Index: projects/vnet/release/doc/en_US.ISO8859-1/relnotes/article.xml =================================================================== --- projects/vnet/release/doc/en_US.ISO8859-1/relnotes/article.xml (revision 301546) +++ projects/vnet/release/doc/en_US.ISO8859-1/relnotes/article.xml (revision 301547) @@ -1,1853 +1,1857 @@ %release; %sponsor; %vendor; ]>
&os; &release.current; Release Notes The &os; Project $FreeBSD$ 2015 2016 The &os; Documentation Project &tm-attrib.freebsd; &tm-attrib.ibm; &tm-attrib.ieee; &tm-attrib.intel; &tm-attrib.sparc; &tm-attrib.general; The release notes for &os; &release.current; contain a summary of the changes made to the &os; base system on the &release.branch; development line. This document lists applicable security advisories that were issued since the last release, as well as significant changes to the &os; kernel and userland. Some brief remarks on upgrading are also presented. Introduction This document contains the release notes for &os; &release.current;. It describes recently added, changed, or deleted features of &os;. It also provides some notes on upgrading from previous versions of &os;. The &release.type; distribution to which these release notes apply represents the latest point along the &release.branch; development branch since &release.branch; was created. Information regarding pre-built, binary &release.type; distributions along this branch can be found at &release.url;. The &release.type; distribution to which these release notes apply represents a point along the &release.branch; development branch between &release.prev; and the future &release.next;. Information regarding pre-built, binary &release.type; distributions along this branch can be found at &release.url;. This distribution of &os; &release.current; is a &release.type; distribution. It can be found at &release.url; or any of its mirrors. More information on obtaining this (or other) &release.type; distributions of &os; can be found in the Obtaining &os; appendix to the &os; Handbook. All users are encouraged to consult the release errata before installing &os;. The errata document is updated with late-breaking information discovered late in the release cycle or after the release. Typically, it contains information on known bugs, security advisories, and corrections to documentation. An up-to-date copy of the errata for &os; &release.current; can be found on the &os; Web site. This document describes the most user-visible new or changed features in &os; since &release.prev;. In general, changes described here are unique to the &release.branch; branch unless specifically marked as &merged; features. Typical release note items document recent security advisories issued after &release.prev;, new drivers or hardware support, new commands or options, major bug fixes, or contributed software upgrades. They may also list changes to major ports/packages or release engineering practices. Clearly the release notes cannot list every single change made to &os; between releases; this document focuses primarily on security advisories, user-visible changes, and major architectural improvements. Upgrading from Previous Releases of &os; Binary upgrades between RELEASE versions (and snapshots of the various security branches) are supported using the &man.freebsd-update.8; utility. The binary upgrade procedure will update unmodified userland utilities, as well as unmodified GENERIC kernels distributed as a part of an official &os; release. The &man.freebsd-update.8; utility requires that the host being upgraded have Internet connectivity. Source-based upgrades (those based on recompiling the &os; base system from source code) from previous versions are supported, according to the instructions in /usr/src/UPDATING. Upgrading &os; should only be attempted after backing up all data and configuration files. Security and Errata This section lists the various Security Advisories and Errata Notices since &release.prev;. Security Advisories &security; Errata Notices &errata; Userland This section covers changes and additions to userland applications, contributed software, and system utilities. Userland Configuration Changes The default &man.newsyslog.conf.5; now includes files in the /etc/newsyslog.conf.d/ and /usr/local/etc/newsyslog.conf.d/ directories by default for &man.newsyslog.8;. The &man.mailwrapper.8; utility has been updated to use &man.mailer.conf.5; from the LOCALBASE environment variable, which defaults to /usr/local if unset. The MK_ARM_EABI &man.src.conf.5; option has been removed. The ntp suite has been updated to version 4.2.8p8. Userland Application Changes The &man.casperd.8; daemon has been added, which provides access to functionality that is not available in the capability mode sandbox. When unable to load a kernel module with &man.kldload.8;, a message informing to view output of &man.dmesg.8; is now printed, opposed to the previous output Exec format error.. Allow &man.pciconf.8; to identify PCI devices that are attached to a driver to be identified by their device name instead of just the selector. Additionally, an optional device argument to the -l flag to restrict the output to only listing details about a single device. A new flag, onifconsole has been added to /etc/ttys. This allows the system to provide a login prompt via serial console if the device is an active kernel console, otherwise it is equivalent to off. Support for displaying VPD for PCI devices via &man.pciconf.8; has been added. &man.ping.8; protects against malicious network packets using the Capsicum framework to drop privileges. The &man.ps.1; utility has been updated to include the -J flag, used to filter output by matching &man.jail.8; IDs and names. Additionally, argument 0 can be used to -J to only list processes running on the host system. The &man.top.1; utility has been updated to filter by &man.jail.8; ID or name, in followup to the &man.ps.1; change in r265229. The &man.pmcstat.8; utility has been updated to include a new flag, -l, which ends event collection after the specified number of seconds. The &man.ps.1; utility has been updated to include a new keyword, tracer, which displays the PID of the tracing process. Support for adding empty partitions has been added to the &man.mkimg.1; utility. The &man.primes.6; utility has been updated to correctly enumerate prime numbers between 4295098369 and 3825123056546413050, which prior to this change, it would be possible for returned values to be incorrectly identified as prime numbers. The &man.mkimg.1; utility has been updated to include three options used to print information about &man.mkimg.1; itself: Option Output --version The current version of the &man.mkimg.1; utility --formats The disk image file formats supported by &man.mkimg.1; --schemes The partition schemes supported by &man.mkimg.1; Userland &man.ctf.5; support in &man.dtrace.1; has been added. With this change, &man.dtrace.1; is able to resolve type info for function and USDT probe arguments, and function return values. The &man.elfdump.1; utility has been updated to support capability mode provided by &man.capsicum.4;. The &man.fstyp.8; utility has been added, which is used to determine the filesystem on a specified device. The libedit library has been updated to support UTF-8, which additionally provides unicode support to &man.sh.1;. The &man.mkimg.1; utility has been updated to support the MBR EFI partition type. The &man.ptrace.2; system call has been updated include support for Altivec registers on &os;/&arch.powerpc;. A new device control utility, &man.devctl.8; has been added, which allows making administrative changes to individual devices, such as attaching and detaching drivers, and enabling and disabling devices. The &man.devctl.8; utility uses the new &man.devctl.3; library. The &man.netstat.1; utility has been updated to link against the &man.libxo.3; shared library. A new flag, -c, has been added to the &man.mkimg.1; utility, which allows specifying the capacity of the target disk image. The &man.uefisign.8; utility has been added. The &man.freebsd-update.8; utility has been updated to prevent fetching updated binary patches when a previous upgrade has not been thoroughly completed. A regression in the &man.libarchive.3; library that would prevent a directory from being included in the archive when --one-file-system is used has been fixed. The &man.ar.1; utility has been updated to set ARCHIVE_EXTRACT_SECURE_SYMLINKS and ARCHIVE_EXTRACT_SECURE_NODOTDOT to disallow directory traversal when extracting an archive, similar to &man.tar.1;. A race condition in &man.wc.1; that would cause final results to be sent to &man.stderr.4; when receiving the SIGINFO signal has been fixed. The &man.chflags.1;, &man.chgrp.1;, &man.chmod.1;, and &man.chown.8; utilities now affect symbolic links when the -R flag is specified, as documented in &man.symlink.7;. The &man.date.1; utility has been updated to print the modification time of the file passed as an argument to the -r flag, improving compatibility with the GNU &man.date.1; utility behavior. The &man.pw.8; utility has been updated with a new flag, -R, that sets the root directory within which the utility will operate. The &man.lockstat.1; utility has been updated with several improvements: Spin locks are now reported as the amount of time spinning, instead of loop iterations. Reader locks are now recognized as adaptive that can spin on &os;. Lock aquisition events for successful reader try-lock events are now reported. Spin and block events are now reported before lock acquisition events. The &man.fstyp.8; utility has been updated to be able to detect &man.zfs.8; and &man.geli.8; filesystems. The &man.mkimg.1; utility has been updated to include support for NTFS filesystems in both MBR and GPT partitioning schemes. The &man.quota.1; utility has been updated to include support for IPv6. The &man.jexec.8; utility has been updated to include a new flag, -l, which ensures a clean environment in the target jail when used. Additionally, &man.jexec.8; will run a shell within the target jail when run no commands are specified. The &man.w.1; utility has been updated to display the full IPv6 remote address of the host from which a user is connected. The &man.jail.8; framework has been updated to allow mounting &man.linprocfs.5; and &man.linsysfs.5; within a jail. The &man.patch.1; utility has been updated to include a new option to the -V flag, none, which disables backup file creation when applying a patch. The &man.ar.1; utility now enables deterministic mode (-D) by default. This behavior can be disabled by specifying the -U flag. The &man.xargs.1; utility has been updated to allow specifying 0 as an argument to the -P (parallel mode) flag, which allows creating as many concurrent processes as possible. The &man.patch.1; utility has been updated to remove the automatic checkout feature. A new utility, &man.sesutil.8;, has been added, which is used to manage &man.ses.4; devices. The &man.pciconf.8; utility has been updated to use the PCI ID database from the misc/pciids package, if present, falling back to the PCI ID database in the &os; base system. The &man.ifconfig.8; utility has been updated to always exit with an error code if an important &man.ioctl.2; fails. Contributed Software &man.byacc.1; has been updated to version 20140101. OpenSSH has been updated to 7.2p2. mdocml has been updated to version 1.12.3. The binutils suite of utilities has been updated to include upstream patches that add new relocations for &arch.powerpc; support. The ELF Tool Chain has been updated to upstream revision r3136. The texinfo utility and info pages were removed from the base system. The print/texinfo port should be installed on systems where info pages are needed. The ELF object manipulation tools addr2line, elfcopy (strip), nm, readelf, size, and strings were switched to the versions from the ELF Tool Chain project. The libedit library has been updated to include UTF-8 support, adding UTF-8 support to the &man.sh.1; shell. The &man.xz.1; utility has been updated to support multi-threaded compression. The elftoolchain utilities have been updated to version 3179. The &man.xz.1; utility has been updated to version 5.2.1. The &man.nvi.1; utility has been updated to version 2.1.3. The &man.wpa.supplicant.8; and &man.hostapd.8; utilities have been updated to version 2.4. The &man.resolvconf.8; utility has been updated to version 3.7.3. bmake has been updated to version 20150606. sendmail has been updated to 8.15.2. Starting with &os; 11.0 and sendmail 8.15, sendmail uses uncompressed IPv6 addresses by default, i.e., they will not contain ::. For example, instead of ::1, it will be 0:0:0:0:0:0:0:1. This permits a zero subnet to have a more specific match, such as different map entries for IPv6:0:0 versus IPv6:0. This change requires that configuration data (including maps, files, classes, custom ruleset, etc.) must use the same format, so make certain such configuration data is upgrading. As a very simple check search for patterns like 'IPv6:[0-9a-fA-F:]*::' and 'IPv6::'. To return to the old behavior, set the m4 option confUSE_COMPRESSED_IPV6_ADDRESSES or the cf option UseCompressedIPv6Addresses. The &man.tcpdump.1; utility has been updated to version 4.7.4. OpenSSL has been updated to version 1.0.2h. The &man.ssh.1; utility has been updated to re-implement hostname canonicalization before locating the host in known_hosts. The &man.libarchive.3; library has been updated to properly skip a sparse file entry in a &man.tar.1; file, which would previously produce errors. The apr library used by &man.svnlite.1; has been updated to version 1.5.2. The serf library used by &man.svnlite.1; has been updated to version 1.3.8. The &man.svnlite.1; utility has been updated to version 1.8.14. The sqlite3 library used by &man.svnlite.1; and &man.kerberos.8; has been updated to version 3.12.1. Timezone data files have been updated to version 2015f. The &man.acpi.4; subsystem has been updated to version 20150818. The &man.unbound.8; utility has been updated to version 1.5.4. &man.jemalloc.3; has been updated to version 4.0.2. The &man.file.1; utility has been updated to version 5.26. The &man.nc.1; utility has been updated to the OpenBSD 5.8 version. Clang has been updated to version 3.8.0. LLVM has been updated to version 3.8.0. LLDB has been updated to version 3.8.0. libc++ has been updated to version 3.8.0. The compiler_rt utility has been updated to version 3.8.0. Installation and Configuration Tools The &man.bsdinstall.8; partition editor and &man.sade.8; utility have been updated to include native ZFS support. The &os; installation utility, &man.bsdinstall.8;, has been updated to set the canmount &man.zfs.8; property to off for the /var dataset, preventing the contents of directories within /var from conflicting when using multiple boot environments, such as that provided by sysutils/beadm. The &man.bsdconfig.8; utility has been updated to skip the initial &man.tzsetup.8; UTC versus wall-clock time prompt when run in a virtual machine, determined when the kern.vm_guest &man.sysctl.8; is set to 1. The &man.bsdinstall.8; utility has been updated to use the new &man.dpv.3; library to display progress when extracting the &os; distributions. Support for detecting and implementing aligning partitions on 1Mb boundaries has been added to &man.bsdinstall.8;. Support for detecting and implementing a workaround for various laptops and motherboards that do not boot properly from GPT-partitioned disks has been added to &man.bsdinstall.8;. Additionally, the active flag will be set on the partition when needed. Support for selecting the partitioning scheme when installing on the UFS filesystem has been added to &man.bsdinstall.8;. <filename class="directory">/etc/rc.d</filename> Scripts The &man.rc.8; subsystem has been updated to allow configuring services in ${LOCALBASE}/etc/rc.conf.d/. If LOCALBASE is unset, it defaults to /usr/local. A new &man.rc.8; script, growfs, has been added, which will resize the root filesystem on boot if /firstboot exists. The mrouted &man.rc.8; script has been removed from the base system. An equivalent script is available from the net/mrouted port. A new &man.rc.8; script, iovctl, has been added, which allows automatically starting the &man.iovctl.8; utility at boot. The &man.service.8; utility has been updated to honor entries within /etc/rc.conf.d/. <filename class="directory">/etc/periodic</filename> Scripts The daily &man.periodic.8; script 110.clean-tmps has been updated to avoid crossing filesystem mount boundaries when cleaning files in /tmp. A new &man.periodic.8; script, 510.status-world-kernel, has been added, which evaluates the running userland and kernel versions from the &man.uname.1; -U and -K arguments, and prints an error if the system userland and kernel are not in sync. Runtime Libraries and API The Blowfish &man.crypt.3; default format has been changed to $2b$. The &man.readline.3; library is now statically linked in software within the base system, and the shared library is no longer installed, allowing the Ports Collection to use a modern version of the library. The &man.strptime.3; library has been updated to add support for POSIX-2001 features %U and %W. The &man.dl.iterate.phdr.3; library has been changed to always return the path name of the ELF object in the dlpi_name structure member. The &man.libxo.3; library has been imported to the base system. A userland library for Chelsio Terminator 5 based iWARP cards has been added, allowing userland RDMA applications to work over compatible NICs. The &man.gpio.3; library has been added, providing a wrapper around the &man.gpio.4; kernel interface. The &man.procctl.2; system call has been updated to include a facility for non-&man.init.8; processes to be declared as the reaper of child processes and their decendants. The futimens() and utimensat() system calls have been added. See &man.utimensat.2; for more information. The &man.elf.3; compile-time dependency has been removed from dtri.o, which allows adding DTrace probes to userland applications and libraries without also linking against &man.elf.3;. The &man.setmode.3; function has been updated to consistently set errno on failure. The &man.qsort.3; functions have been updated to be able to handle 32-bit aligned data on 64-bit platforms, also providing a significant improvement in 32-bit workloads. Several standard include headers have been updated to use of gcc attributes, such as __result_use_check(), __alloc_size(), and __nonnull(). Support for file verification in MAC has been added. The libgomp library is now only built when building GCC from the base system. An up-to-date version is available in the Ports Collection as devel/libiomp5-devel. The stdlib.h and malloc.h headers have been updated to make use of the gcc alloc_align() attribute. The Blowfish &man.crypt.3; library has been updated to support $2y$ hashes. The &man.execl.3; and &man.execlp.3; library functions have been updated to use the __sentinel gcc attribute. ABI Compatibility The &linux; compatibility version has been updated to 2.6.18. The compat.linux.osrelease &man.sysctl.8; is evaluated when building the emulators/linux-c6 and related ports. The stack protector has been upgraded to the "strong" level, elevating the protection against buffer overflows. While this significantly improves the security of the system, extensive testing was done to ensure there are no measurable side effects in performance or functionality. Kernel This section covers changes to kernel configurations, system tuning, and system control parameters that are not otherwise categorized. Kernel Bug Fixes A kernel bug that inhibited proper functionality of the dev.cpu.0.freq &man.sysctl.8; on &intel; processors with Turbo Boost ™ enabled has been fixed. Support for &man.dtrace.1; stack tracing has been fixed for &os;/&arch.powerpc;, using the trapexit() and asttrapexit() functions instead of checking within addressed kernel space. A kernel panic triggered when destroying a &man.vnet.9; &man.jail.8; configured with &man.gif.4; has been fixed. A kernel panic triggered when destroying a &man.vnet.9; &man.jail.8; configured with &man.gre.4; has been fixed. A bug in &man.ipfw.4; that could potentially lead to a kernel panic when using &man.dummynet.4; at layer 2 has been fixed. The kernel RPC has been updated to include several enhancements: The 45 MiB limit on requests queued for &man.nfsd.8; threads has been removed. Avoids unnecessary throttling by not deferring accounting for completed requests. Fixes an integer overflow and signedness bugs. Support for &man.dtrace.1; has been added for the Book-E ™. The &man.kqueue.2; system call has been updated to handle write events to files larger than 2 gigabytes. Kernel Configuration The IMAGACT_BINMISC kernel configuration option has been enabled by default, which enables application execution through emulators, such as Qemu. The VT kernel configuration file has been removed, and the &man.vt.4; driver is included in the GENERIC kernel. To enable &man.vt.4;, enter set kern.vty=vt at the &man.loader.8; prompt during boot, or add kern.vty=vt to &man.loader.conf.5; and reboot the system. The &man.config.8; utility has been updated to allow using a non-standard src/ tree, specified as an argument to the -s flag. The &os;/&arch.powerpc64; kernel now builds as a position-independent executable, allowing the kernel to be loaded into and run from any physical or virtual address. This change requires an update to &man.loader.8;. The userland and kernel must be updated before rebooting the system. A new module for creating rpi.dtb has been added for the Raspberry Pi. The rpi.dtb module is now installed to /boot/dtb/ by default for the Raspberry Pi system. Kernel support for Vector-Scalar eXtension (VSX) found on POWER7 and POWER8 hardware has been added. The &man.pmap.9; implementation for 64-bit &powerpc; processors has been overhaulded to improve concurrency. A new module for creating the dtb module for AM335x systems has been added. The PAE_TABLES kernel configuration option has been added for &os;/&arch.i386;, which instructs &man.pmap.9; to use PAE format for page tables while maintaining a 32-bit physical address size elsewhere in the kernel. The use of this option can enhance application-level security by enabling the creation of no execute mappings on modern &arch.i386; processors. Unlike the PAE option, PAE_TABLES preserves kernel binary interface (KBI) compatibility with non-PAE kernels, allowing non-PAE kernel modules and drivers to work with a PAE_TABLES-enabled kernel. Additionally, system limits are tuned for 4GB maximum RAM, avoiding kernel virtual address space (KVA) exhaustion. The SIFTR kernel configuration has been added, allowing building &man.siftr.4; statically into the kernel. The &arch.arm; boot loader, ubldr, is now relocatable. In addition, ubldr.bin is now created during build time, which is a stripped binary with an entry point of 0, providing the ability to specify the load address by running go ${loadaddr} in u-boot. The &man.nvd.4; and &man.nvme.4; drivers are now included in the GENERIC kernel configuration by default. A new kernel configuration option, EM_MULTIQUEUE, has been added which enables multi-queue support in the &man.em.4; driver. Multi-queue support in the &man.em.4; driver is not officially supported by &intel;. The GENERIC kernel configuration has been updated to include the IPSEC option by default. Initial NUMA affinity and policy configuration has been added. See &man.numactl.1;, and &man.numa.getaffinity.2;, for usage details. The &man.pms.4; driver has been added to the GENERIC kernel configuration for supported architectures. The CUBIEBOARD2 kernel configuration has been renamed to A20. Kernel debugging symbols are now installed to /usr/lib/debug/boot/kernel/. To retain the previous behavior, add KERN_DEBUGDIR="" to &man.src.conf.5;. System Tuning and Controls The &man.hwpmc.4; default and maximum callchain depths have been increased. The default has been increased from 16 to 32, and the maximum increased from 32 to 128. The kern.osrelease and kern.osreldate are now configurable &man.jail.8; parameters. The &man.devfs.5; device filesystem has been changed to update timestamps for read/write operations using seconds precision. A new &man.sysctl.8;, vfs.devfs.dotimes has been added, which when set to a non-zero value, enables default precision timestamps for these operations. A new &man.sysctl.8;, kern.racct.enable, has been added, which when set to a non-zero value allows using &man.rctl.8; with the GENERIC kernel. A new kernel configuration option, RACCT_DISABLED has also been added. The GENERIC kernel configuration now includes RACCT and RCTL by default. To enable RACCT and RCTL on a system using the GENERIC kernel configuration, add kern.racct.enable=1 to &man.loader.conf.5;, and reboot the system. A new &man.sysctl.8;, net.inet.tcp.hostcache.purgenow, has been added, which when set to 1 during runtime will flush all net.inet.tcp.hostcache entries. A new &man.sysctl.8;, hw.model, has been added, which displays CPU model information. The &man.uart.4; driver has been updated to allow tuning pulses per second captured in the CTS line during runtime, whereas previously only the DCD line could be used without rebuilding the kernel. Devices and Drivers This section covers changes and additions to devices and device drivers since &release.prev;. Device Drivers Support for GPS ports has been added to &man.uhso.4;. The &man.full.4; device has been added, and the lindev(4) device has been removed. Prior to this change, lindev(4) provided only the /dev/full character device, returning ENOSPC on write attempts. As this device is not specific to &linux;, a native &os; version has been added. Hardware context support has been added to the drm/i915 driver, adding support for Mesa 9.2 and later. The &man.vt.4; driver has been updated, replacing the bitmapped kern.vt.spclkeys &man.sysctl.8; with individual kern.vt.kbd_* variants. The &man.hpet.4; driver has been updated to create a /dev/hpetN device, providing access to HPET from userspace. The drm code has been updated to match &linux; version 3.8.13. The &man.psm.4; driver has been updated to include improved support for newer Synaptics ® touchpads and the ClickPad ® mouse on newer Lenovo ™ laptops. Support for the Freescale PCI Root Complex device has been added. The &man.cyapa.4; driver has been added, supporting the Cypress APA I2C trackpad. The &man.isl.4; driver has been added, supporting the Intersil I2C ISL29018 digital ambient light sensor. Storage Drivers The &man.mpr.4; device has been added, providing support for LSI Fusion-MPT 3 12Gb SCSI/SATA controllers. The &man.mrsas.4; driver has been added, providing support for LSI MegaRAID SAS controllers. The &man.mfi.4; driver will attach to the controller, by default. To enable &man.mrsas.4; add hw.mfi.mrsas_enable=1 to /boot/loader.conf, which turns off &man.mfi.4; device probing. At this time, the &man.mfiutil.8; utility and the &os; version of MegaCLI and StorCli do not work with &man.mrsas.4;. The &man.ctl.4; subsystem has been updated, increasing the ports limit from 128 to 256, and LUN limit from 256 to 1024. The asr(4) driver has been removed, and is no longer supported. The &man.hptnr.4; driver has been updated to version 1.1.1. The &man.pms.4; driver has been added, providing support for the PMC Sierra line of SAS/SATA host bus adapters. The &man.ioat.4; driver has been added, providing support for the PSE (Platform Storage Extension). The CTL High Availability implementation has been rewritten. The &man.ctl.4; driver has been updated to support CD-ROM and removable devices. The &man.isp.4; driver has been updated and improved: added support for 16Gbps FC cards, improved target mode support, completed Multi-ID (NPIV) functionality. Network Drivers Support for Broadcom chipsets BCM57764, BCM57767, BCM57782, BCM57786 and BCM57787 has been added to &man.bge.4;. Support for the &intel; Centrino™ Wireless-N 135 chipset has been added. Firmware for &intel; Centrino™ Wireless-N 105 devices has been added to the base system. The deprecated nve(4) driver has been removed. Users of NVIDIA nForce MCP network adapters are advised to use the &man.nfe.4; driver instead, which has been the default driver for this hardware since &os; 7.0. The if_nf10bmac(4) device has been added, providing support for NetFPGA-10G Embedded CPU Ethernet Core. The if_nf10bmac(4) driver operates on the FPGA, and is not suited for the PCI host interface. The &man.ath.hal.4; driver has been updated to support the Atheros AR1111 chipset. Support for the &intel; Centrino™ Wireless-N 105 chipset has been added. Support for the &man.cxgbe.4; Terminator 5 (T5) 10G/40G cards has been added to &man.netmap.4;. The &man.alc.4; driver has been updated to support AR816x and AR817x ethernet controllers. The &man.pf.4; packet filter default hash has been changed from Jenkins to Murmur3, providing a 3-percent performance increase in packets-per-second. The &man.vxlan.4; driver has been added, which creates a virtual Layer 2 (Ethernet) network overlaid in a Layer 3 (IP/UDP) network. The &man.vxlan.4; driver is analogous to &man.vlan.4;, but is designed to be better suited for large, multiple-tenant datacenter environments. The &man.gre.4; driver has been significantly overhauled, and has been split into two separate modules, &man.gre.4; and &man.me.4;. The &man.ral.4; driver has been updated to support the RT5390 and RT5392 chipsets. The &man.sfxge.4; driver has been updated to support Solarflare Flareon Ultra 7000-series chipsets. The &man.em.4; driver has been updated with improved transmission queue hang detection. The &man.cdce.4; driver has been updated to include support for the RTL8153 chipset. The &man.iwm.4; driver has been imported from OpenBSD, providing support for &intel; 3160/7260/7265 wireless chipsets. The &man.em.4; driver has been updated to allow disabling CRC stripping. The &man.pf.4; implementation has been updated to remove support for the scrub fragment crop|drop-ovl filtering rule. Systems with this rule in &man.pf.conf.5; will implicitly be converted to the scrub fragment reassemble filtering rule, without necessary intervention. The &man.lagg.4; driver has been updated to remove support for the fec protocol. Hardware Support This section covers general hardware support for physical machines, hypervisors, and virtualization environments, as well as hardware changes and updates that do not otherwise fit in other sections of this document. Hardware Support The &man.asmc.4; driver has been updated to support the &apple; MacMini 3,1. Support for &os;/ia64 has been dropped as of &os; 11. An issue that could cause a system to hang when entering ACPI S3 state (suspend to RAM) has been corrected in the &man.acpi.4; and &man.pci.4; drivers. The power management unit subsystem has been updated to support power button events on certain &arch.powerpc; hardware, such as aluminum PowerBook ®. The &man.hwpmc.4; driver has been updated to correct performance counter sampling on G4 (MPC74xxx) and G5 class processors. The OpenCrypto framework has been updated to include AES-ICM and AES-GCM modes, both of which have also been added to the &man.aesni.4; driver. The &man.hwpmc.4; driver has been updated to support the Freescale e500 core. The &man.ig4.4; driver has been added, providing support for the fourth generation &intel; I2C SMBus. The &man.uart.4; driver has been updated to support AMT devices on newer systems. Initial SMP support has been added to the &os;/&arch.arm64; port. Virtualization Support Support for the Virtual Interrupt Delivery feature of &intel; VT-x is enabled if supported by the CPU. This feature can be disabled by running sysctl hw.vmm.vmx.use_apic_vid=0. Additionally, to persist this setting across reboots, add hw.vmm.vmx.use_apic_vid=0 to /etc/sysctl.conf. Support for Posted Interrupt Processing is enabled if supported by the CPU. This feature can be disabled by running sysctl hw.vmm.vmx.use_apic_pir=0. Additionally, to persist this setting across reboots, add hw.vmm.vmx.use_apic_pir=0 to /etc/sysctl.conf. Unmapped IO support has been added to &man.virtio_blk.4;. Unmapped IO support has been added to &man.virtio_scsi.4;. The &man.virtio_random.4; driver has been added to harvest entropy from the host system. &os;/&arch.i386; guests can be run under bhyve. Support for running a &os;/&arch.amd64; Xen guest instance as PVH guest has been added. PVH mode, short for Para-Virtualized Hardware, uses para-virtualized drivers for boot and I/O, and uses hardware virtualization extensions for all other tasks, without the need for emulation. The &man.bhyve.8; hypervisor has been updated to support &amd; processors with SVM and AMD-V hardware extensions. The &man.virtio.console.4; driver has been added, which provides an interface to VirtIO console devices through a &man.tty.4; device. The &man.bhyve.8; hypervisor has been updated to support DSM TRIM commands for virtual AHCI disks. Support for the QEMU virt system has been added. The Hyper-V™ drivers have been updated with several enhancements: The &man.hv.vmbus.4; driver now has multi-channel support. The &man.hv.storvsc.4; driver now has scatter/gather support, in addition to performance improvements. The &man.hv.kvp.4; driver has received several bug fixes. Support for &man.xen.4; para-virtualized domU kernels has been removed. The &man.hv.netvsc.4; driver has been updated to support checksum offloading and TSO. The &man.xen.4; driver has been updated to include support for blkif indirect segment I/O. ARM Support The &man.nand.4; device is enabled for ARM devices by default. Support for the Exynos 5420 Octa system has been added. The SMP option has been enabled for all Exynos 5 systems supported by &os;. Support for the Toradex Apalis i.MX6 development board has been added. An issue that could cause instability when detecting SD cards on the Raspberry Pi SOC has been fixed. The bcm2835_cpufreq driver has been added, which supports CPU frequency and voltage control on the Raspberry Pi SOC. Support to turn off the BeagleBone Black system with the &man.shutdown.8; -p flag or by invoking &man.poweroff.8; has been added. Audio transmission drivers have been added for Digital Audio Multiplexer (AUDMUXM), Smart Direct Memory Access Controller (SDMA), and Syncronous Serial Interface (SSI). Initial support for the ARM AArch64 architecture has been added. Kernel support for Thumb-2 userland has been added. Support for the hardware power button on the BeagleBone Black system has been added. Initial ACPI support has been added for &os;/&arch.arm64;. Support for 1-Wire devices has been added, providing support for 1-Wire hardware through &man.gpio.4;. See &man.ow.4;, &man.owc.4;, and &man.ow.temp.4; for more information. Support for the HiSilicon HI6220 SoC has been added. Storage This section covers changes and additions to file systems and other storage subsystems, both local and networked. General Storage The &man.ctl.4; LUN mapping has been rewritten, replacing iSCSI-specific mapping mechanisms with a new mechanism that works for any port. The &man.ctld.8; utility has been updated to allow controlling non-iSCSI &man.ctl.4; ports. The &man.autofs.5; subsystem has been updated to include a new &man.auto.master.5; map, -media, which allows automatically mounting removable media, such as CD drives or USB flash drives. The &man.autofs.5; subsystem has been updated to include a new &man.auto.master.5; map, -noauto, which handles &man.fstab.5; entries set to noauto. The GELI class has been updated to support the BIO_DELETE &man.g.bio.9; bio_cmd field, providing TRIM/UNMAP support on GELI-backed SSD storage providers. Networked Storage The new filesystem automount facility, &man.autofs.5;, has been added. The new &man.autofs.5; facility is similar to that found in other &unix;-like operating systems, such as OS X™ and Solaris™. The &man.autofs.5; facility uses a &sun;-compatible &man.auto.master.5; configuration file, and is administered with the &man.automount.8; userland utility, and the &man.automountd.8; and &man.autounmountd.8; daemons. Support for the timeo, actimeo, noac, and proto options have been added to &man.mount.nfs.8;. ZFS The arc_meta_limit statistics are now visible through the kstat &man.sysctl.8;. As a result of this change, the vfs.zfs.arc_meta_used &man.sysctl.8; has been removed, and replaced with the kstat.zfs.misc.arcstats.arc_meta_used &man.sysctl.8;. The &man.zfs.8; l2arc code has been updated to take ashift into account when gathering buffers to be written to the l2arc device. The zfsd daemon has been added, which manages hotspares and replements in drive slots that publish physical paths. &man.geom.4; Support for the disklabel64 partitioning scheme has been added to &man.gpart.8;. Support for the apple-boot, apple-hfs, and apple-ufs MBR partitioning schemes have been added to &man.gpart.8;. The &man.gpart.8; utility has been updated to include a new attribute for GPT partitions, lenovofix, which when set, which works around BIOS compatibility issues reported on several Lenovo ™ laptops. Boot Loader Changes This section covers the boot loader, boot menu, and other boot-related changes. Boot Loader Changes The memory test run at boot time on &os;/&arch.amd64; platforms has been disabled by default. A new &man.ttys.5; class, 3wire, has been added. This is similar to the existing terminal classes, but does not have a defined baudrate. The &man.vt.4; driver has been made the default system console driver. The &man.syscons.4; driver is still available, and can be enabled by adding kern.vty=sc in &man.loader.conf.5;. Alternatively, &man.syscons.4; can be enabled at boot time by entering set kern.vty=sc at the &man.loader.8; prompt. Support for bzipfs has been added to the EFI loader. The boot loader has been updated to support entering the GELI passphrase before loading the kernel. To enable this behavior, add geom_eli_passphrase_prompt="YES" to &man.loader.conf.5;. The &man.ttys.5; file for &os;/&arch.arm; has been updated to enable ttyu1, ttyu2, and ttyu3 by default, if the callin port is an active console port. Boot Menu Changes   Networking This section describes changes that affect networking in &os;. Network Protocols Support for the IPX network transport protocol has been removed, and will not be supported in &os; 11 and later releases. Support for PLPMTUD blackhole detection (RFC 4821) has been added to the &man.tcp.4; stack, disabled by default. New control tunables have been added: Tunable Description net.inet.tcp.pmtud_blackhole_detection Enables or disables PLPMTUD blackhole detection net.inet.tcp.pmtud_blackhole_mss MSS to try for IPv4 net.inet.tcp.v6pmtud_blackhole_mss MSS to try for IPv6 New monitoring &man.sysctl.8;s haven been added: Tunable Description net.inet.tcp.pmtud_blackhole_activated Number of times the code was activated to attempt downshifting the MSS net.inet.tcp.pmtud_blackhole_min_activated Number of times the blackhole MSS was used in an attempt to downshift net.inet.tcp.pmtud_blackhole_failed Number of times that the blackhole failed to connect after downshifting the MSS Support for IP identification for atomic datagrams (RFC 6864) has been added. Support for this feature can be toggled with the net.inet.ip.rfc6864 &man.sysctl.8;, which is enabled by default. The IPSEC has been updated to include support for AES modes on both software-only and hardware-backed (&man.aesni.4;) systems. The network stack has been updated to fix handling of IPv6 On-Link redirects. The net.inet.tcp.ecn.enable sysctl mib has been changed from a binary off/on control to a three way setting. Value Description 0 Totally disable ECN. 1 Enable ECN if incoming connections request it. Outgoing connections will request ECN. 2 Enable ECN if incoming connections request it. Outgoing conections will not request ECN. + Dummynet AQM, an independent implementation of + CoDel and FQ-CoDel for ipfw/dummynet has been imported to the base + system. + Ports Collection and Package Infrastructure This section covers changes to the &os; Ports Collection, package infrastructure, and package maintenance and installation tools. Infrastructure Changes   Packaging Changes   Documentation This section covers changes to the &os; Documentation Project sources and toolchain. Documentation Source Changes   Documentation Toolchain Changes   Release Engineering and Integration This section convers changes that are specific to the &os; Release Engineering processes. Integration Changes The Release Engineering build tools have been updated to include support for producing virtual machine disk images for various cloud hosting providers. The Release Engineering build tools have been updated to use multi-threaded &man.xz.1;. By default, the number of &man.xz.1; threads is set to the number of cores available. The Release Engineering build tools have been updated to include support for building &os;/&arch.arm64; virtual machine and memory stick installation images. The Release Engineering build tools have been updated to support building &os;/&arch.arm; images without external utilities for supported boards where a corresponding u-boot port exists in the Ports Collection. The &os;/&arch.i386; memory stick installation images are now created using the &man.mkimg.1; utility, matching the way the &os;/&arch.amd64; images are created.
Index: projects/vnet/share/man/man5/rc.conf.5 =================================================================== --- projects/vnet/share/man/man5/rc.conf.5 (revision 301546) +++ projects/vnet/share/man/man5/rc.conf.5 (revision 301547) @@ -1,4663 +1,4679 @@ .\" Copyright (c) 1995 .\" Jordan K. Hubbard .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" $FreeBSD$ .\" -.Dd April 30, 2016 +.Dd June 8, 2016 .Dt RC.CONF 5 .Os .Sh NAME .Nm rc.conf .Nd system configuration information .Sh DESCRIPTION The file .Nm contains descriptive information about the local host name, configuration details for any potential network interfaces and which services should be started up at system initial boot time. In new installations, the .Nm file is generally initialized by the system installation utility. .Pp The purpose of .Nm is not to run commands or perform system startup actions directly. Instead, it is included by the various generic startup scripts in .Pa /etc which conditionalize their internal actions according to the settings found there. .Pp The .Pa /etc/rc.conf file is included from the file .Pa /etc/defaults/rc.conf , which specifies the default settings for all the available options. Options need only be specified in .Pa /etc/rc.conf when the system administrator wishes to override these defaults. The file .Pa /etc/rc.conf.local is used to override settings in .Pa /etc/rc.conf for historical reasons. .Pp In addition to .Pa /etc/rc.conf.local you can also place smaller configuration files for each .Xr rc 8 script in the .Pa /etc/rc.conf.d directory or .Ao Ar dir Ac Ns Pa /rc.conf.d directories specified in .Va local_startup , which will be included by the .Va load_rc_config function. For jail configurations you could use the file .Pa /etc/rc.conf.d/jail to store jail specific configuration options. If .Va local_startup contains .Pa /usr/local/etc/rc.d and .Pa /opt/conf , .Pa /usr/local/rc.conf.d/jail and .Pa /opt/conf/rc.conf.d/jail will be loaded. If .Ao Ar dir Ac Ns Pa /rc.conf.d/ Ns Ao Ar name Ac is a directory, all of files in the directory will be loaded. Also see the .Va rc_conf_files variable below. .Pp Options are set with .Dq Ar name Ns Li = Ns Ar value assignments that use .Xr sh 1 syntax. The following list provides a name and short description for each variable that can be set in the .Nm file: .Bl -tag -width indent-two .It Va rc_debug .Pq Vt bool If set to .Dq Li YES , enable output of debug messages from rc scripts. This variable can be helpful in diagnosing mistakes when editing or integrating new scripts. Beware that this produces copious output to the terminal and .Xr syslog 3 . .It Va rc_info .Pq Vt bool If set to .Dq Li NO , disable informational messages from the rc scripts. Informational messages are displayed when a condition that is not serious enough to warrant a warning or an error occurs. .It Va rc_startmsgs .Pq Vt bool If set to .Dq Li YES , show .Dq Starting foo: when faststart is used (e.g., at boot time). .It Va early_late_divider .Pq Vt str The name of the script that should be used as the delimiter between the .Dq early and .Dq late stages of the boot process. The early stage should contain all the services needed to get the disks (local or remote) mounted so that the late stage can include scripts contained in the directories listed in the .Va local_startup variable (see below). Thus, the two likely candidates for this value are .Pa mountcritlocal for the typical system, and .Pa mountcritremote if the system needs remote file systems mounted to get access to the .Va local_startup directories; for example when .Pa /usr/local is NFS mounted. For .Pa rc.conf within a .Xr jail 8 .Pa NETWORKING is likely to be an appropriate value. Extreme care should be taken when changing this value, and before changing it one should ensure that there are adequate provisions to recover from a failed boot (such as physical contact with the machine, or reliable remote console access). .It Va always_force_depends .Pq Vt bool Various .Pa rc.d scripts use the force_depend function to check whether required services are already running, and to start them if necessary. By default during boot time this check is bypassed if the required service is enabled in .Pa /etc/rc.conf[.local] . Setting this option will bypass that check at boot time and always test whether or not the service is actually running. Enabling this option is likely to increase your boot time if services are enabled that utilize the force_depend check. .It Ao Ar name Ac Ns Va _chroot .Pq Vt str .Xr chroot to this directory before running the service. .It Ao Ar name Ac Ns Va _user .Pq Vt str Run the service under this user account. .It Ao Ar name Ac Ns Va _group .Pq Vt str Run the chrooted service under this system group. Unlike the _user setting, this setting has no effect if the service is not chrooted. .It Ao Ar name Ac Ns Va _fib .Pq Vt int The .Xr setfib 1 value to run the service under. .It Ao Ar name Ac Ns Va _nice .Pq Vt int The .Xr nice 1 value to run the service under. .It Va apm_enable .Pq Vt bool If set to .Dq Li YES , enable support for Automatic Power Management with the .Xr apm 8 command. .It Va apmd_enable .Pq Vt bool Run .Xr apmd 8 to handle APM event from userland. This also enables support for APM. .It Va apmd_flags .Pq Vt str If .Va apmd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr apmd 8 daemon. .It Va devd_enable .Pq Vt bool Run .Xr devd 8 to handle device added, removed or unknown events from the kernel. .It Va ddb_enable .Pq Vt bool Run .Xr ddb 8 to install .Xr ddb 4 scripts at boot time. .It Va ddb_config .Pq Vt str Configuration file for .Xr ddb 8 . Default .Pa /etc/ddb.conf . .It Va kld_list .Pq Vt str A list of kernel modules to load right after the local disks are mounted. Loading modules at this point in the boot process is much faster than doing it via .Pa /boot/loader.conf for those modules not necessary for mounting local disk. .It Va kldxref_enable .Pq Vt bool Set to .Dq Li NO by default. Set to .Dq Li YES to automatically rebuild .Pa linker.hints files with .Xr kldxref 8 at boot time. .It Va kldxref_clobber .Pq Vt bool Set to .Dq Li NO by default. If .Va kldxref_enable is true, setting to .Dq Li YES will overwrite existing .Pa linker.hints files at boot time. Otherwise, only missing .Pa linker.hints files are generated. .It Va kldxref_module_path .Pq Vt str Empty by default. A semi-colon .Pq Ql \&; delimited list of paths containing .Xr kld 4 modules. If empty, the contents of the .Va kern.module_path .Xr sysctl 8 are used. .It Va powerd_enable .Pq Vt bool If set to .Dq Li YES , enable the system power control facility with the .Xr powerd 8 daemon. .It Va powerd_flags .Pq Vt str If .Va powerd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr powerd 8 daemon. .It Va tmpmfs Controls the creation of a .Pa /tmp memory file system. Always happens if set to .Dq Li YES and never happens if set to .Dq Li NO . If set to anything else, a memory file system is created if .Pa /tmp is not writable. .It Va tmpsize Controls the size of a created .Pa /tmp memory file system. .It Va tmpmfs_flags Extra options passed to the .Xr mdmfs 8 utility when the memory file system for .Pa /tmp is created. The default is .Dq Li "-S" , which inhibits the use of softupdates on .Pa /tmp so that file system space is freed without delay after file truncation or deletion. See .Xr mdmfs 8 for other options you can use in .Va tmpmfs_flags . .It Va varmfs Controls the creation of a .Pa /var memory file system. Always happens if set to .Dq Li YES and never happens if set to .Dq Li NO . If set to anything else, a memory file system is created if .Pa /var is not writable. .It Va varsize Controls the size of a created .Pa /var memory file system. .It Va varmfs_flags Extra options passed to the .Xr mdmfs 8 utility when the memory file system for .Pa /var is created. The default is .Dq Li "-S" , which inhibits the use of softupdates on .Pa /var so that file system space is freed without delay after file truncation or deletion. See .Xr mdmfs 8 for other options you can use in .Va varmfs_flags . .It Va populate_var Controls the automatic population of the .Pa /var file system. Always happens if set to .Dq Li YES and never happens if set to .Dq Li NO . If set to anything else, a memory file system is created if .Pa /var is not writable. Note that this process requires access to certain commands in .Pa /usr before .Pa /usr is mounted on normal systems. .It Va cleanvar_enable .Pq Vt bool Clean the .Pa /var directory. .It Va local_startup .Pq Vt str List of directories to search for startup script files. .It Va script_name_sep .Pq Vt str The field separator to use for breaking down the list of startup script files into individual filenames. The default is a space. It is not necessary to change this unless there are startup scripts with names containing spaces. .It Va hostapd_enable .Pq Vt bool Set to .Dq Li YES to start .Xr hostapd 8 at system boot time. .It Va hostname .Pq Vt str The fully qualified domain name (FQDN) of this host on the network. This should almost certainly be set to something meaningful, even if there is no network connection. If .Xr dhclient 8 is used to set the hostname via DHCP, this variable should be set to an empty string. If this value remains unset when the system is done booting your console login will display the default hostname of .Dq Amnesiac . .It Va nisdomainname .Pq Vt str The NIS domain name of this host, or .Dq Li NO if NIS is not used. .It Va dhclient_program .Pq Vt str Path to the DHCP client program .Pa ( /sbin/dhclient , the .Ox DHCP client, is the default). .It Va dhclient_flags .Pq Vt str Additional flags to pass to the DHCP client program. For the .Ox DHCP client, see the .Xr dhclient 8 manpage for a description of the command line options available. .It Va dhclient_flags_ Ns Aq Ar iface Additional flags to pass to the DHCP client program running on .Ar iface only. When specified, this variable overrides .Va dhclient_flags . .It Va background_dhclient .Pq Vt bool Set to .Dq Li YES to start the DHCP client in background. This can cause trouble with applications depending on a working network, but it will provide a faster startup in many cases. .It Va background_dhclient_ Ns Aq Ar iface When specified, this variable overrides the .Va background_dhclient variable for interface .Ar iface only. .It Va synchronous_dhclient .Pq Vt bool Set to .Dq Li YES to start .Xr dhclient 8 synchronously at startup. This behavior can be overridden on a per-interface basis by replacing the .Dq Li DHCP keyword in the .Va ifconfig_ Ns Aq Ar interface variable with .Dq Li SYNCDHCP or .Dq Li NOSYNCDHCP . .It Va defaultroute_delay .Pq Vt int When set to a positive value, wait up to this long after configuring DHCP interfaces at startup to give the interfaces time to receive a lease. .It Va firewall_enable .Pq Vt bool Set to .Dq Li YES to load firewall rules at startup. If the kernel was not built with .Cd "options IPFIREWALL" , the .Pa ipfw.ko kernel module will be loaded. See also .Va ipfilter_enable . .It Va firewall_script .Pq Vt str This variable specifies the full path to the firewall script to run. The default is .Pa /etc/rc.firewall . .It Va firewall_type .Pq Vt str Names the firewall type from the selection in .Pa /etc/rc.firewall , or the file which contains the local firewall ruleset. Valid selections from .Pa /etc/rc.firewall are: .Pp .Bl -tag -width ".Li simple" -compact .It Li open unrestricted IP access .It Li closed all IP services disabled, except via .Dq Li lo0 .It Li client basic protection for a workstation .It Li simple basic protection for a LAN. .El .Pp If a filename is specified, the full path must be given. .It Va firewall_quiet .Pq Vt bool Set to .Dq Li YES to disable the display of firewall rules on the console during boot. .It Va firewall_logging .Pq Vt bool Set to .Dq Li YES to enable firewall event logging. This is equivalent to the .Dv IPFIREWALL_VERBOSE kernel option. .It Va firewall_logif .Pq Vt bool Set to .Dq Li YES to create pseudo interface .Li ipfw0 for logging. For more details, see .Xr ipfw 8 manual page. .It Va firewall_flags .Pq Vt str Flags passed to .Xr ipfw 8 if .Va firewall_type specifies a filename. .It Va firewall_coscripts .Pq Vt str List of executables and/or rc scripts to run after firewall starts/stops. Default is empty. .\" ----- firewall_nat_enable setting -------------------------------- .It Va firewall_nat_enable .Pq Vt bool The .Xr ipfw 8 equivalent of .Va natd_enable . Setting this to .Dq Li YES enables kernel NAT. .Va firewall_enable must also be set to .Dq Li YES . .It Va firewall_nat_interface .Pq Vt str The .Xr ipfw 8 equivalent of .Va natd_interface . This is the name of the public interface or IP address on which kernel NAT should run. .It Va firewall_nat_flags .Pq Vt str Additional configuration parameters for kernel NAT should be placed here. .It Va dummynet_enable .Pq Vt bool Setting this to .Dq Li YES will automatically load the .Xr dummynet 4 module if .Va firewall_enable is also set to .Dq Li YES . .\" ------------------------------------------------------------------- .It Va natd_program .Pq Vt str Path to .Xr natd 8 . .It Va natd_enable .Pq Vt bool Set to .Dq Li YES to enable .Xr natd 8 . .Va firewall_enable must also be set to .Dq Li YES , and .Xr divert 4 sockets must be enabled in the kernel. If the kernel was not built with .Cd "options IPDIVERT" , the .Pa ipdivert.ko kernel module will be loaded. .It Va natd_interface .Pq Vt str This is the name of the public interface on which .Xr natd 8 should run. The interface may be given as an interface name or as an IP address. .It Va natd_flags .Pq Vt str Additional .Xr natd 8 flags should be placed here. The .Fl n or .Fl a flag is automatically added with the above .Va natd_interface as an argument. .\" ----- ipfilter_enable setting -------------------------------- .It Va ipfilter_enable .Pq Vt bool Set to .Dq Li NO by default. Setting this to .Dq Li YES enables .Xr ipf 8 packet filtering. .Pp Typical usage will require putting .Bd -literal ipfilter_enable="YES" ipnat_enable="YES" ipmon_enable="YES" ipfs_enable="YES" .Ed .Pp into .Pa /etc/rc.conf and editing .Pa /etc/ipf.rules and .Pa /etc/ipnat.rules appropriately. .Pp Note that .Va ipfilter_enable and .Va ipnat_enable can be enabled independently. .Va ipmon_enable and .Va ipfs_enable both require at least one of .Va ipfilter_enable and .Va ipnat_enable to be enabled. .Pp Having .Bd -literal options IPFILTER options IPFILTER_LOG options IPFILTER_DEFAULT_BLOCK .Ed .Pp in the kernel configuration file is a good idea, too. .\" ----- ipfilter_program setting ------------------------------ .It Va ipfilter_program .Pq Vt str Path to .Xr ipf 8 (default .Pa /sbin/ipf ) . .\" ----- ipfilter_rules setting -------------------------------- .It Va ipfilter_rules .Pq Vt str Set to .Pa /etc/ipf.rules by default. This variable contains the name of the filter rule definition file. The file is expected to be readable for the .Xr ipf 8 command to execute. .\" ----- ipv6_ipfilter_rules setting --------------------------- .It Va ipv6_ipfilter_rules .Pq Vt str Set to .Pa /etc/ipf6.rules by default. This variable contains the IPv6 filter rule definition file. The file is expected to be readable for the .Xr ipf 8 command to execute. .\" ----- ipfilter_flags setting -------------------------------- .It Va ipfilter_flags .Pq Vt str Empty by default. This variable contains flags passed to the .Xr ipf 8 program. .\" ----- ipnat_enable setting ---------------------------------- .It Va ipnat_enable .Pq Vt bool Set to .Dq Li NO by default. Set it to .Dq Li YES to enable .Xr ipnat 8 network address translation. See .Va ipfilter_enable for a detailed discussion. .\" ----- ipnat_program setting --------------------------------- .It Va ipnat_program .Pq Vt str Path to .Xr ipnat 8 (default .Pa /sbin/ipnat ) . .\" ----- ipnat_rules setting ----------------------------------- .It Va ipnat_rules .Pq Vt str Set to .Pa /etc/ipnat.rules by default. This variable contains the name of the file holding the network address translation definition. This file is expected to be readable for the .Xr ipnat 8 command to execute. .\" ----- ipnat_flags setting ----------------------------------- .It Va ipnat_flags .Pq Vt str Empty by default. This variable contains flags passed to the .Xr ipnat 8 program. .\" ----- ipmon_enable setting ---------------------------------- .It Va ipmon_enable .Pq Vt bool Set to .Dq Li NO by default. Set it to .Dq Li YES to enable .Xr ipmon 8 monitoring (logging .Xr ipf 8 and .Xr ipnat 8 events). Setting this variable needs setting .Va ipfilter_enable or .Va ipnat_enable too. See .Va ipfilter_enable for a detailed discussion. .\" ----- ipmon_program setting --------------------------------- .It Va ipmon_program .Pq Vt str Path to .Xr ipmon 8 (default .Pa /sbin/ipmon ) . .\" ----- ipmon_flags setting ----------------------------------- .It Va ipmon_flags .Pq Vt str Set to .Dq Li -Ds by default. This variable contains flags passed to the .Xr ipmon 8 program. Another typical example would be .Dq Fl D Pa /var/log/ipflog to have .Xr ipmon 8 log directly to a file bypassing .Xr syslogd 8 . Make sure to adjust .Pa /etc/newsyslog.conf in such case like this: .Bd -literal /var/log/ipflog 640 10 100 * Z /var/run/ipmon.pid .Ed .\" ----- ipfs_enable setting ----------------------------------- .It Va ipfs_enable .Pq Vt bool Set to .Dq Li NO by default. Set it to .Dq Li YES to enable .Xr ipfs 8 saving the filter and NAT state tables during shutdown and reloading them during startup again. Setting this variable needs setting .Va ipfilter_enable or .Va ipnat_enable to .Dq Li YES too. See .Va ipfilter_enable for a detailed discussion. Note that if .Va kern_securelevel is set to 3, .Va ipfs_enable cannot be used because the raised securelevel will prevent .Xr ipfs 8 from saving the state tables at shutdown time. .\" ----- ipfs_program setting ---------------------------------- .It Va ipfs_program .Pq Vt str Path to .Xr ipfs 8 (default .Pa /sbin/ipfs ) . .\" ----- ipfs_flags setting ------------------------------------ .It Va ipfs_flags .Pq Vt str Empty by default. This variable contains flags passed to the .Xr ipfs 8 program. .\" ----- end of added ipf hook --------------------------------- .It Va pf_enable .Pq Vt bool Set to .Dq Li NO by default. Setting this to .Dq Li YES enables .Xr pf 4 packet filtering. .Pp Typical usage will require putting .Pp .Dl pf_enable="YES" .Pp into .Pa /etc/rc.conf and editing .Pa /etc/pf.conf appropriately. Adding .Pp .Dl "device pf" .Pp builds support for .Xr pf 4 into the kernel, otherwise the kernel module will be loaded. .It Va pf_rules .Pq Vt str Path to .Xr pf 4 ruleset configuration file (default .Pa /etc/pf.conf ) . .It Va pf_program .Pq Vt str Path to .Xr pfctl 8 (default .Pa /sbin/pfctl ) . .It Va pf_flags .Pq Vt str If .Va pf_enable is set to .Dq Li YES , these flags are passed to the .Xr pfctl 8 program when loading the ruleset. .It Va pflog_enable .Pq Vt bool Set to .Dq Li NO by default. Setting this to .Dq Li YES enables .Xr pflogd 8 which logs packets from the .Xr pf 4 packet filter. .It Va pflog_logfile .Pq Vt str If .Va pflog_enable is set to .Dq Li YES this controls where .Xr pflogd 8 stores the logfile (default .Pa /var/log/pflog ) . Check .Pa /etc/newsyslog.conf to adjust logfile rotation for this. .It Va pflog_program .Pq Vt str Path to .Xr pflogd 8 (default .Pa /sbin/pflogd ) . .It Va pflog_flags .Pq Vt str Empty by default. This variable contains additional flags passed to the .Xr pflogd 8 program. .It Va pflog_instances .Pq Vt str If logging to more than one .Xr pflog 4 interface is desired, .Va pflog_instances is set to the list of .Xr pflogd 8 instances that should be started at system boot time. If .Va pflog_instances is set, for each whitespace-seperated .Ar element in the list, .Ao Ar element Ac Ns Va _dev and .Ao Ar element Ac Ns Va _logfile elements are assumed to exist. .Ao Ar element Ac Ns Va _dev must contain the .Xr pflog 4 interface to be watched by the named .Xr pflogd 8 instance. .Ao Ar element Ac Ns Va _logfile must contain the name of the logfile that will be used by the .Xr pflogd 8 instance. .It Va ftpproxy_enable .Pq Vt bool Set to .Dq Li NO by default. Setting this to .Dq Li YES enables .Xr ftp-proxy 8 which supports the .Xr pf 4 packet filter in translating ftp connections. .It Va ftpproxy_flags .Pq Vt str Empty by default. This variable contains additional flags passed to the .Xr ftp-proxy 8 program. .It Va ftpproxy_instances .Pq Vt str Empty by default. If multiple instances of .Xr ftp-proxy 8 are desired at boot time, .Va ftpproxy_instances should contain a whitespace-seperated list of instance names. For each .Ar element in the list, a variable named .Ao Ar element Ac Ns Va _flags should be defined, containing the command-line flags to be passed to the .Xr ftp-proxy 8 instance. .It Va pfsync_enable .Pq Vt bool Set to .Dq Li NO by default. Setting this to .Dq Li YES enables exposing .Xr pf 4 state changes to other hosts over the network by means of .Xr pfsync 4 . The .Va pfsync_syncdev variable must also be set then. .It Va pfsync_syncdev .Pq Vt str Empty by default. This variable specifies the name of the network interface .Xr pfsync 4 should operate through. It must be set accordingly if .Va pfsync_enable is set to .Dq Li YES . .It Va pfsync_syncpeer .Pq Vt str Empty by default. This variable is optional. By default, state change messages are sent out on the synchronisation interface using IP multicast packets. The protocol is IP protocol 240, PFSYNC, and the multicast group used is 224.0.0.240. When a peer address is specified using the .Va pfsync_syncpeer option, the peer address is used as a destination for the pfsync traffic, and the traffic can then be protected using .Xr ipsec 4 . See the .Xr pfsync 4 manpage for more details about using .Xr ipsec 4 with .Xr pfsync 4 interfaces. .It Va pfsync_ifconfig .Pq Vt str Empty by default. This variable can contain additional options to be passed to the .Xr ifconfig 8 command used to set up .Xr pfsync 4 . .It Va tcp_extensions .Pq Vt bool Set to .Dq Li YES by default. Setting this to .Dq Li NO disables certain TCP options as described by .Rs .%T "RFC 1323" .Re Setting this to .Dq Li NO might help remedy such problems with connections as randomly hanging or other weird behavior. Some network devices are known to be broken with respect to these options. .It Va log_in_vain .Pq Vt int Set to 0 by default. The .Xr sysctl 8 variables, .Va net.inet.tcp.log_in_vain and .Va net.inet.udp.log_in_vain , as described in .Xr tcp 4 and .Xr udp 4 , are set to the given value. .It Va tcp_keepalive .Pq Vt bool Set to .Dq Li YES by default. Setting to .Dq Li NO will disable probing idle TCP connections to verify that the peer is still up and reachable. .It Va tcp_drop_synfin .Pq Vt bool Set to .Dq Li NO by default. Setting to .Dq Li YES will cause the kernel to ignore TCP frames that have both the SYN and FIN flags set. This prevents OS fingerprinting, but may break some legitimate applications. .It Va icmp_drop_redirect .Pq Vt bool Set to .Dq Li NO by default. Setting to .Dq Li YES will cause the kernel to ignore ICMP REDIRECT packets. Refer to .Xr icmp 4 for more information. .It Va icmp_log_redirect .Pq Vt bool Set to .Dq Li NO by default. Setting to .Dq Li YES will cause the kernel to log ICMP REDIRECT packets. Note that the log messages are not rate-limited, so this option should only be used for troubleshooting networks. Refer to .Xr icmp 4 for more information. .It Va icmp_bmcastecho .Pq Vt bool Set to .Dq Li YES to respond to broadcast or multicast ICMP ping packets. Refer to .Xr icmp 4 for more information. .It Va ip_portrange_first .Pq Vt int If not set to .Dq Li NO , this is the first port in the default portrange. Refer to .Xr ip 4 for more information. .It Va ip_portrange_last .Pq Vt int If not set to .Dq Li NO , this is the last port in the default portrange. Refer to .Xr ip 4 for more information. .It Va network_interfaces .Pq Vt str Set to the list of network interfaces to configure on this host or .Dq Li AUTO (the default) for all current interfaces. Setting the .Va network_interfaces variable to anything other than the default is deprecated. Interfaces that the administrator wishes to store configuration for, but not start at boot should be configured with the .Dq Li NOAUTO keyword in their .Va ifconfig_ Ns Aq Ar interface variables as described below. .Pp An .Va ifconfig_ Ns Aq Ar interface variable is also assumed to exist for each value of .Ar interface . When an interface name contains any of the characters .Dq Li .-/+ they are translated to .Dq Li _ before lookup. The variable can contain arguments to .Xr ifconfig 8 , as well as special case-insensitive keywords described below. Such keywords are removed before passing the value to .Xr ifconfig 8 while the order of the other arguments is preserved. .Pp It is possible to add IP alias entries using .Xr ifconfig 8 syntax with the address family keyword such as .Li inet . Assuming that the interface in question was .Li ed0 , it might look something like this: .Bd -literal ifconfig_ed0_alias0="inet 127.0.0.253 netmask 0xffffffff" ifconfig_ed0_alias1="inet 127.0.0.254 netmask 0xffffffff" .Ed .Pp It also possible to configure multiple IP addresses in Classless Inter-Domain Routing .Pq CIDR address notation, whose each address component can be a range like .Li inet 192.0.2.5-23/24 or .Li inet6 2001:db8:1-f::1/64 . This notation allows address and prefix length part only, not the other address modifiers. Note that the maximum number of the generated addresses from a range specification is limited to an integer value specified in .Va netif_ipexpand_max in .Xr rc.conf 5 because a small typo can unexpectedly generate a large number of addresses. The default value is .Li 2048 . It can be increased by adding the following line into .Xr rc.conf 5 : .Bd -literal netif_ipexpand_max="4096" .Ed .Pp In the case of .Li 192.0.2.5-23/24 , the address 192.0.2.5 will be configured with the netmask /24 and the addresses 192.0.2.6 to 192.0.2.23 with the non-conflicting netmask /32 as explained in the .Xr ifconfig 8 alias section. Note that this special netmask handling is only for .Li inet , not for the other address families such as .Li inet6 . .Pp With the interface in question being .Li ed0 , an example could look like: .Bd -literal ifconfig_ed0_alias2="inet 192.0.2.129/27" ifconfig_ed0_alias3="inet 192.0.2.1-5/28" .Ed .Pp and so on. .Pp Note that .Va ipv4_addrs_ Ns Aq Ar interface variable was supported for IPv4 CIDR address notation. It is now deprecated because the functionality was integrated into .Va ifconfig_ Ns Ao Ar interface Ac Ns Va _alias Ns Aq Ar n though .Va ipv4_addrs_ Ns Aq Ar interface is still supported for backward compatibility. .Pp For each .Va ifconfig_ Ns Ao Ar interface Ac Ns Va _alias Ns Aq Ar n entry with an address family keyword, its contents are passed to .Xr ifconfig 8 . Execution stops at the first unsuccessful access, so if something like this is present: .Bd -literal ifconfig_ed0_alias0="inet 127.0.0.251 netmask 0xffffffff" ifconfig_ed0_alias1="inet 127.0.0.252 netmask 0xffffffff" ifconfig_ed0_alias2="inet 127.0.0.253 netmask 0xffffffff" ifconfig_ed0_alias4="inet 127.0.0.254 netmask 0xffffffff" .Ed .Pp Then note that alias4 would .Em not be added since the search would stop with the missing .Dq Li alias3 entry. Because of this difficult to manage behavior, there is .Va ifconfig_ Ns Ao Ar interface Ac Ns Va _aliases variable, which has the same functionality as .Va ifconfig_ Ns Ao Ar interface Ac Ns Va _alias Ns Aq Ar n and can have all of entries in a variable like the following: .Bd -literal ifconfig_ed0_aliases="\\ inet 127.0.0.251 netmask 0xffffffff \\ inet 127.0.0.252 netmask 0xffffffff \\ inet 127.0.0.253 netmask 0xffffffff \\ inet 127.0.0.254 netmask 0xffffffff" .Ed .Pp It also supports CIDR notation. .Pp If the .Pa /etc/start_if. Ns Aq Ar interface file is present, it is read and executed by the .Xr sh 1 interpreter before configuring the interface as specified in the .Va ifconfig_ Ns Aq Ar interface and .Va ifconfig_ Ns Ao Ar interface Ac Ns Va _alias Ns Aq Ar n variables. .Pp If a .Va vlans_ Ns Aq Ar interface variable is set, a .Xr vlan 4 interface will be created for each item in the list with the .Ar vlandev argument set to .Ar interface . If a vlan interface's name is a number, then that number is used as the vlan tag and the new vlan interface is named .Ar interface . Ns Ar tag . Otherwise, the vlan tag must be specified via a .Va vlan parameter in the .Va create_args_ Ns Aq Ar interface variable. .Pp To create a vlan device named .Li em0.101 on .Li em0 with the vlan tag 101 and the optional the IPv4 address 192.0.2.1/24: .Bd -literal vlans_em0="101" ifconfig_em0_101="inet 192.0.2.1/24" .Ed .Pp To create a vlan device named .Li myvlan on .Li em0 with the vlan tag 102: .Bd -literal vlans_em0="myvlan" create_args_myvlan="vlan 102" .Ed .Pp If a .Va wlans_ Ns Aq Ar interface variable is set, an .Xr wlan 4 interface will be created for each item in the list with the .Ar wlandev argument set to .Ar interface . Further wlan cloning arguments may be passed to the .Xr ifconfig 8 .Cm create command by setting the .Va create_args_ Ns Aq Ar interface variable. One or more .Xr wlan 4 devices must be created for each wireless devices as of .Fx 8.0 . Debugging flags for .Xr wlan 4 devices as set by .Xr wlandebug 8 may be specified with an .Va wlandebug_ Ns Aq Ar interface variable. The contents of this variable will be passed directly to .Xr wlandebug 8 . .Pp If the .Va ifconfig_ Ns Aq Ar interface contains the keyword .Dq Li NOAUTO then the interface will not be configured at boot or by .Pa /etc/pccard_ether when .Va network_interfaces is set to .Dq Li AUTO . .Pp It is possible to bring up an interface with DHCP by adding .Dq Li DHCP to the .Va ifconfig_ Ns Aq Ar interface variable. For instance, to initialize the .Li ed0 device via DHCP, it is possible to use something like: .Bd -literal ifconfig_ed0="DHCP" .Ed .Pp If you want to configure your wireless interface with .Xr wpa_supplicant 8 for use with WPA, EAP/LEAP or WEP, you need to add .Dq Li WPA to the .Va ifconfig_ Ns Aq Ar interface variable. .Pp On the other hand, if you want to configure your wireless interface with .Xr hostapd 8 , you need to add .Dq Li HOSTAP to the .Va ifconfig_ Ns Aq Ar interface variable. .Xr hostapd 8 will use the settings from .Pa /etc/hostapd- Ns Ao Ar interface Ac Ns .conf .Pp Finally, you can add .Xr ifconfig 8 options in this variable, in addition to the .Pa /etc/start_if. Ns Aq Ar interface file. For instance, to configure an .Xr ath 4 wireless device in station mode with an address obtained via DHCP, using WPA authentication and 802.11b mode, it is possible to use something like: .Bd -literal wlans_ath0="wlan0" ifconfig_wlan0="DHCP WPA mode 11b" .Ed .Pp In addition to the .Va ifconfig_ Ns Aq Ar interface form, a fallback variable .Va ifconfig_DEFAULT may be configured. It will be used for all interfaces with no .Va ifconfig_ Ns Aq Ar interface variable. This is intended to replace the no longer supported .Va pccard_ifconfig variable. .Pp It is also possible to rename an interface by doing: .Bd -literal ifconfig_ed0_name="net0" ifconfig_net0="inet 192.0.2.1 netmask 0xffffff00" .Ed .It Va ipv6_enable .Pq Vt bool This variable is deprecated. Use .Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 and .Va ipv6_activate_all_interfaces if necessary. .Pp If the variable is .Dq Li YES , .Dq Li inet6 accept_rtadv is added to all of .Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 and the .Va ipv6_activate_all_interfaces is defined as .Dq Li YES . .It Va ipv6_prefer .Pq Vt bool This variable is deprecated. Use .Va ip6addrctl_policy instead. .Pp If the variable is .Dq Li YES , the default address selection policy table set by .Xr ip6addrctl 8 will be IPv6-preferred. .Pp If the variable is .Dq Li NO , the default address selection policy table set by .Xr ip6addrctl 8 will be IPv4-preferred. .It Va ipv6_activate_all_interfaces .Pq Vt bool This controls initial configuration on IPv6-capable interfaces with no corresponding .Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 variable. Note that it is not always necessary to set this variable to .Dq YES to use IPv6 functionality on .Fx . In most cases, just configuring .Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 variables works. .Pp If the variable is .Dq Li NO , all interfaces which do not have a corresponding .Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 variable will be marked as .Dq Li IFDISABLED at creation. This means that all of IPv6 functionality on that interface is completely disabled to enforce a security policy. If the variable is set to .Dq YES , the flag will be cleared on all of the interfaces. .Pp In most cases, just defining an .Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 for an IPv6-capable interface should be sufficient. However, if an interface is added dynamically .Pq by some tunneling protocols such as PPP, for example , it is often difficult to define the variable in advance. In such a case, configuring the .Dq Li IFDISABLED flag can be disabled by setting this variable to .Dq YES . .Pp For more details of the .Dq Li IFDISABLED flag and keywords .Dq Li inet6 ifdisabled , see .Xr ifconfig 8 . .Pp Default is .Dq Li NO . .It Va ipv6_privacy .Pq Vt bool If the variable is .Dq Li YES privacy addresses will be generated for each IPv6 interface as described in RFC 4941. .It Va ipv6_network_interfaces .Pq Vt str This is the IPv6 equivalent of .Va network_interfaces . Normally manual configuration of this variable is not needed. .It Va ipv6_cpe_wanif .Pq Vt str If the variable is set to an interface name, the .Xr ifconfig 8 options .Dq inet6 -no_radr accept_rtadv will be added to the specified interface automatically before evaluating .Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 , and two .Xr sysctl 8 variables .Va net.inet6.ip6.rfc6204w3 and .Va net.inet6.ip6.no_radr will be set to 1. .Pp This means the specified interface will accept ICMPv6 Router Advertisement messages on that link and add the discovered routers into the Default Router List. While the other interfaces can still accept RA messages if the .Dq inet6 accept_rtadv option is specified, adding routes into the Default Router List will be disabled by .Dq inet6 no_radr option by default. See .Xr ifconfig 8 for more details. .Pp Note that ICMPv6 Router Advertisement messages will be accepted even when .Va net.inet6.ip6.forwarding is 1 .Pq packet forwarding is enabled when .Va net.inet6.ip6.rfc6204w3 is set to 1. .Pp Default is .Dq Li NO . .It Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 .Pq Vt str IPv6 functionality on an interface should be configured by .Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 , instead of setting ifconfig parameters in .Va ifconfig_ Ns Aq Ar interface . If this variable is empty, all of IPv6 configurations on the specified interface by other variables such as .Va ipv6_prefix_ Ns Ao Ar interface Ac will be ignored. .Pp Aliases should be set by .Va ifconfig_ Ns Ao Ar interface Ac Ns Va _alias Ns Aq Ar n with .Dq Li inet6 keyword. For example: .Bd -literal ifconfig_ed0_ipv6="inet6 2001:db8:1::1 prefixlen 64" ifconfig_ed0_alias0="inet6 2001:db8:2::1 prefixlen 64" .Ed .Pp Interfaces that have an .Dq Li inet6 accept_rtadv keyword in .Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 setting will be automatically configured by SLAAC .Pq StateLess Address AutoConfiguration described in .Rs .%T "RFC 4862" .Re .Pp Note that a link-local address will be automatically configured in addition to the configured global-scope addresses because the IPv6 specifications require it on each link. The address is calculated from the MAC address by using an algorithm defined in .Rs .%T "RFC 4862" .%O "Section 5.3" .Re .Pp If only a link-local address is needed on the interface, the following configuration can be used: .Bd -literal ifconfig_ed0_ipv6="inet6 auto_linklocal" .Ed .Pp A link-local address can also be configured manually. This is useful for the default router address of an IPv6 router so that it does not change when the network interface card is replaced. For example: .Bd -literal ifconfig_ed0_ipv6="inet6 fe80::1 prefixlen 64" .Ed .It Va ipv6_prefix_ Ns Aq Ar interface .Pq Vt str If one or more prefixes are defined in .Va ipv6_prefix_ Ns Aq Ar interface addresses based on each prefix and the EUI-64 interface index will be configured on that interface. Note that this variable will be ignored when .Va ifconfig_ Ns Ao Ar interface Ac Ns _ipv6 is empty. .Pp For example, the following configuration .Bd -literal ipv6_prefix_ed0="2001:db8:1:0 2001:db8:2:0" .Ed .Pp is equivalent to the following: .Bd -literal ifconfig_ed0_alias0="inet6 2001:db8:1:: eui64 prefixlen 64" ifconfig_ed0_alias1="inet6 2001:db8:1:: prefixlen 64 anycast" ifconfig_ed0_alias2="inet6 2001:db8:2:: eui64 prefixlen 64" ifconfig_ed0_alias3="inet6 2001:db8:2:: prefixlen 64 anycast" .Ed .Pp These Subnet-Router anycast addresses will be added only when .Va ipv6_gateway_enable is YES. .It Va ipv6_default_interface .Pq Vt str If not set to .Dq Li NO , this is the default output interface for scoped addresses. This works only with ipv6_gateway_enable="NO". .It Va ip6addrctl_enable .Pq Vt bool This variable is to enable configuring default address selection policy table .Pq RFC 3484 . The table can be specified in another variable .Va ip6addrctl_policy . For .Va ip6addrctl_policy the following keywords can be specified: .Dq Li ipv4_prefer , .Dq Li ipv6_prefer , or .Dq Li AUTO . .Pp If .Dq Li ipv4_prefer or .Dq Li ipv6_prefer is specified, .Xr ip6addrctl 8 installs a pre-defined policy table described in Section 2.1 .Pq IPv6-preferred or 10.3 .Pq IPv4-preferred of RFC 3484. .Pp If .Dq Li AUTO is specified, it attempts to read a file .Pa /etc/ip6addrctl.conf first. If this file is found, .Xr ip6addrctl 8 reads and installs it. If not found, a policy is automatically set according to .Va ipv6_activate_all_interfaces variable; if the variable is set to .Dq Li YES the IPv6-preferred one is used. Otherwise IPv4-preferred. .Pp The default value of .Va ip6addrctl_enable and .Va ip6addrctl_policy are .Dq Li YES and .Dq Li AUTO , respectively. .It Va cloned_interfaces .Pq Vt str Set to the list of clonable network interfaces to create on this host. Further cloning arguments may be passed to the .Xr ifconfig 8 .Cm create command for each interface by setting the .Va create_args_ Ns Aq Ar interface variable. If an interface name is specified with .Dq :sticky keyword, the interface will not be destroyed even when .Pa rc.d/netif script is invoked with .Dq stop argument. This is useful when reconfiguring the interface without destroying it. Entries in .Va cloned_interfaces are automatically appended to .Va network_interfaces for configuration. .It Va cloned_interfaces_sticky .Pq Vt bool This variable is to globally enable functionality of .Dq :sticky keyword in .Va cloned_interfaces for all interfaces. The default value is .Dq NO . Even if this variable is specified to .Dq YES , .Dq :nosticky keyword can be used to override it on per interface basis. .It Va gif_interfaces .Pq Vt str This variable is deprecated in favor of .Va cloned_interfaces . Set to the list of .Xr gif 4 tunnel interfaces to configure on this host. A .Va gifconfig_ Ns Aq Ar interface variable is assumed to exist for each value of .Ar interface . The value of this variable is used to configure the link layer of the tunnel according to the syntax of the .Cm tunnel option to .Xr ifconfig 8 . Additionally, this option ensures that each listed interface is created via the .Cm create option to .Xr ifconfig 8 before attempting to configure it. .It Va sppp_interfaces .Pq Vt str Set to the list of .Xr sppp 4 interfaces to configure on this host. A .Va spppconfig_ Ns Aq Ar interface variable is assumed to exist for each value of .Ar interface . Each interface should also be configured by a general .Va ifconfig_ Ns Aq Ar interface setting. Refer to .Xr spppcontrol 8 for more information about available options. .It Va ppp_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr ppp 8 daemon. .It Va ppp_profile .Pq Vt str The name of the profile to use from .Pa /etc/ppp/ppp.conf . Also used for per-profile overrides of .Va ppp_mode and .Va ppp_nat , and .Va ppp_ Ns Ao Ar profile Ac Ns _unit . When the profile name contains any of the characters .Dq Li .-/+ they are translated to .Dq Li _ for the proposes of the override variable names. .It Va ppp_mode .Pq Vt str Mode in which to run the .Xr ppp 8 daemon. .It Va ppp_ Ns Ao Ar profile Ac Ns _mode .Pq Vt str Overrides the global .Va ppp_mode for .Ar profile . Accepted modes are .Dq Li auto , .Dq Li ddial , .Dq Li direct and .Dq Li dedicated . See the manual for a full description. .It Va ppp_nat .Pq Vt bool If set to .Dq Li YES , enables network address translation. Used in conjunction with .Va gateway_enable allows hosts on private network addresses access to the Internet using this host as a network address translating router. .It Va ppp_ Ns Ao Ar profile Ac Ns _nat .Pq Vt str Overrides the global .Va ppp_nat for .Ar profile . .It Va ppp_ Ns Ao Ar profile Ac Ns _unit .Pq Vt int Set the unit number to be used for this profile. See the manual description of .Fl unit Ns Ar N for details. .It Va ppp_user .Pq Vt str The name of the user under which .Xr ppp 8 should be started. By default, .Xr ppp 8 is started as .Dq Li root . .It Va rc_conf_files .Pq Vt str This option is used to specify a list of files that will override the settings in .Pa /etc/defaults/rc.conf . The files will be read in the order in which they are specified and should include the full path to the file. By default, the files specified are .Pa /etc/rc.conf and .Pa /etc/rc.conf.local .It Va zfs_enable .Pq Vt bool If set to .Dq Li YES , .Pa /etc/rc.d/zfs will attempt to automatically mount ZFS file systems and initialize ZFS volumes (ZVOLs). .It Va gptboot_enable .Pq Vt bool If set to .Dq Li YES , .Pa /etc/rc.d/gptboot will log if the system successfully (or not) booted from a GPT partition, which had the .Ar bootonce attribute set using .Xr gpart 8 utility. .It Va gbde_autoattach_all .Pq Vt bool If set to .Dq Li YES , .Pa /etc/rc.d/gbde will attempt to automatically initialize your .bde devices in .Pa /etc/fstab . .It Va gbde_devices .Pq Vt str List the devices that the script should try to attach, or .Dq Li AUTO . .It Va gbde_lockdir .Pq Vt str The directory where the .Xr gbde 4 lockfiles are located. The default lockfile directory is .Pa /etc . .Pp The lockfile for each individual .Xr gbde 4 device can be overridden by setting the variable .Va gbde_lock_ Ns Aq Ar device , where .Ar device is the encrypted device without the .Dq Pa /dev/ and .Dq Pa .bde parts. .It Va gbde_attach_attempts .Pq Vt int Number of times to attempt attaching to a .Xr gbde 4 device, i.e., how many times the user is asked for the pass-phrase. Default is 3. .It Va geli_devices .Pq Vt str List of devices to automatically attach on boot. Note that .eli devices from .Pa /etc/fstab are automatically appended to this list. .It Va geli_tries .Pq Vt int Number of times user is asked for the pass-phrase. If empty, it will be taken from .Va kern.geom.eli.tries sysctl variable. .It Va geli_default_flags .Pq Vt str Default flags to use by .Xr geli 8 when configuring disk encryption. Flags can be configured for every device separately by defining .Va geli_ Ns Ao Ar device Ac Ns Va _flags variable. .It Va geli_autodetach .Pq Vt str Specifies if GELI devices should be marked for detach on last close after file systems are mounted. Default is .Dq Li YES . This can be changed for every device separately by defining .Va geli_ Ns Ao Ar device Ac Ns Va _autodetach variable. .It Va root_rw_mount .Pq Vt bool Set to .Dq Li YES by default. After the file systems are checked at boot time, the root file system is remounted as read-write if this is set to .Dq Li YES . Diskless systems that mount their root file system from a read-only remote NFS share should set this to .Dq Li NO in their .Pa rc.conf . .It Va fsck_y_enable .Pq Vt bool If set to .Dq Li YES , .Xr fsck 8 will be run with the .Fl y flag if the initial preen of the file systems fails. .It Va background_fsck .Pq Vt bool If set to .Dq Li YES , the system will attempt to run .Xr fsck 8 in the background where possible. .It Va background_fsck_delay .Pq Vt int The amount of time in seconds to sleep before starting a background .Xr fsck 8 . It defaults to sixty seconds to allow large applications such as the X server to start before disk I/O bandwidth is monopolized by .Xr fsck 8 . If set to a negative number, the background file system check will be delayed indefinitely to allow the administrator to run it at a more convenient time. For example it may be run from .Xr cron 8 by adding a line like .Pp .Dl "0 4 * * * root /etc/rc.d/bgfsck forcestart" .Pp to .Pa /etc/crontab . .It Va netfs_types .Pq Vt str List of file system types that are network-based. This list should generally not be modified by end users. Use .Va extra_netfs_types instead. .It Va extra_netfs_types .Pq Vt str If set to something other than .Dq Li NO (the default), this variable extends the list of file system types for which automatic mounting at startup by .Xr rc 8 should be delayed until the network is initialized. It should contain a whitespace-separated list of network file system descriptor pairs, each consisting of a file system type as passed to .Xr mount 8 and a human-readable, one-word description, joined with a colon .Pq Ql \&: . Extending the default list in this way is only necessary when third party file system types are used. .It Va syslogd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr syslogd 8 daemon. .It Va syslogd_program .Pq Vt str Path to .Xr syslogd 8 (default .Pa /usr/sbin/syslogd ) . .It Va syslogd_flags .Pq Vt str If .Va syslogd_enable is set to .Dq Li YES , these are the flags to pass to .Xr syslogd 8 . .It Va inetd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr inetd 8 daemon. .It Va inetd_program .Pq Vt str Path to .Xr inetd 8 (default .Pa /usr/sbin/inetd ) . .It Va inetd_flags .Pq Vt str If .Va inetd_enable is set to .Dq Li YES , these are the flags to pass to .Xr inetd 8 . .It Va hastd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr hastd 8 daemon. .It Va hastd_program .Pq Vt str Path to .Xr hastd 8 (default .Pa /sbin/hastd ) . .It Va hastd_flags .Pq Vt str If .Va hastd_enable is set to .Dq Li YES , these are the flags to pass to .Xr hastd 8 . .It Va local_unbound_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr unbound 8 daemon as a local caching resolver. .It Va kdc_enable .Pq Vt bool Set to .Dq Li YES to start a Kerberos 5 authentication server at boot time. .It Va kdc_program .Pq Vt str If .Va kdc_enable is set to .Dq Li YES this is the path to Kerberos 5 Authentication Server. .It Va kdc_flags .Pq Vt str Empty by default. This variable contains additional flags to be passed to the Kerberos 5 authentication server. .It Va kadmind_enable .Pq Vt bool Set to .Dq Li YES to start .Xr kadmind 8 , the Kerberos 5 Administration Daemon; set to .Dq Li NO on a slave server. .It Va kadmind_program .Pq Vt str If .Va kadmind_enable is set to .Dq Li YES this is the path to Kerberos 5 Administration Daemon. .It Va kpasswdd_enable .Pq Vt bool Set to .Dq Li YES to start .Xr kpasswdd 8 , the Kerberos 5 Password-Changing Daemon; set to .Dq Li NO on a slave server. .It Va kpasswdd_program .Pq Vt str If .Va kpasswdd_enable is set to .Dq Li YES this is the path to Kerberos 5 Password-Changing Daemon. .It Va kfd_enable .Pq Vt bool Set to .Dq Li YES to start .Xr kfd 8 , the Kerberos 5 ticket forwarding daemon, at the boot time. .It Va kfd_program .Pq Vt str Path to .Xr kfd 8 (default .Pa /usr/libexec/kfd ) . .It Va rwhod_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr rwhod 8 daemon at boot time. .It Va rwhod_flags .Pq Vt str If .Va rwhod_enable is set to .Dq Li YES , these are the flags to pass to it. .It Va amd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr amd 8 daemon at boot time. .It Va amd_flags .Pq Vt str If .Va amd_enable is set to .Dq Li YES , these are the flags to pass to it. See the .Xr amd 8 manpage for more information. .It Va amd_map_program .Pq Vt str If set, the specified program is run to get the list of .Xr amd 8 maps. For example, if the .Xr amd 8 maps are stored in NIS, one can set this to run .Xr ypcat 1 to get a list of .Xr amd 8 maps from the .Pa amd.master NIS map. .It Va update_motd .Pq Vt bool If set to .Dq Li YES , .Pa /etc/motd will be updated at boot time to reflect the kernel release being run. If set to .Dq Li NO , .Pa /etc/motd will not be updated. .It Va nfs_client_enable .Pq Vt bool If set to .Dq Li YES , run the NFS client daemons at boot time. .It Va nfs_access_cache .Pq Vt int If .Va nfs_client_enable is set to .Dq Li YES , this can be set to .Dq Li 0 to disable NFS ACCESS RPC caching, or to the number of seconds for which NFS ACCESS results should be cached. A value of 2-10 seconds will substantially reduce network traffic for many NFS operations. .It Va nfs_server_enable .Pq Vt bool If set to .Dq Li YES , run the NFS server daemons at boot time. .It Va nfs_server_flags .Pq Vt str If .Va nfs_server_enable is set to .Dq Li YES , these are the flags to pass to the .Xr nfsd 8 daemon. .It Va nfsv4_server_enable .Pq Vt bool If .Va nfs_server_enable is set to .Dq Li YES and .Va nfsv4_server_enable are set to .Dq Li YES , enable the server for NFSv4 as well as NFSv2 and NFSv3. .It Va nfsuserd_enable .Pq Vt bool If .Va nfsuserd_enable is set to .Dq Li YES , run the nfsuserd daemon, which is needed for NFSv4 in order to map between user/group names vs uid/gid numbers. If .Va nfsv4_server_enable is set to .Dq Li YES , this will be forced enabled. .It Va nfsuserd_flags .Pq Vt str If .Va nfsuserd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr nfsuserd 8 daemon. .It Va nfscbd_enable .Pq Vt bool If .Va nfscbd_enable is set to .Dq Li YES , run the nfscbd daemon, which enables callbacks/delegations for the NFSv4 client. .It Va nfscbd_flags .Pq Vt str If .Va nfscbd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr nfscbd 8 daemon. .It Va mountd_enable .Pq Vt bool If set to .Dq Li YES , and no .Va nfs_server_enable is set, start .Xr mountd 8 , but not .Xr nfsd 8 daemon. It is commonly needed to run CFS without real NFS used. .It Va mountd_flags .Pq Vt str If .Va mountd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr mountd 8 daemon. .It Va weak_mountd_authentication .Pq Vt bool If set to .Dq Li YES , allow services like PCNFSD to make non-privileged mount requests. .It Va nfs_reserved_port_only .Pq Vt bool If set to .Dq Li YES , provide NFS services only on a secure port. .It Va nfs_bufpackets .Pq Vt int If set to a number, indicates the number of packets worth of socket buffer space to reserve on an NFS client. The kernel default is typically 4. Using a higher number may be useful on gigabit networks to improve performance. The minimum value is 2 and the maximum is 64. .It Va rpc_lockd_enable .Pq Vt bool If set to .Dq Li YES and also an NFS server or client, run .Xr rpc.lockd 8 at boot time. .It Va rpc_lockd_flags .Pq Vt str If .Va rpc_lockd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr rpc.lockd 8 daemon. .It Va rpc_statd_enable .Pq Vt bool If set to .Dq Li YES and also an NFS server or client, run .Xr rpc.statd 8 at boot time. .It Va rpc_statd_flags .Pq Vt str If .Va rpc_statd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr rpc.statd 8 daemon. .It Va rpcbind_program .Pq Vt str Path to .Xr rpcbind 8 (default .Pa /usr/sbin/rpcbind ) . .It Va rpcbind_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr rpcbind 8 service at boot time. .It Va rpcbind_flags .Pq Vt str If .Va rpcbind_enable is set to .Dq Li YES , these are the flags to pass to the .Xr rpcbind 8 daemon. .It Va keyserv_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr keyserv 8 daemon on boot for running Secure RPC. .It Va keyserv_flags .Pq Vt str If .Va keyserv_enable is set to .Dq Li YES , these are the flags to pass to .Xr keyserv 8 daemon. .It Va pppoed_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr pppoed 8 daemon at boot time to provide PPP over Ethernet services. .It Va pppoed_ Ns Aq Ar provider .Pq Vt str .Xr pppoed 8 listens to requests to this .Ar provider and ultimately runs .Xr ppp 8 with a .Ar system argument of the same name. .It Va pppoed_flags .Pq Vt str Additional flags to pass to .Xr pppoed 8 . .It Va pppoed_interface .Pq Vt str The network interface to run .Xr pppoed 8 on. This is mandatory when .Va pppoed_enable is set to .Dq Li YES . .It Va timed_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr timed 8 service at boot time. This command is intended for networks of machines where a consistent .Dq "network time" for all hosts must be established. This is often useful in large NFS environments where time stamps on files are expected to be consistent network-wide. .It Va timed_flags .Pq Vt str If .Va timed_enable is set to .Dq Li YES , these are the flags to pass to the .Xr timed 8 service. .It Va ntpdate_enable .Pq Vt bool If set to .Dq Li YES , run .Xr ntpdate 8 at system startup. This command is intended to synchronize the system clock only .Em once from some standard reference. .It Va ntpdate_config .Pq Vt str Configuration file for .Xr ntpdate 8 . Default .Pa /etc/ntp.conf . .It Va ntpdate_hosts .Pq Vt str A whitespace-separated list of NTP servers to synchronize with at startup. The default is to use the servers listed in .Va ntpdate_config , if that file exists. .It Va ntpdate_program .Pq Vt str Path to .Xr ntpdate 8 (default .Pa /usr/sbin/ntpdate ) . .It Va ntpdate_flags .Pq Vt str If .Va ntpdate_enable is set to .Dq Li YES , these are the flags to pass to the .Xr ntpdate 8 command (typically a hostname). .It Va ntpd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr ntpd 8 command at boot time. .It Va ntpd_program .Pq Vt str Path to .Xr ntpd 8 (default .Pa /usr/sbin/ntpd ) . .It Va ntpd_config .Pq Vt str Path to .Xr ntpd 8 configuration file. Default .Pa /etc/ntp.conf . .It Va ntpd_flags .Pq Vt str If .Va ntpd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr ntpd 8 daemon. .It Va ntpd_sync_on_start .Pq Vt bool If set to .Dq Li YES , .Xr ntpd 8 is run with the .Fl g flag, which syncs the system's clock on startup. See .Xr ntpd 8 for more information regarding the .Fl g option. This is a preferred alternative to using .Xr ntpdate 8 or specifying the .Va ntpdate_enable variable. .It Va nis_client_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr ypbind 8 service at system boot time. .It Va nis_client_flags .Pq Vt str If .Va nis_client_enable is set to .Dq Li YES , these are the flags to pass to the .Xr ypbind 8 service. +.It Va nis_ypldap_enable +.Pq Vt bool +If set to +.Dq Li YES , +run the +.Xr ypldap 8 +daemon at system boot time. +.It Va nis_ypldap_flags +.Pq Vt str +If +.Va nis.ypldap_enable +is set to +.Dq Li YES , +these are the flags to pass to the +.Xr ypldap 8 +daemon. .It Va nis_ypset_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr ypset 8 daemon at system boot time. .It Va nis_ypset_flags .Pq Vt str If .Va nis_ypset_enable is set to .Dq Li YES , these are the flags to pass to the .Xr ypset 8 daemon. .It Va nis_server_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr ypserv 8 daemon at system boot time. .It Va nis_server_flags .Pq Vt str If .Va nis_server_enable is set to .Dq Li YES , these are the flags to pass to the .Xr ypserv 8 daemon. .It Va nis_ypxfrd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr rpc.ypxfrd 8 daemon at system boot time. .It Va nis_ypxfrd_flags .Pq Vt str If .Va nis_ypxfrd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr rpc.ypxfrd 8 daemon. .It Va nis_yppasswdd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr rpc.yppasswdd 8 daemon at system boot time. .It Va nis_yppasswdd_flags .Pq Vt str If .Va nis_yppasswdd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr rpc.yppasswdd 8 daemon. .It Va rpc_ypupdated_enable .Pq Vt bool If set to .Dq Li YES , run the .Nm rpc.ypupdated daemon at system boot time. .It Va bsnmpd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr bsnmpd 1 daemon at system boot time. Be sure to understand the security implications of running SNMP daemon on your host. .It Va bsnmpd_flags .Pq Vt str If .Va bsnmpd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr bsnmpd 1 daemon. .It Va defaultrouter .Pq Vt str If not set to .Dq Li NO , create a default route to this host name or IP address (use an IP address if this router is also required to get to the name server!). .It Va ipv6_defaultrouter .Pq Vt str The IPv6 equivalent of .Va defaultrouter . .It Va static_arp_pairs .Pq Vt str Set to the list of static ARP pairs that are to be added at system boot time. For each whitespace separated .Ar element in the value, a .Va static_arp_ Ns Aq Ar element variable is assumed to exist whose contents will later be passed to a .Dq Nm arp Cm -S operation. For example .Bd -literal static_arp_pairs="gw" static_arp_gw="192.168.1.1 00:01:02:03:04:05" .Ed .It Va static_ndp_pairs .Pq Vt str Set to the list of static NDP pairs that are to be added at system boot time. For each whitespace separated .Ar element in the value, a .Va static_ndp_ Ns Aq Ar element variable is assumed to exist whose contents will later be passed to a .Dq Nm ndp Cm -s operation. For example .Bd -literal static_ndp_pairs="gw" static_ndp_gw="2001:db8:3::1 00:01:02:03:04:05" .Ed .It Va static_routes .Pq Vt str Set to the list of static routes that are to be added at system boot time. If not set to .Dq Li NO then for each whitespace separated .Ar element in the value, a .Va route_ Ns Aq Ar element variable is assumed to exist whose contents will later be passed to a .Dq Nm route Cm add operation. For example: .Bd -literal static_routes="ext mcast:gif0 gif0local:gif0" route_ext="-net 10.0.0.0/24 -gateway 192.168.0.1" route_mcast="-net 224.0.0.0/4 -iface gif0" route_gif0local="-host 169.254.1.1 -iface lo0" .Ed .Pp When an .Ar element is in the form of .Li name:ifname , the route is specific to the interface .Li ifname . .It Va ipv6_static_routes .Pq Vt str The IPv6 equivalent of .Va static_routes . If not set to .Dq Li NO then for each whitespace separated .Ar element in the value, a .Va ipv6_route_ Ns Aq Ar element variable is assumed to exist whose contents will later be passed to a .Dq Nm route Cm add Fl inet6 operation. .It Va natm_static_routes .Pq Vt str The .Xr natmip 4 equivalent of .Va static_routes . If not empty then for each whitespace separated .Ar element in the value, a .Va route_ Ns Aq Ar element variable is assumed to exist whose contents will later be passed to a .Dq Nm atmconfig Cm natm Cm add operation. .It Va gateway_enable .Pq Vt bool If set to .Dq Li YES , configure host to act as an IP router, e.g.\& to forward packets between interfaces. .It Va ipv6_gateway_enable .Pq Vt bool The IPv6 equivalent of .Va gateway_enable . .It Va routed_enable .Pq Vt bool If set to .Dq Li YES , run a routing daemon of some sort, based on the settings of .Va routed_program and .Va routed_flags . .It Va route6d_enable .Pq Vt bool The IPv6 equivalent of .Va routed_enable . If set to .Dq Li YES , run a routing daemon of some sort, based on the settings of .Va route6d_program and .Va route6d_flags . .It Va routed_program .Pq Vt str If .Va routed_enable is set to .Dq Li YES , this is the name of the routing daemon to use. .It Va route6d_program .Pq Vt str The IPv6 equivalent of .Va routed_program . .It Va routed_flags .Pq Vt str If .Va routed_enable is set to .Dq Li YES , these are the flags to pass to the routing daemon. .It Va route6d_flags .Pq Vt str The IPv6 equivalent of .Va routed_flags . .It Va mroute6d_enable .Pq Vt bool If set to .Dq Li YES , run the IPv6 multicast routing daemon. .Pp Note that multicast routing daemons are no longer included in the .Fx base system, however, both .Xr mrouted 8 and .Xr pim6dd 8 may be installed from the .Fx Ports Collection. .It Va mroute6d_flags .Pq Vt str If .Va mroute6d_enable is set to .Dq Li YES , these are the flags passed to the IPv6 multicast routing daemon. .It Va mroute6d_program .Pq Vt str If .Va mroute6d_enable is set to .Dq Li YES , this is the path to the IPv6 multicast routing daemon. .It Va rtadvd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr rtadvd 8 daemon at boot time. The .Xr rtadvd 8 utility sends ICMPv6 Router Advertisement messages to the interfaces specified in .Va rtadvd_interfaces . This should only be enabled with great care. You may want to fine-tune .Xr rtadvd.conf 5 . .It Va rtadvd_interfaces .Pq Vt str If .Va rtadvd_enable is set to .Dq Li YES this is the list of interfaces to use. .It Va arpproxy_all .Pq Vt bool If set to .Dq Li YES , enable global proxy ARP. .It Va forward_sourceroute .Pq Vt bool If set to .Dq Li YES and .Va gateway_enable is also set to .Dq Li YES , source-routed packets are forwarded. .It Va accept_sourceroute .Pq Vt bool If set to .Dq Li YES , the system will accept source-routed packets directed at it. .It Va rarpd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr rarpd 8 daemon at system boot time. .It Va rarpd_flags .Pq Vt str If .Va rarpd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr rarpd 8 daemon. .It Va bootparamd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr bootparamd 8 daemon at system boot time. .It Va bootparamd_flags .Pq Vt str If .Va bootparamd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr bootparamd 8 daemon. .It Va stf_interface_ipv4addr .Pq Vt str If not set to .Dq Li NO , this is the local IPv4 address for 6to4 (IPv6 over IPv4 tunneling interface). Specify this entry to enable the 6to4 interface. .It Va stf_interface_ipv4plen .Pq Vt int Prefix length for 6to4 IPv4 addresses, to limit peer address range. An effective value is 0-31. .It Va stf_interface_ipv6_ifid .Pq Vt str IPv6 interface ID for .Xr stf 4 . This can be set to .Dq Li AUTO . .It Va stf_interface_ipv6_slaid .Pq Vt str IPv6 Site Level Aggregator for .Xr stf 4 . .It Va ipv6_ipv4mapping .Pq Vt bool If set to .Dq Li YES this enables IPv4 mapped IPv6 address communication (like .Li ::ffff:a.b.c.d ) . .It Va rtsold_enable .Pq Vt bool Set to .Dq Li YES to enable the .Xr rtsold 8 daemon to send ICMPv6 Router Solicitation messages. .It Va rtsold_flags .Pq Vt str If .Va rtsold_enable is set to .Dq Li YES , these are the flags to pass to .Xr rtsold 8 . .It Va rtsol_flags .Pq Vt str For interfaces configured with the .Dq Li inet6 accept_rtadv keyword, these are the flags to pass to .Xr rtsol 8 . .Pp Note that .Va rtsold_enable is mutually exclusive to .Va rtsol_flags ; .Va rtsold_enable takes precedence. .It Va atm_enable .Pq Vt bool Set to .Dq Li YES to enable the configuration of ATM interfaces at system boot time. For all of the ATM variables described below, please refer to the .Xr atm 8 manual page for further details on the available command parameters. Also refer to the files in .Pa /usr/share/examples/atm for more detailed configuration information. .It Va atm_load .Pq Vt str This is a list of physical ATM interface drivers to load. Typical values are .Dq Li hfa_pci and/or .Dq Li hea_pci . .It Va atm_netif_ Ns Aq Ar intf .Pq Vt str For the ATM physical interface .Ar intf , this variable defines the name prefix and count for the ATM network interfaces to be created. The value will be passed as the parameters of an .Dq Nm atm Cm "set netif" Ar intf command. .It Va atm_sigmgr_ Ns Aq Ar intf .Pq Vt str For the ATM physical interface .Ar intf , this variable defines the ATM signalling manager to be used. The value will be passed as the parameters of an .Dq Nm atm Cm attach Ar intf command. .It Va atm_prefix_ Ns Aq Ar intf .Pq Vt str For the ATM physical interface .Ar intf , this variable defines the NSAP prefix for interfaces using a UNI signalling manager. If set to .Dq Li ILMI , the prefix will automatically be set via the .Xr ilmid 8 daemon. Otherwise, the value will be passed as the parameters of an .Dq Nm atm Cm "set prefix" Ar intf command. .It Va atm_macaddr_ Ns Aq Ar intf .Pq Vt str For the ATM physical interface .Ar intf , this variable defines the MAC address for interfaces using a UNI signalling manager. If set to .Dq Li NO , the hardware MAC address contained in the ATM interface card will be used. Otherwise, the value will be passed as the parameters of an .Dq Nm atm Cm "set mac" Ar intf command. .It Va atm_arpserver_ Ns Aq Ar netif .Pq Vt str For the ATM network interface .Ar netif , this variable defines the ATM address for a host which is to provide ATMARP service. This variable is only applicable to interfaces using a UNI signalling manager. If set to .Dq Li local , this host will become an ATMARP server. The value will be passed as the parameters of an .Dq Nm atm Cm "set arpserver" Ar netif command. .It Va atm_scsparp_ Ns Aq Ar netif .Pq Vt bool If set to .Dq Li YES , SCSP/ATMARP service for the network interface .Ar netif will be initiated using the .Xr scspd 8 and .Xr atmarpd 8 daemons. This variable is only applicable if .Va atm_arpserver_ Ns Aq Ar netif is set to .Dq Li local . .It Va atm_pvcs .Pq Vt str Set to the list of ATM PVCs to be added at system boot time. For each whitespace separated .Ar element in the value, an .Va atm_pvc_ Ns Aq Ar element variable is assumed to exist. The value of each of these variables will be passed as the parameters of an .Dq Nm atm Cm "add pvc" command. .It Va atm_arps .Pq Vt str Set to the list of permanent ATM ARP entries to be added at system boot time. For each whitespace separated .Ar element in the value, an .Va atm_arp_ Ns Aq Ar element variable is assumed to exist. The value of each of these variables will be passed as the parameters of an .Dq Nm atm Cm "add arp" command. .It Va natm_interfaces .Pq Vt str Set to the list of .Xr natm 4 interfaces that will also be used for HARP through .Xr harp 4 . If this list is not empty all interfaces in the list will be brought up with .Xr ifconfig 8 and .Xr harp 4 will be loaded. For this to work the interface drivers must be either compiled into the kernel or must reside on the root partition. .It Va keybell .Pq Vt str The keyboard bell sound. Set to .Dq Li normal , .Dq Li visual , .Dq Li off , or .Dq Li NO if the default behavior is desired. For details, refer to the .Xr kbdcontrol 1 manpage. .It Va keyboard .Pq Vt str If set to a non-null string, the virtual console's keyboard input is set to this device. .It Va keymap .Pq Vt str If set to .Dq Li NO , no keymap is installed, otherwise the value is used to install the keymap file found in .Pa /usr/share/syscons/keymaps/ Ns Ao Ar value Ac Ns Pa .kbd (if using .Xr syscons 4 ) or .Pa /usr/share/vt/keymaps/ Ns Ao Ar value Ac Ns Pa .kbd (if using .Xr vt 4 ) . .It Va keyrate .Pq Vt str The keyboard repeat speed. Set to .Dq Li slow , .Dq Li normal , .Dq Li fast , or .Dq Li NO if the default behavior is desired. .It Va keychange .Pq Vt str If not set to .Dq Li NO , attempt to program the function keys with the value. The value should be a single string of the form: .Dq Ar funkey_number new_value Op Ar funkey_number new_value ... . .It Va cursor .Pq Vt str Can be set to the value of .Dq Li normal , .Dq Li blink , .Dq Li destructive , or .Dq Li NO to set the cursor behavior explicitly or choose the default behavior. .It Va scrnmap .Pq Vt str If set to .Dq Li NO , no screen map is installed, otherwise the value is used to install the screen map file in .Pa /usr/share/syscons/scrnmaps/ Ns Aq Ar value . This parameter is ignored when using .Xr vt 4 as the console driver. .It Va font8x16 .Pq Vt str If set to .Dq Li NO , the default 8x16 font value is used for screen size requests, otherwise the value in .Pa /usr/share/syscons/fonts/ Ns Aq Ar value or .Pa /usr/share/vt/fonts/ Ns Aq Ar value is used (depending on the console driver being used). .It Va font8x14 .Pq Vt str If set to .Dq Li NO , the default 8x14 font value is used for screen size requests, otherwise the value in .Pa /usr/share/syscons/fonts/ Ns Aq Ar value or .Pa /usr/share/vt/fonts/ Ns Aq Ar value is used (depending on the console driver being used). .It Va font8x8 .Pq Vt str If set to .Dq Li NO , the default 8x8 font value is used for screen size requests, otherwise the value in .Pa /usr/share/syscons/fonts/ Ns Aq Ar value or .Pa /usr/share/vt/fonts/ Ns Aq Ar value is used (depending on the console driver being used). .It Va blanktime .Pq Vt int If set to .Dq Li NO , the default screen blanking interval is used, otherwise it is set to .Ar value seconds. .It Va saver .Pq Vt str If not set to .Dq Li NO , this is the actual screen saver to use .Li ( blank , snake , daemon , etc). .It Va moused_nondefault_enable .Pq Vt str If set to .Dq Li NO , the mouse device specified on the command line is not automatically treated as enabled by the .Pa /etc/rc.d/moused script. Having this variable set to .Dq Li YES allows a .Xr usb 4 mouse, for example, to be enabled as soon as it is plugged in. .It Va moused_enable .Pq Vt str If set to .Dq Li YES , the .Xr moused 8 daemon is started for doing cut/paste selection on the console. .It Va moused_type .Pq Vt str This is the protocol type of the mouse connected to this host. This variable must be set if .Va moused_enable is set to .Dq Li YES . The .Xr moused 8 daemon is able to detect the appropriate mouse type automatically in many cases. Set this variable to .Dq Li auto to let the daemon detect it, or select one from the following list if the automatic detection fails. .Pp If the mouse is attached to the PS/2 mouse port, choose .Dq Li auto or .Dq Li ps/2 , regardless of the brand and model of the mouse. Likewise, if the mouse is attached to the bus mouse port, choose .Dq Li auto or .Dq Li busmouse . All other protocols are for serial mice and will not work with the PS/2 and bus mice. If this is a USB mouse, .Dq Li auto is the only protocol type which will work. .Pp .Bl -tag -width ".Li x10mouseremote" -compact .It Li microsoft Microsoft mouse (serial) .It Li intellimouse Microsoft IntelliMouse (serial) .It Li mousesystems Mouse systems Corp.\& mouse (serial) .It Li mmseries MM Series mouse (serial) .It Li logitech Logitech mouse (serial) .It Li busmouse A bus mouse .It Li mouseman Logitech MouseMan and TrackMan (serial) .It Li glidepoint ALPS GlidePoint (serial) .It Li thinkingmouse Kensington ThinkingMouse (serial) .It Li ps/2 PS/2 mouse .It Li mmhittab MM HitTablet (serial) .It Li x10mouseremote X10 MouseRemote (serial) .It Li versapad Interlink VersaPad (serial) .El .Pp Even if the mouse is not in the above list, it may be compatible with one in the list. Refer to the manual page for .Xr moused 8 for compatibility information. .Pp It should also be noted that while this is enabled, any other client of the mouse (such as an X server) should access the mouse through the virtual mouse device, .Pa /dev/sysmouse , and configure it as a .Dq Li sysmouse type mouse, since all mouse data is converted to this single canonical format when using .Xr moused 8 . If the client program does not support the .Dq Li sysmouse type, specify the .Dq Li mousesystems type. It is the second preferred type. .It Va moused_port .Pq Vt str If .Va moused_enable is set to .Dq Li YES , this is the actual port the mouse is on. It might be .Pa /dev/cuau0 for a COM1 serial mouse, .Pa /dev/psm0 for a PS/2 mouse or .Pa /dev/mse0 for a bus mouse, for example. .It Va moused_flags .Pq Vt str If .Va moused_flags is set, its value is used as an additional set of flags to pass to the .Xr moused 8 daemon. .It Va "moused_" Ns Ar XXX Ns Va "_flags" When .Va moused_nondefault_enable is enabled, and a .Xr moused 8 daemon is started for a non-default port, the .Va "moused_" Ns Ar XXX Ns Va "_flags" set of options has precedence over and replaces the default .Va moused_flags (where .Ar XXX is the name of the non-default port, i.e.,\& .Ar ums0 ) . By setting .Va "moused_" Ns Ar XXX Ns Va "_flags" it is possible to set up a different set of default flags for each .Xr moused 8 instance. For example, you can use .Dq Li "-3" for the default .Va moused_flags to make your laptop's touchpad more comfortable to use, but an empty set of options for .Va moused_ums0_flags when your .Xr usb 4 mouse has three or more buttons. .It Va mousechar_start .Pq Vt int If set to .Dq Li NO , the default mouse cursor character range .Li 0xd0 Ns - Ns Li 0xd3 is used, otherwise the range start is set to .Ar value character, see .Xr vidcontrol 1 . Use if the default range is occupied in the language code table. .It Va allscreens_flags .Pq Vt str If set, .Xr vidcontrol 1 is run with these options for each of the virtual terminals .Pq Pa /dev/ttyv* . For example, .Dq Fl m Cm on will enable the mouse pointer on all virtual terminals if .Va moused_enable is set to .Dq Li YES . .It Va allscreens_kbdflags .Pq Vt str If set, .Xr kbdcontrol 1 is run with these options for each of the virtual terminals .Pq Pa /dev/ttyv* . For example, .Dq Fl h Li 200 will set the .Xr syscons 4 or .Xr vt 4 scrollback (history) buffer to 200 lines. .It Va cron_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr cron 8 daemon at system boot time. .It Va cron_program .Pq Vt str Path to .Xr cron 8 (default .Pa /usr/sbin/cron ) . .It Va cron_flags .Pq Vt str If .Va cron_enable is set to .Dq Li YES , these are the flags to pass to .Xr cron 8 . .It Va cron_dst .Pq Vt bool If set to .Dq Li YES , enable the special handling of transitions to and from the Daylight Saving Time in .Xr cron 8 (equivalent to using the flag .Fl s ) . .It Va lpd_program .Pq Vt str Path to .Xr lpd 8 (default .Pa /usr/sbin/lpd ) . .It Va lpd_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr lpd 8 daemon at system boot time. .It Va lpd_flags .Pq Vt str If .Va lpd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr lpd 8 daemon. .It Va chkprintcap_enable .Pq Vt bool If set to .Dq Li YES , run the .Xr chkprintcap 8 command before starting the .Xr lpd 8 daemon. .It Va chkprintcap_flags .Pq Vt str If .Va lpd_enable and .Va chkprintcap_enable are set to .Dq Li YES , these are the flags to pass to the .Xr chkprintcap 8 program. The default is .Dq Li -d , which causes missing directories to be created. .It Va mta_start_script .Pq Vt str This variable specifies the full path to the script to run to start a mail transfer agent. The default is .Pa /etc/rc.sendmail . The .Va sendmail_* variables which .Pa /etc/rc.sendmail uses are documented in the .Xr rc.sendmail 8 manual page. .It Va dumpdev .Pq Vt str Indicates the device (usually a swap partition) to which a crash dump should be written in the event of a system crash. If the value of this variable is .Dq Li AUTO , the first suitable swap device listed in .Pa /etc/fstab will be used as dump device. Otherwise, the value of this variable is passed as the argument to .Xr dumpon 8 . To disable crash dumps, set this variable to .Dq Li NO . .It Va dumpdir .Pq Vt str When the system reboots after a crash and a crash dump is found on the device specified by the .Va dumpdev variable, .Xr savecore 8 will save that crash dump and a copy of the kernel to the directory specified by the .Va dumpdir variable. The default value is .Pa /var/crash . Set to .Dq Li NO to not run .Xr savecore 8 at boot time when .Va dumpdir is set. .It Va savecore_enable .Pq Vt bool If set to .Dq Li NO , disable automatic extraction of the crash dump from the .Va dumpdev . .It Va savecore_flags .Pq Vt str If crash dumps are enabled, these are the flags to pass to the .Xr savecore 8 utility. .It Va quota_enable .Pq Vt bool Set to .Dq Li YES to turn on user and group disk quotas on system startup via the .Xr quotaon 8 command for all file systems marked as having quotas enabled in .Pa /etc/fstab . The kernel must be built with .Cd "options QUOTA" for disk quotas to function. .It Va check_quotas .Pq Vt bool Set to .Dq Li YES to enable user and group disk quota checking via the .Xr quotacheck 8 command. .It Va quotacheck_flags .Pq Vt str If .Va quota_enable is set to .Dq Li YES , and .Va check_quotas is set to .Dq Li YES , these are the flags to pass to the .Xr quotacheck 8 utility. The default is .Dq Li "-a" , which checks quotas for all file systems with quotas enabled in .Pa /etc/fstab . .It Va quotaon_flags .Pq Vt str If .Va quota_enable is set to .Dq Li YES , these are the flags to pass to the .Xr quotaon 8 utility. The default is .Dq Li "-a" , which enables quotas for all file systems with quotas enabled in .Pa /etc/fstab . .It Va quotaoff_flags .Pq Vt str If .Va quota_enable is set to .Dq Li YES , these are the flags to pass to the .Xr quotaoff 8 utility when shutting down the quota system. The default is .Dq Li "-a" , which disables quotas for all file systems with quotas enabled in .Pa /etc/fstab . .It Va accounting_enable .Pq Vt bool Set to .Dq Li YES to enable system accounting through the .Xr accton 8 facility. .It Va ibcs2_enable .Pq Vt bool Set to .Dq Li YES to enable iBCS2 (SCO) binary emulation at system initial boot time. .It Va ibcs2_loaders .Pq Vt str If not set to .Dq Li NO and if .Va ibcs2_enable is set to .Dq Li YES , this specifies a list of additional iBCS2 loaders to enable. .It Va firstboot_sentinel .Pq Vt str This variable specifies the full path to a .Dq first boot sentinel file. If a file exists with this path, .Pa rc.d scripts with the .Dq firstboot keyword will be run on startup and the sentinel file will be deleted after the boot process completes. The sentinel file must be located on a writable file system which is mounted no later than .Va early_late_divider to function properly. The default is .Pa /firstboot . .It Va linux_enable .Pq Vt bool Set to .Dq Li YES to enable Linux/ELF binary emulation at system initial boot time. .It Va svr4_enable .Pq Vt bool If set to .Dq Li YES , enable SysVR4 emulation at boot time. .It Va sysvipc_enable .Pq Vt bool If set to .Dq Li YES , load System V IPC primitives at boot time. .It Va clear_tmp_enable .Pq Vt bool Set to .Dq Li YES to have .Pa /tmp cleaned at startup. .It Va clear_tmp_X .Pq Vt bool Set to .Dq Li NO to disable removing of X11 lock files, and the removal and (secure) recreation of the various socket directories for X11 related programs. .It Va ldconfig_paths .Pq Vt str Set to the list of shared library paths to use with .Xr ldconfig 8 . NOTE: .Pa /usr/lib will always be added first, so it need not appear in this list. .It Va ldconfig32_paths .Pq Vt str Set to the list of 32-bit compatibility shared library paths to use with .Xr ldconfig 8 . .It Va ldconfig_paths_aout .Pq Vt str Set to the list of shared library paths to use with .Xr ldconfig 8 legacy .Xr a.out 5 support. .It Va ldconfig_insecure .Pq Vt bool The .Xr ldconfig 8 utility normally refuses to use directories which are writable by anyone except root. Set this variable to .Dq Li YES to disable that security check during system startup. .It Va ldconfig_local_dirs .Pq Vt str Set to the list of local .Xr ldconfig 8 directories. The names of all files in the directories listed will be passed as arguments to .Xr ldconfig 8 . .It Va ldconfig_local32_dirs .Pq Vt str Set to the list of local 32-bit compatibility .Xr ldconfig 8 directories. The names of all files in the directories listed will be passed as arguments to .Dq Nm ldconfig Fl 32 . .It Va kern_securelevel_enable .Pq Vt bool Set to .Dq Li YES to set the kernel security level at system startup. .It Va kern_securelevel .Pq Vt int The kernel security level to set at startup. The allowed range of .Ar value ranges from \-1 (the compile time default) to 3 (the most secure). See .Xr security 7 for the list of possible security levels and their effect on system operation. .It Va sshd_program .Pq Vt str Path to the SSH server program .Pa ( /usr/sbin/sshd is the default). .It Va sshd_enable .Pq Vt bool Set to .Dq Li YES to start .Xr sshd 8 at system boot time. .It Va sshd_flags .Pq Vt str If .Va sshd_enable is set to .Dq Li YES , these are the flags to pass to the .Xr sshd 8 daemon. .It Va ftpd_program .Pq Vt str Path to the FTP server program .Pa ( /usr/libexec/ftpd is the default). .It Va ftpd_enable .Pq Vt bool Set to .Dq Li YES to start .Xr ftpd 8 as a stand-alone daemon at system boot time. .It Va ftpd_flags .Pq Vt str If .Va ftpd_enable is set to .Dq Li YES , these are the additional flags to pass to the .Xr ftpd 8 daemon. .It Va watchdogd_enable .Pq Vt bool If set to .Dq Li YES , start the .Xr watchdogd 8 daemon at boot time. This requires that the kernel have been compiled with a .Xr watchdog 4 compatible device. .It Va watchdogd_flags .Pq Vt str If .Va watchdogd_enable is set to .Dq Li YES , these are the flags passed to the .Xr watchdogd 8 daemon. .It Va devfs_rulesets .Pq Vt str List of files containing sets of rules for .Xr devfs 8 . .It Va devfs_system_ruleset .Pq Vt str Rule name(s) to apply to the system .Pa /dev itself. .It Va devfs_set_rulesets .Pq Vt str Pairs of already-mounted .Pa dev directories and rulesets that should be applied to them. For example: /mount/dev=ruleset_name .It Va devfs_load_rulesets .Pq Vt bool If set, always load the default rulesets listed in .Va devfs_rulesets . .It Va performance_cx_lowest .Pq Vt str CPU idle state to use while on AC power. The string .Dq Li LOW indicates that .Xr acpi 4 should use the lowest power state available while .Dq Li HIGH indicates that the lowest latency state (less power savings) should be used. .It Va performance_cpu_freq .Pq Vt str CPU clock frequency to use while on AC power. The string .Dq Li LOW indicates that .Xr cpufreq 4 should use the lowest frequency available while .Dq Li HIGH indicates that the highest frequency (less power savings) should be used. .It Va economy_cx_lowest .Pq Vt str CPU idle state to use when off AC power. The string .Dq Li LOW indicates that .Xr acpi 4 should use the lowest power state available while .Dq Li HIGH indicates that the lowest latency state (less power savings) should be used. .It Va economy_cpu_freq .Pq Vt str CPU clock frequency to use when off AC power. The string .Dq Li LOW indicates that .Xr cpufreq 4 should use the lowest frequency available while .Dq Li HIGH indicates that the highest frequency (less power savings) should be used. .It Va jail_enable .Pq Vt bool If set to .Dq Li NO , any configured jails will not be started. .It Va jail_conf .Pq Vt str The configuration filename used by .Xr jail 8 utility. The default value is .Pa /etc/jail.conf . .It Va jail_parallel_start .Pq Vt bool If set to .Dq Li YES , all configured jails will be started in the background (in parallel). .It Va jail_flags .Pq Vt str Unset by default. When set, use as default value for .Va jail_ Ns Ao Ar jname Ac Ns Va _flags for every jail in .Va jail_list . .It Va jail_list .Pq Vt str A space-delimited list of jail names. When left empty, all of the .Xr jail 8 instances defined in the configuration file are started. The names specified in this list control the jail startup order. .Xr jail 8 instances missing from .Va jail_list must be started manually. Note that a jail's .Va depend parameter in the configuration file may override this list. .It Va jail_reverse_stop .Pq Vt bool When set to .Dq Li YES , all configured jails in .Va jail_list are stopped in reverse order. .It Va jail_* variables Note that older releases supported per-jail configuration via .Xr rc.conf 5 variables. For example, hostname of a jail named .Li vjail was able to be set by .Li jail_vjail_hostname . These per-jail configuration variables are now obsolete in favor of .Xr jail 8 configuration file. For backward compatibility, when per-jail configuration variables are defined, .Xr jail 8 configuration files are created as .Pa /var/run/jail. Ns Ao Ar jname Ac Ns Pa .conf and used. .Pp The following per-jail parameters are handled by .Pa rc.d/jail script out of their corresponding .Nm variables. In addition to them, parameters in .Va jail_ Ns Ao Ar jname Ac Ns Va _parameters will be added to the configuration file. They must be a semi-colon .Pq Ql \&; delimited list of .Dq key=value . For more details, see .Xr jail 8 manual page. .Bl -tag -width "host.hostname" -offset indent .It Li path set from .Va jail_ Ns Ao Ar jname Ac Ns Va _rootdir .It Li host.hostname set from .Va jail_ Ns Ao Ar jname Ac Ns Va _hostname .It Li exec.consolelog set from .Va jail_ Ns Ao Ar jname Ac Ns Va _consolelog . The default value is .Pa /var/log/jail_ Ao Ar jname Ac Pa _console.log . .It Li interface set from .Va jail_ Ns Ao Ar jname Ac Ns Va _interface . .It Li vnet.interface set from .Va jail_ Ns Ao Ar jname Ac Ns Va _vnet_interface . This implies .Li vnet parameter will be enabled and cannot be specified with .Va jail_ Ns Ao Ar jname Ac Ns Va _interface , .Va jail_ Ns Ao Ar jname Ac Ns Va _ip and/or .Va jail_ Ns Ao Ar jname Ac Ns Va _ip_multi Ns Aq Ar n at the same time. .It Li fstab set from .Va jail_ Ns Ao Ar jname Ac Ns Va _fstab .It Li mount set from .Va jail_ Ns Ao Ar jname Ac Ns Va _procfs_enable . .It Li exec.fib set from .Va jail_ Ns Ao Ar jname Ac Ns Va _fib .It Li exec.start set from .Va jail_ Ns Ao Ar jname Ac Ns Va _exec_start . The parameter name was .Li command in some older releases. .It Li exec.prestart set from .Va jail_ Ns Ao Ar jname Ac Ns Va _exec_prestart .It Li exec.poststart set from .Va jail_ Ns Ao Ar jname Ac Ns Va _exec_poststart .It Li exec.stop set from .Va jail_ Ns Ao Ar jname Ac Ns Va _exec_stop .It Li exec.prestop set from .Va jail_ Ns Ao Ar jname Ac Ns Va _exec_prestop .It Li exec.poststop set from .Va jail_ Ns Ao Ar jname Ac Ns Va _exec_poststop .It Li ip4.addr set if .Va jail_ Ns Ao Ar jname Ac Ns Va _ip or .Va jail_ Ns Ao Ar jname Ac Ns Va _ip_multi Ns Aq Ar n contain IPv4 addresses .It Li ip6.addr set if .Va jail_ Ns Ao Ar jname Ac Ns Va _ip or .Va jail_ Ns Ao Ar jname Ac Ns Va _ip_multi Ns Aq Ar n contain IPv6 addresses .It Li allow.mount set from .Va jail_ Ns Ao Ar jname Ac Ns Va _mount_enable .It Li mount.devfs set from .Va jail_ Ns Ao Ar jname Ac Ns Va _devfs_enable .It Li devfs_ruleset set from .Va jail_ Ns Ao Ar jname Ac Ns Va _devfs_ruleset . This must be an integer, not a string. .It Li mount.fdescfs set from .Va jail_ Ns Ao Ar jname Ac Ns Va _fdescfs_enable .It Li allow.set_hostname set from .Va jail_ Ns Ao Ar jname Ac Ns Va _set_hostname_allow .It Li allow.rawsocket set from .Va jail_ Ns Ao Ar jname Ac Ns Va _socket_unixiproute_only .It Li allow.sysvipc set from .Va jail_ Ns Ao Ar jname Ac Ns Va _sysvipc_allow .El .\" ----------------------------------------------------- .It Va harvest_mask .Pq Vt int Set to a bit-mask representing the entropy sources you wish to harvest. Refer to .Xr random 4 for more information. .It Va entropy_dir .Pq Vt str Set to .Dq Li NO to disable caching entropy via .Xr cron 8 . Otherwise set to the directory in which the entropy files are stored. To be useful, there must be a system cron job that regularly writes and rotates files here. All files found will be used at boot time. The default is .Pa /var/db/entropy . .It Va entropy_file .Pq Vt str Set to .Dq Li NO to disable caching entropy through reboots. Otherwise set to the name of a file used to store cached entropy. This file should be located on a file system that is readable before all the volumes specified in .Xr fstab 5 are mounted. By default, .Pa /entropy is used, but if .Pa /var/db/entropy-file is found it will also be used. This will be of some use to .Xr bsdinstall 8 . .It Va entropy_boot_file .Pq Vt str Set to .Dq Li NO to disable very early caching entropy through reboots. Otherwise set to the filename used to read very early reboot cached entropy. This file should be located where .Xr loader 8 can read it. See also .Xr loader.conf 5 . The default location is .Pa /boot/entropy . .It Va entropy_save_sz .Pq Vt int Size of the entropy cache files saved by .Nm save-entropy periodically. .It Va entropy_save_num .Pq Vt int Number of entropy cache files to save by .Nm save-entropy periodically. .It Va ipsec_enable .Pq Vt bool Set to .Dq Li YES to run .Xr setkey 8 on .Va ipsec_file at boot time. .It Va ipsec_file .Pq Vt str Configuration file for .Xr setkey 8 . .It Va dmesg_enable .Pq Vt bool Set to .Dq Li YES to save .Xr dmesg 8 to .Pa /var/run/dmesg.boot on boot. .It Va rcshutdown_timeout .Pq Vt int If set, start a watchdog timer in the background which will terminate .Pa rc.shutdown if .Xr shutdown 8 has not completed within the specified time (in seconds). Notice that in addition to this soft timeout, .Xr init 8 also applies a hard timeout for the execution of .Pa rc.shutdown . This is configured via .Xr sysctl 8 variable .Va kern.init_shutdown_timeout and defaults to 120 seconds. Setting the value of .Va rcshutdown_timeout to more than 120 seconds will have no effect until the .Xr sysctl 8 variable .Va kern.init_shutdown_timeout is also increased. .It Va virecover_enable .Pq Vt bool Set to .Dq Li NO to prevent the system from trying to recover pre-maturely terminated .Xr vi 1 sessions. .It Va ugidfw_enable .Pq Vt bool Set to .Dq Li YES to load the .Xr mac_bsdextended 4 module upon system initialization and load a default ruleset file. .It Va bsdextended_script .Pq Vt str The default .Xr mac_bsdextended 4 ruleset file to load. The default value of this variable is .Pa /etc/rc.bsdextended . .It Va newsyslog_enable .Pq Vt bool If set to .Dq Li YES , run .Xr newsyslog 8 command at startup. .It Va newsyslog_flags .Pq Vt str If .Va newsyslog_enable is set to .Dq Li YES , these are the flags to pass to the .Xr newsyslog 8 program. The default is .Dq Li -CN , which causes log files flagged with a .Cm C to be created. .It Va mdconfig_md Ns Aq Ar X .Pq Vt str Arguments to .Xr mdconfig 8 for .Xr md 4 device .Ar X . At minimum a .Fl t Ar type must be specified and either a .Fl s Ar size for malloc or swap backed .Xr md 4 devices or a .Fl f Ar file for vnode backed .Xr md 4 devices. Note that .Va mdconfig_md Ns Aq Ar X variables are evaluated until one variable is unset or null. .It Va mdconfig_md Ns Ao Ar X Ac Ns Va _newfs .Pq Vt str Optional arguments passed to .Xr newfs 8 to initialize .Xr md 4 device .Ar X . .It Va mdconfig_md Ns Ao Ar X Ac Ns Va _owner .Pq Vt str An ownership specification passed to .Xr chown 8 after the specified .Xr md 4 device .Ar X has been mounted. Both the .Xr md 4 device and the mount point will be changed. .It Va mdconfig_md Ns Ao Ar X Ac Ns Va _perms .Pq Vt str A mode string passed to .Xr chmod 1 after the specified .Xr md 4 device .Ar X has been mounted. Both the .Xr md 4 device and the mount point will be changed. .It Va mdconfig_md Ns Ao Ar X Ac Ns Va _files .Pq Vt str Files to be copied to the mount point of the .Xr md 4 device .Ar X after it has been mounted. .It Va mdconfig_md Ns Ao Ar X Ac Ns Va _cmd .Pq Vt str Command to execute after the specified .Xr md 4 device .Ar X has been mounted. Note that the command is passed to .Ic eval and that both .Va _dev and .Va _mp variables can be used to reference respectively the .Xr md 4 device and the mount point. Assuming that the .Xr md 4 device is .Li md0 , one could set the following: .Bd -literal mdconfig_md0_cmd="tar xfzC /var/file.tgz \e${_mp}" .Ed .It Va autobridge_interfaces .Pq Vt str Set to the list of bridge interfaces that will have newly arriving interfaces checked against to be automatically added. If not set to .Dq Li NO then for each whitespace separated .Ar element in the value, a .Va autobridge_ Ns Aq Ar element variable is assumed to exist which has a whitespace separated list of interface names to match, these names can use wildcards. For example: .Bd -literal autobridge_interfaces="bridge0" autobridge_bridge0="tap* dc0 vlan[345]" .Ed .It Va mixer_enable .Pq Vt bool If set to .Dq Li YES , enable support for sound mixer. .It Va hcsecd_enable .Pq Vt bool If set to .Dq Li YES , enable Bluetooth security daemon. .It Va hcsecd_config .Pq Vt str Configuration file for .Xr hcsecd 8 . Default .Pa /etc/bluetooth/hcsecd.conf . .It Va sdpd_enable .Pq Vt bool If set to .Dq Li YES , enable Bluetooth Service Discovery Protocol daemon. .It Va sdpd_control .Pq Vt str Path to .Xr sdpd 8 control socket. Default .Pa /var/run/sdp . .It Va sdpd_groupname .Pq Vt str Sets .Xr sdpd 8 group to run as after it initializes. Default .Dq Li nobody . .It Va sdpd_username .Pq Vt str Sets .Xr sdpd 8 user to run as after it initializes. Default .Dq Li nobody . .It Va bthidd_enable .Pq Vt bool If set to .Dq Li YES , enable Bluetooth Human Interface Device daemon. .It Va bthidd_config .Pq Vt str Configuration file for .Xr bthidd 8 . Default .Pa /etc/bluetooth/bthidd.conf . .It Va bthidd_hids .Pq Vt str Path to a file, where .Xr bthidd 8 will store information about known HID devices. Default .Pa /var/db/bthidd.hids . .It Va rfcomm_pppd_server_enable .Pq Vt bool If set to .Dq Li YES , enable Bluetooth RFCOMM PPP wrapper daemon. .It Va rfcomm_pppd_server_profile .Pq Vt str The name of the profile to use from .Pa /etc/ppp/ppp.conf . Multiple profiles can be specified here. Also used to specify per-profile overrides. When the profile name contains any of the characters .Dq Li .-/+ they are translated to .Dq Li _ for the proposes of the override variable names. .It Va rfcomm_pppd_server_ Ns Ao Ar profile Ac Ns _bdaddr .Pq Vt str Overrides local address to listen on. By default .Xr rfcomm_pppd 8 will listen on .Dq Li ANY address. The address can be specified as BD_ADDR or name. .It Va rfcomm_pppd_server_ Ns Ao Ar profile Ac Ns _channel .Pq Vt str Overrides local RFCOMM channel to listen on. By default .Xr rfcomm_pppd 8 will listen on RFCOMM channel 1. Must set properly if multiple profiles used in the same time. .It Va rfcomm_pppd_server_ Ns Ao Ar profile Ac Ns _register_sp .Pq Vt bool Tells .Xr rfcomm_pppd 8 if it should register Serial Port service on the specified RFCOMM channel. Default .Dq Li NO . .It Va rfcomm_pppd_server_ Ns Ao Ar profile Ac Ns _register_dun .Pq Vt bool Tells .Xr rfcomm_pppd 8 if it should register Dial-Up Networking service on the specified RFCOMM channel. Default .Dq Li NO . .It Va ubthidhci_enable .Pq Vt bool If set to .Dq Li YES , change the USB Bluetooth controller from HID mode to HCI mode. You also need to specify the location of USB Bluetooth controller with the .Va ubthidhci_busnum and .Va ubthidhci_addr variables. .It Va ubthidhci_busnum Bus number where the USB Bluetooth controller is located. Check the output of .Xr usbconfig 8 on your system to find this information. .It Va ubthidhci_addr Bus address of the USB Bluetooth controller. Check the output of .Xr usbconfig 8 on your system to find this information. .It Va netwait_enable .Pq Vt bool If set to .Dq Li YES , delays the start of network-reliant services until .Va netwait_if is up and ICMP packets to a destination defined in .Va netwait_ip are flowing. Link state is examined first, followed by .Dq Li pinging an IP address to verify network usability. If no destination can be reached or timeouts are exceeded, network services are started anyway with no guarantee that the network is usable. Use of this variable requires both .Va netwait_ip and .Va netwait_if to be set. .It Va netwait_ip .Pq Vt str Empty by default. This variable contains a space-delimited list of IP addresses to .Xr ping 8 . DNS hostnames should not be used as resolution is not guaranteed to be functional at this point. If multiple IP addresses are specified, each will be tried until one is successful or the list is exhausted. .It Va netwait_timeout .Pq Vt int Indicates the total number of seconds to perform a .Dq Li ping against each IP address in .Va netwait_ip , at a rate of one ping per second. If any of the pings are successful, full network connectivity is considered reliable. The default is 60. .It Va netwait_if .Pq Vt str Empty by default. Defines the name of the network interface on which watch for link. .Xr ifconfig 8 is used to monitor the interface, looking for .Dq Li status: no carrier . Once gone, the link is considered up. This can be a .Xr vlan 4 interface if desired. .It Va netwait_if_timeout .Pq Vt int Defines the total number of seconds to wait for link to become usable, polled at a 1-second interval. The default is 30. .It Va rctl_enable .Pq Vt bool If set to .Dq Li YES , load .Xr rctl 8 rules from the defined ruleset. The kernel must be built with .Cd "options RACCT" and .Cd "options RCTL" . .It Va rctl_rules .Pq Vt str Set to .Pa /etc/rctl.conf by default. This variables contains the .Xr rctl.conf 5 ruleset to load for .Xr rctl 8 . .It Va iovctl_files .Pq Vt str A space-separated list of configuration files used by .Xr iovctl 8 . The default value is an empty string. .It Va autofs_enable .Pq Vt bool If set to .Dq Li YES , start the .Xr automount 8 utility and the .Xr automountd 8 and .Xr autounmountd 8 daemons at boot time. .It Va automount_flags .Pq Vt str If .Va autofs_enable is set to .Dq Li YES , these are the flags to pass to the .Xr automount 8 program. By default no flags are passed. .It Va automountd_flags .Pq Vt str If .Va autofs_enable is set to .Dq Li YES , these are the flags to pass to the .Xr automountd 8 daemon. By default no flags are passed. .It Va autounmountd_flags .Pq Vt str If .Va autofs_enable is set to .Dq Li YES , these are the flags to pass to the .Xr autounmountd 8 daemon. By default no flags are passed. .It Va ctld_enable .Pq Vt bool If set to .Dq Li YES , start the .Xr ctld 8 daemon at boot time. .It Va iscsid_enable .Pq Vt bool If set to .Dq Li YES , start the .Xr iscsid 8 daemon at boot time. .It Va iscsictl_enable .Pq Vt bool If set to .Dq Li YES , start the .Xr iscsictl 8 utility at boot time. .It Va iscsictl_flags .Pq Vt str If .Va iscsictl_enable is set to .Dq Li YES , these are the flags to pass to the .Xr iscsictl 8 program. The default is .Dq Li -Aa , which configures sessions based on the .Pa /etc/iscsi.conf configuration file. .El .Sh FILES .Bl -tag -width ".Pa /etc/defaults/rc.conf" -compact .It Pa /etc/defaults/rc.conf .It Pa /etc/rc.conf .It Pa /etc/rc.conf.local .El .Sh SEE ALSO .Xr catman 1 , .Xr chmod 1 , .Xr gdb 1 , .Xr info 1 , .Xr kbdcontrol 1 , .Xr makewhatis 1 , .Xr sh 1 , .Xr vi 1 , .Xr vidcontrol 1 , .Xr bridge 4 , .Xr dummynet 4 , .Xr ip 4 , .Xr ipf 4 , .Xr ipfw 4 , .Xr ipnat 4 , .Xr kld 4 , .Xr pf 4 , .Xr pflog 4 , .Xr pfsync 4 , .Xr tcp 4 , .Xr udp 4 , .Xr exports 5 , .Xr fstab 5 , .Xr ipf 5 , .Xr ipnat 5 , .Xr jail.conf 5 , .Xr loader.conf 5 , .Xr motd 5 , .Xr newsyslog.conf 5 , .Xr pf.conf 5 , .Xr security 7 , .Xr accton 8 , .Xr amd 8 , .Xr apm 8 , .Xr atm 8 , .Xr bsdinstall 8 , .Xr bthidd 8 , .Xr chkprintcap 8 , .Xr chown 8 , .Xr cron 8 , .Xr devfs 8 , .Xr dhclient 8 , .Xr ftpd 8 , .Xr geli 8 , .Xr hcsecd 8 , .Xr ifconfig 8 , .Xr inetd 8 , .Xr iovctl 8 , .Xr ipf 8 , .Xr ipfw 8 , .Xr ipnat 8 , .Xr jail 8 , .Xr kldxref 8 , .Xr loader 8 , .Xr lpd 8 , .Xr mdconfig 8 , .Xr mdmfs 8 , .Xr mixer 8 , .Xr mountd 8 , .Xr moused 8 , .Xr newfs 8 , .Xr newsyslog 8 , .Xr nfsd 8 , .Xr ntpd 8 , .Xr ntpdate 8 , .Xr pfctl 8 , .Xr pflogd 8 , .Xr ping 8 , .Xr powerd 8 , .Xr quotacheck 8 , .Xr quotaon 8 , .Xr rc 8 , .Xr rc.sendmail 8 , .Xr rfcomm_pppd 8 , .Xr route 8 , .Xr routed 8 , .Xr rpc.lockd 8 , .Xr rpc.statd 8 , .Xr rpcbind 8 , .Xr rwhod 8 , .Xr savecore 8 , .Xr sdpd 8 , .Xr sshd 8 , .Xr swapon 8 , .Xr sysctl 8 , .Xr syslogd 8 , .Xr timed 8 , .Xr unbound 8 , .Xr usbconfig 8 , .Xr wlandebug 8 , .Xr yp 8 , .Xr ypbind 8 , .Xr ypserv 8 , .Xr ypset 8 .Sh HISTORY The .Nm file appeared in .Fx 2.2.2 . .Sh AUTHORS .An Jordan K. Hubbard . Index: projects/vnet/sys/compat/linuxkpi/common/include/linux/etherdevice.h =================================================================== --- projects/vnet/sys/compat/linuxkpi/common/include/linux/etherdevice.h (revision 301546) +++ projects/vnet/sys/compat/linuxkpi/common/include/linux/etherdevice.h (revision 301547) @@ -1,111 +1,112 @@ /*- - * Copyright (c) 2015 Mellanox Technologies, Ltd. All rights reserved. + * Copyright (c) 2015-2016 Mellanox Technologies, Ltd. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _LINUX_ETHERDEVICE #define _LINUX_ETHERDEVICE #include #include #include #define ETH_MODULE_SFF_8079 1 #define ETH_MODULE_SFF_8079_LEN 256 #define ETH_MODULE_SFF_8472 2 #define ETH_MODULE_SFF_8472_LEN 512 #define ETH_MODULE_SFF_8636 3 #define ETH_MODULE_SFF_8636_LEN 256 #define ETH_MODULE_SFF_8436 4 #define ETH_MODULE_SFF_8436_LEN 256 struct ethtool_eeprom { u32 offset; u32 len; }; struct ethtool_modinfo { u32 type; u32 eeprom_len; }; static inline bool is_zero_ether_addr(const u8 * addr) { return ((addr[0] + addr[1] + addr[2] + addr[3] + addr[4] + addr[5]) == 0x00); } static inline bool is_multicast_ether_addr(const u8 * addr) { return (0x01 & addr[0]); } static inline bool is_broadcast_ether_addr(const u8 * addr) { return ((addr[0] + addr[1] + addr[2] + addr[3] + addr[4] + addr[5]) == (6 * 0xff)); } static inline bool is_valid_ether_addr(const u8 * addr) { return !is_multicast_ether_addr(addr) && !is_zero_ether_addr(addr); } static inline void ether_addr_copy(u8 * dst, const u8 * src) { memcpy(dst, src, 6); } static inline bool ether_addr_equal(const u8 *pa, const u8 *pb) { return (memcmp(pa, pb, 6) == 0); } static inline bool ether_addr_equal_64bits(const u8 *pa, const u8 *pb) { return (memcmp(pa, pb, 6) == 0); } static inline void eth_broadcast_addr(u8 *pa) { memset(pa, 0xff, 6); } static inline void random_ether_addr(u8 * dst) { - read_random(dst, 6); + if (read_random(dst, 6) == 0) + arc4rand(dst, 6, 0); dst[0] &= 0xfe; dst[0] |= 0x02; } #endif /* _LINUX_ETHERDEVICE */ Index: projects/vnet/sys/compat/linuxkpi/common/include/linux/random.h =================================================================== --- projects/vnet/sys/compat/linuxkpi/common/include/linux/random.h (revision 301546) +++ projects/vnet/sys/compat/linuxkpi/common/include/linux/random.h (revision 301547) @@ -1,42 +1,44 @@ /*- * Copyright (c) 2010 Isilon Systems, Inc. * Copyright (c) 2010 iX Systems, Inc. * Copyright (c) 2010 Panasas, Inc. - * Copyright (c) 2013, 2014 Mellanox Technologies, Ltd. + * Copyright (c) 2013-2016 Mellanox Technologies, Ltd. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _LINUX_RANDOM_H_ #define _LINUX_RANDOM_H_ #include +#include static inline void get_random_bytes(void *buf, int nbytes) { - read_random(buf, nbytes); + if (read_random(buf, nbytes) == 0) + arc4rand(buf, nbytes, 0); } #endif /* _LINUX_RANDOM_H_ */ Index: projects/vnet/sys/dev/cxgb/cxgb_sge.c =================================================================== --- projects/vnet/sys/dev/cxgb/cxgb_sge.c (revision 301546) +++ projects/vnet/sys/dev/cxgb/cxgb_sge.c (revision 301547) @@ -1,3706 +1,3707 @@ /************************************************************************** Copyright (c) 2007-2009, Chelsio Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Neither the name of the Chelsio Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ***************************************************************************/ #include __FBSDID("$FreeBSD$"); #include "opt_inet6.h" #include "opt_inet.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include int txq_fills = 0; int multiq_tx_enable = 1; #ifdef TCP_OFFLOAD CTASSERT(NUM_CPL_HANDLERS >= NUM_CPL_CMDS); #endif extern struct sysctl_oid_list sysctl__hw_cxgb_children; int cxgb_txq_buf_ring_size = TX_ETH_Q_SIZE; SYSCTL_INT(_hw_cxgb, OID_AUTO, txq_mr_size, CTLFLAG_RDTUN, &cxgb_txq_buf_ring_size, 0, "size of per-queue mbuf ring"); static int cxgb_tx_coalesce_force = 0; SYSCTL_INT(_hw_cxgb, OID_AUTO, tx_coalesce_force, CTLFLAG_RWTUN, &cxgb_tx_coalesce_force, 0, "coalesce small packets into a single work request regardless of ring state"); #define COALESCE_START_DEFAULT TX_ETH_Q_SIZE>>1 #define COALESCE_START_MAX (TX_ETH_Q_SIZE-(TX_ETH_Q_SIZE>>3)) #define COALESCE_STOP_DEFAULT TX_ETH_Q_SIZE>>2 #define COALESCE_STOP_MIN TX_ETH_Q_SIZE>>5 #define TX_RECLAIM_DEFAULT TX_ETH_Q_SIZE>>5 #define TX_RECLAIM_MAX TX_ETH_Q_SIZE>>2 #define TX_RECLAIM_MIN TX_ETH_Q_SIZE>>6 static int cxgb_tx_coalesce_enable_start = COALESCE_START_DEFAULT; SYSCTL_INT(_hw_cxgb, OID_AUTO, tx_coalesce_enable_start, CTLFLAG_RWTUN, &cxgb_tx_coalesce_enable_start, 0, "coalesce enable threshold"); static int cxgb_tx_coalesce_enable_stop = COALESCE_STOP_DEFAULT; SYSCTL_INT(_hw_cxgb, OID_AUTO, tx_coalesce_enable_stop, CTLFLAG_RWTUN, &cxgb_tx_coalesce_enable_stop, 0, "coalesce disable threshold"); static int cxgb_tx_reclaim_threshold = TX_RECLAIM_DEFAULT; SYSCTL_INT(_hw_cxgb, OID_AUTO, tx_reclaim_threshold, CTLFLAG_RWTUN, &cxgb_tx_reclaim_threshold, 0, "tx cleaning minimum threshold"); /* * XXX don't re-enable this until TOE stops assuming * we have an m_ext */ static int recycle_enable = 0; extern int cxgb_use_16k_clusters; extern int nmbjumbop; extern int nmbjumbo9; extern int nmbjumbo16; #define USE_GTS 0 #define SGE_RX_SM_BUF_SIZE 1536 #define SGE_RX_DROP_THRES 16 #define SGE_RX_COPY_THRES 128 /* * Period of the Tx buffer reclaim timer. This timer does not need to run * frequently as Tx buffers are usually reclaimed by new Tx packets. */ #define TX_RECLAIM_PERIOD (hz >> 1) /* * Values for sge_txq.flags */ enum { TXQ_RUNNING = 1 << 0, /* fetch engine is running */ TXQ_LAST_PKT_DB = 1 << 1, /* last packet rang the doorbell */ }; struct tx_desc { uint64_t flit[TX_DESC_FLITS]; } __packed; struct rx_desc { uint32_t addr_lo; uint32_t len_gen; uint32_t gen2; uint32_t addr_hi; } __packed; struct rsp_desc { /* response queue descriptor */ struct rss_header rss_hdr; uint32_t flags; uint32_t len_cq; uint8_t imm_data[47]; uint8_t intr_gen; } __packed; #define RX_SW_DESC_MAP_CREATED (1 << 0) #define TX_SW_DESC_MAP_CREATED (1 << 1) #define RX_SW_DESC_INUSE (1 << 3) #define TX_SW_DESC_MAPPED (1 << 4) #define RSPQ_NSOP_NEOP G_RSPD_SOP_EOP(0) #define RSPQ_EOP G_RSPD_SOP_EOP(F_RSPD_EOP) #define RSPQ_SOP G_RSPD_SOP_EOP(F_RSPD_SOP) #define RSPQ_SOP_EOP G_RSPD_SOP_EOP(F_RSPD_SOP|F_RSPD_EOP) struct tx_sw_desc { /* SW state per Tx descriptor */ struct mbuf *m; bus_dmamap_t map; int flags; }; struct rx_sw_desc { /* SW state per Rx descriptor */ caddr_t rxsd_cl; struct mbuf *m; bus_dmamap_t map; int flags; }; struct txq_state { unsigned int compl; unsigned int gen; unsigned int pidx; }; struct refill_fl_cb_arg { int error; bus_dma_segment_t seg; int nseg; }; /* * Maps a number of flits to the number of Tx descriptors that can hold them. * The formula is * * desc = 1 + (flits - 2) / (WR_FLITS - 1). * * HW allows up to 4 descriptors to be combined into a WR. */ static uint8_t flit_desc_map[] = { 0, #if SGE_NUM_GENBITS == 1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4 #elif SGE_NUM_GENBITS == 2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, #else # error "SGE_NUM_GENBITS must be 1 or 2" #endif }; #define TXQ_LOCK_ASSERT(qs) mtx_assert(&(qs)->lock, MA_OWNED) #define TXQ_TRYLOCK(qs) mtx_trylock(&(qs)->lock) #define TXQ_LOCK(qs) mtx_lock(&(qs)->lock) #define TXQ_UNLOCK(qs) mtx_unlock(&(qs)->lock) #define TXQ_RING_EMPTY(qs) drbr_empty((qs)->port->ifp, (qs)->txq[TXQ_ETH].txq_mr) #define TXQ_RING_NEEDS_ENQUEUE(qs) \ drbr_needs_enqueue((qs)->port->ifp, (qs)->txq[TXQ_ETH].txq_mr) #define TXQ_RING_FLUSH(qs) drbr_flush((qs)->port->ifp, (qs)->txq[TXQ_ETH].txq_mr) #define TXQ_RING_DEQUEUE_COND(qs, func, arg) \ drbr_dequeue_cond((qs)->port->ifp, (qs)->txq[TXQ_ETH].txq_mr, func, arg) #define TXQ_RING_DEQUEUE(qs) \ drbr_dequeue((qs)->port->ifp, (qs)->txq[TXQ_ETH].txq_mr) int cxgb_debug = 0; static void sge_timer_cb(void *arg); static void sge_timer_reclaim(void *arg, int ncount); static void sge_txq_reclaim_handler(void *arg, int ncount); static void cxgb_start_locked(struct sge_qset *qs); /* * XXX need to cope with bursty scheduling by looking at a wider * window than we are now for determining the need for coalescing * */ static __inline uint64_t check_pkt_coalesce(struct sge_qset *qs) { struct adapter *sc; struct sge_txq *txq; uint8_t *fill; if (__predict_false(cxgb_tx_coalesce_force)) return (1); txq = &qs->txq[TXQ_ETH]; sc = qs->port->adapter; fill = &sc->tunq_fill[qs->idx]; if (cxgb_tx_coalesce_enable_start > COALESCE_START_MAX) cxgb_tx_coalesce_enable_start = COALESCE_START_MAX; if (cxgb_tx_coalesce_enable_stop < COALESCE_STOP_MIN) cxgb_tx_coalesce_enable_start = COALESCE_STOP_MIN; /* * if the hardware transmit queue is more than 1/8 full * we mark it as coalescing - we drop back from coalescing * when we go below 1/32 full and there are no packets enqueued, * this provides us with some degree of hysteresis */ if (*fill != 0 && (txq->in_use <= cxgb_tx_coalesce_enable_stop) && TXQ_RING_EMPTY(qs) && (qs->coalescing == 0)) *fill = 0; else if (*fill == 0 && (txq->in_use >= cxgb_tx_coalesce_enable_start)) *fill = 1; return (sc->tunq_coalesce); } #ifdef __LP64__ static void set_wr_hdr(struct work_request_hdr *wrp, uint32_t wr_hi, uint32_t wr_lo) { uint64_t wr_hilo; #if _BYTE_ORDER == _LITTLE_ENDIAN wr_hilo = wr_hi; wr_hilo |= (((uint64_t)wr_lo)<<32); #else wr_hilo = wr_lo; wr_hilo |= (((uint64_t)wr_hi)<<32); #endif wrp->wrh_hilo = wr_hilo; } #else static void set_wr_hdr(struct work_request_hdr *wrp, uint32_t wr_hi, uint32_t wr_lo) { wrp->wrh_hi = wr_hi; wmb(); wrp->wrh_lo = wr_lo; } #endif struct coalesce_info { int count; int nbytes; }; static int coalesce_check(struct mbuf *m, void *arg) { struct coalesce_info *ci = arg; int *count = &ci->count; int *nbytes = &ci->nbytes; if ((*nbytes == 0) || ((*nbytes + m->m_len <= 10500) && (*count < 7) && (m->m_next == NULL))) { *count += 1; *nbytes += m->m_len; return (1); } return (0); } static struct mbuf * cxgb_dequeue(struct sge_qset *qs) { struct mbuf *m, *m_head, *m_tail; struct coalesce_info ci; if (check_pkt_coalesce(qs) == 0) return TXQ_RING_DEQUEUE(qs); m_head = m_tail = NULL; ci.count = ci.nbytes = 0; do { m = TXQ_RING_DEQUEUE_COND(qs, coalesce_check, &ci); if (m_head == NULL) { m_tail = m_head = m; } else if (m != NULL) { m_tail->m_nextpkt = m; m_tail = m; } } while (m != NULL); if (ci.count > 7) panic("trying to coalesce %d packets in to one WR", ci.count); return (m_head); } /** * reclaim_completed_tx - reclaims completed Tx descriptors * @adapter: the adapter * @q: the Tx queue to reclaim completed descriptors from * * Reclaims Tx descriptors that the SGE has indicated it has processed, * and frees the associated buffers if possible. Called with the Tx * queue's lock held. */ static __inline int reclaim_completed_tx(struct sge_qset *qs, int reclaim_min, int queue) { struct sge_txq *q = &qs->txq[queue]; int reclaim = desc_reclaimable(q); if ((cxgb_tx_reclaim_threshold > TX_RECLAIM_MAX) || (cxgb_tx_reclaim_threshold < TX_RECLAIM_MIN)) cxgb_tx_reclaim_threshold = TX_RECLAIM_DEFAULT; if (reclaim < reclaim_min) return (0); mtx_assert(&qs->lock, MA_OWNED); if (reclaim > 0) { t3_free_tx_desc(qs, reclaim, queue); q->cleaned += reclaim; q->in_use -= reclaim; } if (isset(&qs->txq_stopped, TXQ_ETH)) clrbit(&qs->txq_stopped, TXQ_ETH); return (reclaim); } /** * should_restart_tx - are there enough resources to restart a Tx queue? * @q: the Tx queue * * Checks if there are enough descriptors to restart a suspended Tx queue. */ static __inline int should_restart_tx(const struct sge_txq *q) { unsigned int r = q->processed - q->cleaned; return q->in_use - r < (q->size >> 1); } /** * t3_sge_init - initialize SGE * @adap: the adapter * @p: the SGE parameters * * Performs SGE initialization needed every time after a chip reset. * We do not initialize any of the queue sets here, instead the driver * top-level must request those individually. We also do not enable DMA * here, that should be done after the queues have been set up. */ void t3_sge_init(adapter_t *adap, struct sge_params *p) { u_int ctrl, ups; ups = 0; /* = ffs(pci_resource_len(adap->pdev, 2) >> 12); */ ctrl = F_DROPPKT | V_PKTSHIFT(2) | F_FLMODE | F_AVOIDCQOVFL | F_CQCRDTCTRL | F_CONGMODE | F_TNLFLMODE | F_FATLPERREN | V_HOSTPAGESIZE(PAGE_SHIFT - 11) | F_BIGENDIANINGRESS | V_USERSPACESIZE(ups ? ups - 1 : 0) | F_ISCSICOALESCING; #if SGE_NUM_GENBITS == 1 ctrl |= F_EGRGENCTRL; #endif if (adap->params.rev > 0) { if (!(adap->flags & (USING_MSIX | USING_MSI))) ctrl |= F_ONEINTMULTQ | F_OPTONEINTMULTQ; } t3_write_reg(adap, A_SG_CONTROL, ctrl); t3_write_reg(adap, A_SG_EGR_RCQ_DRB_THRSH, V_HIRCQDRBTHRSH(512) | V_LORCQDRBTHRSH(512)); t3_write_reg(adap, A_SG_TIMER_TICK, core_ticks_per_usec(adap) / 10); t3_write_reg(adap, A_SG_CMDQ_CREDIT_TH, V_THRESHOLD(32) | V_TIMEOUT(200 * core_ticks_per_usec(adap))); t3_write_reg(adap, A_SG_HI_DRB_HI_THRSH, adap->params.rev < T3_REV_C ? 1000 : 500); t3_write_reg(adap, A_SG_HI_DRB_LO_THRSH, 256); t3_write_reg(adap, A_SG_LO_DRB_HI_THRSH, 1000); t3_write_reg(adap, A_SG_LO_DRB_LO_THRSH, 256); t3_write_reg(adap, A_SG_OCO_BASE, V_BASE1(0xfff)); t3_write_reg(adap, A_SG_DRB_PRI_THRESH, 63 * 1024); } /** * sgl_len - calculates the size of an SGL of the given capacity * @n: the number of SGL entries * * Calculates the number of flits needed for a scatter/gather list that * can hold the given number of entries. */ static __inline unsigned int sgl_len(unsigned int n) { return ((3 * n) / 2 + (n & 1)); } /** * get_imm_packet - return the next ingress packet buffer from a response * @resp: the response descriptor containing the packet data * * Return a packet containing the immediate data of the given response. */ static int get_imm_packet(adapter_t *sc, const struct rsp_desc *resp, struct mbuf *m) { if (resp->rss_hdr.opcode == CPL_RX_DATA) { const struct cpl_rx_data *cpl = (const void *)&resp->imm_data[0]; m->m_len = sizeof(*cpl) + ntohs(cpl->len); } else if (resp->rss_hdr.opcode == CPL_RX_PKT) { const struct cpl_rx_pkt *cpl = (const void *)&resp->imm_data[0]; m->m_len = sizeof(*cpl) + ntohs(cpl->len); } else m->m_len = IMMED_PKT_SIZE; m->m_ext.ext_buf = NULL; m->m_ext.ext_type = 0; memcpy(mtod(m, uint8_t *), resp->imm_data, m->m_len); return (0); } static __inline u_int flits_to_desc(u_int n) { return (flit_desc_map[n]); } #define SGE_PARERR (F_CPPARITYERROR | F_OCPARITYERROR | F_RCPARITYERROR | \ F_IRPARITYERROR | V_ITPARITYERROR(M_ITPARITYERROR) | \ V_FLPARITYERROR(M_FLPARITYERROR) | F_LODRBPARITYERROR | \ F_HIDRBPARITYERROR | F_LORCQPARITYERROR | \ F_HIRCQPARITYERROR) #define SGE_FRAMINGERR (F_UC_REQ_FRAMINGERROR | F_R_REQ_FRAMINGERROR) #define SGE_FATALERR (SGE_PARERR | SGE_FRAMINGERR | F_RSPQCREDITOVERFOW | \ F_RSPQDISABLED) /** * t3_sge_err_intr_handler - SGE async event interrupt handler * @adapter: the adapter * * Interrupt handler for SGE asynchronous (non-data) events. */ void t3_sge_err_intr_handler(adapter_t *adapter) { unsigned int v, status; status = t3_read_reg(adapter, A_SG_INT_CAUSE); if (status & SGE_PARERR) CH_ALERT(adapter, "SGE parity error (0x%x)\n", status & SGE_PARERR); if (status & SGE_FRAMINGERR) CH_ALERT(adapter, "SGE framing error (0x%x)\n", status & SGE_FRAMINGERR); if (status & F_RSPQCREDITOVERFOW) CH_ALERT(adapter, "SGE response queue credit overflow\n"); if (status & F_RSPQDISABLED) { v = t3_read_reg(adapter, A_SG_RSPQ_FL_STATUS); CH_ALERT(adapter, "packet delivered to disabled response queue (0x%x)\n", (v >> S_RSPQ0DISABLED) & 0xff); } t3_write_reg(adapter, A_SG_INT_CAUSE, status); if (status & SGE_FATALERR) t3_fatal_err(adapter); } void t3_sge_prep(adapter_t *adap, struct sge_params *p) { int i, nqsets, fl_q_size, jumbo_q_size, use_16k, jumbo_buf_size; nqsets = min(SGE_QSETS / adap->params.nports, mp_ncpus); nqsets *= adap->params.nports; fl_q_size = min(nmbclusters/(3*nqsets), FL_Q_SIZE); while (!powerof2(fl_q_size)) fl_q_size--; use_16k = cxgb_use_16k_clusters != -1 ? cxgb_use_16k_clusters : is_offload(adap); #if __FreeBSD_version >= 700111 if (use_16k) { jumbo_q_size = min(nmbjumbo16/(3*nqsets), JUMBO_Q_SIZE); jumbo_buf_size = MJUM16BYTES; } else { jumbo_q_size = min(nmbjumbo9/(3*nqsets), JUMBO_Q_SIZE); jumbo_buf_size = MJUM9BYTES; } #else jumbo_q_size = min(nmbjumbop/(3*nqsets), JUMBO_Q_SIZE); jumbo_buf_size = MJUMPAGESIZE; #endif while (!powerof2(jumbo_q_size)) jumbo_q_size--; if (fl_q_size < (FL_Q_SIZE / 4) || jumbo_q_size < (JUMBO_Q_SIZE / 2)) device_printf(adap->dev, "Insufficient clusters and/or jumbo buffers.\n"); p->max_pkt_size = jumbo_buf_size - sizeof(struct cpl_rx_data); for (i = 0; i < SGE_QSETS; ++i) { struct qset_params *q = p->qset + i; if (adap->params.nports > 2) { q->coalesce_usecs = 50; } else { #ifdef INVARIANTS q->coalesce_usecs = 10; #else q->coalesce_usecs = 5; #endif } q->polling = 0; q->rspq_size = RSPQ_Q_SIZE; q->fl_size = fl_q_size; q->jumbo_size = jumbo_q_size; q->jumbo_buf_size = jumbo_buf_size; q->txq_size[TXQ_ETH] = TX_ETH_Q_SIZE; q->txq_size[TXQ_OFLD] = is_offload(adap) ? TX_OFLD_Q_SIZE : 16; q->txq_size[TXQ_CTRL] = TX_CTRL_Q_SIZE; q->cong_thres = 0; } } int t3_sge_alloc(adapter_t *sc) { /* The parent tag. */ if (bus_dma_tag_create( bus_get_dma_tag(sc->dev),/* PCI parent */ 1, 0, /* algnmnt, boundary */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ BUS_SPACE_MAXSIZE_32BIT,/* maxsize */ BUS_SPACE_UNRESTRICTED, /* nsegments */ BUS_SPACE_MAXSIZE_32BIT,/* maxsegsize */ 0, /* flags */ NULL, NULL, /* lock, lockarg */ &sc->parent_dmat)) { device_printf(sc->dev, "Cannot allocate parent DMA tag\n"); return (ENOMEM); } /* * DMA tag for normal sized RX frames */ if (bus_dma_tag_create(sc->parent_dmat, MCLBYTES, 0, BUS_SPACE_MAXADDR, BUS_SPACE_MAXADDR, NULL, NULL, MCLBYTES, 1, MCLBYTES, BUS_DMA_ALLOCNOW, NULL, NULL, &sc->rx_dmat)) { device_printf(sc->dev, "Cannot allocate RX DMA tag\n"); return (ENOMEM); } /* * DMA tag for jumbo sized RX frames. */ if (bus_dma_tag_create(sc->parent_dmat, MJUM16BYTES, 0, BUS_SPACE_MAXADDR, BUS_SPACE_MAXADDR, NULL, NULL, MJUM16BYTES, 1, MJUM16BYTES, BUS_DMA_ALLOCNOW, NULL, NULL, &sc->rx_jumbo_dmat)) { device_printf(sc->dev, "Cannot allocate RX jumbo DMA tag\n"); return (ENOMEM); } /* * DMA tag for TX frames. */ if (bus_dma_tag_create(sc->parent_dmat, 1, 0, BUS_SPACE_MAXADDR, BUS_SPACE_MAXADDR, NULL, NULL, TX_MAX_SIZE, TX_MAX_SEGS, TX_MAX_SIZE, BUS_DMA_ALLOCNOW, NULL, NULL, &sc->tx_dmat)) { device_printf(sc->dev, "Cannot allocate TX DMA tag\n"); return (ENOMEM); } return (0); } int t3_sge_free(struct adapter * sc) { if (sc->tx_dmat != NULL) bus_dma_tag_destroy(sc->tx_dmat); if (sc->rx_jumbo_dmat != NULL) bus_dma_tag_destroy(sc->rx_jumbo_dmat); if (sc->rx_dmat != NULL) bus_dma_tag_destroy(sc->rx_dmat); if (sc->parent_dmat != NULL) bus_dma_tag_destroy(sc->parent_dmat); return (0); } void t3_update_qset_coalesce(struct sge_qset *qs, const struct qset_params *p) { qs->rspq.holdoff_tmr = max(p->coalesce_usecs * 10, 1U); qs->rspq.polling = 0 /* p->polling */; } #if !defined(__i386__) && !defined(__amd64__) static void refill_fl_cb(void *arg, bus_dma_segment_t *segs, int nseg, int error) { struct refill_fl_cb_arg *cb_arg = arg; cb_arg->error = error; cb_arg->seg = segs[0]; cb_arg->nseg = nseg; } #endif /** * refill_fl - refill an SGE free-buffer list * @sc: the controller softc * @q: the free-list to refill * @n: the number of new buffers to allocate * * (Re)populate an SGE free-buffer list with up to @n new packet buffers. * The caller must assure that @n does not exceed the queue's capacity. */ static void refill_fl(adapter_t *sc, struct sge_fl *q, int n) { struct rx_sw_desc *sd = &q->sdesc[q->pidx]; struct rx_desc *d = &q->desc[q->pidx]; struct refill_fl_cb_arg cb_arg; struct mbuf *m; caddr_t cl; int err; cb_arg.error = 0; while (n--) { /* * We allocate an uninitialized mbuf + cluster, mbuf is * initialized after rx. */ if (q->zone == zone_pack) { if ((m = m_getcl(M_NOWAIT, MT_NOINIT, M_PKTHDR)) == NULL) break; cl = m->m_ext.ext_buf; } else { if ((cl = m_cljget(NULL, M_NOWAIT, q->buf_size)) == NULL) break; if ((m = m_gethdr(M_NOWAIT, MT_NOINIT)) == NULL) { uma_zfree(q->zone, cl); break; } } if ((sd->flags & RX_SW_DESC_MAP_CREATED) == 0) { if ((err = bus_dmamap_create(q->entry_tag, 0, &sd->map))) { log(LOG_WARNING, "bus_dmamap_create failed %d\n", err); uma_zfree(q->zone, cl); goto done; } sd->flags |= RX_SW_DESC_MAP_CREATED; } #if !defined(__i386__) && !defined(__amd64__) err = bus_dmamap_load(q->entry_tag, sd->map, cl, q->buf_size, refill_fl_cb, &cb_arg, 0); if (err != 0 || cb_arg.error) { if (q->zone != zone_pack) uma_zfree(q->zone, cl); m_free(m); goto done; } #else cb_arg.seg.ds_addr = pmap_kextract((vm_offset_t)cl); #endif sd->flags |= RX_SW_DESC_INUSE; sd->rxsd_cl = cl; sd->m = m; d->addr_lo = htobe32(cb_arg.seg.ds_addr & 0xffffffff); d->addr_hi = htobe32(((uint64_t)cb_arg.seg.ds_addr >>32) & 0xffffffff); d->len_gen = htobe32(V_FLD_GEN1(q->gen)); d->gen2 = htobe32(V_FLD_GEN2(q->gen)); d++; sd++; if (++q->pidx == q->size) { q->pidx = 0; q->gen ^= 1; sd = q->sdesc; d = q->desc; } q->credits++; q->db_pending++; } done: if (q->db_pending >= 32) { q->db_pending = 0; t3_write_reg(sc, A_SG_KDOORBELL, V_EGRCNTX(q->cntxt_id)); } } /** * free_rx_bufs - free the Rx buffers on an SGE free list * @sc: the controle softc * @q: the SGE free list to clean up * * Release the buffers on an SGE free-buffer Rx queue. HW fetching from * this queue should be stopped before calling this function. */ static void free_rx_bufs(adapter_t *sc, struct sge_fl *q) { u_int cidx = q->cidx; while (q->credits--) { struct rx_sw_desc *d = &q->sdesc[cidx]; if (d->flags & RX_SW_DESC_INUSE) { bus_dmamap_unload(q->entry_tag, d->map); bus_dmamap_destroy(q->entry_tag, d->map); if (q->zone == zone_pack) { m_init(d->m, M_NOWAIT, MT_DATA, M_EXT); uma_zfree(zone_pack, d->m); } else { m_init(d->m, M_NOWAIT, MT_DATA, 0); uma_zfree(zone_mbuf, d->m); uma_zfree(q->zone, d->rxsd_cl); } } d->rxsd_cl = NULL; d->m = NULL; if (++cidx == q->size) cidx = 0; } } static __inline void __refill_fl(adapter_t *adap, struct sge_fl *fl) { refill_fl(adap, fl, min(16U, fl->size - fl->credits)); } static __inline void __refill_fl_lt(adapter_t *adap, struct sge_fl *fl, int max) { uint32_t reclaimable = fl->size - fl->credits; if (reclaimable > 0) refill_fl(adap, fl, min(max, reclaimable)); } /** * recycle_rx_buf - recycle a receive buffer * @adapter: the adapter * @q: the SGE free list * @idx: index of buffer to recycle * * Recycles the specified buffer on the given free list by adding it at * the next available slot on the list. */ static void recycle_rx_buf(adapter_t *adap, struct sge_fl *q, unsigned int idx) { struct rx_desc *from = &q->desc[idx]; struct rx_desc *to = &q->desc[q->pidx]; q->sdesc[q->pidx] = q->sdesc[idx]; to->addr_lo = from->addr_lo; // already big endian to->addr_hi = from->addr_hi; // likewise wmb(); /* necessary ? */ to->len_gen = htobe32(V_FLD_GEN1(q->gen)); to->gen2 = htobe32(V_FLD_GEN2(q->gen)); q->credits++; if (++q->pidx == q->size) { q->pidx = 0; q->gen ^= 1; } t3_write_reg(adap, A_SG_KDOORBELL, V_EGRCNTX(q->cntxt_id)); } static void alloc_ring_cb(void *arg, bus_dma_segment_t *segs, int nsegs, int error) { uint32_t *addr; addr = arg; *addr = segs[0].ds_addr; } static int alloc_ring(adapter_t *sc, size_t nelem, size_t elem_size, size_t sw_size, bus_addr_t *phys, void *desc, void *sdesc, bus_dma_tag_t *tag, bus_dmamap_t *map, bus_dma_tag_t parent_entry_tag, bus_dma_tag_t *entry_tag) { size_t len = nelem * elem_size; void *s = NULL; void *p = NULL; int err; if ((err = bus_dma_tag_create(sc->parent_dmat, PAGE_SIZE, 0, BUS_SPACE_MAXADDR_32BIT, BUS_SPACE_MAXADDR, NULL, NULL, len, 1, len, 0, NULL, NULL, tag)) != 0) { device_printf(sc->dev, "Cannot allocate descriptor tag\n"); return (ENOMEM); } if ((err = bus_dmamem_alloc(*tag, (void **)&p, BUS_DMA_NOWAIT, map)) != 0) { device_printf(sc->dev, "Cannot allocate descriptor memory\n"); return (ENOMEM); } bus_dmamap_load(*tag, *map, p, len, alloc_ring_cb, phys, 0); bzero(p, len); *(void **)desc = p; if (sw_size) { len = nelem * sw_size; s = malloc(len, M_DEVBUF, M_WAITOK|M_ZERO); *(void **)sdesc = s; } if (parent_entry_tag == NULL) return (0); if ((err = bus_dma_tag_create(parent_entry_tag, 1, 0, BUS_SPACE_MAXADDR, BUS_SPACE_MAXADDR, NULL, NULL, TX_MAX_SIZE, TX_MAX_SEGS, TX_MAX_SIZE, BUS_DMA_ALLOCNOW, NULL, NULL, entry_tag)) != 0) { device_printf(sc->dev, "Cannot allocate descriptor entry tag\n"); return (ENOMEM); } return (0); } static void sge_slow_intr_handler(void *arg, int ncount) { adapter_t *sc = arg; t3_slow_intr_handler(sc); t3_write_reg(sc, A_PL_INT_ENABLE0, sc->slow_intr_mask); (void) t3_read_reg(sc, A_PL_INT_ENABLE0); } /** * sge_timer_cb - perform periodic maintenance of an SGE qset * @data: the SGE queue set to maintain * * Runs periodically from a timer to perform maintenance of an SGE queue * set. It performs two tasks: * * a) Cleans up any completed Tx descriptors that may still be pending. * Normal descriptor cleanup happens when new packets are added to a Tx * queue so this timer is relatively infrequent and does any cleanup only * if the Tx queue has not seen any new packets in a while. We make a * best effort attempt to reclaim descriptors, in that we don't wait * around if we cannot get a queue's lock (which most likely is because * someone else is queueing new packets and so will also handle the clean * up). Since control queues use immediate data exclusively we don't * bother cleaning them up here. * * b) Replenishes Rx queues that have run out due to memory shortage. * Normally new Rx buffers are added when existing ones are consumed but * when out of memory a queue can become empty. We try to add only a few * buffers here, the queue will be replenished fully as these new buffers * are used up if memory shortage has subsided. * * c) Return coalesced response queue credits in case a response queue is * starved. * * d) Ring doorbells for T304 tunnel queues since we have seen doorbell * fifo overflows and the FW doesn't implement any recovery scheme yet. */ static void sge_timer_cb(void *arg) { adapter_t *sc = arg; if ((sc->flags & USING_MSIX) == 0) { struct port_info *pi; struct sge_qset *qs; struct sge_txq *txq; int i, j; int reclaim_ofl, refill_rx; if (sc->open_device_map == 0) return; for (i = 0; i < sc->params.nports; i++) { pi = &sc->port[i]; for (j = 0; j < pi->nqsets; j++) { qs = &sc->sge.qs[pi->first_qset + j]; txq = &qs->txq[0]; reclaim_ofl = txq[TXQ_OFLD].processed - txq[TXQ_OFLD].cleaned; refill_rx = ((qs->fl[0].credits < qs->fl[0].size) || (qs->fl[1].credits < qs->fl[1].size)); if (reclaim_ofl || refill_rx) { taskqueue_enqueue(sc->tq, &pi->timer_reclaim_task); break; } } } } if (sc->params.nports > 2) { int i; for_each_port(sc, i) { struct port_info *pi = &sc->port[i]; t3_write_reg(sc, A_SG_KDOORBELL, F_SELEGRCNTX | (FW_TUNNEL_SGEEC_START + pi->first_qset)); } } if (((sc->flags & USING_MSIX) == 0 || sc->params.nports > 2) && sc->open_device_map != 0) callout_reset(&sc->sge_timer_ch, TX_RECLAIM_PERIOD, sge_timer_cb, sc); } /* * This is meant to be a catch-all function to keep sge state private * to sge.c * */ int t3_sge_init_adapter(adapter_t *sc) { callout_init(&sc->sge_timer_ch, 1); callout_reset(&sc->sge_timer_ch, TX_RECLAIM_PERIOD, sge_timer_cb, sc); TASK_INIT(&sc->slow_intr_task, 0, sge_slow_intr_handler, sc); return (0); } int t3_sge_reset_adapter(adapter_t *sc) { callout_reset(&sc->sge_timer_ch, TX_RECLAIM_PERIOD, sge_timer_cb, sc); return (0); } int t3_sge_init_port(struct port_info *pi) { TASK_INIT(&pi->timer_reclaim_task, 0, sge_timer_reclaim, pi); return (0); } /** * refill_rspq - replenish an SGE response queue * @adapter: the adapter * @q: the response queue to replenish * @credits: how many new responses to make available * * Replenishes a response queue by making the supplied number of responses * available to HW. */ static __inline void refill_rspq(adapter_t *sc, const struct sge_rspq *q, u_int credits) { /* mbufs are allocated on demand when a rspq entry is processed. */ t3_write_reg(sc, A_SG_RSPQ_CREDIT_RETURN, V_RSPQ(q->cntxt_id) | V_CREDITS(credits)); } static void sge_txq_reclaim_handler(void *arg, int ncount) { struct sge_qset *qs = arg; int i; for (i = 0; i < 3; i++) reclaim_completed_tx(qs, 16, i); } static void sge_timer_reclaim(void *arg, int ncount) { struct port_info *pi = arg; int i, nqsets = pi->nqsets; adapter_t *sc = pi->adapter; struct sge_qset *qs; struct mtx *lock; KASSERT((sc->flags & USING_MSIX) == 0, ("can't call timer reclaim for msi-x")); for (i = 0; i < nqsets; i++) { qs = &sc->sge.qs[pi->first_qset + i]; reclaim_completed_tx(qs, 16, TXQ_OFLD); lock = (sc->flags & USING_MSIX) ? &qs->rspq.lock : &sc->sge.qs[0].rspq.lock; if (mtx_trylock(lock)) { /* XXX currently assume that we are *NOT* polling */ uint32_t status = t3_read_reg(sc, A_SG_RSPQ_FL_STATUS); if (qs->fl[0].credits < qs->fl[0].size - 16) __refill_fl(sc, &qs->fl[0]); if (qs->fl[1].credits < qs->fl[1].size - 16) __refill_fl(sc, &qs->fl[1]); if (status & (1 << qs->rspq.cntxt_id)) { if (qs->rspq.credits) { refill_rspq(sc, &qs->rspq, 1); qs->rspq.credits--; t3_write_reg(sc, A_SG_RSPQ_FL_STATUS, 1 << qs->rspq.cntxt_id); } } mtx_unlock(lock); } } } /** * init_qset_cntxt - initialize an SGE queue set context info * @qs: the queue set * @id: the queue set id * * Initializes the TIDs and context ids for the queues of a queue set. */ static void init_qset_cntxt(struct sge_qset *qs, u_int id) { qs->rspq.cntxt_id = id; qs->fl[0].cntxt_id = 2 * id; qs->fl[1].cntxt_id = 2 * id + 1; qs->txq[TXQ_ETH].cntxt_id = FW_TUNNEL_SGEEC_START + id; qs->txq[TXQ_ETH].token = FW_TUNNEL_TID_START + id; qs->txq[TXQ_OFLD].cntxt_id = FW_OFLD_SGEEC_START + id; qs->txq[TXQ_CTRL].cntxt_id = FW_CTRL_SGEEC_START + id; qs->txq[TXQ_CTRL].token = FW_CTRL_TID_START + id; /* XXX: a sane limit is needed instead of INT_MAX */ mbufq_init(&qs->txq[TXQ_ETH].sendq, INT_MAX); mbufq_init(&qs->txq[TXQ_OFLD].sendq, INT_MAX); mbufq_init(&qs->txq[TXQ_CTRL].sendq, INT_MAX); } static void txq_prod(struct sge_txq *txq, unsigned int ndesc, struct txq_state *txqs) { txq->in_use += ndesc; /* * XXX we don't handle stopping of queue * presumably start handles this when we bump against the end */ txqs->gen = txq->gen; txq->unacked += ndesc; txqs->compl = (txq->unacked & 32) << (S_WR_COMPL - 5); txq->unacked &= 31; txqs->pidx = txq->pidx; txq->pidx += ndesc; #ifdef INVARIANTS if (((txqs->pidx > txq->cidx) && (txq->pidx < txqs->pidx) && (txq->pidx >= txq->cidx)) || ((txqs->pidx < txq->cidx) && (txq->pidx >= txq-> cidx)) || ((txqs->pidx < txq->cidx) && (txq->cidx < txqs->pidx))) panic("txqs->pidx=%d txq->pidx=%d txq->cidx=%d", txqs->pidx, txq->pidx, txq->cidx); #endif if (txq->pidx >= txq->size) { txq->pidx -= txq->size; txq->gen ^= 1; } } /** * calc_tx_descs - calculate the number of Tx descriptors for a packet * @m: the packet mbufs * @nsegs: the number of segments * * Returns the number of Tx descriptors needed for the given Ethernet * packet. Ethernet packets require addition of WR and CPL headers. */ static __inline unsigned int calc_tx_descs(const struct mbuf *m, int nsegs) { unsigned int flits; if (m->m_pkthdr.len <= PIO_LEN) return 1; flits = sgl_len(nsegs) + 2; if (m->m_pkthdr.csum_flags & CSUM_TSO) flits++; return flits_to_desc(flits); } /** * make_sgl - populate a scatter/gather list for a packet * @sgp: the SGL to populate * @segs: the packet dma segments * @nsegs: the number of segments * * Generates a scatter/gather list for the buffers that make up a packet * and returns the SGL size in 8-byte words. The caller must size the SGL * appropriately. */ static __inline void make_sgl(struct sg_ent *sgp, bus_dma_segment_t *segs, int nsegs) { int i, idx; for (idx = 0, i = 0; i < nsegs; i++) { /* * firmware doesn't like empty segments */ if (segs[i].ds_len == 0) continue; if (i && idx == 0) ++sgp; sgp->len[idx] = htobe32(segs[i].ds_len); sgp->addr[idx] = htobe64(segs[i].ds_addr); idx ^= 1; } if (idx) { sgp->len[idx] = 0; sgp->addr[idx] = 0; } } /** * check_ring_tx_db - check and potentially ring a Tx queue's doorbell * @adap: the adapter * @q: the Tx queue * * Ring the doorbell if a Tx queue is asleep. There is a natural race, * where the HW is going to sleep just after we checked, however, * then the interrupt handler will detect the outstanding TX packet * and ring the doorbell for us. * * When GTS is disabled we unconditionally ring the doorbell. */ static __inline void check_ring_tx_db(adapter_t *adap, struct sge_txq *q, int mustring) { #if USE_GTS clear_bit(TXQ_LAST_PKT_DB, &q->flags); if (test_and_set_bit(TXQ_RUNNING, &q->flags) == 0) { set_bit(TXQ_LAST_PKT_DB, &q->flags); #ifdef T3_TRACE T3_TRACE1(adap->tb[q->cntxt_id & 7], "doorbell Tx, cntxt %d", q->cntxt_id); #endif t3_write_reg(adap, A_SG_KDOORBELL, F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id)); } #else if (mustring || ++q->db_pending >= 32) { wmb(); /* write descriptors before telling HW */ t3_write_reg(adap, A_SG_KDOORBELL, F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id)); q->db_pending = 0; } #endif } static __inline void wr_gen2(struct tx_desc *d, unsigned int gen) { #if SGE_NUM_GENBITS == 2 d->flit[TX_DESC_FLITS - 1] = htobe64(gen); #endif } /** * write_wr_hdr_sgl - write a WR header and, optionally, SGL * @ndesc: number of Tx descriptors spanned by the SGL * @txd: first Tx descriptor to be written * @txqs: txq state (generation and producer index) * @txq: the SGE Tx queue * @sgl: the SGL * @flits: number of flits to the start of the SGL in the first descriptor * @sgl_flits: the SGL size in flits * @wr_hi: top 32 bits of WR header based on WR type (big endian) * @wr_lo: low 32 bits of WR header based on WR type (big endian) * * Write a work request header and an associated SGL. If the SGL is * small enough to fit into one Tx descriptor it has already been written * and we just need to write the WR header. Otherwise we distribute the * SGL across the number of descriptors it spans. */ static void write_wr_hdr_sgl(unsigned int ndesc, struct tx_desc *txd, struct txq_state *txqs, const struct sge_txq *txq, const struct sg_ent *sgl, unsigned int flits, unsigned int sgl_flits, unsigned int wr_hi, unsigned int wr_lo) { struct work_request_hdr *wrp = (struct work_request_hdr *)txd; struct tx_sw_desc *txsd = &txq->sdesc[txqs->pidx]; if (__predict_true(ndesc == 1)) { set_wr_hdr(wrp, htonl(F_WR_SOP | F_WR_EOP | V_WR_DATATYPE(1) | V_WR_SGLSFLT(flits)) | wr_hi, htonl(V_WR_LEN(flits + sgl_flits) | V_WR_GEN(txqs->gen)) | wr_lo); wr_gen2(txd, txqs->gen); } else { unsigned int ogen = txqs->gen; const uint64_t *fp = (const uint64_t *)sgl; struct work_request_hdr *wp = wrp; wrp->wrh_hi = htonl(F_WR_SOP | V_WR_DATATYPE(1) | V_WR_SGLSFLT(flits)) | wr_hi; while (sgl_flits) { unsigned int avail = WR_FLITS - flits; if (avail > sgl_flits) avail = sgl_flits; memcpy(&txd->flit[flits], fp, avail * sizeof(*fp)); sgl_flits -= avail; ndesc--; if (!sgl_flits) break; fp += avail; txd++; txsd++; if (++txqs->pidx == txq->size) { txqs->pidx = 0; txqs->gen ^= 1; txd = txq->desc; txsd = txq->sdesc; } /* * when the head of the mbuf chain * is freed all clusters will be freed * with it */ wrp = (struct work_request_hdr *)txd; wrp->wrh_hi = htonl(V_WR_DATATYPE(1) | V_WR_SGLSFLT(1)) | wr_hi; wrp->wrh_lo = htonl(V_WR_LEN(min(WR_FLITS, sgl_flits + 1)) | V_WR_GEN(txqs->gen)) | wr_lo; wr_gen2(txd, txqs->gen); flits = 1; } wrp->wrh_hi |= htonl(F_WR_EOP); wmb(); wp->wrh_lo = htonl(V_WR_LEN(WR_FLITS) | V_WR_GEN(ogen)) | wr_lo; wr_gen2((struct tx_desc *)wp, ogen); } } /* sizeof(*eh) + sizeof(*ip) + sizeof(*tcp) */ #define TCPPKTHDRSIZE (ETHER_HDR_LEN + 20 + 20) #define GET_VTAG(cntrl, m) \ do { \ if ((m)->m_flags & M_VLANTAG) \ cntrl |= F_TXPKT_VLAN_VLD | V_TXPKT_VLAN((m)->m_pkthdr.ether_vtag); \ } while (0) static int t3_encap(struct sge_qset *qs, struct mbuf **m) { adapter_t *sc; struct mbuf *m0; struct sge_txq *txq; struct txq_state txqs; struct port_info *pi; unsigned int ndesc, flits, cntrl, mlen; int err, nsegs, tso_info = 0; struct work_request_hdr *wrp; struct tx_sw_desc *txsd; struct sg_ent *sgp, *sgl; uint32_t wr_hi, wr_lo, sgl_flits; bus_dma_segment_t segs[TX_MAX_SEGS]; struct tx_desc *txd; pi = qs->port; sc = pi->adapter; txq = &qs->txq[TXQ_ETH]; txd = &txq->desc[txq->pidx]; txsd = &txq->sdesc[txq->pidx]; sgl = txq->txq_sgl; prefetch(txd); m0 = *m; mtx_assert(&qs->lock, MA_OWNED); cntrl = V_TXPKT_INTF(pi->txpkt_intf); KASSERT(m0->m_flags & M_PKTHDR, ("not packet header\n")); if (m0->m_nextpkt == NULL && m0->m_next != NULL && m0->m_pkthdr.csum_flags & (CSUM_TSO)) tso_info = V_LSO_MSS(m0->m_pkthdr.tso_segsz); if (m0->m_nextpkt != NULL) { busdma_map_sg_vec(txq->entry_tag, txsd->map, m0, segs, &nsegs); ndesc = 1; mlen = 0; } else { if ((err = busdma_map_sg_collapse(txq->entry_tag, txsd->map, &m0, segs, &nsegs))) { if (cxgb_debug) printf("failed ... err=%d\n", err); return (err); } mlen = m0->m_pkthdr.len; ndesc = calc_tx_descs(m0, nsegs); } txq_prod(txq, ndesc, &txqs); KASSERT(m0->m_pkthdr.len, ("empty packet nsegs=%d", nsegs)); txsd->m = m0; if (m0->m_nextpkt != NULL) { struct cpl_tx_pkt_batch *cpl_batch = (struct cpl_tx_pkt_batch *)txd; int i, fidx; if (nsegs > 7) panic("trying to coalesce %d packets in to one WR", nsegs); txq->txq_coalesced += nsegs; wrp = (struct work_request_hdr *)txd; flits = nsegs*2 + 1; for (fidx = 1, i = 0; i < nsegs; i++, fidx += 2) { struct cpl_tx_pkt_batch_entry *cbe; uint64_t flit; uint32_t *hflit = (uint32_t *)&flit; int cflags = m0->m_pkthdr.csum_flags; cntrl = V_TXPKT_INTF(pi->txpkt_intf); GET_VTAG(cntrl, m0); cntrl |= V_TXPKT_OPCODE(CPL_TX_PKT); if (__predict_false(!(cflags & CSUM_IP))) cntrl |= F_TXPKT_IPCSUM_DIS; if (__predict_false(!(cflags & (CSUM_TCP | CSUM_UDP | CSUM_UDP_IPV6 | CSUM_TCP_IPV6)))) cntrl |= F_TXPKT_L4CSUM_DIS; hflit[0] = htonl(cntrl); hflit[1] = htonl(segs[i].ds_len | 0x80000000); flit |= htobe64(1 << 24); cbe = &cpl_batch->pkt_entry[i]; cbe->cntrl = hflit[0]; cbe->len = hflit[1]; cbe->addr = htobe64(segs[i].ds_addr); } wr_hi = htonl(F_WR_SOP | F_WR_EOP | V_WR_DATATYPE(1) | V_WR_SGLSFLT(flits)) | htonl(V_WR_OP(FW_WROPCODE_TUNNEL_TX_PKT) | txqs.compl); wr_lo = htonl(V_WR_LEN(flits) | V_WR_GEN(txqs.gen)) | htonl(V_WR_TID(txq->token)); set_wr_hdr(wrp, wr_hi, wr_lo); wmb(); ETHER_BPF_MTAP(pi->ifp, m0); wr_gen2(txd, txqs.gen); check_ring_tx_db(sc, txq, 0); return (0); } else if (tso_info) { uint16_t eth_type; struct cpl_tx_pkt_lso *hdr = (struct cpl_tx_pkt_lso *)txd; struct ether_header *eh; void *l3hdr; struct tcphdr *tcp; txd->flit[2] = 0; GET_VTAG(cntrl, m0); cntrl |= V_TXPKT_OPCODE(CPL_TX_PKT_LSO); hdr->cntrl = htonl(cntrl); hdr->len = htonl(mlen | 0x80000000); if (__predict_false(mlen < TCPPKTHDRSIZE)) { printf("mbuf=%p,len=%d,tso_segsz=%d,csum_flags=%b,flags=%#x", m0, mlen, m0->m_pkthdr.tso_segsz, (int)m0->m_pkthdr.csum_flags, CSUM_BITS, m0->m_flags); panic("tx tso packet too small"); } /* Make sure that ether, ip, tcp headers are all in m0 */ if (__predict_false(m0->m_len < TCPPKTHDRSIZE)) { m0 = m_pullup(m0, TCPPKTHDRSIZE); if (__predict_false(m0 == NULL)) { /* XXX panic probably an overreaction */ panic("couldn't fit header into mbuf"); } } eh = mtod(m0, struct ether_header *); eth_type = eh->ether_type; if (eth_type == htons(ETHERTYPE_VLAN)) { struct ether_vlan_header *evh = (void *)eh; tso_info |= V_LSO_ETH_TYPE(CPL_ETH_II_VLAN); l3hdr = evh + 1; eth_type = evh->evl_proto; } else { tso_info |= V_LSO_ETH_TYPE(CPL_ETH_II); l3hdr = eh + 1; } if (eth_type == htons(ETHERTYPE_IP)) { struct ip *ip = l3hdr; tso_info |= V_LSO_IPHDR_WORDS(ip->ip_hl); tcp = (struct tcphdr *)(ip + 1); } else if (eth_type == htons(ETHERTYPE_IPV6)) { struct ip6_hdr *ip6 = l3hdr; KASSERT(ip6->ip6_nxt == IPPROTO_TCP, ("%s: CSUM_TSO with ip6_nxt %d", __func__, ip6->ip6_nxt)); tso_info |= F_LSO_IPV6; tso_info |= V_LSO_IPHDR_WORDS(sizeof(*ip6) >> 2); tcp = (struct tcphdr *)(ip6 + 1); } else panic("%s: CSUM_TSO but neither ip nor ip6", __func__); tso_info |= V_LSO_TCPHDR_WORDS(tcp->th_off); hdr->lso_info = htonl(tso_info); if (__predict_false(mlen <= PIO_LEN)) { /* * pkt not undersized but fits in PIO_LEN * Indicates a TSO bug at the higher levels. */ txsd->m = NULL; m_copydata(m0, 0, mlen, (caddr_t)&txd->flit[3]); flits = (mlen + 7) / 8 + 3; wr_hi = htonl(V_WR_BCNTLFLT(mlen & 7) | V_WR_OP(FW_WROPCODE_TUNNEL_TX_PKT) | F_WR_SOP | F_WR_EOP | txqs.compl); wr_lo = htonl(V_WR_LEN(flits) | V_WR_GEN(txqs.gen) | V_WR_TID(txq->token)); set_wr_hdr(&hdr->wr, wr_hi, wr_lo); wmb(); ETHER_BPF_MTAP(pi->ifp, m0); wr_gen2(txd, txqs.gen); check_ring_tx_db(sc, txq, 0); m_freem(m0); return (0); } flits = 3; } else { struct cpl_tx_pkt *cpl = (struct cpl_tx_pkt *)txd; GET_VTAG(cntrl, m0); cntrl |= V_TXPKT_OPCODE(CPL_TX_PKT); if (__predict_false(!(m0->m_pkthdr.csum_flags & CSUM_IP))) cntrl |= F_TXPKT_IPCSUM_DIS; if (__predict_false(!(m0->m_pkthdr.csum_flags & (CSUM_TCP | CSUM_UDP | CSUM_UDP_IPV6 | CSUM_TCP_IPV6)))) cntrl |= F_TXPKT_L4CSUM_DIS; cpl->cntrl = htonl(cntrl); cpl->len = htonl(mlen | 0x80000000); if (mlen <= PIO_LEN) { txsd->m = NULL; m_copydata(m0, 0, mlen, (caddr_t)&txd->flit[2]); flits = (mlen + 7) / 8 + 2; wr_hi = htonl(V_WR_BCNTLFLT(mlen & 7) | V_WR_OP(FW_WROPCODE_TUNNEL_TX_PKT) | F_WR_SOP | F_WR_EOP | txqs.compl); wr_lo = htonl(V_WR_LEN(flits) | V_WR_GEN(txqs.gen) | V_WR_TID(txq->token)); set_wr_hdr(&cpl->wr, wr_hi, wr_lo); wmb(); ETHER_BPF_MTAP(pi->ifp, m0); wr_gen2(txd, txqs.gen); check_ring_tx_db(sc, txq, 0); m_freem(m0); return (0); } flits = 2; } wrp = (struct work_request_hdr *)txd; sgp = (ndesc == 1) ? (struct sg_ent *)&txd->flit[flits] : sgl; make_sgl(sgp, segs, nsegs); sgl_flits = sgl_len(nsegs); ETHER_BPF_MTAP(pi->ifp, m0); KASSERT(ndesc <= 4, ("ndesc too large %d", ndesc)); wr_hi = htonl(V_WR_OP(FW_WROPCODE_TUNNEL_TX_PKT) | txqs.compl); wr_lo = htonl(V_WR_TID(txq->token)); write_wr_hdr_sgl(ndesc, txd, &txqs, txq, sgl, flits, sgl_flits, wr_hi, wr_lo); check_ring_tx_db(sc, txq, 0); return (0); } void cxgb_tx_watchdog(void *arg) { struct sge_qset *qs = arg; struct sge_txq *txq = &qs->txq[TXQ_ETH]; if (qs->coalescing != 0 && (txq->in_use <= cxgb_tx_coalesce_enable_stop) && TXQ_RING_EMPTY(qs)) qs->coalescing = 0; else if (qs->coalescing == 0 && (txq->in_use >= cxgb_tx_coalesce_enable_start)) qs->coalescing = 1; if (TXQ_TRYLOCK(qs)) { qs->qs_flags |= QS_FLUSHING; cxgb_start_locked(qs); qs->qs_flags &= ~QS_FLUSHING; TXQ_UNLOCK(qs); } if (qs->port->ifp->if_drv_flags & IFF_DRV_RUNNING) callout_reset_on(&txq->txq_watchdog, hz/4, cxgb_tx_watchdog, qs, txq->txq_watchdog.c_cpu); } static void cxgb_tx_timeout(void *arg) { struct sge_qset *qs = arg; struct sge_txq *txq = &qs->txq[TXQ_ETH]; if (qs->coalescing == 0 && (txq->in_use >= (txq->size>>3))) qs->coalescing = 1; if (TXQ_TRYLOCK(qs)) { qs->qs_flags |= QS_TIMEOUT; cxgb_start_locked(qs); qs->qs_flags &= ~QS_TIMEOUT; TXQ_UNLOCK(qs); } } static void cxgb_start_locked(struct sge_qset *qs) { struct mbuf *m_head = NULL; struct sge_txq *txq = &qs->txq[TXQ_ETH]; struct port_info *pi = qs->port; struct ifnet *ifp = pi->ifp; if (qs->qs_flags & (QS_FLUSHING|QS_TIMEOUT)) reclaim_completed_tx(qs, 0, TXQ_ETH); if (!pi->link_config.link_ok) { TXQ_RING_FLUSH(qs); return; } TXQ_LOCK_ASSERT(qs); while (!TXQ_RING_EMPTY(qs) && (ifp->if_drv_flags & IFF_DRV_RUNNING) && pi->link_config.link_ok) { reclaim_completed_tx(qs, cxgb_tx_reclaim_threshold, TXQ_ETH); if (txq->size - txq->in_use <= TX_MAX_DESC) break; if ((m_head = cxgb_dequeue(qs)) == NULL) break; /* * Encapsulation can modify our pointer, and or make it * NULL on failure. In that event, we can't requeue. */ if (t3_encap(qs, &m_head) || m_head == NULL) break; m_head = NULL; } if (txq->db_pending) check_ring_tx_db(pi->adapter, txq, 1); if (!TXQ_RING_EMPTY(qs) && callout_pending(&txq->txq_timer) == 0 && pi->link_config.link_ok) callout_reset_on(&txq->txq_timer, 1, cxgb_tx_timeout, qs, txq->txq_timer.c_cpu); if (m_head != NULL) m_freem(m_head); } static int cxgb_transmit_locked(struct ifnet *ifp, struct sge_qset *qs, struct mbuf *m) { struct port_info *pi = qs->port; struct sge_txq *txq = &qs->txq[TXQ_ETH]; struct buf_ring *br = txq->txq_mr; int error, avail; avail = txq->size - txq->in_use; TXQ_LOCK_ASSERT(qs); /* * We can only do a direct transmit if the following are true: * - we aren't coalescing (ring < 3/4 full) * - the link is up -- checked in caller * - there are no packets enqueued already * - there is space in hardware transmit queue */ if (check_pkt_coalesce(qs) == 0 && !TXQ_RING_NEEDS_ENQUEUE(qs) && avail > TX_MAX_DESC) { if (t3_encap(qs, &m)) { if (m != NULL && (error = drbr_enqueue(ifp, br, m)) != 0) return (error); } else { if (txq->db_pending) check_ring_tx_db(pi->adapter, txq, 1); /* * We've bypassed the buf ring so we need to update * the stats directly */ txq->txq_direct_packets++; txq->txq_direct_bytes += m->m_pkthdr.len; } } else if ((error = drbr_enqueue(ifp, br, m)) != 0) return (error); reclaim_completed_tx(qs, cxgb_tx_reclaim_threshold, TXQ_ETH); if (!TXQ_RING_EMPTY(qs) && pi->link_config.link_ok && (!check_pkt_coalesce(qs) || (drbr_inuse(ifp, br) >= 7))) cxgb_start_locked(qs); else if (!TXQ_RING_EMPTY(qs) && !callout_pending(&txq->txq_timer)) callout_reset_on(&txq->txq_timer, 1, cxgb_tx_timeout, qs, txq->txq_timer.c_cpu); return (0); } int cxgb_transmit(struct ifnet *ifp, struct mbuf *m) { struct sge_qset *qs; struct port_info *pi = ifp->if_softc; int error, qidx = pi->first_qset; if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0 ||(!pi->link_config.link_ok)) { m_freem(m); return (0); } /* check if flowid is set */ if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) qidx = (m->m_pkthdr.flowid % pi->nqsets) + pi->first_qset; qs = &pi->adapter->sge.qs[qidx]; if (TXQ_TRYLOCK(qs)) { /* XXX running */ error = cxgb_transmit_locked(ifp, qs, m); TXQ_UNLOCK(qs); } else error = drbr_enqueue(ifp, qs->txq[TXQ_ETH].txq_mr, m); return (error); } void cxgb_qflush(struct ifnet *ifp) { /* * flush any enqueued mbufs in the buf_rings * and in the transmit queues * no-op for now */ return; } /** * write_imm - write a packet into a Tx descriptor as immediate data * @d: the Tx descriptor to write * @m: the packet * @len: the length of packet data to write as immediate data * @gen: the generation bit value to write * * Writes a packet as immediate data into a Tx descriptor. The packet * contains a work request at its beginning. We must write the packet * carefully so the SGE doesn't read accidentally before it's written in * its entirety. */ static __inline void write_imm(struct tx_desc *d, caddr_t src, unsigned int len, unsigned int gen) { struct work_request_hdr *from = (struct work_request_hdr *)src; struct work_request_hdr *to = (struct work_request_hdr *)d; uint32_t wr_hi, wr_lo; KASSERT(len <= WR_LEN && len >= sizeof(*from), ("%s: invalid len %d", __func__, len)); memcpy(&to[1], &from[1], len - sizeof(*from)); wr_hi = from->wrh_hi | htonl(F_WR_SOP | F_WR_EOP | V_WR_BCNTLFLT(len & 7)); wr_lo = from->wrh_lo | htonl(V_WR_GEN(gen) | V_WR_LEN((len + 7) / 8)); set_wr_hdr(to, wr_hi, wr_lo); wmb(); wr_gen2(d, gen); } /** * check_desc_avail - check descriptor availability on a send queue * @adap: the adapter * @q: the TX queue * @m: the packet needing the descriptors * @ndesc: the number of Tx descriptors needed * @qid: the Tx queue number in its queue set (TXQ_OFLD or TXQ_CTRL) * * Checks if the requested number of Tx descriptors is available on an * SGE send queue. If the queue is already suspended or not enough * descriptors are available the packet is queued for later transmission. * Must be called with the Tx queue locked. * * Returns 0 if enough descriptors are available, 1 if there aren't * enough descriptors and the packet has been queued, and 2 if the caller * needs to retry because there weren't enough descriptors at the * beginning of the call but some freed up in the mean time. */ static __inline int check_desc_avail(adapter_t *adap, struct sge_txq *q, struct mbuf *m, unsigned int ndesc, unsigned int qid) { /* * XXX We currently only use this for checking the control queue * the control queue is only used for binding qsets which happens * at init time so we are guaranteed enough descriptors */ if (__predict_false(mbufq_len(&q->sendq))) { addq_exit: (void )mbufq_enqueue(&q->sendq, m); return 1; } if (__predict_false(q->size - q->in_use < ndesc)) { struct sge_qset *qs = txq_to_qset(q, qid); setbit(&qs->txq_stopped, qid); if (should_restart_tx(q) && test_and_clear_bit(qid, &qs->txq_stopped)) return 2; q->stops++; goto addq_exit; } return 0; } /** * reclaim_completed_tx_imm - reclaim completed control-queue Tx descs * @q: the SGE control Tx queue * * This is a variant of reclaim_completed_tx() that is used for Tx queues * that send only immediate data (presently just the control queues) and * thus do not have any mbufs */ static __inline void reclaim_completed_tx_imm(struct sge_txq *q) { unsigned int reclaim = q->processed - q->cleaned; q->in_use -= reclaim; q->cleaned += reclaim; } /** * ctrl_xmit - send a packet through an SGE control Tx queue * @adap: the adapter * @q: the control queue * @m: the packet * * Send a packet through an SGE control Tx queue. Packets sent through * a control queue must fit entirely as immediate data in a single Tx * descriptor and have no page fragments. */ static int ctrl_xmit(adapter_t *adap, struct sge_qset *qs, struct mbuf *m) { int ret; struct work_request_hdr *wrp = mtod(m, struct work_request_hdr *); struct sge_txq *q = &qs->txq[TXQ_CTRL]; KASSERT(m->m_len <= WR_LEN, ("%s: bad tx data", __func__)); wrp->wrh_hi |= htonl(F_WR_SOP | F_WR_EOP); wrp->wrh_lo = htonl(V_WR_TID(q->token)); TXQ_LOCK(qs); again: reclaim_completed_tx_imm(q); ret = check_desc_avail(adap, q, m, 1, TXQ_CTRL); if (__predict_false(ret)) { if (ret == 1) { TXQ_UNLOCK(qs); return (ENOSPC); } goto again; } write_imm(&q->desc[q->pidx], m->m_data, m->m_len, q->gen); q->in_use++; if (++q->pidx >= q->size) { q->pidx = 0; q->gen ^= 1; } TXQ_UNLOCK(qs); wmb(); t3_write_reg(adap, A_SG_KDOORBELL, F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id)); m_free(m); return (0); } /** * restart_ctrlq - restart a suspended control queue * @qs: the queue set cotaining the control queue * * Resumes transmission on a suspended Tx control queue. */ static void restart_ctrlq(void *data, int npending) { struct mbuf *m; struct sge_qset *qs = (struct sge_qset *)data; struct sge_txq *q = &qs->txq[TXQ_CTRL]; adapter_t *adap = qs->port->adapter; TXQ_LOCK(qs); again: reclaim_completed_tx_imm(q); while (q->in_use < q->size && (m = mbufq_dequeue(&q->sendq)) != NULL) { write_imm(&q->desc[q->pidx], m->m_data, m->m_len, q->gen); m_free(m); if (++q->pidx >= q->size) { q->pidx = 0; q->gen ^= 1; } q->in_use++; } if (mbufq_len(&q->sendq)) { setbit(&qs->txq_stopped, TXQ_CTRL); if (should_restart_tx(q) && test_and_clear_bit(TXQ_CTRL, &qs->txq_stopped)) goto again; q->stops++; } TXQ_UNLOCK(qs); t3_write_reg(adap, A_SG_KDOORBELL, F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id)); } /* * Send a management message through control queue 0 */ int t3_mgmt_tx(struct adapter *adap, struct mbuf *m) { return ctrl_xmit(adap, &adap->sge.qs[0], m); } /** * free_qset - free the resources of an SGE queue set * @sc: the controller owning the queue set * @q: the queue set * * Release the HW and SW resources associated with an SGE queue set, such * as HW contexts, packet buffers, and descriptor rings. Traffic to the * queue set must be quiesced prior to calling this. */ static void t3_free_qset(adapter_t *sc, struct sge_qset *q) { int i; reclaim_completed_tx(q, 0, TXQ_ETH); if (q->txq[TXQ_ETH].txq_mr != NULL) buf_ring_free(q->txq[TXQ_ETH].txq_mr, M_DEVBUF); if (q->txq[TXQ_ETH].txq_ifq != NULL) { ifq_delete(q->txq[TXQ_ETH].txq_ifq); free(q->txq[TXQ_ETH].txq_ifq, M_DEVBUF); } for (i = 0; i < SGE_RXQ_PER_SET; ++i) { if (q->fl[i].desc) { mtx_lock_spin(&sc->sge.reg_lock); t3_sge_disable_fl(sc, q->fl[i].cntxt_id); mtx_unlock_spin(&sc->sge.reg_lock); bus_dmamap_unload(q->fl[i].desc_tag, q->fl[i].desc_map); bus_dmamem_free(q->fl[i].desc_tag, q->fl[i].desc, q->fl[i].desc_map); bus_dma_tag_destroy(q->fl[i].desc_tag); bus_dma_tag_destroy(q->fl[i].entry_tag); } if (q->fl[i].sdesc) { free_rx_bufs(sc, &q->fl[i]); free(q->fl[i].sdesc, M_DEVBUF); } } mtx_unlock(&q->lock); MTX_DESTROY(&q->lock); for (i = 0; i < SGE_TXQ_PER_SET; i++) { if (q->txq[i].desc) { mtx_lock_spin(&sc->sge.reg_lock); t3_sge_enable_ecntxt(sc, q->txq[i].cntxt_id, 0); mtx_unlock_spin(&sc->sge.reg_lock); bus_dmamap_unload(q->txq[i].desc_tag, q->txq[i].desc_map); bus_dmamem_free(q->txq[i].desc_tag, q->txq[i].desc, q->txq[i].desc_map); bus_dma_tag_destroy(q->txq[i].desc_tag); bus_dma_tag_destroy(q->txq[i].entry_tag); } if (q->txq[i].sdesc) { free(q->txq[i].sdesc, M_DEVBUF); } } if (q->rspq.desc) { mtx_lock_spin(&sc->sge.reg_lock); t3_sge_disable_rspcntxt(sc, q->rspq.cntxt_id); mtx_unlock_spin(&sc->sge.reg_lock); bus_dmamap_unload(q->rspq.desc_tag, q->rspq.desc_map); bus_dmamem_free(q->rspq.desc_tag, q->rspq.desc, q->rspq.desc_map); bus_dma_tag_destroy(q->rspq.desc_tag); MTX_DESTROY(&q->rspq.lock); } #if defined(INET6) || defined(INET) tcp_lro_free(&q->lro.ctrl); #endif bzero(q, sizeof(*q)); } /** * t3_free_sge_resources - free SGE resources * @sc: the adapter softc * * Frees resources used by the SGE queue sets. */ void t3_free_sge_resources(adapter_t *sc, int nqsets) { int i; for (i = 0; i < nqsets; ++i) { TXQ_LOCK(&sc->sge.qs[i]); t3_free_qset(sc, &sc->sge.qs[i]); } } /** * t3_sge_start - enable SGE * @sc: the controller softc * * Enables the SGE for DMAs. This is the last step in starting packet * transfers. */ void t3_sge_start(adapter_t *sc) { t3_set_reg_field(sc, A_SG_CONTROL, F_GLOBALENABLE, F_GLOBALENABLE); } /** * t3_sge_stop - disable SGE operation * @sc: the adapter * * Disables the DMA engine. This can be called in emeregencies (e.g., * from error interrupts) or from normal process context. In the latter * case it also disables any pending queue restart tasklets. Note that * if it is called in interrupt context it cannot disable the restart * tasklets as it cannot wait, however the tasklets will have no effect * since the doorbells are disabled and the driver will call this again * later from process context, at which time the tasklets will be stopped * if they are still running. */ void t3_sge_stop(adapter_t *sc) { int i, nqsets; t3_set_reg_field(sc, A_SG_CONTROL, F_GLOBALENABLE, 0); if (sc->tq == NULL) return; for (nqsets = i = 0; i < (sc)->params.nports; i++) nqsets += sc->port[i].nqsets; #ifdef notyet /* * * XXX */ for (i = 0; i < nqsets; ++i) { struct sge_qset *qs = &sc->sge.qs[i]; taskqueue_drain(sc->tq, &qs->txq[TXQ_OFLD].qresume_task); taskqueue_drain(sc->tq, &qs->txq[TXQ_CTRL].qresume_task); } #endif } /** * t3_free_tx_desc - reclaims Tx descriptors and their buffers * @adapter: the adapter * @q: the Tx queue to reclaim descriptors from * @reclaimable: the number of descriptors to reclaim * @m_vec_size: maximum number of buffers to reclaim * @desc_reclaimed: returns the number of descriptors reclaimed * * Reclaims Tx descriptors from an SGE Tx queue and frees the associated * Tx buffers. Called with the Tx queue lock held. * * Returns number of buffers of reclaimed */ void t3_free_tx_desc(struct sge_qset *qs, int reclaimable, int queue) { struct tx_sw_desc *txsd; unsigned int cidx, mask; struct sge_txq *q = &qs->txq[queue]; #ifdef T3_TRACE T3_TRACE2(sc->tb[q->cntxt_id & 7], "reclaiming %u Tx descriptors at cidx %u", reclaimable, cidx); #endif cidx = q->cidx; mask = q->size - 1; txsd = &q->sdesc[cidx]; mtx_assert(&qs->lock, MA_OWNED); while (reclaimable--) { prefetch(q->sdesc[(cidx + 1) & mask].m); prefetch(q->sdesc[(cidx + 2) & mask].m); if (txsd->m != NULL) { if (txsd->flags & TX_SW_DESC_MAPPED) { bus_dmamap_unload(q->entry_tag, txsd->map); txsd->flags &= ~TX_SW_DESC_MAPPED; } m_freem_list(txsd->m); txsd->m = NULL; } else q->txq_skipped++; ++txsd; if (++cidx == q->size) { cidx = 0; txsd = q->sdesc; } } q->cidx = cidx; } /** * is_new_response - check if a response is newly written * @r: the response descriptor * @q: the response queue * * Returns true if a response descriptor contains a yet unprocessed * response. */ static __inline int is_new_response(const struct rsp_desc *r, const struct sge_rspq *q) { return (r->intr_gen & F_RSPD_GEN2) == q->gen; } #define RSPD_GTS_MASK (F_RSPD_TXQ0_GTS | F_RSPD_TXQ1_GTS) #define RSPD_CTRL_MASK (RSPD_GTS_MASK | \ V_RSPD_TXQ0_CR(M_RSPD_TXQ0_CR) | \ V_RSPD_TXQ1_CR(M_RSPD_TXQ1_CR) | \ V_RSPD_TXQ2_CR(M_RSPD_TXQ2_CR)) /* How long to delay the next interrupt in case of memory shortage, in 0.1us. */ #define NOMEM_INTR_DELAY 2500 #ifdef TCP_OFFLOAD /** * write_ofld_wr - write an offload work request * @adap: the adapter * @m: the packet to send * @q: the Tx queue * @pidx: index of the first Tx descriptor to write * @gen: the generation value to use * @ndesc: number of descriptors the packet will occupy * * Write an offload work request to send the supplied packet. The packet * data already carry the work request with most fields populated. */ static void write_ofld_wr(adapter_t *adap, struct mbuf *m, struct sge_txq *q, unsigned int pidx, unsigned int gen, unsigned int ndesc) { unsigned int sgl_flits, flits; int i, idx, nsegs, wrlen; struct work_request_hdr *from; struct sg_ent *sgp, t3sgl[TX_MAX_SEGS / 2 + 1]; struct tx_desc *d = &q->desc[pidx]; struct txq_state txqs; struct sglist_seg *segs; struct ofld_hdr *oh = mtod(m, struct ofld_hdr *); struct sglist *sgl; from = (void *)(oh + 1); /* Start of WR within mbuf */ wrlen = m->m_len - sizeof(*oh); if (!(oh->flags & F_HDR_SGL)) { write_imm(d, (caddr_t)from, wrlen, gen); /* * mbuf with "real" immediate tx data will be enqueue_wr'd by * t3_push_frames and freed in wr_ack. Others, like those sent * down by close_conn, t3_send_reset, etc. should be freed here. */ if (!(oh->flags & F_HDR_DF)) m_free(m); return; } memcpy(&d->flit[1], &from[1], wrlen - sizeof(*from)); sgl = oh->sgl; flits = wrlen / 8; sgp = (ndesc == 1) ? (struct sg_ent *)&d->flit[flits] : t3sgl; nsegs = sgl->sg_nseg; segs = sgl->sg_segs; for (idx = 0, i = 0; i < nsegs; i++) { KASSERT(segs[i].ss_len, ("%s: 0 len in sgl", __func__)); if (i && idx == 0) ++sgp; sgp->len[idx] = htobe32(segs[i].ss_len); sgp->addr[idx] = htobe64(segs[i].ss_paddr); idx ^= 1; } if (idx) { sgp->len[idx] = 0; sgp->addr[idx] = 0; } sgl_flits = sgl_len(nsegs); txqs.gen = gen; txqs.pidx = pidx; txqs.compl = 0; write_wr_hdr_sgl(ndesc, d, &txqs, q, t3sgl, flits, sgl_flits, from->wrh_hi, from->wrh_lo); } /** * ofld_xmit - send a packet through an offload queue * @adap: the adapter * @q: the Tx offload queue * @m: the packet * * Send an offload packet through an SGE offload queue. */ static int ofld_xmit(adapter_t *adap, struct sge_qset *qs, struct mbuf *m) { int ret; unsigned int ndesc; unsigned int pidx, gen; struct sge_txq *q = &qs->txq[TXQ_OFLD]; struct ofld_hdr *oh = mtod(m, struct ofld_hdr *); ndesc = G_HDR_NDESC(oh->flags); TXQ_LOCK(qs); again: reclaim_completed_tx(qs, 16, TXQ_OFLD); ret = check_desc_avail(adap, q, m, ndesc, TXQ_OFLD); if (__predict_false(ret)) { if (ret == 1) { TXQ_UNLOCK(qs); return (EINTR); } goto again; } gen = q->gen; q->in_use += ndesc; pidx = q->pidx; q->pidx += ndesc; if (q->pidx >= q->size) { q->pidx -= q->size; q->gen ^= 1; } write_ofld_wr(adap, m, q, pidx, gen, ndesc); check_ring_tx_db(adap, q, 1); TXQ_UNLOCK(qs); return (0); } /** * restart_offloadq - restart a suspended offload queue * @qs: the queue set cotaining the offload queue * * Resumes transmission on a suspended Tx offload queue. */ static void restart_offloadq(void *data, int npending) { struct mbuf *m; struct sge_qset *qs = data; struct sge_txq *q = &qs->txq[TXQ_OFLD]; adapter_t *adap = qs->port->adapter; int cleaned; TXQ_LOCK(qs); again: cleaned = reclaim_completed_tx(qs, 16, TXQ_OFLD); while ((m = mbufq_first(&q->sendq)) != NULL) { unsigned int gen, pidx; struct ofld_hdr *oh = mtod(m, struct ofld_hdr *); unsigned int ndesc = G_HDR_NDESC(oh->flags); if (__predict_false(q->size - q->in_use < ndesc)) { setbit(&qs->txq_stopped, TXQ_OFLD); if (should_restart_tx(q) && test_and_clear_bit(TXQ_OFLD, &qs->txq_stopped)) goto again; q->stops++; break; } gen = q->gen; q->in_use += ndesc; pidx = q->pidx; q->pidx += ndesc; if (q->pidx >= q->size) { q->pidx -= q->size; q->gen ^= 1; } (void)mbufq_dequeue(&q->sendq); TXQ_UNLOCK(qs); write_ofld_wr(adap, m, q, pidx, gen, ndesc); TXQ_LOCK(qs); } #if USE_GTS set_bit(TXQ_RUNNING, &q->flags); set_bit(TXQ_LAST_PKT_DB, &q->flags); #endif TXQ_UNLOCK(qs); wmb(); t3_write_reg(adap, A_SG_KDOORBELL, F_SELEGRCNTX | V_EGRCNTX(q->cntxt_id)); } /** * t3_offload_tx - send an offload packet * @m: the packet * * Sends an offload packet. We use the packet priority to select the * appropriate Tx queue as follows: bit 0 indicates whether the packet * should be sent as regular or control, bits 1-3 select the queue set. */ int t3_offload_tx(struct adapter *sc, struct mbuf *m) { struct ofld_hdr *oh = mtod(m, struct ofld_hdr *); struct sge_qset *qs = &sc->sge.qs[G_HDR_QSET(oh->flags)]; if (oh->flags & F_HDR_CTRL) { m_adj(m, sizeof (*oh)); /* trim ofld_hdr off */ return (ctrl_xmit(sc, qs, m)); } else return (ofld_xmit(sc, qs, m)); } #endif static void restart_tx(struct sge_qset *qs) { struct adapter *sc = qs->port->adapter; if (isset(&qs->txq_stopped, TXQ_OFLD) && should_restart_tx(&qs->txq[TXQ_OFLD]) && test_and_clear_bit(TXQ_OFLD, &qs->txq_stopped)) { qs->txq[TXQ_OFLD].restarts++; taskqueue_enqueue(sc->tq, &qs->txq[TXQ_OFLD].qresume_task); } if (isset(&qs->txq_stopped, TXQ_CTRL) && should_restart_tx(&qs->txq[TXQ_CTRL]) && test_and_clear_bit(TXQ_CTRL, &qs->txq_stopped)) { qs->txq[TXQ_CTRL].restarts++; taskqueue_enqueue(sc->tq, &qs->txq[TXQ_CTRL].qresume_task); } } /** * t3_sge_alloc_qset - initialize an SGE queue set * @sc: the controller softc * @id: the queue set id * @nports: how many Ethernet ports will be using this queue set * @irq_vec_idx: the IRQ vector index for response queue interrupts * @p: configuration parameters for this queue set * @ntxq: number of Tx queues for the queue set * @pi: port info for queue set * * Allocate resources and initialize an SGE queue set. A queue set * comprises a response queue, two Rx free-buffer queues, and up to 3 * Tx queues. The Tx queues are assigned roles in the order Ethernet * queue, offload queue, and control queue. */ int t3_sge_alloc_qset(adapter_t *sc, u_int id, int nports, int irq_vec_idx, const struct qset_params *p, int ntxq, struct port_info *pi) { struct sge_qset *q = &sc->sge.qs[id]; int i, ret = 0; MTX_INIT(&q->lock, q->namebuf, NULL, MTX_DEF); q->port = pi; q->adap = sc; if ((q->txq[TXQ_ETH].txq_mr = buf_ring_alloc(cxgb_txq_buf_ring_size, M_DEVBUF, M_WAITOK, &q->lock)) == NULL) { device_printf(sc->dev, "failed to allocate mbuf ring\n"); goto err; } if ((q->txq[TXQ_ETH].txq_ifq = malloc(sizeof(struct ifaltq), M_DEVBUF, M_NOWAIT | M_ZERO)) == NULL) { device_printf(sc->dev, "failed to allocate ifq\n"); goto err; } ifq_init(q->txq[TXQ_ETH].txq_ifq, pi->ifp); callout_init(&q->txq[TXQ_ETH].txq_timer, 1); callout_init(&q->txq[TXQ_ETH].txq_watchdog, 1); q->txq[TXQ_ETH].txq_timer.c_cpu = id % mp_ncpus; q->txq[TXQ_ETH].txq_watchdog.c_cpu = id % mp_ncpus; init_qset_cntxt(q, id); q->idx = id; if ((ret = alloc_ring(sc, p->fl_size, sizeof(struct rx_desc), sizeof(struct rx_sw_desc), &q->fl[0].phys_addr, &q->fl[0].desc, &q->fl[0].sdesc, &q->fl[0].desc_tag, &q->fl[0].desc_map, sc->rx_dmat, &q->fl[0].entry_tag)) != 0) { printf("error %d from alloc ring fl0\n", ret); goto err; } if ((ret = alloc_ring(sc, p->jumbo_size, sizeof(struct rx_desc), sizeof(struct rx_sw_desc), &q->fl[1].phys_addr, &q->fl[1].desc, &q->fl[1].sdesc, &q->fl[1].desc_tag, &q->fl[1].desc_map, sc->rx_jumbo_dmat, &q->fl[1].entry_tag)) != 0) { printf("error %d from alloc ring fl1\n", ret); goto err; } if ((ret = alloc_ring(sc, p->rspq_size, sizeof(struct rsp_desc), 0, &q->rspq.phys_addr, &q->rspq.desc, NULL, &q->rspq.desc_tag, &q->rspq.desc_map, NULL, NULL)) != 0) { printf("error %d from alloc ring rspq\n", ret); goto err; } snprintf(q->rspq.lockbuf, RSPQ_NAME_LEN, "t3 rspq lock %d:%d", device_get_unit(sc->dev), irq_vec_idx); MTX_INIT(&q->rspq.lock, q->rspq.lockbuf, NULL, MTX_DEF); for (i = 0; i < ntxq; ++i) { size_t sz = i == TXQ_CTRL ? 0 : sizeof(struct tx_sw_desc); if ((ret = alloc_ring(sc, p->txq_size[i], sizeof(struct tx_desc), sz, &q->txq[i].phys_addr, &q->txq[i].desc, &q->txq[i].sdesc, &q->txq[i].desc_tag, &q->txq[i].desc_map, sc->tx_dmat, &q->txq[i].entry_tag)) != 0) { printf("error %d from alloc ring tx %i\n", ret, i); goto err; } mbufq_init(&q->txq[i].sendq, INT_MAX); q->txq[i].gen = 1; q->txq[i].size = p->txq_size[i]; } #ifdef TCP_OFFLOAD TASK_INIT(&q->txq[TXQ_OFLD].qresume_task, 0, restart_offloadq, q); #endif TASK_INIT(&q->txq[TXQ_CTRL].qresume_task, 0, restart_ctrlq, q); TASK_INIT(&q->txq[TXQ_ETH].qreclaim_task, 0, sge_txq_reclaim_handler, q); TASK_INIT(&q->txq[TXQ_OFLD].qreclaim_task, 0, sge_txq_reclaim_handler, q); q->fl[0].gen = q->fl[1].gen = 1; q->fl[0].size = p->fl_size; q->fl[1].size = p->jumbo_size; q->rspq.gen = 1; q->rspq.cidx = 0; q->rspq.size = p->rspq_size; q->txq[TXQ_ETH].stop_thres = nports * flits_to_desc(sgl_len(TX_MAX_SEGS + 1) + 3); q->fl[0].buf_size = MCLBYTES; q->fl[0].zone = zone_pack; q->fl[0].type = EXT_PACKET; if (p->jumbo_buf_size == MJUM16BYTES) { q->fl[1].zone = zone_jumbo16; q->fl[1].type = EXT_JUMBO16; } else if (p->jumbo_buf_size == MJUM9BYTES) { q->fl[1].zone = zone_jumbo9; q->fl[1].type = EXT_JUMBO9; } else if (p->jumbo_buf_size == MJUMPAGESIZE) { q->fl[1].zone = zone_jumbop; q->fl[1].type = EXT_JUMBOP; } else { KASSERT(0, ("can't deal with jumbo_buf_size %d.", p->jumbo_buf_size)); ret = EDOOFUS; goto err; } q->fl[1].buf_size = p->jumbo_buf_size; /* Allocate and setup the lro_ctrl structure */ q->lro.enabled = !!(pi->ifp->if_capenable & IFCAP_LRO); #if defined(INET6) || defined(INET) ret = tcp_lro_init(&q->lro.ctrl); if (ret) { printf("error %d from tcp_lro_init\n", ret); goto err; } #endif q->lro.ctrl.ifp = pi->ifp; mtx_lock_spin(&sc->sge.reg_lock); ret = -t3_sge_init_rspcntxt(sc, q->rspq.cntxt_id, irq_vec_idx, q->rspq.phys_addr, q->rspq.size, q->fl[0].buf_size, 1, 0); if (ret) { printf("error %d from t3_sge_init_rspcntxt\n", ret); goto err_unlock; } for (i = 0; i < SGE_RXQ_PER_SET; ++i) { ret = -t3_sge_init_flcntxt(sc, q->fl[i].cntxt_id, 0, q->fl[i].phys_addr, q->fl[i].size, q->fl[i].buf_size, p->cong_thres, 1, 0); if (ret) { printf("error %d from t3_sge_init_flcntxt for index i=%d\n", ret, i); goto err_unlock; } } ret = -t3_sge_init_ecntxt(sc, q->txq[TXQ_ETH].cntxt_id, USE_GTS, SGE_CNTXT_ETH, id, q->txq[TXQ_ETH].phys_addr, q->txq[TXQ_ETH].size, q->txq[TXQ_ETH].token, 1, 0); if (ret) { printf("error %d from t3_sge_init_ecntxt\n", ret); goto err_unlock; } if (ntxq > 1) { ret = -t3_sge_init_ecntxt(sc, q->txq[TXQ_OFLD].cntxt_id, USE_GTS, SGE_CNTXT_OFLD, id, q->txq[TXQ_OFLD].phys_addr, q->txq[TXQ_OFLD].size, 0, 1, 0); if (ret) { printf("error %d from t3_sge_init_ecntxt\n", ret); goto err_unlock; } } if (ntxq > 2) { ret = -t3_sge_init_ecntxt(sc, q->txq[TXQ_CTRL].cntxt_id, 0, SGE_CNTXT_CTRL, id, q->txq[TXQ_CTRL].phys_addr, q->txq[TXQ_CTRL].size, q->txq[TXQ_CTRL].token, 1, 0); if (ret) { printf("error %d from t3_sge_init_ecntxt\n", ret); goto err_unlock; } } mtx_unlock_spin(&sc->sge.reg_lock); t3_update_qset_coalesce(q, p); refill_fl(sc, &q->fl[0], q->fl[0].size); refill_fl(sc, &q->fl[1], q->fl[1].size); refill_rspq(sc, &q->rspq, q->rspq.size - 1); t3_write_reg(sc, A_SG_GTS, V_RSPQ(q->rspq.cntxt_id) | V_NEWTIMER(q->rspq.holdoff_tmr)); return (0); err_unlock: mtx_unlock_spin(&sc->sge.reg_lock); err: TXQ_LOCK(q); t3_free_qset(sc, q); return (ret); } /* * Remove CPL_RX_PKT headers from the mbuf and reduce it to a regular mbuf with * ethernet data. Hardware assistance with various checksums and any vlan tag * will also be taken into account here. */ void t3_rx_eth(struct adapter *adap, struct mbuf *m, int ethpad) { struct cpl_rx_pkt *cpl = (struct cpl_rx_pkt *)(mtod(m, uint8_t *) + ethpad); struct port_info *pi = &adap->port[adap->rxpkt_map[cpl->iff]]; struct ifnet *ifp = pi->ifp; if (cpl->vlan_valid) { m->m_pkthdr.ether_vtag = ntohs(cpl->vlan); m->m_flags |= M_VLANTAG; } m->m_pkthdr.rcvif = ifp; /* * adjust after conversion to mbuf chain */ m->m_pkthdr.len -= (sizeof(*cpl) + ethpad); m->m_len -= (sizeof(*cpl) + ethpad); m->m_data += (sizeof(*cpl) + ethpad); if (!cpl->fragment && cpl->csum_valid && cpl->csum == 0xffff) { struct ether_header *eh = mtod(m, void *); uint16_t eh_type; if (eh->ether_type == htons(ETHERTYPE_VLAN)) { struct ether_vlan_header *evh = mtod(m, void *); eh_type = evh->evl_proto; } else eh_type = eh->ether_type; if (ifp->if_capenable & IFCAP_RXCSUM && eh_type == htons(ETHERTYPE_IP)) { m->m_pkthdr.csum_flags = (CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR); m->m_pkthdr.csum_data = 0xffff; } else if (ifp->if_capenable & IFCAP_RXCSUM_IPV6 && eh_type == htons(ETHERTYPE_IPV6)) { m->m_pkthdr.csum_flags = (CSUM_DATA_VALID_IPV6 | CSUM_PSEUDO_HDR); m->m_pkthdr.csum_data = 0xffff; } } } /** * get_packet - return the next ingress packet buffer from a free list * @adap: the adapter that received the packet * @drop_thres: # of remaining buffers before we start dropping packets * @qs: the qset that the SGE free list holding the packet belongs to * @mh: the mbuf header, contains a pointer to the head and tail of the mbuf chain * @r: response descriptor * * Get the next packet from a free list and complete setup of the * sk_buff. If the packet is small we make a copy and recycle the * original buffer, otherwise we use the original buffer itself. If a * positive drop threshold is supplied packets are dropped and their * buffers recycled if (a) the number of remaining buffers is under the * threshold and the packet is too big to copy, or (b) the packet should * be copied but there is no memory for the copy. */ static int get_packet(adapter_t *adap, unsigned int drop_thres, struct sge_qset *qs, struct t3_mbuf_hdr *mh, struct rsp_desc *r) { unsigned int len_cq = ntohl(r->len_cq); struct sge_fl *fl = (len_cq & F_RSPD_FLQ) ? &qs->fl[1] : &qs->fl[0]; int mask, cidx = fl->cidx; struct rx_sw_desc *sd = &fl->sdesc[cidx]; uint32_t len = G_RSPD_LEN(len_cq); uint32_t flags = M_EXT; uint8_t sopeop = G_RSPD_SOP_EOP(ntohl(r->flags)); caddr_t cl; struct mbuf *m; int ret = 0; mask = fl->size - 1; prefetch(fl->sdesc[(cidx + 1) & mask].m); prefetch(fl->sdesc[(cidx + 2) & mask].m); prefetch(fl->sdesc[(cidx + 1) & mask].rxsd_cl); prefetch(fl->sdesc[(cidx + 2) & mask].rxsd_cl); fl->credits--; bus_dmamap_sync(fl->entry_tag, sd->map, BUS_DMASYNC_POSTREAD); if (recycle_enable && len <= SGE_RX_COPY_THRES && sopeop == RSPQ_SOP_EOP) { if ((m = m_gethdr(M_NOWAIT, MT_DATA)) == NULL) goto skip_recycle; cl = mtod(m, void *); memcpy(cl, sd->rxsd_cl, len); recycle_rx_buf(adap, fl, fl->cidx); m->m_pkthdr.len = m->m_len = len; m->m_flags = 0; mh->mh_head = mh->mh_tail = m; ret = 1; goto done; } else { skip_recycle: bus_dmamap_unload(fl->entry_tag, sd->map); cl = sd->rxsd_cl; m = sd->m; if ((sopeop == RSPQ_SOP_EOP) || (sopeop == RSPQ_SOP)) flags |= M_PKTHDR; m_init(m, M_NOWAIT, MT_DATA, flags); if (fl->zone == zone_pack) { /* * restore clobbered data pointer */ m->m_data = m->m_ext.ext_buf; } else { m_cljset(m, cl, fl->type); } m->m_len = len; } switch(sopeop) { case RSPQ_SOP_EOP: ret = 1; /* FALLTHROUGH */ case RSPQ_SOP: mh->mh_head = mh->mh_tail = m; m->m_pkthdr.len = len; break; case RSPQ_EOP: ret = 1; /* FALLTHROUGH */ case RSPQ_NSOP_NEOP: if (mh->mh_tail == NULL) { log(LOG_ERR, "discarding intermediate descriptor entry\n"); m_freem(m); break; } mh->mh_tail->m_next = m; mh->mh_tail = m; mh->mh_head->m_pkthdr.len += len; break; } if (cxgb_debug) printf("len=%d pktlen=%d\n", m->m_len, m->m_pkthdr.len); done: if (++fl->cidx == fl->size) fl->cidx = 0; return (ret); } /** * handle_rsp_cntrl_info - handles control information in a response * @qs: the queue set corresponding to the response * @flags: the response control flags * * Handles the control information of an SGE response, such as GTS * indications and completion credits for the queue set's Tx queues. * HW coalesces credits, we don't do any extra SW coalescing. */ static __inline void handle_rsp_cntrl_info(struct sge_qset *qs, uint32_t flags) { unsigned int credits; #if USE_GTS if (flags & F_RSPD_TXQ0_GTS) clear_bit(TXQ_RUNNING, &qs->txq[TXQ_ETH].flags); #endif credits = G_RSPD_TXQ0_CR(flags); if (credits) qs->txq[TXQ_ETH].processed += credits; credits = G_RSPD_TXQ2_CR(flags); if (credits) qs->txq[TXQ_CTRL].processed += credits; # if USE_GTS if (flags & F_RSPD_TXQ1_GTS) clear_bit(TXQ_RUNNING, &qs->txq[TXQ_OFLD].flags); # endif credits = G_RSPD_TXQ1_CR(flags); if (credits) qs->txq[TXQ_OFLD].processed += credits; } static void check_ring_db(adapter_t *adap, struct sge_qset *qs, unsigned int sleeping) { ; } /** * process_responses - process responses from an SGE response queue * @adap: the adapter * @qs: the queue set to which the response queue belongs * @budget: how many responses can be processed in this round * * Process responses from an SGE response queue up to the supplied budget. * Responses include received packets as well as credits and other events * for the queues that belong to the response queue's queue set. * A negative budget is effectively unlimited. * * Additionally choose the interrupt holdoff time for the next interrupt * on this queue. If the system is under memory shortage use a fairly * long delay to help recovery. */ static int process_responses(adapter_t *adap, struct sge_qset *qs, int budget) { struct sge_rspq *rspq = &qs->rspq; struct rsp_desc *r = &rspq->desc[rspq->cidx]; int budget_left = budget; unsigned int sleeping = 0; #if defined(INET6) || defined(INET) int lro_enabled = qs->lro.enabled; int skip_lro; struct lro_ctrl *lro_ctrl = &qs->lro.ctrl; #endif struct t3_mbuf_hdr *mh = &rspq->rspq_mh; #ifdef DEBUG static int last_holdoff = 0; if (cxgb_debug && rspq->holdoff_tmr != last_holdoff) { printf("next_holdoff=%d\n", rspq->holdoff_tmr); last_holdoff = rspq->holdoff_tmr; } #endif rspq->next_holdoff = rspq->holdoff_tmr; while (__predict_true(budget_left && is_new_response(r, rspq))) { int eth, eop = 0, ethpad = 0; uint32_t flags = ntohl(r->flags); uint32_t rss_hash = be32toh(r->rss_hdr.rss_hash_val); uint8_t opcode = r->rss_hdr.opcode; eth = (opcode == CPL_RX_PKT); if (__predict_false(flags & F_RSPD_ASYNC_NOTIF)) { struct mbuf *m; if (cxgb_debug) printf("async notification\n"); if (mh->mh_head == NULL) { mh->mh_head = m_gethdr(M_NOWAIT, MT_DATA); m = mh->mh_head; } else { m = m_gethdr(M_NOWAIT, MT_DATA); } if (m == NULL) goto no_mem; memcpy(mtod(m, char *), r, AN_PKT_SIZE); m->m_len = m->m_pkthdr.len = AN_PKT_SIZE; *mtod(m, char *) = CPL_ASYNC_NOTIF; opcode = CPL_ASYNC_NOTIF; eop = 1; rspq->async_notif++; goto skip; } else if (flags & F_RSPD_IMM_DATA_VALID) { struct mbuf *m = m_gethdr(M_NOWAIT, MT_DATA); if (m == NULL) { no_mem: rspq->next_holdoff = NOMEM_INTR_DELAY; budget_left--; break; } if (mh->mh_head == NULL) mh->mh_head = m; else mh->mh_tail->m_next = m; mh->mh_tail = m; get_imm_packet(adap, r, m); mh->mh_head->m_pkthdr.len += m->m_len; eop = 1; rspq->imm_data++; } else if (r->len_cq) { int drop_thresh = eth ? SGE_RX_DROP_THRES : 0; eop = get_packet(adap, drop_thresh, qs, mh, r); if (eop) { if (r->rss_hdr.hash_type && !adap->timestamp) { - M_HASHTYPE_SET(mh->mh_head, M_HASHTYPE_OPAQUE); + M_HASHTYPE_SET(mh->mh_head, + M_HASHTYPE_OPAQUE_HASH); mh->mh_head->m_pkthdr.flowid = rss_hash; } } ethpad = 2; } else { rspq->pure_rsps++; } skip: if (flags & RSPD_CTRL_MASK) { sleeping |= flags & RSPD_GTS_MASK; handle_rsp_cntrl_info(qs, flags); } if (!eth && eop) { rspq->offload_pkts++; #ifdef TCP_OFFLOAD adap->cpl_handler[opcode](qs, r, mh->mh_head); #else m_freem(mh->mh_head); #endif mh->mh_head = NULL; } else if (eth && eop) { struct mbuf *m = mh->mh_head; t3_rx_eth(adap, m, ethpad); /* * The T304 sends incoming packets on any qset. If LRO * is also enabled, we could end up sending packet up * lro_ctrl->ifp's input. That is incorrect. * * The mbuf's rcvif was derived from the cpl header and * is accurate. Skip LRO and just use that. */ #if defined(INET6) || defined(INET) skip_lro = __predict_false(qs->port->ifp != m->m_pkthdr.rcvif); if (lro_enabled && lro_ctrl->lro_cnt && !skip_lro && (tcp_lro_rx(lro_ctrl, m, 0) == 0) ) { /* successfully queue'd for LRO */ } else #endif { /* * LRO not enabled, packet unsuitable for LRO, * or unable to queue. Pass it up right now in * either case. */ struct ifnet *ifp = m->m_pkthdr.rcvif; (*ifp->if_input)(ifp, m); } mh->mh_head = NULL; } r++; if (__predict_false(++rspq->cidx == rspq->size)) { rspq->cidx = 0; rspq->gen ^= 1; r = rspq->desc; } if (++rspq->credits >= 64) { refill_rspq(adap, rspq, rspq->credits); rspq->credits = 0; } __refill_fl_lt(adap, &qs->fl[0], 32); __refill_fl_lt(adap, &qs->fl[1], 32); --budget_left; } #if defined(INET6) || defined(INET) /* Flush LRO */ tcp_lro_flush_all(lro_ctrl); #endif if (sleeping) check_ring_db(adap, qs, sleeping); mb(); /* commit Tx queue processed updates */ if (__predict_false(qs->txq_stopped > 1)) restart_tx(qs); __refill_fl_lt(adap, &qs->fl[0], 512); __refill_fl_lt(adap, &qs->fl[1], 512); budget -= budget_left; return (budget); } /* * A helper function that processes responses and issues GTS. */ static __inline int process_responses_gts(adapter_t *adap, struct sge_rspq *rq) { int work; static int last_holdoff = 0; work = process_responses(adap, rspq_to_qset(rq), -1); if (cxgb_debug && (rq->next_holdoff != last_holdoff)) { printf("next_holdoff=%d\n", rq->next_holdoff); last_holdoff = rq->next_holdoff; } t3_write_reg(adap, A_SG_GTS, V_RSPQ(rq->cntxt_id) | V_NEWTIMER(rq->next_holdoff) | V_NEWINDEX(rq->cidx)); return (work); } /* * Interrupt handler for legacy INTx interrupts for T3B-based cards. * Handles data events from SGE response queues as well as error and other * async events as they all use the same interrupt pin. We use one SGE * response queue per port in this mode and protect all response queues with * queue 0's lock. */ void t3b_intr(void *data) { uint32_t i, map; adapter_t *adap = data; struct sge_rspq *q0 = &adap->sge.qs[0].rspq; t3_write_reg(adap, A_PL_CLI, 0); map = t3_read_reg(adap, A_SG_DATA_INTR); if (!map) return; if (__predict_false(map & F_ERRINTR)) { t3_write_reg(adap, A_PL_INT_ENABLE0, 0); (void) t3_read_reg(adap, A_PL_INT_ENABLE0); taskqueue_enqueue(adap->tq, &adap->slow_intr_task); } mtx_lock(&q0->lock); for_each_port(adap, i) if (map & (1 << i)) process_responses_gts(adap, &adap->sge.qs[i].rspq); mtx_unlock(&q0->lock); } /* * The MSI interrupt handler. This needs to handle data events from SGE * response queues as well as error and other async events as they all use * the same MSI vector. We use one SGE response queue per port in this mode * and protect all response queues with queue 0's lock. */ void t3_intr_msi(void *data) { adapter_t *adap = data; struct sge_rspq *q0 = &adap->sge.qs[0].rspq; int i, new_packets = 0; mtx_lock(&q0->lock); for_each_port(adap, i) if (process_responses_gts(adap, &adap->sge.qs[i].rspq)) new_packets = 1; mtx_unlock(&q0->lock); if (new_packets == 0) { t3_write_reg(adap, A_PL_INT_ENABLE0, 0); (void) t3_read_reg(adap, A_PL_INT_ENABLE0); taskqueue_enqueue(adap->tq, &adap->slow_intr_task); } } void t3_intr_msix(void *data) { struct sge_qset *qs = data; adapter_t *adap = qs->port->adapter; struct sge_rspq *rspq = &qs->rspq; if (process_responses_gts(adap, rspq) == 0) rspq->unhandled_irqs++; } #define QDUMP_SBUF_SIZE 32 * 400 static int t3_dump_rspq(SYSCTL_HANDLER_ARGS) { struct sge_rspq *rspq; struct sge_qset *qs; int i, err, dump_end, idx; struct sbuf *sb; struct rsp_desc *rspd; uint32_t data[4]; rspq = arg1; qs = rspq_to_qset(rspq); if (rspq->rspq_dump_count == 0) return (0); if (rspq->rspq_dump_count > RSPQ_Q_SIZE) { log(LOG_WARNING, "dump count is too large %d\n", rspq->rspq_dump_count); rspq->rspq_dump_count = 0; return (EINVAL); } if (rspq->rspq_dump_start > (RSPQ_Q_SIZE-1)) { log(LOG_WARNING, "dump start of %d is greater than queue size\n", rspq->rspq_dump_start); rspq->rspq_dump_start = 0; return (EINVAL); } err = t3_sge_read_rspq(qs->port->adapter, rspq->cntxt_id, data); if (err) return (err); err = sysctl_wire_old_buffer(req, 0); if (err) return (err); sb = sbuf_new_for_sysctl(NULL, NULL, QDUMP_SBUF_SIZE, req); sbuf_printf(sb, " \n index=%u size=%u MSI-X/RspQ=%u intr enable=%u intr armed=%u\n", (data[0] & 0xffff), data[0] >> 16, ((data[2] >> 20) & 0x3f), ((data[2] >> 26) & 1), ((data[2] >> 27) & 1)); sbuf_printf(sb, " generation=%u CQ mode=%u FL threshold=%u\n", ((data[2] >> 28) & 1), ((data[2] >> 31) & 1), data[3]); sbuf_printf(sb, " start=%d -> end=%d\n", rspq->rspq_dump_start, (rspq->rspq_dump_start + rspq->rspq_dump_count) & (RSPQ_Q_SIZE-1)); dump_end = rspq->rspq_dump_start + rspq->rspq_dump_count; for (i = rspq->rspq_dump_start; i < dump_end; i++) { idx = i & (RSPQ_Q_SIZE-1); rspd = &rspq->desc[idx]; sbuf_printf(sb, "\tidx=%04d opcode=%02x cpu_idx=%x hash_type=%x cq_idx=%x\n", idx, rspd->rss_hdr.opcode, rspd->rss_hdr.cpu_idx, rspd->rss_hdr.hash_type, be16toh(rspd->rss_hdr.cq_idx)); sbuf_printf(sb, "\trss_hash_val=%x flags=%08x len_cq=%x intr_gen=%x\n", rspd->rss_hdr.rss_hash_val, be32toh(rspd->flags), be32toh(rspd->len_cq), rspd->intr_gen); } err = sbuf_finish(sb); sbuf_delete(sb); return (err); } static int t3_dump_txq_eth(SYSCTL_HANDLER_ARGS) { struct sge_txq *txq; struct sge_qset *qs; int i, j, err, dump_end; struct sbuf *sb; struct tx_desc *txd; uint32_t *WR, wr_hi, wr_lo, gen; uint32_t data[4]; txq = arg1; qs = txq_to_qset(txq, TXQ_ETH); if (txq->txq_dump_count == 0) { return (0); } if (txq->txq_dump_count > TX_ETH_Q_SIZE) { log(LOG_WARNING, "dump count is too large %d\n", txq->txq_dump_count); txq->txq_dump_count = 1; return (EINVAL); } if (txq->txq_dump_start > (TX_ETH_Q_SIZE-1)) { log(LOG_WARNING, "dump start of %d is greater than queue size\n", txq->txq_dump_start); txq->txq_dump_start = 0; return (EINVAL); } err = t3_sge_read_ecntxt(qs->port->adapter, qs->rspq.cntxt_id, data); if (err) return (err); err = sysctl_wire_old_buffer(req, 0); if (err) return (err); sb = sbuf_new_for_sysctl(NULL, NULL, QDUMP_SBUF_SIZE, req); sbuf_printf(sb, " \n credits=%u GTS=%u index=%u size=%u rspq#=%u cmdq#=%u\n", (data[0] & 0x7fff), ((data[0] >> 15) & 1), (data[0] >> 16), (data[1] & 0xffff), ((data[3] >> 4) & 7), ((data[3] >> 7) & 1)); sbuf_printf(sb, " TUN=%u TOE=%u generation%u uP token=%u valid=%u\n", ((data[3] >> 8) & 1), ((data[3] >> 9) & 1), ((data[3] >> 10) & 1), ((data[3] >> 11) & 0xfffff), ((data[3] >> 31) & 1)); sbuf_printf(sb, " qid=%d start=%d -> end=%d\n", qs->idx, txq->txq_dump_start, (txq->txq_dump_start + txq->txq_dump_count) & (TX_ETH_Q_SIZE-1)); dump_end = txq->txq_dump_start + txq->txq_dump_count; for (i = txq->txq_dump_start; i < dump_end; i++) { txd = &txq->desc[i & (TX_ETH_Q_SIZE-1)]; WR = (uint32_t *)txd->flit; wr_hi = ntohl(WR[0]); wr_lo = ntohl(WR[1]); gen = G_WR_GEN(wr_lo); sbuf_printf(sb," wr_hi %08x wr_lo %08x gen %d\n", wr_hi, wr_lo, gen); for (j = 2; j < 30; j += 4) sbuf_printf(sb, "\t%08x %08x %08x %08x \n", WR[j], WR[j + 1], WR[j + 2], WR[j + 3]); } err = sbuf_finish(sb); sbuf_delete(sb); return (err); } static int t3_dump_txq_ctrl(SYSCTL_HANDLER_ARGS) { struct sge_txq *txq; struct sge_qset *qs; int i, j, err, dump_end; struct sbuf *sb; struct tx_desc *txd; uint32_t *WR, wr_hi, wr_lo, gen; txq = arg1; qs = txq_to_qset(txq, TXQ_CTRL); if (txq->txq_dump_count == 0) { return (0); } if (txq->txq_dump_count > 256) { log(LOG_WARNING, "dump count is too large %d\n", txq->txq_dump_count); txq->txq_dump_count = 1; return (EINVAL); } if (txq->txq_dump_start > 255) { log(LOG_WARNING, "dump start of %d is greater than queue size\n", txq->txq_dump_start); txq->txq_dump_start = 0; return (EINVAL); } err = sysctl_wire_old_buffer(req, 0); if (err != 0) return (err); sb = sbuf_new_for_sysctl(NULL, NULL, QDUMP_SBUF_SIZE, req); sbuf_printf(sb, " qid=%d start=%d -> end=%d\n", qs->idx, txq->txq_dump_start, (txq->txq_dump_start + txq->txq_dump_count) & 255); dump_end = txq->txq_dump_start + txq->txq_dump_count; for (i = txq->txq_dump_start; i < dump_end; i++) { txd = &txq->desc[i & (255)]; WR = (uint32_t *)txd->flit; wr_hi = ntohl(WR[0]); wr_lo = ntohl(WR[1]); gen = G_WR_GEN(wr_lo); sbuf_printf(sb," wr_hi %08x wr_lo %08x gen %d\n", wr_hi, wr_lo, gen); for (j = 2; j < 30; j += 4) sbuf_printf(sb, "\t%08x %08x %08x %08x \n", WR[j], WR[j + 1], WR[j + 2], WR[j + 3]); } err = sbuf_finish(sb); sbuf_delete(sb); return (err); } static int t3_set_coalesce_usecs(SYSCTL_HANDLER_ARGS) { adapter_t *sc = arg1; struct qset_params *qsp = &sc->params.sge.qset[0]; int coalesce_usecs; struct sge_qset *qs; int i, j, err, nqsets = 0; struct mtx *lock; if ((sc->flags & FULL_INIT_DONE) == 0) return (ENXIO); coalesce_usecs = qsp->coalesce_usecs; err = sysctl_handle_int(oidp, &coalesce_usecs, arg2, req); if (err != 0) { return (err); } if (coalesce_usecs == qsp->coalesce_usecs) return (0); for (i = 0; i < sc->params.nports; i++) for (j = 0; j < sc->port[i].nqsets; j++) nqsets++; coalesce_usecs = max(1, coalesce_usecs); for (i = 0; i < nqsets; i++) { qs = &sc->sge.qs[i]; qsp = &sc->params.sge.qset[i]; qsp->coalesce_usecs = coalesce_usecs; lock = (sc->flags & USING_MSIX) ? &qs->rspq.lock : &sc->sge.qs[0].rspq.lock; mtx_lock(lock); t3_update_qset_coalesce(qs, qsp); t3_write_reg(sc, A_SG_GTS, V_RSPQ(qs->rspq.cntxt_id) | V_NEWTIMER(qs->rspq.holdoff_tmr)); mtx_unlock(lock); } return (0); } static int t3_pkt_timestamp(SYSCTL_HANDLER_ARGS) { adapter_t *sc = arg1; int rc, timestamp; if ((sc->flags & FULL_INIT_DONE) == 0) return (ENXIO); timestamp = sc->timestamp; rc = sysctl_handle_int(oidp, ×tamp, arg2, req); if (rc != 0) return (rc); if (timestamp != sc->timestamp) { t3_set_reg_field(sc, A_TP_PC_CONFIG2, F_ENABLERXPKTTMSTPRSS, timestamp ? F_ENABLERXPKTTMSTPRSS : 0); sc->timestamp = timestamp; } return (0); } void t3_add_attach_sysctls(adapter_t *sc) { struct sysctl_ctx_list *ctx; struct sysctl_oid_list *children; ctx = device_get_sysctl_ctx(sc->dev); children = SYSCTL_CHILDREN(device_get_sysctl_tree(sc->dev)); /* random information */ SYSCTL_ADD_STRING(ctx, children, OID_AUTO, "firmware_version", CTLFLAG_RD, sc->fw_version, 0, "firmware version"); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "hw_revision", CTLFLAG_RD, &sc->params.rev, 0, "chip model"); SYSCTL_ADD_STRING(ctx, children, OID_AUTO, "port_types", CTLFLAG_RD, sc->port_types, 0, "type of ports"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "enable_debug", CTLFLAG_RW, &cxgb_debug, 0, "enable verbose debugging output"); SYSCTL_ADD_UQUAD(ctx, children, OID_AUTO, "tunq_coalesce", CTLFLAG_RD, &sc->tunq_coalesce, "#tunneled packets freed"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "txq_overrun", CTLFLAG_RD, &txq_fills, 0, "#times txq overrun"); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "core_clock", CTLFLAG_RD, &sc->params.vpd.cclk, 0, "core clock frequency (in KHz)"); } static const char *rspq_name = "rspq"; static const char *txq_names[] = { "txq_eth", "txq_ofld", "txq_ctrl" }; static int sysctl_handle_macstat(SYSCTL_HANDLER_ARGS) { struct port_info *p = arg1; uint64_t *parg; if (!p) return (EINVAL); cxgb_refresh_stats(p); parg = (uint64_t *) ((uint8_t *)&p->mac.stats + arg2); return (sysctl_handle_64(oidp, parg, 0, req)); } void t3_add_configured_sysctls(adapter_t *sc) { struct sysctl_ctx_list *ctx; struct sysctl_oid_list *children; int i, j; ctx = device_get_sysctl_ctx(sc->dev); children = SYSCTL_CHILDREN(device_get_sysctl_tree(sc->dev)); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "intr_coal", CTLTYPE_INT|CTLFLAG_RW, sc, 0, t3_set_coalesce_usecs, "I", "interrupt coalescing timer (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "pkt_timestamp", CTLTYPE_INT | CTLFLAG_RW, sc, 0, t3_pkt_timestamp, "I", "provide packet timestamp instead of connection hash"); for (i = 0; i < sc->params.nports; i++) { struct port_info *pi = &sc->port[i]; struct sysctl_oid *poid; struct sysctl_oid_list *poidlist; struct mac_stats *mstats = &pi->mac.stats; snprintf(pi->namebuf, PORT_NAME_LEN, "port%d", i); poid = SYSCTL_ADD_NODE(ctx, children, OID_AUTO, pi->namebuf, CTLFLAG_RD, NULL, "port statistics"); poidlist = SYSCTL_CHILDREN(poid); SYSCTL_ADD_UINT(ctx, poidlist, OID_AUTO, "nqsets", CTLFLAG_RD, &pi->nqsets, 0, "#queue sets"); for (j = 0; j < pi->nqsets; j++) { struct sge_qset *qs = &sc->sge.qs[pi->first_qset + j]; struct sysctl_oid *qspoid, *rspqpoid, *txqpoid, *ctrlqpoid, *lropoid; struct sysctl_oid_list *qspoidlist, *rspqpoidlist, *txqpoidlist, *ctrlqpoidlist, *lropoidlist; struct sge_txq *txq = &qs->txq[TXQ_ETH]; snprintf(qs->namebuf, QS_NAME_LEN, "qs%d", j); qspoid = SYSCTL_ADD_NODE(ctx, poidlist, OID_AUTO, qs->namebuf, CTLFLAG_RD, NULL, "qset statistics"); qspoidlist = SYSCTL_CHILDREN(qspoid); SYSCTL_ADD_UINT(ctx, qspoidlist, OID_AUTO, "fl0_empty", CTLFLAG_RD, &qs->fl[0].empty, 0, "freelist #0 empty"); SYSCTL_ADD_UINT(ctx, qspoidlist, OID_AUTO, "fl1_empty", CTLFLAG_RD, &qs->fl[1].empty, 0, "freelist #1 empty"); rspqpoid = SYSCTL_ADD_NODE(ctx, qspoidlist, OID_AUTO, rspq_name, CTLFLAG_RD, NULL, "rspq statistics"); rspqpoidlist = SYSCTL_CHILDREN(rspqpoid); txqpoid = SYSCTL_ADD_NODE(ctx, qspoidlist, OID_AUTO, txq_names[0], CTLFLAG_RD, NULL, "txq statistics"); txqpoidlist = SYSCTL_CHILDREN(txqpoid); ctrlqpoid = SYSCTL_ADD_NODE(ctx, qspoidlist, OID_AUTO, txq_names[2], CTLFLAG_RD, NULL, "ctrlq statistics"); ctrlqpoidlist = SYSCTL_CHILDREN(ctrlqpoid); lropoid = SYSCTL_ADD_NODE(ctx, qspoidlist, OID_AUTO, "lro_stats", CTLFLAG_RD, NULL, "LRO statistics"); lropoidlist = SYSCTL_CHILDREN(lropoid); SYSCTL_ADD_UINT(ctx, rspqpoidlist, OID_AUTO, "size", CTLFLAG_RD, &qs->rspq.size, 0, "#entries in response queue"); SYSCTL_ADD_UINT(ctx, rspqpoidlist, OID_AUTO, "cidx", CTLFLAG_RD, &qs->rspq.cidx, 0, "consumer index"); SYSCTL_ADD_UINT(ctx, rspqpoidlist, OID_AUTO, "credits", CTLFLAG_RD, &qs->rspq.credits, 0, "#credits"); SYSCTL_ADD_UINT(ctx, rspqpoidlist, OID_AUTO, "starved", CTLFLAG_RD, &qs->rspq.starved, 0, "#times starved"); SYSCTL_ADD_UAUTO(ctx, rspqpoidlist, OID_AUTO, "phys_addr", CTLFLAG_RD, &qs->rspq.phys_addr, "physical_address_of the queue"); SYSCTL_ADD_UINT(ctx, rspqpoidlist, OID_AUTO, "dump_start", CTLFLAG_RW, &qs->rspq.rspq_dump_start, 0, "start rspq dump entry"); SYSCTL_ADD_UINT(ctx, rspqpoidlist, OID_AUTO, "dump_count", CTLFLAG_RW, &qs->rspq.rspq_dump_count, 0, "#rspq entries to dump"); SYSCTL_ADD_PROC(ctx, rspqpoidlist, OID_AUTO, "qdump", CTLTYPE_STRING | CTLFLAG_RD, &qs->rspq, 0, t3_dump_rspq, "A", "dump of the response queue"); SYSCTL_ADD_UQUAD(ctx, txqpoidlist, OID_AUTO, "dropped", CTLFLAG_RD, &qs->txq[TXQ_ETH].txq_mr->br_drops, "#tunneled packets dropped"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "sendqlen", CTLFLAG_RD, &qs->txq[TXQ_ETH].sendq.mq_len, 0, "#tunneled packets waiting to be sent"); #if 0 SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "queue_pidx", CTLFLAG_RD, (uint32_t *)(uintptr_t)&qs->txq[TXQ_ETH].txq_mr.br_prod, 0, "#tunneled packets queue producer index"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "queue_cidx", CTLFLAG_RD, (uint32_t *)(uintptr_t)&qs->txq[TXQ_ETH].txq_mr.br_cons, 0, "#tunneled packets queue consumer index"); #endif SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "processed", CTLFLAG_RD, &qs->txq[TXQ_ETH].processed, 0, "#tunneled packets processed by the card"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "cleaned", CTLFLAG_RD, &txq->cleaned, 0, "#tunneled packets cleaned"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "in_use", CTLFLAG_RD, &txq->in_use, 0, "#tunneled packet slots in use"); SYSCTL_ADD_UQUAD(ctx, txqpoidlist, OID_AUTO, "frees", CTLFLAG_RD, &txq->txq_frees, "#tunneled packets freed"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "skipped", CTLFLAG_RD, &txq->txq_skipped, 0, "#tunneled packet descriptors skipped"); SYSCTL_ADD_UQUAD(ctx, txqpoidlist, OID_AUTO, "coalesced", CTLFLAG_RD, &txq->txq_coalesced, "#tunneled packets coalesced"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "enqueued", CTLFLAG_RD, &txq->txq_enqueued, 0, "#tunneled packets enqueued to hardware"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "stopped_flags", CTLFLAG_RD, &qs->txq_stopped, 0, "tx queues stopped"); SYSCTL_ADD_UAUTO(ctx, txqpoidlist, OID_AUTO, "phys_addr", CTLFLAG_RD, &txq->phys_addr, "physical_address_of the queue"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "qgen", CTLFLAG_RW, &qs->txq[TXQ_ETH].gen, 0, "txq generation"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "hw_cidx", CTLFLAG_RD, &txq->cidx, 0, "hardware queue cidx"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "hw_pidx", CTLFLAG_RD, &txq->pidx, 0, "hardware queue pidx"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "dump_start", CTLFLAG_RW, &qs->txq[TXQ_ETH].txq_dump_start, 0, "txq start idx for dump"); SYSCTL_ADD_UINT(ctx, txqpoidlist, OID_AUTO, "dump_count", CTLFLAG_RW, &qs->txq[TXQ_ETH].txq_dump_count, 0, "txq #entries to dump"); SYSCTL_ADD_PROC(ctx, txqpoidlist, OID_AUTO, "qdump", CTLTYPE_STRING | CTLFLAG_RD, &qs->txq[TXQ_ETH], 0, t3_dump_txq_eth, "A", "dump of the transmit queue"); SYSCTL_ADD_UINT(ctx, ctrlqpoidlist, OID_AUTO, "dump_start", CTLFLAG_RW, &qs->txq[TXQ_CTRL].txq_dump_start, 0, "ctrlq start idx for dump"); SYSCTL_ADD_UINT(ctx, ctrlqpoidlist, OID_AUTO, "dump_count", CTLFLAG_RW, &qs->txq[TXQ_CTRL].txq_dump_count, 0, "ctrl #entries to dump"); SYSCTL_ADD_PROC(ctx, ctrlqpoidlist, OID_AUTO, "qdump", CTLTYPE_STRING | CTLFLAG_RD, &qs->txq[TXQ_CTRL], 0, t3_dump_txq_ctrl, "A", "dump of the transmit queue"); SYSCTL_ADD_U64(ctx, lropoidlist, OID_AUTO, "lro_queued", CTLFLAG_RD, &qs->lro.ctrl.lro_queued, 0, NULL); SYSCTL_ADD_U64(ctx, lropoidlist, OID_AUTO, "lro_flushed", CTLFLAG_RD, &qs->lro.ctrl.lro_flushed, 0, NULL); SYSCTL_ADD_U64(ctx, lropoidlist, OID_AUTO, "lro_bad_csum", CTLFLAG_RD, &qs->lro.ctrl.lro_bad_csum, 0, NULL); SYSCTL_ADD_INT(ctx, lropoidlist, OID_AUTO, "lro_cnt", CTLFLAG_RD, &qs->lro.ctrl.lro_cnt, 0, NULL); } /* Now add a node for mac stats. */ poid = SYSCTL_ADD_NODE(ctx, poidlist, OID_AUTO, "mac_stats", CTLFLAG_RD, NULL, "MAC statistics"); poidlist = SYSCTL_CHILDREN(poid); /* * We (ab)use the length argument (arg2) to pass on the offset * of the data that we are interested in. This is only required * for the quad counters that are updated from the hardware (we * make sure that we return the latest value). * sysctl_handle_macstat first updates *all* the counters from * the hardware, and then returns the latest value of the * requested counter. Best would be to update only the * requested counter from hardware, but t3_mac_update_stats() * hides all the register details and we don't want to dive into * all that here. */ #define CXGB_SYSCTL_ADD_QUAD(a) SYSCTL_ADD_OID(ctx, poidlist, OID_AUTO, #a, \ (CTLTYPE_U64 | CTLFLAG_RD), pi, offsetof(struct mac_stats, a), \ sysctl_handle_macstat, "QU", 0) CXGB_SYSCTL_ADD_QUAD(tx_octets); CXGB_SYSCTL_ADD_QUAD(tx_octets_bad); CXGB_SYSCTL_ADD_QUAD(tx_frames); CXGB_SYSCTL_ADD_QUAD(tx_mcast_frames); CXGB_SYSCTL_ADD_QUAD(tx_bcast_frames); CXGB_SYSCTL_ADD_QUAD(tx_pause); CXGB_SYSCTL_ADD_QUAD(tx_deferred); CXGB_SYSCTL_ADD_QUAD(tx_late_collisions); CXGB_SYSCTL_ADD_QUAD(tx_total_collisions); CXGB_SYSCTL_ADD_QUAD(tx_excess_collisions); CXGB_SYSCTL_ADD_QUAD(tx_underrun); CXGB_SYSCTL_ADD_QUAD(tx_len_errs); CXGB_SYSCTL_ADD_QUAD(tx_mac_internal_errs); CXGB_SYSCTL_ADD_QUAD(tx_excess_deferral); CXGB_SYSCTL_ADD_QUAD(tx_fcs_errs); CXGB_SYSCTL_ADD_QUAD(tx_frames_64); CXGB_SYSCTL_ADD_QUAD(tx_frames_65_127); CXGB_SYSCTL_ADD_QUAD(tx_frames_128_255); CXGB_SYSCTL_ADD_QUAD(tx_frames_256_511); CXGB_SYSCTL_ADD_QUAD(tx_frames_512_1023); CXGB_SYSCTL_ADD_QUAD(tx_frames_1024_1518); CXGB_SYSCTL_ADD_QUAD(tx_frames_1519_max); CXGB_SYSCTL_ADD_QUAD(rx_octets); CXGB_SYSCTL_ADD_QUAD(rx_octets_bad); CXGB_SYSCTL_ADD_QUAD(rx_frames); CXGB_SYSCTL_ADD_QUAD(rx_mcast_frames); CXGB_SYSCTL_ADD_QUAD(rx_bcast_frames); CXGB_SYSCTL_ADD_QUAD(rx_pause); CXGB_SYSCTL_ADD_QUAD(rx_fcs_errs); CXGB_SYSCTL_ADD_QUAD(rx_align_errs); CXGB_SYSCTL_ADD_QUAD(rx_symbol_errs); CXGB_SYSCTL_ADD_QUAD(rx_data_errs); CXGB_SYSCTL_ADD_QUAD(rx_sequence_errs); CXGB_SYSCTL_ADD_QUAD(rx_runt); CXGB_SYSCTL_ADD_QUAD(rx_jabber); CXGB_SYSCTL_ADD_QUAD(rx_short); CXGB_SYSCTL_ADD_QUAD(rx_too_long); CXGB_SYSCTL_ADD_QUAD(rx_mac_internal_errs); CXGB_SYSCTL_ADD_QUAD(rx_cong_drops); CXGB_SYSCTL_ADD_QUAD(rx_frames_64); CXGB_SYSCTL_ADD_QUAD(rx_frames_65_127); CXGB_SYSCTL_ADD_QUAD(rx_frames_128_255); CXGB_SYSCTL_ADD_QUAD(rx_frames_256_511); CXGB_SYSCTL_ADD_QUAD(rx_frames_512_1023); CXGB_SYSCTL_ADD_QUAD(rx_frames_1024_1518); CXGB_SYSCTL_ADD_QUAD(rx_frames_1519_max); #undef CXGB_SYSCTL_ADD_QUAD #define CXGB_SYSCTL_ADD_ULONG(a) SYSCTL_ADD_ULONG(ctx, poidlist, OID_AUTO, #a, \ CTLFLAG_RD, &mstats->a, 0) CXGB_SYSCTL_ADD_ULONG(tx_fifo_parity_err); CXGB_SYSCTL_ADD_ULONG(rx_fifo_parity_err); CXGB_SYSCTL_ADD_ULONG(tx_fifo_urun); CXGB_SYSCTL_ADD_ULONG(rx_fifo_ovfl); CXGB_SYSCTL_ADD_ULONG(serdes_signal_loss); CXGB_SYSCTL_ADD_ULONG(xaui_pcs_ctc_err); CXGB_SYSCTL_ADD_ULONG(xaui_pcs_align_change); CXGB_SYSCTL_ADD_ULONG(num_toggled); CXGB_SYSCTL_ADD_ULONG(num_resets); CXGB_SYSCTL_ADD_ULONG(link_faults); #undef CXGB_SYSCTL_ADD_ULONG } } /** * t3_get_desc - dump an SGE descriptor for debugging purposes * @qs: the queue set * @qnum: identifies the specific queue (0..2: Tx, 3:response, 4..5: Rx) * @idx: the descriptor index in the queue * @data: where to dump the descriptor contents * * Dumps the contents of a HW descriptor of an SGE queue. Returns the * size of the descriptor. */ int t3_get_desc(const struct sge_qset *qs, unsigned int qnum, unsigned int idx, unsigned char *data) { if (qnum >= 6) return (EINVAL); if (qnum < 3) { if (!qs->txq[qnum].desc || idx >= qs->txq[qnum].size) return -EINVAL; memcpy(data, &qs->txq[qnum].desc[idx], sizeof(struct tx_desc)); return sizeof(struct tx_desc); } if (qnum == 3) { if (!qs->rspq.desc || idx >= qs->rspq.size) return (EINVAL); memcpy(data, &qs->rspq.desc[idx], sizeof(struct rsp_desc)); return sizeof(struct rsp_desc); } qnum -= 4; if (!qs->fl[qnum].desc || idx >= qs->fl[qnum].size) return (EINVAL); memcpy(data, &qs->fl[qnum].desc[idx], sizeof(struct rx_desc)); return sizeof(struct rx_desc); } Index: projects/vnet/sys/dev/cxgbe/adapter.h =================================================================== --- projects/vnet/sys/dev/cxgbe/adapter.h (revision 301546) +++ projects/vnet/sys/dev/cxgbe/adapter.h (revision 301547) @@ -1,1152 +1,1166 @@ /*- * Copyright (c) 2011 Chelsio Communications, Inc. * All rights reserved. * Written by: Navdeep Parhar * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ * */ #ifndef __T4_ADAPTER_H__ #define __T4_ADAPTER_H__ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "offload.h" +#include "t4_ioctl.h" #include "common/t4_msg.h" #include "firmware/t4fw_interface.h" #define KTR_CXGBE KTR_SPARE3 MALLOC_DECLARE(M_CXGBE); #define CXGBE_UNIMPLEMENTED(s) \ panic("%s (%s, line %d) not implemented yet.", s, __FILE__, __LINE__) #if defined(__i386__) || defined(__amd64__) static __inline void prefetch(void *x) { __asm volatile("prefetcht0 %0" :: "m" (*(unsigned long *)x)); } #else #define prefetch(x) #endif #ifndef SYSCTL_ADD_UQUAD #define SYSCTL_ADD_UQUAD SYSCTL_ADD_QUAD #define sysctl_handle_64 sysctl_handle_quad #define CTLTYPE_U64 CTLTYPE_QUAD #endif #if (__FreeBSD_version >= 900030) || \ ((__FreeBSD_version >= 802507) && (__FreeBSD_version < 900000)) #define SBUF_DRAIN 1 #endif #ifdef __amd64__ /* XXX: need systemwide bus_space_read_8/bus_space_write_8 */ static __inline uint64_t t4_bus_space_read_8(bus_space_tag_t tag, bus_space_handle_t handle, bus_size_t offset) { KASSERT(tag == X86_BUS_SPACE_MEM, ("%s: can only handle mem space", __func__)); return (*(volatile uint64_t *)(handle + offset)); } static __inline void t4_bus_space_write_8(bus_space_tag_t tag, bus_space_handle_t bsh, bus_size_t offset, uint64_t value) { KASSERT(tag == X86_BUS_SPACE_MEM, ("%s: can only handle mem space", __func__)); *(volatile uint64_t *)(bsh + offset) = value; } #else static __inline uint64_t t4_bus_space_read_8(bus_space_tag_t tag, bus_space_handle_t handle, bus_size_t offset) { return (uint64_t)bus_space_read_4(tag, handle, offset) + ((uint64_t)bus_space_read_4(tag, handle, offset + 4) << 32); } static __inline void t4_bus_space_write_8(bus_space_tag_t tag, bus_space_handle_t bsh, bus_size_t offset, uint64_t value) { bus_space_write_4(tag, bsh, offset, value); bus_space_write_4(tag, bsh, offset + 4, value >> 32); } #endif struct adapter; typedef struct adapter adapter_t; enum { /* * All ingress queues use this entry size. Note that the firmware event * queue and any iq expecting CPL_RX_PKT in the descriptor needs this to * be at least 64. */ IQ_ESIZE = 64, /* Default queue sizes for all kinds of ingress queues */ FW_IQ_QSIZE = 256, RX_IQ_QSIZE = 1024, /* All egress queues use this entry size */ EQ_ESIZE = 64, /* Default queue sizes for all kinds of egress queues */ CTRL_EQ_QSIZE = 128, TX_EQ_QSIZE = 1024, #if MJUMPAGESIZE != MCLBYTES SW_ZONE_SIZES = 4, /* cluster, jumbop, jumbo9k, jumbo16k */ #else SW_ZONE_SIZES = 3, /* cluster, jumbo9k, jumbo16k */ #endif CL_METADATA_SIZE = CACHE_LINE_SIZE, SGE_MAX_WR_NDESC = SGE_MAX_WR_LEN / EQ_ESIZE, /* max WR size in desc */ TX_SGL_SEGS = 39, TX_SGL_SEGS_TSO = 38, TX_WR_FLITS = SGE_MAX_WR_LEN / 8 }; enum { /* adapter intr_type */ INTR_INTX = (1 << 0), INTR_MSI = (1 << 1), INTR_MSIX = (1 << 2) }; enum { XGMAC_MTU = (1 << 0), XGMAC_PROMISC = (1 << 1), XGMAC_ALLMULTI = (1 << 2), XGMAC_VLANEX = (1 << 3), XGMAC_UCADDR = (1 << 4), XGMAC_MCADDRS = (1 << 5), XGMAC_ALL = 0xffff }; enum { /* flags understood by begin_synchronized_op */ HOLD_LOCK = (1 << 0), SLEEP_OK = (1 << 1), INTR_OK = (1 << 2), /* flags understood by end_synchronized_op */ LOCK_HELD = HOLD_LOCK, }; enum { /* adapter flags */ FULL_INIT_DONE = (1 << 0), FW_OK = (1 << 1), /* INTR_DIRECT = (1 << 2), No longer used. */ MASTER_PF = (1 << 3), ADAP_SYSCTL_CTX = (1 << 4), /* TOM_INIT_DONE= (1 << 5), No longer used */ BUF_PACKING_OK = (1 << 6), CXGBE_BUSY = (1 << 9), /* port flags */ HAS_TRACEQ = (1 << 3), /* VI flags */ DOOMED = (1 << 0), VI_INIT_DONE = (1 << 1), VI_SYSCTL_CTX = (1 << 2), INTR_RXQ = (1 << 4), /* All NIC rxq's take interrupts */ INTR_OFLD_RXQ = (1 << 5), /* All TOE rxq's take interrupts */ INTR_ALL = (INTR_RXQ | INTR_OFLD_RXQ), VI_NETMAP = (1 << 6), /* adapter debug_flags */ DF_DUMP_MBOX = (1 << 0), }; #define IS_DOOMED(vi) ((vi)->flags & DOOMED) #define SET_DOOMED(vi) do {(vi)->flags |= DOOMED;} while (0) #define IS_BUSY(sc) ((sc)->flags & CXGBE_BUSY) #define SET_BUSY(sc) do {(sc)->flags |= CXGBE_BUSY;} while (0) #define CLR_BUSY(sc) do {(sc)->flags &= ~CXGBE_BUSY;} while (0) struct vi_info { device_t dev; struct port_info *pi; struct ifnet *ifp; struct ifmedia media; unsigned long flags; int if_flags; uint16_t *rss; uint16_t viid; int16_t xact_addr_filt;/* index of exact MAC address filter */ uint16_t rss_size; /* size of VI's RSS table slice */ uint16_t rss_base; /* start of VI's RSS table slice */ eventhandler_tag vlan_c; int nintr; int first_intr; /* These need to be int as they are used in sysctl */ int ntxq; /* # of tx queues */ int first_txq; /* index of first tx queue */ int rsrv_noflowq; /* Reserve queue 0 for non-flowid packets */ int nrxq; /* # of rx queues */ int first_rxq; /* index of first rx queue */ int nofldtxq; /* # of offload tx queues */ int first_ofld_txq; /* index of first offload tx queue */ int nofldrxq; /* # of offload rx queues */ int first_ofld_rxq; /* index of first offload rx queue */ int tmr_idx; int pktc_idx; int qsize_rxq; int qsize_txq; struct timeval last_refreshed; struct fw_vi_stats_vf stats; struct callout tick; struct sysctl_ctx_list ctx; /* from ifconfig up to driver detach */ uint8_t hw_addr[ETHER_ADDR_LEN]; /* factory MAC address, won't change */ }; +enum { + /* tx_sched_class flags */ + TX_SC_OK = (1 << 0), /* Set up in hardware, active. */ +}; + +struct tx_sched_class { + int refcount; + int flags; + struct t4_sched_class_params params; +}; + struct port_info { device_t dev; struct adapter *adapter; struct vi_info *vi; int nvi; int up_vis; int uld_vis; + + struct tx_sched_class *tc; /* traffic classes for this channel */ struct mtx pi_lock; char lockname[16]; unsigned long flags; uint8_t lport; /* associated offload logical port */ int8_t mdio_addr; uint8_t port_type; uint8_t mod_type; uint8_t port_id; uint8_t tx_chan; uint8_t rx_chan_map; /* rx MPS channel bitmap */ int linkdnrc; struct link_config link_cfg; struct timeval last_refreshed; struct port_stats stats; u_int tnl_cong_drops; u_int tx_parse_error; struct callout tick; }; #define IS_MAIN_VI(vi) ((vi) == &((vi)->pi->vi[0])) /* Where the cluster came from, how it has been carved up. */ struct cluster_layout { int8_t zidx; int8_t hwidx; uint16_t region1; /* mbufs laid out within this region */ /* region2 is the DMA region */ uint16_t region3; /* cluster_metadata within this region */ }; struct cluster_metadata { u_int refcount; struct fl_sdesc *sd; /* For debug only. Could easily be stale */ }; struct fl_sdesc { caddr_t cl; uint16_t nmbuf; /* # of driver originated mbufs with ref on cluster */ struct cluster_layout cll; }; struct tx_desc { __be64 flit[8]; }; struct tx_sdesc { struct mbuf *m; /* m_nextpkt linked chain of frames */ uint8_t desc_used; /* # of hardware descriptors used by the WR */ }; #define IQ_PAD (IQ_ESIZE - sizeof(struct rsp_ctrl) - sizeof(struct rss_header)) struct iq_desc { struct rss_header rss; uint8_t cpl[IQ_PAD]; struct rsp_ctrl rsp; }; #undef IQ_PAD CTASSERT(sizeof(struct iq_desc) == IQ_ESIZE); enum { /* iq flags */ IQ_ALLOCATED = (1 << 0), /* firmware resources allocated */ IQ_HAS_FL = (1 << 1), /* iq associated with a freelist */ IQ_INTR = (1 << 2), /* iq takes direct interrupt */ IQ_LRO_ENABLED = (1 << 3), /* iq is an eth rxq with LRO enabled */ /* iq state */ IQS_DISABLED = 0, IQS_BUSY = 1, IQS_IDLE = 2, }; /* * Ingress Queue: T4 is producer, driver is consumer. */ struct sge_iq { uint32_t flags; volatile int state; struct adapter *adapter; struct iq_desc *desc; /* KVA of descriptor ring */ int8_t intr_pktc_idx; /* packet count threshold index */ uint8_t gen; /* generation bit */ uint8_t intr_params; /* interrupt holdoff parameters */ uint8_t intr_next; /* XXX: holdoff for next interrupt */ uint16_t qsize; /* size (# of entries) of the queue */ uint16_t sidx; /* index of the entry with the status page */ uint16_t cidx; /* consumer index */ uint16_t cntxt_id; /* SGE context id for the iq */ uint16_t abs_id; /* absolute SGE id for the iq */ STAILQ_ENTRY(sge_iq) link; bus_dma_tag_t desc_tag; bus_dmamap_t desc_map; bus_addr_t ba; /* bus address of descriptor ring */ }; enum { EQ_CTRL = 1, EQ_ETH = 2, EQ_OFLD = 3, /* eq flags */ EQ_TYPEMASK = 0x3, /* 2 lsbits hold the type (see above) */ EQ_ALLOCATED = (1 << 2), /* firmware resources allocated */ EQ_ENABLED = (1 << 3), /* open for business */ }; /* Listed in order of preference. Update t4_sysctls too if you change these */ enum {DOORBELL_UDB, DOORBELL_WCWR, DOORBELL_UDBWC, DOORBELL_KDB}; /* * Egress Queue: driver is producer, T4 is consumer. * * Note: A free list is an egress queue (driver produces the buffers and T4 * consumes them) but it's special enough to have its own struct (see sge_fl). */ struct sge_eq { unsigned int flags; /* MUST be first */ unsigned int cntxt_id; /* SGE context id for the eq */ struct mtx eq_lock; struct tx_desc *desc; /* KVA of descriptor ring */ uint16_t doorbells; volatile uint32_t *udb; /* KVA of doorbell (lies within BAR2) */ u_int udb_qid; /* relative qid within the doorbell page */ uint16_t sidx; /* index of the entry with the status page */ uint16_t cidx; /* consumer idx (desc idx) */ uint16_t pidx; /* producer idx (desc idx) */ uint16_t equeqidx; /* EQUEQ last requested at this pidx */ uint16_t dbidx; /* pidx of the most recent doorbell */ uint16_t iqid; /* iq that gets egr_update for the eq */ uint8_t tx_chan; /* tx channel used by the eq */ volatile u_int equiq; /* EQUIQ outstanding */ bus_dma_tag_t desc_tag; bus_dmamap_t desc_map; bus_addr_t ba; /* bus address of descriptor ring */ char lockname[16]; }; struct sw_zone_info { uma_zone_t zone; /* zone that this cluster comes from */ int size; /* size of cluster: 2K, 4K, 9K, 16K, etc. */ int type; /* EXT_xxx type of the cluster */ int8_t head_hwidx; int8_t tail_hwidx; }; struct hw_buf_info { int8_t zidx; /* backpointer to zone; -ve means unused */ int8_t next; /* next hwidx for this zone; -1 means no more */ int size; }; enum { NUM_MEMWIN = 3, MEMWIN0_APERTURE = 2048, MEMWIN0_BASE = 0x1b800, MEMWIN1_APERTURE = 32768, MEMWIN1_BASE = 0x28000, MEMWIN2_APERTURE_T4 = 65536, MEMWIN2_BASE_T4 = 0x30000, MEMWIN2_APERTURE_T5 = 128 * 1024, MEMWIN2_BASE_T5 = 0x60000, }; struct memwin { struct rwlock mw_lock __aligned(CACHE_LINE_SIZE); uint32_t mw_base; /* constant after setup_memwin */ uint32_t mw_aperture; /* ditto */ uint32_t mw_curpos; /* protected by mw_lock */ }; enum { FL_STARVING = (1 << 0), /* on the adapter's list of starving fl's */ FL_DOOMED = (1 << 1), /* about to be destroyed */ FL_BUF_PACKING = (1 << 2), /* buffer packing enabled */ FL_BUF_RESUME = (1 << 3), /* resume from the middle of the frame */ }; #define FL_RUNNING_LOW(fl) \ (IDXDIFF(fl->dbidx * 8, fl->cidx, fl->sidx * 8) <= fl->lowat) #define FL_NOT_RUNNING_LOW(fl) \ (IDXDIFF(fl->dbidx * 8, fl->cidx, fl->sidx * 8) >= 2 * fl->lowat) struct sge_fl { struct mtx fl_lock; __be64 *desc; /* KVA of descriptor ring, ptr to addresses */ struct fl_sdesc *sdesc; /* KVA of software descriptor ring */ struct cluster_layout cll_def; /* default refill zone, layout */ uint16_t lowat; /* # of buffers <= this means fl needs help */ int flags; uint16_t buf_boundary; /* The 16b idx all deal with hw descriptors */ uint16_t dbidx; /* hw pidx after last doorbell */ uint16_t sidx; /* index of status page */ volatile uint16_t hw_cidx; /* The 32b idx are all buffer idx, not hardware descriptor idx */ uint32_t cidx; /* consumer index */ uint32_t pidx; /* producer index */ uint32_t dbval; u_int rx_offset; /* offset in fl buf (when buffer packing) */ volatile uint32_t *udb; uint64_t mbuf_allocated;/* # of mbuf allocated from zone_mbuf */ uint64_t mbuf_inlined; /* # of mbuf created within clusters */ uint64_t cl_allocated; /* # of clusters allocated */ uint64_t cl_recycled; /* # of clusters recycled */ uint64_t cl_fast_recycled; /* # of clusters recycled (fast) */ /* These 3 are valid when FL_BUF_RESUME is set, stale otherwise. */ struct mbuf *m0; struct mbuf **pnext; u_int remaining; uint16_t qsize; /* # of hw descriptors (status page included) */ uint16_t cntxt_id; /* SGE context id for the freelist */ TAILQ_ENTRY(sge_fl) link; /* All starving freelists */ bus_dma_tag_t desc_tag; bus_dmamap_t desc_map; char lockname[16]; bus_addr_t ba; /* bus address of descriptor ring */ struct cluster_layout cll_alt; /* alternate refill zone, layout */ }; struct mp_ring; /* txq: SGE egress queue + what's needed for Ethernet NIC */ struct sge_txq { struct sge_eq eq; /* MUST be first */ struct ifnet *ifp; /* the interface this txq belongs to */ struct mp_ring *r; /* tx software ring */ struct tx_sdesc *sdesc; /* KVA of software descriptor ring */ struct sglist *gl; __be32 cpl_ctrl0; /* for convenience */ struct task tx_reclaim_task; /* stats for common events first */ uint64_t txcsum; /* # of times hardware assisted with checksum */ uint64_t tso_wrs; /* # of TSO work requests */ uint64_t vlan_insertion;/* # of times VLAN tag was inserted */ uint64_t imm_wrs; /* # of work requests with immediate data */ uint64_t sgl_wrs; /* # of work requests with direct SGL */ uint64_t txpkt_wrs; /* # of txpkt work requests (not coalesced) */ uint64_t txpkts0_wrs; /* # of type0 coalesced tx work requests */ uint64_t txpkts1_wrs; /* # of type1 coalesced tx work requests */ uint64_t txpkts0_pkts; /* # of frames in type0 coalesced tx WRs */ uint64_t txpkts1_pkts; /* # of frames in type1 coalesced tx WRs */ /* stats for not-that-common events */ } __aligned(CACHE_LINE_SIZE); /* rxq: SGE ingress queue + SGE free list + miscellaneous items */ struct sge_rxq { struct sge_iq iq; /* MUST be first */ struct sge_fl fl; /* MUST follow iq */ struct ifnet *ifp; /* the interface this rxq belongs to */ #if defined(INET) || defined(INET6) struct lro_ctrl lro; /* LRO state */ #endif /* stats for common events first */ uint64_t rxcsum; /* # of times hardware assisted with checksum */ uint64_t vlan_extraction;/* # of times VLAN tag was extracted */ /* stats for not-that-common events */ } __aligned(CACHE_LINE_SIZE); static inline struct sge_rxq * iq_to_rxq(struct sge_iq *iq) { return (__containerof(iq, struct sge_rxq, iq)); } /* ofld_rxq: SGE ingress queue + SGE free list + miscellaneous items */ struct sge_ofld_rxq { struct sge_iq iq; /* MUST be first */ struct sge_fl fl; /* MUST follow iq */ } __aligned(CACHE_LINE_SIZE); static inline struct sge_ofld_rxq * iq_to_ofld_rxq(struct sge_iq *iq) { return (__containerof(iq, struct sge_ofld_rxq, iq)); } struct wrqe { STAILQ_ENTRY(wrqe) link; struct sge_wrq *wrq; int wr_len; char wr[] __aligned(16); }; struct wrq_cookie { TAILQ_ENTRY(wrq_cookie) link; int ndesc; int pidx; }; /* * wrq: SGE egress queue that is given prebuilt work requests. Both the control * and offload tx queues are of this type. */ struct sge_wrq { struct sge_eq eq; /* MUST be first */ struct adapter *adapter; struct task wrq_tx_task; /* Tx desc reserved but WR not "committed" yet. */ TAILQ_HEAD(wrq_incomplete_wrs , wrq_cookie) incomplete_wrs; /* List of WRs ready to go out as soon as descriptors are available. */ STAILQ_HEAD(, wrqe) wr_list; u_int nwr_pending; u_int ndesc_needed; /* stats for common events first */ uint64_t tx_wrs_direct; /* # of WRs written directly to desc ring. */ uint64_t tx_wrs_ss; /* # of WRs copied from scratch space. */ uint64_t tx_wrs_copied; /* # of WRs queued and copied to desc ring. */ /* stats for not-that-common events */ /* * Scratch space for work requests that wrap around after reaching the * status page, and some information about the last WR that used it. */ uint16_t ss_pidx; uint16_t ss_len; uint8_t ss[SGE_MAX_WR_LEN]; } __aligned(CACHE_LINE_SIZE); struct sge_nm_rxq { struct vi_info *vi; struct iq_desc *iq_desc; uint16_t iq_abs_id; uint16_t iq_cntxt_id; uint16_t iq_cidx; uint16_t iq_sidx; uint8_t iq_gen; __be64 *fl_desc; uint16_t fl_cntxt_id; uint32_t fl_cidx; uint32_t fl_pidx; uint32_t fl_sidx; uint32_t fl_db_val; u_int fl_hwidx:4; u_int nid; /* netmap ring # for this queue */ /* infrequently used items after this */ bus_dma_tag_t iq_desc_tag; bus_dmamap_t iq_desc_map; bus_addr_t iq_ba; int intr_idx; bus_dma_tag_t fl_desc_tag; bus_dmamap_t fl_desc_map; bus_addr_t fl_ba; } __aligned(CACHE_LINE_SIZE); struct sge_nm_txq { struct tx_desc *desc; uint16_t cidx; uint16_t pidx; uint16_t sidx; uint16_t equiqidx; /* EQUIQ last requested at this pidx */ uint16_t equeqidx; /* EQUEQ last requested at this pidx */ uint16_t dbidx; /* pidx of the most recent doorbell */ uint16_t doorbells; volatile uint32_t *udb; u_int udb_qid; u_int cntxt_id; __be32 cpl_ctrl0; /* for convenience */ u_int nid; /* netmap ring # for this queue */ /* infrequently used items after this */ bus_dma_tag_t desc_tag; bus_dmamap_t desc_map; bus_addr_t ba; int iqidx; } __aligned(CACHE_LINE_SIZE); struct sge { int nrxq; /* total # of Ethernet rx queues */ int ntxq; /* total # of Ethernet tx tx queues */ int nofldrxq; /* total # of TOE rx queues */ int nofldtxq; /* total # of TOE tx queues */ int nnmrxq; /* total # of netmap rx queues */ int nnmtxq; /* total # of netmap tx queues */ int niq; /* total # of ingress queues */ int neq; /* total # of egress queues */ struct sge_iq fwq; /* Firmware event queue */ struct sge_wrq mgmtq; /* Management queue (control queue) */ struct sge_wrq *ctrlq; /* Control queues */ struct sge_txq *txq; /* NIC tx queues */ struct sge_rxq *rxq; /* NIC rx queues */ struct sge_wrq *ofld_txq; /* TOE tx queues */ struct sge_ofld_rxq *ofld_rxq; /* TOE rx queues */ struct sge_nm_txq *nm_txq; /* netmap tx queues */ struct sge_nm_rxq *nm_rxq; /* netmap rx queues */ uint16_t iq_start; int eq_start; struct sge_iq **iqmap; /* iq->cntxt_id to iq mapping */ struct sge_eq **eqmap; /* eq->cntxt_id to eq mapping */ int8_t safe_hwidx1; /* may not have room for metadata */ int8_t safe_hwidx2; /* with room for metadata and maybe more */ struct sw_zone_info sw_zone_info[SW_ZONE_SIZES]; struct hw_buf_info hw_buf_info[SGE_FLBUF_SIZES]; }; struct rss_header; typedef int (*cpl_handler_t)(struct sge_iq *, const struct rss_header *, struct mbuf *); typedef int (*an_handler_t)(struct sge_iq *, const struct rsp_ctrl *); typedef int (*fw_msg_handler_t)(struct adapter *, const __be64 *); struct adapter { SLIST_ENTRY(adapter) link; device_t dev; struct cdev *cdev; /* PCIe register resources */ int regs_rid; struct resource *regs_res; int msix_rid; struct resource *msix_res; bus_space_handle_t bh; bus_space_tag_t bt; bus_size_t mmio_len; int udbs_rid; struct resource *udbs_res; volatile uint8_t *udbs_base; unsigned int pf; unsigned int mbox; unsigned int vpd_busy; unsigned int vpd_flag; /* Interrupt information */ int intr_type; int intr_count; struct irq { struct resource *res; int rid; void *tag; } *irq; bus_dma_tag_t dmat; /* Parent DMA tag */ struct sge sge; int lro_timeout; struct taskqueue *tq[MAX_NCHAN]; /* General purpose taskqueues */ struct port_info *port[MAX_NPORTS]; uint8_t chan_map[MAX_NCHAN]; void *tom_softc; /* (struct tom_data *) */ struct tom_tunables tt; void *iwarp_softc; /* (struct c4iw_dev *) */ void *iscsi_ulp_softc; /* (struct cxgbei_data *) */ struct l2t_data *l2t; /* L2 table */ struct tid_info tids; uint16_t doorbells; int offload_map; /* ports with IFCAP_TOE enabled */ int active_ulds; /* ULDs activated on this adapter */ int flags; int debug_flags; char ifp_lockname[16]; struct mtx ifp_lock; struct ifnet *ifp; /* tracer ifp */ struct ifmedia media; int traceq; /* iq used by all tracers, -1 if none */ int tracer_valid; /* bitmap of valid tracers */ int tracer_enabled; /* bitmap of enabled tracers */ char fw_version[16]; char tp_version[16]; char exprom_version[16]; char cfg_file[32]; u_int cfcsum; struct adapter_params params; const struct chip_params *chip_params; struct t4_virt_res vres; uint16_t nbmcaps; uint16_t linkcaps; uint16_t switchcaps; uint16_t niccaps; uint16_t toecaps; uint16_t rdmacaps; uint16_t tlscaps; uint16_t iscsicaps; uint16_t fcoecaps; struct sysctl_ctx_list ctx; /* from adapter_full_init to full_uninit */ struct mtx sc_lock; char lockname[16]; /* Starving free lists */ struct mtx sfl_lock; /* same cache-line as sc_lock? but that's ok */ TAILQ_HEAD(, sge_fl) sfl; struct callout sfl_callout; struct mtx reg_lock; /* for indirect register access */ struct memwin memwin[NUM_MEMWIN]; /* memory windows */ an_handler_t an_handler __aligned(CACHE_LINE_SIZE); fw_msg_handler_t fw_msg_handler[7]; /* NUM_FW6_TYPES */ cpl_handler_t cpl_handler[0xef]; /* NUM_CPL_CMDS */ const char *last_op; const void *last_op_thr; int last_op_flags; int sc_do_rxcopy; }; #define ADAPTER_LOCK(sc) mtx_lock(&(sc)->sc_lock) #define ADAPTER_UNLOCK(sc) mtx_unlock(&(sc)->sc_lock) #define ADAPTER_LOCK_ASSERT_OWNED(sc) mtx_assert(&(sc)->sc_lock, MA_OWNED) #define ADAPTER_LOCK_ASSERT_NOTOWNED(sc) mtx_assert(&(sc)->sc_lock, MA_NOTOWNED) #define ASSERT_SYNCHRONIZED_OP(sc) \ KASSERT(IS_BUSY(sc) && \ (mtx_owned(&(sc)->sc_lock) || sc->last_op_thr == curthread), \ ("%s: operation not synchronized.", __func__)) #define PORT_LOCK(pi) mtx_lock(&(pi)->pi_lock) #define PORT_UNLOCK(pi) mtx_unlock(&(pi)->pi_lock) #define PORT_LOCK_ASSERT_OWNED(pi) mtx_assert(&(pi)->pi_lock, MA_OWNED) #define PORT_LOCK_ASSERT_NOTOWNED(pi) mtx_assert(&(pi)->pi_lock, MA_NOTOWNED) #define FL_LOCK(fl) mtx_lock(&(fl)->fl_lock) #define FL_TRYLOCK(fl) mtx_trylock(&(fl)->fl_lock) #define FL_UNLOCK(fl) mtx_unlock(&(fl)->fl_lock) #define FL_LOCK_ASSERT_OWNED(fl) mtx_assert(&(fl)->fl_lock, MA_OWNED) #define FL_LOCK_ASSERT_NOTOWNED(fl) mtx_assert(&(fl)->fl_lock, MA_NOTOWNED) #define RXQ_FL_LOCK(rxq) FL_LOCK(&(rxq)->fl) #define RXQ_FL_UNLOCK(rxq) FL_UNLOCK(&(rxq)->fl) #define RXQ_FL_LOCK_ASSERT_OWNED(rxq) FL_LOCK_ASSERT_OWNED(&(rxq)->fl) #define RXQ_FL_LOCK_ASSERT_NOTOWNED(rxq) FL_LOCK_ASSERT_NOTOWNED(&(rxq)->fl) #define EQ_LOCK(eq) mtx_lock(&(eq)->eq_lock) #define EQ_TRYLOCK(eq) mtx_trylock(&(eq)->eq_lock) #define EQ_UNLOCK(eq) mtx_unlock(&(eq)->eq_lock) #define EQ_LOCK_ASSERT_OWNED(eq) mtx_assert(&(eq)->eq_lock, MA_OWNED) #define EQ_LOCK_ASSERT_NOTOWNED(eq) mtx_assert(&(eq)->eq_lock, MA_NOTOWNED) #define TXQ_LOCK(txq) EQ_LOCK(&(txq)->eq) #define TXQ_TRYLOCK(txq) EQ_TRYLOCK(&(txq)->eq) #define TXQ_UNLOCK(txq) EQ_UNLOCK(&(txq)->eq) #define TXQ_LOCK_ASSERT_OWNED(txq) EQ_LOCK_ASSERT_OWNED(&(txq)->eq) #define TXQ_LOCK_ASSERT_NOTOWNED(txq) EQ_LOCK_ASSERT_NOTOWNED(&(txq)->eq) #define CH_DUMP_MBOX(sc, mbox, data_reg) \ do { \ if (sc->debug_flags & DF_DUMP_MBOX) { \ log(LOG_NOTICE, \ "%s mbox %u: %016llx %016llx %016llx %016llx " \ "%016llx %016llx %016llx %016llx\n", \ device_get_nameunit(sc->dev), mbox, \ (unsigned long long)t4_read_reg64(sc, data_reg), \ (unsigned long long)t4_read_reg64(sc, data_reg + 8), \ (unsigned long long)t4_read_reg64(sc, data_reg + 16), \ (unsigned long long)t4_read_reg64(sc, data_reg + 24), \ (unsigned long long)t4_read_reg64(sc, data_reg + 32), \ (unsigned long long)t4_read_reg64(sc, data_reg + 40), \ (unsigned long long)t4_read_reg64(sc, data_reg + 48), \ (unsigned long long)t4_read_reg64(sc, data_reg + 56)); \ } \ } while (0) #define for_each_txq(vi, iter, q) \ for (q = &vi->pi->adapter->sge.txq[vi->first_txq], iter = 0; \ iter < vi->ntxq; ++iter, ++q) #define for_each_rxq(vi, iter, q) \ for (q = &vi->pi->adapter->sge.rxq[vi->first_rxq], iter = 0; \ iter < vi->nrxq; ++iter, ++q) #define for_each_ofld_txq(vi, iter, q) \ for (q = &vi->pi->adapter->sge.ofld_txq[vi->first_ofld_txq], iter = 0; \ iter < vi->nofldtxq; ++iter, ++q) #define for_each_ofld_rxq(vi, iter, q) \ for (q = &vi->pi->adapter->sge.ofld_rxq[vi->first_ofld_rxq], iter = 0; \ iter < vi->nofldrxq; ++iter, ++q) #define for_each_nm_txq(vi, iter, q) \ for (q = &vi->pi->adapter->sge.nm_txq[vi->first_txq], iter = 0; \ iter < vi->ntxq; ++iter, ++q) #define for_each_nm_rxq(vi, iter, q) \ for (q = &vi->pi->adapter->sge.nm_rxq[vi->first_rxq], iter = 0; \ iter < vi->nrxq; ++iter, ++q) #define for_each_vi(_pi, _iter, _vi) \ for ((_vi) = (_pi)->vi, (_iter) = 0; (_iter) < (_pi)->nvi; \ ++(_iter), ++(_vi)) #define IDXINCR(idx, incr, wrap) do { \ idx = wrap - idx > incr ? idx + incr : incr - (wrap - idx); \ } while (0) #define IDXDIFF(head, tail, wrap) \ ((head) >= (tail) ? (head) - (tail) : (wrap) - (tail) + (head)) /* One for errors, one for firmware events */ #define T4_EXTRA_INTR 2 static inline uint32_t t4_read_reg(struct adapter *sc, uint32_t reg) { return bus_space_read_4(sc->bt, sc->bh, reg); } static inline void t4_write_reg(struct adapter *sc, uint32_t reg, uint32_t val) { bus_space_write_4(sc->bt, sc->bh, reg, val); } static inline uint64_t t4_read_reg64(struct adapter *sc, uint32_t reg) { return t4_bus_space_read_8(sc->bt, sc->bh, reg); } static inline void t4_write_reg64(struct adapter *sc, uint32_t reg, uint64_t val) { t4_bus_space_write_8(sc->bt, sc->bh, reg, val); } static inline void t4_os_pci_read_cfg1(struct adapter *sc, int reg, uint8_t *val) { *val = pci_read_config(sc->dev, reg, 1); } static inline void t4_os_pci_write_cfg1(struct adapter *sc, int reg, uint8_t val) { pci_write_config(sc->dev, reg, val, 1); } static inline void t4_os_pci_read_cfg2(struct adapter *sc, int reg, uint16_t *val) { *val = pci_read_config(sc->dev, reg, 2); } static inline void t4_os_pci_write_cfg2(struct adapter *sc, int reg, uint16_t val) { pci_write_config(sc->dev, reg, val, 2); } static inline void t4_os_pci_read_cfg4(struct adapter *sc, int reg, uint32_t *val) { *val = pci_read_config(sc->dev, reg, 4); } static inline void t4_os_pci_write_cfg4(struct adapter *sc, int reg, uint32_t val) { pci_write_config(sc->dev, reg, val, 4); } static inline struct port_info * adap2pinfo(struct adapter *sc, int idx) { return (sc->port[idx]); } static inline void t4_os_set_hw_addr(struct adapter *sc, int idx, uint8_t hw_addr[]) { bcopy(hw_addr, sc->port[idx]->vi[0].hw_addr, ETHER_ADDR_LEN); } static inline bool is_10G_port(const struct port_info *pi) { return ((pi->link_cfg.supported & FW_PORT_CAP_SPEED_10G) != 0); } static inline bool is_40G_port(const struct port_info *pi) { return ((pi->link_cfg.supported & FW_PORT_CAP_SPEED_40G) != 0); } static inline int port_top_speed(const struct port_info *pi) { if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_100G) return (100); if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_40G) return (40); if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_10G) return (10); if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_1G) return (1); return (0); } static inline int tx_resume_threshold(struct sge_eq *eq) { /* not quite the same as qsize / 4, but this will do. */ return (eq->sidx / 4); } static inline int t4_use_ldst(struct adapter *sc) { #ifdef notyet return (sc->flags & FW_OK || !sc->use_bd); #else return (0); #endif } /* t4_main.c */ int t4_os_find_pci_capability(struct adapter *, int); int t4_os_pci_save_state(struct adapter *); int t4_os_pci_restore_state(struct adapter *); void t4_os_portmod_changed(const struct adapter *, int); void t4_os_link_changed(struct adapter *, int, int, int); void t4_iterate(void (*)(struct adapter *, void *), void *); int t4_register_cpl_handler(struct adapter *, int, cpl_handler_t); int t4_register_an_handler(struct adapter *, an_handler_t); int t4_register_fw_msg_handler(struct adapter *, int, fw_msg_handler_t); int t4_filter_rpl(struct sge_iq *, const struct rss_header *, struct mbuf *); int begin_synchronized_op(struct adapter *, struct vi_info *, int, char *); void doom_vi(struct adapter *, struct vi_info *); void end_synchronized_op(struct adapter *, int); int update_mac_settings(struct ifnet *, int); int adapter_full_init(struct adapter *); int adapter_full_uninit(struct adapter *); uint64_t cxgbe_get_counter(struct ifnet *, ift_counter); int vi_full_init(struct vi_info *); int vi_full_uninit(struct vi_info *); void vi_sysctls(struct vi_info *); void vi_tick(void *); #ifdef DEV_NETMAP /* t4_netmap.c */ int create_netmap_ifnet(struct port_info *); int destroy_netmap_ifnet(struct port_info *); void t4_nm_intr(void *); #endif /* t4_sge.c */ void t4_sge_modload(void); void t4_sge_modunload(void); uint64_t t4_sge_extfree_refs(void); void t4_init_sge_cpl_handlers(struct adapter *); void t4_tweak_chip_settings(struct adapter *); int t4_read_chip_settings(struct adapter *); int t4_create_dma_tag(struct adapter *); void t4_sge_sysctls(struct adapter *, struct sysctl_ctx_list *, struct sysctl_oid_list *); int t4_destroy_dma_tag(struct adapter *); int t4_setup_adapter_queues(struct adapter *); int t4_teardown_adapter_queues(struct adapter *); int t4_setup_vi_queues(struct vi_info *); int t4_teardown_vi_queues(struct vi_info *); void t4_intr_all(void *); void t4_intr(void *); void t4_intr_err(void *); void t4_intr_evt(void *); void t4_wrq_tx_locked(struct adapter *, struct sge_wrq *, struct wrqe *); void t4_update_fl_bufsize(struct ifnet *); int parse_pkt(struct mbuf **); void *start_wrq_wr(struct sge_wrq *, int, struct wrq_cookie *); void commit_wrq_wr(struct sge_wrq *, void *, struct wrq_cookie *); int tnl_cong(struct port_info *, int); /* t4_tracer.c */ struct t4_tracer; void t4_tracer_modload(void); void t4_tracer_modunload(void); void t4_tracer_port_detach(struct adapter *); int t4_get_tracer(struct adapter *, struct t4_tracer *); int t4_set_tracer(struct adapter *, struct t4_tracer *); int t4_trace_pkt(struct sge_iq *, const struct rss_header *, struct mbuf *); int t5_trace_pkt(struct sge_iq *, const struct rss_header *, struct mbuf *); static inline struct wrqe * alloc_wrqe(int wr_len, struct sge_wrq *wrq) { int len = offsetof(struct wrqe, wr) + wr_len; struct wrqe *wr; wr = malloc(len, M_CXGBE, M_NOWAIT); if (__predict_false(wr == NULL)) return (NULL); wr->wr_len = wr_len; wr->wrq = wrq; return (wr); } static inline void * wrtod(struct wrqe *wr) { return (&wr->wr[0]); } static inline void free_wrqe(struct wrqe *wr) { free(wr, M_CXGBE); } static inline void t4_wrq_tx(struct adapter *sc, struct wrqe *wr) { struct sge_wrq *wrq = wr->wrq; TXQ_LOCK(wrq); t4_wrq_tx_locked(sc, wrq, wr); TXQ_UNLOCK(wrq); } #endif Index: projects/vnet/sys/dev/cxgbe/t4_main.c =================================================================== --- projects/vnet/sys/dev/cxgbe/t4_main.c (revision 301546) +++ projects/vnet/sys/dev/cxgbe/t4_main.c (revision 301547) @@ -1,9442 +1,9561 @@ /*- * Copyright (c) 2011 Chelsio Communications, Inc. * All rights reserved. * Written by: Navdeep Parhar * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_inet.h" #include "opt_inet6.h" #include "opt_rss.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef RSS #include #endif #if defined(__i386__) || defined(__amd64__) #include #include #endif #ifdef DDB #include #include #endif #include "common/common.h" #include "common/t4_msg.h" #include "common/t4_regs.h" #include "common/t4_regs_values.h" #include "t4_ioctl.h" #include "t4_l2t.h" #include "t4_mp_ring.h" /* T4 bus driver interface */ static int t4_probe(device_t); static int t4_attach(device_t); static int t4_detach(device_t); static device_method_t t4_methods[] = { DEVMETHOD(device_probe, t4_probe), DEVMETHOD(device_attach, t4_attach), DEVMETHOD(device_detach, t4_detach), DEVMETHOD_END }; static driver_t t4_driver = { "t4nex", t4_methods, sizeof(struct adapter) }; /* T4 port (cxgbe) interface */ static int cxgbe_probe(device_t); static int cxgbe_attach(device_t); static int cxgbe_detach(device_t); static device_method_t cxgbe_methods[] = { DEVMETHOD(device_probe, cxgbe_probe), DEVMETHOD(device_attach, cxgbe_attach), DEVMETHOD(device_detach, cxgbe_detach), { 0, 0 } }; static driver_t cxgbe_driver = { "cxgbe", cxgbe_methods, sizeof(struct port_info) }; /* T4 VI (vcxgbe) interface */ static int vcxgbe_probe(device_t); static int vcxgbe_attach(device_t); static int vcxgbe_detach(device_t); static device_method_t vcxgbe_methods[] = { DEVMETHOD(device_probe, vcxgbe_probe), DEVMETHOD(device_attach, vcxgbe_attach), DEVMETHOD(device_detach, vcxgbe_detach), { 0, 0 } }; static driver_t vcxgbe_driver = { "vcxgbe", vcxgbe_methods, sizeof(struct vi_info) }; static d_ioctl_t t4_ioctl; static d_open_t t4_open; static d_close_t t4_close; static struct cdevsw t4_cdevsw = { .d_version = D_VERSION, .d_flags = 0, .d_open = t4_open, .d_close = t4_close, .d_ioctl = t4_ioctl, .d_name = "t4nex", }; /* T5 bus driver interface */ static int t5_probe(device_t); static device_method_t t5_methods[] = { DEVMETHOD(device_probe, t5_probe), DEVMETHOD(device_attach, t4_attach), DEVMETHOD(device_detach, t4_detach), DEVMETHOD_END }; static driver_t t5_driver = { "t5nex", t5_methods, sizeof(struct adapter) }; /* T5 port (cxl) interface */ static driver_t cxl_driver = { "cxl", cxgbe_methods, sizeof(struct port_info) }; /* T5 VI (vcxl) interface */ static driver_t vcxl_driver = { "vcxl", vcxgbe_methods, sizeof(struct vi_info) }; static struct cdevsw t5_cdevsw = { .d_version = D_VERSION, .d_flags = 0, .d_open = t4_open, .d_close = t4_close, .d_ioctl = t4_ioctl, .d_name = "t5nex", }; /* ifnet + media interface */ static void cxgbe_init(void *); static int cxgbe_ioctl(struct ifnet *, unsigned long, caddr_t); static int cxgbe_transmit(struct ifnet *, struct mbuf *); static void cxgbe_qflush(struct ifnet *); static int cxgbe_media_change(struct ifnet *); static void cxgbe_media_status(struct ifnet *, struct ifmediareq *); MALLOC_DEFINE(M_CXGBE, "cxgbe", "Chelsio T4/T5 Ethernet driver and services"); /* * Correct lock order when you need to acquire multiple locks is t4_list_lock, * then ADAPTER_LOCK, then t4_uld_list_lock. */ static struct sx t4_list_lock; SLIST_HEAD(, adapter) t4_list; #ifdef TCP_OFFLOAD static struct sx t4_uld_list_lock; SLIST_HEAD(, uld_info) t4_uld_list; #endif /* * Tunables. See tweak_tunables() too. * * Each tunable is set to a default value here if it's known at compile-time. * Otherwise it is set to -1 as an indication to tweak_tunables() that it should * provide a reasonable default when the driver is loaded. * * Tunables applicable to both T4 and T5 are under hw.cxgbe. Those specific to * T5 are under hw.cxl. */ /* * Number of queues for tx and rx, 10G and 1G, NIC and offload. */ #define NTXQ_10G 16 static int t4_ntxq10g = -1; TUNABLE_INT("hw.cxgbe.ntxq10g", &t4_ntxq10g); #define NRXQ_10G 8 static int t4_nrxq10g = -1; TUNABLE_INT("hw.cxgbe.nrxq10g", &t4_nrxq10g); #define NTXQ_1G 4 static int t4_ntxq1g = -1; TUNABLE_INT("hw.cxgbe.ntxq1g", &t4_ntxq1g); #define NRXQ_1G 2 static int t4_nrxq1g = -1; TUNABLE_INT("hw.cxgbe.nrxq1g", &t4_nrxq1g); static int t4_rsrv_noflowq = 0; TUNABLE_INT("hw.cxgbe.rsrv_noflowq", &t4_rsrv_noflowq); #ifdef TCP_OFFLOAD #define NOFLDTXQ_10G 8 static int t4_nofldtxq10g = -1; TUNABLE_INT("hw.cxgbe.nofldtxq10g", &t4_nofldtxq10g); #define NOFLDRXQ_10G 2 static int t4_nofldrxq10g = -1; TUNABLE_INT("hw.cxgbe.nofldrxq10g", &t4_nofldrxq10g); #define NOFLDTXQ_1G 2 static int t4_nofldtxq1g = -1; TUNABLE_INT("hw.cxgbe.nofldtxq1g", &t4_nofldtxq1g); #define NOFLDRXQ_1G 1 static int t4_nofldrxq1g = -1; TUNABLE_INT("hw.cxgbe.nofldrxq1g", &t4_nofldrxq1g); #endif #ifdef DEV_NETMAP #define NNMTXQ_10G 2 static int t4_nnmtxq10g = -1; TUNABLE_INT("hw.cxgbe.nnmtxq10g", &t4_nnmtxq10g); #define NNMRXQ_10G 2 static int t4_nnmrxq10g = -1; TUNABLE_INT("hw.cxgbe.nnmrxq10g", &t4_nnmrxq10g); #define NNMTXQ_1G 1 static int t4_nnmtxq1g = -1; TUNABLE_INT("hw.cxgbe.nnmtxq1g", &t4_nnmtxq1g); #define NNMRXQ_1G 1 static int t4_nnmrxq1g = -1; TUNABLE_INT("hw.cxgbe.nnmrxq1g", &t4_nnmrxq1g); #endif /* * Holdoff parameters for 10G and 1G ports. */ #define TMR_IDX_10G 1 static int t4_tmr_idx_10g = TMR_IDX_10G; TUNABLE_INT("hw.cxgbe.holdoff_timer_idx_10G", &t4_tmr_idx_10g); #define PKTC_IDX_10G (-1) static int t4_pktc_idx_10g = PKTC_IDX_10G; TUNABLE_INT("hw.cxgbe.holdoff_pktc_idx_10G", &t4_pktc_idx_10g); #define TMR_IDX_1G 1 static int t4_tmr_idx_1g = TMR_IDX_1G; TUNABLE_INT("hw.cxgbe.holdoff_timer_idx_1G", &t4_tmr_idx_1g); #define PKTC_IDX_1G (-1) static int t4_pktc_idx_1g = PKTC_IDX_1G; TUNABLE_INT("hw.cxgbe.holdoff_pktc_idx_1G", &t4_pktc_idx_1g); /* * Size (# of entries) of each tx and rx queue. */ static unsigned int t4_qsize_txq = TX_EQ_QSIZE; TUNABLE_INT("hw.cxgbe.qsize_txq", &t4_qsize_txq); static unsigned int t4_qsize_rxq = RX_IQ_QSIZE; TUNABLE_INT("hw.cxgbe.qsize_rxq", &t4_qsize_rxq); /* * Interrupt types allowed (bits 0, 1, 2 = INTx, MSI, MSI-X respectively). */ static int t4_intr_types = INTR_MSIX | INTR_MSI | INTR_INTX; TUNABLE_INT("hw.cxgbe.interrupt_types", &t4_intr_types); /* * Configuration file. */ #define DEFAULT_CF "default" #define FLASH_CF "flash" #define UWIRE_CF "uwire" #define FPGA_CF "fpga" static char t4_cfg_file[32] = DEFAULT_CF; TUNABLE_STR("hw.cxgbe.config_file", t4_cfg_file, sizeof(t4_cfg_file)); /* * PAUSE settings (bit 0, 1 = rx_pause, tx_pause respectively). * rx_pause = 1 to heed incoming PAUSE frames, 0 to ignore them. * tx_pause = 1 to emit PAUSE frames when the rx FIFO reaches its high water * mark or when signalled to do so, 0 to never emit PAUSE. */ static int t4_pause_settings = PAUSE_TX | PAUSE_RX; TUNABLE_INT("hw.cxgbe.pause_settings", &t4_pause_settings); /* * Firmware auto-install by driver during attach (0, 1, 2 = prohibited, allowed, * encouraged respectively). */ static unsigned int t4_fw_install = 1; TUNABLE_INT("hw.cxgbe.fw_install", &t4_fw_install); /* * ASIC features that will be used. Disable the ones you don't want so that the * chip resources aren't wasted on features that will not be used. */ static int t4_nbmcaps_allowed = 0; TUNABLE_INT("hw.cxgbe.nbmcaps_allowed", &t4_nbmcaps_allowed); static int t4_linkcaps_allowed = 0; /* No DCBX, PPP, etc. by default */ TUNABLE_INT("hw.cxgbe.linkcaps_allowed", &t4_linkcaps_allowed); static int t4_switchcaps_allowed = FW_CAPS_CONFIG_SWITCH_INGRESS | FW_CAPS_CONFIG_SWITCH_EGRESS; TUNABLE_INT("hw.cxgbe.switchcaps_allowed", &t4_switchcaps_allowed); static int t4_niccaps_allowed = FW_CAPS_CONFIG_NIC; TUNABLE_INT("hw.cxgbe.niccaps_allowed", &t4_niccaps_allowed); static int t4_toecaps_allowed = -1; TUNABLE_INT("hw.cxgbe.toecaps_allowed", &t4_toecaps_allowed); static int t4_rdmacaps_allowed = -1; TUNABLE_INT("hw.cxgbe.rdmacaps_allowed", &t4_rdmacaps_allowed); static int t4_tlscaps_allowed = 0; TUNABLE_INT("hw.cxgbe.tlscaps_allowed", &t4_tlscaps_allowed); static int t4_iscsicaps_allowed = -1; TUNABLE_INT("hw.cxgbe.iscsicaps_allowed", &t4_iscsicaps_allowed); static int t4_fcoecaps_allowed = 0; TUNABLE_INT("hw.cxgbe.fcoecaps_allowed", &t4_fcoecaps_allowed); static int t5_write_combine = 0; TUNABLE_INT("hw.cxl.write_combine", &t5_write_combine); static int t4_num_vis = 1; TUNABLE_INT("hw.cxgbe.num_vis", &t4_num_vis); /* Functions used by extra VIs to obtain unique MAC addresses for each VI. */ static int vi_mac_funcs[] = { FW_VI_FUNC_OFLD, FW_VI_FUNC_IWARP, FW_VI_FUNC_OPENISCSI, FW_VI_FUNC_OPENFCOE, FW_VI_FUNC_FOISCSI, FW_VI_FUNC_FOFCOE, }; struct intrs_and_queues { uint16_t intr_type; /* INTx, MSI, or MSI-X */ uint16_t nirq; /* Total # of vectors */ uint16_t intr_flags_10g;/* Interrupt flags for each 10G port */ uint16_t intr_flags_1g; /* Interrupt flags for each 1G port */ uint16_t ntxq10g; /* # of NIC txq's for each 10G port */ uint16_t nrxq10g; /* # of NIC rxq's for each 10G port */ uint16_t ntxq1g; /* # of NIC txq's for each 1G port */ uint16_t nrxq1g; /* # of NIC rxq's for each 1G port */ uint16_t rsrv_noflowq; /* Flag whether to reserve queue 0 */ #ifdef TCP_OFFLOAD uint16_t nofldtxq10g; /* # of TOE txq's for each 10G port */ uint16_t nofldrxq10g; /* # of TOE rxq's for each 10G port */ uint16_t nofldtxq1g; /* # of TOE txq's for each 1G port */ uint16_t nofldrxq1g; /* # of TOE rxq's for each 1G port */ #endif #ifdef DEV_NETMAP uint16_t nnmtxq10g; /* # of netmap txq's for each 10G port */ uint16_t nnmrxq10g; /* # of netmap rxq's for each 10G port */ uint16_t nnmtxq1g; /* # of netmap txq's for each 1G port */ uint16_t nnmrxq1g; /* # of netmap rxq's for each 1G port */ #endif }; struct filter_entry { uint32_t valid:1; /* filter allocated and valid */ uint32_t locked:1; /* filter is administratively locked */ uint32_t pending:1; /* filter action is pending firmware reply */ uint32_t smtidx:8; /* Source MAC Table index for smac */ struct l2t_entry *l2t; /* Layer Two Table entry for dmac */ struct t4_filter_specification fs; }; static int map_bars_0_and_4(struct adapter *); static int map_bar_2(struct adapter *); static void setup_memwin(struct adapter *); static void position_memwin(struct adapter *, int, uint32_t); static int rw_via_memwin(struct adapter *, int, uint32_t, uint32_t *, int, int); static inline int read_via_memwin(struct adapter *, int, uint32_t, uint32_t *, int); static inline int write_via_memwin(struct adapter *, int, uint32_t, const uint32_t *, int); static int validate_mem_range(struct adapter *, uint32_t, int); static int fwmtype_to_hwmtype(int); static int validate_mt_off_len(struct adapter *, int, uint32_t, int, uint32_t *); static int fixup_devlog_params(struct adapter *); static int cfg_itype_and_nqueues(struct adapter *, int, int, int, struct intrs_and_queues *); static int prep_firmware(struct adapter *); static int partition_resources(struct adapter *, const struct firmware *, const char *); static int get_params__pre_init(struct adapter *); static int get_params__post_init(struct adapter *); static int set_params__post_init(struct adapter *); static void t4_set_desc(struct adapter *); static void build_medialist(struct port_info *, struct ifmedia *); static int cxgbe_init_synchronized(struct vi_info *); static int cxgbe_uninit_synchronized(struct vi_info *); static int setup_intr_handlers(struct adapter *); static void quiesce_txq(struct adapter *, struct sge_txq *); static void quiesce_wrq(struct adapter *, struct sge_wrq *); static void quiesce_iq(struct adapter *, struct sge_iq *); static void quiesce_fl(struct adapter *, struct sge_fl *); static int t4_alloc_irq(struct adapter *, struct irq *, int rid, driver_intr_t *, void *, char *); static int t4_free_irq(struct adapter *, struct irq *); static void get_regs(struct adapter *, struct t4_regdump *, uint8_t *); static void vi_refresh_stats(struct adapter *, struct vi_info *); static void cxgbe_refresh_stats(struct adapter *, struct port_info *); static void cxgbe_tick(void *); static void cxgbe_vlan_config(void *, struct ifnet *, uint16_t); static int cpl_not_handled(struct sge_iq *, const struct rss_header *, struct mbuf *); static int an_not_handled(struct sge_iq *, const struct rsp_ctrl *); static int fw_msg_not_handled(struct adapter *, const __be64 *); static void t4_sysctls(struct adapter *); static void cxgbe_sysctls(struct port_info *); static int sysctl_int_array(SYSCTL_HANDLER_ARGS); static int sysctl_bitfield(SYSCTL_HANDLER_ARGS); static int sysctl_btphy(SYSCTL_HANDLER_ARGS); static int sysctl_noflowq(SYSCTL_HANDLER_ARGS); static int sysctl_holdoff_tmr_idx(SYSCTL_HANDLER_ARGS); static int sysctl_holdoff_pktc_idx(SYSCTL_HANDLER_ARGS); static int sysctl_qsize_rxq(SYSCTL_HANDLER_ARGS); static int sysctl_qsize_txq(SYSCTL_HANDLER_ARGS); static int sysctl_pause_settings(SYSCTL_HANDLER_ARGS); static int sysctl_handle_t4_reg64(SYSCTL_HANDLER_ARGS); static int sysctl_temperature(SYSCTL_HANDLER_ARGS); #ifdef SBUF_DRAIN static int sysctl_cctrl(SYSCTL_HANDLER_ARGS); static int sysctl_cim_ibq_obq(SYSCTL_HANDLER_ARGS); static int sysctl_cim_la(SYSCTL_HANDLER_ARGS); static int sysctl_cim_la_t6(SYSCTL_HANDLER_ARGS); static int sysctl_cim_ma_la(SYSCTL_HANDLER_ARGS); static int sysctl_cim_pif_la(SYSCTL_HANDLER_ARGS); static int sysctl_cim_qcfg(SYSCTL_HANDLER_ARGS); static int sysctl_cpl_stats(SYSCTL_HANDLER_ARGS); static int sysctl_ddp_stats(SYSCTL_HANDLER_ARGS); static int sysctl_devlog(SYSCTL_HANDLER_ARGS); static int sysctl_fcoe_stats(SYSCTL_HANDLER_ARGS); static int sysctl_hw_sched(SYSCTL_HANDLER_ARGS); static int sysctl_lb_stats(SYSCTL_HANDLER_ARGS); static int sysctl_linkdnrc(SYSCTL_HANDLER_ARGS); static int sysctl_meminfo(SYSCTL_HANDLER_ARGS); static int sysctl_mps_tcam(SYSCTL_HANDLER_ARGS); static int sysctl_mps_tcam_t6(SYSCTL_HANDLER_ARGS); static int sysctl_path_mtus(SYSCTL_HANDLER_ARGS); static int sysctl_pm_stats(SYSCTL_HANDLER_ARGS); static int sysctl_rdma_stats(SYSCTL_HANDLER_ARGS); static int sysctl_tcp_stats(SYSCTL_HANDLER_ARGS); static int sysctl_tids(SYSCTL_HANDLER_ARGS); static int sysctl_tp_err_stats(SYSCTL_HANDLER_ARGS); static int sysctl_tp_la_mask(SYSCTL_HANDLER_ARGS); static int sysctl_tp_la(SYSCTL_HANDLER_ARGS); static int sysctl_tx_rate(SYSCTL_HANDLER_ARGS); static int sysctl_ulprx_la(SYSCTL_HANDLER_ARGS); static int sysctl_wcwr_stats(SYSCTL_HANDLER_ARGS); +static int sysctl_tc_params(SYSCTL_HANDLER_ARGS); #endif #ifdef TCP_OFFLOAD static int sysctl_tp_tick(SYSCTL_HANDLER_ARGS); static int sysctl_tp_dack_timer(SYSCTL_HANDLER_ARGS); static int sysctl_tp_timer(SYSCTL_HANDLER_ARGS); #endif static uint32_t fconf_iconf_to_mode(uint32_t, uint32_t); static uint32_t mode_to_fconf(uint32_t); static uint32_t mode_to_iconf(uint32_t); static int check_fspec_against_fconf_iconf(struct adapter *, struct t4_filter_specification *); static int get_filter_mode(struct adapter *, uint32_t *); static int set_filter_mode(struct adapter *, uint32_t); static inline uint64_t get_filter_hits(struct adapter *, uint32_t); static int get_filter(struct adapter *, struct t4_filter *); static int set_filter(struct adapter *, struct t4_filter *); static int del_filter(struct adapter *, struct t4_filter *); static void clear_filter(struct filter_entry *); static int set_filter_wr(struct adapter *, int); static int del_filter_wr(struct adapter *, int); static int get_sge_context(struct adapter *, struct t4_sge_context *); static int load_fw(struct adapter *, struct t4_data *); static int read_card_mem(struct adapter *, int, struct t4_mem_range *); static int read_i2c(struct adapter *, struct t4_i2c_data *); static int set_sched_class(struct adapter *, struct t4_sched_params *); static int set_sched_queue(struct adapter *, struct t4_sched_queue *); #ifdef TCP_OFFLOAD static int toe_capability(struct vi_info *, int); #endif static int mod_event(module_t, int, void *); struct { uint16_t device; char *desc; } t4_pciids[] = { {0xa000, "Chelsio Terminator 4 FPGA"}, {0x4400, "Chelsio T440-dbg"}, {0x4401, "Chelsio T420-CR"}, {0x4402, "Chelsio T422-CR"}, {0x4403, "Chelsio T440-CR"}, {0x4404, "Chelsio T420-BCH"}, {0x4405, "Chelsio T440-BCH"}, {0x4406, "Chelsio T440-CH"}, {0x4407, "Chelsio T420-SO"}, {0x4408, "Chelsio T420-CX"}, {0x4409, "Chelsio T420-BT"}, {0x440a, "Chelsio T404-BT"}, {0x440e, "Chelsio T440-LP-CR"}, }, t5_pciids[] = { {0xb000, "Chelsio Terminator 5 FPGA"}, {0x5400, "Chelsio T580-dbg"}, {0x5401, "Chelsio T520-CR"}, /* 2 x 10G */ {0x5402, "Chelsio T522-CR"}, /* 2 x 10G, 2 X 1G */ {0x5403, "Chelsio T540-CR"}, /* 4 x 10G */ {0x5407, "Chelsio T520-SO"}, /* 2 x 10G, nomem */ {0x5409, "Chelsio T520-BT"}, /* 2 x 10GBaseT */ {0x540a, "Chelsio T504-BT"}, /* 4 x 1G */ {0x540d, "Chelsio T580-CR"}, /* 2 x 40G */ {0x540e, "Chelsio T540-LP-CR"}, /* 4 x 10G */ {0x5410, "Chelsio T580-LP-CR"}, /* 2 x 40G */ {0x5411, "Chelsio T520-LL-CR"}, /* 2 x 10G */ {0x5412, "Chelsio T560-CR"}, /* 1 x 40G, 2 x 10G */ {0x5414, "Chelsio T580-LP-SO-CR"}, /* 2 x 40G, nomem */ {0x5415, "Chelsio T502-BT"}, /* 2 x 1G */ #ifdef notyet {0x5404, "Chelsio T520-BCH"}, {0x5405, "Chelsio T540-BCH"}, {0x5406, "Chelsio T540-CH"}, {0x5408, "Chelsio T520-CX"}, {0x540b, "Chelsio B520-SR"}, {0x540c, "Chelsio B504-BT"}, {0x540f, "Chelsio Amsterdam"}, {0x5413, "Chelsio T580-CHR"}, #endif }; #ifdef TCP_OFFLOAD /* * service_iq() has an iq and needs the fl. Offset of fl from the iq should be * exactly the same for both rxq and ofld_rxq. */ CTASSERT(offsetof(struct sge_ofld_rxq, iq) == offsetof(struct sge_rxq, iq)); CTASSERT(offsetof(struct sge_ofld_rxq, fl) == offsetof(struct sge_rxq, fl)); #endif /* No easy way to include t4_msg.h before adapter.h so we check this way */ CTASSERT(nitems(((struct adapter *)0)->cpl_handler) == NUM_CPL_CMDS); CTASSERT(nitems(((struct adapter *)0)->fw_msg_handler) == NUM_FW6_TYPES); CTASSERT(sizeof(struct cluster_metadata) <= CL_METADATA_SIZE); static int t4_probe(device_t dev) { int i; uint16_t v = pci_get_vendor(dev); uint16_t d = pci_get_device(dev); uint8_t f = pci_get_function(dev); if (v != PCI_VENDOR_ID_CHELSIO) return (ENXIO); /* Attach only to PF0 of the FPGA */ if (d == 0xa000 && f != 0) return (ENXIO); for (i = 0; i < nitems(t4_pciids); i++) { if (d == t4_pciids[i].device) { device_set_desc(dev, t4_pciids[i].desc); return (BUS_PROBE_DEFAULT); } } return (ENXIO); } static int t5_probe(device_t dev) { int i; uint16_t v = pci_get_vendor(dev); uint16_t d = pci_get_device(dev); uint8_t f = pci_get_function(dev); if (v != PCI_VENDOR_ID_CHELSIO) return (ENXIO); /* Attach only to PF0 of the FPGA */ if (d == 0xb000 && f != 0) return (ENXIO); for (i = 0; i < nitems(t5_pciids); i++) { if (d == t5_pciids[i].device) { device_set_desc(dev, t5_pciids[i].desc); return (BUS_PROBE_DEFAULT); } } return (ENXIO); } static void t5_attribute_workaround(device_t dev) { device_t root_port; uint32_t v; /* * The T5 chips do not properly echo the No Snoop and Relaxed * Ordering attributes when replying to a TLP from a Root * Port. As a workaround, find the parent Root Port and * disable No Snoop and Relaxed Ordering. Note that this * affects all devices under this root port. */ root_port = pci_find_pcie_root_port(dev); if (root_port == NULL) { device_printf(dev, "Unable to find parent root port\n"); return; } v = pcie_adjust_config(root_port, PCIER_DEVICE_CTL, PCIEM_CTL_RELAXED_ORD_ENABLE | PCIEM_CTL_NOSNOOP_ENABLE, 0, 2); if ((v & (PCIEM_CTL_RELAXED_ORD_ENABLE | PCIEM_CTL_NOSNOOP_ENABLE)) != 0) device_printf(dev, "Disabled No Snoop/Relaxed Ordering on %s\n", device_get_nameunit(root_port)); } static int t4_attach(device_t dev) { struct adapter *sc; int rc = 0, i, j, n10g, n1g, rqidx, tqidx; struct intrs_and_queues iaq; struct sge *s; uint8_t *buf; #ifdef TCP_OFFLOAD int ofld_rqidx, ofld_tqidx; #endif #ifdef DEV_NETMAP int nm_rqidx, nm_tqidx; #endif int num_vis; sc = device_get_softc(dev); sc->dev = dev; TUNABLE_INT_FETCH("hw.cxgbe.debug_flags", &sc->debug_flags); if ((pci_get_device(dev) & 0xff00) == 0x5400) t5_attribute_workaround(dev); pci_enable_busmaster(dev); if (pci_find_cap(dev, PCIY_EXPRESS, &i) == 0) { uint32_t v; pci_set_max_read_req(dev, 4096); v = pci_read_config(dev, i + PCIER_DEVICE_CTL, 2); v |= PCIEM_CTL_RELAXED_ORD_ENABLE; pci_write_config(dev, i + PCIER_DEVICE_CTL, v, 2); sc->params.pci.mps = 128 << ((v & PCIEM_CTL_MAX_PAYLOAD) >> 5); } sc->traceq = -1; mtx_init(&sc->ifp_lock, sc->ifp_lockname, 0, MTX_DEF); snprintf(sc->ifp_lockname, sizeof(sc->ifp_lockname), "%s tracer", device_get_nameunit(dev)); snprintf(sc->lockname, sizeof(sc->lockname), "%s", device_get_nameunit(dev)); mtx_init(&sc->sc_lock, sc->lockname, 0, MTX_DEF); sx_xlock(&t4_list_lock); SLIST_INSERT_HEAD(&t4_list, sc, link); sx_xunlock(&t4_list_lock); mtx_init(&sc->sfl_lock, "starving freelists", 0, MTX_DEF); TAILQ_INIT(&sc->sfl); callout_init_mtx(&sc->sfl_callout, &sc->sfl_lock, 0); mtx_init(&sc->reg_lock, "indirect register access", 0, MTX_DEF); rc = map_bars_0_and_4(sc); if (rc != 0) goto done; /* error message displayed already */ /* * This is the real PF# to which we're attaching. Works from within PCI * passthrough environments too, where pci_get_function() could return a * different PF# depending on the passthrough configuration. We need to * use the real PF# in all our communication with the firmware. */ sc->pf = G_SOURCEPF(t4_read_reg(sc, A_PL_WHOAMI)); sc->mbox = sc->pf; memset(sc->chan_map, 0xff, sizeof(sc->chan_map)); sc->an_handler = an_not_handled; for (i = 0; i < nitems(sc->cpl_handler); i++) sc->cpl_handler[i] = cpl_not_handled; for (i = 0; i < nitems(sc->fw_msg_handler); i++) sc->fw_msg_handler[i] = fw_msg_not_handled; t4_register_cpl_handler(sc, CPL_SET_TCB_RPL, t4_filter_rpl); t4_register_cpl_handler(sc, CPL_TRACE_PKT, t4_trace_pkt); t4_register_cpl_handler(sc, CPL_T5_TRACE_PKT, t5_trace_pkt); t4_init_sge_cpl_handlers(sc); /* Prepare the adapter for operation. */ buf = malloc(PAGE_SIZE, M_CXGBE, M_ZERO | M_WAITOK); rc = -t4_prep_adapter(sc, buf); free(buf, M_CXGBE); if (rc != 0) { device_printf(dev, "failed to prepare adapter: %d.\n", rc); goto done; } /* * Do this really early, with the memory windows set up even before the * character device. The userland tool's register i/o and mem read * will work even in "recovery mode". */ setup_memwin(sc); if (t4_init_devlog_params(sc, 0) == 0) fixup_devlog_params(sc); sc->cdev = make_dev(is_t4(sc) ? &t4_cdevsw : &t5_cdevsw, device_get_unit(dev), UID_ROOT, GID_WHEEL, 0600, "%s", device_get_nameunit(dev)); if (sc->cdev == NULL) device_printf(dev, "failed to create nexus char device.\n"); else sc->cdev->si_drv1 = sc; /* Go no further if recovery mode has been requested. */ if (TUNABLE_INT_FETCH("hw.cxgbe.sos", &i) && i != 0) { device_printf(dev, "recovery mode.\n"); goto done; } #if defined(__i386__) if ((cpu_feature & CPUID_CX8) == 0) { device_printf(dev, "64 bit atomics not available.\n"); rc = ENOTSUP; goto done; } #endif /* Prepare the firmware for operation */ rc = prep_firmware(sc); if (rc != 0) goto done; /* error message displayed already */ rc = get_params__post_init(sc); if (rc != 0) goto done; /* error message displayed already */ rc = set_params__post_init(sc); if (rc != 0) goto done; /* error message displayed already */ rc = map_bar_2(sc); if (rc != 0) goto done; /* error message displayed already */ rc = t4_create_dma_tag(sc); if (rc != 0) goto done; /* error message displayed already */ /* * Number of VIs to create per-port. The first VI is the * "main" regular VI for the port. The second VI is used for * netmap if present, and any remaining VIs are used for * additional virtual interfaces. * * Limit the number of VIs per port to the number of available * MAC addresses per port. */ if (t4_num_vis >= 1) num_vis = t4_num_vis; else num_vis = 1; #ifdef DEV_NETMAP num_vis++; #endif if (num_vis > nitems(vi_mac_funcs)) { num_vis = nitems(vi_mac_funcs); device_printf(dev, "Number of VIs limited to %d\n", num_vis); } /* * First pass over all the ports - allocate VIs and initialize some * basic parameters like mac address, port type, etc. We also figure * out whether a port is 10G or 1G and use that information when * calculating how many interrupts to attempt to allocate. */ n10g = n1g = 0; for_each_port(sc, i) { struct port_info *pi; struct vi_info *vi; pi = malloc(sizeof(*pi), M_CXGBE, M_ZERO | M_WAITOK); sc->port[i] = pi; /* These must be set before t4_port_init */ pi->adapter = sc; pi->port_id = i; pi->nvi = num_vis; pi->vi = malloc(sizeof(struct vi_info) * num_vis, M_CXGBE, M_ZERO | M_WAITOK); /* * Allocate the "main" VI and initialize parameters * like mac addr. */ rc = -t4_port_init(sc, sc->mbox, sc->pf, 0, i); if (rc != 0) { device_printf(dev, "unable to initialize port %d: %d\n", i, rc); free(pi->vi, M_CXGBE); free(pi, M_CXGBE); sc->port[i] = NULL; goto done; } pi->link_cfg.requested_fc &= ~(PAUSE_TX | PAUSE_RX); pi->link_cfg.requested_fc |= t4_pause_settings; pi->link_cfg.fc &= ~(PAUSE_TX | PAUSE_RX); pi->link_cfg.fc |= t4_pause_settings; rc = -t4_link_l1cfg(sc, sc->mbox, pi->tx_chan, &pi->link_cfg); if (rc != 0) { device_printf(dev, "port %d l1cfg failed: %d\n", i, rc); free(pi->vi, M_CXGBE); free(pi, M_CXGBE); sc->port[i] = NULL; goto done; } snprintf(pi->lockname, sizeof(pi->lockname), "%sp%d", device_get_nameunit(dev), i); mtx_init(&pi->pi_lock, pi->lockname, 0, MTX_DEF); sc->chan_map[pi->tx_chan] = i; + pi->tc = malloc(sizeof(struct tx_sched_class) * + sc->chip_params->nsched_cls, M_CXGBE, M_ZERO | M_WAITOK); + if (is_10G_port(pi) || is_40G_port(pi)) { n10g++; for_each_vi(pi, j, vi) { vi->tmr_idx = t4_tmr_idx_10g; vi->pktc_idx = t4_pktc_idx_10g; } } else { n1g++; for_each_vi(pi, j, vi) { vi->tmr_idx = t4_tmr_idx_1g; vi->pktc_idx = t4_pktc_idx_1g; } } pi->linkdnrc = -1; for_each_vi(pi, j, vi) { vi->qsize_rxq = t4_qsize_rxq; vi->qsize_txq = t4_qsize_txq; vi->pi = pi; } pi->dev = device_add_child(dev, is_t4(sc) ? "cxgbe" : "cxl", -1); if (pi->dev == NULL) { device_printf(dev, "failed to add device for port %d.\n", i); rc = ENXIO; goto done; } pi->vi[0].dev = pi->dev; device_set_softc(pi->dev, pi); } /* * Interrupt type, # of interrupts, # of rx/tx queues, etc. */ #ifdef DEV_NETMAP num_vis--; #endif rc = cfg_itype_and_nqueues(sc, n10g, n1g, num_vis, &iaq); if (rc != 0) goto done; /* error message displayed already */ sc->intr_type = iaq.intr_type; sc->intr_count = iaq.nirq; s = &sc->sge; s->nrxq = n10g * iaq.nrxq10g + n1g * iaq.nrxq1g; s->ntxq = n10g * iaq.ntxq10g + n1g * iaq.ntxq1g; if (num_vis > 1) { s->nrxq += (n10g + n1g) * (num_vis - 1); s->ntxq += (n10g + n1g) * (num_vis - 1); } s->neq = s->ntxq + s->nrxq; /* the free list in an rxq is an eq */ s->neq += sc->params.nports + 1;/* ctrl queues: 1 per port + 1 mgmt */ s->niq = s->nrxq + 1; /* 1 extra for firmware event queue */ #ifdef TCP_OFFLOAD if (is_offload(sc)) { s->nofldrxq = n10g * iaq.nofldrxq10g + n1g * iaq.nofldrxq1g; s->nofldtxq = n10g * iaq.nofldtxq10g + n1g * iaq.nofldtxq1g; if (num_vis > 1) { s->nofldrxq += (n10g + n1g) * (num_vis - 1); s->nofldtxq += (n10g + n1g) * (num_vis - 1); } s->neq += s->nofldtxq + s->nofldrxq; s->niq += s->nofldrxq; s->ofld_rxq = malloc(s->nofldrxq * sizeof(struct sge_ofld_rxq), M_CXGBE, M_ZERO | M_WAITOK); s->ofld_txq = malloc(s->nofldtxq * sizeof(struct sge_wrq), M_CXGBE, M_ZERO | M_WAITOK); } #endif #ifdef DEV_NETMAP s->nnmrxq = n10g * iaq.nnmrxq10g + n1g * iaq.nnmrxq1g; s->nnmtxq = n10g * iaq.nnmtxq10g + n1g * iaq.nnmtxq1g; s->neq += s->nnmtxq + s->nnmrxq; s->niq += s->nnmrxq; s->nm_rxq = malloc(s->nnmrxq * sizeof(struct sge_nm_rxq), M_CXGBE, M_ZERO | M_WAITOK); s->nm_txq = malloc(s->nnmtxq * sizeof(struct sge_nm_txq), M_CXGBE, M_ZERO | M_WAITOK); #endif s->ctrlq = malloc(sc->params.nports * sizeof(struct sge_wrq), M_CXGBE, M_ZERO | M_WAITOK); s->rxq = malloc(s->nrxq * sizeof(struct sge_rxq), M_CXGBE, M_ZERO | M_WAITOK); s->txq = malloc(s->ntxq * sizeof(struct sge_txq), M_CXGBE, M_ZERO | M_WAITOK); s->iqmap = malloc(s->niq * sizeof(struct sge_iq *), M_CXGBE, M_ZERO | M_WAITOK); s->eqmap = malloc(s->neq * sizeof(struct sge_eq *), M_CXGBE, M_ZERO | M_WAITOK); sc->irq = malloc(sc->intr_count * sizeof(struct irq), M_CXGBE, M_ZERO | M_WAITOK); t4_init_l2t(sc, M_WAITOK); /* * Second pass over the ports. This time we know the number of rx and * tx queues that each port should get. */ rqidx = tqidx = 0; #ifdef TCP_OFFLOAD ofld_rqidx = ofld_tqidx = 0; #endif #ifdef DEV_NETMAP nm_rqidx = nm_tqidx = 0; #endif for_each_port(sc, i) { struct port_info *pi = sc->port[i]; struct vi_info *vi; if (pi == NULL) continue; for_each_vi(pi, j, vi) { #ifdef DEV_NETMAP if (j == 1) { vi->flags |= VI_NETMAP | INTR_RXQ; vi->first_rxq = nm_rqidx; vi->first_txq = nm_tqidx; if (is_10G_port(pi) || is_40G_port(pi)) { vi->nrxq = iaq.nnmrxq10g; vi->ntxq = iaq.nnmtxq10g; } else { vi->nrxq = iaq.nnmrxq1g; vi->ntxq = iaq.nnmtxq1g; } nm_rqidx += vi->nrxq; nm_tqidx += vi->ntxq; continue; } #endif vi->first_rxq = rqidx; vi->first_txq = tqidx; if (is_10G_port(pi) || is_40G_port(pi)) { vi->flags |= iaq.intr_flags_10g & INTR_RXQ; vi->nrxq = j == 0 ? iaq.nrxq10g : 1; vi->ntxq = j == 0 ? iaq.ntxq10g : 1; } else { vi->flags |= iaq.intr_flags_1g & INTR_RXQ; vi->nrxq = j == 0 ? iaq.nrxq1g : 1; vi->ntxq = j == 0 ? iaq.ntxq1g : 1; } if (vi->ntxq > 1) vi->rsrv_noflowq = iaq.rsrv_noflowq ? 1 : 0; else vi->rsrv_noflowq = 0; rqidx += vi->nrxq; tqidx += vi->ntxq; #ifdef TCP_OFFLOAD if (!is_offload(sc)) continue; vi->first_ofld_rxq = ofld_rqidx; vi->first_ofld_txq = ofld_tqidx; if (is_10G_port(pi) || is_40G_port(pi)) { vi->flags |= iaq.intr_flags_10g & INTR_OFLD_RXQ; vi->nofldrxq = j == 0 ? iaq.nofldrxq10g : 1; vi->nofldtxq = j == 0 ? iaq.nofldtxq10g : 1; } else { vi->flags |= iaq.intr_flags_1g & INTR_OFLD_RXQ; vi->nofldrxq = j == 0 ? iaq.nofldrxq1g : 1; vi->nofldtxq = j == 0 ? iaq.nofldtxq1g : 1; } ofld_rqidx += vi->nofldrxq; ofld_tqidx += vi->nofldtxq; #endif } } rc = setup_intr_handlers(sc); if (rc != 0) { device_printf(dev, "failed to setup interrupt handlers: %d\n", rc); goto done; } rc = bus_generic_attach(dev); if (rc != 0) { device_printf(dev, "failed to attach all child ports: %d\n", rc); goto done; } device_printf(dev, "PCIe gen%d x%d, %d ports, %d %s interrupt%s, %d eq, %d iq\n", sc->params.pci.speed, sc->params.pci.width, sc->params.nports, sc->intr_count, sc->intr_type == INTR_MSIX ? "MSI-X" : (sc->intr_type == INTR_MSI ? "MSI" : "INTx"), sc->intr_count > 1 ? "s" : "", sc->sge.neq, sc->sge.niq); t4_set_desc(sc); done: if (rc != 0 && sc->cdev) { /* cdev was created and so cxgbetool works; recover that way. */ device_printf(dev, "error during attach, adapter is now in recovery mode.\n"); rc = 0; } if (rc != 0) t4_detach(dev); else t4_sysctls(sc); return (rc); } /* * Idempotent */ static int t4_detach(device_t dev) { struct adapter *sc; struct port_info *pi; int i, rc; sc = device_get_softc(dev); if (sc->flags & FULL_INIT_DONE) t4_intr_disable(sc); if (sc->cdev) { destroy_dev(sc->cdev); sc->cdev = NULL; } rc = bus_generic_detach(dev); if (rc) { device_printf(dev, "failed to detach child devices: %d\n", rc); return (rc); } for (i = 0; i < sc->intr_count; i++) t4_free_irq(sc, &sc->irq[i]); for (i = 0; i < MAX_NPORTS; i++) { pi = sc->port[i]; if (pi) { t4_free_vi(sc, sc->mbox, sc->pf, 0, pi->vi[0].viid); if (pi->dev) device_delete_child(dev, pi->dev); mtx_destroy(&pi->pi_lock); free(pi->vi, M_CXGBE); + free(pi->tc, M_CXGBE); free(pi, M_CXGBE); } } if (sc->flags & FULL_INIT_DONE) adapter_full_uninit(sc); if (sc->flags & FW_OK) t4_fw_bye(sc, sc->mbox); if (sc->intr_type == INTR_MSI || sc->intr_type == INTR_MSIX) pci_release_msi(dev); if (sc->regs_res) bus_release_resource(dev, SYS_RES_MEMORY, sc->regs_rid, sc->regs_res); if (sc->udbs_res) bus_release_resource(dev, SYS_RES_MEMORY, sc->udbs_rid, sc->udbs_res); if (sc->msix_res) bus_release_resource(dev, SYS_RES_MEMORY, sc->msix_rid, sc->msix_res); if (sc->l2t) t4_free_l2t(sc->l2t); #ifdef TCP_OFFLOAD free(sc->sge.ofld_rxq, M_CXGBE); free(sc->sge.ofld_txq, M_CXGBE); #endif #ifdef DEV_NETMAP free(sc->sge.nm_rxq, M_CXGBE); free(sc->sge.nm_txq, M_CXGBE); #endif free(sc->irq, M_CXGBE); free(sc->sge.rxq, M_CXGBE); free(sc->sge.txq, M_CXGBE); free(sc->sge.ctrlq, M_CXGBE); free(sc->sge.iqmap, M_CXGBE); free(sc->sge.eqmap, M_CXGBE); free(sc->tids.ftid_tab, M_CXGBE); t4_destroy_dma_tag(sc); if (mtx_initialized(&sc->sc_lock)) { sx_xlock(&t4_list_lock); SLIST_REMOVE(&t4_list, sc, adapter, link); sx_xunlock(&t4_list_lock); mtx_destroy(&sc->sc_lock); } callout_drain(&sc->sfl_callout); if (mtx_initialized(&sc->tids.ftid_lock)) mtx_destroy(&sc->tids.ftid_lock); if (mtx_initialized(&sc->sfl_lock)) mtx_destroy(&sc->sfl_lock); if (mtx_initialized(&sc->ifp_lock)) mtx_destroy(&sc->ifp_lock); if (mtx_initialized(&sc->reg_lock)) mtx_destroy(&sc->reg_lock); for (i = 0; i < NUM_MEMWIN; i++) { struct memwin *mw = &sc->memwin[i]; if (rw_initialized(&mw->mw_lock)) rw_destroy(&mw->mw_lock); } bzero(sc, sizeof(*sc)); return (0); } static int cxgbe_probe(device_t dev) { char buf[128]; struct port_info *pi = device_get_softc(dev); snprintf(buf, sizeof(buf), "port %d", pi->port_id); device_set_desc_copy(dev, buf); return (BUS_PROBE_DEFAULT); } #define T4_CAP (IFCAP_VLAN_HWTAGGING | IFCAP_VLAN_MTU | IFCAP_HWCSUM | \ IFCAP_VLAN_HWCSUM | IFCAP_TSO | IFCAP_JUMBO_MTU | IFCAP_LRO | \ IFCAP_VLAN_HWTSO | IFCAP_LINKSTATE | IFCAP_HWCSUM_IPV6 | IFCAP_HWSTATS) #define T4_CAP_ENABLE (T4_CAP) static int cxgbe_vi_attach(device_t dev, struct vi_info *vi) { struct ifnet *ifp; struct sbuf *sb; vi->xact_addr_filt = -1; callout_init(&vi->tick, 1); /* Allocate an ifnet and set it up */ ifp = if_alloc(IFT_ETHER); if (ifp == NULL) { device_printf(dev, "Cannot allocate ifnet\n"); return (ENOMEM); } vi->ifp = ifp; ifp->if_softc = vi; if_initname(ifp, device_get_name(dev), device_get_unit(dev)); ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; ifp->if_init = cxgbe_init; ifp->if_ioctl = cxgbe_ioctl; ifp->if_transmit = cxgbe_transmit; ifp->if_qflush = cxgbe_qflush; ifp->if_get_counter = cxgbe_get_counter; ifp->if_capabilities = T4_CAP; #ifdef TCP_OFFLOAD if (vi->nofldrxq != 0) ifp->if_capabilities |= IFCAP_TOE; #endif ifp->if_capenable = T4_CAP_ENABLE; ifp->if_hwassist = CSUM_TCP | CSUM_UDP | CSUM_IP | CSUM_TSO | CSUM_UDP_IPV6 | CSUM_TCP_IPV6; ifp->if_hw_tsomax = 65536 - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN); ifp->if_hw_tsomaxsegcount = TX_SGL_SEGS; ifp->if_hw_tsomaxsegsize = 65536; /* Initialize ifmedia for this VI */ ifmedia_init(&vi->media, IFM_IMASK, cxgbe_media_change, cxgbe_media_status); build_medialist(vi->pi, &vi->media); vi->vlan_c = EVENTHANDLER_REGISTER(vlan_config, cxgbe_vlan_config, ifp, EVENTHANDLER_PRI_ANY); ether_ifattach(ifp, vi->hw_addr); sb = sbuf_new_auto(); sbuf_printf(sb, "%d txq, %d rxq (NIC)", vi->ntxq, vi->nrxq); #ifdef TCP_OFFLOAD if (ifp->if_capabilities & IFCAP_TOE) sbuf_printf(sb, "; %d txq, %d rxq (TOE)", vi->nofldtxq, vi->nofldrxq); #endif sbuf_finish(sb); device_printf(dev, "%s\n", sbuf_data(sb)); sbuf_delete(sb); vi_sysctls(vi); return (0); } static int cxgbe_attach(device_t dev) { struct port_info *pi = device_get_softc(dev); struct vi_info *vi; int i, rc; callout_init_mtx(&pi->tick, &pi->pi_lock, 0); rc = cxgbe_vi_attach(dev, &pi->vi[0]); if (rc) return (rc); for_each_vi(pi, i, vi) { if (i == 0) continue; #ifdef DEV_NETMAP if (vi->flags & VI_NETMAP) { /* * media handled here to keep * implementation private to this file */ ifmedia_init(&vi->media, IFM_IMASK, cxgbe_media_change, cxgbe_media_status); build_medialist(pi, &vi->media); vi->dev = device_add_child(dev, is_t4(pi->adapter) ? "ncxgbe" : "ncxl", device_get_unit(dev)); } else #endif vi->dev = device_add_child(dev, is_t4(pi->adapter) ? "vcxgbe" : "vcxl", -1); if (vi->dev == NULL) { device_printf(dev, "failed to add VI %d\n", i); continue; } device_set_softc(vi->dev, vi); } cxgbe_sysctls(pi); bus_generic_attach(dev); return (0); } static void cxgbe_vi_detach(struct vi_info *vi) { struct ifnet *ifp = vi->ifp; ether_ifdetach(ifp); if (vi->vlan_c) EVENTHANDLER_DEREGISTER(vlan_config, vi->vlan_c); /* Let detach proceed even if these fail. */ cxgbe_uninit_synchronized(vi); callout_drain(&vi->tick); vi_full_uninit(vi); ifmedia_removeall(&vi->media); if_free(vi->ifp); vi->ifp = NULL; } static int cxgbe_detach(device_t dev) { struct port_info *pi = device_get_softc(dev); struct adapter *sc = pi->adapter; int rc; /* Detach the extra VIs first. */ rc = bus_generic_detach(dev); if (rc) return (rc); device_delete_children(dev); doom_vi(sc, &pi->vi[0]); if (pi->flags & HAS_TRACEQ) { sc->traceq = -1; /* cloner should not create ifnet */ t4_tracer_port_detach(sc); } cxgbe_vi_detach(&pi->vi[0]); callout_drain(&pi->tick); end_synchronized_op(sc, 0); return (0); } static void cxgbe_init(void *arg) { struct vi_info *vi = arg; struct adapter *sc = vi->pi->adapter; if (begin_synchronized_op(sc, vi, SLEEP_OK | INTR_OK, "t4init") != 0) return; cxgbe_init_synchronized(vi); end_synchronized_op(sc, 0); } static int cxgbe_ioctl(struct ifnet *ifp, unsigned long cmd, caddr_t data) { int rc = 0, mtu, flags, can_sleep; struct vi_info *vi = ifp->if_softc; struct adapter *sc = vi->pi->adapter; struct ifreq *ifr = (struct ifreq *)data; uint32_t mask; switch (cmd) { case SIOCSIFMTU: mtu = ifr->ifr_mtu; if ((mtu < ETHERMIN) || (mtu > ETHERMTU_JUMBO)) return (EINVAL); rc = begin_synchronized_op(sc, vi, SLEEP_OK | INTR_OK, "t4mtu"); if (rc) return (rc); ifp->if_mtu = mtu; if (vi->flags & VI_INIT_DONE) { t4_update_fl_bufsize(ifp); if (ifp->if_drv_flags & IFF_DRV_RUNNING) rc = update_mac_settings(ifp, XGMAC_MTU); } end_synchronized_op(sc, 0); break; case SIOCSIFFLAGS: can_sleep = 0; redo_sifflags: rc = begin_synchronized_op(sc, vi, can_sleep ? (SLEEP_OK | INTR_OK) : HOLD_LOCK, "t4flg"); if (rc) return (rc); if (ifp->if_flags & IFF_UP) { if (ifp->if_drv_flags & IFF_DRV_RUNNING) { flags = vi->if_flags; if ((ifp->if_flags ^ flags) & (IFF_PROMISC | IFF_ALLMULTI)) { if (can_sleep == 1) { end_synchronized_op(sc, 0); can_sleep = 0; goto redo_sifflags; } rc = update_mac_settings(ifp, XGMAC_PROMISC | XGMAC_ALLMULTI); } } else { if (can_sleep == 0) { end_synchronized_op(sc, LOCK_HELD); can_sleep = 1; goto redo_sifflags; } rc = cxgbe_init_synchronized(vi); } vi->if_flags = ifp->if_flags; } else if (ifp->if_drv_flags & IFF_DRV_RUNNING) { if (can_sleep == 0) { end_synchronized_op(sc, LOCK_HELD); can_sleep = 1; goto redo_sifflags; } rc = cxgbe_uninit_synchronized(vi); } end_synchronized_op(sc, can_sleep ? 0 : LOCK_HELD); break; case SIOCADDMULTI: case SIOCDELMULTI: /* these two are called with a mutex held :-( */ rc = begin_synchronized_op(sc, vi, HOLD_LOCK, "t4multi"); if (rc) return (rc); if (ifp->if_drv_flags & IFF_DRV_RUNNING) rc = update_mac_settings(ifp, XGMAC_MCADDRS); end_synchronized_op(sc, LOCK_HELD); break; case SIOCSIFCAP: rc = begin_synchronized_op(sc, vi, SLEEP_OK | INTR_OK, "t4cap"); if (rc) return (rc); mask = ifr->ifr_reqcap ^ ifp->if_capenable; if (mask & IFCAP_TXCSUM) { ifp->if_capenable ^= IFCAP_TXCSUM; ifp->if_hwassist ^= (CSUM_TCP | CSUM_UDP | CSUM_IP); if (IFCAP_TSO4 & ifp->if_capenable && !(IFCAP_TXCSUM & ifp->if_capenable)) { ifp->if_capenable &= ~IFCAP_TSO4; if_printf(ifp, "tso4 disabled due to -txcsum.\n"); } } if (mask & IFCAP_TXCSUM_IPV6) { ifp->if_capenable ^= IFCAP_TXCSUM_IPV6; ifp->if_hwassist ^= (CSUM_UDP_IPV6 | CSUM_TCP_IPV6); if (IFCAP_TSO6 & ifp->if_capenable && !(IFCAP_TXCSUM_IPV6 & ifp->if_capenable)) { ifp->if_capenable &= ~IFCAP_TSO6; if_printf(ifp, "tso6 disabled due to -txcsum6.\n"); } } if (mask & IFCAP_RXCSUM) ifp->if_capenable ^= IFCAP_RXCSUM; if (mask & IFCAP_RXCSUM_IPV6) ifp->if_capenable ^= IFCAP_RXCSUM_IPV6; /* * Note that we leave CSUM_TSO alone (it is always set). The * kernel takes both IFCAP_TSOx and CSUM_TSO into account before * sending a TSO request our way, so it's sufficient to toggle * IFCAP_TSOx only. */ if (mask & IFCAP_TSO4) { if (!(IFCAP_TSO4 & ifp->if_capenable) && !(IFCAP_TXCSUM & ifp->if_capenable)) { if_printf(ifp, "enable txcsum first.\n"); rc = EAGAIN; goto fail; } ifp->if_capenable ^= IFCAP_TSO4; } if (mask & IFCAP_TSO6) { if (!(IFCAP_TSO6 & ifp->if_capenable) && !(IFCAP_TXCSUM_IPV6 & ifp->if_capenable)) { if_printf(ifp, "enable txcsum6 first.\n"); rc = EAGAIN; goto fail; } ifp->if_capenable ^= IFCAP_TSO6; } if (mask & IFCAP_LRO) { #if defined(INET) || defined(INET6) int i; struct sge_rxq *rxq; ifp->if_capenable ^= IFCAP_LRO; for_each_rxq(vi, i, rxq) { if (ifp->if_capenable & IFCAP_LRO) rxq->iq.flags |= IQ_LRO_ENABLED; else rxq->iq.flags &= ~IQ_LRO_ENABLED; } #endif } #ifdef TCP_OFFLOAD if (mask & IFCAP_TOE) { int enable = (ifp->if_capenable ^ mask) & IFCAP_TOE; rc = toe_capability(vi, enable); if (rc != 0) goto fail; ifp->if_capenable ^= mask; } #endif if (mask & IFCAP_VLAN_HWTAGGING) { ifp->if_capenable ^= IFCAP_VLAN_HWTAGGING; if (ifp->if_drv_flags & IFF_DRV_RUNNING) rc = update_mac_settings(ifp, XGMAC_VLANEX); } if (mask & IFCAP_VLAN_MTU) { ifp->if_capenable ^= IFCAP_VLAN_MTU; /* Need to find out how to disable auto-mtu-inflation */ } if (mask & IFCAP_VLAN_HWTSO) ifp->if_capenable ^= IFCAP_VLAN_HWTSO; if (mask & IFCAP_VLAN_HWCSUM) ifp->if_capenable ^= IFCAP_VLAN_HWCSUM; #ifdef VLAN_CAPABILITIES VLAN_CAPABILITIES(ifp); #endif fail: end_synchronized_op(sc, 0); break; case SIOCSIFMEDIA: case SIOCGIFMEDIA: ifmedia_ioctl(ifp, ifr, &vi->media, cmd); break; case SIOCGI2C: { struct ifi2creq i2c; rc = copyin(ifr->ifr_data, &i2c, sizeof(i2c)); if (rc != 0) break; if (i2c.dev_addr != 0xA0 && i2c.dev_addr != 0xA2) { rc = EPERM; break; } if (i2c.len > sizeof(i2c.data)) { rc = EINVAL; break; } rc = begin_synchronized_op(sc, vi, SLEEP_OK | INTR_OK, "t4i2c"); if (rc) return (rc); rc = -t4_i2c_rd(sc, sc->mbox, vi->pi->port_id, i2c.dev_addr, i2c.offset, i2c.len, &i2c.data[0]); end_synchronized_op(sc, 0); if (rc == 0) rc = copyout(&i2c, ifr->ifr_data, sizeof(i2c)); break; } default: rc = ether_ioctl(ifp, cmd, data); } return (rc); } static int cxgbe_transmit(struct ifnet *ifp, struct mbuf *m) { struct vi_info *vi = ifp->if_softc; struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; struct sge_txq *txq; void *items[1]; int rc; M_ASSERTPKTHDR(m); MPASS(m->m_nextpkt == NULL); /* not quite ready for this yet */ if (__predict_false(pi->link_cfg.link_ok == 0)) { m_freem(m); return (ENETDOWN); } rc = parse_pkt(&m); if (__predict_false(rc != 0)) { MPASS(m == NULL); /* was freed already */ atomic_add_int(&pi->tx_parse_error, 1); /* rare, atomic is ok */ return (rc); } /* Select a txq. */ txq = &sc->sge.txq[vi->first_txq]; if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) txq += ((m->m_pkthdr.flowid % (vi->ntxq - vi->rsrv_noflowq)) + vi->rsrv_noflowq); items[0] = m; rc = mp_ring_enqueue(txq->r, items, 1, 4096); if (__predict_false(rc != 0)) m_freem(m); return (rc); } static void cxgbe_qflush(struct ifnet *ifp) { struct vi_info *vi = ifp->if_softc; struct sge_txq *txq; int i; /* queues do not exist if !VI_INIT_DONE. */ if (vi->flags & VI_INIT_DONE) { for_each_txq(vi, i, txq) { TXQ_LOCK(txq); txq->eq.flags &= ~EQ_ENABLED; TXQ_UNLOCK(txq); while (!mp_ring_is_idle(txq->r)) { mp_ring_check_drainage(txq->r, 0); pause("qflush", 1); } } } if_qflush(ifp); } static uint64_t vi_get_counter(struct ifnet *ifp, ift_counter c) { struct vi_info *vi = ifp->if_softc; struct fw_vi_stats_vf *s = &vi->stats; vi_refresh_stats(vi->pi->adapter, vi); switch (c) { case IFCOUNTER_IPACKETS: return (s->rx_bcast_frames + s->rx_mcast_frames + s->rx_ucast_frames); case IFCOUNTER_IERRORS: return (s->rx_err_frames); case IFCOUNTER_OPACKETS: return (s->tx_bcast_frames + s->tx_mcast_frames + s->tx_ucast_frames + s->tx_offload_frames); case IFCOUNTER_OERRORS: return (s->tx_drop_frames); case IFCOUNTER_IBYTES: return (s->rx_bcast_bytes + s->rx_mcast_bytes + s->rx_ucast_bytes); case IFCOUNTER_OBYTES: return (s->tx_bcast_bytes + s->tx_mcast_bytes + s->tx_ucast_bytes + s->tx_offload_bytes); case IFCOUNTER_IMCASTS: return (s->rx_mcast_frames); case IFCOUNTER_OMCASTS: return (s->tx_mcast_frames); case IFCOUNTER_OQDROPS: { uint64_t drops; drops = 0; if ((vi->flags & (VI_INIT_DONE | VI_NETMAP)) == VI_INIT_DONE) { int i; struct sge_txq *txq; for_each_txq(vi, i, txq) drops += counter_u64_fetch(txq->r->drops); } return (drops); } default: return (if_get_counter_default(ifp, c)); } } uint64_t cxgbe_get_counter(struct ifnet *ifp, ift_counter c) { struct vi_info *vi = ifp->if_softc; struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; struct port_stats *s = &pi->stats; if (pi->nvi > 1) return (vi_get_counter(ifp, c)); cxgbe_refresh_stats(sc, pi); switch (c) { case IFCOUNTER_IPACKETS: return (s->rx_frames); case IFCOUNTER_IERRORS: return (s->rx_jabber + s->rx_runt + s->rx_too_long + s->rx_fcs_err + s->rx_len_err); case IFCOUNTER_OPACKETS: return (s->tx_frames); case IFCOUNTER_OERRORS: return (s->tx_error_frames); case IFCOUNTER_IBYTES: return (s->rx_octets); case IFCOUNTER_OBYTES: return (s->tx_octets); case IFCOUNTER_IMCASTS: return (s->rx_mcast_frames); case IFCOUNTER_OMCASTS: return (s->tx_mcast_frames); case IFCOUNTER_IQDROPS: return (s->rx_ovflow0 + s->rx_ovflow1 + s->rx_ovflow2 + s->rx_ovflow3 + s->rx_trunc0 + s->rx_trunc1 + s->rx_trunc2 + s->rx_trunc3 + pi->tnl_cong_drops); case IFCOUNTER_OQDROPS: { uint64_t drops; drops = s->tx_drop; if (vi->flags & VI_INIT_DONE) { int i; struct sge_txq *txq; for_each_txq(vi, i, txq) drops += counter_u64_fetch(txq->r->drops); } return (drops); } default: return (if_get_counter_default(ifp, c)); } } static int cxgbe_media_change(struct ifnet *ifp) { struct vi_info *vi = ifp->if_softc; device_printf(vi->dev, "%s unimplemented.\n", __func__); return (EOPNOTSUPP); } static void cxgbe_media_status(struct ifnet *ifp, struct ifmediareq *ifmr) { struct vi_info *vi = ifp->if_softc; struct port_info *pi = vi->pi; struct ifmedia_entry *cur; int speed = pi->link_cfg.speed; cur = vi->media.ifm_cur; ifmr->ifm_status = IFM_AVALID; if (!pi->link_cfg.link_ok) return; ifmr->ifm_status |= IFM_ACTIVE; /* active and current will differ iff current media is autoselect. */ if (IFM_SUBTYPE(cur->ifm_media) != IFM_AUTO) return; ifmr->ifm_active = IFM_ETHER | IFM_FDX; if (speed == 10000) ifmr->ifm_active |= IFM_10G_T; else if (speed == 1000) ifmr->ifm_active |= IFM_1000_T; else if (speed == 100) ifmr->ifm_active |= IFM_100_TX; else if (speed == 10) ifmr->ifm_active |= IFM_10_T; else KASSERT(0, ("%s: link up but speed unknown (%u)", __func__, speed)); } static int vcxgbe_probe(device_t dev) { char buf[128]; struct vi_info *vi = device_get_softc(dev); snprintf(buf, sizeof(buf), "port %d vi %td", vi->pi->port_id, vi - vi->pi->vi); device_set_desc_copy(dev, buf); return (BUS_PROBE_DEFAULT); } static int vcxgbe_attach(device_t dev) { struct vi_info *vi; struct port_info *pi; struct adapter *sc; int func, index, rc; u32 param, val; vi = device_get_softc(dev); pi = vi->pi; sc = pi->adapter; index = vi - pi->vi; KASSERT(index < nitems(vi_mac_funcs), ("%s: VI %s doesn't have a MAC func", __func__, device_get_nameunit(dev))); func = vi_mac_funcs[index]; rc = t4_alloc_vi_func(sc, sc->mbox, pi->tx_chan, sc->pf, 0, 1, vi->hw_addr, &vi->rss_size, func, 0); if (rc < 0) { device_printf(dev, "Failed to allocate virtual interface " "for port %d: %d\n", pi->port_id, -rc); return (-rc); } vi->viid = rc; param = V_FW_PARAMS_MNEM(FW_PARAMS_MNEM_DEV) | V_FW_PARAMS_PARAM_X(FW_PARAMS_PARAM_DEV_RSSINFO) | V_FW_PARAMS_PARAM_YZ(vi->viid); rc = t4_query_params(sc, sc->mbox, sc->pf, 0, 1, ¶m, &val); if (rc) vi->rss_base = 0xffff; else { /* MPASS((val >> 16) == rss_size); */ vi->rss_base = val & 0xffff; } rc = cxgbe_vi_attach(dev, vi); if (rc) { t4_free_vi(sc, sc->mbox, sc->pf, 0, vi->viid); return (rc); } return (0); } static int vcxgbe_detach(device_t dev) { struct vi_info *vi; struct adapter *sc; vi = device_get_softc(dev); sc = vi->pi->adapter; doom_vi(sc, vi); cxgbe_vi_detach(vi); t4_free_vi(sc, sc->mbox, sc->pf, 0, vi->viid); end_synchronized_op(sc, 0); return (0); } void t4_fatal_err(struct adapter *sc) { t4_set_reg_field(sc, A_SGE_CONTROL, F_GLOBALENABLE, 0); t4_intr_disable(sc); log(LOG_EMERG, "%s: encountered fatal error, adapter stopped.\n", device_get_nameunit(sc->dev)); } static int map_bars_0_and_4(struct adapter *sc) { sc->regs_rid = PCIR_BAR(0); sc->regs_res = bus_alloc_resource_any(sc->dev, SYS_RES_MEMORY, &sc->regs_rid, RF_ACTIVE); if (sc->regs_res == NULL) { device_printf(sc->dev, "cannot map registers.\n"); return (ENXIO); } sc->bt = rman_get_bustag(sc->regs_res); sc->bh = rman_get_bushandle(sc->regs_res); sc->mmio_len = rman_get_size(sc->regs_res); setbit(&sc->doorbells, DOORBELL_KDB); sc->msix_rid = PCIR_BAR(4); sc->msix_res = bus_alloc_resource_any(sc->dev, SYS_RES_MEMORY, &sc->msix_rid, RF_ACTIVE); if (sc->msix_res == NULL) { device_printf(sc->dev, "cannot map MSI-X BAR.\n"); return (ENXIO); } return (0); } static int map_bar_2(struct adapter *sc) { /* * T4: only iWARP driver uses the userspace doorbells. There is no need * to map it if RDMA is disabled. */ if (is_t4(sc) && sc->rdmacaps == 0) return (0); sc->udbs_rid = PCIR_BAR(2); sc->udbs_res = bus_alloc_resource_any(sc->dev, SYS_RES_MEMORY, &sc->udbs_rid, RF_ACTIVE); if (sc->udbs_res == NULL) { device_printf(sc->dev, "cannot map doorbell BAR.\n"); return (ENXIO); } sc->udbs_base = rman_get_virtual(sc->udbs_res); if (is_t5(sc)) { setbit(&sc->doorbells, DOORBELL_UDB); #if defined(__i386__) || defined(__amd64__) if (t5_write_combine) { int rc; /* * Enable write combining on BAR2. This is the * userspace doorbell BAR and is split into 128B * (UDBS_SEG_SIZE) doorbell regions, each associated * with an egress queue. The first 64B has the doorbell * and the second 64B can be used to submit a tx work * request with an implicit doorbell. */ rc = pmap_change_attr((vm_offset_t)sc->udbs_base, rman_get_size(sc->udbs_res), PAT_WRITE_COMBINING); if (rc == 0) { clrbit(&sc->doorbells, DOORBELL_UDB); setbit(&sc->doorbells, DOORBELL_WCWR); setbit(&sc->doorbells, DOORBELL_UDBWC); } else { device_printf(sc->dev, "couldn't enable write combining: %d\n", rc); } t4_write_reg(sc, A_SGE_STAT_CFG, V_STATSOURCE_T5(7) | V_STATMODE(0)); } #endif } return (0); } struct memwin_init { uint32_t base; uint32_t aperture; }; static const struct memwin_init t4_memwin[NUM_MEMWIN] = { { MEMWIN0_BASE, MEMWIN0_APERTURE }, { MEMWIN1_BASE, MEMWIN1_APERTURE }, { MEMWIN2_BASE_T4, MEMWIN2_APERTURE_T4 } }; static const struct memwin_init t5_memwin[NUM_MEMWIN] = { { MEMWIN0_BASE, MEMWIN0_APERTURE }, { MEMWIN1_BASE, MEMWIN1_APERTURE }, { MEMWIN2_BASE_T5, MEMWIN2_APERTURE_T5 }, }; static void setup_memwin(struct adapter *sc) { const struct memwin_init *mw_init; struct memwin *mw; int i; uint32_t bar0; if (is_t4(sc)) { /* * Read low 32b of bar0 indirectly via the hardware backdoor * mechanism. Works from within PCI passthrough environments * too, where rman_get_start() can return a different value. We * need to program the T4 memory window decoders with the actual * addresses that will be coming across the PCIe link. */ bar0 = t4_hw_pci_read_cfg4(sc, PCIR_BAR(0)); bar0 &= (uint32_t) PCIM_BAR_MEM_BASE; mw_init = &t4_memwin[0]; } else { /* T5+ use the relative offset inside the PCIe BAR */ bar0 = 0; mw_init = &t5_memwin[0]; } for (i = 0, mw = &sc->memwin[0]; i < NUM_MEMWIN; i++, mw_init++, mw++) { rw_init(&mw->mw_lock, "memory window access"); mw->mw_base = mw_init->base; mw->mw_aperture = mw_init->aperture; mw->mw_curpos = 0; t4_write_reg(sc, PCIE_MEM_ACCESS_REG(A_PCIE_MEM_ACCESS_BASE_WIN, i), (mw->mw_base + bar0) | V_BIR(0) | V_WINDOW(ilog2(mw->mw_aperture) - 10)); rw_wlock(&mw->mw_lock); position_memwin(sc, i, 0); rw_wunlock(&mw->mw_lock); } /* flush */ t4_read_reg(sc, PCIE_MEM_ACCESS_REG(A_PCIE_MEM_ACCESS_BASE_WIN, 2)); } /* * Positions the memory window at the given address in the card's address space. * There are some alignment requirements and the actual position may be at an * address prior to the requested address. mw->mw_curpos always has the actual * position of the window. */ static void position_memwin(struct adapter *sc, int idx, uint32_t addr) { struct memwin *mw; uint32_t pf; uint32_t reg; MPASS(idx >= 0 && idx < NUM_MEMWIN); mw = &sc->memwin[idx]; rw_assert(&mw->mw_lock, RA_WLOCKED); if (is_t4(sc)) { pf = 0; mw->mw_curpos = addr & ~0xf; /* start must be 16B aligned */ } else { pf = V_PFNUM(sc->pf); mw->mw_curpos = addr & ~0x7f; /* start must be 128B aligned */ } reg = PCIE_MEM_ACCESS_REG(A_PCIE_MEM_ACCESS_OFFSET, idx); t4_write_reg(sc, reg, mw->mw_curpos | pf); t4_read_reg(sc, reg); /* flush */ } static int rw_via_memwin(struct adapter *sc, int idx, uint32_t addr, uint32_t *val, int len, int rw) { struct memwin *mw; uint32_t mw_end, v; MPASS(idx >= 0 && idx < NUM_MEMWIN); /* Memory can only be accessed in naturally aligned 4 byte units */ if (addr & 3 || len & 3 || len <= 0) return (EINVAL); mw = &sc->memwin[idx]; while (len > 0) { rw_rlock(&mw->mw_lock); mw_end = mw->mw_curpos + mw->mw_aperture; if (addr >= mw_end || addr < mw->mw_curpos) { /* Will need to reposition the window */ if (!rw_try_upgrade(&mw->mw_lock)) { rw_runlock(&mw->mw_lock); rw_wlock(&mw->mw_lock); } rw_assert(&mw->mw_lock, RA_WLOCKED); position_memwin(sc, idx, addr); rw_downgrade(&mw->mw_lock); mw_end = mw->mw_curpos + mw->mw_aperture; } rw_assert(&mw->mw_lock, RA_RLOCKED); while (addr < mw_end && len > 0) { if (rw == 0) { v = t4_read_reg(sc, mw->mw_base + addr - mw->mw_curpos); *val++ = le32toh(v); } else { v = *val++; t4_write_reg(sc, mw->mw_base + addr - mw->mw_curpos, htole32(v)); } addr += 4; len -= 4; } rw_runlock(&mw->mw_lock); } return (0); } static inline int read_via_memwin(struct adapter *sc, int idx, uint32_t addr, uint32_t *val, int len) { return (rw_via_memwin(sc, idx, addr, val, len, 0)); } static inline int write_via_memwin(struct adapter *sc, int idx, uint32_t addr, const uint32_t *val, int len) { return (rw_via_memwin(sc, idx, addr, (void *)(uintptr_t)val, len, 1)); } static int t4_range_cmp(const void *a, const void *b) { return ((const struct t4_range *)a)->start - ((const struct t4_range *)b)->start; } /* * Verify that the memory range specified by the addr/len pair is valid within * the card's address space. */ static int validate_mem_range(struct adapter *sc, uint32_t addr, int len) { struct t4_range mem_ranges[4], *r, *next; uint32_t em, addr_len; int i, n, remaining; /* Memory can only be accessed in naturally aligned 4 byte units */ if (addr & 3 || len & 3 || len <= 0) return (EINVAL); /* Enabled memories */ em = t4_read_reg(sc, A_MA_TARGET_MEM_ENABLE); r = &mem_ranges[0]; n = 0; bzero(r, sizeof(mem_ranges)); if (em & F_EDRAM0_ENABLE) { addr_len = t4_read_reg(sc, A_MA_EDRAM0_BAR); r->size = G_EDRAM0_SIZE(addr_len) << 20; if (r->size > 0) { r->start = G_EDRAM0_BASE(addr_len) << 20; if (addr >= r->start && addr + len <= r->start + r->size) return (0); r++; n++; } } if (em & F_EDRAM1_ENABLE) { addr_len = t4_read_reg(sc, A_MA_EDRAM1_BAR); r->size = G_EDRAM1_SIZE(addr_len) << 20; if (r->size > 0) { r->start = G_EDRAM1_BASE(addr_len) << 20; if (addr >= r->start && addr + len <= r->start + r->size) return (0); r++; n++; } } if (em & F_EXT_MEM_ENABLE) { addr_len = t4_read_reg(sc, A_MA_EXT_MEMORY_BAR); r->size = G_EXT_MEM_SIZE(addr_len) << 20; if (r->size > 0) { r->start = G_EXT_MEM_BASE(addr_len) << 20; if (addr >= r->start && addr + len <= r->start + r->size) return (0); r++; n++; } } if (is_t5(sc) && em & F_EXT_MEM1_ENABLE) { addr_len = t4_read_reg(sc, A_MA_EXT_MEMORY1_BAR); r->size = G_EXT_MEM1_SIZE(addr_len) << 20; if (r->size > 0) { r->start = G_EXT_MEM1_BASE(addr_len) << 20; if (addr >= r->start && addr + len <= r->start + r->size) return (0); r++; n++; } } MPASS(n <= nitems(mem_ranges)); if (n > 1) { /* Sort and merge the ranges. */ qsort(mem_ranges, n, sizeof(struct t4_range), t4_range_cmp); /* Start from index 0 and examine the next n - 1 entries. */ r = &mem_ranges[0]; for (remaining = n - 1; remaining > 0; remaining--, r++) { MPASS(r->size > 0); /* r is a valid entry. */ next = r + 1; MPASS(next->size > 0); /* and so is the next one. */ while (r->start + r->size >= next->start) { /* Merge the next one into the current entry. */ r->size = max(r->start + r->size, next->start + next->size) - r->start; n--; /* One fewer entry in total. */ if (--remaining == 0) goto done; /* short circuit */ next++; } if (next != r + 1) { /* * Some entries were merged into r and next * points to the first valid entry that couldn't * be merged. */ MPASS(next->size > 0); /* must be valid */ memcpy(r + 1, next, remaining * sizeof(*r)); #ifdef INVARIANTS /* * This so that the foo->size assertion in the * next iteration of the loop do the right * thing for entries that were pulled up and are * no longer valid. */ MPASS(n < nitems(mem_ranges)); bzero(&mem_ranges[n], (nitems(mem_ranges) - n) * sizeof(struct t4_range)); #endif } } done: /* Done merging the ranges. */ MPASS(n > 0); r = &mem_ranges[0]; for (i = 0; i < n; i++, r++) { if (addr >= r->start && addr + len <= r->start + r->size) return (0); } } return (EFAULT); } static int fwmtype_to_hwmtype(int mtype) { switch (mtype) { case FW_MEMTYPE_EDC0: return (MEM_EDC0); case FW_MEMTYPE_EDC1: return (MEM_EDC1); case FW_MEMTYPE_EXTMEM: return (MEM_MC0); case FW_MEMTYPE_EXTMEM1: return (MEM_MC1); default: panic("%s: cannot translate fw mtype %d.", __func__, mtype); } } /* * Verify that the memory range specified by the memtype/offset/len pair is * valid and lies entirely within the memtype specified. The global address of * the start of the range is returned in addr. */ static int validate_mt_off_len(struct adapter *sc, int mtype, uint32_t off, int len, uint32_t *addr) { uint32_t em, addr_len, maddr; /* Memory can only be accessed in naturally aligned 4 byte units */ if (off & 3 || len & 3 || len == 0) return (EINVAL); em = t4_read_reg(sc, A_MA_TARGET_MEM_ENABLE); switch (fwmtype_to_hwmtype(mtype)) { case MEM_EDC0: if (!(em & F_EDRAM0_ENABLE)) return (EINVAL); addr_len = t4_read_reg(sc, A_MA_EDRAM0_BAR); maddr = G_EDRAM0_BASE(addr_len) << 20; break; case MEM_EDC1: if (!(em & F_EDRAM1_ENABLE)) return (EINVAL); addr_len = t4_read_reg(sc, A_MA_EDRAM1_BAR); maddr = G_EDRAM1_BASE(addr_len) << 20; break; case MEM_MC: if (!(em & F_EXT_MEM_ENABLE)) return (EINVAL); addr_len = t4_read_reg(sc, A_MA_EXT_MEMORY_BAR); maddr = G_EXT_MEM_BASE(addr_len) << 20; break; case MEM_MC1: if (!is_t5(sc) || !(em & F_EXT_MEM1_ENABLE)) return (EINVAL); addr_len = t4_read_reg(sc, A_MA_EXT_MEMORY1_BAR); maddr = G_EXT_MEM1_BASE(addr_len) << 20; break; default: return (EINVAL); } *addr = maddr + off; /* global address */ return (validate_mem_range(sc, *addr, len)); } static int fixup_devlog_params(struct adapter *sc) { struct devlog_params *dparams = &sc->params.devlog; int rc; rc = validate_mt_off_len(sc, dparams->memtype, dparams->start, dparams->size, &dparams->addr); return (rc); } static int cfg_itype_and_nqueues(struct adapter *sc, int n10g, int n1g, int num_vis, struct intrs_and_queues *iaq) { int rc, itype, navail, nrxq10g, nrxq1g, n; int nofldrxq10g = 0, nofldrxq1g = 0; int nnmrxq10g = 0, nnmrxq1g = 0; bzero(iaq, sizeof(*iaq)); iaq->ntxq10g = t4_ntxq10g; iaq->ntxq1g = t4_ntxq1g; iaq->nrxq10g = nrxq10g = t4_nrxq10g; iaq->nrxq1g = nrxq1g = t4_nrxq1g; iaq->rsrv_noflowq = t4_rsrv_noflowq; #ifdef TCP_OFFLOAD if (is_offload(sc)) { iaq->nofldtxq10g = t4_nofldtxq10g; iaq->nofldtxq1g = t4_nofldtxq1g; iaq->nofldrxq10g = nofldrxq10g = t4_nofldrxq10g; iaq->nofldrxq1g = nofldrxq1g = t4_nofldrxq1g; } #endif #ifdef DEV_NETMAP iaq->nnmtxq10g = t4_nnmtxq10g; iaq->nnmtxq1g = t4_nnmtxq1g; iaq->nnmrxq10g = nnmrxq10g = t4_nnmrxq10g; iaq->nnmrxq1g = nnmrxq1g = t4_nnmrxq1g; #endif for (itype = INTR_MSIX; itype; itype >>= 1) { if ((itype & t4_intr_types) == 0) continue; /* not allowed */ if (itype == INTR_MSIX) navail = pci_msix_count(sc->dev); else if (itype == INTR_MSI) navail = pci_msi_count(sc->dev); else navail = 1; restart: if (navail == 0) continue; iaq->intr_type = itype; iaq->intr_flags_10g = 0; iaq->intr_flags_1g = 0; /* * Best option: an interrupt vector for errors, one for the * firmware event queue, and one for every rxq (NIC, TOE, and * netmap). */ iaq->nirq = T4_EXTRA_INTR; iaq->nirq += n10g * (nrxq10g + nofldrxq10g + nnmrxq10g); iaq->nirq += n10g * 2 * (num_vis - 1); iaq->nirq += n1g * (nrxq1g + nofldrxq1g + nnmrxq1g); iaq->nirq += n1g * 2 * (num_vis - 1); if (iaq->nirq <= navail && (itype != INTR_MSI || powerof2(iaq->nirq))) { iaq->intr_flags_10g = INTR_ALL; iaq->intr_flags_1g = INTR_ALL; goto allocate; } /* * Second best option: a vector for errors, one for the firmware * event queue, and vectors for either all the NIC rx queues or * all the TOE rx queues. The queues that don't get vectors * will forward their interrupts to those that do. * * Note: netmap rx queues cannot be created early and so they * can't be setup to receive forwarded interrupts for others. */ iaq->nirq = T4_EXTRA_INTR; if (nrxq10g >= nofldrxq10g) { iaq->intr_flags_10g = INTR_RXQ; iaq->nirq += n10g * nrxq10g; iaq->nirq += n10g * (num_vis - 1); #ifdef DEV_NETMAP iaq->nnmrxq10g = min(nnmrxq10g, nrxq10g); #endif } else { iaq->intr_flags_10g = INTR_OFLD_RXQ; iaq->nirq += n10g * nofldrxq10g; #ifdef DEV_NETMAP iaq->nnmrxq10g = min(nnmrxq10g, nofldrxq10g); #endif } if (nrxq1g >= nofldrxq1g) { iaq->intr_flags_1g = INTR_RXQ; iaq->nirq += n1g * nrxq1g; iaq->nirq += n1g * (num_vis - 1); #ifdef DEV_NETMAP iaq->nnmrxq1g = min(nnmrxq1g, nrxq1g); #endif } else { iaq->intr_flags_1g = INTR_OFLD_RXQ; iaq->nirq += n1g * nofldrxq1g; #ifdef DEV_NETMAP iaq->nnmrxq1g = min(nnmrxq1g, nofldrxq1g); #endif } if (iaq->nirq <= navail && (itype != INTR_MSI || powerof2(iaq->nirq))) goto allocate; /* * Next best option: an interrupt vector for errors, one for the * firmware event queue, and at least one per VI. At this * point we know we'll have to downsize nrxq and/or nofldrxq * and/or nnmrxq to fit what's available to us. */ iaq->nirq = T4_EXTRA_INTR; iaq->nirq += (n10g + n1g) * num_vis; if (iaq->nirq <= navail) { int leftover = navail - iaq->nirq; if (n10g > 0) { int target = max(nrxq10g, nofldrxq10g); iaq->intr_flags_10g = nrxq10g >= nofldrxq10g ? INTR_RXQ : INTR_OFLD_RXQ; n = 1; while (n < target && leftover >= n10g) { leftover -= n10g; iaq->nirq += n10g; n++; } iaq->nrxq10g = min(n, nrxq10g); #ifdef TCP_OFFLOAD iaq->nofldrxq10g = min(n, nofldrxq10g); #endif #ifdef DEV_NETMAP iaq->nnmrxq10g = min(n, nnmrxq10g); #endif } if (n1g > 0) { int target = max(nrxq1g, nofldrxq1g); iaq->intr_flags_1g = nrxq1g >= nofldrxq1g ? INTR_RXQ : INTR_OFLD_RXQ; n = 1; while (n < target && leftover >= n1g) { leftover -= n1g; iaq->nirq += n1g; n++; } iaq->nrxq1g = min(n, nrxq1g); #ifdef TCP_OFFLOAD iaq->nofldrxq1g = min(n, nofldrxq1g); #endif #ifdef DEV_NETMAP iaq->nnmrxq1g = min(n, nnmrxq1g); #endif } if (itype != INTR_MSI || powerof2(iaq->nirq)) goto allocate; } /* * Least desirable option: one interrupt vector for everything. */ iaq->nirq = iaq->nrxq10g = iaq->nrxq1g = 1; iaq->intr_flags_10g = iaq->intr_flags_1g = 0; #ifdef TCP_OFFLOAD if (is_offload(sc)) iaq->nofldrxq10g = iaq->nofldrxq1g = 1; #endif #ifdef DEV_NETMAP iaq->nnmrxq10g = iaq->nnmrxq1g = 1; #endif allocate: navail = iaq->nirq; rc = 0; if (itype == INTR_MSIX) rc = pci_alloc_msix(sc->dev, &navail); else if (itype == INTR_MSI) rc = pci_alloc_msi(sc->dev, &navail); if (rc == 0) { if (navail == iaq->nirq) return (0); /* * Didn't get the number requested. Use whatever number * the kernel is willing to allocate (it's in navail). */ device_printf(sc->dev, "fewer vectors than requested, " "type=%d, req=%d, rcvd=%d; will downshift req.\n", itype, iaq->nirq, navail); pci_release_msi(sc->dev); goto restart; } device_printf(sc->dev, "failed to allocate vectors:%d, type=%d, req=%d, rcvd=%d\n", itype, rc, iaq->nirq, navail); } device_printf(sc->dev, "failed to find a usable interrupt type. " "allowed=%d, msi-x=%d, msi=%d, intx=1", t4_intr_types, pci_msix_count(sc->dev), pci_msi_count(sc->dev)); return (ENXIO); } #define FW_VERSION(chip) ( \ V_FW_HDR_FW_VER_MAJOR(chip##FW_VERSION_MAJOR) | \ V_FW_HDR_FW_VER_MINOR(chip##FW_VERSION_MINOR) | \ V_FW_HDR_FW_VER_MICRO(chip##FW_VERSION_MICRO) | \ V_FW_HDR_FW_VER_BUILD(chip##FW_VERSION_BUILD)) #define FW_INTFVER(chip, intf) (chip##FW_HDR_INTFVER_##intf) struct fw_info { uint8_t chip; char *kld_name; char *fw_mod_name; struct fw_hdr fw_hdr; /* XXX: waste of space, need a sparse struct */ } fw_info[] = { { .chip = CHELSIO_T4, .kld_name = "t4fw_cfg", .fw_mod_name = "t4fw", .fw_hdr = { .chip = FW_HDR_CHIP_T4, .fw_ver = htobe32_const(FW_VERSION(T4)), .intfver_nic = FW_INTFVER(T4, NIC), .intfver_vnic = FW_INTFVER(T4, VNIC), .intfver_ofld = FW_INTFVER(T4, OFLD), .intfver_ri = FW_INTFVER(T4, RI), .intfver_iscsipdu = FW_INTFVER(T4, ISCSIPDU), .intfver_iscsi = FW_INTFVER(T4, ISCSI), .intfver_fcoepdu = FW_INTFVER(T4, FCOEPDU), .intfver_fcoe = FW_INTFVER(T4, FCOE), }, }, { .chip = CHELSIO_T5, .kld_name = "t5fw_cfg", .fw_mod_name = "t5fw", .fw_hdr = { .chip = FW_HDR_CHIP_T5, .fw_ver = htobe32_const(FW_VERSION(T5)), .intfver_nic = FW_INTFVER(T5, NIC), .intfver_vnic = FW_INTFVER(T5, VNIC), .intfver_ofld = FW_INTFVER(T5, OFLD), .intfver_ri = FW_INTFVER(T5, RI), .intfver_iscsipdu = FW_INTFVER(T5, ISCSIPDU), .intfver_iscsi = FW_INTFVER(T5, ISCSI), .intfver_fcoepdu = FW_INTFVER(T5, FCOEPDU), .intfver_fcoe = FW_INTFVER(T5, FCOE), }, } }; static struct fw_info * find_fw_info(int chip) { int i; for (i = 0; i < nitems(fw_info); i++) { if (fw_info[i].chip == chip) return (&fw_info[i]); } return (NULL); } /* * Is the given firmware API compatible with the one the driver was compiled * with? */ static int fw_compatible(const struct fw_hdr *hdr1, const struct fw_hdr *hdr2) { /* short circuit if it's the exact same firmware version */ if (hdr1->chip == hdr2->chip && hdr1->fw_ver == hdr2->fw_ver) return (1); /* * XXX: Is this too conservative? Perhaps I should limit this to the * features that are supported in the driver. */ #define SAME_INTF(x) (hdr1->intfver_##x == hdr2->intfver_##x) if (hdr1->chip == hdr2->chip && SAME_INTF(nic) && SAME_INTF(vnic) && SAME_INTF(ofld) && SAME_INTF(ri) && SAME_INTF(iscsipdu) && SAME_INTF(iscsi) && SAME_INTF(fcoepdu) && SAME_INTF(fcoe)) return (1); #undef SAME_INTF return (0); } /* * The firmware in the KLD is usable, but should it be installed? This routine * explains itself in detail if it indicates the KLD firmware should be * installed. */ static int should_install_kld_fw(struct adapter *sc, int card_fw_usable, int k, int c) { const char *reason; if (!card_fw_usable) { reason = "incompatible or unusable"; goto install; } if (k > c) { reason = "older than the version bundled with this driver"; goto install; } if (t4_fw_install == 2 && k != c) { reason = "different than the version bundled with this driver"; goto install; } return (0); install: if (t4_fw_install == 0) { device_printf(sc->dev, "firmware on card (%u.%u.%u.%u) is %s, " "but the driver is prohibited from installing a different " "firmware on the card.\n", G_FW_HDR_FW_VER_MAJOR(c), G_FW_HDR_FW_VER_MINOR(c), G_FW_HDR_FW_VER_MICRO(c), G_FW_HDR_FW_VER_BUILD(c), reason); return (0); } device_printf(sc->dev, "firmware on card (%u.%u.%u.%u) is %s, " "installing firmware %u.%u.%u.%u on card.\n", G_FW_HDR_FW_VER_MAJOR(c), G_FW_HDR_FW_VER_MINOR(c), G_FW_HDR_FW_VER_MICRO(c), G_FW_HDR_FW_VER_BUILD(c), reason, G_FW_HDR_FW_VER_MAJOR(k), G_FW_HDR_FW_VER_MINOR(k), G_FW_HDR_FW_VER_MICRO(k), G_FW_HDR_FW_VER_BUILD(k)); return (1); } /* * Establish contact with the firmware and determine if we are the master driver * or not, and whether we are responsible for chip initialization. */ static int prep_firmware(struct adapter *sc) { const struct firmware *fw = NULL, *default_cfg; int rc, pf, card_fw_usable, kld_fw_usable, need_fw_reset = 1; enum dev_state state; struct fw_info *fw_info; struct fw_hdr *card_fw; /* fw on the card */ const struct fw_hdr *kld_fw; /* fw in the KLD */ const struct fw_hdr *drv_fw; /* fw header the driver was compiled against */ /* Contact firmware. */ rc = t4_fw_hello(sc, sc->mbox, sc->mbox, MASTER_MAY, &state); if (rc < 0 || state == DEV_STATE_ERR) { rc = -rc; device_printf(sc->dev, "failed to connect to the firmware: %d, %d.\n", rc, state); return (rc); } pf = rc; if (pf == sc->mbox) sc->flags |= MASTER_PF; else if (state == DEV_STATE_UNINIT) { /* * We didn't get to be the master so we definitely won't be * configuring the chip. It's a bug if someone else hasn't * configured it already. */ device_printf(sc->dev, "couldn't be master(%d), " "device not already initialized either(%d).\n", rc, state); return (EDOOFUS); } /* This is the firmware whose headers the driver was compiled against */ fw_info = find_fw_info(chip_id(sc)); if (fw_info == NULL) { device_printf(sc->dev, "unable to look up firmware information for chip %d.\n", chip_id(sc)); return (EINVAL); } drv_fw = &fw_info->fw_hdr; /* * The firmware KLD contains many modules. The KLD name is also the * name of the module that contains the default config file. */ default_cfg = firmware_get(fw_info->kld_name); /* Read the header of the firmware on the card */ card_fw = malloc(sizeof(*card_fw), M_CXGBE, M_ZERO | M_WAITOK); rc = -t4_read_flash(sc, FLASH_FW_START, sizeof (*card_fw) / sizeof (uint32_t), (uint32_t *)card_fw, 1); if (rc == 0) card_fw_usable = fw_compatible(drv_fw, (const void*)card_fw); else { device_printf(sc->dev, "Unable to read card's firmware header: %d\n", rc); card_fw_usable = 0; } /* This is the firmware in the KLD */ fw = firmware_get(fw_info->fw_mod_name); if (fw != NULL) { kld_fw = (const void *)fw->data; kld_fw_usable = fw_compatible(drv_fw, kld_fw); } else { kld_fw = NULL; kld_fw_usable = 0; } if (card_fw_usable && card_fw->fw_ver == drv_fw->fw_ver && (!kld_fw_usable || kld_fw->fw_ver == drv_fw->fw_ver)) { /* * Common case: the firmware on the card is an exact match and * the KLD is an exact match too, or the KLD is * absent/incompatible. Note that t4_fw_install = 2 is ignored * here -- use cxgbetool loadfw if you want to reinstall the * same firmware as the one on the card. */ } else if (kld_fw_usable && state == DEV_STATE_UNINIT && should_install_kld_fw(sc, card_fw_usable, be32toh(kld_fw->fw_ver), be32toh(card_fw->fw_ver))) { rc = -t4_fw_upgrade(sc, sc->mbox, fw->data, fw->datasize, 0); if (rc != 0) { device_printf(sc->dev, "failed to install firmware: %d\n", rc); goto done; } /* Installed successfully, update the cached header too. */ memcpy(card_fw, kld_fw, sizeof(*card_fw)); card_fw_usable = 1; need_fw_reset = 0; /* already reset as part of load_fw */ } if (!card_fw_usable) { uint32_t d, c, k; d = ntohl(drv_fw->fw_ver); c = ntohl(card_fw->fw_ver); k = kld_fw ? ntohl(kld_fw->fw_ver) : 0; device_printf(sc->dev, "Cannot find a usable firmware: " "fw_install %d, chip state %d, " "driver compiled with %d.%d.%d.%d, " "card has %d.%d.%d.%d, KLD has %d.%d.%d.%d\n", t4_fw_install, state, G_FW_HDR_FW_VER_MAJOR(d), G_FW_HDR_FW_VER_MINOR(d), G_FW_HDR_FW_VER_MICRO(d), G_FW_HDR_FW_VER_BUILD(d), G_FW_HDR_FW_VER_MAJOR(c), G_FW_HDR_FW_VER_MINOR(c), G_FW_HDR_FW_VER_MICRO(c), G_FW_HDR_FW_VER_BUILD(c), G_FW_HDR_FW_VER_MAJOR(k), G_FW_HDR_FW_VER_MINOR(k), G_FW_HDR_FW_VER_MICRO(k), G_FW_HDR_FW_VER_BUILD(k)); rc = EINVAL; goto done; } /* We're using whatever's on the card and it's known to be good. */ sc->params.fw_vers = ntohl(card_fw->fw_ver); snprintf(sc->fw_version, sizeof(sc->fw_version), "%u.%u.%u.%u", G_FW_HDR_FW_VER_MAJOR(sc->params.fw_vers), G_FW_HDR_FW_VER_MINOR(sc->params.fw_vers), G_FW_HDR_FW_VER_MICRO(sc->params.fw_vers), G_FW_HDR_FW_VER_BUILD(sc->params.fw_vers)); t4_get_tp_version(sc, &sc->params.tp_vers); snprintf(sc->tp_version, sizeof(sc->tp_version), "%u.%u.%u.%u", G_FW_HDR_FW_VER_MAJOR(sc->params.tp_vers), G_FW_HDR_FW_VER_MINOR(sc->params.tp_vers), G_FW_HDR_FW_VER_MICRO(sc->params.tp_vers), G_FW_HDR_FW_VER_BUILD(sc->params.tp_vers)); if (t4_get_exprom_version(sc, &sc->params.exprom_vers) != 0) sc->params.exprom_vers = 0; else { snprintf(sc->exprom_version, sizeof(sc->exprom_version), "%u.%u.%u.%u", G_FW_HDR_FW_VER_MAJOR(sc->params.exprom_vers), G_FW_HDR_FW_VER_MINOR(sc->params.exprom_vers), G_FW_HDR_FW_VER_MICRO(sc->params.exprom_vers), G_FW_HDR_FW_VER_BUILD(sc->params.exprom_vers)); } /* Reset device */ if (need_fw_reset && (rc = -t4_fw_reset(sc, sc->mbox, F_PIORSTMODE | F_PIORST)) != 0) { device_printf(sc->dev, "firmware reset failed: %d.\n", rc); if (rc != ETIMEDOUT && rc != EIO) t4_fw_bye(sc, sc->mbox); goto done; } sc->flags |= FW_OK; rc = get_params__pre_init(sc); if (rc != 0) goto done; /* error message displayed already */ /* Partition adapter resources as specified in the config file. */ if (state == DEV_STATE_UNINIT) { KASSERT(sc->flags & MASTER_PF, ("%s: trying to change chip settings when not master.", __func__)); rc = partition_resources(sc, default_cfg, fw_info->kld_name); if (rc != 0) goto done; /* error message displayed already */ t4_tweak_chip_settings(sc); /* get basic stuff going */ rc = -t4_fw_initialize(sc, sc->mbox); if (rc != 0) { device_printf(sc->dev, "fw init failed: %d.\n", rc); goto done; } } else { snprintf(sc->cfg_file, sizeof(sc->cfg_file), "pf%d", pf); sc->cfcsum = 0; } done: free(card_fw, M_CXGBE); if (fw != NULL) firmware_put(fw, FIRMWARE_UNLOAD); if (default_cfg != NULL) firmware_put(default_cfg, FIRMWARE_UNLOAD); return (rc); } #define FW_PARAM_DEV(param) \ (V_FW_PARAMS_MNEM(FW_PARAMS_MNEM_DEV) | \ V_FW_PARAMS_PARAM_X(FW_PARAMS_PARAM_DEV_##param)) #define FW_PARAM_PFVF(param) \ (V_FW_PARAMS_MNEM(FW_PARAMS_MNEM_PFVF) | \ V_FW_PARAMS_PARAM_X(FW_PARAMS_PARAM_PFVF_##param)) /* * Partition chip resources for use between various PFs, VFs, etc. */ static int partition_resources(struct adapter *sc, const struct firmware *default_cfg, const char *name_prefix) { const struct firmware *cfg = NULL; int rc = 0; struct fw_caps_config_cmd caps; uint32_t mtype, moff, finicsum, cfcsum; /* * Figure out what configuration file to use. Pick the default config * file for the card if the user hasn't specified one explicitly. */ snprintf(sc->cfg_file, sizeof(sc->cfg_file), "%s", t4_cfg_file); if (strncmp(t4_cfg_file, DEFAULT_CF, sizeof(t4_cfg_file)) == 0) { /* Card specific overrides go here. */ if (pci_get_device(sc->dev) == 0x440a) snprintf(sc->cfg_file, sizeof(sc->cfg_file), UWIRE_CF); if (is_fpga(sc)) snprintf(sc->cfg_file, sizeof(sc->cfg_file), FPGA_CF); } /* * We need to load another module if the profile is anything except * "default" or "flash". */ if (strncmp(sc->cfg_file, DEFAULT_CF, sizeof(sc->cfg_file)) != 0 && strncmp(sc->cfg_file, FLASH_CF, sizeof(sc->cfg_file)) != 0) { char s[32]; snprintf(s, sizeof(s), "%s_%s", name_prefix, sc->cfg_file); cfg = firmware_get(s); if (cfg == NULL) { if (default_cfg != NULL) { device_printf(sc->dev, "unable to load module \"%s\" for " "configuration profile \"%s\", will use " "the default config file instead.\n", s, sc->cfg_file); snprintf(sc->cfg_file, sizeof(sc->cfg_file), "%s", DEFAULT_CF); } else { device_printf(sc->dev, "unable to load module \"%s\" for " "configuration profile \"%s\", will use " "the config file on the card's flash " "instead.\n", s, sc->cfg_file); snprintf(sc->cfg_file, sizeof(sc->cfg_file), "%s", FLASH_CF); } } } if (strncmp(sc->cfg_file, DEFAULT_CF, sizeof(sc->cfg_file)) == 0 && default_cfg == NULL) { device_printf(sc->dev, "default config file not available, will use the config " "file on the card's flash instead.\n"); snprintf(sc->cfg_file, sizeof(sc->cfg_file), "%s", FLASH_CF); } if (strncmp(sc->cfg_file, FLASH_CF, sizeof(sc->cfg_file)) != 0) { u_int cflen; const uint32_t *cfdata; uint32_t param, val, addr; KASSERT(cfg != NULL || default_cfg != NULL, ("%s: no config to upload", __func__)); /* * Ask the firmware where it wants us to upload the config file. */ param = FW_PARAM_DEV(CF); rc = -t4_query_params(sc, sc->mbox, sc->pf, 0, 1, ¶m, &val); if (rc != 0) { /* No support for config file? Shouldn't happen. */ device_printf(sc->dev, "failed to query config file location: %d.\n", rc); goto done; } mtype = G_FW_PARAMS_PARAM_Y(val); moff = G_FW_PARAMS_PARAM_Z(val) << 16; /* * XXX: sheer laziness. We deliberately added 4 bytes of * useless stuffing/comments at the end of the config file so * it's ok to simply throw away the last remaining bytes when * the config file is not an exact multiple of 4. This also * helps with the validate_mt_off_len check. */ if (cfg != NULL) { cflen = cfg->datasize & ~3; cfdata = cfg->data; } else { cflen = default_cfg->datasize & ~3; cfdata = default_cfg->data; } if (cflen > FLASH_CFG_MAX_SIZE) { device_printf(sc->dev, "config file too long (%d, max allowed is %d). " "Will try to use the config on the card, if any.\n", cflen, FLASH_CFG_MAX_SIZE); goto use_config_on_flash; } rc = validate_mt_off_len(sc, mtype, moff, cflen, &addr); if (rc != 0) { device_printf(sc->dev, "%s: addr (%d/0x%x) or len %d is not valid: %d. " "Will try to use the config on the card, if any.\n", __func__, mtype, moff, cflen, rc); goto use_config_on_flash; } write_via_memwin(sc, 2, addr, cfdata, cflen); } else { use_config_on_flash: mtype = FW_MEMTYPE_FLASH; moff = t4_flash_cfg_addr(sc); } bzero(&caps, sizeof(caps)); caps.op_to_write = htobe32(V_FW_CMD_OP(FW_CAPS_CONFIG_CMD) | F_FW_CMD_REQUEST | F_FW_CMD_READ); caps.cfvalid_to_len16 = htobe32(F_FW_CAPS_CONFIG_CMD_CFVALID | V_FW_CAPS_CONFIG_CMD_MEMTYPE_CF(mtype) | V_FW_CAPS_CONFIG_CMD_MEMADDR64K_CF(moff >> 16) | FW_LEN16(caps)); rc = -t4_wr_mbox(sc, sc->mbox, &caps, sizeof(caps), &caps); if (rc != 0) { device_printf(sc->dev, "failed to pre-process config file: %d " "(mtype %d, moff 0x%x).\n", rc, mtype, moff); goto done; } finicsum = be32toh(caps.finicsum); cfcsum = be32toh(caps.cfcsum); if (finicsum != cfcsum) { device_printf(sc->dev, "WARNING: config file checksum mismatch: %08x %08x\n", finicsum, cfcsum); } sc->cfcsum = cfcsum; #define LIMIT_CAPS(x) do { \ caps.x &= htobe16(t4_##x##_allowed); \ } while (0) /* * Let the firmware know what features will (not) be used so it can tune * things accordingly. */ LIMIT_CAPS(nbmcaps); LIMIT_CAPS(linkcaps); LIMIT_CAPS(switchcaps); LIMIT_CAPS(niccaps); LIMIT_CAPS(toecaps); LIMIT_CAPS(rdmacaps); LIMIT_CAPS(tlscaps); LIMIT_CAPS(iscsicaps); LIMIT_CAPS(fcoecaps); #undef LIMIT_CAPS caps.op_to_write = htobe32(V_FW_CMD_OP(FW_CAPS_CONFIG_CMD) | F_FW_CMD_REQUEST | F_FW_CMD_WRITE); caps.cfvalid_to_len16 = htobe32(FW_LEN16(caps)); rc = -t4_wr_mbox(sc, sc->mbox, &caps, sizeof(caps), NULL); if (rc != 0) { device_printf(sc->dev, "failed to process config file: %d.\n", rc); } done: if (cfg != NULL) firmware_put(cfg, FIRMWARE_UNLOAD); return (rc); } /* * Retrieve parameters that are needed (or nice to have) very early. */ static int get_params__pre_init(struct adapter *sc) { int rc; uint32_t param[2], val[2]; param[0] = FW_PARAM_DEV(PORTVEC); param[1] = FW_PARAM_DEV(CCLK); rc = -t4_query_params(sc, sc->mbox, sc->pf, 0, 2, param, val); if (rc != 0) { device_printf(sc->dev, "failed to query parameters (pre_init): %d.\n", rc); return (rc); } sc->params.portvec = val[0]; sc->params.nports = bitcount32(val[0]); sc->params.vpd.cclk = val[1]; /* Read device log parameters. */ rc = -t4_init_devlog_params(sc, 1); if (rc == 0) fixup_devlog_params(sc); else { device_printf(sc->dev, "failed to get devlog parameters: %d.\n", rc); rc = 0; /* devlog isn't critical for device operation */ } return (rc); } /* * Retrieve various parameters that are of interest to the driver. The device * has been initialized by the firmware at this point. */ static int get_params__post_init(struct adapter *sc) { int rc; uint32_t param[7], val[7]; struct fw_caps_config_cmd caps; param[0] = FW_PARAM_PFVF(IQFLINT_START); param[1] = FW_PARAM_PFVF(EQ_START); param[2] = FW_PARAM_PFVF(FILTER_START); param[3] = FW_PARAM_PFVF(FILTER_END); param[4] = FW_PARAM_PFVF(L2T_START); param[5] = FW_PARAM_PFVF(L2T_END); rc = -t4_query_params(sc, sc->mbox, sc->pf, 0, 6, param, val); if (rc != 0) { device_printf(sc->dev, "failed to query parameters (post_init): %d.\n", rc); return (rc); } sc->sge.iq_start = val[0]; sc->sge.eq_start = val[1]; sc->tids.ftid_base = val[2]; sc->tids.nftids = val[3] - val[2] + 1; sc->params.ftid_min = val[2]; sc->params.ftid_max = val[3]; sc->vres.l2t.start = val[4]; sc->vres.l2t.size = val[5] - val[4] + 1; KASSERT(sc->vres.l2t.size <= L2T_SIZE, ("%s: L2 table size (%u) larger than expected (%u)", __func__, sc->vres.l2t.size, L2T_SIZE)); /* get capabilites */ bzero(&caps, sizeof(caps)); caps.op_to_write = htobe32(V_FW_CMD_OP(FW_CAPS_CONFIG_CMD) | F_FW_CMD_REQUEST | F_FW_CMD_READ); caps.cfvalid_to_len16 = htobe32(FW_LEN16(caps)); rc = -t4_wr_mbox(sc, sc->mbox, &caps, sizeof(caps), &caps); if (rc != 0) { device_printf(sc->dev, "failed to get card capabilities: %d.\n", rc); return (rc); } #define READ_CAPS(x) do { \ sc->x = htobe16(caps.x); \ } while (0) READ_CAPS(nbmcaps); READ_CAPS(linkcaps); READ_CAPS(switchcaps); READ_CAPS(niccaps); READ_CAPS(toecaps); READ_CAPS(rdmacaps); READ_CAPS(tlscaps); READ_CAPS(iscsicaps); READ_CAPS(fcoecaps); if (sc->niccaps & FW_CAPS_CONFIG_NIC_ETHOFLD) { param[0] = FW_PARAM_PFVF(ETHOFLD_START); param[1] = FW_PARAM_PFVF(ETHOFLD_END); param[2] = FW_PARAM_DEV(FLOWC_BUFFIFO_SZ); rc = -t4_query_params(sc, sc->mbox, sc->pf, 0, 3, param, val); if (rc != 0) { device_printf(sc->dev, "failed to query NIC parameters: %d.\n", rc); return (rc); } sc->tids.etid_base = val[0]; sc->params.etid_min = val[0]; sc->tids.netids = val[1] - val[0] + 1; sc->params.netids = sc->tids.netids; sc->params.eo_wr_cred = val[2]; sc->params.ethoffload = 1; } if (sc->toecaps) { /* query offload-related parameters */ param[0] = FW_PARAM_DEV(NTID); param[1] = FW_PARAM_PFVF(SERVER_START); param[2] = FW_PARAM_PFVF(SERVER_END); param[3] = FW_PARAM_PFVF(TDDP_START); param[4] = FW_PARAM_PFVF(TDDP_END); param[5] = FW_PARAM_DEV(FLOWC_BUFFIFO_SZ); rc = -t4_query_params(sc, sc->mbox, sc->pf, 0, 6, param, val); if (rc != 0) { device_printf(sc->dev, "failed to query TOE parameters: %d.\n", rc); return (rc); } sc->tids.ntids = val[0]; sc->tids.natids = min(sc->tids.ntids / 2, MAX_ATIDS); sc->tids.stid_base = val[1]; sc->tids.nstids = val[2] - val[1] + 1; sc->vres.ddp.start = val[3]; sc->vres.ddp.size = val[4] - val[3] + 1; sc->params.ofldq_wr_cred = val[5]; sc->params.offload = 1; } if (sc->rdmacaps) { param[0] = FW_PARAM_PFVF(STAG_START); param[1] = FW_PARAM_PFVF(STAG_END); param[2] = FW_PARAM_PFVF(RQ_START); param[3] = FW_PARAM_PFVF(RQ_END); param[4] = FW_PARAM_PFVF(PBL_START); param[5] = FW_PARAM_PFVF(PBL_END); rc = -t4_query_params(sc, sc->mbox, sc->pf, 0, 6, param, val); if (rc != 0) { device_printf(sc->dev, "failed to query RDMA parameters(1): %d.\n", rc); return (rc); } sc->vres.stag.start = val[0]; sc->vres.stag.size = val[1] - val[0] + 1; sc->vres.rq.start = val[2]; sc->vres.rq.size = val[3] - val[2] + 1; sc->vres.pbl.start = val[4]; sc->vres.pbl.size = val[5] - val[4] + 1; param[0] = FW_PARAM_PFVF(SQRQ_START); param[1] = FW_PARAM_PFVF(SQRQ_END); param[2] = FW_PARAM_PFVF(CQ_START); param[3] = FW_PARAM_PFVF(CQ_END); param[4] = FW_PARAM_PFVF(OCQ_START); param[5] = FW_PARAM_PFVF(OCQ_END); rc = -t4_query_params(sc, sc->mbox, sc->pf, 0, 6, param, val); if (rc != 0) { device_printf(sc->dev, "failed to query RDMA parameters(2): %d.\n", rc); return (rc); } sc->vres.qp.start = val[0]; sc->vres.qp.size = val[1] - val[0] + 1; sc->vres.cq.start = val[2]; sc->vres.cq.size = val[3] - val[2] + 1; sc->vres.ocq.start = val[4]; sc->vres.ocq.size = val[5] - val[4] + 1; } if (sc->iscsicaps) { param[0] = FW_PARAM_PFVF(ISCSI_START); param[1] = FW_PARAM_PFVF(ISCSI_END); rc = -t4_query_params(sc, sc->mbox, sc->pf, 0, 2, param, val); if (rc != 0) { device_printf(sc->dev, "failed to query iSCSI parameters: %d.\n", rc); return (rc); } sc->vres.iscsi.start = val[0]; sc->vres.iscsi.size = val[1] - val[0] + 1; } /* * We've got the params we wanted to query via the firmware. Now grab * some others directly from the chip. */ rc = t4_read_chip_settings(sc); return (rc); } static int set_params__post_init(struct adapter *sc) { uint32_t param, val; /* ask for encapsulated CPLs */ param = FW_PARAM_PFVF(CPLFW4MSG_ENCAP); val = 1; (void)t4_set_params(sc, sc->mbox, sc->pf, 0, 1, ¶m, &val); return (0); } #undef FW_PARAM_PFVF #undef FW_PARAM_DEV static void t4_set_desc(struct adapter *sc) { char buf[128]; struct adapter_params *p = &sc->params; snprintf(buf, sizeof(buf), "Chelsio %s %sNIC (rev %d), S/N:%s, " "P/N:%s, E/C:%s", p->vpd.id, is_offload(sc) ? "R" : "", chip_rev(sc), p->vpd.sn, p->vpd.pn, p->vpd.ec); device_set_desc_copy(sc->dev, buf); } static void build_medialist(struct port_info *pi, struct ifmedia *media) { int m; PORT_LOCK(pi); ifmedia_removeall(media); m = IFM_ETHER | IFM_FDX; switch(pi->port_type) { case FW_PORT_TYPE_BT_XFI: case FW_PORT_TYPE_BT_XAUI: ifmedia_add(media, m | IFM_10G_T, 0, NULL); /* fall through */ case FW_PORT_TYPE_BT_SGMII: ifmedia_add(media, m | IFM_1000_T, 0, NULL); ifmedia_add(media, m | IFM_100_TX, 0, NULL); ifmedia_add(media, IFM_ETHER | IFM_AUTO, 0, NULL); ifmedia_set(media, IFM_ETHER | IFM_AUTO); break; case FW_PORT_TYPE_CX4: ifmedia_add(media, m | IFM_10G_CX4, 0, NULL); ifmedia_set(media, m | IFM_10G_CX4); break; case FW_PORT_TYPE_QSFP_10G: case FW_PORT_TYPE_SFP: case FW_PORT_TYPE_FIBER_XFI: case FW_PORT_TYPE_FIBER_XAUI: switch (pi->mod_type) { case FW_PORT_MOD_TYPE_LR: ifmedia_add(media, m | IFM_10G_LR, 0, NULL); ifmedia_set(media, m | IFM_10G_LR); break; case FW_PORT_MOD_TYPE_SR: ifmedia_add(media, m | IFM_10G_SR, 0, NULL); ifmedia_set(media, m | IFM_10G_SR); break; case FW_PORT_MOD_TYPE_LRM: ifmedia_add(media, m | IFM_10G_LRM, 0, NULL); ifmedia_set(media, m | IFM_10G_LRM); break; case FW_PORT_MOD_TYPE_TWINAX_PASSIVE: case FW_PORT_MOD_TYPE_TWINAX_ACTIVE: ifmedia_add(media, m | IFM_10G_TWINAX, 0, NULL); ifmedia_set(media, m | IFM_10G_TWINAX); break; case FW_PORT_MOD_TYPE_NONE: m &= ~IFM_FDX; ifmedia_add(media, m | IFM_NONE, 0, NULL); ifmedia_set(media, m | IFM_NONE); break; case FW_PORT_MOD_TYPE_NA: case FW_PORT_MOD_TYPE_ER: default: device_printf(pi->dev, "unknown port_type (%d), mod_type (%d)\n", pi->port_type, pi->mod_type); ifmedia_add(media, m | IFM_UNKNOWN, 0, NULL); ifmedia_set(media, m | IFM_UNKNOWN); break; } break; case FW_PORT_TYPE_QSFP: switch (pi->mod_type) { case FW_PORT_MOD_TYPE_LR: ifmedia_add(media, m | IFM_40G_LR4, 0, NULL); ifmedia_set(media, m | IFM_40G_LR4); break; case FW_PORT_MOD_TYPE_SR: ifmedia_add(media, m | IFM_40G_SR4, 0, NULL); ifmedia_set(media, m | IFM_40G_SR4); break; case FW_PORT_MOD_TYPE_TWINAX_PASSIVE: case FW_PORT_MOD_TYPE_TWINAX_ACTIVE: ifmedia_add(media, m | IFM_40G_CR4, 0, NULL); ifmedia_set(media, m | IFM_40G_CR4); break; case FW_PORT_MOD_TYPE_NONE: m &= ~IFM_FDX; ifmedia_add(media, m | IFM_NONE, 0, NULL); ifmedia_set(media, m | IFM_NONE); break; default: device_printf(pi->dev, "unknown port_type (%d), mod_type (%d)\n", pi->port_type, pi->mod_type); ifmedia_add(media, m | IFM_UNKNOWN, 0, NULL); ifmedia_set(media, m | IFM_UNKNOWN); break; } break; default: device_printf(pi->dev, "unknown port_type (%d), mod_type (%d)\n", pi->port_type, pi->mod_type); ifmedia_add(media, m | IFM_UNKNOWN, 0, NULL); ifmedia_set(media, m | IFM_UNKNOWN); break; } PORT_UNLOCK(pi); } #define FW_MAC_EXACT_CHUNK 7 /* * Program the port's XGMAC based on parameters in ifnet. The caller also * indicates which parameters should be programmed (the rest are left alone). */ int update_mac_settings(struct ifnet *ifp, int flags) { int rc = 0; struct vi_info *vi = ifp->if_softc; struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; int mtu = -1, promisc = -1, allmulti = -1, vlanex = -1; ASSERT_SYNCHRONIZED_OP(sc); KASSERT(flags, ("%s: not told what to update.", __func__)); if (flags & XGMAC_MTU) mtu = ifp->if_mtu; if (flags & XGMAC_PROMISC) promisc = ifp->if_flags & IFF_PROMISC ? 1 : 0; if (flags & XGMAC_ALLMULTI) allmulti = ifp->if_flags & IFF_ALLMULTI ? 1 : 0; if (flags & XGMAC_VLANEX) vlanex = ifp->if_capenable & IFCAP_VLAN_HWTAGGING ? 1 : 0; if (flags & (XGMAC_MTU|XGMAC_PROMISC|XGMAC_ALLMULTI|XGMAC_VLANEX)) { rc = -t4_set_rxmode(sc, sc->mbox, vi->viid, mtu, promisc, allmulti, 1, vlanex, false); if (rc) { if_printf(ifp, "set_rxmode (%x) failed: %d\n", flags, rc); return (rc); } } if (flags & XGMAC_UCADDR) { uint8_t ucaddr[ETHER_ADDR_LEN]; bcopy(IF_LLADDR(ifp), ucaddr, sizeof(ucaddr)); rc = t4_change_mac(sc, sc->mbox, vi->viid, vi->xact_addr_filt, ucaddr, true, true); if (rc < 0) { rc = -rc; if_printf(ifp, "change_mac failed: %d\n", rc); return (rc); } else { vi->xact_addr_filt = rc; rc = 0; } } if (flags & XGMAC_MCADDRS) { const uint8_t *mcaddr[FW_MAC_EXACT_CHUNK]; int del = 1; uint64_t hash = 0; struct ifmultiaddr *ifma; int i = 0, j; if_maddr_rlock(ifp); TAILQ_FOREACH(ifma, &ifp->if_multiaddrs, ifma_link) { if (ifma->ifma_addr->sa_family != AF_LINK) continue; mcaddr[i] = LLADDR((struct sockaddr_dl *)ifma->ifma_addr); MPASS(ETHER_IS_MULTICAST(mcaddr[i])); i++; if (i == FW_MAC_EXACT_CHUNK) { rc = t4_alloc_mac_filt(sc, sc->mbox, vi->viid, del, i, mcaddr, NULL, &hash, 0); if (rc < 0) { rc = -rc; for (j = 0; j < i; j++) { if_printf(ifp, "failed to add mc address" " %02x:%02x:%02x:" "%02x:%02x:%02x rc=%d\n", mcaddr[j][0], mcaddr[j][1], mcaddr[j][2], mcaddr[j][3], mcaddr[j][4], mcaddr[j][5], rc); } goto mcfail; } del = 0; i = 0; } } if (i > 0) { rc = t4_alloc_mac_filt(sc, sc->mbox, vi->viid, del, i, mcaddr, NULL, &hash, 0); if (rc < 0) { rc = -rc; for (j = 0; j < i; j++) { if_printf(ifp, "failed to add mc address" " %02x:%02x:%02x:" "%02x:%02x:%02x rc=%d\n", mcaddr[j][0], mcaddr[j][1], mcaddr[j][2], mcaddr[j][3], mcaddr[j][4], mcaddr[j][5], rc); } goto mcfail; } } rc = -t4_set_addr_hash(sc, sc->mbox, vi->viid, 0, hash, 0); if (rc != 0) if_printf(ifp, "failed to set mc address hash: %d", rc); mcfail: if_maddr_runlock(ifp); } return (rc); } /* * {begin|end}_synchronized_op must be called from the same thread. */ int begin_synchronized_op(struct adapter *sc, struct vi_info *vi, int flags, char *wmesg) { int rc, pri; #ifdef WITNESS /* the caller thinks it's ok to sleep, but is it really? */ if (flags & SLEEP_OK) WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, "begin_synchronized_op"); #endif if (INTR_OK) pri = PCATCH; else pri = 0; ADAPTER_LOCK(sc); for (;;) { if (vi && IS_DOOMED(vi)) { rc = ENXIO; goto done; } if (!IS_BUSY(sc)) { rc = 0; break; } if (!(flags & SLEEP_OK)) { rc = EBUSY; goto done; } if (mtx_sleep(&sc->flags, &sc->sc_lock, pri, wmesg, 0)) { rc = EINTR; goto done; } } KASSERT(!IS_BUSY(sc), ("%s: controller busy.", __func__)); SET_BUSY(sc); #ifdef INVARIANTS sc->last_op = wmesg; sc->last_op_thr = curthread; sc->last_op_flags = flags; #endif done: if (!(flags & HOLD_LOCK) || rc) ADAPTER_UNLOCK(sc); return (rc); } /* * Tell if_ioctl and if_init that the VI is going away. This is * special variant of begin_synchronized_op and must be paired with a * call to end_synchronized_op. */ void doom_vi(struct adapter *sc, struct vi_info *vi) { ADAPTER_LOCK(sc); SET_DOOMED(vi); wakeup(&sc->flags); while (IS_BUSY(sc)) mtx_sleep(&sc->flags, &sc->sc_lock, 0, "t4detach", 0); SET_BUSY(sc); #ifdef INVARIANTS sc->last_op = "t4detach"; sc->last_op_thr = curthread; sc->last_op_flags = 0; #endif ADAPTER_UNLOCK(sc); } /* * {begin|end}_synchronized_op must be called from the same thread. */ void end_synchronized_op(struct adapter *sc, int flags) { if (flags & LOCK_HELD) ADAPTER_LOCK_ASSERT_OWNED(sc); else ADAPTER_LOCK(sc); KASSERT(IS_BUSY(sc), ("%s: controller not busy.", __func__)); CLR_BUSY(sc); wakeup(&sc->flags); ADAPTER_UNLOCK(sc); } static int cxgbe_init_synchronized(struct vi_info *vi) { struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; struct ifnet *ifp = vi->ifp; int rc = 0, i; struct sge_txq *txq; ASSERT_SYNCHRONIZED_OP(sc); if (ifp->if_drv_flags & IFF_DRV_RUNNING) return (0); /* already running */ if (!(sc->flags & FULL_INIT_DONE) && ((rc = adapter_full_init(sc)) != 0)) return (rc); /* error message displayed already */ if (!(vi->flags & VI_INIT_DONE) && ((rc = vi_full_init(vi)) != 0)) return (rc); /* error message displayed already */ rc = update_mac_settings(ifp, XGMAC_ALL); if (rc) goto done; /* error message displayed already */ rc = -t4_enable_vi(sc, sc->mbox, vi->viid, true, true); if (rc != 0) { if_printf(ifp, "enable_vi failed: %d\n", rc); goto done; } /* * Can't fail from this point onwards. Review cxgbe_uninit_synchronized * if this changes. */ for_each_txq(vi, i, txq) { TXQ_LOCK(txq); txq->eq.flags |= EQ_ENABLED; TXQ_UNLOCK(txq); } /* * The first iq of the first port to come up is used for tracing. */ if (sc->traceq < 0 && IS_MAIN_VI(vi)) { sc->traceq = sc->sge.rxq[vi->first_rxq].iq.abs_id; t4_write_reg(sc, is_t4(sc) ? A_MPS_TRC_RSS_CONTROL : A_MPS_T5_TRC_RSS_CONTROL, V_RSSCONTROL(pi->tx_chan) | V_QUEUENUMBER(sc->traceq)); pi->flags |= HAS_TRACEQ; } /* all ok */ PORT_LOCK(pi); ifp->if_drv_flags |= IFF_DRV_RUNNING; pi->up_vis++; if (pi->nvi > 1) callout_reset(&vi->tick, hz, vi_tick, vi); else callout_reset(&pi->tick, hz, cxgbe_tick, pi); PORT_UNLOCK(pi); done: if (rc != 0) cxgbe_uninit_synchronized(vi); return (rc); } /* * Idempotent. */ static int cxgbe_uninit_synchronized(struct vi_info *vi) { struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; struct ifnet *ifp = vi->ifp; int rc, i; struct sge_txq *txq; ASSERT_SYNCHRONIZED_OP(sc); if (!(vi->flags & VI_INIT_DONE)) { KASSERT(!(ifp->if_drv_flags & IFF_DRV_RUNNING), ("uninited VI is running")); return (0); } /* * Disable the VI so that all its data in either direction is discarded * by the MPS. Leave everything else (the queues, interrupts, and 1Hz * tick) intact as the TP can deliver negative advice or data that it's * holding in its RAM (for an offloaded connection) even after the VI is * disabled. */ rc = -t4_enable_vi(sc, sc->mbox, vi->viid, false, false); if (rc) { if_printf(ifp, "disable_vi failed: %d\n", rc); return (rc); } for_each_txq(vi, i, txq) { TXQ_LOCK(txq); txq->eq.flags &= ~EQ_ENABLED; TXQ_UNLOCK(txq); } PORT_LOCK(pi); if (pi->nvi == 1) callout_stop(&pi->tick); else callout_stop(&vi->tick); if (!(ifp->if_drv_flags & IFF_DRV_RUNNING)) { PORT_UNLOCK(pi); return (0); } ifp->if_drv_flags &= ~IFF_DRV_RUNNING; pi->up_vis--; if (pi->up_vis > 0) { PORT_UNLOCK(pi); return (0); } PORT_UNLOCK(pi); pi->link_cfg.link_ok = 0; pi->link_cfg.speed = 0; pi->linkdnrc = -1; t4_os_link_changed(sc, pi->port_id, 0, -1); return (0); } /* * It is ok for this function to fail midway and return right away. t4_detach * will walk the entire sc->irq list and clean up whatever is valid. */ static int setup_intr_handlers(struct adapter *sc) { int rc, rid, p, q, v; char s[8]; struct irq *irq; struct port_info *pi; struct vi_info *vi; struct sge_rxq *rxq; #ifdef TCP_OFFLOAD struct sge_ofld_rxq *ofld_rxq; #endif #ifdef DEV_NETMAP struct sge_nm_rxq *nm_rxq; #endif #ifdef RSS int nbuckets = rss_getnumbuckets(); #endif /* * Setup interrupts. */ irq = &sc->irq[0]; rid = sc->intr_type == INTR_INTX ? 0 : 1; if (sc->intr_count == 1) return (t4_alloc_irq(sc, irq, rid, t4_intr_all, sc, "all")); /* Multiple interrupts. */ KASSERT(sc->intr_count >= T4_EXTRA_INTR + sc->params.nports, ("%s: too few intr.", __func__)); /* The first one is always error intr */ rc = t4_alloc_irq(sc, irq, rid, t4_intr_err, sc, "err"); if (rc != 0) return (rc); irq++; rid++; /* The second one is always the firmware event queue */ rc = t4_alloc_irq(sc, irq, rid, t4_intr_evt, &sc->sge.fwq, "evt"); if (rc != 0) return (rc); irq++; rid++; for_each_port(sc, p) { pi = sc->port[p]; for_each_vi(pi, v, vi) { vi->first_intr = rid - 1; #ifdef DEV_NETMAP if (vi->flags & VI_NETMAP) { for_each_nm_rxq(vi, q, nm_rxq) { snprintf(s, sizeof(s), "%d-%d", p, q); rc = t4_alloc_irq(sc, irq, rid, t4_nm_intr, nm_rxq, s); if (rc != 0) return (rc); irq++; rid++; vi->nintr++; } continue; } #endif if (vi->flags & INTR_RXQ) { for_each_rxq(vi, q, rxq) { if (v == 0) snprintf(s, sizeof(s), "%d.%d", p, q); else snprintf(s, sizeof(s), "%d(%d).%d", p, v, q); rc = t4_alloc_irq(sc, irq, rid, t4_intr, rxq, s); if (rc != 0) return (rc); #ifdef RSS bus_bind_intr(sc->dev, irq->res, rss_getcpu(q % nbuckets)); #endif irq++; rid++; vi->nintr++; } } #ifdef TCP_OFFLOAD if (vi->flags & INTR_OFLD_RXQ) { for_each_ofld_rxq(vi, q, ofld_rxq) { snprintf(s, sizeof(s), "%d,%d", p, q); rc = t4_alloc_irq(sc, irq, rid, t4_intr, ofld_rxq, s); if (rc != 0) return (rc); irq++; rid++; vi->nintr++; } } #endif } } MPASS(irq == &sc->irq[sc->intr_count]); return (0); } int adapter_full_init(struct adapter *sc) { int rc, i; ASSERT_SYNCHRONIZED_OP(sc); ADAPTER_LOCK_ASSERT_NOTOWNED(sc); KASSERT((sc->flags & FULL_INIT_DONE) == 0, ("%s: FULL_INIT_DONE already", __func__)); /* * queues that belong to the adapter (not any particular port). */ rc = t4_setup_adapter_queues(sc); if (rc != 0) goto done; for (i = 0; i < nitems(sc->tq); i++) { sc->tq[i] = taskqueue_create("t4 taskq", M_NOWAIT, taskqueue_thread_enqueue, &sc->tq[i]); if (sc->tq[i] == NULL) { device_printf(sc->dev, "failed to allocate task queue %d\n", i); rc = ENOMEM; goto done; } taskqueue_start_threads(&sc->tq[i], 1, PI_NET, "%s tq%d", device_get_nameunit(sc->dev), i); } t4_intr_enable(sc); sc->flags |= FULL_INIT_DONE; done: if (rc != 0) adapter_full_uninit(sc); return (rc); } int adapter_full_uninit(struct adapter *sc) { int i; ADAPTER_LOCK_ASSERT_NOTOWNED(sc); t4_teardown_adapter_queues(sc); for (i = 0; i < nitems(sc->tq) && sc->tq[i]; i++) { taskqueue_free(sc->tq[i]); sc->tq[i] = NULL; } sc->flags &= ~FULL_INIT_DONE; return (0); } #ifdef RSS #define SUPPORTED_RSS_HASHTYPES (RSS_HASHTYPE_RSS_IPV4 | \ RSS_HASHTYPE_RSS_TCP_IPV4 | RSS_HASHTYPE_RSS_IPV6 | \ RSS_HASHTYPE_RSS_TCP_IPV6 | RSS_HASHTYPE_RSS_UDP_IPV4 | \ RSS_HASHTYPE_RSS_UDP_IPV6) /* Translates kernel hash types to hardware. */ static int hashconfig_to_hashen(int hashconfig) { int hashen = 0; if (hashconfig & RSS_HASHTYPE_RSS_IPV4) hashen |= F_FW_RSS_VI_CONFIG_CMD_IP4TWOTUPEN; if (hashconfig & RSS_HASHTYPE_RSS_IPV6) hashen |= F_FW_RSS_VI_CONFIG_CMD_IP6TWOTUPEN; if (hashconfig & RSS_HASHTYPE_RSS_UDP_IPV4) { hashen |= F_FW_RSS_VI_CONFIG_CMD_UDPEN | F_FW_RSS_VI_CONFIG_CMD_IP4FOURTUPEN; } if (hashconfig & RSS_HASHTYPE_RSS_UDP_IPV6) { hashen |= F_FW_RSS_VI_CONFIG_CMD_UDPEN | F_FW_RSS_VI_CONFIG_CMD_IP6FOURTUPEN; } if (hashconfig & RSS_HASHTYPE_RSS_TCP_IPV4) hashen |= F_FW_RSS_VI_CONFIG_CMD_IP4FOURTUPEN; if (hashconfig & RSS_HASHTYPE_RSS_TCP_IPV6) hashen |= F_FW_RSS_VI_CONFIG_CMD_IP6FOURTUPEN; return (hashen); } /* Translates hardware hash types to kernel. */ static int hashen_to_hashconfig(int hashen) { int hashconfig = 0; if (hashen & F_FW_RSS_VI_CONFIG_CMD_UDPEN) { /* * If UDP hashing was enabled it must have been enabled for * either IPv4 or IPv6 (inclusive or). Enabling UDP without * enabling any 4-tuple hash is nonsense configuration. */ MPASS(hashen & (F_FW_RSS_VI_CONFIG_CMD_IP4FOURTUPEN | F_FW_RSS_VI_CONFIG_CMD_IP6FOURTUPEN)); if (hashen & F_FW_RSS_VI_CONFIG_CMD_IP4FOURTUPEN) hashconfig |= RSS_HASHTYPE_RSS_UDP_IPV4; if (hashen & F_FW_RSS_VI_CONFIG_CMD_IP6FOURTUPEN) hashconfig |= RSS_HASHTYPE_RSS_UDP_IPV6; } if (hashen & F_FW_RSS_VI_CONFIG_CMD_IP4FOURTUPEN) hashconfig |= RSS_HASHTYPE_RSS_TCP_IPV4; if (hashen & F_FW_RSS_VI_CONFIG_CMD_IP6FOURTUPEN) hashconfig |= RSS_HASHTYPE_RSS_TCP_IPV6; if (hashen & F_FW_RSS_VI_CONFIG_CMD_IP4TWOTUPEN) hashconfig |= RSS_HASHTYPE_RSS_IPV4; if (hashen & F_FW_RSS_VI_CONFIG_CMD_IP6TWOTUPEN) hashconfig |= RSS_HASHTYPE_RSS_IPV6; return (hashconfig); } #endif int vi_full_init(struct vi_info *vi) { struct adapter *sc = vi->pi->adapter; struct ifnet *ifp = vi->ifp; uint16_t *rss; struct sge_rxq *rxq; int rc, i, j, hashen; #ifdef RSS int nbuckets = rss_getnumbuckets(); int hashconfig = rss_gethashconfig(); int extra; uint32_t raw_rss_key[RSS_KEYSIZE / sizeof(uint32_t)]; uint32_t rss_key[RSS_KEYSIZE / sizeof(uint32_t)]; #endif ASSERT_SYNCHRONIZED_OP(sc); KASSERT((vi->flags & VI_INIT_DONE) == 0, ("%s: VI_INIT_DONE already", __func__)); sysctl_ctx_init(&vi->ctx); vi->flags |= VI_SYSCTL_CTX; /* * Allocate tx/rx/fl queues for this VI. */ rc = t4_setup_vi_queues(vi); if (rc != 0) goto done; /* error message displayed already */ #ifdef DEV_NETMAP /* Netmap VIs configure RSS when netmap is enabled. */ if (vi->flags & VI_NETMAP) { vi->flags |= VI_INIT_DONE; return (0); } #endif /* * Setup RSS for this VI. Save a copy of the RSS table for later use. */ if (vi->nrxq > vi->rss_size) { if_printf(ifp, "nrxq (%d) > hw RSS table size (%d); " "some queues will never receive traffic.\n", vi->nrxq, vi->rss_size); } else if (vi->rss_size % vi->nrxq) { if_printf(ifp, "nrxq (%d), hw RSS table size (%d); " "expect uneven traffic distribution.\n", vi->nrxq, vi->rss_size); } #ifdef RSS MPASS(RSS_KEYSIZE == 40); if (vi->nrxq != nbuckets) { if_printf(ifp, "nrxq (%d) != kernel RSS buckets (%d);" "performance will be impacted.\n", vi->nrxq, nbuckets); } rss_getkey((void *)&raw_rss_key[0]); for (i = 0; i < nitems(rss_key); i++) { rss_key[i] = htobe32(raw_rss_key[nitems(rss_key) - 1 - i]); } t4_write_rss_key(sc, &rss_key[0], -1); #endif rss = malloc(vi->rss_size * sizeof (*rss), M_CXGBE, M_ZERO | M_WAITOK); for (i = 0; i < vi->rss_size;) { #ifdef RSS j = rss_get_indirection_to_bucket(i); j %= vi->nrxq; rxq = &sc->sge.rxq[vi->first_rxq + j]; rss[i++] = rxq->iq.abs_id; #else for_each_rxq(vi, j, rxq) { rss[i++] = rxq->iq.abs_id; if (i == vi->rss_size) break; } #endif } rc = -t4_config_rss_range(sc, sc->mbox, vi->viid, 0, vi->rss_size, rss, vi->rss_size); if (rc != 0) { if_printf(ifp, "rss_config failed: %d\n", rc); goto done; } #ifdef RSS hashen = hashconfig_to_hashen(hashconfig); /* * We may have had to enable some hashes even though the global config * wants them disabled. This is a potential problem that must be * reported to the user. */ extra = hashen_to_hashconfig(hashen) ^ hashconfig; /* * If we consider only the supported hash types, then the enabled hashes * are a superset of the requested hashes. In other words, there cannot * be any supported hash that was requested but not enabled, but there * can be hashes that were not requested but had to be enabled. */ extra &= SUPPORTED_RSS_HASHTYPES; MPASS((extra & hashconfig) == 0); if (extra) { if_printf(ifp, "global RSS config (0x%x) cannot be accommodated.\n", hashconfig); } if (extra & RSS_HASHTYPE_RSS_IPV4) if_printf(ifp, "IPv4 2-tuple hashing forced on.\n"); if (extra & RSS_HASHTYPE_RSS_TCP_IPV4) if_printf(ifp, "TCP/IPv4 4-tuple hashing forced on.\n"); if (extra & RSS_HASHTYPE_RSS_IPV6) if_printf(ifp, "IPv6 2-tuple hashing forced on.\n"); if (extra & RSS_HASHTYPE_RSS_TCP_IPV6) if_printf(ifp, "TCP/IPv6 4-tuple hashing forced on.\n"); if (extra & RSS_HASHTYPE_RSS_UDP_IPV4) if_printf(ifp, "UDP/IPv4 4-tuple hashing forced on.\n"); if (extra & RSS_HASHTYPE_RSS_UDP_IPV6) if_printf(ifp, "UDP/IPv6 4-tuple hashing forced on.\n"); #else hashen = F_FW_RSS_VI_CONFIG_CMD_IP6FOURTUPEN | F_FW_RSS_VI_CONFIG_CMD_IP6TWOTUPEN | F_FW_RSS_VI_CONFIG_CMD_IP4FOURTUPEN | F_FW_RSS_VI_CONFIG_CMD_IP4TWOTUPEN | F_FW_RSS_VI_CONFIG_CMD_UDPEN; #endif rc = -t4_config_vi_rss(sc, sc->mbox, vi->viid, hashen, rss[0]); if (rc != 0) { if_printf(ifp, "rss hash/defaultq config failed: %d\n", rc); goto done; } vi->rss = rss; vi->flags |= VI_INIT_DONE; done: if (rc != 0) vi_full_uninit(vi); return (rc); } /* * Idempotent. */ int vi_full_uninit(struct vi_info *vi) { struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; int i; struct sge_rxq *rxq; struct sge_txq *txq; #ifdef TCP_OFFLOAD struct sge_ofld_rxq *ofld_rxq; struct sge_wrq *ofld_txq; #endif if (vi->flags & VI_INIT_DONE) { /* Need to quiesce queues. */ #ifdef DEV_NETMAP if (vi->flags & VI_NETMAP) goto skip; #endif /* XXX: Only for the first VI? */ if (IS_MAIN_VI(vi)) quiesce_wrq(sc, &sc->sge.ctrlq[pi->port_id]); for_each_txq(vi, i, txq) { quiesce_txq(sc, txq); } #ifdef TCP_OFFLOAD for_each_ofld_txq(vi, i, ofld_txq) { quiesce_wrq(sc, ofld_txq); } #endif for_each_rxq(vi, i, rxq) { quiesce_iq(sc, &rxq->iq); quiesce_fl(sc, &rxq->fl); } #ifdef TCP_OFFLOAD for_each_ofld_rxq(vi, i, ofld_rxq) { quiesce_iq(sc, &ofld_rxq->iq); quiesce_fl(sc, &ofld_rxq->fl); } #endif free(vi->rss, M_CXGBE); } #ifdef DEV_NETMAP skip: #endif t4_teardown_vi_queues(vi); vi->flags &= ~VI_INIT_DONE; return (0); } static void quiesce_txq(struct adapter *sc, struct sge_txq *txq) { struct sge_eq *eq = &txq->eq; struct sge_qstat *spg = (void *)&eq->desc[eq->sidx]; (void) sc; /* unused */ #ifdef INVARIANTS TXQ_LOCK(txq); MPASS((eq->flags & EQ_ENABLED) == 0); TXQ_UNLOCK(txq); #endif /* Wait for the mp_ring to empty. */ while (!mp_ring_is_idle(txq->r)) { mp_ring_check_drainage(txq->r, 0); pause("rquiesce", 1); } /* Then wait for the hardware to finish. */ while (spg->cidx != htobe16(eq->pidx)) pause("equiesce", 1); /* Finally, wait for the driver to reclaim all descriptors. */ while (eq->cidx != eq->pidx) pause("dquiesce", 1); } static void quiesce_wrq(struct adapter *sc, struct sge_wrq *wrq) { /* XXXTX */ } static void quiesce_iq(struct adapter *sc, struct sge_iq *iq) { (void) sc; /* unused */ /* Synchronize with the interrupt handler */ while (!atomic_cmpset_int(&iq->state, IQS_IDLE, IQS_DISABLED)) pause("iqfree", 1); } static void quiesce_fl(struct adapter *sc, struct sge_fl *fl) { mtx_lock(&sc->sfl_lock); FL_LOCK(fl); fl->flags |= FL_DOOMED; FL_UNLOCK(fl); callout_stop(&sc->sfl_callout); mtx_unlock(&sc->sfl_lock); KASSERT((fl->flags & FL_STARVING) == 0, ("%s: still starving", __func__)); } static int t4_alloc_irq(struct adapter *sc, struct irq *irq, int rid, driver_intr_t *handler, void *arg, char *name) { int rc; irq->rid = rid; irq->res = bus_alloc_resource_any(sc->dev, SYS_RES_IRQ, &irq->rid, RF_SHAREABLE | RF_ACTIVE); if (irq->res == NULL) { device_printf(sc->dev, "failed to allocate IRQ for rid %d, name %s.\n", rid, name); return (ENOMEM); } rc = bus_setup_intr(sc->dev, irq->res, INTR_MPSAFE | INTR_TYPE_NET, NULL, handler, arg, &irq->tag); if (rc != 0) { device_printf(sc->dev, "failed to setup interrupt for rid %d, name %s: %d\n", rid, name, rc); } else if (name) bus_describe_intr(sc->dev, irq->res, irq->tag, name); return (rc); } static int t4_free_irq(struct adapter *sc, struct irq *irq) { if (irq->tag) bus_teardown_intr(sc->dev, irq->res, irq->tag); if (irq->res) bus_release_resource(sc->dev, SYS_RES_IRQ, irq->rid, irq->res); bzero(irq, sizeof(*irq)); return (0); } static void get_regs(struct adapter *sc, struct t4_regdump *regs, uint8_t *buf) { regs->version = chip_id(sc) | chip_rev(sc) << 10; t4_get_regs(sc, buf, regs->len); } #define A_PL_INDIR_CMD 0x1f8 #define S_PL_AUTOINC 31 #define M_PL_AUTOINC 0x1U #define V_PL_AUTOINC(x) ((x) << S_PL_AUTOINC) #define G_PL_AUTOINC(x) (((x) >> S_PL_AUTOINC) & M_PL_AUTOINC) #define S_PL_VFID 20 #define M_PL_VFID 0xffU #define V_PL_VFID(x) ((x) << S_PL_VFID) #define G_PL_VFID(x) (((x) >> S_PL_VFID) & M_PL_VFID) #define S_PL_ADDR 0 #define M_PL_ADDR 0xfffffU #define V_PL_ADDR(x) ((x) << S_PL_ADDR) #define G_PL_ADDR(x) (((x) >> S_PL_ADDR) & M_PL_ADDR) #define A_PL_INDIR_DATA 0x1fc static uint64_t read_vf_stat(struct adapter *sc, unsigned int viid, int reg) { u32 stats[2]; mtx_assert(&sc->reg_lock, MA_OWNED); t4_write_reg(sc, A_PL_INDIR_CMD, V_PL_AUTOINC(1) | V_PL_VFID(G_FW_VIID_VIN(viid)) | V_PL_ADDR(VF_MPS_REG(reg))); stats[0] = t4_read_reg(sc, A_PL_INDIR_DATA); stats[1] = t4_read_reg(sc, A_PL_INDIR_DATA); return (((uint64_t)stats[1]) << 32 | stats[0]); } static void t4_get_vi_stats(struct adapter *sc, unsigned int viid, struct fw_vi_stats_vf *stats) { #define GET_STAT(name) \ read_vf_stat(sc, viid, A_MPS_VF_STAT_##name##_L) stats->tx_bcast_bytes = GET_STAT(TX_VF_BCAST_BYTES); stats->tx_bcast_frames = GET_STAT(TX_VF_BCAST_FRAMES); stats->tx_mcast_bytes = GET_STAT(TX_VF_MCAST_BYTES); stats->tx_mcast_frames = GET_STAT(TX_VF_MCAST_FRAMES); stats->tx_ucast_bytes = GET_STAT(TX_VF_UCAST_BYTES); stats->tx_ucast_frames = GET_STAT(TX_VF_UCAST_FRAMES); stats->tx_drop_frames = GET_STAT(TX_VF_DROP_FRAMES); stats->tx_offload_bytes = GET_STAT(TX_VF_OFFLOAD_BYTES); stats->tx_offload_frames = GET_STAT(TX_VF_OFFLOAD_FRAMES); stats->rx_bcast_bytes = GET_STAT(RX_VF_BCAST_BYTES); stats->rx_bcast_frames = GET_STAT(RX_VF_BCAST_FRAMES); stats->rx_mcast_bytes = GET_STAT(RX_VF_MCAST_BYTES); stats->rx_mcast_frames = GET_STAT(RX_VF_MCAST_FRAMES); stats->rx_ucast_bytes = GET_STAT(RX_VF_UCAST_BYTES); stats->rx_ucast_frames = GET_STAT(RX_VF_UCAST_FRAMES); stats->rx_err_frames = GET_STAT(RX_VF_ERR_FRAMES); #undef GET_STAT } static void t4_clr_vi_stats(struct adapter *sc, unsigned int viid) { int reg; t4_write_reg(sc, A_PL_INDIR_CMD, V_PL_AUTOINC(1) | V_PL_VFID(G_FW_VIID_VIN(viid)) | V_PL_ADDR(VF_MPS_REG(A_MPS_VF_STAT_TX_VF_BCAST_BYTES_L))); for (reg = A_MPS_VF_STAT_TX_VF_BCAST_BYTES_L; reg <= A_MPS_VF_STAT_RX_VF_ERR_FRAMES_H; reg += 4) t4_write_reg(sc, A_PL_INDIR_DATA, 0); } static void vi_refresh_stats(struct adapter *sc, struct vi_info *vi) { struct timeval tv; const struct timeval interval = {0, 250000}; /* 250ms */ if (!(vi->flags & VI_INIT_DONE)) return; getmicrotime(&tv); timevalsub(&tv, &interval); if (timevalcmp(&tv, &vi->last_refreshed, <)) return; mtx_lock(&sc->reg_lock); t4_get_vi_stats(sc, vi->viid, &vi->stats); getmicrotime(&vi->last_refreshed); mtx_unlock(&sc->reg_lock); } static void cxgbe_refresh_stats(struct adapter *sc, struct port_info *pi) { int i; u_int v, tnl_cong_drops; struct timeval tv; const struct timeval interval = {0, 250000}; /* 250ms */ getmicrotime(&tv); timevalsub(&tv, &interval); if (timevalcmp(&tv, &pi->last_refreshed, <)) return; tnl_cong_drops = 0; t4_get_port_stats(sc, pi->tx_chan, &pi->stats); for (i = 0; i < sc->chip_params->nchan; i++) { if (pi->rx_chan_map & (1 << i)) { mtx_lock(&sc->reg_lock); t4_read_indirect(sc, A_TP_MIB_INDEX, A_TP_MIB_DATA, &v, 1, A_TP_MIB_TNL_CNG_DROP_0 + i); mtx_unlock(&sc->reg_lock); tnl_cong_drops += v; } } pi->tnl_cong_drops = tnl_cong_drops; getmicrotime(&pi->last_refreshed); } static void cxgbe_tick(void *arg) { struct port_info *pi = arg; struct adapter *sc = pi->adapter; PORT_LOCK_ASSERT_OWNED(pi); cxgbe_refresh_stats(sc, pi); callout_schedule(&pi->tick, hz); } void vi_tick(void *arg) { struct vi_info *vi = arg; struct adapter *sc = vi->pi->adapter; vi_refresh_stats(sc, vi); callout_schedule(&vi->tick, hz); } static void cxgbe_vlan_config(void *arg, struct ifnet *ifp, uint16_t vid) { struct ifnet *vlan; if (arg != ifp || ifp->if_type != IFT_ETHER) return; vlan = VLAN_DEVAT(ifp, vid); VLAN_SETCOOKIE(vlan, ifp); } static int cpl_not_handled(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m) { #ifdef INVARIANTS panic("%s: opcode 0x%02x on iq %p with payload %p", __func__, rss->opcode, iq, m); #else log(LOG_ERR, "%s: opcode 0x%02x on iq %p with payload %p\n", __func__, rss->opcode, iq, m); m_freem(m); #endif return (EDOOFUS); } int t4_register_cpl_handler(struct adapter *sc, int opcode, cpl_handler_t h) { uintptr_t *loc, new; if (opcode >= nitems(sc->cpl_handler)) return (EINVAL); new = h ? (uintptr_t)h : (uintptr_t)cpl_not_handled; loc = (uintptr_t *) &sc->cpl_handler[opcode]; atomic_store_rel_ptr(loc, new); return (0); } static int an_not_handled(struct sge_iq *iq, const struct rsp_ctrl *ctrl) { #ifdef INVARIANTS panic("%s: async notification on iq %p (ctrl %p)", __func__, iq, ctrl); #else log(LOG_ERR, "%s: async notification on iq %p (ctrl %p)\n", __func__, iq, ctrl); #endif return (EDOOFUS); } int t4_register_an_handler(struct adapter *sc, an_handler_t h) { uintptr_t *loc, new; new = h ? (uintptr_t)h : (uintptr_t)an_not_handled; loc = (uintptr_t *) &sc->an_handler; atomic_store_rel_ptr(loc, new); return (0); } static int fw_msg_not_handled(struct adapter *sc, const __be64 *rpl) { const struct cpl_fw6_msg *cpl = __containerof(rpl, struct cpl_fw6_msg, data[0]); #ifdef INVARIANTS panic("%s: fw_msg type %d", __func__, cpl->type); #else log(LOG_ERR, "%s: fw_msg type %d\n", __func__, cpl->type); #endif return (EDOOFUS); } int t4_register_fw_msg_handler(struct adapter *sc, int type, fw_msg_handler_t h) { uintptr_t *loc, new; if (type >= nitems(sc->fw_msg_handler)) return (EINVAL); /* * These are dispatched by the handler for FW{4|6}_CPL_MSG using the CPL * handler dispatch table. Reject any attempt to install a handler for * this subtype. */ if (type == FW_TYPE_RSSCPL || type == FW6_TYPE_RSSCPL) return (EINVAL); new = h ? (uintptr_t)h : (uintptr_t)fw_msg_not_handled; loc = (uintptr_t *) &sc->fw_msg_handler[type]; atomic_store_rel_ptr(loc, new); return (0); } /* * Should match fw_caps_config_ enums in t4fw_interface.h */ static char *caps_decoder[] = { "\20\001IPMI\002NCSI", /* 0: NBM */ "\20\001PPP\002QFC\003DCBX", /* 1: link */ "\20\001INGRESS\002EGRESS", /* 2: switch */ "\20\001NIC\002VM\003IDS\004UM\005UM_ISGL" /* 3: NIC */ "\006HASHFILTER\007ETHOFLD", "\20\001TOE", /* 4: TOE */ "\20\001RDDP\002RDMAC", /* 5: RDMA */ "\20\001INITIATOR_PDU\002TARGET_PDU" /* 6: iSCSI */ "\003INITIATOR_CNXOFLD\004TARGET_CNXOFLD" "\005INITIATOR_SSNOFLD\006TARGET_SSNOFLD" "\007T10DIF" "\010INITIATOR_CMDOFLD\011TARGET_CMDOFLD", "\20\00KEYS", /* 7: TLS */ "\20\001INITIATOR\002TARGET\003CTRL_OFLD" /* 8: FCoE */ "\004PO_INITIATOR\005PO_TARGET", }; static void t4_sysctls(struct adapter *sc) { struct sysctl_ctx_list *ctx; struct sysctl_oid *oid; struct sysctl_oid_list *children, *c0; static char *doorbells = {"\20\1UDB\2WCWR\3UDBWC\4KDB"}; ctx = device_get_sysctl_ctx(sc->dev); /* * dev.t4nex.X. */ oid = device_get_sysctl_tree(sc->dev); c0 = children = SYSCTL_CHILDREN(oid); sc->sc_do_rxcopy = 1; SYSCTL_ADD_INT(ctx, children, OID_AUTO, "do_rx_copy", CTLFLAG_RW, &sc->sc_do_rxcopy, 1, "Do RX copy of small frames"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "nports", CTLFLAG_RD, NULL, sc->params.nports, "# of ports"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "hw_revision", CTLFLAG_RD, NULL, chip_rev(sc), "chip hardware revision"); SYSCTL_ADD_STRING(ctx, children, OID_AUTO, "tp_version", CTLFLAG_RD, sc->tp_version, 0, "TP microcode version"); if (sc->params.exprom_vers != 0) { SYSCTL_ADD_STRING(ctx, children, OID_AUTO, "exprom_version", CTLFLAG_RD, sc->exprom_version, 0, "expansion ROM version"); } SYSCTL_ADD_STRING(ctx, children, OID_AUTO, "firmware_version", CTLFLAG_RD, sc->fw_version, 0, "firmware version"); SYSCTL_ADD_STRING(ctx, children, OID_AUTO, "cf", CTLFLAG_RD, sc->cfg_file, 0, "configuration file"); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "cfcsum", CTLFLAG_RD, NULL, sc->cfcsum, "config file checksum"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "doorbells", CTLTYPE_STRING | CTLFLAG_RD, doorbells, sc->doorbells, sysctl_bitfield, "A", "available doorbells"); #define SYSCTL_CAP(name, n, text) \ SYSCTL_ADD_PROC(ctx, children, OID_AUTO, #name, \ CTLTYPE_STRING | CTLFLAG_RD, caps_decoder[n], sc->name, \ sysctl_bitfield, "A", "available " text "capabilities") SYSCTL_CAP(nbmcaps, 0, "NBM"); SYSCTL_CAP(linkcaps, 1, "link"); SYSCTL_CAP(switchcaps, 2, "switch"); SYSCTL_CAP(niccaps, 3, "NIC"); SYSCTL_CAP(toecaps, 4, "TCP offload"); SYSCTL_CAP(rdmacaps, 5, "RDMA"); SYSCTL_CAP(iscsicaps, 6, "iSCSI"); SYSCTL_CAP(tlscaps, 7, "TLS"); SYSCTL_CAP(fcoecaps, 8, "FCoE"); #undef SYSCTL_CAP SYSCTL_ADD_INT(ctx, children, OID_AUTO, "core_clock", CTLFLAG_RD, NULL, sc->params.vpd.cclk, "core clock frequency (in KHz)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "holdoff_timers", CTLTYPE_STRING | CTLFLAG_RD, sc->params.sge.timer_val, sizeof(sc->params.sge.timer_val), sysctl_int_array, "A", "interrupt holdoff timer values (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "holdoff_pkt_counts", CTLTYPE_STRING | CTLFLAG_RD, sc->params.sge.counter_val, sizeof(sc->params.sge.counter_val), sysctl_int_array, "A", "interrupt holdoff packet counter values"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "nfilters", CTLFLAG_RD, NULL, sc->tids.nftids, "number of filters"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "temperature", CTLTYPE_INT | CTLFLAG_RD, sc, 0, sysctl_temperature, "I", "chip temperature (in Celsius)"); t4_sge_sysctls(sc, ctx, children); sc->lro_timeout = 100; SYSCTL_ADD_INT(ctx, children, OID_AUTO, "lro_timeout", CTLFLAG_RW, &sc->lro_timeout, 0, "lro inactive-flush timeout (in us)"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "debug_flags", CTLFLAG_RW, &sc->debug_flags, 0, "flags to enable runtime debugging"); #ifdef SBUF_DRAIN /* * dev.t4nex.X.misc. Marked CTLFLAG_SKIP to avoid information overload. */ oid = SYSCTL_ADD_NODE(ctx, c0, OID_AUTO, "misc", CTLFLAG_RD | CTLFLAG_SKIP, NULL, "logs and miscellaneous information"); children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cctrl", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_cctrl, "A", "congestion control"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_ibq_tp0", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_cim_ibq_obq, "A", "CIM IBQ 0 (TP0)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_ibq_tp1", CTLTYPE_STRING | CTLFLAG_RD, sc, 1, sysctl_cim_ibq_obq, "A", "CIM IBQ 1 (TP1)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_ibq_ulp", CTLTYPE_STRING | CTLFLAG_RD, sc, 2, sysctl_cim_ibq_obq, "A", "CIM IBQ 2 (ULP)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_ibq_sge0", CTLTYPE_STRING | CTLFLAG_RD, sc, 3, sysctl_cim_ibq_obq, "A", "CIM IBQ 3 (SGE0)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_ibq_sge1", CTLTYPE_STRING | CTLFLAG_RD, sc, 4, sysctl_cim_ibq_obq, "A", "CIM IBQ 4 (SGE1)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_ibq_ncsi", CTLTYPE_STRING | CTLFLAG_RD, sc, 5, sysctl_cim_ibq_obq, "A", "CIM IBQ 5 (NCSI)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_la", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, chip_id(sc) <= CHELSIO_T5 ? sysctl_cim_la : sysctl_cim_la_t6, "A", "CIM logic analyzer"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_ma_la", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_cim_ma_la, "A", "CIM MA logic analyzer"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_obq_ulp0", CTLTYPE_STRING | CTLFLAG_RD, sc, 0 + CIM_NUM_IBQ, sysctl_cim_ibq_obq, "A", "CIM OBQ 0 (ULP0)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_obq_ulp1", CTLTYPE_STRING | CTLFLAG_RD, sc, 1 + CIM_NUM_IBQ, sysctl_cim_ibq_obq, "A", "CIM OBQ 1 (ULP1)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_obq_ulp2", CTLTYPE_STRING | CTLFLAG_RD, sc, 2 + CIM_NUM_IBQ, sysctl_cim_ibq_obq, "A", "CIM OBQ 2 (ULP2)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_obq_ulp3", CTLTYPE_STRING | CTLFLAG_RD, sc, 3 + CIM_NUM_IBQ, sysctl_cim_ibq_obq, "A", "CIM OBQ 3 (ULP3)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_obq_sge", CTLTYPE_STRING | CTLFLAG_RD, sc, 4 + CIM_NUM_IBQ, sysctl_cim_ibq_obq, "A", "CIM OBQ 4 (SGE)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_obq_ncsi", CTLTYPE_STRING | CTLFLAG_RD, sc, 5 + CIM_NUM_IBQ, sysctl_cim_ibq_obq, "A", "CIM OBQ 5 (NCSI)"); if (chip_id(sc) > CHELSIO_T4) { SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_obq_sge0_rx", CTLTYPE_STRING | CTLFLAG_RD, sc, 6 + CIM_NUM_IBQ, sysctl_cim_ibq_obq, "A", "CIM OBQ 6 (SGE0-RX)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_obq_sge1_rx", CTLTYPE_STRING | CTLFLAG_RD, sc, 7 + CIM_NUM_IBQ, sysctl_cim_ibq_obq, "A", "CIM OBQ 7 (SGE1-RX)"); } SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_pif_la", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_cim_pif_la, "A", "CIM PIF logic analyzer"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cim_qcfg", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_cim_qcfg, "A", "CIM queue configuration"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "cpl_stats", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_cpl_stats, "A", "CPL statistics"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "ddp_stats", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_ddp_stats, "A", "non-TCP DDP statistics"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "devlog", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_devlog, "A", "firmware's device log"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "fcoe_stats", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_fcoe_stats, "A", "FCoE statistics"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "hw_sched", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_hw_sched, "A", "hardware scheduler "); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "l2t", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_l2t, "A", "hardware L2 table"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "lb_stats", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_lb_stats, "A", "loopback statistics"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "meminfo", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_meminfo, "A", "memory regions"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "mps_tcam", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, chip_id(sc) <= CHELSIO_T5 ? sysctl_mps_tcam : sysctl_mps_tcam_t6, "A", "MPS TCAM entries"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "path_mtus", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_path_mtus, "A", "path MTUs"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "pm_stats", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_pm_stats, "A", "PM statistics"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "rdma_stats", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_rdma_stats, "A", "RDMA statistics"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "tcp_stats", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_tcp_stats, "A", "TCP statistics"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "tids", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_tids, "A", "TID information"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "tp_err_stats", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_tp_err_stats, "A", "TP error statistics"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "tp_la_mask", CTLTYPE_INT | CTLFLAG_RW, sc, 0, sysctl_tp_la_mask, "I", "TP logic analyzer event capture mask"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "tp_la", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_tp_la, "A", "TP logic analyzer"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "tx_rate", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_tx_rate, "A", "Tx rate"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "ulprx_la", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_ulprx_la, "A", "ULPRX logic analyzer"); if (is_t5(sc)) { SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "wcwr_stats", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_wcwr_stats, "A", "write combined work requests"); } #endif #ifdef TCP_OFFLOAD if (is_offload(sc)) { /* * dev.t4nex.X.toe. */ oid = SYSCTL_ADD_NODE(ctx, c0, OID_AUTO, "toe", CTLFLAG_RD, NULL, "TOE parameters"); children = SYSCTL_CHILDREN(oid); sc->tt.sndbuf = 256 * 1024; SYSCTL_ADD_INT(ctx, children, OID_AUTO, "sndbuf", CTLFLAG_RW, &sc->tt.sndbuf, 0, "max hardware send buffer size"); sc->tt.ddp = 0; SYSCTL_ADD_INT(ctx, children, OID_AUTO, "ddp", CTLFLAG_RW, &sc->tt.ddp, 0, "DDP allowed"); sc->tt.rx_coalesce = 1; SYSCTL_ADD_INT(ctx, children, OID_AUTO, "rx_coalesce", CTLFLAG_RW, &sc->tt.rx_coalesce, 0, "receive coalescing"); sc->tt.tx_align = 1; SYSCTL_ADD_INT(ctx, children, OID_AUTO, "tx_align", CTLFLAG_RW, &sc->tt.tx_align, 0, "chop and align payload"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "timer_tick", CTLTYPE_STRING | CTLFLAG_RD, sc, 0, sysctl_tp_tick, "A", "TP timer tick (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "timestamp_tick", CTLTYPE_STRING | CTLFLAG_RD, sc, 1, sysctl_tp_tick, "A", "TCP timestamp tick (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "dack_tick", CTLTYPE_STRING | CTLFLAG_RD, sc, 2, sysctl_tp_tick, "A", "DACK tick (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "dack_timer", CTLTYPE_UINT | CTLFLAG_RD, sc, 0, sysctl_tp_dack_timer, "IU", "DACK timer (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "rexmt_min", CTLTYPE_ULONG | CTLFLAG_RD, sc, A_TP_RXT_MIN, sysctl_tp_timer, "LU", "Retransmit min (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "rexmt_max", CTLTYPE_ULONG | CTLFLAG_RD, sc, A_TP_RXT_MAX, sysctl_tp_timer, "LU", "Retransmit max (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "persist_min", CTLTYPE_ULONG | CTLFLAG_RD, sc, A_TP_PERS_MIN, sysctl_tp_timer, "LU", "Persist timer min (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "persist_max", CTLTYPE_ULONG | CTLFLAG_RD, sc, A_TP_PERS_MAX, sysctl_tp_timer, "LU", "Persist timer max (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "keepalive_idle", CTLTYPE_ULONG | CTLFLAG_RD, sc, A_TP_KEEP_IDLE, sysctl_tp_timer, "LU", "Keepidle idle timer (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "keepalive_intvl", CTLTYPE_ULONG | CTLFLAG_RD, sc, A_TP_KEEP_INTVL, sysctl_tp_timer, "LU", "Keepidle interval (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "initial_srtt", CTLTYPE_ULONG | CTLFLAG_RD, sc, A_TP_INIT_SRTT, sysctl_tp_timer, "LU", "Initial SRTT (us)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "finwait2_timer", CTLTYPE_ULONG | CTLFLAG_RD, sc, A_TP_FINWAIT2_TIMER, sysctl_tp_timer, "LU", "FINWAIT2 timer (us)"); } #endif } void vi_sysctls(struct vi_info *vi) { struct sysctl_ctx_list *ctx; struct sysctl_oid *oid; struct sysctl_oid_list *children; ctx = device_get_sysctl_ctx(vi->dev); /* * dev.[nv](cxgbe|cxl).X. */ oid = device_get_sysctl_tree(vi->dev); children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "viid", CTLFLAG_RD, NULL, vi->viid, "VI identifer"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "nrxq", CTLFLAG_RD, &vi->nrxq, 0, "# of rx queues"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "ntxq", CTLFLAG_RD, &vi->ntxq, 0, "# of tx queues"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "first_rxq", CTLFLAG_RD, &vi->first_rxq, 0, "index of first rx queue"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "first_txq", CTLFLAG_RD, &vi->first_txq, 0, "index of first tx queue"); if (vi->flags & VI_NETMAP) return; SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "rsrv_noflowq", CTLTYPE_INT | CTLFLAG_RW, vi, 0, sysctl_noflowq, "IU", "Reserve queue 0 for non-flowid packets"); #ifdef TCP_OFFLOAD if (vi->nofldrxq != 0) { SYSCTL_ADD_INT(ctx, children, OID_AUTO, "nofldrxq", CTLFLAG_RD, &vi->nofldrxq, 0, "# of rx queues for offloaded TCP connections"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "nofldtxq", CTLFLAG_RD, &vi->nofldtxq, 0, "# of tx queues for offloaded TCP connections"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "first_ofld_rxq", CTLFLAG_RD, &vi->first_ofld_rxq, 0, "index of first TOE rx queue"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "first_ofld_txq", CTLFLAG_RD, &vi->first_ofld_txq, 0, "index of first TOE tx queue"); } #endif SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "holdoff_tmr_idx", CTLTYPE_INT | CTLFLAG_RW, vi, 0, sysctl_holdoff_tmr_idx, "I", "holdoff timer index"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "holdoff_pktc_idx", CTLTYPE_INT | CTLFLAG_RW, vi, 0, sysctl_holdoff_pktc_idx, "I", "holdoff packet counter index"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "qsize_rxq", CTLTYPE_INT | CTLFLAG_RW, vi, 0, sysctl_qsize_rxq, "I", "rx queue size"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "qsize_txq", CTLTYPE_INT | CTLFLAG_RW, vi, 0, sysctl_qsize_txq, "I", "tx queue size"); } static void cxgbe_sysctls(struct port_info *pi) { struct sysctl_ctx_list *ctx; struct sysctl_oid *oid; - struct sysctl_oid_list *children; + struct sysctl_oid_list *children, *children2; struct adapter *sc = pi->adapter; + int i; + char name[16]; ctx = device_get_sysctl_ctx(pi->dev); /* * dev.cxgbe.X. */ oid = device_get_sysctl_tree(pi->dev); children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "linkdnrc", CTLTYPE_STRING | CTLFLAG_RD, pi, 0, sysctl_linkdnrc, "A", "reason why link is down"); if (pi->port_type == FW_PORT_TYPE_BT_XAUI) { SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "temperature", CTLTYPE_INT | CTLFLAG_RD, pi, 0, sysctl_btphy, "I", "PHY temperature (in Celsius)"); SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "fw_version", CTLTYPE_INT | CTLFLAG_RD, pi, 1, sysctl_btphy, "I", "PHY firmware version"); } SYSCTL_ADD_PROC(ctx, children, OID_AUTO, "pause_settings", CTLTYPE_STRING | CTLFLAG_RW, pi, PAUSE_TX, sysctl_pause_settings, "A", "PAUSE settings (bit 0 = rx_pause, bit 1 = tx_pause)"); SYSCTL_ADD_INT(ctx, children, OID_AUTO, "max_speed", CTLFLAG_RD, NULL, port_top_speed(pi), "max speed (in Gbps)"); /* + * dev.(cxgbe|cxl).X.tc. + */ + oid = SYSCTL_ADD_NODE(ctx, children, OID_AUTO, "tc", CTLFLAG_RD, NULL, + "Tx scheduler traffic classes"); + for (i = 0; i < sc->chip_params->nsched_cls; i++) { + struct tx_sched_class *tc = &pi->tc[i]; + + snprintf(name, sizeof(name), "%d", i); + children2 = SYSCTL_CHILDREN(SYSCTL_ADD_NODE(ctx, + SYSCTL_CHILDREN(oid), OID_AUTO, name, CTLFLAG_RD, NULL, + "traffic class")); + SYSCTL_ADD_UINT(ctx, children2, OID_AUTO, "flags", CTLFLAG_RD, + &tc->flags, 0, "flags"); + SYSCTL_ADD_UINT(ctx, children2, OID_AUTO, "refcount", + CTLFLAG_RD, &tc->refcount, 0, "references to this class"); +#ifdef SBUF_DRAIN + SYSCTL_ADD_PROC(ctx, children2, OID_AUTO, "params", + CTLTYPE_STRING | CTLFLAG_RD, sc, (pi->port_id << 16) | i, + sysctl_tc_params, "A", "traffic class parameters"); +#endif + } + + /* * dev.cxgbe.X.stats. */ oid = SYSCTL_ADD_NODE(ctx, children, OID_AUTO, "stats", CTLFLAG_RD, NULL, "port statistics"); children = SYSCTL_CHILDREN(oid); SYSCTL_ADD_UINT(ctx, children, OID_AUTO, "tx_parse_error", CTLFLAG_RD, &pi->tx_parse_error, 0, "# of tx packets with invalid length or # of segments"); #define SYSCTL_ADD_T4_REG64(pi, name, desc, reg) \ SYSCTL_ADD_OID(ctx, children, OID_AUTO, name, \ CTLTYPE_U64 | CTLFLAG_RD, sc, reg, \ sysctl_handle_t4_reg64, "QU", desc) SYSCTL_ADD_T4_REG64(pi, "tx_octets", "# of octets in good frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_BYTES_L)); SYSCTL_ADD_T4_REG64(pi, "tx_frames", "total # of good frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_FRAMES_L)); SYSCTL_ADD_T4_REG64(pi, "tx_bcast_frames", "# of broadcast frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_BCAST_L)); SYSCTL_ADD_T4_REG64(pi, "tx_mcast_frames", "# of multicast frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_MCAST_L)); SYSCTL_ADD_T4_REG64(pi, "tx_ucast_frames", "# of unicast frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_UCAST_L)); SYSCTL_ADD_T4_REG64(pi, "tx_error_frames", "# of error frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_ERROR_L)); SYSCTL_ADD_T4_REG64(pi, "tx_frames_64", "# of tx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_64B_L)); SYSCTL_ADD_T4_REG64(pi, "tx_frames_65_127", "# of tx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_65B_127B_L)); SYSCTL_ADD_T4_REG64(pi, "tx_frames_128_255", "# of tx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_128B_255B_L)); SYSCTL_ADD_T4_REG64(pi, "tx_frames_256_511", "# of tx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_256B_511B_L)); SYSCTL_ADD_T4_REG64(pi, "tx_frames_512_1023", "# of tx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_512B_1023B_L)); SYSCTL_ADD_T4_REG64(pi, "tx_frames_1024_1518", "# of tx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_1024B_1518B_L)); SYSCTL_ADD_T4_REG64(pi, "tx_frames_1519_max", "# of tx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_1519B_MAX_L)); SYSCTL_ADD_T4_REG64(pi, "tx_drop", "# of dropped tx frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_DROP_L)); SYSCTL_ADD_T4_REG64(pi, "tx_pause", "# of pause frames transmitted", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_PAUSE_L)); SYSCTL_ADD_T4_REG64(pi, "tx_ppp0", "# of PPP prio 0 frames transmitted", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_PPP0_L)); SYSCTL_ADD_T4_REG64(pi, "tx_ppp1", "# of PPP prio 1 frames transmitted", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_PPP1_L)); SYSCTL_ADD_T4_REG64(pi, "tx_ppp2", "# of PPP prio 2 frames transmitted", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_PPP2_L)); SYSCTL_ADD_T4_REG64(pi, "tx_ppp3", "# of PPP prio 3 frames transmitted", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_PPP3_L)); SYSCTL_ADD_T4_REG64(pi, "tx_ppp4", "# of PPP prio 4 frames transmitted", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_PPP4_L)); SYSCTL_ADD_T4_REG64(pi, "tx_ppp5", "# of PPP prio 5 frames transmitted", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_PPP5_L)); SYSCTL_ADD_T4_REG64(pi, "tx_ppp6", "# of PPP prio 6 frames transmitted", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_PPP6_L)); SYSCTL_ADD_T4_REG64(pi, "tx_ppp7", "# of PPP prio 7 frames transmitted", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_TX_PORT_PPP7_L)); SYSCTL_ADD_T4_REG64(pi, "rx_octets", "# of octets in good frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_BYTES_L)); SYSCTL_ADD_T4_REG64(pi, "rx_frames", "total # of good frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_FRAMES_L)); SYSCTL_ADD_T4_REG64(pi, "rx_bcast_frames", "# of broadcast frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_BCAST_L)); SYSCTL_ADD_T4_REG64(pi, "rx_mcast_frames", "# of multicast frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_MCAST_L)); SYSCTL_ADD_T4_REG64(pi, "rx_ucast_frames", "# of unicast frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_UCAST_L)); SYSCTL_ADD_T4_REG64(pi, "rx_too_long", "# of frames exceeding MTU", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_MTU_ERROR_L)); SYSCTL_ADD_T4_REG64(pi, "rx_jabber", "# of jabber frames", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_MTU_CRC_ERROR_L)); SYSCTL_ADD_T4_REG64(pi, "rx_fcs_err", "# of frames received with bad FCS", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_CRC_ERROR_L)); SYSCTL_ADD_T4_REG64(pi, "rx_len_err", "# of frames received with length error", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_LEN_ERROR_L)); SYSCTL_ADD_T4_REG64(pi, "rx_symbol_err", "symbol errors", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_SYM_ERROR_L)); SYSCTL_ADD_T4_REG64(pi, "rx_runt", "# of short frames received", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_LESS_64B_L)); SYSCTL_ADD_T4_REG64(pi, "rx_frames_64", "# of rx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_64B_L)); SYSCTL_ADD_T4_REG64(pi, "rx_frames_65_127", "# of rx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_65B_127B_L)); SYSCTL_ADD_T4_REG64(pi, "rx_frames_128_255", "# of rx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_128B_255B_L)); SYSCTL_ADD_T4_REG64(pi, "rx_frames_256_511", "# of rx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_256B_511B_L)); SYSCTL_ADD_T4_REG64(pi, "rx_frames_512_1023", "# of rx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_512B_1023B_L)); SYSCTL_ADD_T4_REG64(pi, "rx_frames_1024_1518", "# of rx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_1024B_1518B_L)); SYSCTL_ADD_T4_REG64(pi, "rx_frames_1519_max", "# of rx frames in this range", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_1519B_MAX_L)); SYSCTL_ADD_T4_REG64(pi, "rx_pause", "# of pause frames received", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_PAUSE_L)); SYSCTL_ADD_T4_REG64(pi, "rx_ppp0", "# of PPP prio 0 frames received", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_PPP0_L)); SYSCTL_ADD_T4_REG64(pi, "rx_ppp1", "# of PPP prio 1 frames received", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_PPP1_L)); SYSCTL_ADD_T4_REG64(pi, "rx_ppp2", "# of PPP prio 2 frames received", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_PPP2_L)); SYSCTL_ADD_T4_REG64(pi, "rx_ppp3", "# of PPP prio 3 frames received", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_PPP3_L)); SYSCTL_ADD_T4_REG64(pi, "rx_ppp4", "# of PPP prio 4 frames received", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_PPP4_L)); SYSCTL_ADD_T4_REG64(pi, "rx_ppp5", "# of PPP prio 5 frames received", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_PPP5_L)); SYSCTL_ADD_T4_REG64(pi, "rx_ppp6", "# of PPP prio 6 frames received", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_PPP6_L)); SYSCTL_ADD_T4_REG64(pi, "rx_ppp7", "# of PPP prio 7 frames received", PORT_REG(pi->tx_chan, A_MPS_PORT_STAT_RX_PORT_PPP7_L)); #undef SYSCTL_ADD_T4_REG64 #define SYSCTL_ADD_T4_PORTSTAT(name, desc) \ SYSCTL_ADD_UQUAD(ctx, children, OID_AUTO, #name, CTLFLAG_RD, \ &pi->stats.name, desc) /* We get these from port_stats and they may be stale by up to 1s */ SYSCTL_ADD_T4_PORTSTAT(rx_ovflow0, "# drops due to buffer-group 0 overflows"); SYSCTL_ADD_T4_PORTSTAT(rx_ovflow1, "# drops due to buffer-group 1 overflows"); SYSCTL_ADD_T4_PORTSTAT(rx_ovflow2, "# drops due to buffer-group 2 overflows"); SYSCTL_ADD_T4_PORTSTAT(rx_ovflow3, "# drops due to buffer-group 3 overflows"); SYSCTL_ADD_T4_PORTSTAT(rx_trunc0, "# of buffer-group 0 truncated packets"); SYSCTL_ADD_T4_PORTSTAT(rx_trunc1, "# of buffer-group 1 truncated packets"); SYSCTL_ADD_T4_PORTSTAT(rx_trunc2, "# of buffer-group 2 truncated packets"); SYSCTL_ADD_T4_PORTSTAT(rx_trunc3, "# of buffer-group 3 truncated packets"); #undef SYSCTL_ADD_T4_PORTSTAT } static int sysctl_int_array(SYSCTL_HANDLER_ARGS) { int rc, *i, space = 0; struct sbuf sb; sbuf_new_for_sysctl(&sb, NULL, 64, req); for (i = arg1; arg2; arg2 -= sizeof(int), i++) { if (space) sbuf_printf(&sb, " "); sbuf_printf(&sb, "%d", *i); space = 1; } rc = sbuf_finish(&sb); sbuf_delete(&sb); return (rc); } static int sysctl_bitfield(SYSCTL_HANDLER_ARGS) { int rc; struct sbuf *sb; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return(rc); sb = sbuf_new_for_sysctl(NULL, NULL, 128, req); if (sb == NULL) return (ENOMEM); sbuf_printf(sb, "%b", (int)arg2, (char *)arg1); rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_btphy(SYSCTL_HANDLER_ARGS) { struct port_info *pi = arg1; int op = arg2; struct adapter *sc = pi->adapter; u_int v; int rc; rc = begin_synchronized_op(sc, &pi->vi[0], SLEEP_OK | INTR_OK, "t4btt"); if (rc) return (rc); /* XXX: magic numbers */ rc = -t4_mdio_rd(sc, sc->mbox, pi->mdio_addr, 0x1e, op ? 0x20 : 0xc820, &v); end_synchronized_op(sc, 0); if (rc) return (rc); if (op == 0) v /= 256; rc = sysctl_handle_int(oidp, &v, 0, req); return (rc); } static int sysctl_noflowq(SYSCTL_HANDLER_ARGS) { struct vi_info *vi = arg1; int rc, val; val = vi->rsrv_noflowq; rc = sysctl_handle_int(oidp, &val, 0, req); if (rc != 0 || req->newptr == NULL) return (rc); if ((val >= 1) && (vi->ntxq > 1)) vi->rsrv_noflowq = 1; else vi->rsrv_noflowq = 0; return (rc); } static int sysctl_holdoff_tmr_idx(SYSCTL_HANDLER_ARGS) { struct vi_info *vi = arg1; struct adapter *sc = vi->pi->adapter; int idx, rc, i; struct sge_rxq *rxq; #ifdef TCP_OFFLOAD struct sge_ofld_rxq *ofld_rxq; #endif uint8_t v; idx = vi->tmr_idx; rc = sysctl_handle_int(oidp, &idx, 0, req); if (rc != 0 || req->newptr == NULL) return (rc); if (idx < 0 || idx >= SGE_NTIMERS) return (EINVAL); rc = begin_synchronized_op(sc, vi, HOLD_LOCK | SLEEP_OK | INTR_OK, "t4tmr"); if (rc) return (rc); v = V_QINTR_TIMER_IDX(idx) | V_QINTR_CNT_EN(vi->pktc_idx != -1); for_each_rxq(vi, i, rxq) { #ifdef atomic_store_rel_8 atomic_store_rel_8(&rxq->iq.intr_params, v); #else rxq->iq.intr_params = v; #endif } #ifdef TCP_OFFLOAD for_each_ofld_rxq(vi, i, ofld_rxq) { #ifdef atomic_store_rel_8 atomic_store_rel_8(&ofld_rxq->iq.intr_params, v); #else ofld_rxq->iq.intr_params = v; #endif } #endif vi->tmr_idx = idx; end_synchronized_op(sc, LOCK_HELD); return (0); } static int sysctl_holdoff_pktc_idx(SYSCTL_HANDLER_ARGS) { struct vi_info *vi = arg1; struct adapter *sc = vi->pi->adapter; int idx, rc; idx = vi->pktc_idx; rc = sysctl_handle_int(oidp, &idx, 0, req); if (rc != 0 || req->newptr == NULL) return (rc); if (idx < -1 || idx >= SGE_NCOUNTERS) return (EINVAL); rc = begin_synchronized_op(sc, vi, HOLD_LOCK | SLEEP_OK | INTR_OK, "t4pktc"); if (rc) return (rc); if (vi->flags & VI_INIT_DONE) rc = EBUSY; /* cannot be changed once the queues are created */ else vi->pktc_idx = idx; end_synchronized_op(sc, LOCK_HELD); return (rc); } static int sysctl_qsize_rxq(SYSCTL_HANDLER_ARGS) { struct vi_info *vi = arg1; struct adapter *sc = vi->pi->adapter; int qsize, rc; qsize = vi->qsize_rxq; rc = sysctl_handle_int(oidp, &qsize, 0, req); if (rc != 0 || req->newptr == NULL) return (rc); if (qsize < 128 || (qsize & 7)) return (EINVAL); rc = begin_synchronized_op(sc, vi, HOLD_LOCK | SLEEP_OK | INTR_OK, "t4rxqs"); if (rc) return (rc); if (vi->flags & VI_INIT_DONE) rc = EBUSY; /* cannot be changed once the queues are created */ else vi->qsize_rxq = qsize; end_synchronized_op(sc, LOCK_HELD); return (rc); } static int sysctl_qsize_txq(SYSCTL_HANDLER_ARGS) { struct vi_info *vi = arg1; struct adapter *sc = vi->pi->adapter; int qsize, rc; qsize = vi->qsize_txq; rc = sysctl_handle_int(oidp, &qsize, 0, req); if (rc != 0 || req->newptr == NULL) return (rc); if (qsize < 128 || qsize > 65536) return (EINVAL); rc = begin_synchronized_op(sc, vi, HOLD_LOCK | SLEEP_OK | INTR_OK, "t4txqs"); if (rc) return (rc); if (vi->flags & VI_INIT_DONE) rc = EBUSY; /* cannot be changed once the queues are created */ else vi->qsize_txq = qsize; end_synchronized_op(sc, LOCK_HELD); return (rc); } static int sysctl_pause_settings(SYSCTL_HANDLER_ARGS) { struct port_info *pi = arg1; struct adapter *sc = pi->adapter; struct link_config *lc = &pi->link_cfg; int rc; if (req->newptr == NULL) { struct sbuf *sb; static char *bits = "\20\1PAUSE_RX\2PAUSE_TX"; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return(rc); sb = sbuf_new_for_sysctl(NULL, NULL, 128, req); if (sb == NULL) return (ENOMEM); sbuf_printf(sb, "%b", lc->fc & (PAUSE_TX | PAUSE_RX), bits); rc = sbuf_finish(sb); sbuf_delete(sb); } else { char s[2]; int n; s[0] = '0' + (lc->requested_fc & (PAUSE_TX | PAUSE_RX)); s[1] = 0; rc = sysctl_handle_string(oidp, s, sizeof(s), req); if (rc != 0) return(rc); if (s[1] != 0) return (EINVAL); if (s[0] < '0' || s[0] > '9') return (EINVAL); /* not a number */ n = s[0] - '0'; if (n & ~(PAUSE_TX | PAUSE_RX)) return (EINVAL); /* some other bit is set too */ rc = begin_synchronized_op(sc, &pi->vi[0], SLEEP_OK | INTR_OK, "t4PAUSE"); if (rc) return (rc); if ((lc->requested_fc & (PAUSE_TX | PAUSE_RX)) != n) { int link_ok = lc->link_ok; lc->requested_fc &= ~(PAUSE_TX | PAUSE_RX); lc->requested_fc |= n; rc = -t4_link_l1cfg(sc, sc->mbox, pi->tx_chan, lc); lc->link_ok = link_ok; /* restore */ } end_synchronized_op(sc, 0); } return (rc); } static int sysctl_handle_t4_reg64(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; int reg = arg2; uint64_t val; val = t4_read_reg64(sc, reg); return (sysctl_handle_64(oidp, &val, 0, req)); } static int sysctl_temperature(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; int rc, t; uint32_t param, val; rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t4temp"); if (rc) return (rc); param = V_FW_PARAMS_MNEM(FW_PARAMS_MNEM_DEV) | V_FW_PARAMS_PARAM_X(FW_PARAMS_PARAM_DEV_DIAG) | V_FW_PARAMS_PARAM_Y(FW_PARAM_DEV_DIAG_TMP); rc = -t4_query_params(sc, sc->mbox, sc->pf, 0, 1, ¶m, &val); end_synchronized_op(sc, 0); if (rc) return (rc); /* unknown is returned as 0 but we display -1 in that case */ t = val == 0 ? -1 : val; rc = sysctl_handle_int(oidp, &t, 0, req); return (rc); } #ifdef SBUF_DRAIN static int sysctl_cctrl(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc, i; uint16_t incr[NMTUS][NCCTRL_WIN]; static const char *dec_fac[] = { "0.5", "0.5625", "0.625", "0.6875", "0.75", "0.8125", "0.875", "0.9375" }; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); t4_read_cong_tbl(sc, incr); for (i = 0; i < NCCTRL_WIN; ++i) { sbuf_printf(sb, "%2d: %4u %4u %4u %4u %4u %4u %4u %4u\n", i, incr[0][i], incr[1][i], incr[2][i], incr[3][i], incr[4][i], incr[5][i], incr[6][i], incr[7][i]); sbuf_printf(sb, "%8u %4u %4u %4u %4u %4u %4u %4u %5u %s\n", incr[8][i], incr[9][i], incr[10][i], incr[11][i], incr[12][i], incr[13][i], incr[14][i], incr[15][i], sc->params.a_wnd[i], dec_fac[sc->params.b_wnd[i]]); } rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static const char *qname[CIM_NUM_IBQ + CIM_NUM_OBQ_T5] = { "TP0", "TP1", "ULP", "SGE0", "SGE1", "NC-SI", /* ibq's */ "ULP0", "ULP1", "ULP2", "ULP3", "SGE", "NC-SI", /* obq's */ "SGE0-RX", "SGE1-RX" /* additional obq's (T5 onwards) */ }; static int sysctl_cim_ibq_obq(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc, i, n, qid = arg2; uint32_t *buf, *p; char *qtype; u_int cim_num_obq = sc->chip_params->cim_num_obq; KASSERT(qid >= 0 && qid < CIM_NUM_IBQ + cim_num_obq, ("%s: bad qid %d\n", __func__, qid)); if (qid < CIM_NUM_IBQ) { /* inbound queue */ qtype = "IBQ"; n = 4 * CIM_IBQ_SIZE; buf = malloc(n * sizeof(uint32_t), M_CXGBE, M_ZERO | M_WAITOK); rc = t4_read_cim_ibq(sc, qid, buf, n); } else { /* outbound queue */ qtype = "OBQ"; qid -= CIM_NUM_IBQ; n = 4 * cim_num_obq * CIM_OBQ_SIZE; buf = malloc(n * sizeof(uint32_t), M_CXGBE, M_ZERO | M_WAITOK); rc = t4_read_cim_obq(sc, qid, buf, n); } if (rc < 0) { rc = -rc; goto done; } n = rc * sizeof(uint32_t); /* rc has # of words actually read */ rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) goto done; sb = sbuf_new_for_sysctl(NULL, NULL, PAGE_SIZE, req); if (sb == NULL) { rc = ENOMEM; goto done; } sbuf_printf(sb, "%s%d %s", qtype , qid, qname[arg2]); for (i = 0, p = buf; i < n; i += 16, p += 4) sbuf_printf(sb, "\n%#06x: %08x %08x %08x %08x", i, p[0], p[1], p[2], p[3]); rc = sbuf_finish(sb); sbuf_delete(sb); done: free(buf, M_CXGBE); return (rc); } static int sysctl_cim_la(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; u_int cfg; struct sbuf *sb; uint32_t *buf, *p; int rc; MPASS(chip_id(sc) <= CHELSIO_T5); rc = -t4_cim_read(sc, A_UP_UP_DBG_LA_CFG, 1, &cfg); if (rc != 0) return (rc); rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); buf = malloc(sc->params.cim_la_size * sizeof(uint32_t), M_CXGBE, M_ZERO | M_WAITOK); rc = -t4_cim_read_la(sc, buf, NULL); if (rc != 0) goto done; sbuf_printf(sb, "Status Data PC%s", cfg & F_UPDBGLACAPTPCONLY ? "" : " LS0Stat LS0Addr LS0Data"); for (p = buf; p <= &buf[sc->params.cim_la_size - 8]; p += 8) { if (cfg & F_UPDBGLACAPTPCONLY) { sbuf_printf(sb, "\n %02x %08x %08x", p[5] & 0xff, p[6], p[7]); sbuf_printf(sb, "\n %02x %02x%06x %02x%06x", (p[3] >> 8) & 0xff, p[3] & 0xff, p[4] >> 8, p[4] & 0xff, p[5] >> 8); sbuf_printf(sb, "\n %02x %x%07x %x%07x", (p[0] >> 4) & 0xff, p[0] & 0xf, p[1] >> 4, p[1] & 0xf, p[2] >> 4); } else { sbuf_printf(sb, "\n %02x %x%07x %x%07x %08x %08x " "%08x%08x%08x%08x", (p[0] >> 4) & 0xff, p[0] & 0xf, p[1] >> 4, p[1] & 0xf, p[2] >> 4, p[2] & 0xf, p[3], p[4], p[5], p[6], p[7]); } } rc = sbuf_finish(sb); sbuf_delete(sb); done: free(buf, M_CXGBE); return (rc); } static int sysctl_cim_la_t6(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; u_int cfg; struct sbuf *sb; uint32_t *buf, *p; int rc; MPASS(chip_id(sc) > CHELSIO_T5); rc = -t4_cim_read(sc, A_UP_UP_DBG_LA_CFG, 1, &cfg); if (rc != 0) return (rc); rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); buf = malloc(sc->params.cim_la_size * sizeof(uint32_t), M_CXGBE, M_ZERO | M_WAITOK); rc = -t4_cim_read_la(sc, buf, NULL); if (rc != 0) goto done; sbuf_printf(sb, "Status Inst Data PC%s", cfg & F_UPDBGLACAPTPCONLY ? "" : " LS0Stat LS0Addr LS0Data LS1Stat LS1Addr LS1Data"); for (p = buf; p <= &buf[sc->params.cim_la_size - 10]; p += 10) { if (cfg & F_UPDBGLACAPTPCONLY) { sbuf_printf(sb, "\n %02x %08x %08x %08x", p[3] & 0xff, p[2], p[1], p[0]); sbuf_printf(sb, "\n %02x %02x%06x %02x%06x %02x%06x", (p[6] >> 8) & 0xff, p[6] & 0xff, p[5] >> 8, p[5] & 0xff, p[4] >> 8, p[4] & 0xff, p[3] >> 8); sbuf_printf(sb, "\n %02x %04x%04x %04x%04x %04x%04x", (p[9] >> 16) & 0xff, p[9] & 0xffff, p[8] >> 16, p[8] & 0xffff, p[7] >> 16, p[7] & 0xffff, p[6] >> 16); } else { sbuf_printf(sb, "\n %02x %04x%04x %04x%04x %04x%04x " "%08x %08x %08x %08x %08x %08x", (p[9] >> 16) & 0xff, p[9] & 0xffff, p[8] >> 16, p[8] & 0xffff, p[7] >> 16, p[7] & 0xffff, p[6] >> 16, p[2], p[1], p[0], p[5], p[4], p[3]); } } rc = sbuf_finish(sb); sbuf_delete(sb); done: free(buf, M_CXGBE); return (rc); } static int sysctl_cim_ma_la(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; u_int i; struct sbuf *sb; uint32_t *buf, *p; int rc; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); buf = malloc(2 * CIM_MALA_SIZE * 5 * sizeof(uint32_t), M_CXGBE, M_ZERO | M_WAITOK); t4_cim_read_ma_la(sc, buf, buf + 5 * CIM_MALA_SIZE); p = buf; for (i = 0; i < CIM_MALA_SIZE; i++, p += 5) { sbuf_printf(sb, "\n%02x%08x%08x%08x%08x", p[4], p[3], p[2], p[1], p[0]); } sbuf_printf(sb, "\n\nCnt ID Tag UE Data RDY VLD"); for (i = 0; i < CIM_MALA_SIZE; i++, p += 5) { sbuf_printf(sb, "\n%3u %2u %x %u %08x%08x %u %u", (p[2] >> 10) & 0xff, (p[2] >> 7) & 7, (p[2] >> 3) & 0xf, (p[2] >> 2) & 1, (p[1] >> 2) | ((p[2] & 3) << 30), (p[0] >> 2) | ((p[1] & 3) << 30), (p[0] >> 1) & 1, p[0] & 1); } rc = sbuf_finish(sb); sbuf_delete(sb); free(buf, M_CXGBE); return (rc); } static int sysctl_cim_pif_la(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; u_int i; struct sbuf *sb; uint32_t *buf, *p; int rc; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); buf = malloc(2 * CIM_PIFLA_SIZE * 6 * sizeof(uint32_t), M_CXGBE, M_ZERO | M_WAITOK); t4_cim_read_pif_la(sc, buf, buf + 6 * CIM_PIFLA_SIZE, NULL, NULL); p = buf; sbuf_printf(sb, "Cntl ID DataBE Addr Data"); for (i = 0; i < CIM_PIFLA_SIZE; i++, p += 6) { sbuf_printf(sb, "\n %02x %02x %04x %08x %08x%08x%08x%08x", (p[5] >> 22) & 0xff, (p[5] >> 16) & 0x3f, p[5] & 0xffff, p[4], p[3], p[2], p[1], p[0]); } sbuf_printf(sb, "\n\nCntl ID Data"); for (i = 0; i < CIM_PIFLA_SIZE; i++, p += 6) { sbuf_printf(sb, "\n %02x %02x %08x%08x%08x%08x", (p[4] >> 6) & 0xff, p[4] & 0x3f, p[3], p[2], p[1], p[0]); } rc = sbuf_finish(sb); sbuf_delete(sb); free(buf, M_CXGBE); return (rc); } static int sysctl_cim_qcfg(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc, i; uint16_t base[CIM_NUM_IBQ + CIM_NUM_OBQ_T5]; uint16_t size[CIM_NUM_IBQ + CIM_NUM_OBQ_T5]; uint16_t thres[CIM_NUM_IBQ]; uint32_t obq_wr[2 * CIM_NUM_OBQ_T5], *wr = obq_wr; uint32_t stat[4 * (CIM_NUM_IBQ + CIM_NUM_OBQ_T5)], *p = stat; u_int cim_num_obq, ibq_rdaddr, obq_rdaddr, nq; cim_num_obq = sc->chip_params->cim_num_obq; if (is_t4(sc)) { ibq_rdaddr = A_UP_IBQ_0_RDADDR; obq_rdaddr = A_UP_OBQ_0_REALADDR; } else { ibq_rdaddr = A_UP_IBQ_0_SHADOW_RDADDR; obq_rdaddr = A_UP_OBQ_0_SHADOW_REALADDR; } nq = CIM_NUM_IBQ + cim_num_obq; rc = -t4_cim_read(sc, ibq_rdaddr, 4 * nq, stat); if (rc == 0) rc = -t4_cim_read(sc, obq_rdaddr, 2 * cim_num_obq, obq_wr); if (rc != 0) return (rc); t4_read_cimq_cfg(sc, base, size, thres); rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, PAGE_SIZE, req); if (sb == NULL) return (ENOMEM); sbuf_printf(sb, "Queue Base Size Thres RdPtr WrPtr SOP EOP Avail"); for (i = 0; i < CIM_NUM_IBQ; i++, p += 4) sbuf_printf(sb, "\n%7s %5x %5u %5u %6x %4x %4u %4u %5u", qname[i], base[i], size[i], thres[i], G_IBQRDADDR(p[0]), G_IBQWRADDR(p[1]), G_QUESOPCNT(p[3]), G_QUEEOPCNT(p[3]), G_QUEREMFLITS(p[2]) * 16); for ( ; i < nq; i++, p += 4, wr += 2) sbuf_printf(sb, "\n%7s %5x %5u %12x %4x %4u %4u %5u", qname[i], base[i], size[i], G_QUERDADDR(p[0]) & 0x3fff, wr[0] - base[i], G_QUESOPCNT(p[3]), G_QUEEOPCNT(p[3]), G_QUEREMFLITS(p[2]) * 16); rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_cpl_stats(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc; struct tp_cpl_stats stats; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); mtx_lock(&sc->reg_lock); t4_tp_get_cpl_stats(sc, &stats); mtx_unlock(&sc->reg_lock); if (sc->chip_params->nchan > 2) { sbuf_printf(sb, " channel 0 channel 1" " channel 2 channel 3"); sbuf_printf(sb, "\nCPL requests: %10u %10u %10u %10u", stats.req[0], stats.req[1], stats.req[2], stats.req[3]); sbuf_printf(sb, "\nCPL responses: %10u %10u %10u %10u", stats.rsp[0], stats.rsp[1], stats.rsp[2], stats.rsp[3]); } else { sbuf_printf(sb, " channel 0 channel 1"); sbuf_printf(sb, "\nCPL requests: %10u %10u", stats.req[0], stats.req[1]); sbuf_printf(sb, "\nCPL responses: %10u %10u", stats.rsp[0], stats.rsp[1]); } rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_ddp_stats(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc; struct tp_usm_stats stats; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return(rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); t4_get_usm_stats(sc, &stats); sbuf_printf(sb, "Frames: %u\n", stats.frames); sbuf_printf(sb, "Octets: %ju\n", stats.octets); sbuf_printf(sb, "Drops: %u", stats.drops); rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static const char * const devlog_level_strings[] = { [FW_DEVLOG_LEVEL_EMERG] = "EMERG", [FW_DEVLOG_LEVEL_CRIT] = "CRIT", [FW_DEVLOG_LEVEL_ERR] = "ERR", [FW_DEVLOG_LEVEL_NOTICE] = "NOTICE", [FW_DEVLOG_LEVEL_INFO] = "INFO", [FW_DEVLOG_LEVEL_DEBUG] = "DEBUG" }; static const char * const devlog_facility_strings[] = { [FW_DEVLOG_FACILITY_CORE] = "CORE", [FW_DEVLOG_FACILITY_CF] = "CF", [FW_DEVLOG_FACILITY_SCHED] = "SCHED", [FW_DEVLOG_FACILITY_TIMER] = "TIMER", [FW_DEVLOG_FACILITY_RES] = "RES", [FW_DEVLOG_FACILITY_HW] = "HW", [FW_DEVLOG_FACILITY_FLR] = "FLR", [FW_DEVLOG_FACILITY_DMAQ] = "DMAQ", [FW_DEVLOG_FACILITY_PHY] = "PHY", [FW_DEVLOG_FACILITY_MAC] = "MAC", [FW_DEVLOG_FACILITY_PORT] = "PORT", [FW_DEVLOG_FACILITY_VI] = "VI", [FW_DEVLOG_FACILITY_FILTER] = "FILTER", [FW_DEVLOG_FACILITY_ACL] = "ACL", [FW_DEVLOG_FACILITY_TM] = "TM", [FW_DEVLOG_FACILITY_QFC] = "QFC", [FW_DEVLOG_FACILITY_DCB] = "DCB", [FW_DEVLOG_FACILITY_ETH] = "ETH", [FW_DEVLOG_FACILITY_OFLD] = "OFLD", [FW_DEVLOG_FACILITY_RI] = "RI", [FW_DEVLOG_FACILITY_ISCSI] = "ISCSI", [FW_DEVLOG_FACILITY_FCOE] = "FCOE", [FW_DEVLOG_FACILITY_FOISCSI] = "FOISCSI", [FW_DEVLOG_FACILITY_FOFCOE] = "FOFCOE", [FW_DEVLOG_FACILITY_CHNET] = "CHNET", }; static int sysctl_devlog(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct devlog_params *dparams = &sc->params.devlog; struct fw_devlog_e *buf, *e; int i, j, rc, nentries, first = 0; struct sbuf *sb; uint64_t ftstamp = UINT64_MAX; if (dparams->addr == 0) return (ENXIO); buf = malloc(dparams->size, M_CXGBE, M_NOWAIT); if (buf == NULL) return (ENOMEM); rc = read_via_memwin(sc, 1, dparams->addr, (void *)buf, dparams->size); if (rc != 0) goto done; nentries = dparams->size / sizeof(struct fw_devlog_e); for (i = 0; i < nentries; i++) { e = &buf[i]; if (e->timestamp == 0) break; /* end */ e->timestamp = be64toh(e->timestamp); e->seqno = be32toh(e->seqno); for (j = 0; j < 8; j++) e->params[j] = be32toh(e->params[j]); if (e->timestamp < ftstamp) { ftstamp = e->timestamp; first = i; } } if (buf[first].timestamp == 0) goto done; /* nothing in the log */ rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) goto done; sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) { rc = ENOMEM; goto done; } sbuf_printf(sb, "%10s %15s %8s %8s %s\n", "Seq#", "Tstamp", "Level", "Facility", "Message"); i = first; do { e = &buf[i]; if (e->timestamp == 0) break; /* end */ sbuf_printf(sb, "%10d %15ju %8s %8s ", e->seqno, e->timestamp, (e->level < nitems(devlog_level_strings) ? devlog_level_strings[e->level] : "UNKNOWN"), (e->facility < nitems(devlog_facility_strings) ? devlog_facility_strings[e->facility] : "UNKNOWN")); sbuf_printf(sb, e->fmt, e->params[0], e->params[1], e->params[2], e->params[3], e->params[4], e->params[5], e->params[6], e->params[7]); if (++i == nentries) i = 0; } while (i != first); rc = sbuf_finish(sb); sbuf_delete(sb); done: free(buf, M_CXGBE); return (rc); } static int sysctl_fcoe_stats(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc; struct tp_fcoe_stats stats[MAX_NCHAN]; int i, nchan = sc->chip_params->nchan; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); for (i = 0; i < nchan; i++) t4_get_fcoe_stats(sc, i, &stats[i]); if (nchan > 2) { sbuf_printf(sb, " channel 0 channel 1" " channel 2 channel 3"); sbuf_printf(sb, "\noctetsDDP: %16ju %16ju %16ju %16ju", stats[0].octets_ddp, stats[1].octets_ddp, stats[2].octets_ddp, stats[3].octets_ddp); sbuf_printf(sb, "\nframesDDP: %16u %16u %16u %16u", stats[0].frames_ddp, stats[1].frames_ddp, stats[2].frames_ddp, stats[3].frames_ddp); sbuf_printf(sb, "\nframesDrop: %16u %16u %16u %16u", stats[0].frames_drop, stats[1].frames_drop, stats[2].frames_drop, stats[3].frames_drop); } else { sbuf_printf(sb, " channel 0 channel 1"); sbuf_printf(sb, "\noctetsDDP: %16ju %16ju", stats[0].octets_ddp, stats[1].octets_ddp); sbuf_printf(sb, "\nframesDDP: %16u %16u", stats[0].frames_ddp, stats[1].frames_ddp); sbuf_printf(sb, "\nframesDrop: %16u %16u", stats[0].frames_drop, stats[1].frames_drop); } rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_hw_sched(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc, i; unsigned int map, kbps, ipg, mode; unsigned int pace_tab[NTX_SCHED]; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); map = t4_read_reg(sc, A_TP_TX_MOD_QUEUE_REQ_MAP); mode = G_TIMERMODE(t4_read_reg(sc, A_TP_MOD_CONFIG)); t4_read_pace_tbl(sc, pace_tab); sbuf_printf(sb, "Scheduler Mode Channel Rate (Kbps) " "Class IPG (0.1 ns) Flow IPG (us)"); for (i = 0; i < NTX_SCHED; ++i, map >>= 2) { t4_get_tx_sched(sc, i, &kbps, &ipg); sbuf_printf(sb, "\n %u %-5s %u ", i, (mode & (1 << i)) ? "flow" : "class", map & 3); if (kbps) sbuf_printf(sb, "%9u ", kbps); else sbuf_printf(sb, " disabled "); if (ipg) sbuf_printf(sb, "%13u ", ipg); else sbuf_printf(sb, " disabled "); if (pace_tab[i]) sbuf_printf(sb, "%10u", pace_tab[i]); else sbuf_printf(sb, " disabled"); } rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_lb_stats(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc, i, j; uint64_t *p0, *p1; struct lb_port_stats s[2]; static const char *stat_name[] = { "OctetsOK:", "FramesOK:", "BcastFrames:", "McastFrames:", "UcastFrames:", "ErrorFrames:", "Frames64:", "Frames65To127:", "Frames128To255:", "Frames256To511:", "Frames512To1023:", "Frames1024To1518:", "Frames1519ToMax:", "FramesDropped:", "BG0FramesDropped:", "BG1FramesDropped:", "BG2FramesDropped:", "BG3FramesDropped:", "BG0FramesTrunc:", "BG1FramesTrunc:", "BG2FramesTrunc:", "BG3FramesTrunc:" }; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); memset(s, 0, sizeof(s)); for (i = 0; i < sc->chip_params->nchan; i += 2) { t4_get_lb_stats(sc, i, &s[0]); t4_get_lb_stats(sc, i + 1, &s[1]); p0 = &s[0].octets; p1 = &s[1].octets; sbuf_printf(sb, "%s Loopback %u" " Loopback %u", i == 0 ? "" : "\n", i, i + 1); for (j = 0; j < nitems(stat_name); j++) sbuf_printf(sb, "\n%-17s %20ju %20ju", stat_name[j], *p0++, *p1++); } rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_linkdnrc(SYSCTL_HANDLER_ARGS) { int rc = 0; struct port_info *pi = arg1; struct sbuf *sb; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return(rc); sb = sbuf_new_for_sysctl(NULL, NULL, 64, req); if (sb == NULL) return (ENOMEM); if (pi->linkdnrc < 0) sbuf_printf(sb, "n/a"); else sbuf_printf(sb, "%s", t4_link_down_rc_str(pi->linkdnrc)); rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } struct mem_desc { unsigned int base; unsigned int limit; unsigned int idx; }; static int mem_desc_cmp(const void *a, const void *b) { return ((const struct mem_desc *)a)->base - ((const struct mem_desc *)b)->base; } static void mem_region_show(struct sbuf *sb, const char *name, unsigned int from, unsigned int to) { unsigned int size; if (from == to) return; size = to - from + 1; if (size == 0) return; /* XXX: need humanize_number(3) in libkern for a more readable 'size' */ sbuf_printf(sb, "%-15s %#x-%#x [%u]\n", name, from, to, size); } static int sysctl_meminfo(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc, i, n; uint32_t lo, hi, used, alloc; static const char *memory[] = {"EDC0:", "EDC1:", "MC:", "MC0:", "MC1:"}; static const char *region[] = { "DBQ contexts:", "IMSG contexts:", "FLM cache:", "TCBs:", "Pstructs:", "Timers:", "Rx FL:", "Tx FL:", "Pstruct FL:", "Tx payload:", "Rx payload:", "LE hash:", "iSCSI region:", "TDDP region:", "TPT region:", "STAG region:", "RQ region:", "RQUDP region:", "PBL region:", "TXPBL region:", "DBVFIFO region:", "ULPRX state:", "ULPTX state:", "On-chip queues:" }; struct mem_desc avail[4]; struct mem_desc mem[nitems(region) + 3]; /* up to 3 holes */ struct mem_desc *md = mem; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); for (i = 0; i < nitems(mem); i++) { mem[i].limit = 0; mem[i].idx = i; } /* Find and sort the populated memory ranges */ i = 0; lo = t4_read_reg(sc, A_MA_TARGET_MEM_ENABLE); if (lo & F_EDRAM0_ENABLE) { hi = t4_read_reg(sc, A_MA_EDRAM0_BAR); avail[i].base = G_EDRAM0_BASE(hi) << 20; avail[i].limit = avail[i].base + (G_EDRAM0_SIZE(hi) << 20); avail[i].idx = 0; i++; } if (lo & F_EDRAM1_ENABLE) { hi = t4_read_reg(sc, A_MA_EDRAM1_BAR); avail[i].base = G_EDRAM1_BASE(hi) << 20; avail[i].limit = avail[i].base + (G_EDRAM1_SIZE(hi) << 20); avail[i].idx = 1; i++; } if (lo & F_EXT_MEM_ENABLE) { hi = t4_read_reg(sc, A_MA_EXT_MEMORY_BAR); avail[i].base = G_EXT_MEM_BASE(hi) << 20; avail[i].limit = avail[i].base + (G_EXT_MEM_SIZE(hi) << 20); avail[i].idx = is_t5(sc) ? 3 : 2; /* Call it MC0 for T5 */ i++; } if (is_t5(sc) && lo & F_EXT_MEM1_ENABLE) { hi = t4_read_reg(sc, A_MA_EXT_MEMORY1_BAR); avail[i].base = G_EXT_MEM1_BASE(hi) << 20; avail[i].limit = avail[i].base + (G_EXT_MEM1_SIZE(hi) << 20); avail[i].idx = 4; i++; } if (!i) /* no memory available */ return 0; qsort(avail, i, sizeof(struct mem_desc), mem_desc_cmp); (md++)->base = t4_read_reg(sc, A_SGE_DBQ_CTXT_BADDR); (md++)->base = t4_read_reg(sc, A_SGE_IMSG_CTXT_BADDR); (md++)->base = t4_read_reg(sc, A_SGE_FLM_CACHE_BADDR); (md++)->base = t4_read_reg(sc, A_TP_CMM_TCB_BASE); (md++)->base = t4_read_reg(sc, A_TP_CMM_MM_BASE); (md++)->base = t4_read_reg(sc, A_TP_CMM_TIMER_BASE); (md++)->base = t4_read_reg(sc, A_TP_CMM_MM_RX_FLST_BASE); (md++)->base = t4_read_reg(sc, A_TP_CMM_MM_TX_FLST_BASE); (md++)->base = t4_read_reg(sc, A_TP_CMM_MM_PS_FLST_BASE); /* the next few have explicit upper bounds */ md->base = t4_read_reg(sc, A_TP_PMM_TX_BASE); md->limit = md->base - 1 + t4_read_reg(sc, A_TP_PMM_TX_PAGE_SIZE) * G_PMTXMAXPAGE(t4_read_reg(sc, A_TP_PMM_TX_MAX_PAGE)); md++; md->base = t4_read_reg(sc, A_TP_PMM_RX_BASE); md->limit = md->base - 1 + t4_read_reg(sc, A_TP_PMM_RX_PAGE_SIZE) * G_PMRXMAXPAGE(t4_read_reg(sc, A_TP_PMM_RX_MAX_PAGE)); md++; if (t4_read_reg(sc, A_LE_DB_CONFIG) & F_HASHEN) { if (chip_id(sc) <= CHELSIO_T5) md->base = t4_read_reg(sc, A_LE_DB_HASH_TID_BASE); else md->base = t4_read_reg(sc, A_LE_DB_HASH_TBL_BASE_ADDR); md->limit = 0; } else { md->base = 0; md->idx = nitems(region); /* hide it */ } md++; #define ulp_region(reg) \ md->base = t4_read_reg(sc, A_ULP_ ## reg ## _LLIMIT);\ (md++)->limit = t4_read_reg(sc, A_ULP_ ## reg ## _ULIMIT) ulp_region(RX_ISCSI); ulp_region(RX_TDDP); ulp_region(TX_TPT); ulp_region(RX_STAG); ulp_region(RX_RQ); ulp_region(RX_RQUDP); ulp_region(RX_PBL); ulp_region(TX_PBL); #undef ulp_region md->base = 0; md->idx = nitems(region); if (!is_t4(sc)) { uint32_t size = 0; uint32_t sge_ctrl = t4_read_reg(sc, A_SGE_CONTROL2); uint32_t fifo_size = t4_read_reg(sc, A_SGE_DBVFIFO_SIZE); if (is_t5(sc)) { if (sge_ctrl & F_VFIFO_ENABLE) size = G_DBVFIFO_SIZE(fifo_size); } else size = G_T6_DBVFIFO_SIZE(fifo_size); if (size) { md->base = G_BASEADDR(t4_read_reg(sc, A_SGE_DBVFIFO_BADDR)); md->limit = md->base + (size << 2) - 1; } } md++; md->base = t4_read_reg(sc, A_ULP_RX_CTX_BASE); md->limit = 0; md++; md->base = t4_read_reg(sc, A_ULP_TX_ERR_TABLE_BASE); md->limit = 0; md++; md->base = sc->vres.ocq.start; if (sc->vres.ocq.size) md->limit = md->base + sc->vres.ocq.size - 1; else md->idx = nitems(region); /* hide it */ md++; /* add any address-space holes, there can be up to 3 */ for (n = 0; n < i - 1; n++) if (avail[n].limit < avail[n + 1].base) (md++)->base = avail[n].limit; if (avail[n].limit) (md++)->base = avail[n].limit; n = md - mem; qsort(mem, n, sizeof(struct mem_desc), mem_desc_cmp); for (lo = 0; lo < i; lo++) mem_region_show(sb, memory[avail[lo].idx], avail[lo].base, avail[lo].limit - 1); sbuf_printf(sb, "\n"); for (i = 0; i < n; i++) { if (mem[i].idx >= nitems(region)) continue; /* skip holes */ if (!mem[i].limit) mem[i].limit = i < n - 1 ? mem[i + 1].base - 1 : ~0; mem_region_show(sb, region[mem[i].idx], mem[i].base, mem[i].limit); } sbuf_printf(sb, "\n"); lo = t4_read_reg(sc, A_CIM_SDRAM_BASE_ADDR); hi = t4_read_reg(sc, A_CIM_SDRAM_ADDR_SIZE) + lo - 1; mem_region_show(sb, "uP RAM:", lo, hi); lo = t4_read_reg(sc, A_CIM_EXTMEM2_BASE_ADDR); hi = t4_read_reg(sc, A_CIM_EXTMEM2_ADDR_SIZE) + lo - 1; mem_region_show(sb, "uP Extmem2:", lo, hi); lo = t4_read_reg(sc, A_TP_PMM_RX_MAX_PAGE); sbuf_printf(sb, "\n%u Rx pages of size %uKiB for %u channels\n", G_PMRXMAXPAGE(lo), t4_read_reg(sc, A_TP_PMM_RX_PAGE_SIZE) >> 10, (lo & F_PMRXNUMCHN) ? 2 : 1); lo = t4_read_reg(sc, A_TP_PMM_TX_MAX_PAGE); hi = t4_read_reg(sc, A_TP_PMM_TX_PAGE_SIZE); sbuf_printf(sb, "%u Tx pages of size %u%ciB for %u channels\n", G_PMTXMAXPAGE(lo), hi >= (1 << 20) ? (hi >> 20) : (hi >> 10), hi >= (1 << 20) ? 'M' : 'K', 1 << G_PMTXNUMCHN(lo)); sbuf_printf(sb, "%u p-structs\n", t4_read_reg(sc, A_TP_CMM_MM_MAX_PSTRUCT)); for (i = 0; i < 4; i++) { if (chip_id(sc) > CHELSIO_T5) lo = t4_read_reg(sc, A_MPS_RX_MAC_BG_PG_CNT0 + i * 4); else lo = t4_read_reg(sc, A_MPS_RX_PG_RSV0 + i * 4); if (is_t5(sc)) { used = G_T5_USED(lo); alloc = G_T5_ALLOC(lo); } else { used = G_USED(lo); alloc = G_ALLOC(lo); } /* For T6 these are MAC buffer groups */ sbuf_printf(sb, "\nPort %d using %u pages out of %u allocated", i, used, alloc); } for (i = 0; i < sc->chip_params->nchan; i++) { if (chip_id(sc) > CHELSIO_T5) lo = t4_read_reg(sc, A_MPS_RX_LPBK_BG_PG_CNT0 + i * 4); else lo = t4_read_reg(sc, A_MPS_RX_PG_RSV4 + i * 4); if (is_t5(sc)) { used = G_T5_USED(lo); alloc = G_T5_ALLOC(lo); } else { used = G_USED(lo); alloc = G_ALLOC(lo); } /* For T6 these are MAC buffer groups */ sbuf_printf(sb, "\nLoopback %d using %u pages out of %u allocated", i, used, alloc); } rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static inline void tcamxy2valmask(uint64_t x, uint64_t y, uint8_t *addr, uint64_t *mask) { *mask = x | y; y = htobe64(y); memcpy(addr, (char *)&y + 2, ETHER_ADDR_LEN); } static int sysctl_mps_tcam(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc, i; MPASS(chip_id(sc) <= CHELSIO_T5); rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); sbuf_printf(sb, "Idx Ethernet address Mask Vld Ports PF" " VF Replication P0 P1 P2 P3 ML"); for (i = 0; i < sc->chip_params->mps_tcam_size; i++) { uint64_t tcamx, tcamy, mask; uint32_t cls_lo, cls_hi; uint8_t addr[ETHER_ADDR_LEN]; tcamy = t4_read_reg64(sc, MPS_CLS_TCAM_Y_L(i)); tcamx = t4_read_reg64(sc, MPS_CLS_TCAM_X_L(i)); if (tcamx & tcamy) continue; tcamxy2valmask(tcamx, tcamy, addr, &mask); cls_lo = t4_read_reg(sc, MPS_CLS_SRAM_L(i)); cls_hi = t4_read_reg(sc, MPS_CLS_SRAM_H(i)); sbuf_printf(sb, "\n%3u %02x:%02x:%02x:%02x:%02x:%02x %012jx" " %c %#x%4u%4d", i, addr[0], addr[1], addr[2], addr[3], addr[4], addr[5], (uintmax_t)mask, (cls_lo & F_SRAM_VLD) ? 'Y' : 'N', G_PORTMAP(cls_hi), G_PF(cls_lo), (cls_lo & F_VF_VALID) ? G_VF(cls_lo) : -1); if (cls_lo & F_REPLICATE) { struct fw_ldst_cmd ldst_cmd; memset(&ldst_cmd, 0, sizeof(ldst_cmd)); ldst_cmd.op_to_addrspace = htobe32(V_FW_CMD_OP(FW_LDST_CMD) | F_FW_CMD_REQUEST | F_FW_CMD_READ | V_FW_LDST_CMD_ADDRSPACE(FW_LDST_ADDRSPC_MPS)); ldst_cmd.cycles_to_len16 = htobe32(FW_LEN16(ldst_cmd)); ldst_cmd.u.mps.rplc.fid_idx = htobe16(V_FW_LDST_CMD_FID(FW_LDST_MPS_RPLC) | V_FW_LDST_CMD_IDX(i)); rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t4mps"); if (rc) break; rc = -t4_wr_mbox(sc, sc->mbox, &ldst_cmd, sizeof(ldst_cmd), &ldst_cmd); end_synchronized_op(sc, 0); if (rc != 0) { sbuf_printf(sb, "%36d", rc); rc = 0; } else { sbuf_printf(sb, " %08x %08x %08x %08x", be32toh(ldst_cmd.u.mps.rplc.rplc127_96), be32toh(ldst_cmd.u.mps.rplc.rplc95_64), be32toh(ldst_cmd.u.mps.rplc.rplc63_32), be32toh(ldst_cmd.u.mps.rplc.rplc31_0)); } } else sbuf_printf(sb, "%36s", ""); sbuf_printf(sb, "%4u%3u%3u%3u %#3x", G_SRAM_PRIO0(cls_lo), G_SRAM_PRIO1(cls_lo), G_SRAM_PRIO2(cls_lo), G_SRAM_PRIO3(cls_lo), (cls_lo >> S_MULTILISTEN0) & 0xf); } if (rc) (void) sbuf_finish(sb); else rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_mps_tcam_t6(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc, i; MPASS(chip_id(sc) > CHELSIO_T5); rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); sbuf_printf(sb, "Idx Ethernet address Mask VNI Mask" " IVLAN Vld DIP_Hit Lookup Port Vld Ports PF VF" " Replication" " P0 P1 P2 P3 ML\n"); for (i = 0; i < sc->chip_params->mps_tcam_size; i++) { uint8_t dip_hit, vlan_vld, lookup_type, port_num; uint16_t ivlan; uint64_t tcamx, tcamy, val, mask; uint32_t cls_lo, cls_hi, ctl, data2, vnix, vniy; uint8_t addr[ETHER_ADDR_LEN]; ctl = V_CTLREQID(1) | V_CTLCMDTYPE(0) | V_CTLXYBITSEL(0); if (i < 256) ctl |= V_CTLTCAMINDEX(i) | V_CTLTCAMSEL(0); else ctl |= V_CTLTCAMINDEX(i - 256) | V_CTLTCAMSEL(1); t4_write_reg(sc, A_MPS_CLS_TCAM_DATA2_CTL, ctl); val = t4_read_reg(sc, A_MPS_CLS_TCAM_RDATA1_REQ_ID1); tcamy = G_DMACH(val) << 32; tcamy |= t4_read_reg(sc, A_MPS_CLS_TCAM_RDATA0_REQ_ID1); data2 = t4_read_reg(sc, A_MPS_CLS_TCAM_RDATA2_REQ_ID1); lookup_type = G_DATALKPTYPE(data2); port_num = G_DATAPORTNUM(data2); if (lookup_type && lookup_type != M_DATALKPTYPE) { /* Inner header VNI */ vniy = ((data2 & F_DATAVIDH2) << 23) | (G_DATAVIDH1(data2) << 16) | G_VIDL(val); dip_hit = data2 & F_DATADIPHIT; vlan_vld = 0; } else { vniy = 0; dip_hit = 0; vlan_vld = data2 & F_DATAVIDH2; ivlan = G_VIDL(val); } ctl |= V_CTLXYBITSEL(1); t4_write_reg(sc, A_MPS_CLS_TCAM_DATA2_CTL, ctl); val = t4_read_reg(sc, A_MPS_CLS_TCAM_RDATA1_REQ_ID1); tcamx = G_DMACH(val) << 32; tcamx |= t4_read_reg(sc, A_MPS_CLS_TCAM_RDATA0_REQ_ID1); data2 = t4_read_reg(sc, A_MPS_CLS_TCAM_RDATA2_REQ_ID1); if (lookup_type && lookup_type != M_DATALKPTYPE) { /* Inner header VNI mask */ vnix = ((data2 & F_DATAVIDH2) << 23) | (G_DATAVIDH1(data2) << 16) | G_VIDL(val); } else vnix = 0; if (tcamx & tcamy) continue; tcamxy2valmask(tcamx, tcamy, addr, &mask); cls_lo = t4_read_reg(sc, MPS_CLS_SRAM_L(i)); cls_hi = t4_read_reg(sc, MPS_CLS_SRAM_H(i)); if (lookup_type && lookup_type != M_DATALKPTYPE) { sbuf_printf(sb, "\n%3u %02x:%02x:%02x:%02x:%02x:%02x " "%012jx %06x %06x - - %3c" " 'I' %4x %3c %#x%4u%4d", i, addr[0], addr[1], addr[2], addr[3], addr[4], addr[5], (uintmax_t)mask, vniy, vnix, dip_hit ? 'Y' : 'N', port_num, cls_lo & F_T6_SRAM_VLD ? 'Y' : 'N', G_PORTMAP(cls_hi), G_T6_PF(cls_lo), cls_lo & F_T6_VF_VALID ? G_T6_VF(cls_lo) : -1); } else { sbuf_printf(sb, "\n%3u %02x:%02x:%02x:%02x:%02x:%02x " "%012jx - - ", i, addr[0], addr[1], addr[2], addr[3], addr[4], addr[5], (uintmax_t)mask); if (vlan_vld) sbuf_printf(sb, "%4u Y ", ivlan); else sbuf_printf(sb, " - N "); sbuf_printf(sb, "- %3c %4x %3c %#x%4u%4d", lookup_type ? 'I' : 'O', port_num, cls_lo & F_T6_SRAM_VLD ? 'Y' : 'N', G_PORTMAP(cls_hi), G_T6_PF(cls_lo), cls_lo & F_T6_VF_VALID ? G_T6_VF(cls_lo) : -1); } if (cls_lo & F_T6_REPLICATE) { struct fw_ldst_cmd ldst_cmd; memset(&ldst_cmd, 0, sizeof(ldst_cmd)); ldst_cmd.op_to_addrspace = htobe32(V_FW_CMD_OP(FW_LDST_CMD) | F_FW_CMD_REQUEST | F_FW_CMD_READ | V_FW_LDST_CMD_ADDRSPACE(FW_LDST_ADDRSPC_MPS)); ldst_cmd.cycles_to_len16 = htobe32(FW_LEN16(ldst_cmd)); ldst_cmd.u.mps.rplc.fid_idx = htobe16(V_FW_LDST_CMD_FID(FW_LDST_MPS_RPLC) | V_FW_LDST_CMD_IDX(i)); rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t6mps"); if (rc) break; rc = -t4_wr_mbox(sc, sc->mbox, &ldst_cmd, sizeof(ldst_cmd), &ldst_cmd); end_synchronized_op(sc, 0); if (rc != 0) { sbuf_printf(sb, "%72d", rc); rc = 0; } else { sbuf_printf(sb, " %08x %08x %08x %08x" " %08x %08x %08x %08x", be32toh(ldst_cmd.u.mps.rplc.rplc255_224), be32toh(ldst_cmd.u.mps.rplc.rplc223_192), be32toh(ldst_cmd.u.mps.rplc.rplc191_160), be32toh(ldst_cmd.u.mps.rplc.rplc159_128), be32toh(ldst_cmd.u.mps.rplc.rplc127_96), be32toh(ldst_cmd.u.mps.rplc.rplc95_64), be32toh(ldst_cmd.u.mps.rplc.rplc63_32), be32toh(ldst_cmd.u.mps.rplc.rplc31_0)); } } else sbuf_printf(sb, "%72s", ""); sbuf_printf(sb, "%4u%3u%3u%3u %#x", G_T6_SRAM_PRIO0(cls_lo), G_T6_SRAM_PRIO1(cls_lo), G_T6_SRAM_PRIO2(cls_lo), G_T6_SRAM_PRIO3(cls_lo), (cls_lo >> S_T6_MULTILISTEN0) & 0xf); } if (rc) (void) sbuf_finish(sb); else rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_path_mtus(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc; uint16_t mtus[NMTUS]; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); t4_read_mtu_tbl(sc, mtus, NULL); sbuf_printf(sb, "%u %u %u %u %u %u %u %u %u %u %u %u %u %u %u %u", mtus[0], mtus[1], mtus[2], mtus[3], mtus[4], mtus[5], mtus[6], mtus[7], mtus[8], mtus[9], mtus[10], mtus[11], mtus[12], mtus[13], mtus[14], mtus[15]); rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_pm_stats(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc, i; uint32_t tx_cnt[MAX_PM_NSTATS], rx_cnt[MAX_PM_NSTATS]; uint64_t tx_cyc[MAX_PM_NSTATS], rx_cyc[MAX_PM_NSTATS]; static const char *tx_stats[MAX_PM_NSTATS] = { "Read:", "Write bypass:", "Write mem:", "Bypass + mem:", "Tx FIFO wait", NULL, "Tx latency" }; static const char *rx_stats[MAX_PM_NSTATS] = { "Read:", "Write bypass:", "Write mem:", "Flush:", " Rx FIFO wait", NULL, "Rx latency" }; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); t4_pmtx_get_stats(sc, tx_cnt, tx_cyc); t4_pmrx_get_stats(sc, rx_cnt, rx_cyc); sbuf_printf(sb, " Tx pcmds Tx bytes"); for (i = 0; i < 4; i++) { sbuf_printf(sb, "\n%-13s %10u %20ju", tx_stats[i], tx_cnt[i], tx_cyc[i]); } sbuf_printf(sb, "\n Rx pcmds Rx bytes"); for (i = 0; i < 4; i++) { sbuf_printf(sb, "\n%-13s %10u %20ju", rx_stats[i], rx_cnt[i], rx_cyc[i]); } if (chip_id(sc) > CHELSIO_T5) { sbuf_printf(sb, "\n Total wait Total occupancy"); sbuf_printf(sb, "\n%-13s %10u %20ju", tx_stats[i], tx_cnt[i], tx_cyc[i]); sbuf_printf(sb, "\n%-13s %10u %20ju", rx_stats[i], rx_cnt[i], rx_cyc[i]); i += 2; MPASS(i < nitems(tx_stats)); sbuf_printf(sb, "\n Reads Total wait"); sbuf_printf(sb, "\n%-13s %10u %20ju", tx_stats[i], tx_cnt[i], tx_cyc[i]); sbuf_printf(sb, "\n%-13s %10u %20ju", rx_stats[i], rx_cnt[i], rx_cyc[i]); } rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_rdma_stats(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc; struct tp_rdma_stats stats; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); mtx_lock(&sc->reg_lock); t4_tp_get_rdma_stats(sc, &stats); mtx_unlock(&sc->reg_lock); sbuf_printf(sb, "NoRQEModDefferals: %u\n", stats.rqe_dfr_mod); sbuf_printf(sb, "NoRQEPktDefferals: %u", stats.rqe_dfr_pkt); rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_tcp_stats(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc; struct tp_tcp_stats v4, v6; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); mtx_lock(&sc->reg_lock); t4_tp_get_tcp_stats(sc, &v4, &v6); mtx_unlock(&sc->reg_lock); sbuf_printf(sb, " IP IPv6\n"); sbuf_printf(sb, "OutRsts: %20u %20u\n", v4.tcp_out_rsts, v6.tcp_out_rsts); sbuf_printf(sb, "InSegs: %20ju %20ju\n", v4.tcp_in_segs, v6.tcp_in_segs); sbuf_printf(sb, "OutSegs: %20ju %20ju\n", v4.tcp_out_segs, v6.tcp_out_segs); sbuf_printf(sb, "RetransSegs: %20ju %20ju", v4.tcp_retrans_segs, v6.tcp_retrans_segs); rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_tids(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc; struct tid_info *t = &sc->tids; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); if (t->natids) { sbuf_printf(sb, "ATID range: 0-%u, in use: %u\n", t->natids - 1, t->atids_in_use); } if (t->ntids) { if (t4_read_reg(sc, A_LE_DB_CONFIG) & F_HASHEN) { uint32_t b = t4_read_reg(sc, A_LE_DB_SERVER_INDEX) / 4; if (b) { sbuf_printf(sb, "TID range: 0-%u, %u-%u", b - 1, t4_read_reg(sc, A_LE_DB_TID_HASHBASE) / 4, t->ntids - 1); } else { sbuf_printf(sb, "TID range: %u-%u", t4_read_reg(sc, A_LE_DB_TID_HASHBASE) / 4, t->ntids - 1); } } else sbuf_printf(sb, "TID range: 0-%u", t->ntids - 1); sbuf_printf(sb, ", in use: %u\n", atomic_load_acq_int(&t->tids_in_use)); } if (t->nstids) { sbuf_printf(sb, "STID range: %u-%u, in use: %u\n", t->stid_base, t->stid_base + t->nstids - 1, t->stids_in_use); } if (t->nftids) { sbuf_printf(sb, "FTID range: %u-%u\n", t->ftid_base, t->ftid_base + t->nftids - 1); } if (t->netids) { sbuf_printf(sb, "ETID range: %u-%u\n", t->etid_base, t->etid_base + t->netids - 1); } sbuf_printf(sb, "HW TID usage: %u IP users, %u IPv6 users", t4_read_reg(sc, A_LE_DB_ACT_CNT_IPV4), t4_read_reg(sc, A_LE_DB_ACT_CNT_IPV6)); rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_tp_err_stats(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc; struct tp_err_stats stats; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); mtx_lock(&sc->reg_lock); t4_tp_get_err_stats(sc, &stats); mtx_unlock(&sc->reg_lock); if (sc->chip_params->nchan > 2) { sbuf_printf(sb, " channel 0 channel 1" " channel 2 channel 3\n"); sbuf_printf(sb, "macInErrs: %10u %10u %10u %10u\n", stats.mac_in_errs[0], stats.mac_in_errs[1], stats.mac_in_errs[2], stats.mac_in_errs[3]); sbuf_printf(sb, "hdrInErrs: %10u %10u %10u %10u\n", stats.hdr_in_errs[0], stats.hdr_in_errs[1], stats.hdr_in_errs[2], stats.hdr_in_errs[3]); sbuf_printf(sb, "tcpInErrs: %10u %10u %10u %10u\n", stats.tcp_in_errs[0], stats.tcp_in_errs[1], stats.tcp_in_errs[2], stats.tcp_in_errs[3]); sbuf_printf(sb, "tcp6InErrs: %10u %10u %10u %10u\n", stats.tcp6_in_errs[0], stats.tcp6_in_errs[1], stats.tcp6_in_errs[2], stats.tcp6_in_errs[3]); sbuf_printf(sb, "tnlCongDrops: %10u %10u %10u %10u\n", stats.tnl_cong_drops[0], stats.tnl_cong_drops[1], stats.tnl_cong_drops[2], stats.tnl_cong_drops[3]); sbuf_printf(sb, "tnlTxDrops: %10u %10u %10u %10u\n", stats.tnl_tx_drops[0], stats.tnl_tx_drops[1], stats.tnl_tx_drops[2], stats.tnl_tx_drops[3]); sbuf_printf(sb, "ofldVlanDrops: %10u %10u %10u %10u\n", stats.ofld_vlan_drops[0], stats.ofld_vlan_drops[1], stats.ofld_vlan_drops[2], stats.ofld_vlan_drops[3]); sbuf_printf(sb, "ofldChanDrops: %10u %10u %10u %10u\n\n", stats.ofld_chan_drops[0], stats.ofld_chan_drops[1], stats.ofld_chan_drops[2], stats.ofld_chan_drops[3]); } else { sbuf_printf(sb, " channel 0 channel 1\n"); sbuf_printf(sb, "macInErrs: %10u %10u\n", stats.mac_in_errs[0], stats.mac_in_errs[1]); sbuf_printf(sb, "hdrInErrs: %10u %10u\n", stats.hdr_in_errs[0], stats.hdr_in_errs[1]); sbuf_printf(sb, "tcpInErrs: %10u %10u\n", stats.tcp_in_errs[0], stats.tcp_in_errs[1]); sbuf_printf(sb, "tcp6InErrs: %10u %10u\n", stats.tcp6_in_errs[0], stats.tcp6_in_errs[1]); sbuf_printf(sb, "tnlCongDrops: %10u %10u\n", stats.tnl_cong_drops[0], stats.tnl_cong_drops[1]); sbuf_printf(sb, "tnlTxDrops: %10u %10u\n", stats.tnl_tx_drops[0], stats.tnl_tx_drops[1]); sbuf_printf(sb, "ofldVlanDrops: %10u %10u\n", stats.ofld_vlan_drops[0], stats.ofld_vlan_drops[1]); sbuf_printf(sb, "ofldChanDrops: %10u %10u\n\n", stats.ofld_chan_drops[0], stats.ofld_chan_drops[1]); } sbuf_printf(sb, "ofldNoNeigh: %u\nofldCongDefer: %u", stats.ofld_no_neigh, stats.ofld_cong_defer); rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_tp_la_mask(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct tp_params *tpp = &sc->params.tp; u_int mask; int rc; mask = tpp->la_mask >> 16; rc = sysctl_handle_int(oidp, &mask, 0, req); if (rc != 0 || req->newptr == NULL) return (rc); if (mask > 0xffff) return (EINVAL); tpp->la_mask = mask << 16; t4_set_reg_field(sc, A_TP_DBG_LA_CONFIG, 0xffff0000U, tpp->la_mask); return (0); } struct field_desc { const char *name; u_int start; u_int width; }; static void field_desc_show(struct sbuf *sb, uint64_t v, const struct field_desc *f) { char buf[32]; int line_size = 0; while (f->name) { uint64_t mask = (1ULL << f->width) - 1; int len = snprintf(buf, sizeof(buf), "%s: %ju", f->name, ((uintmax_t)v >> f->start) & mask); if (line_size + len >= 79) { line_size = 8; sbuf_printf(sb, "\n "); } sbuf_printf(sb, "%s ", buf); line_size += len + 1; f++; } sbuf_printf(sb, "\n"); } static const struct field_desc tp_la0[] = { { "RcfOpCodeOut", 60, 4 }, { "State", 56, 4 }, { "WcfState", 52, 4 }, { "RcfOpcSrcOut", 50, 2 }, { "CRxError", 49, 1 }, { "ERxError", 48, 1 }, { "SanityFailed", 47, 1 }, { "SpuriousMsg", 46, 1 }, { "FlushInputMsg", 45, 1 }, { "FlushInputCpl", 44, 1 }, { "RssUpBit", 43, 1 }, { "RssFilterHit", 42, 1 }, { "Tid", 32, 10 }, { "InitTcb", 31, 1 }, { "LineNumber", 24, 7 }, { "Emsg", 23, 1 }, { "EdataOut", 22, 1 }, { "Cmsg", 21, 1 }, { "CdataOut", 20, 1 }, { "EreadPdu", 19, 1 }, { "CreadPdu", 18, 1 }, { "TunnelPkt", 17, 1 }, { "RcfPeerFin", 16, 1 }, { "RcfReasonOut", 12, 4 }, { "TxCchannel", 10, 2 }, { "RcfTxChannel", 8, 2 }, { "RxEchannel", 6, 2 }, { "RcfRxChannel", 5, 1 }, { "RcfDataOutSrdy", 4, 1 }, { "RxDvld", 3, 1 }, { "RxOoDvld", 2, 1 }, { "RxCongestion", 1, 1 }, { "TxCongestion", 0, 1 }, { NULL } }; static const struct field_desc tp_la1[] = { { "CplCmdIn", 56, 8 }, { "CplCmdOut", 48, 8 }, { "ESynOut", 47, 1 }, { "EAckOut", 46, 1 }, { "EFinOut", 45, 1 }, { "ERstOut", 44, 1 }, { "SynIn", 43, 1 }, { "AckIn", 42, 1 }, { "FinIn", 41, 1 }, { "RstIn", 40, 1 }, { "DataIn", 39, 1 }, { "DataInVld", 38, 1 }, { "PadIn", 37, 1 }, { "RxBufEmpty", 36, 1 }, { "RxDdp", 35, 1 }, { "RxFbCongestion", 34, 1 }, { "TxFbCongestion", 33, 1 }, { "TxPktSumSrdy", 32, 1 }, { "RcfUlpType", 28, 4 }, { "Eread", 27, 1 }, { "Ebypass", 26, 1 }, { "Esave", 25, 1 }, { "Static0", 24, 1 }, { "Cread", 23, 1 }, { "Cbypass", 22, 1 }, { "Csave", 21, 1 }, { "CPktOut", 20, 1 }, { "RxPagePoolFull", 18, 2 }, { "RxLpbkPkt", 17, 1 }, { "TxLpbkPkt", 16, 1 }, { "RxVfValid", 15, 1 }, { "SynLearned", 14, 1 }, { "SetDelEntry", 13, 1 }, { "SetInvEntry", 12, 1 }, { "CpcmdDvld", 11, 1 }, { "CpcmdSave", 10, 1 }, { "RxPstructsFull", 8, 2 }, { "EpcmdDvld", 7, 1 }, { "EpcmdFlush", 6, 1 }, { "EpcmdTrimPrefix", 5, 1 }, { "EpcmdTrimPostfix", 4, 1 }, { "ERssIp4Pkt", 3, 1 }, { "ERssIp6Pkt", 2, 1 }, { "ERssTcpUdpPkt", 1, 1 }, { "ERssFceFipPkt", 0, 1 }, { NULL } }; static const struct field_desc tp_la2[] = { { "CplCmdIn", 56, 8 }, { "MpsVfVld", 55, 1 }, { "MpsPf", 52, 3 }, { "MpsVf", 44, 8 }, { "SynIn", 43, 1 }, { "AckIn", 42, 1 }, { "FinIn", 41, 1 }, { "RstIn", 40, 1 }, { "DataIn", 39, 1 }, { "DataInVld", 38, 1 }, { "PadIn", 37, 1 }, { "RxBufEmpty", 36, 1 }, { "RxDdp", 35, 1 }, { "RxFbCongestion", 34, 1 }, { "TxFbCongestion", 33, 1 }, { "TxPktSumSrdy", 32, 1 }, { "RcfUlpType", 28, 4 }, { "Eread", 27, 1 }, { "Ebypass", 26, 1 }, { "Esave", 25, 1 }, { "Static0", 24, 1 }, { "Cread", 23, 1 }, { "Cbypass", 22, 1 }, { "Csave", 21, 1 }, { "CPktOut", 20, 1 }, { "RxPagePoolFull", 18, 2 }, { "RxLpbkPkt", 17, 1 }, { "TxLpbkPkt", 16, 1 }, { "RxVfValid", 15, 1 }, { "SynLearned", 14, 1 }, { "SetDelEntry", 13, 1 }, { "SetInvEntry", 12, 1 }, { "CpcmdDvld", 11, 1 }, { "CpcmdSave", 10, 1 }, { "RxPstructsFull", 8, 2 }, { "EpcmdDvld", 7, 1 }, { "EpcmdFlush", 6, 1 }, { "EpcmdTrimPrefix", 5, 1 }, { "EpcmdTrimPostfix", 4, 1 }, { "ERssIp4Pkt", 3, 1 }, { "ERssIp6Pkt", 2, 1 }, { "ERssTcpUdpPkt", 1, 1 }, { "ERssFceFipPkt", 0, 1 }, { NULL } }; static void tp_la_show(struct sbuf *sb, uint64_t *p, int idx) { field_desc_show(sb, *p, tp_la0); } static void tp_la_show2(struct sbuf *sb, uint64_t *p, int idx) { if (idx) sbuf_printf(sb, "\n"); field_desc_show(sb, p[0], tp_la0); if (idx < (TPLA_SIZE / 2 - 1) || p[1] != ~0ULL) field_desc_show(sb, p[1], tp_la0); } static void tp_la_show3(struct sbuf *sb, uint64_t *p, int idx) { if (idx) sbuf_printf(sb, "\n"); field_desc_show(sb, p[0], tp_la0); if (idx < (TPLA_SIZE / 2 - 1) || p[1] != ~0ULL) field_desc_show(sb, p[1], (p[0] & (1 << 17)) ? tp_la2 : tp_la1); } static int sysctl_tp_la(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; uint64_t *buf, *p; int rc; u_int i, inc; void (*show_func)(struct sbuf *, uint64_t *, int); rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); buf = malloc(TPLA_SIZE * sizeof(uint64_t), M_CXGBE, M_ZERO | M_WAITOK); t4_tp_read_la(sc, buf, NULL); p = buf; switch (G_DBGLAMODE(t4_read_reg(sc, A_TP_DBG_LA_CONFIG))) { case 2: inc = 2; show_func = tp_la_show2; break; case 3: inc = 2; show_func = tp_la_show3; break; default: inc = 1; show_func = tp_la_show; } for (i = 0; i < TPLA_SIZE / inc; i++, p += inc) (*show_func)(sb, p, i); rc = sbuf_finish(sb); sbuf_delete(sb); free(buf, M_CXGBE); return (rc); } static int sysctl_tx_rate(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc; u64 nrate[MAX_NCHAN], orate[MAX_NCHAN]; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 256, req); if (sb == NULL) return (ENOMEM); t4_get_chan_txrate(sc, nrate, orate); if (sc->chip_params->nchan > 2) { sbuf_printf(sb, " channel 0 channel 1" " channel 2 channel 3\n"); sbuf_printf(sb, "NIC B/s: %10ju %10ju %10ju %10ju\n", nrate[0], nrate[1], nrate[2], nrate[3]); sbuf_printf(sb, "Offload B/s: %10ju %10ju %10ju %10ju", orate[0], orate[1], orate[2], orate[3]); } else { sbuf_printf(sb, " channel 0 channel 1\n"); sbuf_printf(sb, "NIC B/s: %10ju %10ju\n", nrate[0], nrate[1]); sbuf_printf(sb, "Offload B/s: %10ju %10ju", orate[0], orate[1]); } rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } static int sysctl_ulprx_la(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; uint32_t *buf, *p; int rc, i; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); buf = malloc(ULPRX_LA_SIZE * 8 * sizeof(uint32_t), M_CXGBE, M_ZERO | M_WAITOK); t4_ulprx_read_la(sc, buf); p = buf; sbuf_printf(sb, " Pcmd Type Message" " Data"); for (i = 0; i < ULPRX_LA_SIZE; i++, p += 8) { sbuf_printf(sb, "\n%08x%08x %4x %08x %08x%08x%08x%08x", p[1], p[0], p[2], p[3], p[7], p[6], p[5], p[4]); } rc = sbuf_finish(sb); sbuf_delete(sb); free(buf, M_CXGBE); return (rc); } static int sysctl_wcwr_stats(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; struct sbuf *sb; int rc, v; rc = sysctl_wire_old_buffer(req, 0); if (rc != 0) return (rc); sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); if (sb == NULL) return (ENOMEM); v = t4_read_reg(sc, A_SGE_STAT_CFG); if (G_STATSOURCE_T5(v) == 7) { if (G_STATMODE(v) == 0) { sbuf_printf(sb, "total %d, incomplete %d", t4_read_reg(sc, A_SGE_STAT_TOTAL), t4_read_reg(sc, A_SGE_STAT_MATCH)); } else if (G_STATMODE(v) == 1) { sbuf_printf(sb, "total %d, data overflow %d", t4_read_reg(sc, A_SGE_STAT_TOTAL), t4_read_reg(sc, A_SGE_STAT_MATCH)); } } rc = sbuf_finish(sb); sbuf_delete(sb); return (rc); } + +static int +sysctl_tc_params(SYSCTL_HANDLER_ARGS) +{ + struct adapter *sc = arg1; + struct tx_sched_class *tc; + struct t4_sched_class_params p; + struct sbuf *sb; + int i, rc, port_id, flags, mbps, gbps; + + rc = sysctl_wire_old_buffer(req, 0); + if (rc != 0) + return (rc); + + sb = sbuf_new_for_sysctl(NULL, NULL, 4096, req); + if (sb == NULL) + return (ENOMEM); + + port_id = arg2 >> 16; + MPASS(port_id < sc->params.nports); + MPASS(sc->port[port_id] != NULL); + i = arg2 & 0xffff; + MPASS(i < sc->chip_params->nsched_cls); + tc = &sc->port[port_id]->tc[i]; + + rc = begin_synchronized_op(sc, NULL, HOLD_LOCK | SLEEP_OK | INTR_OK, + "t4tc_p"); + if (rc) + goto done; + flags = tc->flags; + p = tc->params; + end_synchronized_op(sc, LOCK_HELD); + + if ((flags & TX_SC_OK) == 0) { + sbuf_printf(sb, "none"); + goto done; + } + + if (p.level == SCHED_CLASS_LEVEL_CL_WRR) { + sbuf_printf(sb, "cl-wrr weight %u", p.weight); + goto done; + } else if (p.level == SCHED_CLASS_LEVEL_CL_RL) + sbuf_printf(sb, "cl-rl"); + else if (p.level == SCHED_CLASS_LEVEL_CH_RL) + sbuf_printf(sb, "ch-rl"); + else { + rc = ENXIO; + goto done; + } + + if (p.ratemode == SCHED_CLASS_RATEMODE_REL) { + /* XXX: top speed or actual link speed? */ + gbps = port_top_speed(sc->port[port_id]); + sbuf_printf(sb, " %u%% of %uGbps", p.maxrate, gbps); + } + else if (p.ratemode == SCHED_CLASS_RATEMODE_ABS) { + switch (p.rateunit) { + case SCHED_CLASS_RATEUNIT_BITS: + mbps = p.maxrate / 1000; + gbps = p.maxrate / 1000000; + if (p.maxrate == gbps * 1000000) + sbuf_printf(sb, " %uGbps", gbps); + else if (p.maxrate == mbps * 1000) + sbuf_printf(sb, " %uMbps", mbps); + else + sbuf_printf(sb, " %uKbps", p.maxrate); + break; + case SCHED_CLASS_RATEUNIT_PKTS: + sbuf_printf(sb, " %upps", p.maxrate); + break; + default: + rc = ENXIO; + goto done; + } + } + + switch (p.mode) { + case SCHED_CLASS_MODE_CLASS: + sbuf_printf(sb, " aggregate"); + break; + case SCHED_CLASS_MODE_FLOW: + sbuf_printf(sb, " per-flow"); + break; + default: + rc = ENXIO; + goto done; + } + +done: + if (rc == 0) + rc = sbuf_finish(sb); + sbuf_delete(sb); + + return (rc); +} #endif #ifdef TCP_OFFLOAD static void unit_conv(char *buf, size_t len, u_int val, u_int factor) { u_int rem = val % factor; if (rem == 0) snprintf(buf, len, "%u", val / factor); else { while (rem % 10 == 0) rem /= 10; snprintf(buf, len, "%u.%u", val / factor, rem); } } static int sysctl_tp_tick(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; char buf[16]; u_int res, re; u_int cclk_ps = 1000000000 / sc->params.vpd.cclk; res = t4_read_reg(sc, A_TP_TIMER_RESOLUTION); switch (arg2) { case 0: /* timer_tick */ re = G_TIMERRESOLUTION(res); break; case 1: /* TCP timestamp tick */ re = G_TIMESTAMPRESOLUTION(res); break; case 2: /* DACK tick */ re = G_DELAYEDACKRESOLUTION(res); break; default: return (EDOOFUS); } unit_conv(buf, sizeof(buf), (cclk_ps << re), 1000000); return (sysctl_handle_string(oidp, buf, sizeof(buf), req)); } static int sysctl_tp_dack_timer(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; u_int res, dack_re, v; u_int cclk_ps = 1000000000 / sc->params.vpd.cclk; res = t4_read_reg(sc, A_TP_TIMER_RESOLUTION); dack_re = G_DELAYEDACKRESOLUTION(res); v = ((cclk_ps << dack_re) / 1000000) * t4_read_reg(sc, A_TP_DACK_TIMER); return (sysctl_handle_int(oidp, &v, 0, req)); } static int sysctl_tp_timer(SYSCTL_HANDLER_ARGS) { struct adapter *sc = arg1; int reg = arg2; u_int tre; u_long tp_tick_us, v; u_int cclk_ps = 1000000000 / sc->params.vpd.cclk; MPASS(reg == A_TP_RXT_MIN || reg == A_TP_RXT_MAX || reg == A_TP_PERS_MIN || reg == A_TP_PERS_MAX || reg == A_TP_KEEP_IDLE || A_TP_KEEP_INTVL || reg == A_TP_INIT_SRTT || reg == A_TP_FINWAIT2_TIMER); tre = G_TIMERRESOLUTION(t4_read_reg(sc, A_TP_TIMER_RESOLUTION)); tp_tick_us = (cclk_ps << tre) / 1000000; if (reg == A_TP_INIT_SRTT) v = tp_tick_us * G_INITSRTT(t4_read_reg(sc, reg)); else v = tp_tick_us * t4_read_reg(sc, reg); return (sysctl_handle_long(oidp, &v, 0, req)); } #endif static uint32_t fconf_iconf_to_mode(uint32_t fconf, uint32_t iconf) { uint32_t mode; mode = T4_FILTER_IPv4 | T4_FILTER_IPv6 | T4_FILTER_IP_SADDR | T4_FILTER_IP_DADDR | T4_FILTER_IP_SPORT | T4_FILTER_IP_DPORT; if (fconf & F_FRAGMENTATION) mode |= T4_FILTER_IP_FRAGMENT; if (fconf & F_MPSHITTYPE) mode |= T4_FILTER_MPS_HIT_TYPE; if (fconf & F_MACMATCH) mode |= T4_FILTER_MAC_IDX; if (fconf & F_ETHERTYPE) mode |= T4_FILTER_ETH_TYPE; if (fconf & F_PROTOCOL) mode |= T4_FILTER_IP_PROTO; if (fconf & F_TOS) mode |= T4_FILTER_IP_TOS; if (fconf & F_VLAN) mode |= T4_FILTER_VLAN; if (fconf & F_VNIC_ID) { mode |= T4_FILTER_VNIC; if (iconf & F_VNIC) mode |= T4_FILTER_IC_VNIC; } if (fconf & F_PORT) mode |= T4_FILTER_PORT; if (fconf & F_FCOE) mode |= T4_FILTER_FCoE; return (mode); } static uint32_t mode_to_fconf(uint32_t mode) { uint32_t fconf = 0; if (mode & T4_FILTER_IP_FRAGMENT) fconf |= F_FRAGMENTATION; if (mode & T4_FILTER_MPS_HIT_TYPE) fconf |= F_MPSHITTYPE; if (mode & T4_FILTER_MAC_IDX) fconf |= F_MACMATCH; if (mode & T4_FILTER_ETH_TYPE) fconf |= F_ETHERTYPE; if (mode & T4_FILTER_IP_PROTO) fconf |= F_PROTOCOL; if (mode & T4_FILTER_IP_TOS) fconf |= F_TOS; if (mode & T4_FILTER_VLAN) fconf |= F_VLAN; if (mode & T4_FILTER_VNIC) fconf |= F_VNIC_ID; if (mode & T4_FILTER_PORT) fconf |= F_PORT; if (mode & T4_FILTER_FCoE) fconf |= F_FCOE; return (fconf); } static uint32_t mode_to_iconf(uint32_t mode) { if (mode & T4_FILTER_IC_VNIC) return (F_VNIC); return (0); } static int check_fspec_against_fconf_iconf(struct adapter *sc, struct t4_filter_specification *fs) { struct tp_params *tpp = &sc->params.tp; uint32_t fconf = 0; if (fs->val.frag || fs->mask.frag) fconf |= F_FRAGMENTATION; if (fs->val.matchtype || fs->mask.matchtype) fconf |= F_MPSHITTYPE; if (fs->val.macidx || fs->mask.macidx) fconf |= F_MACMATCH; if (fs->val.ethtype || fs->mask.ethtype) fconf |= F_ETHERTYPE; if (fs->val.proto || fs->mask.proto) fconf |= F_PROTOCOL; if (fs->val.tos || fs->mask.tos) fconf |= F_TOS; if (fs->val.vlan_vld || fs->mask.vlan_vld) fconf |= F_VLAN; if (fs->val.ovlan_vld || fs->mask.ovlan_vld) { fconf |= F_VNIC_ID; if (tpp->ingress_config & F_VNIC) return (EINVAL); } if (fs->val.pfvf_vld || fs->mask.pfvf_vld) { fconf |= F_VNIC_ID; if ((tpp->ingress_config & F_VNIC) == 0) return (EINVAL); } if (fs->val.iport || fs->mask.iport) fconf |= F_PORT; if (fs->val.fcoe || fs->mask.fcoe) fconf |= F_FCOE; if ((tpp->vlan_pri_map | fconf) != tpp->vlan_pri_map) return (E2BIG); return (0); } static int get_filter_mode(struct adapter *sc, uint32_t *mode) { struct tp_params *tpp = &sc->params.tp; /* * We trust the cached values of the relevant TP registers. This means * things work reliably only if writes to those registers are always via * t4_set_filter_mode. */ *mode = fconf_iconf_to_mode(tpp->vlan_pri_map, tpp->ingress_config); return (0); } static int set_filter_mode(struct adapter *sc, uint32_t mode) { struct tp_params *tpp = &sc->params.tp; uint32_t fconf, iconf; int rc; iconf = mode_to_iconf(mode); if ((iconf ^ tpp->ingress_config) & F_VNIC) { /* * For now we just complain if A_TP_INGRESS_CONFIG is not * already set to the correct value for the requested filter * mode. It's not clear if it's safe to write to this register * on the fly. (And we trust the cached value of the register). */ return (EBUSY); } fconf = mode_to_fconf(mode); rc = begin_synchronized_op(sc, NULL, HOLD_LOCK | SLEEP_OK | INTR_OK, "t4setfm"); if (rc) return (rc); if (sc->tids.ftids_in_use > 0) { rc = EBUSY; goto done; } #ifdef TCP_OFFLOAD if (uld_active(sc, ULD_TOM)) { rc = EBUSY; goto done; } #endif rc = -t4_set_filter_mode(sc, fconf); done: end_synchronized_op(sc, LOCK_HELD); return (rc); } static inline uint64_t get_filter_hits(struct adapter *sc, uint32_t fid) { uint32_t tcb_addr; tcb_addr = t4_read_reg(sc, A_TP_CMM_TCB_BASE) + (fid + sc->tids.ftid_base) * TCB_SIZE; if (is_t4(sc)) { uint64_t hits; read_via_memwin(sc, 0, tcb_addr + 16, (uint32_t *)&hits, 8); return (be64toh(hits)); } else { uint32_t hits; read_via_memwin(sc, 0, tcb_addr + 24, &hits, 4); return (be32toh(hits)); } } static int get_filter(struct adapter *sc, struct t4_filter *t) { int i, rc, nfilters = sc->tids.nftids; struct filter_entry *f; rc = begin_synchronized_op(sc, NULL, HOLD_LOCK | SLEEP_OK | INTR_OK, "t4getf"); if (rc) return (rc); if (sc->tids.ftids_in_use == 0 || sc->tids.ftid_tab == NULL || t->idx >= nfilters) { t->idx = 0xffffffff; goto done; } f = &sc->tids.ftid_tab[t->idx]; for (i = t->idx; i < nfilters; i++, f++) { if (f->valid) { t->idx = i; t->l2tidx = f->l2t ? f->l2t->idx : 0; t->smtidx = f->smtidx; if (f->fs.hitcnts) t->hits = get_filter_hits(sc, t->idx); else t->hits = UINT64_MAX; t->fs = f->fs; goto done; } } t->idx = 0xffffffff; done: end_synchronized_op(sc, LOCK_HELD); return (0); } static int set_filter(struct adapter *sc, struct t4_filter *t) { unsigned int nfilters, nports; struct filter_entry *f; int i, rc; rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t4setf"); if (rc) return (rc); nfilters = sc->tids.nftids; nports = sc->params.nports; if (nfilters == 0) { rc = ENOTSUP; goto done; } if (!(sc->flags & FULL_INIT_DONE)) { rc = EAGAIN; goto done; } if (t->idx >= nfilters) { rc = EINVAL; goto done; } /* Validate against the global filter mode and ingress config */ rc = check_fspec_against_fconf_iconf(sc, &t->fs); if (rc != 0) goto done; if (t->fs.action == FILTER_SWITCH && t->fs.eport >= nports) { rc = EINVAL; goto done; } if (t->fs.val.iport >= nports) { rc = EINVAL; goto done; } /* Can't specify an iq if not steering to it */ if (!t->fs.dirsteer && t->fs.iq) { rc = EINVAL; goto done; } /* IPv6 filter idx must be 4 aligned */ if (t->fs.type == 1 && ((t->idx & 0x3) || t->idx + 4 >= nfilters)) { rc = EINVAL; goto done; } if (sc->tids.ftid_tab == NULL) { KASSERT(sc->tids.ftids_in_use == 0, ("%s: no memory allocated but filters_in_use > 0", __func__)); sc->tids.ftid_tab = malloc(sizeof (struct filter_entry) * nfilters, M_CXGBE, M_NOWAIT | M_ZERO); if (sc->tids.ftid_tab == NULL) { rc = ENOMEM; goto done; } mtx_init(&sc->tids.ftid_lock, "T4 filters", 0, MTX_DEF); } for (i = 0; i < 4; i++) { f = &sc->tids.ftid_tab[t->idx + i]; if (f->pending || f->valid) { rc = EBUSY; goto done; } if (f->locked) { rc = EPERM; goto done; } if (t->fs.type == 0) break; } f = &sc->tids.ftid_tab[t->idx]; f->fs = t->fs; rc = set_filter_wr(sc, t->idx); done: end_synchronized_op(sc, 0); if (rc == 0) { mtx_lock(&sc->tids.ftid_lock); for (;;) { if (f->pending == 0) { rc = f->valid ? 0 : EIO; break; } if (mtx_sleep(&sc->tids.ftid_tab, &sc->tids.ftid_lock, PCATCH, "t4setfw", 0)) { rc = EINPROGRESS; break; } } mtx_unlock(&sc->tids.ftid_lock); } return (rc); } static int del_filter(struct adapter *sc, struct t4_filter *t) { unsigned int nfilters; struct filter_entry *f; int rc; rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t4delf"); if (rc) return (rc); nfilters = sc->tids.nftids; if (nfilters == 0) { rc = ENOTSUP; goto done; } if (sc->tids.ftid_tab == NULL || sc->tids.ftids_in_use == 0 || t->idx >= nfilters) { rc = EINVAL; goto done; } if (!(sc->flags & FULL_INIT_DONE)) { rc = EAGAIN; goto done; } f = &sc->tids.ftid_tab[t->idx]; if (f->pending) { rc = EBUSY; goto done; } if (f->locked) { rc = EPERM; goto done; } if (f->valid) { t->fs = f->fs; /* extra info for the caller */ rc = del_filter_wr(sc, t->idx); } done: end_synchronized_op(sc, 0); if (rc == 0) { mtx_lock(&sc->tids.ftid_lock); for (;;) { if (f->pending == 0) { rc = f->valid ? EIO : 0; break; } if (mtx_sleep(&sc->tids.ftid_tab, &sc->tids.ftid_lock, PCATCH, "t4delfw", 0)) { rc = EINPROGRESS; break; } } mtx_unlock(&sc->tids.ftid_lock); } return (rc); } static void clear_filter(struct filter_entry *f) { if (f->l2t) t4_l2t_release(f->l2t); bzero(f, sizeof (*f)); } static int set_filter_wr(struct adapter *sc, int fidx) { struct filter_entry *f = &sc->tids.ftid_tab[fidx]; struct fw_filter_wr *fwr; unsigned int ftid, vnic_vld, vnic_vld_mask; struct wrq_cookie cookie; ASSERT_SYNCHRONIZED_OP(sc); if (f->fs.newdmac || f->fs.newvlan) { /* This filter needs an L2T entry; allocate one. */ f->l2t = t4_l2t_alloc_switching(sc->l2t); if (f->l2t == NULL) return (EAGAIN); if (t4_l2t_set_switching(sc, f->l2t, f->fs.vlan, f->fs.eport, f->fs.dmac)) { t4_l2t_release(f->l2t); f->l2t = NULL; return (ENOMEM); } } /* Already validated against fconf, iconf */ MPASS((f->fs.val.pfvf_vld & f->fs.val.ovlan_vld) == 0); MPASS((f->fs.mask.pfvf_vld & f->fs.mask.ovlan_vld) == 0); if (f->fs.val.pfvf_vld || f->fs.val.ovlan_vld) vnic_vld = 1; else vnic_vld = 0; if (f->fs.mask.pfvf_vld || f->fs.mask.ovlan_vld) vnic_vld_mask = 1; else vnic_vld_mask = 0; ftid = sc->tids.ftid_base + fidx; fwr = start_wrq_wr(&sc->sge.mgmtq, howmany(sizeof(*fwr), 16), &cookie); if (fwr == NULL) return (ENOMEM); bzero(fwr, sizeof(*fwr)); fwr->op_pkd = htobe32(V_FW_WR_OP(FW_FILTER_WR)); fwr->len16_pkd = htobe32(FW_LEN16(*fwr)); fwr->tid_to_iq = htobe32(V_FW_FILTER_WR_TID(ftid) | V_FW_FILTER_WR_RQTYPE(f->fs.type) | V_FW_FILTER_WR_NOREPLY(0) | V_FW_FILTER_WR_IQ(f->fs.iq)); fwr->del_filter_to_l2tix = htobe32(V_FW_FILTER_WR_RPTTID(f->fs.rpttid) | V_FW_FILTER_WR_DROP(f->fs.action == FILTER_DROP) | V_FW_FILTER_WR_DIRSTEER(f->fs.dirsteer) | V_FW_FILTER_WR_MASKHASH(f->fs.maskhash) | V_FW_FILTER_WR_DIRSTEERHASH(f->fs.dirsteerhash) | V_FW_FILTER_WR_LPBK(f->fs.action == FILTER_SWITCH) | V_FW_FILTER_WR_DMAC(f->fs.newdmac) | V_FW_FILTER_WR_SMAC(f->fs.newsmac) | V_FW_FILTER_WR_INSVLAN(f->fs.newvlan == VLAN_INSERT || f->fs.newvlan == VLAN_REWRITE) | V_FW_FILTER_WR_RMVLAN(f->fs.newvlan == VLAN_REMOVE || f->fs.newvlan == VLAN_REWRITE) | V_FW_FILTER_WR_HITCNTS(f->fs.hitcnts) | V_FW_FILTER_WR_TXCHAN(f->fs.eport) | V_FW_FILTER_WR_PRIO(f->fs.prio) | V_FW_FILTER_WR_L2TIX(f->l2t ? f->l2t->idx : 0)); fwr->ethtype = htobe16(f->fs.val.ethtype); fwr->ethtypem = htobe16(f->fs.mask.ethtype); fwr->frag_to_ovlan_vldm = (V_FW_FILTER_WR_FRAG(f->fs.val.frag) | V_FW_FILTER_WR_FRAGM(f->fs.mask.frag) | V_FW_FILTER_WR_IVLAN_VLD(f->fs.val.vlan_vld) | V_FW_FILTER_WR_OVLAN_VLD(vnic_vld) | V_FW_FILTER_WR_IVLAN_VLDM(f->fs.mask.vlan_vld) | V_FW_FILTER_WR_OVLAN_VLDM(vnic_vld_mask)); fwr->smac_sel = 0; fwr->rx_chan_rx_rpl_iq = htobe16(V_FW_FILTER_WR_RX_CHAN(0) | V_FW_FILTER_WR_RX_RPL_IQ(sc->sge.fwq.abs_id)); fwr->maci_to_matchtypem = htobe32(V_FW_FILTER_WR_MACI(f->fs.val.macidx) | V_FW_FILTER_WR_MACIM(f->fs.mask.macidx) | V_FW_FILTER_WR_FCOE(f->fs.val.fcoe) | V_FW_FILTER_WR_FCOEM(f->fs.mask.fcoe) | V_FW_FILTER_WR_PORT(f->fs.val.iport) | V_FW_FILTER_WR_PORTM(f->fs.mask.iport) | V_FW_FILTER_WR_MATCHTYPE(f->fs.val.matchtype) | V_FW_FILTER_WR_MATCHTYPEM(f->fs.mask.matchtype)); fwr->ptcl = f->fs.val.proto; fwr->ptclm = f->fs.mask.proto; fwr->ttyp = f->fs.val.tos; fwr->ttypm = f->fs.mask.tos; fwr->ivlan = htobe16(f->fs.val.vlan); fwr->ivlanm = htobe16(f->fs.mask.vlan); fwr->ovlan = htobe16(f->fs.val.vnic); fwr->ovlanm = htobe16(f->fs.mask.vnic); bcopy(f->fs.val.dip, fwr->lip, sizeof (fwr->lip)); bcopy(f->fs.mask.dip, fwr->lipm, sizeof (fwr->lipm)); bcopy(f->fs.val.sip, fwr->fip, sizeof (fwr->fip)); bcopy(f->fs.mask.sip, fwr->fipm, sizeof (fwr->fipm)); fwr->lp = htobe16(f->fs.val.dport); fwr->lpm = htobe16(f->fs.mask.dport); fwr->fp = htobe16(f->fs.val.sport); fwr->fpm = htobe16(f->fs.mask.sport); if (f->fs.newsmac) bcopy(f->fs.smac, fwr->sma, sizeof (fwr->sma)); f->pending = 1; sc->tids.ftids_in_use++; commit_wrq_wr(&sc->sge.mgmtq, fwr, &cookie); return (0); } static int del_filter_wr(struct adapter *sc, int fidx) { struct filter_entry *f = &sc->tids.ftid_tab[fidx]; struct fw_filter_wr *fwr; unsigned int ftid; struct wrq_cookie cookie; ftid = sc->tids.ftid_base + fidx; fwr = start_wrq_wr(&sc->sge.mgmtq, howmany(sizeof(*fwr), 16), &cookie); if (fwr == NULL) return (ENOMEM); bzero(fwr, sizeof (*fwr)); t4_mk_filtdelwr(ftid, fwr, sc->sge.fwq.abs_id); f->pending = 1; commit_wrq_wr(&sc->sge.mgmtq, fwr, &cookie); return (0); } int t4_filter_rpl(struct sge_iq *iq, const struct rss_header *rss, struct mbuf *m) { struct adapter *sc = iq->adapter; const struct cpl_set_tcb_rpl *rpl = (const void *)(rss + 1); unsigned int idx = GET_TID(rpl); unsigned int rc; struct filter_entry *f; KASSERT(m == NULL, ("%s: payload with opcode %02x", __func__, rss->opcode)); if (is_ftid(sc, idx)) { idx -= sc->tids.ftid_base; f = &sc->tids.ftid_tab[idx]; rc = G_COOKIE(rpl->cookie); mtx_lock(&sc->tids.ftid_lock); if (rc == FW_FILTER_WR_FLT_ADDED) { KASSERT(f->pending, ("%s: filter[%u] isn't pending.", __func__, idx)); f->smtidx = (be64toh(rpl->oldval) >> 24) & 0xff; f->pending = 0; /* asynchronous setup completed */ f->valid = 1; } else { if (rc != FW_FILTER_WR_FLT_DELETED) { /* Add or delete failed, display an error */ log(LOG_ERR, "filter %u setup failed with error %u\n", idx, rc); } clear_filter(f); sc->tids.ftids_in_use--; } wakeup(&sc->tids.ftid_tab); mtx_unlock(&sc->tids.ftid_lock); } return (0); } static int get_sge_context(struct adapter *sc, struct t4_sge_context *cntxt) { int rc; if (cntxt->cid > M_CTXTQID) return (EINVAL); if (cntxt->mem_id != CTXT_EGRESS && cntxt->mem_id != CTXT_INGRESS && cntxt->mem_id != CTXT_FLM && cntxt->mem_id != CTXT_CNM) return (EINVAL); rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t4ctxt"); if (rc) return (rc); if (sc->flags & FW_OK) { rc = -t4_sge_ctxt_rd(sc, sc->mbox, cntxt->cid, cntxt->mem_id, &cntxt->data[0]); if (rc == 0) goto done; } /* * Read via firmware failed or wasn't even attempted. Read directly via * the backdoor. */ rc = -t4_sge_ctxt_rd_bd(sc, cntxt->cid, cntxt->mem_id, &cntxt->data[0]); done: end_synchronized_op(sc, 0); return (rc); } static int load_fw(struct adapter *sc, struct t4_data *fw) { int rc; uint8_t *fw_data; rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t4ldfw"); if (rc) return (rc); if (sc->flags & FULL_INIT_DONE) { rc = EBUSY; goto done; } fw_data = malloc(fw->len, M_CXGBE, M_WAITOK); if (fw_data == NULL) { rc = ENOMEM; goto done; } rc = copyin(fw->data, fw_data, fw->len); if (rc == 0) rc = -t4_load_fw(sc, fw_data, fw->len); free(fw_data, M_CXGBE); done: end_synchronized_op(sc, 0); return (rc); } #define MAX_READ_BUF_SIZE (128 * 1024) static int read_card_mem(struct adapter *sc, int win, struct t4_mem_range *mr) { uint32_t addr, remaining, n; uint32_t *buf; int rc; uint8_t *dst; rc = validate_mem_range(sc, mr->addr, mr->len); if (rc != 0) return (rc); buf = malloc(min(mr->len, MAX_READ_BUF_SIZE), M_CXGBE, M_WAITOK); addr = mr->addr; remaining = mr->len; dst = (void *)mr->data; while (remaining) { n = min(remaining, MAX_READ_BUF_SIZE); read_via_memwin(sc, 2, addr, buf, n); rc = copyout(buf, dst, n); if (rc != 0) break; dst += n; remaining -= n; addr += n; } free(buf, M_CXGBE); return (rc); } #undef MAX_READ_BUF_SIZE static int read_i2c(struct adapter *sc, struct t4_i2c_data *i2cd) { int rc; if (i2cd->len == 0 || i2cd->port_id >= sc->params.nports) return (EINVAL); if (i2cd->len > sizeof(i2cd->data)) return (EFBIG); rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t4i2crd"); if (rc) return (rc); rc = -t4_i2c_rd(sc, sc->mbox, i2cd->port_id, i2cd->dev_addr, i2cd->offset, i2cd->len, &i2cd->data[0]); end_synchronized_op(sc, 0); return (rc); } static int in_range(int val, int lo, int hi) { return (val < 0 || (val <= hi && val >= lo)); } static int -set_sched_class(struct adapter *sc, struct t4_sched_params *p) +set_sched_class_config(struct adapter *sc, int minmax) { - int fw_subcmd, fw_type, rc; + int rc; - rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t4setsc"); + if (minmax < 0) + return (EINVAL); + + rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t4sscc"); if (rc) return (rc); + rc = -t4_sched_config(sc, FW_SCHED_TYPE_PKTSCHED, minmax, 1); + end_synchronized_op(sc, 0); - if (!(sc->flags & FULL_INIT_DONE)) { - rc = EAGAIN; - goto done; - } + return (rc); +} - /* - * Translate the cxgbetool parameters into T4 firmware parameters. (The - * sub-command and type are in common locations.) - */ - if (p->subcmd == SCHED_CLASS_SUBCMD_CONFIG) - fw_subcmd = FW_SCHED_SC_CONFIG; - else if (p->subcmd == SCHED_CLASS_SUBCMD_PARAMS) - fw_subcmd = FW_SCHED_SC_PARAMS; - else { - rc = EINVAL; - goto done; - } - if (p->type == SCHED_CLASS_TYPE_PACKET) - fw_type = FW_SCHED_TYPE_PKTSCHED; - else { - rc = EINVAL; - goto done; - } +static int +set_sched_class_params(struct adapter *sc, struct t4_sched_class_params *p, + int sleep_ok) +{ + int rc, top_speed, fw_level, fw_mode, fw_rateunit, fw_ratemode; + struct port_info *pi; + struct tx_sched_class *tc; - if (fw_subcmd == FW_SCHED_SC_CONFIG) { - /* Vet our parameters ..*/ - if (p->u.config.minmax < 0) { - rc = EINVAL; - goto done; - } + if (p->level == SCHED_CLASS_LEVEL_CL_RL) + fw_level = FW_SCHED_PARAMS_LEVEL_CL_RL; + else if (p->level == SCHED_CLASS_LEVEL_CL_WRR) + fw_level = FW_SCHED_PARAMS_LEVEL_CL_WRR; + else if (p->level == SCHED_CLASS_LEVEL_CH_RL) + fw_level = FW_SCHED_PARAMS_LEVEL_CH_RL; + else + return (EINVAL); - /* And pass the request to the firmware ...*/ - rc = -t4_sched_config(sc, fw_type, p->u.config.minmax, 1); - goto done; - } + if (p->mode == SCHED_CLASS_MODE_CLASS) + fw_mode = FW_SCHED_PARAMS_MODE_CLASS; + else if (p->mode == SCHED_CLASS_MODE_FLOW) + fw_mode = FW_SCHED_PARAMS_MODE_FLOW; + else + return (EINVAL); - if (fw_subcmd == FW_SCHED_SC_PARAMS) { - int fw_level; - int fw_mode; - int fw_rateunit; - int fw_ratemode; + if (p->rateunit == SCHED_CLASS_RATEUNIT_BITS) + fw_rateunit = FW_SCHED_PARAMS_UNIT_BITRATE; + else if (p->rateunit == SCHED_CLASS_RATEUNIT_PKTS) + fw_rateunit = FW_SCHED_PARAMS_UNIT_PKTRATE; + else + return (EINVAL); - if (p->u.params.level == SCHED_CLASS_LEVEL_CL_RL) - fw_level = FW_SCHED_PARAMS_LEVEL_CL_RL; - else if (p->u.params.level == SCHED_CLASS_LEVEL_CL_WRR) - fw_level = FW_SCHED_PARAMS_LEVEL_CL_WRR; - else if (p->u.params.level == SCHED_CLASS_LEVEL_CH_RL) - fw_level = FW_SCHED_PARAMS_LEVEL_CH_RL; - else { - rc = EINVAL; - goto done; - } + if (p->ratemode == SCHED_CLASS_RATEMODE_REL) + fw_ratemode = FW_SCHED_PARAMS_RATE_REL; + else if (p->ratemode == SCHED_CLASS_RATEMODE_ABS) + fw_ratemode = FW_SCHED_PARAMS_RATE_ABS; + else + return (EINVAL); - if (p->u.params.mode == SCHED_CLASS_MODE_CLASS) - fw_mode = FW_SCHED_PARAMS_MODE_CLASS; - else if (p->u.params.mode == SCHED_CLASS_MODE_FLOW) - fw_mode = FW_SCHED_PARAMS_MODE_FLOW; - else { - rc = EINVAL; - goto done; - } + /* Vet our parameters ... */ + if (!in_range(p->channel, 0, sc->chip_params->nchan - 1)) + return (ERANGE); - if (p->u.params.rateunit == SCHED_CLASS_RATEUNIT_BITS) - fw_rateunit = FW_SCHED_PARAMS_UNIT_BITRATE; - else if (p->u.params.rateunit == SCHED_CLASS_RATEUNIT_PKTS) - fw_rateunit = FW_SCHED_PARAMS_UNIT_PKTRATE; - else { - rc = EINVAL; - goto done; - } + pi = sc->port[sc->chan_map[p->channel]]; + if (pi == NULL) + return (ENXIO); + MPASS(pi->tx_chan == p->channel); + top_speed = port_top_speed(pi) * 1000000; /* Gbps -> Kbps */ - if (p->u.params.ratemode == SCHED_CLASS_RATEMODE_REL) - fw_ratemode = FW_SCHED_PARAMS_RATE_REL; - else if (p->u.params.ratemode == SCHED_CLASS_RATEMODE_ABS) - fw_ratemode = FW_SCHED_PARAMS_RATE_ABS; - else { - rc = EINVAL; - goto done; - } + if (!in_range(p->cl, 0, sc->chip_params->nsched_cls) || + !in_range(p->minrate, 0, top_speed) || + !in_range(p->maxrate, 0, top_speed) || + !in_range(p->weight, 0, 100)) + return (ERANGE); - /* Vet our parameters ... */ - if (!in_range(p->u.params.channel, 0, 3) || - !in_range(p->u.params.cl, 0, sc->chip_params->nsched_cls) || - !in_range(p->u.params.minrate, 0, 10000000) || - !in_range(p->u.params.maxrate, 0, 10000000) || - !in_range(p->u.params.weight, 0, 100)) { - rc = ERANGE; - goto done; - } + /* + * Translate any unset parameters into the firmware's + * nomenclature and/or fail the call if the parameters + * are required ... + */ + if (p->rateunit < 0 || p->ratemode < 0 || p->channel < 0 || p->cl < 0) + return (EINVAL); + if (p->minrate < 0) + p->minrate = 0; + if (p->maxrate < 0) { + if (p->level == SCHED_CLASS_LEVEL_CL_RL || + p->level == SCHED_CLASS_LEVEL_CH_RL) + return (EINVAL); + else + p->maxrate = 0; + } + if (p->weight < 0) { + if (p->level == SCHED_CLASS_LEVEL_CL_WRR) + return (EINVAL); + else + p->weight = 0; + } + if (p->pktsize < 0) { + if (p->level == SCHED_CLASS_LEVEL_CL_RL || + p->level == SCHED_CLASS_LEVEL_CH_RL) + return (EINVAL); + else + p->pktsize = 0; + } + + rc = begin_synchronized_op(sc, NULL, + sleep_ok ? (SLEEP_OK | INTR_OK) : HOLD_LOCK, "t4sscp"); + if (rc) + return (rc); + tc = &pi->tc[p->cl]; + tc->params = *p; + rc = -t4_sched_params(sc, FW_SCHED_TYPE_PKTSCHED, fw_level, fw_mode, + fw_rateunit, fw_ratemode, p->channel, p->cl, p->minrate, p->maxrate, + p->weight, p->pktsize, sleep_ok); + if (rc == 0) + tc->flags |= TX_SC_OK; + else { /* - * Translate any unset parameters into the firmware's - * nomenclature and/or fail the call if the parameters - * are required ... + * Unknown state at this point, see tc->params for what was + * attempted. */ - if (p->u.params.rateunit < 0 || p->u.params.ratemode < 0 || - p->u.params.channel < 0 || p->u.params.cl < 0) { - rc = EINVAL; - goto done; - } - if (p->u.params.minrate < 0) - p->u.params.minrate = 0; - if (p->u.params.maxrate < 0) { - if (p->u.params.level == SCHED_CLASS_LEVEL_CL_RL || - p->u.params.level == SCHED_CLASS_LEVEL_CH_RL) { - rc = EINVAL; - goto done; - } else - p->u.params.maxrate = 0; - } - if (p->u.params.weight < 0) { - if (p->u.params.level == SCHED_CLASS_LEVEL_CL_WRR) { - rc = EINVAL; - goto done; - } else - p->u.params.weight = 0; - } - if (p->u.params.pktsize < 0) { - if (p->u.params.level == SCHED_CLASS_LEVEL_CL_RL || - p->u.params.level == SCHED_CLASS_LEVEL_CH_RL) { - rc = EINVAL; - goto done; - } else - p->u.params.pktsize = 0; - } - - /* See what the firmware thinks of the request ... */ - rc = -t4_sched_params(sc, fw_type, fw_level, fw_mode, - fw_rateunit, fw_ratemode, p->u.params.channel, - p->u.params.cl, p->u.params.minrate, p->u.params.maxrate, - p->u.params.weight, p->u.params.pktsize, 1); - goto done; + tc->flags &= ~TX_SC_OK; } + end_synchronized_op(sc, sleep_ok ? 0 : LOCK_HELD); - rc = EINVAL; -done: - end_synchronized_op(sc, 0); return (rc); } static int +set_sched_class(struct adapter *sc, struct t4_sched_params *p) +{ + + if (p->type != SCHED_CLASS_TYPE_PACKET) + return (EINVAL); + + if (p->subcmd == SCHED_CLASS_SUBCMD_CONFIG) + return (set_sched_class_config(sc, p->u.config.minmax)); + + if (p->subcmd == SCHED_CLASS_SUBCMD_PARAMS) + return (set_sched_class_params(sc, &p->u.params, 1)); + + return (EINVAL); +} + +static int set_sched_queue(struct adapter *sc, struct t4_sched_queue *p) { struct port_info *pi = NULL; struct vi_info *vi; struct sge_txq *txq; uint32_t fw_mnem, fw_queue, fw_class; int i, rc; rc = begin_synchronized_op(sc, NULL, SLEEP_OK | INTR_OK, "t4setsq"); if (rc) return (rc); - if (!(sc->flags & FULL_INIT_DONE)) { - rc = EAGAIN; - goto done; - } - if (p->port >= sc->params.nports) { rc = EINVAL; goto done; } /* XXX: Only supported for the main VI. */ pi = sc->port[p->port]; vi = &pi->vi[0]; - if (!in_range(p->queue, 0, vi->ntxq - 1) || !in_range(p->cl, 0, 7)) { + if (!(vi->flags & VI_INIT_DONE)) { + /* tx queues not set up yet */ + rc = EAGAIN; + goto done; + } + + if (!in_range(p->queue, 0, vi->ntxq - 1) || + !in_range(p->cl, 0, sc->chip_params->nsched_cls - 1)) { rc = EINVAL; goto done; } /* * Create a template for the FW_PARAMS_CMD mnemonic and value (TX * Scheduling Class in this case). */ fw_mnem = (V_FW_PARAMS_MNEM(FW_PARAMS_MNEM_DMAQ) | V_FW_PARAMS_PARAM_X(FW_PARAMS_PARAM_DMAQ_EQ_SCHEDCLASS_ETH)); fw_class = p->cl < 0 ? 0xffffffff : p->cl; /* * If op.queue is non-negative, then we're only changing the scheduling * on a single specified TX queue. */ if (p->queue >= 0) { txq = &sc->sge.txq[vi->first_txq + p->queue]; fw_queue = (fw_mnem | V_FW_PARAMS_PARAM_YZ(txq->eq.cntxt_id)); rc = -t4_set_params(sc, sc->mbox, sc->pf, 0, 1, &fw_queue, &fw_class); goto done; } /* * Change the scheduling on all the TX queues for the * interface. */ for_each_txq(vi, i, txq) { fw_queue = (fw_mnem | V_FW_PARAMS_PARAM_YZ(txq->eq.cntxt_id)); rc = -t4_set_params(sc, sc->mbox, sc->pf, 0, 1, &fw_queue, &fw_class); if (rc) goto done; } rc = 0; done: end_synchronized_op(sc, 0); return (rc); } int t4_os_find_pci_capability(struct adapter *sc, int cap) { int i; return (pci_find_cap(sc->dev, cap, &i) == 0 ? i : 0); } int t4_os_pci_save_state(struct adapter *sc) { device_t dev; struct pci_devinfo *dinfo; dev = sc->dev; dinfo = device_get_ivars(dev); pci_cfg_save(dev, dinfo, 0); return (0); } int t4_os_pci_restore_state(struct adapter *sc) { device_t dev; struct pci_devinfo *dinfo; dev = sc->dev; dinfo = device_get_ivars(dev); pci_cfg_restore(dev, dinfo); return (0); } void t4_os_portmod_changed(const struct adapter *sc, int idx) { struct port_info *pi = sc->port[idx]; struct vi_info *vi; struct ifnet *ifp; int v; static const char *mod_str[] = { NULL, "LR", "SR", "ER", "TWINAX", "active TWINAX", "LRM" }; for_each_vi(pi, v, vi) { build_medialist(pi, &vi->media); } ifp = pi->vi[0].ifp; if (pi->mod_type == FW_PORT_MOD_TYPE_NONE) if_printf(ifp, "transceiver unplugged.\n"); else if (pi->mod_type == FW_PORT_MOD_TYPE_UNKNOWN) if_printf(ifp, "unknown transceiver inserted.\n"); else if (pi->mod_type == FW_PORT_MOD_TYPE_NOTSUPPORTED) if_printf(ifp, "unsupported transceiver inserted.\n"); else if (pi->mod_type > 0 && pi->mod_type < nitems(mod_str)) { if_printf(ifp, "%s transceiver inserted.\n", mod_str[pi->mod_type]); } else { if_printf(ifp, "transceiver (type %d) inserted.\n", pi->mod_type); } } void t4_os_link_changed(struct adapter *sc, int idx, int link_stat, int reason) { struct port_info *pi = sc->port[idx]; struct vi_info *vi; struct ifnet *ifp; int v; if (link_stat) pi->linkdnrc = -1; else { if (reason >= 0) pi->linkdnrc = reason; } for_each_vi(pi, v, vi) { ifp = vi->ifp; if (ifp == NULL) continue; if (link_stat) { ifp->if_baudrate = IF_Mbps(pi->link_cfg.speed); if_link_state_change(ifp, LINK_STATE_UP); } else { if_link_state_change(ifp, LINK_STATE_DOWN); } } } void t4_iterate(void (*func)(struct adapter *, void *), void *arg) { struct adapter *sc; sx_slock(&t4_list_lock); SLIST_FOREACH(sc, &t4_list, link) { /* * func should not make any assumptions about what state sc is * in - the only guarantee is that sc->sc_lock is a valid lock. */ func(sc, arg); } sx_sunlock(&t4_list_lock); } static int t4_open(struct cdev *dev, int flags, int type, struct thread *td) { return (0); } static int t4_close(struct cdev *dev, int flags, int type, struct thread *td) { return (0); } static int t4_ioctl(struct cdev *dev, unsigned long cmd, caddr_t data, int fflag, struct thread *td) { int rc; struct adapter *sc = dev->si_drv1; rc = priv_check(td, PRIV_DRIVER); if (rc != 0) return (rc); switch (cmd) { case CHELSIO_T4_GETREG: { struct t4_reg *edata = (struct t4_reg *)data; if ((edata->addr & 0x3) != 0 || edata->addr >= sc->mmio_len) return (EFAULT); if (edata->size == 4) edata->val = t4_read_reg(sc, edata->addr); else if (edata->size == 8) edata->val = t4_read_reg64(sc, edata->addr); else return (EINVAL); break; } case CHELSIO_T4_SETREG: { struct t4_reg *edata = (struct t4_reg *)data; if ((edata->addr & 0x3) != 0 || edata->addr >= sc->mmio_len) return (EFAULT); if (edata->size == 4) { if (edata->val & 0xffffffff00000000) return (EINVAL); t4_write_reg(sc, edata->addr, (uint32_t) edata->val); } else if (edata->size == 8) t4_write_reg64(sc, edata->addr, edata->val); else return (EINVAL); break; } case CHELSIO_T4_REGDUMP: { struct t4_regdump *regs = (struct t4_regdump *)data; int reglen = is_t4(sc) ? T4_REGDUMP_SIZE : T5_REGDUMP_SIZE; uint8_t *buf; if (regs->len < reglen) { regs->len = reglen; /* hint to the caller */ return (ENOBUFS); } regs->len = reglen; buf = malloc(reglen, M_CXGBE, M_WAITOK | M_ZERO); get_regs(sc, regs, buf); rc = copyout(buf, regs->data, reglen); free(buf, M_CXGBE); break; } case CHELSIO_T4_GET_FILTER_MODE: rc = get_filter_mode(sc, (uint32_t *)data); break; case CHELSIO_T4_SET_FILTER_MODE: rc = set_filter_mode(sc, *(uint32_t *)data); break; case CHELSIO_T4_GET_FILTER: rc = get_filter(sc, (struct t4_filter *)data); break; case CHELSIO_T4_SET_FILTER: rc = set_filter(sc, (struct t4_filter *)data); break; case CHELSIO_T4_DEL_FILTER: rc = del_filter(sc, (struct t4_filter *)data); break; case CHELSIO_T4_GET_SGE_CONTEXT: rc = get_sge_context(sc, (struct t4_sge_context *)data); break; case CHELSIO_T4_LOAD_FW: rc = load_fw(sc, (struct t4_data *)data); break; case CHELSIO_T4_GET_MEM: rc = read_card_mem(sc, 2, (struct t4_mem_range *)data); break; case CHELSIO_T4_GET_I2C: rc = read_i2c(sc, (struct t4_i2c_data *)data); break; case CHELSIO_T4_CLEAR_STATS: { int i, v; u_int port_id = *(uint32_t *)data; struct port_info *pi; struct vi_info *vi; if (port_id >= sc->params.nports) return (EINVAL); pi = sc->port[port_id]; /* MAC stats */ t4_clr_port_stats(sc, pi->tx_chan); pi->tx_parse_error = 0; mtx_lock(&sc->reg_lock); for_each_vi(pi, v, vi) { if (vi->flags & VI_INIT_DONE) t4_clr_vi_stats(sc, vi->viid); } mtx_unlock(&sc->reg_lock); /* * Since this command accepts a port, clear stats for * all VIs on this port. */ for_each_vi(pi, v, vi) { if (vi->flags & VI_INIT_DONE) { struct sge_rxq *rxq; struct sge_txq *txq; struct sge_wrq *wrq; if (vi->flags & VI_NETMAP) continue; for_each_rxq(vi, i, rxq) { #if defined(INET) || defined(INET6) rxq->lro.lro_queued = 0; rxq->lro.lro_flushed = 0; #endif rxq->rxcsum = 0; rxq->vlan_extraction = 0; } for_each_txq(vi, i, txq) { txq->txcsum = 0; txq->tso_wrs = 0; txq->vlan_insertion = 0; txq->imm_wrs = 0; txq->sgl_wrs = 0; txq->txpkt_wrs = 0; txq->txpkts0_wrs = 0; txq->txpkts1_wrs = 0; txq->txpkts0_pkts = 0; txq->txpkts1_pkts = 0; mp_ring_reset_stats(txq->r); } #ifdef TCP_OFFLOAD /* nothing to clear for each ofld_rxq */ for_each_ofld_txq(vi, i, wrq) { wrq->tx_wrs_direct = 0; wrq->tx_wrs_copied = 0; } #endif if (IS_MAIN_VI(vi)) { wrq = &sc->sge.ctrlq[pi->port_id]; wrq->tx_wrs_direct = 0; wrq->tx_wrs_copied = 0; } } } break; } case CHELSIO_T4_SCHED_CLASS: rc = set_sched_class(sc, (struct t4_sched_params *)data); break; case CHELSIO_T4_SCHED_QUEUE: rc = set_sched_queue(sc, (struct t4_sched_queue *)data); break; case CHELSIO_T4_GET_TRACER: rc = t4_get_tracer(sc, (struct t4_tracer *)data); break; case CHELSIO_T4_SET_TRACER: rc = t4_set_tracer(sc, (struct t4_tracer *)data); break; default: rc = EINVAL; } return (rc); } void t4_db_full(struct adapter *sc) { CXGBE_UNIMPLEMENTED(__func__); } void t4_db_dropped(struct adapter *sc) { CXGBE_UNIMPLEMENTED(__func__); } #ifdef TCP_OFFLOAD void t4_iscsi_init(struct adapter *sc, u_int tag_mask, const u_int *pgsz_order) { t4_write_reg(sc, A_ULP_RX_ISCSI_TAGMASK, tag_mask); t4_write_reg(sc, A_ULP_RX_ISCSI_PSZ, V_HPZ0(pgsz_order[0]) | V_HPZ1(pgsz_order[1]) | V_HPZ2(pgsz_order[2]) | V_HPZ3(pgsz_order[3])); } static int toe_capability(struct vi_info *vi, int enable) { int rc; struct port_info *pi = vi->pi; struct adapter *sc = pi->adapter; ASSERT_SYNCHRONIZED_OP(sc); if (!is_offload(sc)) return (ENODEV); if (enable) { if ((vi->ifp->if_capenable & IFCAP_TOE) != 0) { /* TOE is already enabled. */ return (0); } /* * We need the port's queues around so that we're able to send * and receive CPLs to/from the TOE even if the ifnet for this * port has never been UP'd administratively. */ if (!(vi->flags & VI_INIT_DONE)) { rc = cxgbe_init_synchronized(vi); if (rc) return (rc); } if (!(pi->vi[0].flags & VI_INIT_DONE)) { rc = cxgbe_init_synchronized(&pi->vi[0]); if (rc) return (rc); } if (isset(&sc->offload_map, pi->port_id)) { /* TOE is enabled on another VI of this port. */ pi->uld_vis++; return (0); } if (!uld_active(sc, ULD_TOM)) { rc = t4_activate_uld(sc, ULD_TOM); if (rc == EAGAIN) { log(LOG_WARNING, "You must kldload t4_tom.ko before trying " "to enable TOE on a cxgbe interface.\n"); } if (rc != 0) return (rc); KASSERT(sc->tom_softc != NULL, ("%s: TOM activated but softc NULL", __func__)); KASSERT(uld_active(sc, ULD_TOM), ("%s: TOM activated but flag not set", __func__)); } /* Activate iWARP and iSCSI too, if the modules are loaded. */ if (!uld_active(sc, ULD_IWARP)) (void) t4_activate_uld(sc, ULD_IWARP); if (!uld_active(sc, ULD_ISCSI)) (void) t4_activate_uld(sc, ULD_ISCSI); pi->uld_vis++; setbit(&sc->offload_map, pi->port_id); } else { pi->uld_vis--; if (!isset(&sc->offload_map, pi->port_id) || pi->uld_vis > 0) return (0); KASSERT(uld_active(sc, ULD_TOM), ("%s: TOM never initialized?", __func__)); clrbit(&sc->offload_map, pi->port_id); } return (0); } /* * Add an upper layer driver to the global list. */ int t4_register_uld(struct uld_info *ui) { int rc = 0; struct uld_info *u; sx_xlock(&t4_uld_list_lock); SLIST_FOREACH(u, &t4_uld_list, link) { if (u->uld_id == ui->uld_id) { rc = EEXIST; goto done; } } SLIST_INSERT_HEAD(&t4_uld_list, ui, link); ui->refcount = 0; done: sx_xunlock(&t4_uld_list_lock); return (rc); } int t4_unregister_uld(struct uld_info *ui) { int rc = EINVAL; struct uld_info *u; sx_xlock(&t4_uld_list_lock); SLIST_FOREACH(u, &t4_uld_list, link) { if (u == ui) { if (ui->refcount > 0) { rc = EBUSY; goto done; } SLIST_REMOVE(&t4_uld_list, ui, uld_info, link); rc = 0; goto done; } } done: sx_xunlock(&t4_uld_list_lock); return (rc); } int t4_activate_uld(struct adapter *sc, int id) { int rc; struct uld_info *ui; ASSERT_SYNCHRONIZED_OP(sc); if (id < 0 || id > ULD_MAX) return (EINVAL); rc = EAGAIN; /* kldoad the module with this ULD and try again. */ sx_slock(&t4_uld_list_lock); SLIST_FOREACH(ui, &t4_uld_list, link) { if (ui->uld_id == id) { if (!(sc->flags & FULL_INIT_DONE)) { rc = adapter_full_init(sc); if (rc != 0) break; } rc = ui->activate(sc); if (rc == 0) { setbit(&sc->active_ulds, id); ui->refcount++; } break; } } sx_sunlock(&t4_uld_list_lock); return (rc); } int t4_deactivate_uld(struct adapter *sc, int id) { int rc; struct uld_info *ui; ASSERT_SYNCHRONIZED_OP(sc); if (id < 0 || id > ULD_MAX) return (EINVAL); rc = ENXIO; sx_slock(&t4_uld_list_lock); SLIST_FOREACH(ui, &t4_uld_list, link) { if (ui->uld_id == id) { rc = ui->deactivate(sc); if (rc == 0) { clrbit(&sc->active_ulds, id); ui->refcount--; } break; } } sx_sunlock(&t4_uld_list_lock); return (rc); } int uld_active(struct adapter *sc, int uld_id) { MPASS(uld_id >= 0 && uld_id <= ULD_MAX); return (isset(&sc->active_ulds, uld_id)); } #endif /* * Come up with reasonable defaults for some of the tunables, provided they're * not set by the user (in which case we'll use the values as is). */ static void tweak_tunables(void) { int nc = mp_ncpus; /* our snapshot of the number of CPUs */ if (t4_ntxq10g < 1) { #ifdef RSS t4_ntxq10g = rss_getnumbuckets(); #else t4_ntxq10g = min(nc, NTXQ_10G); #endif } if (t4_ntxq1g < 1) { #ifdef RSS /* XXX: way too many for 1GbE? */ t4_ntxq1g = rss_getnumbuckets(); #else t4_ntxq1g = min(nc, NTXQ_1G); #endif } if (t4_nrxq10g < 1) { #ifdef RSS t4_nrxq10g = rss_getnumbuckets(); #else t4_nrxq10g = min(nc, NRXQ_10G); #endif } if (t4_nrxq1g < 1) { #ifdef RSS /* XXX: way too many for 1GbE? */ t4_nrxq1g = rss_getnumbuckets(); #else t4_nrxq1g = min(nc, NRXQ_1G); #endif } #ifdef TCP_OFFLOAD if (t4_nofldtxq10g < 1) t4_nofldtxq10g = min(nc, NOFLDTXQ_10G); if (t4_nofldtxq1g < 1) t4_nofldtxq1g = min(nc, NOFLDTXQ_1G); if (t4_nofldrxq10g < 1) t4_nofldrxq10g = min(nc, NOFLDRXQ_10G); if (t4_nofldrxq1g < 1) t4_nofldrxq1g = min(nc, NOFLDRXQ_1G); if (t4_toecaps_allowed == -1) t4_toecaps_allowed = FW_CAPS_CONFIG_TOE; if (t4_rdmacaps_allowed == -1) { t4_rdmacaps_allowed = FW_CAPS_CONFIG_RDMA_RDDP | FW_CAPS_CONFIG_RDMA_RDMAC; } if (t4_iscsicaps_allowed == -1) { t4_iscsicaps_allowed = FW_CAPS_CONFIG_ISCSI_INITIATOR_PDU | FW_CAPS_CONFIG_ISCSI_TARGET_PDU | FW_CAPS_CONFIG_ISCSI_T10DIF; } #else if (t4_toecaps_allowed == -1) t4_toecaps_allowed = 0; if (t4_rdmacaps_allowed == -1) t4_rdmacaps_allowed = 0; if (t4_iscsicaps_allowed == -1) t4_iscsicaps_allowed = 0; #endif #ifdef DEV_NETMAP if (t4_nnmtxq10g < 1) t4_nnmtxq10g = min(nc, NNMTXQ_10G); if (t4_nnmtxq1g < 1) t4_nnmtxq1g = min(nc, NNMTXQ_1G); if (t4_nnmrxq10g < 1) t4_nnmrxq10g = min(nc, NNMRXQ_10G); if (t4_nnmrxq1g < 1) t4_nnmrxq1g = min(nc, NNMRXQ_1G); #endif if (t4_tmr_idx_10g < 0 || t4_tmr_idx_10g >= SGE_NTIMERS) t4_tmr_idx_10g = TMR_IDX_10G; if (t4_pktc_idx_10g < -1 || t4_pktc_idx_10g >= SGE_NCOUNTERS) t4_pktc_idx_10g = PKTC_IDX_10G; if (t4_tmr_idx_1g < 0 || t4_tmr_idx_1g >= SGE_NTIMERS) t4_tmr_idx_1g = TMR_IDX_1G; if (t4_pktc_idx_1g < -1 || t4_pktc_idx_1g >= SGE_NCOUNTERS) t4_pktc_idx_1g = PKTC_IDX_1G; if (t4_qsize_txq < 128) t4_qsize_txq = 128; if (t4_qsize_rxq < 128) t4_qsize_rxq = 128; while (t4_qsize_rxq & 7) t4_qsize_rxq++; t4_intr_types &= INTR_MSIX | INTR_MSI | INTR_INTX; } #ifdef DDB static void t4_dump_tcb(struct adapter *sc, int tid) { uint32_t base, i, j, off, pf, reg, save, tcb_addr, win_pos; reg = PCIE_MEM_ACCESS_REG(A_PCIE_MEM_ACCESS_OFFSET, 2); save = t4_read_reg(sc, reg); base = sc->memwin[2].mw_base; /* Dump TCB for the tid */ tcb_addr = t4_read_reg(sc, A_TP_CMM_TCB_BASE); tcb_addr += tid * TCB_SIZE; if (is_t4(sc)) { pf = 0; win_pos = tcb_addr & ~0xf; /* start must be 16B aligned */ } else { pf = V_PFNUM(sc->pf); win_pos = tcb_addr & ~0x7f; /* start must be 128B aligned */ } t4_write_reg(sc, reg, win_pos | pf); t4_read_reg(sc, reg); off = tcb_addr - win_pos; for (i = 0; i < 4; i++) { uint32_t buf[8]; for (j = 0; j < 8; j++, off += 4) buf[j] = htonl(t4_read_reg(sc, base + off)); db_printf("%08x %08x %08x %08x %08x %08x %08x %08x\n", buf[0], buf[1], buf[2], buf[3], buf[4], buf[5], buf[6], buf[7]); } t4_write_reg(sc, reg, save); t4_read_reg(sc, reg); } static void t4_dump_devlog(struct adapter *sc) { struct devlog_params *dparams = &sc->params.devlog; struct fw_devlog_e e; int i, first, j, m, nentries, rc; uint64_t ftstamp = UINT64_MAX; if (dparams->start == 0) { db_printf("devlog params not valid\n"); return; } nentries = dparams->size / sizeof(struct fw_devlog_e); m = fwmtype_to_hwmtype(dparams->memtype); /* Find the first entry. */ first = -1; for (i = 0; i < nentries && !db_pager_quit; i++) { rc = -t4_mem_read(sc, m, dparams->start + i * sizeof(e), sizeof(e), (void *)&e); if (rc != 0) break; if (e.timestamp == 0) break; e.timestamp = be64toh(e.timestamp); if (e.timestamp < ftstamp) { ftstamp = e.timestamp; first = i; } } if (first == -1) return; i = first; do { rc = -t4_mem_read(sc, m, dparams->start + i * sizeof(e), sizeof(e), (void *)&e); if (rc != 0) return; if (e.timestamp == 0) return; e.timestamp = be64toh(e.timestamp); e.seqno = be32toh(e.seqno); for (j = 0; j < 8; j++) e.params[j] = be32toh(e.params[j]); db_printf("%10d %15ju %8s %8s ", e.seqno, e.timestamp, (e.level < nitems(devlog_level_strings) ? devlog_level_strings[e.level] : "UNKNOWN"), (e.facility < nitems(devlog_facility_strings) ? devlog_facility_strings[e.facility] : "UNKNOWN")); db_printf(e.fmt, e.params[0], e.params[1], e.params[2], e.params[3], e.params[4], e.params[5], e.params[6], e.params[7]); if (++i == nentries) i = 0; } while (i != first && !db_pager_quit); } static struct command_table db_t4_table = LIST_HEAD_INITIALIZER(db_t4_table); _DB_SET(_show, t4, NULL, db_show_table, 0, &db_t4_table); DB_FUNC(devlog, db_show_devlog, db_t4_table, CS_OWN, NULL) { device_t dev; int t; bool valid; valid = false; t = db_read_token(); if (t == tIDENT) { dev = device_lookup_by_name(db_tok_string); valid = true; } db_skip_to_eol(); if (!valid) { db_printf("usage: show t4 devlog \n"); return; } if (dev == NULL) { db_printf("device not found\n"); return; } t4_dump_devlog(device_get_softc(dev)); } DB_FUNC(tcb, db_show_t4tcb, db_t4_table, CS_OWN, NULL) { device_t dev; int radix, tid, t; bool valid; valid = false; radix = db_radix; db_radix = 10; t = db_read_token(); if (t == tIDENT) { dev = device_lookup_by_name(db_tok_string); t = db_read_token(); if (t == tNUMBER) { tid = db_tok_number; valid = true; } } db_radix = radix; db_skip_to_eol(); if (!valid) { db_printf("usage: show t4 tcb \n"); return; } if (dev == NULL) { db_printf("device not found\n"); return; } if (tid < 0) { db_printf("invalid tid\n"); return; } t4_dump_tcb(device_get_softc(dev), tid); } #endif static struct sx mlu; /* mod load unload */ SX_SYSINIT(cxgbe_mlu, &mlu, "cxgbe mod load/unload"); static int mod_event(module_t mod, int cmd, void *arg) { int rc = 0; static int loaded = 0; switch (cmd) { case MOD_LOAD: sx_xlock(&mlu); if (loaded++ == 0) { t4_sge_modload(); sx_init(&t4_list_lock, "T4/T5 adapters"); SLIST_INIT(&t4_list); #ifdef TCP_OFFLOAD sx_init(&t4_uld_list_lock, "T4/T5 ULDs"); SLIST_INIT(&t4_uld_list); #endif t4_tracer_modload(); tweak_tunables(); } sx_xunlock(&mlu); break; case MOD_UNLOAD: sx_xlock(&mlu); if (--loaded == 0) { int tries; sx_slock(&t4_list_lock); if (!SLIST_EMPTY(&t4_list)) { rc = EBUSY; sx_sunlock(&t4_list_lock); goto done_unload; } #ifdef TCP_OFFLOAD sx_slock(&t4_uld_list_lock); if (!SLIST_EMPTY(&t4_uld_list)) { rc = EBUSY; sx_sunlock(&t4_uld_list_lock); sx_sunlock(&t4_list_lock); goto done_unload; } #endif tries = 0; while (tries++ < 5 && t4_sge_extfree_refs() != 0) { uprintf("%ju clusters with custom free routine " "still is use.\n", t4_sge_extfree_refs()); pause("t4unload", 2 * hz); } #ifdef TCP_OFFLOAD sx_sunlock(&t4_uld_list_lock); #endif sx_sunlock(&t4_list_lock); if (t4_sge_extfree_refs() == 0) { t4_tracer_modunload(); #ifdef TCP_OFFLOAD sx_destroy(&t4_uld_list_lock); #endif sx_destroy(&t4_list_lock); t4_sge_modunload(); loaded = 0; } else { rc = EBUSY; loaded++; /* undo earlier decrement */ } } done_unload: sx_xunlock(&mlu); break; } return (rc); } static devclass_t t4_devclass, t5_devclass; static devclass_t cxgbe_devclass, cxl_devclass; static devclass_t vcxgbe_devclass, vcxl_devclass; DRIVER_MODULE(t4nex, pci, t4_driver, t4_devclass, mod_event, 0); MODULE_VERSION(t4nex, 1); MODULE_DEPEND(t4nex, firmware, 1, 1, 1); #ifdef DEV_NETMAP MODULE_DEPEND(t4nex, netmap, 1, 1, 1); #endif /* DEV_NETMAP */ DRIVER_MODULE(t5nex, pci, t5_driver, t5_devclass, mod_event, 0); MODULE_VERSION(t5nex, 1); MODULE_DEPEND(t5nex, firmware, 1, 1, 1); #ifdef DEV_NETMAP MODULE_DEPEND(t5nex, netmap, 1, 1, 1); #endif /* DEV_NETMAP */ DRIVER_MODULE(cxgbe, t4nex, cxgbe_driver, cxgbe_devclass, 0, 0); MODULE_VERSION(cxgbe, 1); DRIVER_MODULE(cxl, t5nex, cxl_driver, cxl_devclass, 0, 0); MODULE_VERSION(cxl, 1); DRIVER_MODULE(vcxgbe, cxgbe, vcxgbe_driver, vcxgbe_devclass, 0, 0); MODULE_VERSION(vcxgbe, 1); DRIVER_MODULE(vcxl, cxl, vcxl_driver, vcxl_devclass, 0, 0); MODULE_VERSION(vcxl, 1); Index: projects/vnet/sys/dev/e1000/if_igb.c =================================================================== --- projects/vnet/sys/dev/e1000/if_igb.c (revision 301546) +++ projects/vnet/sys/dev/e1000/if_igb.c (revision 301547) @@ -1,6440 +1,6440 @@ /****************************************************************************** Copyright (c) 2001-2015, Intel Corporation All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ******************************************************************************/ /*$FreeBSD$*/ #include "opt_inet.h" #include "opt_inet6.h" #include "opt_rss.h" #ifdef HAVE_KERNEL_OPTION_HEADERS #include "opt_device_polling.h" #include "opt_altq.h" #endif #include "if_igb.h" /********************************************************************* * Driver version: *********************************************************************/ char igb_driver_version[] = "2.5.3-k"; /********************************************************************* * PCI Device ID Table * * Used by probe to select devices to load on * Last field stores an index into e1000_strings * Last entry must be all 0s * * { Vendor ID, Device ID, SubVendor ID, SubDevice ID, String Index } *********************************************************************/ static igb_vendor_info_t igb_vendor_info_array[] = { {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82575EB_COPPER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82575EB_FIBER_SERDES, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82575GB_QUAD_COPPER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82576, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82576_NS, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82576_NS_SERDES, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82576_FIBER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82576_SERDES, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82576_SERDES_QUAD, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82576_QUAD_COPPER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82576_QUAD_COPPER_ET2, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82576_VF, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82580_COPPER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82580_FIBER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82580_SERDES, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82580_SGMII, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82580_COPPER_DUAL, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_82580_QUAD_FIBER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_DH89XXCC_SERDES, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_DH89XXCC_SGMII, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_DH89XXCC_SFP, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_DH89XXCC_BACKPLANE, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I350_COPPER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I350_FIBER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I350_SERDES, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I350_SGMII, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I350_VF, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I210_COPPER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I210_COPPER_IT, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I210_COPPER_OEM1, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I210_COPPER_FLASHLESS, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I210_SERDES_FLASHLESS, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I210_FIBER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I210_SERDES, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I210_SGMII, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I211_COPPER, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I354_BACKPLANE_1GBPS, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I354_BACKPLANE_2_5GBPS, 0, 0, 0}, {IGB_INTEL_VENDOR_ID, E1000_DEV_ID_I354_SGMII, 0, 0, 0}, /* required last entry */ {0, 0, 0, 0, 0} }; /********************************************************************* * Table of branding strings for all supported NICs. *********************************************************************/ static char *igb_strings[] = { "Intel(R) PRO/1000 Network Connection" }; /********************************************************************* * Function prototypes *********************************************************************/ static int igb_probe(device_t); static int igb_attach(device_t); static int igb_detach(device_t); static int igb_shutdown(device_t); static int igb_suspend(device_t); static int igb_resume(device_t); #ifndef IGB_LEGACY_TX static int igb_mq_start(struct ifnet *, struct mbuf *); static int igb_mq_start_locked(struct ifnet *, struct tx_ring *); static void igb_qflush(struct ifnet *); static void igb_deferred_mq_start(void *, int); #else static void igb_start(struct ifnet *); static void igb_start_locked(struct tx_ring *, struct ifnet *ifp); #endif static int igb_ioctl(struct ifnet *, u_long, caddr_t); static uint64_t igb_get_counter(if_t, ift_counter); static void igb_init(void *); static void igb_init_locked(struct adapter *); static void igb_stop(void *); static void igb_media_status(struct ifnet *, struct ifmediareq *); static int igb_media_change(struct ifnet *); static void igb_identify_hardware(struct adapter *); static int igb_allocate_pci_resources(struct adapter *); static int igb_allocate_msix(struct adapter *); static int igb_allocate_legacy(struct adapter *); static int igb_setup_msix(struct adapter *); static void igb_free_pci_resources(struct adapter *); static void igb_local_timer(void *); static void igb_reset(struct adapter *); static int igb_setup_interface(device_t, struct adapter *); static int igb_allocate_queues(struct adapter *); static void igb_configure_queues(struct adapter *); static int igb_allocate_transmit_buffers(struct tx_ring *); static void igb_setup_transmit_structures(struct adapter *); static void igb_setup_transmit_ring(struct tx_ring *); static void igb_initialize_transmit_units(struct adapter *); static void igb_free_transmit_structures(struct adapter *); static void igb_free_transmit_buffers(struct tx_ring *); static int igb_allocate_receive_buffers(struct rx_ring *); static int igb_setup_receive_structures(struct adapter *); static int igb_setup_receive_ring(struct rx_ring *); static void igb_initialize_receive_units(struct adapter *); static void igb_free_receive_structures(struct adapter *); static void igb_free_receive_buffers(struct rx_ring *); static void igb_free_receive_ring(struct rx_ring *); static void igb_enable_intr(struct adapter *); static void igb_disable_intr(struct adapter *); static void igb_update_stats_counters(struct adapter *); static bool igb_txeof(struct tx_ring *); static __inline void igb_rx_discard(struct rx_ring *, int); static __inline void igb_rx_input(struct rx_ring *, struct ifnet *, struct mbuf *, u32); static bool igb_rxeof(struct igb_queue *, int, int *); static void igb_rx_checksum(u32, struct mbuf *, u32); static int igb_tx_ctx_setup(struct tx_ring *, struct mbuf *, u32 *, u32 *); static int igb_tso_setup(struct tx_ring *, struct mbuf *, u32 *, u32 *); static void igb_set_promisc(struct adapter *); static void igb_disable_promisc(struct adapter *); static void igb_set_multi(struct adapter *); static void igb_update_link_status(struct adapter *); static void igb_refresh_mbufs(struct rx_ring *, int); static void igb_register_vlan(void *, struct ifnet *, u16); static void igb_unregister_vlan(void *, struct ifnet *, u16); static void igb_setup_vlan_hw_support(struct adapter *); static int igb_xmit(struct tx_ring *, struct mbuf **); static int igb_dma_malloc(struct adapter *, bus_size_t, struct igb_dma_alloc *, int); static void igb_dma_free(struct adapter *, struct igb_dma_alloc *); static int igb_sysctl_nvm_info(SYSCTL_HANDLER_ARGS); static void igb_print_nvm_info(struct adapter *); static int igb_is_valid_ether_addr(u8 *); static void igb_add_hw_stats(struct adapter *); static void igb_vf_init_stats(struct adapter *); static void igb_update_vf_stats_counters(struct adapter *); /* Management and WOL Support */ static void igb_init_manageability(struct adapter *); static void igb_release_manageability(struct adapter *); static void igb_get_hw_control(struct adapter *); static void igb_release_hw_control(struct adapter *); static void igb_enable_wakeup(device_t); static void igb_led_func(void *, int); static int igb_irq_fast(void *); static void igb_msix_que(void *); static void igb_msix_link(void *); static void igb_handle_que(void *context, int pending); static void igb_handle_link(void *context, int pending); static void igb_handle_link_locked(struct adapter *); static void igb_set_sysctl_value(struct adapter *, const char *, const char *, int *, int); static int igb_set_flowcntl(SYSCTL_HANDLER_ARGS); static int igb_sysctl_dmac(SYSCTL_HANDLER_ARGS); static int igb_sysctl_eee(SYSCTL_HANDLER_ARGS); #ifdef DEVICE_POLLING static poll_handler_t igb_poll; #endif /* POLLING */ /********************************************************************* * FreeBSD Device Interface Entry Points *********************************************************************/ static device_method_t igb_methods[] = { /* Device interface */ DEVMETHOD(device_probe, igb_probe), DEVMETHOD(device_attach, igb_attach), DEVMETHOD(device_detach, igb_detach), DEVMETHOD(device_shutdown, igb_shutdown), DEVMETHOD(device_suspend, igb_suspend), DEVMETHOD(device_resume, igb_resume), DEVMETHOD_END }; static driver_t igb_driver = { "igb", igb_methods, sizeof(struct adapter), }; static devclass_t igb_devclass; DRIVER_MODULE(igb, pci, igb_driver, igb_devclass, 0, 0); MODULE_DEPEND(igb, pci, 1, 1, 1); MODULE_DEPEND(igb, ether, 1, 1, 1); #ifdef DEV_NETMAP MODULE_DEPEND(igb, netmap, 1, 1, 1); #endif /* DEV_NETMAP */ /********************************************************************* * Tunable default values. *********************************************************************/ static SYSCTL_NODE(_hw, OID_AUTO, igb, CTLFLAG_RD, 0, "IGB driver parameters"); /* Descriptor defaults */ static int igb_rxd = IGB_DEFAULT_RXD; static int igb_txd = IGB_DEFAULT_TXD; SYSCTL_INT(_hw_igb, OID_AUTO, rxd, CTLFLAG_RDTUN, &igb_rxd, 0, "Number of receive descriptors per queue"); SYSCTL_INT(_hw_igb, OID_AUTO, txd, CTLFLAG_RDTUN, &igb_txd, 0, "Number of transmit descriptors per queue"); /* ** AIM: Adaptive Interrupt Moderation ** which means that the interrupt rate ** is varied over time based on the ** traffic for that interrupt vector */ static int igb_enable_aim = TRUE; SYSCTL_INT(_hw_igb, OID_AUTO, enable_aim, CTLFLAG_RWTUN, &igb_enable_aim, 0, "Enable adaptive interrupt moderation"); /* * MSIX should be the default for best performance, * but this allows it to be forced off for testing. */ static int igb_enable_msix = 1; SYSCTL_INT(_hw_igb, OID_AUTO, enable_msix, CTLFLAG_RDTUN, &igb_enable_msix, 0, "Enable MSI-X interrupts"); /* ** Tuneable Interrupt rate */ static int igb_max_interrupt_rate = 8000; SYSCTL_INT(_hw_igb, OID_AUTO, max_interrupt_rate, CTLFLAG_RDTUN, &igb_max_interrupt_rate, 0, "Maximum interrupts per second"); #ifndef IGB_LEGACY_TX /* ** Tuneable number of buffers in the buf-ring (drbr_xxx) */ static int igb_buf_ring_size = IGB_BR_SIZE; SYSCTL_INT(_hw_igb, OID_AUTO, buf_ring_size, CTLFLAG_RDTUN, &igb_buf_ring_size, 0, "Size of the bufring"); #endif /* ** Header split causes the packet header to ** be dma'd to a separate mbuf from the payload. ** this can have memory alignment benefits. But ** another plus is that small packets often fit ** into the header and thus use no cluster. Its ** a very workload dependent type feature. */ static int igb_header_split = FALSE; SYSCTL_INT(_hw_igb, OID_AUTO, header_split, CTLFLAG_RDTUN, &igb_header_split, 0, "Enable receive mbuf header split"); /* ** This will autoconfigure based on the ** number of CPUs and max supported ** MSIX messages if left at 0. */ static int igb_num_queues = 0; SYSCTL_INT(_hw_igb, OID_AUTO, num_queues, CTLFLAG_RDTUN, &igb_num_queues, 0, "Number of queues to configure, 0 indicates autoconfigure"); /* ** Global variable to store last used CPU when binding queues ** to CPUs in igb_allocate_msix. Starts at CPU_FIRST and increments when a ** queue is bound to a cpu. */ static int igb_last_bind_cpu = -1; /* How many packets rxeof tries to clean at a time */ static int igb_rx_process_limit = 100; SYSCTL_INT(_hw_igb, OID_AUTO, rx_process_limit, CTLFLAG_RDTUN, &igb_rx_process_limit, 0, "Maximum number of received packets to process at a time, -1 means unlimited"); /* How many packets txeof tries to clean at a time */ static int igb_tx_process_limit = -1; SYSCTL_INT(_hw_igb, OID_AUTO, tx_process_limit, CTLFLAG_RDTUN, &igb_tx_process_limit, 0, "Maximum number of sent packets to process at a time, -1 means unlimited"); #ifdef DEV_NETMAP /* see ixgbe.c for details */ #include #endif /* DEV_NETMAP */ /********************************************************************* * Device identification routine * * igb_probe determines if the driver should be loaded on * adapter based on PCI vendor/device id of the adapter. * * return BUS_PROBE_DEFAULT on success, positive on failure *********************************************************************/ static int igb_probe(device_t dev) { char adapter_name[256]; uint16_t pci_vendor_id = 0; uint16_t pci_device_id = 0; uint16_t pci_subvendor_id = 0; uint16_t pci_subdevice_id = 0; igb_vendor_info_t *ent; INIT_DEBUGOUT("igb_probe: begin"); pci_vendor_id = pci_get_vendor(dev); if (pci_vendor_id != IGB_INTEL_VENDOR_ID) return (ENXIO); pci_device_id = pci_get_device(dev); pci_subvendor_id = pci_get_subvendor(dev); pci_subdevice_id = pci_get_subdevice(dev); ent = igb_vendor_info_array; while (ent->vendor_id != 0) { if ((pci_vendor_id == ent->vendor_id) && (pci_device_id == ent->device_id) && ((pci_subvendor_id == ent->subvendor_id) || (ent->subvendor_id == 0)) && ((pci_subdevice_id == ent->subdevice_id) || (ent->subdevice_id == 0))) { sprintf(adapter_name, "%s, Version - %s", igb_strings[ent->index], igb_driver_version); device_set_desc_copy(dev, adapter_name); return (BUS_PROBE_DEFAULT); } ent++; } return (ENXIO); } /********************************************************************* * Device initialization routine * * The attach entry point is called when the driver is being loaded. * This routine identifies the type of hardware, allocates all resources * and initializes the hardware. * * return 0 on success, positive on failure *********************************************************************/ static int igb_attach(device_t dev) { struct adapter *adapter; int error = 0; u16 eeprom_data; INIT_DEBUGOUT("igb_attach: begin"); if (resource_disabled("igb", device_get_unit(dev))) { device_printf(dev, "Disabled by device hint\n"); return (ENXIO); } adapter = device_get_softc(dev); adapter->dev = adapter->osdep.dev = dev; IGB_CORE_LOCK_INIT(adapter, device_get_nameunit(dev)); /* SYSCTLs */ SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev), SYSCTL_CHILDREN(device_get_sysctl_tree(dev)), OID_AUTO, "nvm", CTLTYPE_INT|CTLFLAG_RW, adapter, 0, igb_sysctl_nvm_info, "I", "NVM Information"); igb_set_sysctl_value(adapter, "enable_aim", "Interrupt Moderation", &adapter->enable_aim, igb_enable_aim); SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev), SYSCTL_CHILDREN(device_get_sysctl_tree(dev)), OID_AUTO, "fc", CTLTYPE_INT|CTLFLAG_RW, adapter, 0, igb_set_flowcntl, "I", "Flow Control"); callout_init_mtx(&adapter->timer, &adapter->core_mtx, 0); /* Determine hardware and mac info */ igb_identify_hardware(adapter); /* Setup PCI resources */ if (igb_allocate_pci_resources(adapter)) { device_printf(dev, "Allocation of PCI resources failed\n"); error = ENXIO; goto err_pci; } /* Do Shared Code initialization */ if (e1000_setup_init_funcs(&adapter->hw, TRUE)) { device_printf(dev, "Setup of Shared code failed\n"); error = ENXIO; goto err_pci; } e1000_get_bus_info(&adapter->hw); /* Sysctls for limiting the amount of work done in the taskqueues */ igb_set_sysctl_value(adapter, "rx_processing_limit", "max number of rx packets to process", &adapter->rx_process_limit, igb_rx_process_limit); igb_set_sysctl_value(adapter, "tx_processing_limit", "max number of tx packets to process", &adapter->tx_process_limit, igb_tx_process_limit); /* * Validate number of transmit and receive descriptors. It * must not exceed hardware maximum, and must be multiple * of E1000_DBA_ALIGN. */ if (((igb_txd * sizeof(struct e1000_tx_desc)) % IGB_DBA_ALIGN) != 0 || (igb_txd > IGB_MAX_TXD) || (igb_txd < IGB_MIN_TXD)) { device_printf(dev, "Using %d TX descriptors instead of %d!\n", IGB_DEFAULT_TXD, igb_txd); adapter->num_tx_desc = IGB_DEFAULT_TXD; } else adapter->num_tx_desc = igb_txd; if (((igb_rxd * sizeof(struct e1000_rx_desc)) % IGB_DBA_ALIGN) != 0 || (igb_rxd > IGB_MAX_RXD) || (igb_rxd < IGB_MIN_RXD)) { device_printf(dev, "Using %d RX descriptors instead of %d!\n", IGB_DEFAULT_RXD, igb_rxd); adapter->num_rx_desc = IGB_DEFAULT_RXD; } else adapter->num_rx_desc = igb_rxd; adapter->hw.mac.autoneg = DO_AUTO_NEG; adapter->hw.phy.autoneg_wait_to_complete = FALSE; adapter->hw.phy.autoneg_advertised = AUTONEG_ADV_DEFAULT; /* Copper options */ if (adapter->hw.phy.media_type == e1000_media_type_copper) { adapter->hw.phy.mdix = AUTO_ALL_MODES; adapter->hw.phy.disable_polarity_correction = FALSE; adapter->hw.phy.ms_type = IGB_MASTER_SLAVE; } /* * Set the frame limits assuming * standard ethernet sized frames. */ adapter->max_frame_size = ETHERMTU + ETHER_HDR_LEN + ETHERNET_FCS_SIZE; /* ** Allocate and Setup Queues */ if (igb_allocate_queues(adapter)) { error = ENOMEM; goto err_pci; } /* Allocate the appropriate stats memory */ if (adapter->vf_ifp) { adapter->stats = (struct e1000_vf_stats *)malloc(sizeof \ (struct e1000_vf_stats), M_DEVBUF, M_NOWAIT | M_ZERO); igb_vf_init_stats(adapter); } else adapter->stats = (struct e1000_hw_stats *)malloc(sizeof \ (struct e1000_hw_stats), M_DEVBUF, M_NOWAIT | M_ZERO); if (adapter->stats == NULL) { device_printf(dev, "Can not allocate stats memory\n"); error = ENOMEM; goto err_late; } /* Allocate multicast array memory. */ adapter->mta = malloc(sizeof(u8) * ETH_ADDR_LEN * MAX_NUM_MULTICAST_ADDRESSES, M_DEVBUF, M_NOWAIT); if (adapter->mta == NULL) { device_printf(dev, "Can not allocate multicast setup array\n"); error = ENOMEM; goto err_late; } /* Some adapter-specific advanced features */ if (adapter->hw.mac.type >= e1000_i350) { SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev), SYSCTL_CHILDREN(device_get_sysctl_tree(dev)), OID_AUTO, "dmac", CTLTYPE_INT|CTLFLAG_RW, adapter, 0, igb_sysctl_dmac, "I", "DMA Coalesce"); SYSCTL_ADD_PROC(device_get_sysctl_ctx(dev), SYSCTL_CHILDREN(device_get_sysctl_tree(dev)), OID_AUTO, "eee_disabled", CTLTYPE_INT|CTLFLAG_RW, adapter, 0, igb_sysctl_eee, "I", "Disable Energy Efficient Ethernet"); if (adapter->hw.phy.media_type == e1000_media_type_copper) { if (adapter->hw.mac.type == e1000_i354) e1000_set_eee_i354(&adapter->hw, TRUE, TRUE); else e1000_set_eee_i350(&adapter->hw, TRUE, TRUE); } } /* ** Start from a known state, this is ** important in reading the nvm and ** mac from that. */ e1000_reset_hw(&adapter->hw); /* Make sure we have a good EEPROM before we read from it */ if (((adapter->hw.mac.type != e1000_i210) && (adapter->hw.mac.type != e1000_i211)) && (e1000_validate_nvm_checksum(&adapter->hw) < 0)) { /* ** Some PCI-E parts fail the first check due to ** the link being in sleep state, call it again, ** if it fails a second time its a real issue. */ if (e1000_validate_nvm_checksum(&adapter->hw) < 0) { device_printf(dev, "The EEPROM Checksum Is Not Valid\n"); error = EIO; goto err_late; } } /* ** Copy the permanent MAC address out of the EEPROM */ if (e1000_read_mac_addr(&adapter->hw) < 0) { device_printf(dev, "EEPROM read error while reading MAC" " address\n"); error = EIO; goto err_late; } /* Check its sanity */ if (!igb_is_valid_ether_addr(adapter->hw.mac.addr)) { device_printf(dev, "Invalid MAC address\n"); error = EIO; goto err_late; } /* Setup OS specific network interface */ if (igb_setup_interface(dev, adapter) != 0) goto err_late; /* Now get a good starting state */ igb_reset(adapter); /* Initialize statistics */ igb_update_stats_counters(adapter); adapter->hw.mac.get_link_status = 1; igb_update_link_status(adapter); /* Indicate SOL/IDER usage */ if (e1000_check_reset_block(&adapter->hw)) device_printf(dev, "PHY reset is blocked due to SOL/IDER session.\n"); /* Determine if we have to control management hardware */ adapter->has_manage = e1000_enable_mng_pass_thru(&adapter->hw); /* * Setup Wake-on-Lan */ /* APME bit in EEPROM is mapped to WUC.APME */ eeprom_data = E1000_READ_REG(&adapter->hw, E1000_WUC) & E1000_WUC_APME; if (eeprom_data) adapter->wol = E1000_WUFC_MAG; /* Register for VLAN events */ adapter->vlan_attach = EVENTHANDLER_REGISTER(vlan_config, igb_register_vlan, adapter, EVENTHANDLER_PRI_FIRST); adapter->vlan_detach = EVENTHANDLER_REGISTER(vlan_unconfig, igb_unregister_vlan, adapter, EVENTHANDLER_PRI_FIRST); igb_add_hw_stats(adapter); /* Tell the stack that the interface is not active */ adapter->ifp->if_drv_flags &= ~IFF_DRV_RUNNING; adapter->ifp->if_drv_flags |= IFF_DRV_OACTIVE; adapter->led_dev = led_create(igb_led_func, adapter, device_get_nameunit(dev)); /* ** Configure Interrupts */ if ((adapter->msix > 1) && (igb_enable_msix)) error = igb_allocate_msix(adapter); else /* MSI or Legacy */ error = igb_allocate_legacy(adapter); if (error) goto err_late; #ifdef DEV_NETMAP igb_netmap_attach(adapter); #endif /* DEV_NETMAP */ INIT_DEBUGOUT("igb_attach: end"); return (0); err_late: if (igb_detach(dev) == 0) /* igb_detach() already did the cleanup */ return(error); igb_free_transmit_structures(adapter); igb_free_receive_structures(adapter); igb_release_hw_control(adapter); err_pci: igb_free_pci_resources(adapter); if (adapter->ifp != NULL) if_free(adapter->ifp); free(adapter->mta, M_DEVBUF); IGB_CORE_LOCK_DESTROY(adapter); return (error); } /********************************************************************* * Device removal routine * * The detach entry point is called when the driver is being removed. * This routine stops the adapter and deallocates all the resources * that were allocated for driver operation. * * return 0 on success, positive on failure *********************************************************************/ static int igb_detach(device_t dev) { struct adapter *adapter = device_get_softc(dev); struct ifnet *ifp = adapter->ifp; INIT_DEBUGOUT("igb_detach: begin"); /* Make sure VLANS are not using driver */ if (adapter->ifp->if_vlantrunk != NULL) { device_printf(dev,"Vlan in use, detach first\n"); return (EBUSY); } ether_ifdetach(adapter->ifp); if (adapter->led_dev != NULL) led_destroy(adapter->led_dev); #ifdef DEVICE_POLLING if (ifp->if_capenable & IFCAP_POLLING) ether_poll_deregister(ifp); #endif IGB_CORE_LOCK(adapter); adapter->in_detach = 1; igb_stop(adapter); IGB_CORE_UNLOCK(adapter); e1000_phy_hw_reset(&adapter->hw); /* Give control back to firmware */ igb_release_manageability(adapter); igb_release_hw_control(adapter); if (adapter->wol) { E1000_WRITE_REG(&adapter->hw, E1000_WUC, E1000_WUC_PME_EN); E1000_WRITE_REG(&adapter->hw, E1000_WUFC, adapter->wol); igb_enable_wakeup(dev); } /* Unregister VLAN events */ if (adapter->vlan_attach != NULL) EVENTHANDLER_DEREGISTER(vlan_config, adapter->vlan_attach); if (adapter->vlan_detach != NULL) EVENTHANDLER_DEREGISTER(vlan_unconfig, adapter->vlan_detach); callout_drain(&adapter->timer); #ifdef DEV_NETMAP netmap_detach(adapter->ifp); #endif /* DEV_NETMAP */ igb_free_pci_resources(adapter); bus_generic_detach(dev); if_free(ifp); igb_free_transmit_structures(adapter); igb_free_receive_structures(adapter); if (adapter->mta != NULL) free(adapter->mta, M_DEVBUF); IGB_CORE_LOCK_DESTROY(adapter); return (0); } /********************************************************************* * * Shutdown entry point * **********************************************************************/ static int igb_shutdown(device_t dev) { return igb_suspend(dev); } /* * Suspend/resume device methods. */ static int igb_suspend(device_t dev) { struct adapter *adapter = device_get_softc(dev); IGB_CORE_LOCK(adapter); igb_stop(adapter); igb_release_manageability(adapter); igb_release_hw_control(adapter); if (adapter->wol) { E1000_WRITE_REG(&adapter->hw, E1000_WUC, E1000_WUC_PME_EN); E1000_WRITE_REG(&adapter->hw, E1000_WUFC, adapter->wol); igb_enable_wakeup(dev); } IGB_CORE_UNLOCK(adapter); return bus_generic_suspend(dev); } static int igb_resume(device_t dev) { struct adapter *adapter = device_get_softc(dev); struct tx_ring *txr = adapter->tx_rings; struct ifnet *ifp = adapter->ifp; IGB_CORE_LOCK(adapter); igb_init_locked(adapter); igb_init_manageability(adapter); if ((ifp->if_flags & IFF_UP) && (ifp->if_drv_flags & IFF_DRV_RUNNING) && adapter->link_active) { for (int i = 0; i < adapter->num_queues; i++, txr++) { IGB_TX_LOCK(txr); #ifndef IGB_LEGACY_TX /* Process the stack queue only if not depleted */ if (((txr->queue_status & IGB_QUEUE_DEPLETED) == 0) && !drbr_empty(ifp, txr->br)) igb_mq_start_locked(ifp, txr); #else if (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) igb_start_locked(txr, ifp); #endif IGB_TX_UNLOCK(txr); } } IGB_CORE_UNLOCK(adapter); return bus_generic_resume(dev); } #ifdef IGB_LEGACY_TX /********************************************************************* * Transmit entry point * * igb_start is called by the stack to initiate a transmit. * The driver will remain in this routine as long as there are * packets to transmit and transmit resources are available. * In case resources are not available stack is notified and * the packet is requeued. **********************************************************************/ static void igb_start_locked(struct tx_ring *txr, struct ifnet *ifp) { struct adapter *adapter = ifp->if_softc; struct mbuf *m_head; IGB_TX_LOCK_ASSERT(txr); if ((ifp->if_drv_flags & (IFF_DRV_RUNNING|IFF_DRV_OACTIVE)) != IFF_DRV_RUNNING) return; if (!adapter->link_active) return; /* Call cleanup if number of TX descriptors low */ if (txr->tx_avail <= IGB_TX_CLEANUP_THRESHOLD) igb_txeof(txr); while (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) { if (txr->tx_avail <= IGB_MAX_SCATTER) { txr->queue_status |= IGB_QUEUE_DEPLETED; break; } IFQ_DRV_DEQUEUE(&ifp->if_snd, m_head); if (m_head == NULL) break; /* * Encapsulation can modify our pointer, and or make it * NULL on failure. In that event, we can't requeue. */ if (igb_xmit(txr, &m_head)) { if (m_head != NULL) IFQ_DRV_PREPEND(&ifp->if_snd, m_head); if (txr->tx_avail <= IGB_MAX_SCATTER) txr->queue_status |= IGB_QUEUE_DEPLETED; break; } /* Send a copy of the frame to the BPF listener */ ETHER_BPF_MTAP(ifp, m_head); /* Set watchdog on */ txr->watchdog_time = ticks; txr->queue_status |= IGB_QUEUE_WORKING; } } /* * Legacy TX driver routine, called from the * stack, always uses tx[0], and spins for it. * Should not be used with multiqueue tx */ static void igb_start(struct ifnet *ifp) { struct adapter *adapter = ifp->if_softc; struct tx_ring *txr = adapter->tx_rings; if (ifp->if_drv_flags & IFF_DRV_RUNNING) { IGB_TX_LOCK(txr); igb_start_locked(txr, ifp); IGB_TX_UNLOCK(txr); } return; } #else /* ~IGB_LEGACY_TX */ /* ** Multiqueue Transmit Entry: ** quick turnaround to the stack ** */ static int igb_mq_start(struct ifnet *ifp, struct mbuf *m) { struct adapter *adapter = ifp->if_softc; struct igb_queue *que; struct tx_ring *txr; int i, err = 0; #ifdef RSS uint32_t bucket_id; #endif /* Which queue to use */ /* * When doing RSS, map it to the same outbound queue * as the incoming flow would be mapped to. * * If everything is setup correctly, it should be the * same bucket that the current CPU we're on is. */ if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) { #ifdef RSS if (rss_hash2bucket(m->m_pkthdr.flowid, M_HASHTYPE_GET(m), &bucket_id) == 0) { /* XXX TODO: spit out something if bucket_id > num_queues? */ i = bucket_id % adapter->num_queues; } else { #endif i = m->m_pkthdr.flowid % adapter->num_queues; #ifdef RSS } #endif } else { i = curcpu % adapter->num_queues; } txr = &adapter->tx_rings[i]; que = &adapter->queues[i]; err = drbr_enqueue(ifp, txr->br, m); if (err) return (err); if (IGB_TX_TRYLOCK(txr)) { igb_mq_start_locked(ifp, txr); IGB_TX_UNLOCK(txr); } else taskqueue_enqueue(que->tq, &txr->txq_task); return (0); } static int igb_mq_start_locked(struct ifnet *ifp, struct tx_ring *txr) { struct adapter *adapter = txr->adapter; struct mbuf *next; int err = 0, enq = 0; IGB_TX_LOCK_ASSERT(txr); if (((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) || adapter->link_active == 0) return (ENETDOWN); /* Process the queue */ while ((next = drbr_peek(ifp, txr->br)) != NULL) { if ((err = igb_xmit(txr, &next)) != 0) { if (next == NULL) { /* It was freed, move forward */ drbr_advance(ifp, txr->br); } else { /* * Still have one left, it may not be * the same since the transmit function * may have changed it. */ drbr_putback(ifp, txr->br, next); } break; } drbr_advance(ifp, txr->br); enq++; if (next->m_flags & M_MCAST && adapter->vf_ifp) if_inc_counter(ifp, IFCOUNTER_OMCASTS, 1); ETHER_BPF_MTAP(ifp, next); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) break; } if (enq > 0) { /* Set the watchdog */ txr->queue_status |= IGB_QUEUE_WORKING; txr->watchdog_time = ticks; } if (txr->tx_avail <= IGB_TX_CLEANUP_THRESHOLD) igb_txeof(txr); if (txr->tx_avail <= IGB_MAX_SCATTER) txr->queue_status |= IGB_QUEUE_DEPLETED; return (err); } /* * Called from a taskqueue to drain queued transmit packets. */ static void igb_deferred_mq_start(void *arg, int pending) { struct tx_ring *txr = arg; struct adapter *adapter = txr->adapter; struct ifnet *ifp = adapter->ifp; IGB_TX_LOCK(txr); if (!drbr_empty(ifp, txr->br)) igb_mq_start_locked(ifp, txr); IGB_TX_UNLOCK(txr); } /* ** Flush all ring buffers */ static void igb_qflush(struct ifnet *ifp) { struct adapter *adapter = ifp->if_softc; struct tx_ring *txr = adapter->tx_rings; struct mbuf *m; for (int i = 0; i < adapter->num_queues; i++, txr++) { IGB_TX_LOCK(txr); while ((m = buf_ring_dequeue_sc(txr->br)) != NULL) m_freem(m); IGB_TX_UNLOCK(txr); } if_qflush(ifp); } #endif /* ~IGB_LEGACY_TX */ /********************************************************************* * Ioctl entry point * * igb_ioctl is called when the user wants to configure the * interface. * * return 0 on success, positive on failure **********************************************************************/ static int igb_ioctl(struct ifnet *ifp, u_long command, caddr_t data) { struct adapter *adapter = ifp->if_softc; struct ifreq *ifr = (struct ifreq *)data; #if defined(INET) || defined(INET6) struct ifaddr *ifa = (struct ifaddr *)data; #endif bool avoid_reset = FALSE; int error = 0; if (adapter->in_detach) return (error); switch (command) { case SIOCSIFADDR: #ifdef INET if (ifa->ifa_addr->sa_family == AF_INET) avoid_reset = TRUE; #endif #ifdef INET6 if (ifa->ifa_addr->sa_family == AF_INET6) avoid_reset = TRUE; #endif /* ** Calling init results in link renegotiation, ** so we avoid doing it when possible. */ if (avoid_reset) { ifp->if_flags |= IFF_UP; if (!(ifp->if_drv_flags & IFF_DRV_RUNNING)) igb_init(adapter); #ifdef INET if (!(ifp->if_flags & IFF_NOARP)) arp_ifinit(ifp, ifa); #endif } else error = ether_ioctl(ifp, command, data); break; case SIOCSIFMTU: { int max_frame_size; IOCTL_DEBUGOUT("ioctl rcv'd: SIOCSIFMTU (Set Interface MTU)"); IGB_CORE_LOCK(adapter); max_frame_size = 9234; if (ifr->ifr_mtu > max_frame_size - ETHER_HDR_LEN - ETHER_CRC_LEN) { IGB_CORE_UNLOCK(adapter); error = EINVAL; break; } ifp->if_mtu = ifr->ifr_mtu; adapter->max_frame_size = ifp->if_mtu + ETHER_HDR_LEN + ETHER_CRC_LEN; igb_init_locked(adapter); IGB_CORE_UNLOCK(adapter); break; } case SIOCSIFFLAGS: IOCTL_DEBUGOUT("ioctl rcv'd:\ SIOCSIFFLAGS (Set Interface Flags)"); IGB_CORE_LOCK(adapter); if (ifp->if_flags & IFF_UP) { if ((ifp->if_drv_flags & IFF_DRV_RUNNING)) { if ((ifp->if_flags ^ adapter->if_flags) & (IFF_PROMISC | IFF_ALLMULTI)) { igb_disable_promisc(adapter); igb_set_promisc(adapter); } } else igb_init_locked(adapter); } else if (ifp->if_drv_flags & IFF_DRV_RUNNING) igb_stop(adapter); adapter->if_flags = ifp->if_flags; IGB_CORE_UNLOCK(adapter); break; case SIOCADDMULTI: case SIOCDELMULTI: IOCTL_DEBUGOUT("ioctl rcv'd: SIOC(ADD|DEL)MULTI"); if (ifp->if_drv_flags & IFF_DRV_RUNNING) { IGB_CORE_LOCK(adapter); igb_disable_intr(adapter); igb_set_multi(adapter); #ifdef DEVICE_POLLING if (!(ifp->if_capenable & IFCAP_POLLING)) #endif igb_enable_intr(adapter); IGB_CORE_UNLOCK(adapter); } break; case SIOCSIFMEDIA: /* Check SOL/IDER usage */ IGB_CORE_LOCK(adapter); if (e1000_check_reset_block(&adapter->hw)) { IGB_CORE_UNLOCK(adapter); device_printf(adapter->dev, "Media change is" " blocked due to SOL/IDER session.\n"); break; } IGB_CORE_UNLOCK(adapter); case SIOCGIFMEDIA: IOCTL_DEBUGOUT("ioctl rcv'd: \ SIOCxIFMEDIA (Get/Set Interface Media)"); error = ifmedia_ioctl(ifp, ifr, &adapter->media, command); break; case SIOCSIFCAP: { int mask, reinit; IOCTL_DEBUGOUT("ioctl rcv'd: SIOCSIFCAP (Set Capabilities)"); reinit = 0; mask = ifr->ifr_reqcap ^ ifp->if_capenable; #ifdef DEVICE_POLLING if (mask & IFCAP_POLLING) { if (ifr->ifr_reqcap & IFCAP_POLLING) { error = ether_poll_register(igb_poll, ifp); if (error) return (error); IGB_CORE_LOCK(adapter); igb_disable_intr(adapter); ifp->if_capenable |= IFCAP_POLLING; IGB_CORE_UNLOCK(adapter); } else { error = ether_poll_deregister(ifp); /* Enable interrupt even in error case */ IGB_CORE_LOCK(adapter); igb_enable_intr(adapter); ifp->if_capenable &= ~IFCAP_POLLING; IGB_CORE_UNLOCK(adapter); } } #endif #if __FreeBSD_version >= 1000000 /* HW cannot turn these on/off separately */ if (mask & (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6)) { ifp->if_capenable ^= IFCAP_RXCSUM; ifp->if_capenable ^= IFCAP_RXCSUM_IPV6; reinit = 1; } if (mask & IFCAP_TXCSUM) { ifp->if_capenable ^= IFCAP_TXCSUM; reinit = 1; } if (mask & IFCAP_TXCSUM_IPV6) { ifp->if_capenable ^= IFCAP_TXCSUM_IPV6; reinit = 1; } #else if (mask & IFCAP_HWCSUM) { ifp->if_capenable ^= IFCAP_HWCSUM; reinit = 1; } #endif if (mask & IFCAP_TSO4) { ifp->if_capenable ^= IFCAP_TSO4; reinit = 1; } if (mask & IFCAP_TSO6) { ifp->if_capenable ^= IFCAP_TSO6; reinit = 1; } if (mask & IFCAP_VLAN_HWTAGGING) { ifp->if_capenable ^= IFCAP_VLAN_HWTAGGING; reinit = 1; } if (mask & IFCAP_VLAN_HWFILTER) { ifp->if_capenable ^= IFCAP_VLAN_HWFILTER; reinit = 1; } if (mask & IFCAP_VLAN_HWTSO) { ifp->if_capenable ^= IFCAP_VLAN_HWTSO; reinit = 1; } if (mask & IFCAP_LRO) { ifp->if_capenable ^= IFCAP_LRO; reinit = 1; } if (reinit && (ifp->if_drv_flags & IFF_DRV_RUNNING)) igb_init(adapter); VLAN_CAPABILITIES(ifp); break; } default: error = ether_ioctl(ifp, command, data); break; } return (error); } /********************************************************************* * Init entry point * * This routine is used in two ways. It is used by the stack as * init entry point in network interface structure. It is also used * by the driver as a hw/sw initialization routine to get to a * consistent state. * * return 0 on success, positive on failure **********************************************************************/ static void igb_init_locked(struct adapter *adapter) { struct ifnet *ifp = adapter->ifp; device_t dev = adapter->dev; INIT_DEBUGOUT("igb_init: begin"); IGB_CORE_LOCK_ASSERT(adapter); igb_disable_intr(adapter); callout_stop(&adapter->timer); /* Get the latest mac address, User can use a LAA */ bcopy(IF_LLADDR(adapter->ifp), adapter->hw.mac.addr, ETHER_ADDR_LEN); /* Put the address into the Receive Address Array */ e1000_rar_set(&adapter->hw, adapter->hw.mac.addr, 0); igb_reset(adapter); igb_update_link_status(adapter); E1000_WRITE_REG(&adapter->hw, E1000_VET, ETHERTYPE_VLAN); /* Set hardware offload abilities */ ifp->if_hwassist = 0; if (ifp->if_capenable & IFCAP_TXCSUM) { #if __FreeBSD_version >= 1000000 ifp->if_hwassist |= (CSUM_IP_TCP | CSUM_IP_UDP); if (adapter->hw.mac.type != e1000_82575) ifp->if_hwassist |= CSUM_IP_SCTP; #else ifp->if_hwassist |= (CSUM_TCP | CSUM_UDP); #if __FreeBSD_version >= 800000 if (adapter->hw.mac.type != e1000_82575) ifp->if_hwassist |= CSUM_SCTP; #endif #endif } #if __FreeBSD_version >= 1000000 if (ifp->if_capenable & IFCAP_TXCSUM_IPV6) { ifp->if_hwassist |= (CSUM_IP6_TCP | CSUM_IP6_UDP); if (adapter->hw.mac.type != e1000_82575) ifp->if_hwassist |= CSUM_IP6_SCTP; } #endif if (ifp->if_capenable & IFCAP_TSO) ifp->if_hwassist |= CSUM_TSO; /* Clear bad data from Rx FIFOs */ e1000_rx_fifo_flush_82575(&adapter->hw); /* Configure for OS presence */ igb_init_manageability(adapter); /* Prepare transmit descriptors and buffers */ igb_setup_transmit_structures(adapter); igb_initialize_transmit_units(adapter); /* Setup Multicast table */ igb_set_multi(adapter); /* ** Figure out the desired mbuf pool ** for doing jumbo/packetsplit */ if (adapter->max_frame_size <= 2048) adapter->rx_mbuf_sz = MCLBYTES; else if (adapter->max_frame_size <= 4096) adapter->rx_mbuf_sz = MJUMPAGESIZE; else adapter->rx_mbuf_sz = MJUM9BYTES; /* Prepare receive descriptors and buffers */ if (igb_setup_receive_structures(adapter)) { device_printf(dev, "Could not setup receive structures\n"); return; } igb_initialize_receive_units(adapter); /* Enable VLAN support */ if (ifp->if_capenable & IFCAP_VLAN_HWTAGGING) igb_setup_vlan_hw_support(adapter); /* Don't lose promiscuous settings */ igb_set_promisc(adapter); ifp->if_drv_flags |= IFF_DRV_RUNNING; ifp->if_drv_flags &= ~IFF_DRV_OACTIVE; callout_reset(&adapter->timer, hz, igb_local_timer, adapter); e1000_clear_hw_cntrs_base_generic(&adapter->hw); if (adapter->msix > 1) /* Set up queue routing */ igb_configure_queues(adapter); /* this clears any pending interrupts */ E1000_READ_REG(&adapter->hw, E1000_ICR); #ifdef DEVICE_POLLING /* * Only enable interrupts if we are not polling, make sure * they are off otherwise. */ if (ifp->if_capenable & IFCAP_POLLING) igb_disable_intr(adapter); else #endif /* DEVICE_POLLING */ { igb_enable_intr(adapter); E1000_WRITE_REG(&adapter->hw, E1000_ICS, E1000_ICS_LSC); } /* Set Energy Efficient Ethernet */ if (adapter->hw.phy.media_type == e1000_media_type_copper) { if (adapter->hw.mac.type == e1000_i354) e1000_set_eee_i354(&adapter->hw, TRUE, TRUE); else e1000_set_eee_i350(&adapter->hw, TRUE, TRUE); } } static void igb_init(void *arg) { struct adapter *adapter = arg; IGB_CORE_LOCK(adapter); igb_init_locked(adapter); IGB_CORE_UNLOCK(adapter); } static void igb_handle_que(void *context, int pending) { struct igb_queue *que = context; struct adapter *adapter = que->adapter; struct tx_ring *txr = que->txr; struct ifnet *ifp = adapter->ifp; if (ifp->if_drv_flags & IFF_DRV_RUNNING) { bool more; more = igb_rxeof(que, adapter->rx_process_limit, NULL); IGB_TX_LOCK(txr); igb_txeof(txr); #ifndef IGB_LEGACY_TX /* Process the stack queue only if not depleted */ if (((txr->queue_status & IGB_QUEUE_DEPLETED) == 0) && !drbr_empty(ifp, txr->br)) igb_mq_start_locked(ifp, txr); #else if (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) igb_start_locked(txr, ifp); #endif IGB_TX_UNLOCK(txr); /* Do we need another? */ if (more) { taskqueue_enqueue(que->tq, &que->que_task); return; } } #ifdef DEVICE_POLLING if (ifp->if_capenable & IFCAP_POLLING) return; #endif /* Reenable this interrupt */ if (que->eims) E1000_WRITE_REG(&adapter->hw, E1000_EIMS, que->eims); else igb_enable_intr(adapter); } /* Deal with link in a sleepable context */ static void igb_handle_link(void *context, int pending) { struct adapter *adapter = context; IGB_CORE_LOCK(adapter); igb_handle_link_locked(adapter); IGB_CORE_UNLOCK(adapter); } static void igb_handle_link_locked(struct adapter *adapter) { struct tx_ring *txr = adapter->tx_rings; struct ifnet *ifp = adapter->ifp; IGB_CORE_LOCK_ASSERT(adapter); adapter->hw.mac.get_link_status = 1; igb_update_link_status(adapter); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) && adapter->link_active) { for (int i = 0; i < adapter->num_queues; i++, txr++) { IGB_TX_LOCK(txr); #ifndef IGB_LEGACY_TX /* Process the stack queue only if not depleted */ if (((txr->queue_status & IGB_QUEUE_DEPLETED) == 0) && !drbr_empty(ifp, txr->br)) igb_mq_start_locked(ifp, txr); #else if (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) igb_start_locked(txr, ifp); #endif IGB_TX_UNLOCK(txr); } } } /********************************************************************* * * MSI/Legacy Deferred * Interrupt Service routine * *********************************************************************/ static int igb_irq_fast(void *arg) { struct adapter *adapter = arg; struct igb_queue *que = adapter->queues; u32 reg_icr; reg_icr = E1000_READ_REG(&adapter->hw, E1000_ICR); /* Hot eject? */ if (reg_icr == 0xffffffff) return FILTER_STRAY; /* Definitely not our interrupt. */ if (reg_icr == 0x0) return FILTER_STRAY; if ((reg_icr & E1000_ICR_INT_ASSERTED) == 0) return FILTER_STRAY; /* * Mask interrupts until the taskqueue is finished running. This is * cheap, just assume that it is needed. This also works around the * MSI message reordering errata on certain systems. */ igb_disable_intr(adapter); taskqueue_enqueue(que->tq, &que->que_task); /* Link status change */ if (reg_icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC)) taskqueue_enqueue(que->tq, &adapter->link_task); if (reg_icr & E1000_ICR_RXO) adapter->rx_overruns++; return FILTER_HANDLED; } #ifdef DEVICE_POLLING #if __FreeBSD_version >= 800000 #define POLL_RETURN_COUNT(a) (a) static int #else #define POLL_RETURN_COUNT(a) static void #endif igb_poll(struct ifnet *ifp, enum poll_cmd cmd, int count) { struct adapter *adapter = ifp->if_softc; struct igb_queue *que; struct tx_ring *txr; u32 reg_icr, rx_done = 0; u32 loop = IGB_MAX_LOOP; bool more; IGB_CORE_LOCK(adapter); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) { IGB_CORE_UNLOCK(adapter); return POLL_RETURN_COUNT(rx_done); } if (cmd == POLL_AND_CHECK_STATUS) { reg_icr = E1000_READ_REG(&adapter->hw, E1000_ICR); /* Link status change */ if (reg_icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC)) igb_handle_link_locked(adapter); if (reg_icr & E1000_ICR_RXO) adapter->rx_overruns++; } IGB_CORE_UNLOCK(adapter); for (int i = 0; i < adapter->num_queues; i++) { que = &adapter->queues[i]; txr = que->txr; igb_rxeof(que, count, &rx_done); IGB_TX_LOCK(txr); do { more = igb_txeof(txr); } while (loop-- && more); #ifndef IGB_LEGACY_TX if (!drbr_empty(ifp, txr->br)) igb_mq_start_locked(ifp, txr); #else if (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) igb_start_locked(txr, ifp); #endif IGB_TX_UNLOCK(txr); } return POLL_RETURN_COUNT(rx_done); } #endif /* DEVICE_POLLING */ /********************************************************************* * * MSIX Que Interrupt Service routine * **********************************************************************/ static void igb_msix_que(void *arg) { struct igb_queue *que = arg; struct adapter *adapter = que->adapter; struct ifnet *ifp = adapter->ifp; struct tx_ring *txr = que->txr; struct rx_ring *rxr = que->rxr; u32 newitr = 0; bool more_rx; /* Ignore spurious interrupts */ if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) return; E1000_WRITE_REG(&adapter->hw, E1000_EIMC, que->eims); ++que->irqs; IGB_TX_LOCK(txr); igb_txeof(txr); #ifndef IGB_LEGACY_TX /* Process the stack queue only if not depleted */ if (((txr->queue_status & IGB_QUEUE_DEPLETED) == 0) && !drbr_empty(ifp, txr->br)) igb_mq_start_locked(ifp, txr); #else if (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) igb_start_locked(txr, ifp); #endif IGB_TX_UNLOCK(txr); more_rx = igb_rxeof(que, adapter->rx_process_limit, NULL); if (adapter->enable_aim == FALSE) goto no_calc; /* ** Do Adaptive Interrupt Moderation: ** - Write out last calculated setting ** - Calculate based on average size over ** the last interval. */ if (que->eitr_setting) E1000_WRITE_REG(&adapter->hw, E1000_EITR(que->msix), que->eitr_setting); que->eitr_setting = 0; /* Idle, do nothing */ if ((txr->bytes == 0) && (rxr->bytes == 0)) goto no_calc; /* Used half Default if sub-gig */ if (adapter->link_speed != 1000) newitr = IGB_DEFAULT_ITR / 2; else { if ((txr->bytes) && (txr->packets)) newitr = txr->bytes/txr->packets; if ((rxr->bytes) && (rxr->packets)) newitr = max(newitr, (rxr->bytes / rxr->packets)); newitr += 24; /* account for hardware frame, crc */ /* set an upper boundary */ newitr = min(newitr, 3000); /* Be nice to the mid range */ if ((newitr > 300) && (newitr < 1200)) newitr = (newitr / 3); else newitr = (newitr / 2); } newitr &= 0x7FFC; /* Mask invalid bits */ if (adapter->hw.mac.type == e1000_82575) newitr |= newitr << 16; else newitr |= E1000_EITR_CNT_IGNR; /* save for next interrupt */ que->eitr_setting = newitr; /* Reset state */ txr->bytes = 0; txr->packets = 0; rxr->bytes = 0; rxr->packets = 0; no_calc: /* Schedule a clean task if needed*/ if (more_rx) taskqueue_enqueue(que->tq, &que->que_task); else /* Reenable this interrupt */ E1000_WRITE_REG(&adapter->hw, E1000_EIMS, que->eims); return; } /********************************************************************* * * MSIX Link Interrupt Service routine * **********************************************************************/ static void igb_msix_link(void *arg) { struct adapter *adapter = arg; u32 icr; ++adapter->link_irq; icr = E1000_READ_REG(&adapter->hw, E1000_ICR); if (!(icr & E1000_ICR_LSC)) goto spurious; igb_handle_link(adapter, 0); spurious: /* Rearm */ E1000_WRITE_REG(&adapter->hw, E1000_IMS, E1000_IMS_LSC); E1000_WRITE_REG(&adapter->hw, E1000_EIMS, adapter->link_mask); return; } /********************************************************************* * * Media Ioctl callback * * This routine is called whenever the user queries the status of * the interface using ifconfig. * **********************************************************************/ static void igb_media_status(struct ifnet *ifp, struct ifmediareq *ifmr) { struct adapter *adapter = ifp->if_softc; INIT_DEBUGOUT("igb_media_status: begin"); IGB_CORE_LOCK(adapter); igb_update_link_status(adapter); ifmr->ifm_status = IFM_AVALID; ifmr->ifm_active = IFM_ETHER; if (!adapter->link_active) { IGB_CORE_UNLOCK(adapter); return; } ifmr->ifm_status |= IFM_ACTIVE; switch (adapter->link_speed) { case 10: ifmr->ifm_active |= IFM_10_T; break; case 100: /* ** Support for 100Mb SFP - these are Fiber ** but the media type appears as serdes */ if (adapter->hw.phy.media_type == e1000_media_type_internal_serdes) ifmr->ifm_active |= IFM_100_FX; else ifmr->ifm_active |= IFM_100_TX; break; case 1000: ifmr->ifm_active |= IFM_1000_T; break; case 2500: ifmr->ifm_active |= IFM_2500_SX; break; } if (adapter->link_duplex == FULL_DUPLEX) ifmr->ifm_active |= IFM_FDX; else ifmr->ifm_active |= IFM_HDX; IGB_CORE_UNLOCK(adapter); } /********************************************************************* * * Media Ioctl callback * * This routine is called when the user changes speed/duplex using * media/mediopt option with ifconfig. * **********************************************************************/ static int igb_media_change(struct ifnet *ifp) { struct adapter *adapter = ifp->if_softc; struct ifmedia *ifm = &adapter->media; INIT_DEBUGOUT("igb_media_change: begin"); if (IFM_TYPE(ifm->ifm_media) != IFM_ETHER) return (EINVAL); IGB_CORE_LOCK(adapter); switch (IFM_SUBTYPE(ifm->ifm_media)) { case IFM_AUTO: adapter->hw.mac.autoneg = DO_AUTO_NEG; adapter->hw.phy.autoneg_advertised = AUTONEG_ADV_DEFAULT; break; case IFM_1000_LX: case IFM_1000_SX: case IFM_1000_T: adapter->hw.mac.autoneg = DO_AUTO_NEG; adapter->hw.phy.autoneg_advertised = ADVERTISE_1000_FULL; break; case IFM_100_TX: adapter->hw.mac.autoneg = FALSE; adapter->hw.phy.autoneg_advertised = 0; if ((ifm->ifm_media & IFM_GMASK) == IFM_FDX) adapter->hw.mac.forced_speed_duplex = ADVERTISE_100_FULL; else adapter->hw.mac.forced_speed_duplex = ADVERTISE_100_HALF; break; case IFM_10_T: adapter->hw.mac.autoneg = FALSE; adapter->hw.phy.autoneg_advertised = 0; if ((ifm->ifm_media & IFM_GMASK) == IFM_FDX) adapter->hw.mac.forced_speed_duplex = ADVERTISE_10_FULL; else adapter->hw.mac.forced_speed_duplex = ADVERTISE_10_HALF; break; default: device_printf(adapter->dev, "Unsupported media type\n"); } igb_init_locked(adapter); IGB_CORE_UNLOCK(adapter); return (0); } /********************************************************************* * * This routine maps the mbufs to Advanced TX descriptors. * **********************************************************************/ static int igb_xmit(struct tx_ring *txr, struct mbuf **m_headp) { struct adapter *adapter = txr->adapter; u32 olinfo_status = 0, cmd_type_len; int i, j, error, nsegs; int first; bool remap = TRUE; struct mbuf *m_head; bus_dma_segment_t segs[IGB_MAX_SCATTER]; bus_dmamap_t map; struct igb_tx_buf *txbuf; union e1000_adv_tx_desc *txd = NULL; m_head = *m_headp; /* Basic descriptor defines */ cmd_type_len = (E1000_ADVTXD_DTYP_DATA | E1000_ADVTXD_DCMD_IFCS | E1000_ADVTXD_DCMD_DEXT); if (m_head->m_flags & M_VLANTAG) cmd_type_len |= E1000_ADVTXD_DCMD_VLE; /* * Important to capture the first descriptor * used because it will contain the index of * the one we tell the hardware to report back */ first = txr->next_avail_desc; txbuf = &txr->tx_buffers[first]; map = txbuf->map; /* * Map the packet for DMA. */ retry: error = bus_dmamap_load_mbuf_sg(txr->txtag, map, *m_headp, segs, &nsegs, BUS_DMA_NOWAIT); if (__predict_false(error)) { struct mbuf *m; switch (error) { case EFBIG: /* Try it again? - one try */ if (remap == TRUE) { remap = FALSE; m = m_collapse(*m_headp, M_NOWAIT, IGB_MAX_SCATTER); if (m == NULL) { adapter->mbuf_defrag_failed++; m_freem(*m_headp); *m_headp = NULL; return (ENOBUFS); } *m_headp = m; goto retry; } else return (error); default: txr->no_tx_dma_setup++; m_freem(*m_headp); *m_headp = NULL; return (error); } } /* Make certain there are enough descriptors */ if (txr->tx_avail < (nsegs + 2)) { txr->no_desc_avail++; bus_dmamap_unload(txr->txtag, map); return (ENOBUFS); } m_head = *m_headp; /* ** Set up the appropriate offload context ** this will consume the first descriptor */ error = igb_tx_ctx_setup(txr, m_head, &cmd_type_len, &olinfo_status); if (__predict_false(error)) { m_freem(*m_headp); *m_headp = NULL; return (error); } /* 82575 needs the queue index added */ if (adapter->hw.mac.type == e1000_82575) olinfo_status |= txr->me << 4; i = txr->next_avail_desc; for (j = 0; j < nsegs; j++) { bus_size_t seglen; bus_addr_t segaddr; txbuf = &txr->tx_buffers[i]; txd = &txr->tx_base[i]; seglen = segs[j].ds_len; segaddr = htole64(segs[j].ds_addr); txd->read.buffer_addr = segaddr; txd->read.cmd_type_len = htole32(E1000_TXD_CMD_IFCS | cmd_type_len | seglen); txd->read.olinfo_status = htole32(olinfo_status); if (++i == txr->num_desc) i = 0; } txd->read.cmd_type_len |= htole32(E1000_TXD_CMD_EOP | E1000_TXD_CMD_RS); txr->tx_avail -= nsegs; txr->next_avail_desc = i; txbuf->m_head = m_head; /* ** Here we swap the map so the last descriptor, ** which gets the completion interrupt has the ** real map, and the first descriptor gets the ** unused map from this descriptor. */ txr->tx_buffers[first].map = txbuf->map; txbuf->map = map; bus_dmamap_sync(txr->txtag, map, BUS_DMASYNC_PREWRITE); /* Set the EOP descriptor that will be marked done */ txbuf = &txr->tx_buffers[first]; txbuf->eop = txd; bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* * Advance the Transmit Descriptor Tail (Tdt), this tells the * hardware that this frame is available to transmit. */ ++txr->total_packets; E1000_WRITE_REG(&adapter->hw, E1000_TDT(txr->me), i); return (0); } static void igb_set_promisc(struct adapter *adapter) { struct ifnet *ifp = adapter->ifp; struct e1000_hw *hw = &adapter->hw; u32 reg; if (adapter->vf_ifp) { e1000_promisc_set_vf(hw, e1000_promisc_enabled); return; } reg = E1000_READ_REG(hw, E1000_RCTL); if (ifp->if_flags & IFF_PROMISC) { reg |= (E1000_RCTL_UPE | E1000_RCTL_MPE); E1000_WRITE_REG(hw, E1000_RCTL, reg); } else if (ifp->if_flags & IFF_ALLMULTI) { reg |= E1000_RCTL_MPE; reg &= ~E1000_RCTL_UPE; E1000_WRITE_REG(hw, E1000_RCTL, reg); } } static void igb_disable_promisc(struct adapter *adapter) { struct e1000_hw *hw = &adapter->hw; struct ifnet *ifp = adapter->ifp; u32 reg; int mcnt = 0; if (adapter->vf_ifp) { e1000_promisc_set_vf(hw, e1000_promisc_disabled); return; } reg = E1000_READ_REG(hw, E1000_RCTL); reg &= (~E1000_RCTL_UPE); if (ifp->if_flags & IFF_ALLMULTI) mcnt = MAX_NUM_MULTICAST_ADDRESSES; else { struct ifmultiaddr *ifma; #if __FreeBSD_version < 800000 IF_ADDR_LOCK(ifp); #else if_maddr_rlock(ifp); #endif TAILQ_FOREACH(ifma, &ifp->if_multiaddrs, ifma_link) { if (ifma->ifma_addr->sa_family != AF_LINK) continue; if (mcnt == MAX_NUM_MULTICAST_ADDRESSES) break; mcnt++; } #if __FreeBSD_version < 800000 IF_ADDR_UNLOCK(ifp); #else if_maddr_runlock(ifp); #endif } /* Don't disable if in MAX groups */ if (mcnt < MAX_NUM_MULTICAST_ADDRESSES) reg &= (~E1000_RCTL_MPE); E1000_WRITE_REG(hw, E1000_RCTL, reg); } /********************************************************************* * Multicast Update * * This routine is called whenever multicast address list is updated. * **********************************************************************/ static void igb_set_multi(struct adapter *adapter) { struct ifnet *ifp = adapter->ifp; struct ifmultiaddr *ifma; u32 reg_rctl = 0; u8 *mta; int mcnt = 0; IOCTL_DEBUGOUT("igb_set_multi: begin"); mta = adapter->mta; bzero(mta, sizeof(uint8_t) * ETH_ADDR_LEN * MAX_NUM_MULTICAST_ADDRESSES); #if __FreeBSD_version < 800000 IF_ADDR_LOCK(ifp); #else if_maddr_rlock(ifp); #endif TAILQ_FOREACH(ifma, &ifp->if_multiaddrs, ifma_link) { if (ifma->ifma_addr->sa_family != AF_LINK) continue; if (mcnt == MAX_NUM_MULTICAST_ADDRESSES) break; bcopy(LLADDR((struct sockaddr_dl *)ifma->ifma_addr), &mta[mcnt * ETH_ADDR_LEN], ETH_ADDR_LEN); mcnt++; } #if __FreeBSD_version < 800000 IF_ADDR_UNLOCK(ifp); #else if_maddr_runlock(ifp); #endif if (mcnt >= MAX_NUM_MULTICAST_ADDRESSES) { reg_rctl = E1000_READ_REG(&adapter->hw, E1000_RCTL); reg_rctl |= E1000_RCTL_MPE; E1000_WRITE_REG(&adapter->hw, E1000_RCTL, reg_rctl); } else e1000_update_mc_addr_list(&adapter->hw, mta, mcnt); } /********************************************************************* * Timer routine: * This routine checks for link status, * updates statistics, and does the watchdog. * **********************************************************************/ static void igb_local_timer(void *arg) { struct adapter *adapter = arg; device_t dev = adapter->dev; struct ifnet *ifp = adapter->ifp; struct tx_ring *txr = adapter->tx_rings; struct igb_queue *que = adapter->queues; int hung = 0, busy = 0; IGB_CORE_LOCK_ASSERT(adapter); igb_update_link_status(adapter); igb_update_stats_counters(adapter); /* ** Check the TX queues status ** - central locked handling of OACTIVE ** - watchdog only if all queues show hung */ for (int i = 0; i < adapter->num_queues; i++, que++, txr++) { if ((txr->queue_status & IGB_QUEUE_HUNG) && (adapter->pause_frames == 0)) ++hung; if (txr->queue_status & IGB_QUEUE_DEPLETED) ++busy; if ((txr->queue_status & IGB_QUEUE_IDLE) == 0) taskqueue_enqueue(que->tq, &que->que_task); } if (hung == adapter->num_queues) goto timeout; if (busy == adapter->num_queues) ifp->if_drv_flags |= IFF_DRV_OACTIVE; else if ((ifp->if_drv_flags & IFF_DRV_OACTIVE) && (busy < adapter->num_queues)) ifp->if_drv_flags &= ~IFF_DRV_OACTIVE; adapter->pause_frames = 0; callout_reset(&adapter->timer, hz, igb_local_timer, adapter); #ifndef DEVICE_POLLING /* Schedule all queue interrupts - deadlock protection */ E1000_WRITE_REG(&adapter->hw, E1000_EICS, adapter->que_mask); #endif return; timeout: device_printf(adapter->dev, "Watchdog timeout -- resetting\n"); device_printf(dev,"Queue(%d) tdh = %d, hw tdt = %d\n", txr->me, E1000_READ_REG(&adapter->hw, E1000_TDH(txr->me)), E1000_READ_REG(&adapter->hw, E1000_TDT(txr->me))); device_printf(dev,"TX(%d) desc avail = %d," "Next TX to Clean = %d\n", txr->me, txr->tx_avail, txr->next_to_clean); adapter->ifp->if_drv_flags &= ~IFF_DRV_RUNNING; adapter->watchdog_events++; igb_init_locked(adapter); } static void igb_update_link_status(struct adapter *adapter) { struct e1000_hw *hw = &adapter->hw; struct e1000_fc_info *fc = &hw->fc; struct ifnet *ifp = adapter->ifp; device_t dev = adapter->dev; struct tx_ring *txr = adapter->tx_rings; u32 link_check, thstat, ctrl; char *flowctl = NULL; link_check = thstat = ctrl = 0; /* Get the cached link value or read for real */ switch (hw->phy.media_type) { case e1000_media_type_copper: if (hw->mac.get_link_status) { /* Do the work to read phy */ e1000_check_for_link(hw); link_check = !hw->mac.get_link_status; } else link_check = TRUE; break; case e1000_media_type_fiber: e1000_check_for_link(hw); link_check = (E1000_READ_REG(hw, E1000_STATUS) & E1000_STATUS_LU); break; case e1000_media_type_internal_serdes: e1000_check_for_link(hw); link_check = adapter->hw.mac.serdes_has_link; break; /* VF device is type_unknown */ case e1000_media_type_unknown: e1000_check_for_link(hw); link_check = !hw->mac.get_link_status; /* Fall thru */ default: break; } /* Check for thermal downshift or shutdown */ if (hw->mac.type == e1000_i350) { thstat = E1000_READ_REG(hw, E1000_THSTAT); ctrl = E1000_READ_REG(hw, E1000_CTRL_EXT); } /* Get the flow control for display */ switch (fc->current_mode) { case e1000_fc_rx_pause: flowctl = "RX"; break; case e1000_fc_tx_pause: flowctl = "TX"; break; case e1000_fc_full: flowctl = "Full"; break; case e1000_fc_none: default: flowctl = "None"; break; } /* Now we check if a transition has happened */ if (link_check && (adapter->link_active == 0)) { e1000_get_speed_and_duplex(&adapter->hw, &adapter->link_speed, &adapter->link_duplex); if (bootverbose) device_printf(dev, "Link is up %d Mbps %s," " Flow Control: %s\n", adapter->link_speed, ((adapter->link_duplex == FULL_DUPLEX) ? "Full Duplex" : "Half Duplex"), flowctl); adapter->link_active = 1; ifp->if_baudrate = adapter->link_speed * 1000000; if ((ctrl & E1000_CTRL_EXT_LINK_MODE_GMII) && (thstat & E1000_THSTAT_LINK_THROTTLE)) device_printf(dev, "Link: thermal downshift\n"); /* Delay Link Up for Phy update */ if (((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211)) && (hw->phy.id == I210_I_PHY_ID)) msec_delay(I210_LINK_DELAY); /* Reset if the media type changed. */ if (hw->dev_spec._82575.media_changed) { hw->dev_spec._82575.media_changed = false; adapter->flags |= IGB_MEDIA_RESET; igb_reset(adapter); } /* This can sleep */ if_link_state_change(ifp, LINK_STATE_UP); } else if (!link_check && (adapter->link_active == 1)) { ifp->if_baudrate = adapter->link_speed = 0; adapter->link_duplex = 0; if (bootverbose) device_printf(dev, "Link is Down\n"); if ((ctrl & E1000_CTRL_EXT_LINK_MODE_GMII) && (thstat & E1000_THSTAT_PWR_DOWN)) device_printf(dev, "Link: thermal shutdown\n"); adapter->link_active = 0; /* This can sleep */ if_link_state_change(ifp, LINK_STATE_DOWN); /* Reset queue state */ for (int i = 0; i < adapter->num_queues; i++, txr++) txr->queue_status = IGB_QUEUE_IDLE; } } /********************************************************************* * * This routine disables all traffic on the adapter by issuing a * global reset on the MAC and deallocates TX/RX buffers. * **********************************************************************/ static void igb_stop(void *arg) { struct adapter *adapter = arg; struct ifnet *ifp = adapter->ifp; struct tx_ring *txr = adapter->tx_rings; IGB_CORE_LOCK_ASSERT(adapter); INIT_DEBUGOUT("igb_stop: begin"); igb_disable_intr(adapter); callout_stop(&adapter->timer); /* Tell the stack that the interface is no longer active */ ifp->if_drv_flags &= ~IFF_DRV_RUNNING; ifp->if_drv_flags |= IFF_DRV_OACTIVE; /* Disarm watchdog timer. */ for (int i = 0; i < adapter->num_queues; i++, txr++) { IGB_TX_LOCK(txr); txr->queue_status = IGB_QUEUE_IDLE; IGB_TX_UNLOCK(txr); } e1000_reset_hw(&adapter->hw); E1000_WRITE_REG(&adapter->hw, E1000_WUC, 0); e1000_led_off(&adapter->hw); e1000_cleanup_led(&adapter->hw); } /********************************************************************* * * Determine hardware revision. * **********************************************************************/ static void igb_identify_hardware(struct adapter *adapter) { device_t dev = adapter->dev; /* Make sure our PCI config space has the necessary stuff set */ pci_enable_busmaster(dev); adapter->hw.bus.pci_cmd_word = pci_read_config(dev, PCIR_COMMAND, 2); /* Save off the information about this board */ adapter->hw.vendor_id = pci_get_vendor(dev); adapter->hw.device_id = pci_get_device(dev); adapter->hw.revision_id = pci_read_config(dev, PCIR_REVID, 1); adapter->hw.subsystem_vendor_id = pci_read_config(dev, PCIR_SUBVEND_0, 2); adapter->hw.subsystem_device_id = pci_read_config(dev, PCIR_SUBDEV_0, 2); /* Set MAC type early for PCI setup */ e1000_set_mac_type(&adapter->hw); /* Are we a VF device? */ if ((adapter->hw.mac.type == e1000_vfadapt) || (adapter->hw.mac.type == e1000_vfadapt_i350)) adapter->vf_ifp = 1; else adapter->vf_ifp = 0; } static int igb_allocate_pci_resources(struct adapter *adapter) { device_t dev = adapter->dev; int rid; rid = PCIR_BAR(0); adapter->pci_mem = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid, RF_ACTIVE); if (adapter->pci_mem == NULL) { device_printf(dev, "Unable to allocate bus resource: memory\n"); return (ENXIO); } adapter->osdep.mem_bus_space_tag = rman_get_bustag(adapter->pci_mem); adapter->osdep.mem_bus_space_handle = rman_get_bushandle(adapter->pci_mem); adapter->hw.hw_addr = (u8 *)&adapter->osdep.mem_bus_space_handle; adapter->num_queues = 1; /* Defaults for Legacy or MSI */ /* This will setup either MSI/X or MSI */ adapter->msix = igb_setup_msix(adapter); adapter->hw.back = &adapter->osdep; return (0); } /********************************************************************* * * Setup the Legacy or MSI Interrupt handler * **********************************************************************/ static int igb_allocate_legacy(struct adapter *adapter) { device_t dev = adapter->dev; struct igb_queue *que = adapter->queues; #ifndef IGB_LEGACY_TX struct tx_ring *txr = adapter->tx_rings; #endif int error, rid = 0; /* Turn off all interrupts */ E1000_WRITE_REG(&adapter->hw, E1000_IMC, 0xffffffff); /* MSI RID is 1 */ if (adapter->msix == 1) rid = 1; /* We allocate a single interrupt resource */ adapter->res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &rid, RF_SHAREABLE | RF_ACTIVE); if (adapter->res == NULL) { device_printf(dev, "Unable to allocate bus resource: " "interrupt\n"); return (ENXIO); } #ifndef IGB_LEGACY_TX TASK_INIT(&txr->txq_task, 0, igb_deferred_mq_start, txr); #endif /* * Try allocating a fast interrupt and the associated deferred * processing contexts. */ TASK_INIT(&que->que_task, 0, igb_handle_que, que); /* Make tasklet for deferred link handling */ TASK_INIT(&adapter->link_task, 0, igb_handle_link, adapter); que->tq = taskqueue_create_fast("igb_taskq", M_NOWAIT, taskqueue_thread_enqueue, &que->tq); taskqueue_start_threads(&que->tq, 1, PI_NET, "%s taskq", device_get_nameunit(adapter->dev)); if ((error = bus_setup_intr(dev, adapter->res, INTR_TYPE_NET | INTR_MPSAFE, igb_irq_fast, NULL, adapter, &adapter->tag)) != 0) { device_printf(dev, "Failed to register fast interrupt " "handler: %d\n", error); taskqueue_free(que->tq); que->tq = NULL; return (error); } return (0); } /********************************************************************* * * Setup the MSIX Queue Interrupt handlers: * **********************************************************************/ static int igb_allocate_msix(struct adapter *adapter) { device_t dev = adapter->dev; struct igb_queue *que = adapter->queues; int error, rid, vector = 0; int cpu_id = 0; #ifdef RSS cpuset_t cpu_mask; #endif /* Be sure to start with all interrupts disabled */ E1000_WRITE_REG(&adapter->hw, E1000_IMC, ~0); E1000_WRITE_FLUSH(&adapter->hw); #ifdef RSS /* * If we're doing RSS, the number of queues needs to * match the number of RSS buckets that are configured. * * + If there's more queues than RSS buckets, we'll end * up with queues that get no traffic. * * + If there's more RSS buckets than queues, we'll end * up having multiple RSS buckets map to the same queue, * so there'll be some contention. */ if (adapter->num_queues != rss_getnumbuckets()) { device_printf(dev, "%s: number of queues (%d) != number of RSS buckets (%d)" "; performance will be impacted.\n", __func__, adapter->num_queues, rss_getnumbuckets()); } #endif for (int i = 0; i < adapter->num_queues; i++, vector++, que++) { rid = vector +1; que->res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &rid, RF_SHAREABLE | RF_ACTIVE); if (que->res == NULL) { device_printf(dev, "Unable to allocate bus resource: " "MSIX Queue Interrupt\n"); return (ENXIO); } error = bus_setup_intr(dev, que->res, INTR_TYPE_NET | INTR_MPSAFE, NULL, igb_msix_que, que, &que->tag); if (error) { que->res = NULL; device_printf(dev, "Failed to register Queue handler"); return (error); } #if __FreeBSD_version >= 800504 bus_describe_intr(dev, que->res, que->tag, "que %d", i); #endif que->msix = vector; if (adapter->hw.mac.type == e1000_82575) que->eims = E1000_EICR_TX_QUEUE0 << i; else que->eims = 1 << vector; #ifdef RSS /* * The queue ID is used as the RSS layer bucket ID. * We look up the queue ID -> RSS CPU ID and select * that. */ cpu_id = rss_getcpu(i % rss_getnumbuckets()); #else /* * Bind the msix vector, and thus the * rings to the corresponding cpu. * * This just happens to match the default RSS round-robin * bucket -> queue -> CPU allocation. */ if (adapter->num_queues > 1) { if (igb_last_bind_cpu < 0) igb_last_bind_cpu = CPU_FIRST(); cpu_id = igb_last_bind_cpu; } #endif if (adapter->num_queues > 1) { bus_bind_intr(dev, que->res, cpu_id); #ifdef RSS device_printf(dev, "Bound queue %d to RSS bucket %d\n", i, cpu_id); #else device_printf(dev, "Bound queue %d to cpu %d\n", i, cpu_id); #endif } #ifndef IGB_LEGACY_TX TASK_INIT(&que->txr->txq_task, 0, igb_deferred_mq_start, que->txr); #endif /* Make tasklet for deferred handling */ TASK_INIT(&que->que_task, 0, igb_handle_que, que); que->tq = taskqueue_create("igb_que", M_NOWAIT, taskqueue_thread_enqueue, &que->tq); if (adapter->num_queues > 1) { /* * Only pin the taskqueue thread to a CPU if * RSS is in use. * * This again just happens to match the default RSS * round-robin bucket -> queue -> CPU allocation. */ #ifdef RSS CPU_SETOF(cpu_id, &cpu_mask); taskqueue_start_threads_cpuset(&que->tq, 1, PI_NET, &cpu_mask, "%s que (bucket %d)", device_get_nameunit(adapter->dev), cpu_id); #else taskqueue_start_threads(&que->tq, 1, PI_NET, "%s que (qid %d)", device_get_nameunit(adapter->dev), cpu_id); #endif } else { taskqueue_start_threads(&que->tq, 1, PI_NET, "%s que", device_get_nameunit(adapter->dev)); } /* Finally update the last bound CPU id */ if (adapter->num_queues > 1) igb_last_bind_cpu = CPU_NEXT(igb_last_bind_cpu); } /* And Link */ rid = vector + 1; adapter->res = bus_alloc_resource_any(dev, SYS_RES_IRQ, &rid, RF_SHAREABLE | RF_ACTIVE); if (adapter->res == NULL) { device_printf(dev, "Unable to allocate bus resource: " "MSIX Link Interrupt\n"); return (ENXIO); } if ((error = bus_setup_intr(dev, adapter->res, INTR_TYPE_NET | INTR_MPSAFE, NULL, igb_msix_link, adapter, &adapter->tag)) != 0) { device_printf(dev, "Failed to register Link handler"); return (error); } #if __FreeBSD_version >= 800504 bus_describe_intr(dev, adapter->res, adapter->tag, "link"); #endif adapter->linkvec = vector; return (0); } static void igb_configure_queues(struct adapter *adapter) { struct e1000_hw *hw = &adapter->hw; struct igb_queue *que; u32 tmp, ivar = 0, newitr = 0; /* First turn on RSS capability */ if (adapter->hw.mac.type != e1000_82575) E1000_WRITE_REG(hw, E1000_GPIE, E1000_GPIE_MSIX_MODE | E1000_GPIE_EIAME | E1000_GPIE_PBA | E1000_GPIE_NSICR); /* Turn on MSIX */ switch (adapter->hw.mac.type) { case e1000_82580: case e1000_i350: case e1000_i354: case e1000_i210: case e1000_i211: case e1000_vfadapt: case e1000_vfadapt_i350: /* RX entries */ for (int i = 0; i < adapter->num_queues; i++) { u32 index = i >> 1; ivar = E1000_READ_REG_ARRAY(hw, E1000_IVAR0, index); que = &adapter->queues[i]; if (i & 1) { ivar &= 0xFF00FFFF; ivar |= (que->msix | E1000_IVAR_VALID) << 16; } else { ivar &= 0xFFFFFF00; ivar |= que->msix | E1000_IVAR_VALID; } E1000_WRITE_REG_ARRAY(hw, E1000_IVAR0, index, ivar); } /* TX entries */ for (int i = 0; i < adapter->num_queues; i++) { u32 index = i >> 1; ivar = E1000_READ_REG_ARRAY(hw, E1000_IVAR0, index); que = &adapter->queues[i]; if (i & 1) { ivar &= 0x00FFFFFF; ivar |= (que->msix | E1000_IVAR_VALID) << 24; } else { ivar &= 0xFFFF00FF; ivar |= (que->msix | E1000_IVAR_VALID) << 8; } E1000_WRITE_REG_ARRAY(hw, E1000_IVAR0, index, ivar); adapter->que_mask |= que->eims; } /* And for the link interrupt */ ivar = (adapter->linkvec | E1000_IVAR_VALID) << 8; adapter->link_mask = 1 << adapter->linkvec; E1000_WRITE_REG(hw, E1000_IVAR_MISC, ivar); break; case e1000_82576: /* RX entries */ for (int i = 0; i < adapter->num_queues; i++) { u32 index = i & 0x7; /* Each IVAR has two entries */ ivar = E1000_READ_REG_ARRAY(hw, E1000_IVAR0, index); que = &adapter->queues[i]; if (i < 8) { ivar &= 0xFFFFFF00; ivar |= que->msix | E1000_IVAR_VALID; } else { ivar &= 0xFF00FFFF; ivar |= (que->msix | E1000_IVAR_VALID) << 16; } E1000_WRITE_REG_ARRAY(hw, E1000_IVAR0, index, ivar); adapter->que_mask |= que->eims; } /* TX entries */ for (int i = 0; i < adapter->num_queues; i++) { u32 index = i & 0x7; /* Each IVAR has two entries */ ivar = E1000_READ_REG_ARRAY(hw, E1000_IVAR0, index); que = &adapter->queues[i]; if (i < 8) { ivar &= 0xFFFF00FF; ivar |= (que->msix | E1000_IVAR_VALID) << 8; } else { ivar &= 0x00FFFFFF; ivar |= (que->msix | E1000_IVAR_VALID) << 24; } E1000_WRITE_REG_ARRAY(hw, E1000_IVAR0, index, ivar); adapter->que_mask |= que->eims; } /* And for the link interrupt */ ivar = (adapter->linkvec | E1000_IVAR_VALID) << 8; adapter->link_mask = 1 << adapter->linkvec; E1000_WRITE_REG(hw, E1000_IVAR_MISC, ivar); break; case e1000_82575: /* enable MSI-X support*/ tmp = E1000_READ_REG(hw, E1000_CTRL_EXT); tmp |= E1000_CTRL_EXT_PBA_CLR; /* Auto-Mask interrupts upon ICR read. */ tmp |= E1000_CTRL_EXT_EIAME; tmp |= E1000_CTRL_EXT_IRCA; E1000_WRITE_REG(hw, E1000_CTRL_EXT, tmp); /* Queues */ for (int i = 0; i < adapter->num_queues; i++) { que = &adapter->queues[i]; tmp = E1000_EICR_RX_QUEUE0 << i; tmp |= E1000_EICR_TX_QUEUE0 << i; que->eims = tmp; E1000_WRITE_REG_ARRAY(hw, E1000_MSIXBM(0), i, que->eims); adapter->que_mask |= que->eims; } /* Link */ E1000_WRITE_REG(hw, E1000_MSIXBM(adapter->linkvec), E1000_EIMS_OTHER); adapter->link_mask |= E1000_EIMS_OTHER; default: break; } /* Set the starting interrupt rate */ if (igb_max_interrupt_rate > 0) newitr = (4000000 / igb_max_interrupt_rate) & 0x7FFC; if (hw->mac.type == e1000_82575) newitr |= newitr << 16; else newitr |= E1000_EITR_CNT_IGNR; for (int i = 0; i < adapter->num_queues; i++) { que = &adapter->queues[i]; E1000_WRITE_REG(hw, E1000_EITR(que->msix), newitr); } return; } static void igb_free_pci_resources(struct adapter *adapter) { struct igb_queue *que = adapter->queues; device_t dev = adapter->dev; int rid; /* ** There is a slight possibility of a failure mode ** in attach that will result in entering this function ** before interrupt resources have been initialized, and ** in that case we do not want to execute the loops below ** We can detect this reliably by the state of the adapter ** res pointer. */ if (adapter->res == NULL) goto mem; /* * First release all the interrupt resources: */ for (int i = 0; i < adapter->num_queues; i++, que++) { rid = que->msix + 1; if (que->tag != NULL) { bus_teardown_intr(dev, que->res, que->tag); que->tag = NULL; } if (que->res != NULL) bus_release_resource(dev, SYS_RES_IRQ, rid, que->res); } /* Clean the Legacy or Link interrupt last */ if (adapter->linkvec) /* we are doing MSIX */ rid = adapter->linkvec + 1; else (adapter->msix != 0) ? (rid = 1):(rid = 0); que = adapter->queues; if (adapter->tag != NULL) { taskqueue_drain(que->tq, &adapter->link_task); bus_teardown_intr(dev, adapter->res, adapter->tag); adapter->tag = NULL; } if (adapter->res != NULL) bus_release_resource(dev, SYS_RES_IRQ, rid, adapter->res); for (int i = 0; i < adapter->num_queues; i++, que++) { if (que->tq != NULL) { #ifndef IGB_LEGACY_TX taskqueue_drain(que->tq, &que->txr->txq_task); #endif taskqueue_drain(que->tq, &que->que_task); taskqueue_free(que->tq); } } mem: if (adapter->msix) pci_release_msi(dev); if (adapter->msix_mem != NULL) bus_release_resource(dev, SYS_RES_MEMORY, adapter->memrid, adapter->msix_mem); if (adapter->pci_mem != NULL) bus_release_resource(dev, SYS_RES_MEMORY, PCIR_BAR(0), adapter->pci_mem); } /* * Setup Either MSI/X or MSI */ static int igb_setup_msix(struct adapter *adapter) { device_t dev = adapter->dev; int bar, want, queues, msgs, maxqueues; /* tuneable override */ if (igb_enable_msix == 0) goto msi; /* First try MSI/X */ msgs = pci_msix_count(dev); if (msgs == 0) goto msi; /* ** Some new devices, as with ixgbe, now may ** use a different BAR, so we need to keep ** track of which is used. */ adapter->memrid = PCIR_BAR(IGB_MSIX_BAR); bar = pci_read_config(dev, adapter->memrid, 4); if (bar == 0) /* use next bar */ adapter->memrid += 4; adapter->msix_mem = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &adapter->memrid, RF_ACTIVE); if (adapter->msix_mem == NULL) { /* May not be enabled */ device_printf(adapter->dev, "Unable to map MSIX table \n"); goto msi; } queues = (mp_ncpus > (msgs-1)) ? (msgs-1) : mp_ncpus; /* Override via tuneable */ if (igb_num_queues != 0) queues = igb_num_queues; #ifdef RSS /* If we're doing RSS, clamp at the number of RSS buckets */ if (queues > rss_getnumbuckets()) queues = rss_getnumbuckets(); #endif /* Sanity check based on HW */ switch (adapter->hw.mac.type) { case e1000_82575: maxqueues = 4; break; case e1000_82576: case e1000_82580: case e1000_i350: case e1000_i354: maxqueues = 8; break; case e1000_i210: maxqueues = 4; break; case e1000_i211: maxqueues = 2; break; default: /* VF interfaces */ maxqueues = 1; break; } /* Final clamp on the actual hardware capability */ if (queues > maxqueues) queues = maxqueues; /* ** One vector (RX/TX pair) per queue ** plus an additional for Link interrupt */ want = queues + 1; if (msgs >= want) msgs = want; else { device_printf(adapter->dev, "MSIX Configuration Problem, " "%d vectors configured, but %d queues wanted!\n", msgs, want); goto msi; } if ((pci_alloc_msix(dev, &msgs) == 0) && (msgs == want)) { device_printf(adapter->dev, "Using MSIX interrupts with %d vectors\n", msgs); adapter->num_queues = queues; return (msgs); } /* ** If MSIX alloc failed or provided us with ** less than needed, free and fall through to MSI */ pci_release_msi(dev); msi: if (adapter->msix_mem != NULL) { bus_release_resource(dev, SYS_RES_MEMORY, PCIR_BAR(IGB_MSIX_BAR), adapter->msix_mem); adapter->msix_mem = NULL; } msgs = 1; if (pci_alloc_msi(dev, &msgs) == 0) { device_printf(adapter->dev," Using an MSI interrupt\n"); return (msgs); } device_printf(adapter->dev," Using a Legacy interrupt\n"); return (0); } /********************************************************************* * * Initialize the DMA Coalescing feature * **********************************************************************/ static void igb_init_dmac(struct adapter *adapter, u32 pba) { device_t dev = adapter->dev; struct e1000_hw *hw = &adapter->hw; u32 dmac, reg = ~E1000_DMACR_DMAC_EN; u16 hwm; if (hw->mac.type == e1000_i211) return; if (hw->mac.type > e1000_82580) { if (adapter->dmac == 0) { /* Disabling it */ E1000_WRITE_REG(hw, E1000_DMACR, reg); return; } else device_printf(dev, "DMA Coalescing enabled\n"); /* Set starting threshold */ E1000_WRITE_REG(hw, E1000_DMCTXTH, 0); hwm = 64 * pba - adapter->max_frame_size / 16; if (hwm < 64 * (pba - 6)) hwm = 64 * (pba - 6); reg = E1000_READ_REG(hw, E1000_FCRTC); reg &= ~E1000_FCRTC_RTH_COAL_MASK; reg |= ((hwm << E1000_FCRTC_RTH_COAL_SHIFT) & E1000_FCRTC_RTH_COAL_MASK); E1000_WRITE_REG(hw, E1000_FCRTC, reg); dmac = pba - adapter->max_frame_size / 512; if (dmac < pba - 10) dmac = pba - 10; reg = E1000_READ_REG(hw, E1000_DMACR); reg &= ~E1000_DMACR_DMACTHR_MASK; reg = ((dmac << E1000_DMACR_DMACTHR_SHIFT) & E1000_DMACR_DMACTHR_MASK); /* transition to L0x or L1 if available..*/ reg |= (E1000_DMACR_DMAC_EN | E1000_DMACR_DMAC_LX_MASK); /* Check if status is 2.5Gb backplane connection * before configuration of watchdog timer, which is * in msec values in 12.8usec intervals * watchdog timer= msec values in 32usec intervals * for non 2.5Gb connection */ if (hw->mac.type == e1000_i354) { int status = E1000_READ_REG(hw, E1000_STATUS); if ((status & E1000_STATUS_2P5_SKU) && (!(status & E1000_STATUS_2P5_SKU_OVER))) reg |= ((adapter->dmac * 5) >> 6); else reg |= (adapter->dmac >> 5); } else { reg |= (adapter->dmac >> 5); } E1000_WRITE_REG(hw, E1000_DMACR, reg); E1000_WRITE_REG(hw, E1000_DMCRTRH, 0); /* Set the interval before transition */ reg = E1000_READ_REG(hw, E1000_DMCTLX); if (hw->mac.type == e1000_i350) reg |= IGB_DMCTLX_DCFLUSH_DIS; /* ** in 2.5Gb connection, TTLX unit is 0.4 usec ** which is 0x4*2 = 0xA. But delay is still 4 usec */ if (hw->mac.type == e1000_i354) { int status = E1000_READ_REG(hw, E1000_STATUS); if ((status & E1000_STATUS_2P5_SKU) && (!(status & E1000_STATUS_2P5_SKU_OVER))) reg |= 0xA; else reg |= 0x4; } else { reg |= 0x4; } E1000_WRITE_REG(hw, E1000_DMCTLX, reg); /* free space in tx packet buffer to wake from DMA coal */ E1000_WRITE_REG(hw, E1000_DMCTXTH, (IGB_TXPBSIZE - (2 * adapter->max_frame_size)) >> 6); /* make low power state decision controlled by DMA coal */ reg = E1000_READ_REG(hw, E1000_PCIEMISC); reg &= ~E1000_PCIEMISC_LX_DECISION; E1000_WRITE_REG(hw, E1000_PCIEMISC, reg); } else if (hw->mac.type == e1000_82580) { u32 reg = E1000_READ_REG(hw, E1000_PCIEMISC); E1000_WRITE_REG(hw, E1000_PCIEMISC, reg & ~E1000_PCIEMISC_LX_DECISION); E1000_WRITE_REG(hw, E1000_DMACR, 0); } } /********************************************************************* * * Set up an fresh starting state * **********************************************************************/ static void igb_reset(struct adapter *adapter) { device_t dev = adapter->dev; struct e1000_hw *hw = &adapter->hw; struct e1000_fc_info *fc = &hw->fc; struct ifnet *ifp = adapter->ifp; u32 pba = 0; u16 hwm; INIT_DEBUGOUT("igb_reset: begin"); /* Let the firmware know the OS is in control */ igb_get_hw_control(adapter); /* * Packet Buffer Allocation (PBA) * Writing PBA sets the receive portion of the buffer * the remainder is used for the transmit buffer. */ switch (hw->mac.type) { case e1000_82575: pba = E1000_PBA_32K; break; case e1000_82576: case e1000_vfadapt: pba = E1000_READ_REG(hw, E1000_RXPBS); pba &= E1000_RXPBS_SIZE_MASK_82576; break; case e1000_82580: case e1000_i350: case e1000_i354: case e1000_vfadapt_i350: pba = E1000_READ_REG(hw, E1000_RXPBS); pba = e1000_rxpbs_adjust_82580(pba); break; case e1000_i210: case e1000_i211: pba = E1000_PBA_34K; default: break; } /* Special needs in case of Jumbo frames */ if ((hw->mac.type == e1000_82575) && (ifp->if_mtu > ETHERMTU)) { u32 tx_space, min_tx, min_rx; pba = E1000_READ_REG(hw, E1000_PBA); tx_space = pba >> 16; pba &= 0xffff; min_tx = (adapter->max_frame_size + sizeof(struct e1000_tx_desc) - ETHERNET_FCS_SIZE) * 2; min_tx = roundup2(min_tx, 1024); min_tx >>= 10; min_rx = adapter->max_frame_size; min_rx = roundup2(min_rx, 1024); min_rx >>= 10; if (tx_space < min_tx && ((min_tx - tx_space) < pba)) { pba = pba - (min_tx - tx_space); /* * if short on rx space, rx wins * and must trump tx adjustment */ if (pba < min_rx) pba = min_rx; } E1000_WRITE_REG(hw, E1000_PBA, pba); } INIT_DEBUGOUT1("igb_init: pba=%dK",pba); /* * These parameters control the automatic generation (Tx) and * response (Rx) to Ethernet PAUSE frames. * - High water mark should allow for at least two frames to be * received after sending an XOFF. * - Low water mark works best when it is very near the high water mark. * This allows the receiver to restart by sending XON when it has * drained a bit. */ hwm = min(((pba << 10) * 9 / 10), ((pba << 10) - 2 * adapter->max_frame_size)); if (hw->mac.type < e1000_82576) { fc->high_water = hwm & 0xFFF8; /* 8-byte granularity */ fc->low_water = fc->high_water - 8; } else { fc->high_water = hwm & 0xFFF0; /* 16-byte granularity */ fc->low_water = fc->high_water - 16; } fc->pause_time = IGB_FC_PAUSE_TIME; fc->send_xon = TRUE; if (adapter->fc) fc->requested_mode = adapter->fc; else fc->requested_mode = e1000_fc_default; /* Issue a global reset */ e1000_reset_hw(hw); E1000_WRITE_REG(hw, E1000_WUC, 0); /* Reset for AutoMediaDetect */ if (adapter->flags & IGB_MEDIA_RESET) { e1000_setup_init_funcs(hw, TRUE); e1000_get_bus_info(hw); adapter->flags &= ~IGB_MEDIA_RESET; } if (e1000_init_hw(hw) < 0) device_printf(dev, "Hardware Initialization Failed\n"); /* Setup DMA Coalescing */ igb_init_dmac(adapter, pba); E1000_WRITE_REG(&adapter->hw, E1000_VET, ETHERTYPE_VLAN); e1000_get_phy_info(hw); e1000_check_for_link(hw); return; } /********************************************************************* * * Setup networking device structure and register an interface. * **********************************************************************/ static int igb_setup_interface(device_t dev, struct adapter *adapter) { struct ifnet *ifp; INIT_DEBUGOUT("igb_setup_interface: begin"); ifp = adapter->ifp = if_alloc(IFT_ETHER); if (ifp == NULL) { device_printf(dev, "can not allocate ifnet structure\n"); return (-1); } if_initname(ifp, device_get_name(dev), device_get_unit(dev)); ifp->if_init = igb_init; ifp->if_softc = adapter; ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; ifp->if_ioctl = igb_ioctl; ifp->if_get_counter = igb_get_counter; /* TSO parameters */ ifp->if_hw_tsomax = IP_MAXPACKET; ifp->if_hw_tsomaxsegcount = IGB_MAX_SCATTER; ifp->if_hw_tsomaxsegsize = IGB_TSO_SEG_SIZE; #ifndef IGB_LEGACY_TX ifp->if_transmit = igb_mq_start; ifp->if_qflush = igb_qflush; #else ifp->if_start = igb_start; IFQ_SET_MAXLEN(&ifp->if_snd, adapter->num_tx_desc - 1); ifp->if_snd.ifq_drv_maxlen = adapter->num_tx_desc - 1; IFQ_SET_READY(&ifp->if_snd); #endif ether_ifattach(ifp, adapter->hw.mac.addr); ifp->if_capabilities = ifp->if_capenable = 0; ifp->if_capabilities = IFCAP_HWCSUM | IFCAP_VLAN_HWCSUM; #if __FreeBSD_version >= 1000000 ifp->if_capabilities |= IFCAP_HWCSUM_IPV6; #endif ifp->if_capabilities |= IFCAP_TSO; ifp->if_capabilities |= IFCAP_JUMBO_MTU; ifp->if_capenable = ifp->if_capabilities; /* Don't enable LRO by default */ ifp->if_capabilities |= IFCAP_LRO; #ifdef DEVICE_POLLING ifp->if_capabilities |= IFCAP_POLLING; #endif /* * Tell the upper layer(s) we * support full VLAN capability. */ ifp->if_hdrlen = sizeof(struct ether_vlan_header); ifp->if_capabilities |= IFCAP_VLAN_HWTAGGING | IFCAP_VLAN_HWTSO | IFCAP_VLAN_MTU; ifp->if_capenable |= IFCAP_VLAN_HWTAGGING | IFCAP_VLAN_HWTSO | IFCAP_VLAN_MTU; /* ** Don't turn this on by default, if vlans are ** created on another pseudo device (eg. lagg) ** then vlan events are not passed thru, breaking ** operation, but with HW FILTER off it works. If ** using vlans directly on the igb driver you can ** enable this and get full hardware tag filtering. */ ifp->if_capabilities |= IFCAP_VLAN_HWFILTER; /* * Specify the media types supported by this adapter and register * callbacks to update media and link information */ ifmedia_init(&adapter->media, IFM_IMASK, igb_media_change, igb_media_status); if ((adapter->hw.phy.media_type == e1000_media_type_fiber) || (adapter->hw.phy.media_type == e1000_media_type_internal_serdes)) { ifmedia_add(&adapter->media, IFM_ETHER | IFM_1000_SX | IFM_FDX, 0, NULL); ifmedia_add(&adapter->media, IFM_ETHER | IFM_1000_SX, 0, NULL); } else { ifmedia_add(&adapter->media, IFM_ETHER | IFM_10_T, 0, NULL); ifmedia_add(&adapter->media, IFM_ETHER | IFM_10_T | IFM_FDX, 0, NULL); ifmedia_add(&adapter->media, IFM_ETHER | IFM_100_TX, 0, NULL); ifmedia_add(&adapter->media, IFM_ETHER | IFM_100_TX | IFM_FDX, 0, NULL); if (adapter->hw.phy.type != e1000_phy_ife) { ifmedia_add(&adapter->media, IFM_ETHER | IFM_1000_T | IFM_FDX, 0, NULL); ifmedia_add(&adapter->media, IFM_ETHER | IFM_1000_T, 0, NULL); } } ifmedia_add(&adapter->media, IFM_ETHER | IFM_AUTO, 0, NULL); ifmedia_set(&adapter->media, IFM_ETHER | IFM_AUTO); return (0); } /* * Manage DMA'able memory. */ static void igb_dmamap_cb(void *arg, bus_dma_segment_t *segs, int nseg, int error) { if (error) return; *(bus_addr_t *) arg = segs[0].ds_addr; } static int igb_dma_malloc(struct adapter *adapter, bus_size_t size, struct igb_dma_alloc *dma, int mapflags) { int error; error = bus_dma_tag_create(bus_get_dma_tag(adapter->dev), /* parent */ IGB_DBA_ALIGN, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ size, /* maxsize */ 1, /* nsegments */ size, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockarg */ &dma->dma_tag); if (error) { device_printf(adapter->dev, "%s: bus_dma_tag_create failed: %d\n", __func__, error); goto fail_0; } error = bus_dmamem_alloc(dma->dma_tag, (void**) &dma->dma_vaddr, BUS_DMA_NOWAIT | BUS_DMA_COHERENT, &dma->dma_map); if (error) { device_printf(adapter->dev, "%s: bus_dmamem_alloc(%ju) failed: %d\n", __func__, (uintmax_t)size, error); goto fail_2; } dma->dma_paddr = 0; error = bus_dmamap_load(dma->dma_tag, dma->dma_map, dma->dma_vaddr, size, igb_dmamap_cb, &dma->dma_paddr, mapflags | BUS_DMA_NOWAIT); if (error || dma->dma_paddr == 0) { device_printf(adapter->dev, "%s: bus_dmamap_load failed: %d\n", __func__, error); goto fail_3; } return (0); fail_3: bus_dmamap_unload(dma->dma_tag, dma->dma_map); fail_2: bus_dmamem_free(dma->dma_tag, dma->dma_vaddr, dma->dma_map); bus_dma_tag_destroy(dma->dma_tag); fail_0: dma->dma_tag = NULL; return (error); } static void igb_dma_free(struct adapter *adapter, struct igb_dma_alloc *dma) { if (dma->dma_tag == NULL) return; if (dma->dma_paddr != 0) { bus_dmamap_sync(dma->dma_tag, dma->dma_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(dma->dma_tag, dma->dma_map); dma->dma_paddr = 0; } if (dma->dma_vaddr != NULL) { bus_dmamem_free(dma->dma_tag, dma->dma_vaddr, dma->dma_map); dma->dma_vaddr = NULL; } bus_dma_tag_destroy(dma->dma_tag); dma->dma_tag = NULL; } /********************************************************************* * * Allocate memory for the transmit and receive rings, and then * the descriptors associated with each, called only once at attach. * **********************************************************************/ static int igb_allocate_queues(struct adapter *adapter) { device_t dev = adapter->dev; struct igb_queue *que = NULL; struct tx_ring *txr = NULL; struct rx_ring *rxr = NULL; int rsize, tsize, error = E1000_SUCCESS; int txconf = 0, rxconf = 0; /* First allocate the top level queue structs */ if (!(adapter->queues = (struct igb_queue *) malloc(sizeof(struct igb_queue) * adapter->num_queues, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate queue memory\n"); error = ENOMEM; goto fail; } /* Next allocate the TX ring struct memory */ if (!(adapter->tx_rings = (struct tx_ring *) malloc(sizeof(struct tx_ring) * adapter->num_queues, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate TX ring memory\n"); error = ENOMEM; goto tx_fail; } /* Now allocate the RX */ if (!(adapter->rx_rings = (struct rx_ring *) malloc(sizeof(struct rx_ring) * adapter->num_queues, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate RX ring memory\n"); error = ENOMEM; goto rx_fail; } tsize = roundup2(adapter->num_tx_desc * sizeof(union e1000_adv_tx_desc), IGB_DBA_ALIGN); /* * Now set up the TX queues, txconf is needed to handle the * possibility that things fail midcourse and we need to * undo memory gracefully */ for (int i = 0; i < adapter->num_queues; i++, txconf++) { /* Set up some basics */ txr = &adapter->tx_rings[i]; txr->adapter = adapter; txr->me = i; txr->num_desc = adapter->num_tx_desc; /* Initialize the TX lock */ snprintf(txr->mtx_name, sizeof(txr->mtx_name), "%s:tx(%d)", device_get_nameunit(dev), txr->me); mtx_init(&txr->tx_mtx, txr->mtx_name, NULL, MTX_DEF); if (igb_dma_malloc(adapter, tsize, &txr->txdma, BUS_DMA_NOWAIT)) { device_printf(dev, "Unable to allocate TX Descriptor memory\n"); error = ENOMEM; goto err_tx_desc; } txr->tx_base = (union e1000_adv_tx_desc *)txr->txdma.dma_vaddr; bzero((void *)txr->tx_base, tsize); /* Now allocate transmit buffers for the ring */ if (igb_allocate_transmit_buffers(txr)) { device_printf(dev, "Critical Failure setting up transmit buffers\n"); error = ENOMEM; goto err_tx_desc; } #ifndef IGB_LEGACY_TX /* Allocate a buf ring */ txr->br = buf_ring_alloc(igb_buf_ring_size, M_DEVBUF, M_WAITOK, &txr->tx_mtx); #endif } /* * Next the RX queues... */ rsize = roundup2(adapter->num_rx_desc * sizeof(union e1000_adv_rx_desc), IGB_DBA_ALIGN); for (int i = 0; i < adapter->num_queues; i++, rxconf++) { rxr = &adapter->rx_rings[i]; rxr->adapter = adapter; rxr->me = i; /* Initialize the RX lock */ snprintf(rxr->mtx_name, sizeof(rxr->mtx_name), "%s:rx(%d)", device_get_nameunit(dev), txr->me); mtx_init(&rxr->rx_mtx, rxr->mtx_name, NULL, MTX_DEF); if (igb_dma_malloc(adapter, rsize, &rxr->rxdma, BUS_DMA_NOWAIT)) { device_printf(dev, "Unable to allocate RxDescriptor memory\n"); error = ENOMEM; goto err_rx_desc; } rxr->rx_base = (union e1000_adv_rx_desc *)rxr->rxdma.dma_vaddr; bzero((void *)rxr->rx_base, rsize); /* Allocate receive buffers for the ring*/ if (igb_allocate_receive_buffers(rxr)) { device_printf(dev, "Critical Failure setting up receive buffers\n"); error = ENOMEM; goto err_rx_desc; } } /* ** Finally set up the queue holding structs */ for (int i = 0; i < adapter->num_queues; i++) { que = &adapter->queues[i]; que->adapter = adapter; que->txr = &adapter->tx_rings[i]; que->rxr = &adapter->rx_rings[i]; } return (0); err_rx_desc: for (rxr = adapter->rx_rings; rxconf > 0; rxr++, rxconf--) igb_dma_free(adapter, &rxr->rxdma); err_tx_desc: for (txr = adapter->tx_rings; txconf > 0; txr++, txconf--) igb_dma_free(adapter, &txr->txdma); free(adapter->rx_rings, M_DEVBUF); rx_fail: #ifndef IGB_LEGACY_TX buf_ring_free(txr->br, M_DEVBUF); #endif free(adapter->tx_rings, M_DEVBUF); tx_fail: free(adapter->queues, M_DEVBUF); fail: return (error); } /********************************************************************* * * Allocate memory for tx_buffer structures. The tx_buffer stores all * the information needed to transmit a packet on the wire. This is * called only once at attach, setup is done every reset. * **********************************************************************/ static int igb_allocate_transmit_buffers(struct tx_ring *txr) { struct adapter *adapter = txr->adapter; device_t dev = adapter->dev; struct igb_tx_buf *txbuf; int error, i; /* * Setup DMA descriptor areas. */ if ((error = bus_dma_tag_create(bus_get_dma_tag(dev), 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ IGB_TSO_SIZE, /* maxsize */ IGB_MAX_SCATTER, /* nsegments */ PAGE_SIZE, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txr->txtag))) { device_printf(dev,"Unable to allocate TX DMA tag\n"); goto fail; } if (!(txr->tx_buffers = (struct igb_tx_buf *) malloc(sizeof(struct igb_tx_buf) * adapter->num_tx_desc, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate tx_buffer memory\n"); error = ENOMEM; goto fail; } /* Create the descriptor buffer dma maps */ txbuf = txr->tx_buffers; for (i = 0; i < adapter->num_tx_desc; i++, txbuf++) { error = bus_dmamap_create(txr->txtag, 0, &txbuf->map); if (error != 0) { device_printf(dev, "Unable to create TX DMA map\n"); goto fail; } } return 0; fail: /* We free all, it handles case where we are in the middle */ igb_free_transmit_structures(adapter); return (error); } /********************************************************************* * * Initialize a transmit ring. * **********************************************************************/ static void igb_setup_transmit_ring(struct tx_ring *txr) { struct adapter *adapter = txr->adapter; struct igb_tx_buf *txbuf; int i; #ifdef DEV_NETMAP struct netmap_adapter *na = NA(adapter->ifp); struct netmap_slot *slot; #endif /* DEV_NETMAP */ /* Clear the old descriptor contents */ IGB_TX_LOCK(txr); #ifdef DEV_NETMAP slot = netmap_reset(na, NR_TX, txr->me, 0); #endif /* DEV_NETMAP */ bzero((void *)txr->tx_base, (sizeof(union e1000_adv_tx_desc)) * adapter->num_tx_desc); /* Reset indices */ txr->next_avail_desc = 0; txr->next_to_clean = 0; /* Free any existing tx buffers. */ txbuf = txr->tx_buffers; for (i = 0; i < adapter->num_tx_desc; i++, txbuf++) { if (txbuf->m_head != NULL) { bus_dmamap_sync(txr->txtag, txbuf->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txr->txtag, txbuf->map); m_freem(txbuf->m_head); txbuf->m_head = NULL; } #ifdef DEV_NETMAP if (slot) { int si = netmap_idx_n2k(&na->tx_rings[txr->me], i); /* no need to set the address */ netmap_load_map(na, txr->txtag, txbuf->map, NMB(na, slot + si)); } #endif /* DEV_NETMAP */ /* clear the watch index */ txbuf->eop = NULL; } /* Set number of descriptors available */ txr->tx_avail = adapter->num_tx_desc; bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); IGB_TX_UNLOCK(txr); } /********************************************************************* * * Initialize all transmit rings. * **********************************************************************/ static void igb_setup_transmit_structures(struct adapter *adapter) { struct tx_ring *txr = adapter->tx_rings; for (int i = 0; i < adapter->num_queues; i++, txr++) igb_setup_transmit_ring(txr); return; } /********************************************************************* * * Enable transmit unit. * **********************************************************************/ static void igb_initialize_transmit_units(struct adapter *adapter) { struct tx_ring *txr = adapter->tx_rings; struct e1000_hw *hw = &adapter->hw; u32 tctl, txdctl; INIT_DEBUGOUT("igb_initialize_transmit_units: begin"); tctl = txdctl = 0; /* Setup the Tx Descriptor Rings */ for (int i = 0; i < adapter->num_queues; i++, txr++) { u64 bus_addr = txr->txdma.dma_paddr; E1000_WRITE_REG(hw, E1000_TDLEN(i), adapter->num_tx_desc * sizeof(struct e1000_tx_desc)); E1000_WRITE_REG(hw, E1000_TDBAH(i), (uint32_t)(bus_addr >> 32)); E1000_WRITE_REG(hw, E1000_TDBAL(i), (uint32_t)bus_addr); /* Setup the HW Tx Head and Tail descriptor pointers */ E1000_WRITE_REG(hw, E1000_TDT(i), 0); E1000_WRITE_REG(hw, E1000_TDH(i), 0); HW_DEBUGOUT2("Base = %x, Length = %x\n", E1000_READ_REG(hw, E1000_TDBAL(i)), E1000_READ_REG(hw, E1000_TDLEN(i))); txr->queue_status = IGB_QUEUE_IDLE; txdctl |= IGB_TX_PTHRESH; txdctl |= IGB_TX_HTHRESH << 8; txdctl |= IGB_TX_WTHRESH << 16; txdctl |= E1000_TXDCTL_QUEUE_ENABLE; E1000_WRITE_REG(hw, E1000_TXDCTL(i), txdctl); } if (adapter->vf_ifp) return; e1000_config_collision_dist(hw); /* Program the Transmit Control Register */ tctl = E1000_READ_REG(hw, E1000_TCTL); tctl &= ~E1000_TCTL_CT; tctl |= (E1000_TCTL_PSP | E1000_TCTL_RTLC | E1000_TCTL_EN | (E1000_COLLISION_THRESHOLD << E1000_CT_SHIFT)); /* This write will effectively turn on the transmit unit. */ E1000_WRITE_REG(hw, E1000_TCTL, tctl); } /********************************************************************* * * Free all transmit rings. * **********************************************************************/ static void igb_free_transmit_structures(struct adapter *adapter) { struct tx_ring *txr = adapter->tx_rings; for (int i = 0; i < adapter->num_queues; i++, txr++) { IGB_TX_LOCK(txr); igb_free_transmit_buffers(txr); igb_dma_free(adapter, &txr->txdma); IGB_TX_UNLOCK(txr); IGB_TX_LOCK_DESTROY(txr); } free(adapter->tx_rings, M_DEVBUF); } /********************************************************************* * * Free transmit ring related data structures. * **********************************************************************/ static void igb_free_transmit_buffers(struct tx_ring *txr) { struct adapter *adapter = txr->adapter; struct igb_tx_buf *tx_buffer; int i; INIT_DEBUGOUT("free_transmit_ring: begin"); if (txr->tx_buffers == NULL) return; tx_buffer = txr->tx_buffers; for (i = 0; i < adapter->num_tx_desc; i++, tx_buffer++) { if (tx_buffer->m_head != NULL) { bus_dmamap_sync(txr->txtag, tx_buffer->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txr->txtag, tx_buffer->map); m_freem(tx_buffer->m_head); tx_buffer->m_head = NULL; if (tx_buffer->map != NULL) { bus_dmamap_destroy(txr->txtag, tx_buffer->map); tx_buffer->map = NULL; } } else if (tx_buffer->map != NULL) { bus_dmamap_unload(txr->txtag, tx_buffer->map); bus_dmamap_destroy(txr->txtag, tx_buffer->map); tx_buffer->map = NULL; } } #ifndef IGB_LEGACY_TX if (txr->br != NULL) buf_ring_free(txr->br, M_DEVBUF); #endif if (txr->tx_buffers != NULL) { free(txr->tx_buffers, M_DEVBUF); txr->tx_buffers = NULL; } if (txr->txtag != NULL) { bus_dma_tag_destroy(txr->txtag); txr->txtag = NULL; } return; } /********************************************************************** * * Setup work for hardware segmentation offload (TSO) on * adapters using advanced tx descriptors * **********************************************************************/ static int igb_tso_setup(struct tx_ring *txr, struct mbuf *mp, u32 *cmd_type_len, u32 *olinfo_status) { struct adapter *adapter = txr->adapter; struct e1000_adv_tx_context_desc *TXD; u32 vlan_macip_lens = 0, type_tucmd_mlhl = 0; u32 mss_l4len_idx = 0, paylen; u16 vtag = 0, eh_type; int ctxd, ehdrlen, ip_hlen, tcp_hlen; struct ether_vlan_header *eh; #ifdef INET6 struct ip6_hdr *ip6; #endif #ifdef INET struct ip *ip; #endif struct tcphdr *th; /* * Determine where frame payload starts. * Jump over vlan headers if already present */ eh = mtod(mp, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; eh_type = eh->evl_proto; } else { ehdrlen = ETHER_HDR_LEN; eh_type = eh->evl_encap_proto; } switch (ntohs(eh_type)) { #ifdef INET6 case ETHERTYPE_IPV6: ip6 = (struct ip6_hdr *)(mp->m_data + ehdrlen); /* XXX-BZ For now we do not pretend to support ext. hdrs. */ if (ip6->ip6_nxt != IPPROTO_TCP) return (ENXIO); ip_hlen = sizeof(struct ip6_hdr); ip6 = (struct ip6_hdr *)(mp->m_data + ehdrlen); th = (struct tcphdr *)((caddr_t)ip6 + ip_hlen); th->th_sum = in6_cksum_pseudo(ip6, 0, IPPROTO_TCP, 0); type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_IPV6; break; #endif #ifdef INET case ETHERTYPE_IP: ip = (struct ip *)(mp->m_data + ehdrlen); if (ip->ip_p != IPPROTO_TCP) return (ENXIO); ip->ip_sum = 0; ip_hlen = ip->ip_hl << 2; th = (struct tcphdr *)((caddr_t)ip + ip_hlen); th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htons(IPPROTO_TCP)); type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_IPV4; /* Tell transmit desc to also do IPv4 checksum. */ *olinfo_status |= E1000_TXD_POPTS_IXSM << 8; break; #endif default: panic("%s: CSUM_TSO but no supported IP version (0x%04x)", __func__, ntohs(eh_type)); break; } ctxd = txr->next_avail_desc; TXD = (struct e1000_adv_tx_context_desc *) &txr->tx_base[ctxd]; tcp_hlen = th->th_off << 2; /* This is used in the transmit desc in encap */ paylen = mp->m_pkthdr.len - ehdrlen - ip_hlen - tcp_hlen; /* VLAN MACLEN IPLEN */ if (mp->m_flags & M_VLANTAG) { vtag = htole16(mp->m_pkthdr.ether_vtag); vlan_macip_lens |= (vtag << E1000_ADVTXD_VLAN_SHIFT); } vlan_macip_lens |= ehdrlen << E1000_ADVTXD_MACLEN_SHIFT; vlan_macip_lens |= ip_hlen; TXD->vlan_macip_lens = htole32(vlan_macip_lens); /* ADV DTYPE TUCMD */ type_tucmd_mlhl |= E1000_ADVTXD_DCMD_DEXT | E1000_ADVTXD_DTYP_CTXT; type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_TCP; TXD->type_tucmd_mlhl = htole32(type_tucmd_mlhl); /* MSS L4LEN IDX */ mss_l4len_idx |= (mp->m_pkthdr.tso_segsz << E1000_ADVTXD_MSS_SHIFT); mss_l4len_idx |= (tcp_hlen << E1000_ADVTXD_L4LEN_SHIFT); /* 82575 needs the queue index added */ if (adapter->hw.mac.type == e1000_82575) mss_l4len_idx |= txr->me << 4; TXD->mss_l4len_idx = htole32(mss_l4len_idx); TXD->seqnum_seed = htole32(0); if (++ctxd == txr->num_desc) ctxd = 0; txr->tx_avail--; txr->next_avail_desc = ctxd; *cmd_type_len |= E1000_ADVTXD_DCMD_TSE; *olinfo_status |= E1000_TXD_POPTS_TXSM << 8; *olinfo_status |= paylen << E1000_ADVTXD_PAYLEN_SHIFT; ++txr->tso_tx; return (0); } /********************************************************************* * * Advanced Context Descriptor setup for VLAN, CSUM or TSO * **********************************************************************/ static int igb_tx_ctx_setup(struct tx_ring *txr, struct mbuf *mp, u32 *cmd_type_len, u32 *olinfo_status) { struct e1000_adv_tx_context_desc *TXD; struct adapter *adapter = txr->adapter; struct ether_vlan_header *eh; struct ip *ip; struct ip6_hdr *ip6; u32 vlan_macip_lens = 0, type_tucmd_mlhl = 0, mss_l4len_idx = 0; int ehdrlen, ip_hlen = 0; u16 etype; u8 ipproto = 0; int offload = TRUE; int ctxd = txr->next_avail_desc; u16 vtag = 0; /* First check if TSO is to be used */ if (mp->m_pkthdr.csum_flags & CSUM_TSO) return (igb_tso_setup(txr, mp, cmd_type_len, olinfo_status)); if ((mp->m_pkthdr.csum_flags & CSUM_OFFLOAD) == 0) offload = FALSE; /* Indicate the whole packet as payload when not doing TSO */ *olinfo_status |= mp->m_pkthdr.len << E1000_ADVTXD_PAYLEN_SHIFT; /* Now ready a context descriptor */ TXD = (struct e1000_adv_tx_context_desc *) &txr->tx_base[ctxd]; /* ** In advanced descriptors the vlan tag must ** be placed into the context descriptor. Hence ** we need to make one even if not doing offloads. */ if (mp->m_flags & M_VLANTAG) { vtag = htole16(mp->m_pkthdr.ether_vtag); vlan_macip_lens |= (vtag << E1000_ADVTXD_VLAN_SHIFT); } else if (offload == FALSE) /* ... no offload to do */ return (0); /* * Determine where frame payload starts. * Jump over vlan headers if already present, * helpful for QinQ too. */ eh = mtod(mp, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { etype = ntohs(eh->evl_proto); ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; } else { etype = ntohs(eh->evl_encap_proto); ehdrlen = ETHER_HDR_LEN; } /* Set the ether header length */ vlan_macip_lens |= ehdrlen << E1000_ADVTXD_MACLEN_SHIFT; switch (etype) { case ETHERTYPE_IP: ip = (struct ip *)(mp->m_data + ehdrlen); ip_hlen = ip->ip_hl << 2; ipproto = ip->ip_p; type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_IPV4; break; case ETHERTYPE_IPV6: ip6 = (struct ip6_hdr *)(mp->m_data + ehdrlen); ip_hlen = sizeof(struct ip6_hdr); /* XXX-BZ this will go badly in case of ext hdrs. */ ipproto = ip6->ip6_nxt; type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_IPV6; break; default: offload = FALSE; break; } vlan_macip_lens |= ip_hlen; type_tucmd_mlhl |= E1000_ADVTXD_DCMD_DEXT | E1000_ADVTXD_DTYP_CTXT; switch (ipproto) { case IPPROTO_TCP: #if __FreeBSD_version >= 1000000 if (mp->m_pkthdr.csum_flags & (CSUM_IP_TCP | CSUM_IP6_TCP)) #else if (mp->m_pkthdr.csum_flags & CSUM_TCP) #endif type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_TCP; break; case IPPROTO_UDP: #if __FreeBSD_version >= 1000000 if (mp->m_pkthdr.csum_flags & (CSUM_IP_UDP | CSUM_IP6_UDP)) #else if (mp->m_pkthdr.csum_flags & CSUM_UDP) #endif type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_UDP; break; #if __FreeBSD_version >= 800000 case IPPROTO_SCTP: #if __FreeBSD_version >= 1000000 if (mp->m_pkthdr.csum_flags & (CSUM_IP_SCTP | CSUM_IP6_SCTP)) #else if (mp->m_pkthdr.csum_flags & CSUM_SCTP) #endif type_tucmd_mlhl |= E1000_ADVTXD_TUCMD_L4T_SCTP; break; #endif default: offload = FALSE; break; } if (offload) /* For the TX descriptor setup */ *olinfo_status |= E1000_TXD_POPTS_TXSM << 8; /* 82575 needs the queue index added */ if (adapter->hw.mac.type == e1000_82575) mss_l4len_idx = txr->me << 4; /* Now copy bits into descriptor */ TXD->vlan_macip_lens = htole32(vlan_macip_lens); TXD->type_tucmd_mlhl = htole32(type_tucmd_mlhl); TXD->seqnum_seed = htole32(0); TXD->mss_l4len_idx = htole32(mss_l4len_idx); /* We've consumed the first desc, adjust counters */ if (++ctxd == txr->num_desc) ctxd = 0; txr->next_avail_desc = ctxd; --txr->tx_avail; return (0); } /********************************************************************** * * Examine each tx_buffer in the used queue. If the hardware is done * processing the packet then free associated resources. The * tx_buffer is put back on the free queue. * * TRUE return means there's work in the ring to clean, FALSE its empty. **********************************************************************/ static bool igb_txeof(struct tx_ring *txr) { struct adapter *adapter = txr->adapter; #ifdef DEV_NETMAP struct ifnet *ifp = adapter->ifp; #endif /* DEV_NETMAP */ u32 work, processed = 0; int limit = adapter->tx_process_limit; struct igb_tx_buf *buf; union e1000_adv_tx_desc *txd; mtx_assert(&txr->tx_mtx, MA_OWNED); #ifdef DEV_NETMAP if (netmap_tx_irq(ifp, txr->me)) return (FALSE); #endif /* DEV_NETMAP */ if (txr->tx_avail == txr->num_desc) { txr->queue_status = IGB_QUEUE_IDLE; return FALSE; } /* Get work starting point */ work = txr->next_to_clean; buf = &txr->tx_buffers[work]; txd = &txr->tx_base[work]; work -= txr->num_desc; /* The distance to ring end */ bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); do { union e1000_adv_tx_desc *eop = buf->eop; if (eop == NULL) /* No work */ break; if ((eop->wb.status & E1000_TXD_STAT_DD) == 0) break; /* I/O not complete */ if (buf->m_head) { txr->bytes += buf->m_head->m_pkthdr.len; bus_dmamap_sync(txr->txtag, buf->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txr->txtag, buf->map); m_freem(buf->m_head); buf->m_head = NULL; } buf->eop = NULL; ++txr->tx_avail; /* We clean the range if multi segment */ while (txd != eop) { ++txd; ++buf; ++work; /* wrap the ring? */ if (__predict_false(!work)) { work -= txr->num_desc; buf = txr->tx_buffers; txd = txr->tx_base; } if (buf->m_head) { txr->bytes += buf->m_head->m_pkthdr.len; bus_dmamap_sync(txr->txtag, buf->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txr->txtag, buf->map); m_freem(buf->m_head); buf->m_head = NULL; } ++txr->tx_avail; buf->eop = NULL; } ++txr->packets; ++processed; txr->watchdog_time = ticks; /* Try the next packet */ ++txd; ++buf; ++work; /* reset with a wrap */ if (__predict_false(!work)) { work -= txr->num_desc; buf = txr->tx_buffers; txd = txr->tx_base; } prefetch(txd); } while (__predict_true(--limit)); bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); work += txr->num_desc; txr->next_to_clean = work; /* ** Watchdog calculation, we know there's ** work outstanding or the first return ** would have been taken, so none processed ** for too long indicates a hang. */ if ((!processed) && ((ticks - txr->watchdog_time) > IGB_WATCHDOG)) txr->queue_status |= IGB_QUEUE_HUNG; if (txr->tx_avail >= IGB_QUEUE_THRESHOLD) txr->queue_status &= ~IGB_QUEUE_DEPLETED; if (txr->tx_avail == txr->num_desc) { txr->queue_status = IGB_QUEUE_IDLE; return (FALSE); } return (TRUE); } /********************************************************************* * * Refresh mbuf buffers for RX descriptor rings * - now keeps its own state so discards due to resource * exhaustion are unnecessary, if an mbuf cannot be obtained * it just returns, keeping its placeholder, thus it can simply * be recalled to try again. * **********************************************************************/ static void igb_refresh_mbufs(struct rx_ring *rxr, int limit) { struct adapter *adapter = rxr->adapter; bus_dma_segment_t hseg[1]; bus_dma_segment_t pseg[1]; struct igb_rx_buf *rxbuf; struct mbuf *mh, *mp; int i, j, nsegs, error; bool refreshed = FALSE; i = j = rxr->next_to_refresh; /* ** Get one descriptor beyond ** our work mark to control ** the loop. */ if (++j == adapter->num_rx_desc) j = 0; while (j != limit) { rxbuf = &rxr->rx_buffers[i]; /* No hdr mbuf used with header split off */ if (rxr->hdr_split == FALSE) goto no_split; if (rxbuf->m_head == NULL) { mh = m_gethdr(M_NOWAIT, MT_DATA); if (mh == NULL) goto update; } else mh = rxbuf->m_head; mh->m_pkthdr.len = mh->m_len = MHLEN; mh->m_len = MHLEN; mh->m_flags |= M_PKTHDR; /* Get the memory mapping */ error = bus_dmamap_load_mbuf_sg(rxr->htag, rxbuf->hmap, mh, hseg, &nsegs, BUS_DMA_NOWAIT); if (error != 0) { printf("Refresh mbufs: hdr dmamap load" " failure - %d\n", error); m_free(mh); rxbuf->m_head = NULL; goto update; } rxbuf->m_head = mh; bus_dmamap_sync(rxr->htag, rxbuf->hmap, BUS_DMASYNC_PREREAD); rxr->rx_base[i].read.hdr_addr = htole64(hseg[0].ds_addr); no_split: if (rxbuf->m_pack == NULL) { mp = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, adapter->rx_mbuf_sz); if (mp == NULL) goto update; } else mp = rxbuf->m_pack; mp->m_pkthdr.len = mp->m_len = adapter->rx_mbuf_sz; /* Get the memory mapping */ error = bus_dmamap_load_mbuf_sg(rxr->ptag, rxbuf->pmap, mp, pseg, &nsegs, BUS_DMA_NOWAIT); if (error != 0) { printf("Refresh mbufs: payload dmamap load" " failure - %d\n", error); m_free(mp); rxbuf->m_pack = NULL; goto update; } rxbuf->m_pack = mp; bus_dmamap_sync(rxr->ptag, rxbuf->pmap, BUS_DMASYNC_PREREAD); rxr->rx_base[i].read.pkt_addr = htole64(pseg[0].ds_addr); refreshed = TRUE; /* I feel wefreshed :) */ i = j; /* our next is precalculated */ rxr->next_to_refresh = i; if (++j == adapter->num_rx_desc) j = 0; } update: if (refreshed) /* update tail */ E1000_WRITE_REG(&adapter->hw, E1000_RDT(rxr->me), rxr->next_to_refresh); return; } /********************************************************************* * * Allocate memory for rx_buffer structures. Since we use one * rx_buffer per received packet, the maximum number of rx_buffer's * that we'll need is equal to the number of receive descriptors * that we've allocated. * **********************************************************************/ static int igb_allocate_receive_buffers(struct rx_ring *rxr) { struct adapter *adapter = rxr->adapter; device_t dev = adapter->dev; struct igb_rx_buf *rxbuf; int i, bsize, error; bsize = sizeof(struct igb_rx_buf) * adapter->num_rx_desc; if (!(rxr->rx_buffers = (struct igb_rx_buf *) malloc(bsize, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate rx_buffer memory\n"); error = ENOMEM; goto fail; } if ((error = bus_dma_tag_create(bus_get_dma_tag(dev), 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ MSIZE, /* maxsize */ 1, /* nsegments */ MSIZE, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &rxr->htag))) { device_printf(dev, "Unable to create RX DMA tag\n"); goto fail; } if ((error = bus_dma_tag_create(bus_get_dma_tag(dev), 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ MJUM9BYTES, /* maxsize */ 1, /* nsegments */ MJUM9BYTES, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &rxr->ptag))) { device_printf(dev, "Unable to create RX payload DMA tag\n"); goto fail; } for (i = 0; i < adapter->num_rx_desc; i++) { rxbuf = &rxr->rx_buffers[i]; error = bus_dmamap_create(rxr->htag, 0, &rxbuf->hmap); if (error) { device_printf(dev, "Unable to create RX head DMA maps\n"); goto fail; } error = bus_dmamap_create(rxr->ptag, 0, &rxbuf->pmap); if (error) { device_printf(dev, "Unable to create RX packet DMA maps\n"); goto fail; } } return (0); fail: /* Frees all, but can handle partial completion */ igb_free_receive_structures(adapter); return (error); } static void igb_free_receive_ring(struct rx_ring *rxr) { struct adapter *adapter = rxr->adapter; struct igb_rx_buf *rxbuf; for (int i = 0; i < adapter->num_rx_desc; i++) { rxbuf = &rxr->rx_buffers[i]; if (rxbuf->m_head != NULL) { bus_dmamap_sync(rxr->htag, rxbuf->hmap, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(rxr->htag, rxbuf->hmap); rxbuf->m_head->m_flags |= M_PKTHDR; m_freem(rxbuf->m_head); } if (rxbuf->m_pack != NULL) { bus_dmamap_sync(rxr->ptag, rxbuf->pmap, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(rxr->ptag, rxbuf->pmap); rxbuf->m_pack->m_flags |= M_PKTHDR; m_freem(rxbuf->m_pack); } rxbuf->m_head = NULL; rxbuf->m_pack = NULL; } } /********************************************************************* * * Initialize a receive ring and its buffers. * **********************************************************************/ static int igb_setup_receive_ring(struct rx_ring *rxr) { struct adapter *adapter; struct ifnet *ifp; device_t dev; struct igb_rx_buf *rxbuf; bus_dma_segment_t pseg[1], hseg[1]; struct lro_ctrl *lro = &rxr->lro; int rsize, nsegs, error = 0; #ifdef DEV_NETMAP struct netmap_adapter *na = NA(rxr->adapter->ifp); struct netmap_slot *slot; #endif /* DEV_NETMAP */ adapter = rxr->adapter; dev = adapter->dev; ifp = adapter->ifp; /* Clear the ring contents */ IGB_RX_LOCK(rxr); #ifdef DEV_NETMAP slot = netmap_reset(na, NR_RX, rxr->me, 0); #endif /* DEV_NETMAP */ rsize = roundup2(adapter->num_rx_desc * sizeof(union e1000_adv_rx_desc), IGB_DBA_ALIGN); bzero((void *)rxr->rx_base, rsize); /* ** Free current RX buffer structures and their mbufs */ igb_free_receive_ring(rxr); /* Configure for header split? */ if (igb_header_split) rxr->hdr_split = TRUE; /* Now replenish the ring mbufs */ for (int j = 0; j < adapter->num_rx_desc; ++j) { struct mbuf *mh, *mp; rxbuf = &rxr->rx_buffers[j]; #ifdef DEV_NETMAP if (slot) { /* slot sj is mapped to the j-th NIC-ring entry */ int sj = netmap_idx_n2k(&na->rx_rings[rxr->me], j); uint64_t paddr; void *addr; addr = PNMB(na, slot + sj, &paddr); netmap_load_map(na, rxr->ptag, rxbuf->pmap, addr); /* Update descriptor */ rxr->rx_base[j].read.pkt_addr = htole64(paddr); continue; } #endif /* DEV_NETMAP */ if (rxr->hdr_split == FALSE) goto skip_head; /* First the header */ rxbuf->m_head = m_gethdr(M_NOWAIT, MT_DATA); if (rxbuf->m_head == NULL) { error = ENOBUFS; goto fail; } m_adj(rxbuf->m_head, ETHER_ALIGN); mh = rxbuf->m_head; mh->m_len = mh->m_pkthdr.len = MHLEN; mh->m_flags |= M_PKTHDR; /* Get the memory mapping */ error = bus_dmamap_load_mbuf_sg(rxr->htag, rxbuf->hmap, rxbuf->m_head, hseg, &nsegs, BUS_DMA_NOWAIT); if (error != 0) /* Nothing elegant to do here */ goto fail; bus_dmamap_sync(rxr->htag, rxbuf->hmap, BUS_DMASYNC_PREREAD); /* Update descriptor */ rxr->rx_base[j].read.hdr_addr = htole64(hseg[0].ds_addr); skip_head: /* Now the payload cluster */ rxbuf->m_pack = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, adapter->rx_mbuf_sz); if (rxbuf->m_pack == NULL) { error = ENOBUFS; goto fail; } mp = rxbuf->m_pack; mp->m_pkthdr.len = mp->m_len = adapter->rx_mbuf_sz; /* Get the memory mapping */ error = bus_dmamap_load_mbuf_sg(rxr->ptag, rxbuf->pmap, mp, pseg, &nsegs, BUS_DMA_NOWAIT); if (error != 0) goto fail; bus_dmamap_sync(rxr->ptag, rxbuf->pmap, BUS_DMASYNC_PREREAD); /* Update descriptor */ rxr->rx_base[j].read.pkt_addr = htole64(pseg[0].ds_addr); } /* Setup our descriptor indices */ rxr->next_to_check = 0; rxr->next_to_refresh = adapter->num_rx_desc - 1; rxr->lro_enabled = FALSE; rxr->rx_split_packets = 0; rxr->rx_bytes = 0; rxr->fmp = NULL; rxr->lmp = NULL; bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* ** Now set up the LRO interface, we ** also only do head split when LRO ** is enabled, since so often they ** are undesirable in similar setups. */ if (ifp->if_capenable & IFCAP_LRO) { error = tcp_lro_init(lro); if (error) { device_printf(dev, "LRO Initialization failed!\n"); goto fail; } INIT_DEBUGOUT("RX LRO Initialized\n"); rxr->lro_enabled = TRUE; lro->ifp = adapter->ifp; } IGB_RX_UNLOCK(rxr); return (0); fail: igb_free_receive_ring(rxr); IGB_RX_UNLOCK(rxr); return (error); } /********************************************************************* * * Initialize all receive rings. * **********************************************************************/ static int igb_setup_receive_structures(struct adapter *adapter) { struct rx_ring *rxr = adapter->rx_rings; int i; for (i = 0; i < adapter->num_queues; i++, rxr++) if (igb_setup_receive_ring(rxr)) goto fail; return (0); fail: /* * Free RX buffers allocated so far, we will only handle * the rings that completed, the failing case will have * cleaned up for itself. 'i' is the endpoint. */ for (int j = 0; j < i; ++j) { rxr = &adapter->rx_rings[j]; IGB_RX_LOCK(rxr); igb_free_receive_ring(rxr); IGB_RX_UNLOCK(rxr); } return (ENOBUFS); } /* * Initialise the RSS mapping for NICs that support multiple transmit/ * receive rings. */ static void igb_initialise_rss_mapping(struct adapter *adapter) { struct e1000_hw *hw = &adapter->hw; int i; int queue_id; u32 reta; u32 rss_key[10], mrqc, shift = 0; /* XXX? */ if (adapter->hw.mac.type == e1000_82575) shift = 6; /* * The redirection table controls which destination * queue each bucket redirects traffic to. * Each DWORD represents four queues, with the LSB * being the first queue in the DWORD. * * This just allocates buckets to queues using round-robin * allocation. * * NOTE: It Just Happens to line up with the default * RSS allocation method. */ /* Warning FM follows */ reta = 0; for (i = 0; i < 128; i++) { #ifdef RSS queue_id = rss_get_indirection_to_bucket(i); /* * If we have more queues than buckets, we'll * end up mapping buckets to a subset of the * queues. * * If we have more buckets than queues, we'll * end up instead assigning multiple buckets * to queues. * * Both are suboptimal, but we need to handle * the case so we don't go out of bounds * indexing arrays and such. */ queue_id = queue_id % adapter->num_queues; #else queue_id = (i % adapter->num_queues); #endif /* Adjust if required */ queue_id = queue_id << shift; /* * The low 8 bits are for hash value (n+0); * The next 8 bits are for hash value (n+1), etc. */ reta = reta >> 8; reta = reta | ( ((uint32_t) queue_id) << 24); if ((i & 3) == 3) { E1000_WRITE_REG(hw, E1000_RETA(i >> 2), reta); reta = 0; } } /* Now fill in hash table */ /* * MRQC: Multiple Receive Queues Command * Set queuing to RSS control, number depends on the device. */ mrqc = E1000_MRQC_ENABLE_RSS_8Q; #ifdef RSS /* XXX ew typecasting */ rss_getkey((uint8_t *) &rss_key); #else arc4rand(&rss_key, sizeof(rss_key), 0); #endif for (i = 0; i < 10; i++) E1000_WRITE_REG_ARRAY(hw, E1000_RSSRK(0), i, rss_key[i]); /* * Configure the RSS fields to hash upon. */ mrqc |= (E1000_MRQC_RSS_FIELD_IPV4 | E1000_MRQC_RSS_FIELD_IPV4_TCP); mrqc |= (E1000_MRQC_RSS_FIELD_IPV6 | E1000_MRQC_RSS_FIELD_IPV6_TCP); mrqc |=( E1000_MRQC_RSS_FIELD_IPV4_UDP | E1000_MRQC_RSS_FIELD_IPV6_UDP); mrqc |=( E1000_MRQC_RSS_FIELD_IPV6_UDP_EX | E1000_MRQC_RSS_FIELD_IPV6_TCP_EX); E1000_WRITE_REG(hw, E1000_MRQC, mrqc); } /********************************************************************* * * Enable receive unit. * **********************************************************************/ static void igb_initialize_receive_units(struct adapter *adapter) { struct rx_ring *rxr = adapter->rx_rings; struct ifnet *ifp = adapter->ifp; struct e1000_hw *hw = &adapter->hw; u32 rctl, rxcsum, psize, srrctl = 0; INIT_DEBUGOUT("igb_initialize_receive_unit: begin"); /* * Make sure receives are disabled while setting * up the descriptor ring */ rctl = E1000_READ_REG(hw, E1000_RCTL); E1000_WRITE_REG(hw, E1000_RCTL, rctl & ~E1000_RCTL_EN); /* ** Set up for header split */ if (igb_header_split) { /* Use a standard mbuf for the header */ srrctl |= IGB_HDR_BUF << E1000_SRRCTL_BSIZEHDRSIZE_SHIFT; srrctl |= E1000_SRRCTL_DESCTYPE_HDR_SPLIT_ALWAYS; } else srrctl |= E1000_SRRCTL_DESCTYPE_ADV_ONEBUF; /* ** Set up for jumbo frames */ if (ifp->if_mtu > ETHERMTU) { rctl |= E1000_RCTL_LPE; if (adapter->rx_mbuf_sz == MJUMPAGESIZE) { srrctl |= 4096 >> E1000_SRRCTL_BSIZEPKT_SHIFT; rctl |= E1000_RCTL_SZ_4096 | E1000_RCTL_BSEX; } else if (adapter->rx_mbuf_sz > MJUMPAGESIZE) { srrctl |= 8192 >> E1000_SRRCTL_BSIZEPKT_SHIFT; rctl |= E1000_RCTL_SZ_8192 | E1000_RCTL_BSEX; } /* Set maximum packet len */ psize = adapter->max_frame_size; /* are we on a vlan? */ if (adapter->ifp->if_vlantrunk != NULL) psize += VLAN_TAG_SIZE; E1000_WRITE_REG(&adapter->hw, E1000_RLPML, psize); } else { rctl &= ~E1000_RCTL_LPE; srrctl |= 2048 >> E1000_SRRCTL_BSIZEPKT_SHIFT; rctl |= E1000_RCTL_SZ_2048; } /* * If TX flow control is disabled and there's >1 queue defined, * enable DROP. * * This drops frames rather than hanging the RX MAC for all queues. */ if ((adapter->num_queues > 1) && (adapter->fc == e1000_fc_none || adapter->fc == e1000_fc_rx_pause)) { srrctl |= E1000_SRRCTL_DROP_EN; } /* Setup the Base and Length of the Rx Descriptor Rings */ for (int i = 0; i < adapter->num_queues; i++, rxr++) { u64 bus_addr = rxr->rxdma.dma_paddr; u32 rxdctl; E1000_WRITE_REG(hw, E1000_RDLEN(i), adapter->num_rx_desc * sizeof(struct e1000_rx_desc)); E1000_WRITE_REG(hw, E1000_RDBAH(i), (uint32_t)(bus_addr >> 32)); E1000_WRITE_REG(hw, E1000_RDBAL(i), (uint32_t)bus_addr); E1000_WRITE_REG(hw, E1000_SRRCTL(i), srrctl); /* Enable this Queue */ rxdctl = E1000_READ_REG(hw, E1000_RXDCTL(i)); rxdctl |= E1000_RXDCTL_QUEUE_ENABLE; rxdctl &= 0xFFF00000; rxdctl |= IGB_RX_PTHRESH; rxdctl |= IGB_RX_HTHRESH << 8; rxdctl |= IGB_RX_WTHRESH << 16; E1000_WRITE_REG(hw, E1000_RXDCTL(i), rxdctl); } /* ** Setup for RX MultiQueue */ rxcsum = E1000_READ_REG(hw, E1000_RXCSUM); if (adapter->num_queues >1) { /* rss setup */ igb_initialise_rss_mapping(adapter); /* ** NOTE: Receive Full-Packet Checksum Offload ** is mutually exclusive with Multiqueue. However ** this is not the same as TCP/IP checksums which ** still work. */ rxcsum |= E1000_RXCSUM_PCSD; #if __FreeBSD_version >= 800000 /* For SCTP Offload */ if ((hw->mac.type != e1000_82575) && (ifp->if_capenable & IFCAP_RXCSUM)) rxcsum |= E1000_RXCSUM_CRCOFL; #endif } else { /* Non RSS setup */ if (ifp->if_capenable & IFCAP_RXCSUM) { rxcsum |= E1000_RXCSUM_IPPCSE; #if __FreeBSD_version >= 800000 if (adapter->hw.mac.type != e1000_82575) rxcsum |= E1000_RXCSUM_CRCOFL; #endif } else rxcsum &= ~E1000_RXCSUM_TUOFL; } E1000_WRITE_REG(hw, E1000_RXCSUM, rxcsum); /* Setup the Receive Control Register */ rctl &= ~(3 << E1000_RCTL_MO_SHIFT); rctl |= E1000_RCTL_EN | E1000_RCTL_BAM | E1000_RCTL_LBM_NO | E1000_RCTL_RDMTS_HALF | (hw->mac.mc_filter_type << E1000_RCTL_MO_SHIFT); /* Strip CRC bytes. */ rctl |= E1000_RCTL_SECRC; /* Make sure VLAN Filters are off */ rctl &= ~E1000_RCTL_VFE; /* Don't store bad packets */ rctl &= ~E1000_RCTL_SBP; /* Enable Receives */ E1000_WRITE_REG(hw, E1000_RCTL, rctl); /* * Setup the HW Rx Head and Tail Descriptor Pointers * - needs to be after enable */ for (int i = 0; i < adapter->num_queues; i++) { rxr = &adapter->rx_rings[i]; E1000_WRITE_REG(hw, E1000_RDH(i), rxr->next_to_check); #ifdef DEV_NETMAP /* * an init() while a netmap client is active must * preserve the rx buffers passed to userspace. * In this driver it means we adjust RDT to * something different from next_to_refresh * (which is not used in netmap mode). */ if (ifp->if_capenable & IFCAP_NETMAP) { struct netmap_adapter *na = NA(adapter->ifp); struct netmap_kring *kring = &na->rx_rings[i]; int t = rxr->next_to_refresh - nm_kr_rxspace(kring); if (t >= adapter->num_rx_desc) t -= adapter->num_rx_desc; else if (t < 0) t += adapter->num_rx_desc; E1000_WRITE_REG(hw, E1000_RDT(i), t); } else #endif /* DEV_NETMAP */ E1000_WRITE_REG(hw, E1000_RDT(i), rxr->next_to_refresh); } return; } /********************************************************************* * * Free receive rings. * **********************************************************************/ static void igb_free_receive_structures(struct adapter *adapter) { struct rx_ring *rxr = adapter->rx_rings; for (int i = 0; i < adapter->num_queues; i++, rxr++) { struct lro_ctrl *lro = &rxr->lro; igb_free_receive_buffers(rxr); tcp_lro_free(lro); igb_dma_free(adapter, &rxr->rxdma); } free(adapter->rx_rings, M_DEVBUF); } /********************************************************************* * * Free receive ring data structures. * **********************************************************************/ static void igb_free_receive_buffers(struct rx_ring *rxr) { struct adapter *adapter = rxr->adapter; struct igb_rx_buf *rxbuf; int i; INIT_DEBUGOUT("free_receive_structures: begin"); /* Cleanup any existing buffers */ if (rxr->rx_buffers != NULL) { for (i = 0; i < adapter->num_rx_desc; i++) { rxbuf = &rxr->rx_buffers[i]; if (rxbuf->m_head != NULL) { bus_dmamap_sync(rxr->htag, rxbuf->hmap, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(rxr->htag, rxbuf->hmap); rxbuf->m_head->m_flags |= M_PKTHDR; m_freem(rxbuf->m_head); } if (rxbuf->m_pack != NULL) { bus_dmamap_sync(rxr->ptag, rxbuf->pmap, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(rxr->ptag, rxbuf->pmap); rxbuf->m_pack->m_flags |= M_PKTHDR; m_freem(rxbuf->m_pack); } rxbuf->m_head = NULL; rxbuf->m_pack = NULL; if (rxbuf->hmap != NULL) { bus_dmamap_destroy(rxr->htag, rxbuf->hmap); rxbuf->hmap = NULL; } if (rxbuf->pmap != NULL) { bus_dmamap_destroy(rxr->ptag, rxbuf->pmap); rxbuf->pmap = NULL; } } if (rxr->rx_buffers != NULL) { free(rxr->rx_buffers, M_DEVBUF); rxr->rx_buffers = NULL; } } if (rxr->htag != NULL) { bus_dma_tag_destroy(rxr->htag); rxr->htag = NULL; } if (rxr->ptag != NULL) { bus_dma_tag_destroy(rxr->ptag); rxr->ptag = NULL; } } static __inline void igb_rx_discard(struct rx_ring *rxr, int i) { struct igb_rx_buf *rbuf; rbuf = &rxr->rx_buffers[i]; /* Partially received? Free the chain */ if (rxr->fmp != NULL) { rxr->fmp->m_flags |= M_PKTHDR; m_freem(rxr->fmp); rxr->fmp = NULL; rxr->lmp = NULL; } /* ** With advanced descriptors the writeback ** clobbers the buffer addrs, so its easier ** to just free the existing mbufs and take ** the normal refresh path to get new buffers ** and mapping. */ if (rbuf->m_head) { m_free(rbuf->m_head); rbuf->m_head = NULL; bus_dmamap_unload(rxr->htag, rbuf->hmap); } if (rbuf->m_pack) { m_free(rbuf->m_pack); rbuf->m_pack = NULL; bus_dmamap_unload(rxr->ptag, rbuf->pmap); } return; } static __inline void igb_rx_input(struct rx_ring *rxr, struct ifnet *ifp, struct mbuf *m, u32 ptype) { /* * ATM LRO is only for IPv4/TCP packets and TCP checksum of the packet * should be computed by hardware. Also it should not have VLAN tag in * ethernet header. */ if (rxr->lro_enabled && (ifp->if_capenable & IFCAP_VLAN_HWTAGGING) != 0 && (ptype & E1000_RXDADV_PKTTYPE_ETQF) == 0 && (ptype & (E1000_RXDADV_PKTTYPE_IPV4 | E1000_RXDADV_PKTTYPE_TCP)) == (E1000_RXDADV_PKTTYPE_IPV4 | E1000_RXDADV_PKTTYPE_TCP) && (m->m_pkthdr.csum_flags & (CSUM_DATA_VALID | CSUM_PSEUDO_HDR)) == (CSUM_DATA_VALID | CSUM_PSEUDO_HDR)) { /* * Send to the stack if: ** - LRO not enabled, or ** - no LRO resources, or ** - lro enqueue fails */ if (rxr->lro.lro_cnt != 0) if (tcp_lro_rx(&rxr->lro, m, 0) == 0) return; } IGB_RX_UNLOCK(rxr); (*ifp->if_input)(ifp, m); IGB_RX_LOCK(rxr); } /********************************************************************* * * This routine executes in interrupt context. It replenishes * the mbufs in the descriptor and sends data which has been * dma'ed into host memory to upper layer. * * We loop at most count times if count is > 0, or until done if * count < 0. * * Return TRUE if more to clean, FALSE otherwise *********************************************************************/ static bool igb_rxeof(struct igb_queue *que, int count, int *done) { struct adapter *adapter = que->adapter; struct rx_ring *rxr = que->rxr; struct ifnet *ifp = adapter->ifp; struct lro_ctrl *lro = &rxr->lro; int i, processed = 0, rxdone = 0; u32 ptype, staterr = 0; union e1000_adv_rx_desc *cur; IGB_RX_LOCK(rxr); /* Sync the ring. */ bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); #ifdef DEV_NETMAP if (netmap_rx_irq(ifp, rxr->me, &processed)) { IGB_RX_UNLOCK(rxr); return (FALSE); } #endif /* DEV_NETMAP */ /* Main clean loop */ for (i = rxr->next_to_check; count != 0;) { struct mbuf *sendmp, *mh, *mp; struct igb_rx_buf *rxbuf; u16 hlen, plen, hdr, vtag, pkt_info; bool eop = FALSE; cur = &rxr->rx_base[i]; staterr = le32toh(cur->wb.upper.status_error); if ((staterr & E1000_RXD_STAT_DD) == 0) break; if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) break; count--; sendmp = mh = mp = NULL; cur->wb.upper.status_error = 0; rxbuf = &rxr->rx_buffers[i]; plen = le16toh(cur->wb.upper.length); ptype = le32toh(cur->wb.lower.lo_dword.data) & IGB_PKTTYPE_MASK; if (((adapter->hw.mac.type == e1000_i350) || (adapter->hw.mac.type == e1000_i354)) && (staterr & E1000_RXDEXT_STATERR_LB)) vtag = be16toh(cur->wb.upper.vlan); else vtag = le16toh(cur->wb.upper.vlan); hdr = le16toh(cur->wb.lower.lo_dword.hs_rss.hdr_info); pkt_info = le16toh(cur->wb.lower.lo_dword.hs_rss.pkt_info); eop = ((staterr & E1000_RXD_STAT_EOP) == E1000_RXD_STAT_EOP); /* * Free the frame (all segments) if we're at EOP and * it's an error. * * The datasheet states that EOP + status is only valid for * the final segment in a multi-segment frame. */ if (eop && ((staterr & E1000_RXDEXT_ERR_FRAME_ERR_MASK) != 0)) { adapter->dropped_pkts++; ++rxr->rx_discarded; igb_rx_discard(rxr, i); goto next_desc; } /* ** The way the hardware is configured to ** split, it will ONLY use the header buffer ** when header split is enabled, otherwise we ** get normal behavior, ie, both header and ** payload are DMA'd into the payload buffer. ** ** The fmp test is to catch the case where a ** packet spans multiple descriptors, in that ** case only the first header is valid. */ if (rxr->hdr_split && rxr->fmp == NULL) { bus_dmamap_unload(rxr->htag, rxbuf->hmap); hlen = (hdr & E1000_RXDADV_HDRBUFLEN_MASK) >> E1000_RXDADV_HDRBUFLEN_SHIFT; if (hlen > IGB_HDR_BUF) hlen = IGB_HDR_BUF; mh = rxr->rx_buffers[i].m_head; mh->m_len = hlen; /* clear buf pointer for refresh */ rxbuf->m_head = NULL; /* ** Get the payload length, this ** could be zero if its a small ** packet. */ if (plen > 0) { mp = rxr->rx_buffers[i].m_pack; mp->m_len = plen; mh->m_next = mp; /* clear buf pointer */ rxbuf->m_pack = NULL; rxr->rx_split_packets++; } } else { /* ** Either no header split, or a ** secondary piece of a fragmented ** split packet. */ mh = rxr->rx_buffers[i].m_pack; mh->m_len = plen; /* clear buf info for refresh */ rxbuf->m_pack = NULL; } bus_dmamap_unload(rxr->ptag, rxbuf->pmap); ++processed; /* So we know when to refresh */ /* Initial frame - setup */ if (rxr->fmp == NULL) { mh->m_pkthdr.len = mh->m_len; /* Save the head of the chain */ rxr->fmp = mh; rxr->lmp = mh; if (mp != NULL) { /* Add payload if split */ mh->m_pkthdr.len += mp->m_len; rxr->lmp = mh->m_next; } } else { /* Chain mbuf's together */ rxr->lmp->m_next = mh; rxr->lmp = rxr->lmp->m_next; rxr->fmp->m_pkthdr.len += mh->m_len; } if (eop) { rxr->fmp->m_pkthdr.rcvif = ifp; rxr->rx_packets++; /* capture data for AIM */ rxr->packets++; rxr->bytes += rxr->fmp->m_pkthdr.len; rxr->rx_bytes += rxr->fmp->m_pkthdr.len; if ((ifp->if_capenable & IFCAP_RXCSUM) != 0) igb_rx_checksum(staterr, rxr->fmp, ptype); if ((ifp->if_capenable & IFCAP_VLAN_HWTAGGING) != 0 && (staterr & E1000_RXD_STAT_VP) != 0) { rxr->fmp->m_pkthdr.ether_vtag = vtag; rxr->fmp->m_flags |= M_VLANTAG; } /* * In case of multiqueue, we have RXCSUM.PCSD bit set * and never cleared. This means we have RSS hash * available to be used. */ if (adapter->num_queues > 1) { rxr->fmp->m_pkthdr.flowid = le32toh(cur->wb.lower.hi_dword.rss); switch (pkt_info & E1000_RXDADV_RSSTYPE_MASK) { case E1000_RXDADV_RSSTYPE_IPV4_TCP: M_HASHTYPE_SET(rxr->fmp, M_HASHTYPE_RSS_TCP_IPV4); break; case E1000_RXDADV_RSSTYPE_IPV4: M_HASHTYPE_SET(rxr->fmp, M_HASHTYPE_RSS_IPV4); break; case E1000_RXDADV_RSSTYPE_IPV6_TCP: M_HASHTYPE_SET(rxr->fmp, M_HASHTYPE_RSS_TCP_IPV6); break; case E1000_RXDADV_RSSTYPE_IPV6_EX: M_HASHTYPE_SET(rxr->fmp, M_HASHTYPE_RSS_IPV6_EX); break; case E1000_RXDADV_RSSTYPE_IPV6: M_HASHTYPE_SET(rxr->fmp, M_HASHTYPE_RSS_IPV6); break; case E1000_RXDADV_RSSTYPE_IPV6_TCP_EX: M_HASHTYPE_SET(rxr->fmp, M_HASHTYPE_RSS_TCP_IPV6_EX); break; default: /* XXX fallthrough */ M_HASHTYPE_SET(rxr->fmp, - M_HASHTYPE_OPAQUE); + M_HASHTYPE_OPAQUE_HASH); } } else { #ifndef IGB_LEGACY_TX rxr->fmp->m_pkthdr.flowid = que->msix; M_HASHTYPE_SET(rxr->fmp, M_HASHTYPE_OPAQUE); #endif } sendmp = rxr->fmp; /* Make sure to set M_PKTHDR. */ sendmp->m_flags |= M_PKTHDR; rxr->fmp = NULL; rxr->lmp = NULL; } next_desc: bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* Advance our pointers to the next descriptor. */ if (++i == adapter->num_rx_desc) i = 0; /* ** Send to the stack or LRO */ if (sendmp != NULL) { rxr->next_to_check = i; igb_rx_input(rxr, ifp, sendmp, ptype); i = rxr->next_to_check; rxdone++; } /* Every 8 descriptors we go to refresh mbufs */ if (processed == 8) { igb_refresh_mbufs(rxr, i); processed = 0; } } /* Catch any remainders */ if (igb_rx_unrefreshed(rxr)) igb_refresh_mbufs(rxr, i); rxr->next_to_check = i; /* * Flush any outstanding LRO work */ tcp_lro_flush_all(lro); if (done != NULL) *done += rxdone; IGB_RX_UNLOCK(rxr); return ((staterr & E1000_RXD_STAT_DD) ? TRUE : FALSE); } /********************************************************************* * * Verify that the hardware indicated that the checksum is valid. * Inform the stack about the status of checksum so that stack * doesn't spend time verifying the checksum. * *********************************************************************/ static void igb_rx_checksum(u32 staterr, struct mbuf *mp, u32 ptype) { u16 status = (u16)staterr; u8 errors = (u8) (staterr >> 24); int sctp; /* Ignore Checksum bit is set */ if (status & E1000_RXD_STAT_IXSM) { mp->m_pkthdr.csum_flags = 0; return; } if ((ptype & E1000_RXDADV_PKTTYPE_ETQF) == 0 && (ptype & E1000_RXDADV_PKTTYPE_SCTP) != 0) sctp = 1; else sctp = 0; if (status & E1000_RXD_STAT_IPCS) { /* Did it pass? */ if (!(errors & E1000_RXD_ERR_IPE)) { /* IP Checksum Good */ mp->m_pkthdr.csum_flags = CSUM_IP_CHECKED; mp->m_pkthdr.csum_flags |= CSUM_IP_VALID; } else mp->m_pkthdr.csum_flags = 0; } if (status & (E1000_RXD_STAT_TCPCS | E1000_RXD_STAT_UDPCS)) { u64 type = (CSUM_DATA_VALID | CSUM_PSEUDO_HDR); #if __FreeBSD_version >= 800000 if (sctp) /* reassign */ type = CSUM_SCTP_VALID; #endif /* Did it pass? */ if (!(errors & E1000_RXD_ERR_TCPE)) { mp->m_pkthdr.csum_flags |= type; if (sctp == 0) mp->m_pkthdr.csum_data = htons(0xffff); } } return; } /* * This routine is run via an vlan * config EVENT */ static void igb_register_vlan(void *arg, struct ifnet *ifp, u16 vtag) { struct adapter *adapter = ifp->if_softc; u32 index, bit; if (ifp->if_softc != arg) /* Not our event */ return; if ((vtag == 0) || (vtag > 4095)) /* Invalid */ return; IGB_CORE_LOCK(adapter); index = (vtag >> 5) & 0x7F; bit = vtag & 0x1F; adapter->shadow_vfta[index] |= (1 << bit); ++adapter->num_vlans; /* Change hw filter setting */ if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) igb_setup_vlan_hw_support(adapter); IGB_CORE_UNLOCK(adapter); } /* * This routine is run via an vlan * unconfig EVENT */ static void igb_unregister_vlan(void *arg, struct ifnet *ifp, u16 vtag) { struct adapter *adapter = ifp->if_softc; u32 index, bit; if (ifp->if_softc != arg) return; if ((vtag == 0) || (vtag > 4095)) /* Invalid */ return; IGB_CORE_LOCK(adapter); index = (vtag >> 5) & 0x7F; bit = vtag & 0x1F; adapter->shadow_vfta[index] &= ~(1 << bit); --adapter->num_vlans; /* Change hw filter setting */ if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) igb_setup_vlan_hw_support(adapter); IGB_CORE_UNLOCK(adapter); } static void igb_setup_vlan_hw_support(struct adapter *adapter) { struct e1000_hw *hw = &adapter->hw; struct ifnet *ifp = adapter->ifp; u32 reg; if (adapter->vf_ifp) { e1000_rlpml_set_vf(hw, adapter->max_frame_size + VLAN_TAG_SIZE); return; } reg = E1000_READ_REG(hw, E1000_CTRL); reg |= E1000_CTRL_VME; E1000_WRITE_REG(hw, E1000_CTRL, reg); /* Enable the Filter Table */ if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) { reg = E1000_READ_REG(hw, E1000_RCTL); reg &= ~E1000_RCTL_CFIEN; reg |= E1000_RCTL_VFE; E1000_WRITE_REG(hw, E1000_RCTL, reg); } /* Update the frame size */ E1000_WRITE_REG(&adapter->hw, E1000_RLPML, adapter->max_frame_size + VLAN_TAG_SIZE); /* Don't bother with table if no vlans */ if ((adapter->num_vlans == 0) || ((ifp->if_capenable & IFCAP_VLAN_HWFILTER) == 0)) return; /* ** A soft reset zero's out the VFTA, so ** we need to repopulate it now. */ for (int i = 0; i < IGB_VFTA_SIZE; i++) if (adapter->shadow_vfta[i] != 0) { if (adapter->vf_ifp) e1000_vfta_set_vf(hw, adapter->shadow_vfta[i], TRUE); else e1000_write_vfta(hw, i, adapter->shadow_vfta[i]); } } static void igb_enable_intr(struct adapter *adapter) { /* With RSS set up what to auto clear */ if (adapter->msix_mem) { u32 mask = (adapter->que_mask | adapter->link_mask); E1000_WRITE_REG(&adapter->hw, E1000_EIAC, mask); E1000_WRITE_REG(&adapter->hw, E1000_EIAM, mask); E1000_WRITE_REG(&adapter->hw, E1000_EIMS, mask); E1000_WRITE_REG(&adapter->hw, E1000_IMS, E1000_IMS_LSC); } else { E1000_WRITE_REG(&adapter->hw, E1000_IMS, IMS_ENABLE_MASK); } E1000_WRITE_FLUSH(&adapter->hw); return; } static void igb_disable_intr(struct adapter *adapter) { if (adapter->msix_mem) { E1000_WRITE_REG(&adapter->hw, E1000_EIMC, ~0); E1000_WRITE_REG(&adapter->hw, E1000_EIAC, 0); } E1000_WRITE_REG(&adapter->hw, E1000_IMC, ~0); E1000_WRITE_FLUSH(&adapter->hw); return; } /* * Bit of a misnomer, what this really means is * to enable OS management of the system... aka * to disable special hardware management features */ static void igb_init_manageability(struct adapter *adapter) { if (adapter->has_manage) { int manc2h = E1000_READ_REG(&adapter->hw, E1000_MANC2H); int manc = E1000_READ_REG(&adapter->hw, E1000_MANC); /* disable hardware interception of ARP */ manc &= ~(E1000_MANC_ARP_EN); /* enable receiving management packets to the host */ manc |= E1000_MANC_EN_MNG2HOST; manc2h |= 1 << 5; /* Mng Port 623 */ manc2h |= 1 << 6; /* Mng Port 664 */ E1000_WRITE_REG(&adapter->hw, E1000_MANC2H, manc2h); E1000_WRITE_REG(&adapter->hw, E1000_MANC, manc); } } /* * Give control back to hardware management * controller if there is one. */ static void igb_release_manageability(struct adapter *adapter) { if (adapter->has_manage) { int manc = E1000_READ_REG(&adapter->hw, E1000_MANC); /* re-enable hardware interception of ARP */ manc |= E1000_MANC_ARP_EN; manc &= ~E1000_MANC_EN_MNG2HOST; E1000_WRITE_REG(&adapter->hw, E1000_MANC, manc); } } /* * igb_get_hw_control sets CTRL_EXT:DRV_LOAD bit. * For ASF and Pass Through versions of f/w this means that * the driver is loaded. * */ static void igb_get_hw_control(struct adapter *adapter) { u32 ctrl_ext; if (adapter->vf_ifp) return; /* Let firmware know the driver has taken over */ ctrl_ext = E1000_READ_REG(&adapter->hw, E1000_CTRL_EXT); E1000_WRITE_REG(&adapter->hw, E1000_CTRL_EXT, ctrl_ext | E1000_CTRL_EXT_DRV_LOAD); } /* * igb_release_hw_control resets CTRL_EXT:DRV_LOAD bit. * For ASF and Pass Through versions of f/w this means that the * driver is no longer loaded. * */ static void igb_release_hw_control(struct adapter *adapter) { u32 ctrl_ext; if (adapter->vf_ifp) return; /* Let firmware taken over control of h/w */ ctrl_ext = E1000_READ_REG(&adapter->hw, E1000_CTRL_EXT); E1000_WRITE_REG(&adapter->hw, E1000_CTRL_EXT, ctrl_ext & ~E1000_CTRL_EXT_DRV_LOAD); } static int igb_is_valid_ether_addr(uint8_t *addr) { char zero_addr[6] = { 0, 0, 0, 0, 0, 0 }; if ((addr[0] & 1) || (!bcmp(addr, zero_addr, ETHER_ADDR_LEN))) { return (FALSE); } return (TRUE); } /* * Enable PCI Wake On Lan capability */ static void igb_enable_wakeup(device_t dev) { u16 cap, status; u8 id; /* First find the capabilities pointer*/ cap = pci_read_config(dev, PCIR_CAP_PTR, 2); /* Read the PM Capabilities */ id = pci_read_config(dev, cap, 1); if (id != PCIY_PMG) /* Something wrong */ return; /* OK, we have the power capabilities, so now get the status register */ cap += PCIR_POWER_STATUS; status = pci_read_config(dev, cap, 2); status |= PCIM_PSTAT_PME | PCIM_PSTAT_PMEENABLE; pci_write_config(dev, cap, status, 2); return; } static void igb_led_func(void *arg, int onoff) { struct adapter *adapter = arg; IGB_CORE_LOCK(adapter); if (onoff) { e1000_setup_led(&adapter->hw); e1000_led_on(&adapter->hw); } else { e1000_led_off(&adapter->hw); e1000_cleanup_led(&adapter->hw); } IGB_CORE_UNLOCK(adapter); } static uint64_t igb_get_vf_counter(if_t ifp, ift_counter cnt) { struct adapter *adapter; struct e1000_vf_stats *stats; #ifndef IGB_LEGACY_TX struct tx_ring *txr; uint64_t rv; #endif adapter = if_getsoftc(ifp); stats = (struct e1000_vf_stats *)adapter->stats; switch (cnt) { case IFCOUNTER_IPACKETS: return (stats->gprc); case IFCOUNTER_OPACKETS: return (stats->gptc); case IFCOUNTER_IBYTES: return (stats->gorc); case IFCOUNTER_OBYTES: return (stats->gotc); case IFCOUNTER_IMCASTS: return (stats->mprc); case IFCOUNTER_IERRORS: return (adapter->dropped_pkts); case IFCOUNTER_OERRORS: return (adapter->watchdog_events); #ifndef IGB_LEGACY_TX case IFCOUNTER_OQDROPS: rv = 0; txr = adapter->tx_rings; for (int i = 0; i < adapter->num_queues; i++, txr++) rv += txr->br->br_drops; return (rv); #endif default: return (if_get_counter_default(ifp, cnt)); } } static uint64_t igb_get_counter(if_t ifp, ift_counter cnt) { struct adapter *adapter; struct e1000_hw_stats *stats; #ifndef IGB_LEGACY_TX struct tx_ring *txr; uint64_t rv; #endif adapter = if_getsoftc(ifp); if (adapter->vf_ifp) return (igb_get_vf_counter(ifp, cnt)); stats = (struct e1000_hw_stats *)adapter->stats; switch (cnt) { case IFCOUNTER_IPACKETS: return (stats->gprc); case IFCOUNTER_OPACKETS: return (stats->gptc); case IFCOUNTER_IBYTES: return (stats->gorc); case IFCOUNTER_OBYTES: return (stats->gotc); case IFCOUNTER_IMCASTS: return (stats->mprc); case IFCOUNTER_OMCASTS: return (stats->mptc); case IFCOUNTER_IERRORS: return (adapter->dropped_pkts + stats->rxerrc + stats->crcerrs + stats->algnerrc + stats->ruc + stats->roc + stats->cexterr); case IFCOUNTER_OERRORS: return (stats->ecol + stats->latecol + adapter->watchdog_events); case IFCOUNTER_COLLISIONS: return (stats->colc); case IFCOUNTER_IQDROPS: return (stats->mpc); #ifndef IGB_LEGACY_TX case IFCOUNTER_OQDROPS: rv = 0; txr = adapter->tx_rings; for (int i = 0; i < adapter->num_queues; i++, txr++) rv += txr->br->br_drops; return (rv); #endif default: return (if_get_counter_default(ifp, cnt)); } } /********************************************************************** * * Update the board statistics counters. * **********************************************************************/ static void igb_update_stats_counters(struct adapter *adapter) { struct e1000_hw *hw = &adapter->hw; struct e1000_hw_stats *stats; /* ** The virtual function adapter has only a ** small controlled set of stats, do only ** those and return. */ if (adapter->vf_ifp) { igb_update_vf_stats_counters(adapter); return; } stats = (struct e1000_hw_stats *)adapter->stats; if (adapter->hw.phy.media_type == e1000_media_type_copper || (E1000_READ_REG(hw, E1000_STATUS) & E1000_STATUS_LU)) { stats->symerrs += E1000_READ_REG(hw,E1000_SYMERRS); stats->sec += E1000_READ_REG(hw, E1000_SEC); } stats->crcerrs += E1000_READ_REG(hw, E1000_CRCERRS); stats->mpc += E1000_READ_REG(hw, E1000_MPC); stats->scc += E1000_READ_REG(hw, E1000_SCC); stats->ecol += E1000_READ_REG(hw, E1000_ECOL); stats->mcc += E1000_READ_REG(hw, E1000_MCC); stats->latecol += E1000_READ_REG(hw, E1000_LATECOL); stats->colc += E1000_READ_REG(hw, E1000_COLC); stats->dc += E1000_READ_REG(hw, E1000_DC); stats->rlec += E1000_READ_REG(hw, E1000_RLEC); stats->xonrxc += E1000_READ_REG(hw, E1000_XONRXC); stats->xontxc += E1000_READ_REG(hw, E1000_XONTXC); /* ** For watchdog management we need to know if we have been ** paused during the last interval, so capture that here. */ adapter->pause_frames = E1000_READ_REG(&adapter->hw, E1000_XOFFRXC); stats->xoffrxc += adapter->pause_frames; stats->xofftxc += E1000_READ_REG(hw, E1000_XOFFTXC); stats->fcruc += E1000_READ_REG(hw, E1000_FCRUC); stats->prc64 += E1000_READ_REG(hw, E1000_PRC64); stats->prc127 += E1000_READ_REG(hw, E1000_PRC127); stats->prc255 += E1000_READ_REG(hw, E1000_PRC255); stats->prc511 += E1000_READ_REG(hw, E1000_PRC511); stats->prc1023 += E1000_READ_REG(hw, E1000_PRC1023); stats->prc1522 += E1000_READ_REG(hw, E1000_PRC1522); stats->gprc += E1000_READ_REG(hw, E1000_GPRC); stats->bprc += E1000_READ_REG(hw, E1000_BPRC); stats->mprc += E1000_READ_REG(hw, E1000_MPRC); stats->gptc += E1000_READ_REG(hw, E1000_GPTC); /* For the 64-bit byte counters the low dword must be read first. */ /* Both registers clear on the read of the high dword */ stats->gorc += E1000_READ_REG(hw, E1000_GORCL) + ((u64)E1000_READ_REG(hw, E1000_GORCH) << 32); stats->gotc += E1000_READ_REG(hw, E1000_GOTCL) + ((u64)E1000_READ_REG(hw, E1000_GOTCH) << 32); stats->rnbc += E1000_READ_REG(hw, E1000_RNBC); stats->ruc += E1000_READ_REG(hw, E1000_RUC); stats->rfc += E1000_READ_REG(hw, E1000_RFC); stats->roc += E1000_READ_REG(hw, E1000_ROC); stats->rjc += E1000_READ_REG(hw, E1000_RJC); stats->mgprc += E1000_READ_REG(hw, E1000_MGTPRC); stats->mgpdc += E1000_READ_REG(hw, E1000_MGTPDC); stats->mgptc += E1000_READ_REG(hw, E1000_MGTPTC); stats->tor += E1000_READ_REG(hw, E1000_TORL) + ((u64)E1000_READ_REG(hw, E1000_TORH) << 32); stats->tot += E1000_READ_REG(hw, E1000_TOTL) + ((u64)E1000_READ_REG(hw, E1000_TOTH) << 32); stats->tpr += E1000_READ_REG(hw, E1000_TPR); stats->tpt += E1000_READ_REG(hw, E1000_TPT); stats->ptc64 += E1000_READ_REG(hw, E1000_PTC64); stats->ptc127 += E1000_READ_REG(hw, E1000_PTC127); stats->ptc255 += E1000_READ_REG(hw, E1000_PTC255); stats->ptc511 += E1000_READ_REG(hw, E1000_PTC511); stats->ptc1023 += E1000_READ_REG(hw, E1000_PTC1023); stats->ptc1522 += E1000_READ_REG(hw, E1000_PTC1522); stats->mptc += E1000_READ_REG(hw, E1000_MPTC); stats->bptc += E1000_READ_REG(hw, E1000_BPTC); /* Interrupt Counts */ stats->iac += E1000_READ_REG(hw, E1000_IAC); stats->icrxptc += E1000_READ_REG(hw, E1000_ICRXPTC); stats->icrxatc += E1000_READ_REG(hw, E1000_ICRXATC); stats->ictxptc += E1000_READ_REG(hw, E1000_ICTXPTC); stats->ictxatc += E1000_READ_REG(hw, E1000_ICTXATC); stats->ictxqec += E1000_READ_REG(hw, E1000_ICTXQEC); stats->ictxqmtc += E1000_READ_REG(hw, E1000_ICTXQMTC); stats->icrxdmtc += E1000_READ_REG(hw, E1000_ICRXDMTC); stats->icrxoc += E1000_READ_REG(hw, E1000_ICRXOC); /* Host to Card Statistics */ stats->cbtmpc += E1000_READ_REG(hw, E1000_CBTMPC); stats->htdpmc += E1000_READ_REG(hw, E1000_HTDPMC); stats->cbrdpc += E1000_READ_REG(hw, E1000_CBRDPC); stats->cbrmpc += E1000_READ_REG(hw, E1000_CBRMPC); stats->rpthc += E1000_READ_REG(hw, E1000_RPTHC); stats->hgptc += E1000_READ_REG(hw, E1000_HGPTC); stats->htcbdpc += E1000_READ_REG(hw, E1000_HTCBDPC); stats->hgorc += (E1000_READ_REG(hw, E1000_HGORCL) + ((u64)E1000_READ_REG(hw, E1000_HGORCH) << 32)); stats->hgotc += (E1000_READ_REG(hw, E1000_HGOTCL) + ((u64)E1000_READ_REG(hw, E1000_HGOTCH) << 32)); stats->lenerrs += E1000_READ_REG(hw, E1000_LENERRS); stats->scvpc += E1000_READ_REG(hw, E1000_SCVPC); stats->hrmpc += E1000_READ_REG(hw, E1000_HRMPC); stats->algnerrc += E1000_READ_REG(hw, E1000_ALGNERRC); stats->rxerrc += E1000_READ_REG(hw, E1000_RXERRC); stats->tncrs += E1000_READ_REG(hw, E1000_TNCRS); stats->cexterr += E1000_READ_REG(hw, E1000_CEXTERR); stats->tsctc += E1000_READ_REG(hw, E1000_TSCTC); stats->tsctfc += E1000_READ_REG(hw, E1000_TSCTFC); /* Driver specific counters */ adapter->device_control = E1000_READ_REG(hw, E1000_CTRL); adapter->rx_control = E1000_READ_REG(hw, E1000_RCTL); adapter->int_mask = E1000_READ_REG(hw, E1000_IMS); adapter->eint_mask = E1000_READ_REG(hw, E1000_EIMS); adapter->packet_buf_alloc_tx = ((E1000_READ_REG(hw, E1000_PBA) & 0xffff0000) >> 16); adapter->packet_buf_alloc_rx = (E1000_READ_REG(hw, E1000_PBA) & 0xffff); } /********************************************************************** * * Initialize the VF board statistics counters. * **********************************************************************/ static void igb_vf_init_stats(struct adapter *adapter) { struct e1000_hw *hw = &adapter->hw; struct e1000_vf_stats *stats; stats = (struct e1000_vf_stats *)adapter->stats; if (stats == NULL) return; stats->last_gprc = E1000_READ_REG(hw, E1000_VFGPRC); stats->last_gorc = E1000_READ_REG(hw, E1000_VFGORC); stats->last_gptc = E1000_READ_REG(hw, E1000_VFGPTC); stats->last_gotc = E1000_READ_REG(hw, E1000_VFGOTC); stats->last_mprc = E1000_READ_REG(hw, E1000_VFMPRC); } /********************************************************************** * * Update the VF board statistics counters. * **********************************************************************/ static void igb_update_vf_stats_counters(struct adapter *adapter) { struct e1000_hw *hw = &adapter->hw; struct e1000_vf_stats *stats; if (adapter->link_speed == 0) return; stats = (struct e1000_vf_stats *)adapter->stats; UPDATE_VF_REG(E1000_VFGPRC, stats->last_gprc, stats->gprc); UPDATE_VF_REG(E1000_VFGORC, stats->last_gorc, stats->gorc); UPDATE_VF_REG(E1000_VFGPTC, stats->last_gptc, stats->gptc); UPDATE_VF_REG(E1000_VFGOTC, stats->last_gotc, stats->gotc); UPDATE_VF_REG(E1000_VFMPRC, stats->last_mprc, stats->mprc); } /* Export a single 32-bit register via a read-only sysctl. */ static int igb_sysctl_reg_handler(SYSCTL_HANDLER_ARGS) { struct adapter *adapter; u_int val; adapter = oidp->oid_arg1; val = E1000_READ_REG(&adapter->hw, oidp->oid_arg2); return (sysctl_handle_int(oidp, &val, 0, req)); } /* ** Tuneable interrupt rate handler */ static int igb_sysctl_interrupt_rate_handler(SYSCTL_HANDLER_ARGS) { struct igb_queue *que = ((struct igb_queue *)oidp->oid_arg1); int error; u32 reg, usec, rate; reg = E1000_READ_REG(&que->adapter->hw, E1000_EITR(que->msix)); usec = ((reg & 0x7FFC) >> 2); if (usec > 0) rate = 1000000 / usec; else rate = 0; error = sysctl_handle_int(oidp, &rate, 0, req); if (error || !req->newptr) return error; return 0; } /* * Add sysctl variables, one per statistic, to the system. */ static void igb_add_hw_stats(struct adapter *adapter) { device_t dev = adapter->dev; struct tx_ring *txr = adapter->tx_rings; struct rx_ring *rxr = adapter->rx_rings; struct sysctl_ctx_list *ctx = device_get_sysctl_ctx(dev); struct sysctl_oid *tree = device_get_sysctl_tree(dev); struct sysctl_oid_list *child = SYSCTL_CHILDREN(tree); struct e1000_hw_stats *stats = adapter->stats; struct sysctl_oid *stat_node, *queue_node, *int_node, *host_node; struct sysctl_oid_list *stat_list, *queue_list, *int_list, *host_list; #define QUEUE_NAME_LEN 32 char namebuf[QUEUE_NAME_LEN]; /* Driver Statistics */ SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "dropped", CTLFLAG_RD, &adapter->dropped_pkts, "Driver dropped packets"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "link_irq", CTLFLAG_RD, &adapter->link_irq, "Link MSIX IRQ Handled"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "mbuf_defrag_fail", CTLFLAG_RD, &adapter->mbuf_defrag_failed, "Defragmenting mbuf chain failed"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "tx_dma_fail", CTLFLAG_RD, &adapter->no_tx_dma_setup, "Driver tx dma failure in xmit"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "rx_overruns", CTLFLAG_RD, &adapter->rx_overruns, "RX overruns"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "watchdog_timeouts", CTLFLAG_RD, &adapter->watchdog_events, "Watchdog timeouts"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "device_control", CTLFLAG_RD, &adapter->device_control, "Device Control Register"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "rx_control", CTLFLAG_RD, &adapter->rx_control, "Receiver Control Register"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "interrupt_mask", CTLFLAG_RD, &adapter->int_mask, "Interrupt Mask"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "extended_int_mask", CTLFLAG_RD, &adapter->eint_mask, "Extended Interrupt Mask"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "tx_buf_alloc", CTLFLAG_RD, &adapter->packet_buf_alloc_tx, "Transmit Buffer Packet Allocation"); SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "rx_buf_alloc", CTLFLAG_RD, &adapter->packet_buf_alloc_rx, "Receive Buffer Packet Allocation"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "fc_high_water", CTLFLAG_RD, &adapter->hw.fc.high_water, 0, "Flow Control High Watermark"); SYSCTL_ADD_UINT(ctx, child, OID_AUTO, "fc_low_water", CTLFLAG_RD, &adapter->hw.fc.low_water, 0, "Flow Control Low Watermark"); for (int i = 0; i < adapter->num_queues; i++, rxr++, txr++) { struct lro_ctrl *lro = &rxr->lro; snprintf(namebuf, QUEUE_NAME_LEN, "queue%d", i); queue_node = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, namebuf, CTLFLAG_RD, NULL, "Queue Name"); queue_list = SYSCTL_CHILDREN(queue_node); SYSCTL_ADD_PROC(ctx, queue_list, OID_AUTO, "interrupt_rate", CTLTYPE_UINT | CTLFLAG_RD, &adapter->queues[i], sizeof(&adapter->queues[i]), igb_sysctl_interrupt_rate_handler, "IU", "Interrupt Rate"); SYSCTL_ADD_PROC(ctx, queue_list, OID_AUTO, "txd_head", CTLTYPE_UINT | CTLFLAG_RD, adapter, E1000_TDH(txr->me), igb_sysctl_reg_handler, "IU", "Transmit Descriptor Head"); SYSCTL_ADD_PROC(ctx, queue_list, OID_AUTO, "txd_tail", CTLTYPE_UINT | CTLFLAG_RD, adapter, E1000_TDT(txr->me), igb_sysctl_reg_handler, "IU", "Transmit Descriptor Tail"); SYSCTL_ADD_QUAD(ctx, queue_list, OID_AUTO, "no_desc_avail", CTLFLAG_RD, &txr->no_desc_avail, "Queue Descriptors Unavailable"); SYSCTL_ADD_UQUAD(ctx, queue_list, OID_AUTO, "tx_packets", CTLFLAG_RD, &txr->total_packets, "Queue Packets Transmitted"); SYSCTL_ADD_PROC(ctx, queue_list, OID_AUTO, "rxd_head", CTLTYPE_UINT | CTLFLAG_RD, adapter, E1000_RDH(rxr->me), igb_sysctl_reg_handler, "IU", "Receive Descriptor Head"); SYSCTL_ADD_PROC(ctx, queue_list, OID_AUTO, "rxd_tail", CTLTYPE_UINT | CTLFLAG_RD, adapter, E1000_RDT(rxr->me), igb_sysctl_reg_handler, "IU", "Receive Descriptor Tail"); SYSCTL_ADD_QUAD(ctx, queue_list, OID_AUTO, "rx_packets", CTLFLAG_RD, &rxr->rx_packets, "Queue Packets Received"); SYSCTL_ADD_QUAD(ctx, queue_list, OID_AUTO, "rx_bytes", CTLFLAG_RD, &rxr->rx_bytes, "Queue Bytes Received"); SYSCTL_ADD_U64(ctx, queue_list, OID_AUTO, "lro_queued", CTLFLAG_RD, &lro->lro_queued, 0, "LRO Queued"); SYSCTL_ADD_U64(ctx, queue_list, OID_AUTO, "lro_flushed", CTLFLAG_RD, &lro->lro_flushed, 0, "LRO Flushed"); } /* MAC stats get their own sub node */ stat_node = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "mac_stats", CTLFLAG_RD, NULL, "MAC Statistics"); stat_list = SYSCTL_CHILDREN(stat_node); /* ** VF adapter has a very limited set of stats ** since its not managing the metal, so to speak. */ if (adapter->vf_ifp) { SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "good_pkts_recvd", CTLFLAG_RD, &stats->gprc, "Good Packets Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "good_pkts_txd", CTLFLAG_RD, &stats->gptc, "Good Packets Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "good_octets_recvd", CTLFLAG_RD, &stats->gorc, "Good Octets Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "good_octets_txd", CTLFLAG_RD, &stats->gotc, "Good Octets Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "mcast_pkts_recvd", CTLFLAG_RD, &stats->mprc, "Multicast Packets Received"); return; } SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "excess_coll", CTLFLAG_RD, &stats->ecol, "Excessive collisions"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "single_coll", CTLFLAG_RD, &stats->scc, "Single collisions"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "multiple_coll", CTLFLAG_RD, &stats->mcc, "Multiple collisions"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "late_coll", CTLFLAG_RD, &stats->latecol, "Late collisions"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "collision_count", CTLFLAG_RD, &stats->colc, "Collision Count"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "symbol_errors", CTLFLAG_RD, &stats->symerrs, "Symbol Errors"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "sequence_errors", CTLFLAG_RD, &stats->sec, "Sequence Errors"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "defer_count", CTLFLAG_RD, &stats->dc, "Defer Count"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "missed_packets", CTLFLAG_RD, &stats->mpc, "Missed Packets"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "recv_length_errors", CTLFLAG_RD, &stats->rlec, "Receive Length Errors"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "recv_no_buff", CTLFLAG_RD, &stats->rnbc, "Receive No Buffers"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "recv_undersize", CTLFLAG_RD, &stats->ruc, "Receive Undersize"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "recv_fragmented", CTLFLAG_RD, &stats->rfc, "Fragmented Packets Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "recv_oversize", CTLFLAG_RD, &stats->roc, "Oversized Packets Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "recv_jabber", CTLFLAG_RD, &stats->rjc, "Recevied Jabber"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "recv_errs", CTLFLAG_RD, &stats->rxerrc, "Receive Errors"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "crc_errs", CTLFLAG_RD, &stats->crcerrs, "CRC errors"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "alignment_errs", CTLFLAG_RD, &stats->algnerrc, "Alignment Errors"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "tx_no_crs", CTLFLAG_RD, &stats->tncrs, "Transmit with No CRS"); /* On 82575 these are collision counts */ SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "coll_ext_errs", CTLFLAG_RD, &stats->cexterr, "Collision/Carrier extension errors"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "xon_recvd", CTLFLAG_RD, &stats->xonrxc, "XON Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "xon_txd", CTLFLAG_RD, &stats->xontxc, "XON Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "xoff_recvd", CTLFLAG_RD, &stats->xoffrxc, "XOFF Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "xoff_txd", CTLFLAG_RD, &stats->xofftxc, "XOFF Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "unsupported_fc_recvd", CTLFLAG_RD, &stats->fcruc, "Unsupported Flow Control Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "mgmt_pkts_recvd", CTLFLAG_RD, &stats->mgprc, "Management Packets Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "mgmt_pkts_drop", CTLFLAG_RD, &stats->mgpdc, "Management Packets Dropped"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "mgmt_pkts_txd", CTLFLAG_RD, &stats->mgptc, "Management Packets Transmitted"); /* Packet Reception Stats */ SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "total_pkts_recvd", CTLFLAG_RD, &stats->tpr, "Total Packets Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "good_pkts_recvd", CTLFLAG_RD, &stats->gprc, "Good Packets Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "bcast_pkts_recvd", CTLFLAG_RD, &stats->bprc, "Broadcast Packets Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "mcast_pkts_recvd", CTLFLAG_RD, &stats->mprc, "Multicast Packets Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "rx_frames_64", CTLFLAG_RD, &stats->prc64, "64 byte frames received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "rx_frames_65_127", CTLFLAG_RD, &stats->prc127, "65-127 byte frames received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "rx_frames_128_255", CTLFLAG_RD, &stats->prc255, "128-255 byte frames received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "rx_frames_256_511", CTLFLAG_RD, &stats->prc511, "256-511 byte frames received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "rx_frames_512_1023", CTLFLAG_RD, &stats->prc1023, "512-1023 byte frames received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "rx_frames_1024_1522", CTLFLAG_RD, &stats->prc1522, "1023-1522 byte frames received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "good_octets_recvd", CTLFLAG_RD, &stats->gorc, "Good Octets Received"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "total_octets_recvd", CTLFLAG_RD, &stats->tor, "Total Octets Received"); /* Packet Transmission Stats */ SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "good_octets_txd", CTLFLAG_RD, &stats->gotc, "Good Octets Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "total_octets_txd", CTLFLAG_RD, &stats->tot, "Total Octets Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "total_pkts_txd", CTLFLAG_RD, &stats->tpt, "Total Packets Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "good_pkts_txd", CTLFLAG_RD, &stats->gptc, "Good Packets Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "bcast_pkts_txd", CTLFLAG_RD, &stats->bptc, "Broadcast Packets Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "mcast_pkts_txd", CTLFLAG_RD, &stats->mptc, "Multicast Packets Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "tx_frames_64", CTLFLAG_RD, &stats->ptc64, "64 byte frames transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "tx_frames_65_127", CTLFLAG_RD, &stats->ptc127, "65-127 byte frames transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "tx_frames_128_255", CTLFLAG_RD, &stats->ptc255, "128-255 byte frames transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "tx_frames_256_511", CTLFLAG_RD, &stats->ptc511, "256-511 byte frames transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "tx_frames_512_1023", CTLFLAG_RD, &stats->ptc1023, "512-1023 byte frames transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "tx_frames_1024_1522", CTLFLAG_RD, &stats->ptc1522, "1024-1522 byte frames transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "tso_txd", CTLFLAG_RD, &stats->tsctc, "TSO Contexts Transmitted"); SYSCTL_ADD_QUAD(ctx, stat_list, OID_AUTO, "tso_ctx_fail", CTLFLAG_RD, &stats->tsctfc, "TSO Contexts Failed"); /* Interrupt Stats */ int_node = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "interrupts", CTLFLAG_RD, NULL, "Interrupt Statistics"); int_list = SYSCTL_CHILDREN(int_node); SYSCTL_ADD_QUAD(ctx, int_list, OID_AUTO, "asserts", CTLFLAG_RD, &stats->iac, "Interrupt Assertion Count"); SYSCTL_ADD_QUAD(ctx, int_list, OID_AUTO, "rx_pkt_timer", CTLFLAG_RD, &stats->icrxptc, "Interrupt Cause Rx Pkt Timer Expire Count"); SYSCTL_ADD_QUAD(ctx, int_list, OID_AUTO, "rx_abs_timer", CTLFLAG_RD, &stats->icrxatc, "Interrupt Cause Rx Abs Timer Expire Count"); SYSCTL_ADD_QUAD(ctx, int_list, OID_AUTO, "tx_pkt_timer", CTLFLAG_RD, &stats->ictxptc, "Interrupt Cause Tx Pkt Timer Expire Count"); SYSCTL_ADD_QUAD(ctx, int_list, OID_AUTO, "tx_abs_timer", CTLFLAG_RD, &stats->ictxatc, "Interrupt Cause Tx Abs Timer Expire Count"); SYSCTL_ADD_QUAD(ctx, int_list, OID_AUTO, "tx_queue_empty", CTLFLAG_RD, &stats->ictxqec, "Interrupt Cause Tx Queue Empty Count"); SYSCTL_ADD_QUAD(ctx, int_list, OID_AUTO, "tx_queue_min_thresh", CTLFLAG_RD, &stats->ictxqmtc, "Interrupt Cause Tx Queue Min Thresh Count"); SYSCTL_ADD_QUAD(ctx, int_list, OID_AUTO, "rx_desc_min_thresh", CTLFLAG_RD, &stats->icrxdmtc, "Interrupt Cause Rx Desc Min Thresh Count"); SYSCTL_ADD_QUAD(ctx, int_list, OID_AUTO, "rx_overrun", CTLFLAG_RD, &stats->icrxoc, "Interrupt Cause Receiver Overrun Count"); /* Host to Card Stats */ host_node = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "host", CTLFLAG_RD, NULL, "Host to Card Statistics"); host_list = SYSCTL_CHILDREN(host_node); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "breaker_tx_pkt", CTLFLAG_RD, &stats->cbtmpc, "Circuit Breaker Tx Packet Count"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "host_tx_pkt_discard", CTLFLAG_RD, &stats->htdpmc, "Host Transmit Discarded Packets"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "rx_pkt", CTLFLAG_RD, &stats->rpthc, "Rx Packets To Host"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "breaker_rx_pkts", CTLFLAG_RD, &stats->cbrmpc, "Circuit Breaker Rx Packet Count"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "breaker_rx_pkt_drop", CTLFLAG_RD, &stats->cbrdpc, "Circuit Breaker Rx Dropped Count"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "tx_good_pkt", CTLFLAG_RD, &stats->hgptc, "Host Good Packets Tx Count"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "breaker_tx_pkt_drop", CTLFLAG_RD, &stats->htcbdpc, "Host Tx Circuit Breaker Dropped Count"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "rx_good_bytes", CTLFLAG_RD, &stats->hgorc, "Host Good Octets Received Count"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "tx_good_bytes", CTLFLAG_RD, &stats->hgotc, "Host Good Octets Transmit Count"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "length_errors", CTLFLAG_RD, &stats->lenerrs, "Length Errors"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "serdes_violation_pkt", CTLFLAG_RD, &stats->scvpc, "SerDes/SGMII Code Violation Pkt Count"); SYSCTL_ADD_QUAD(ctx, host_list, OID_AUTO, "header_redir_missed", CTLFLAG_RD, &stats->hrmpc, "Header Redirection Missed Packet Count"); } /********************************************************************** * * This routine provides a way to dump out the adapter eeprom, * often a useful debug/service tool. This only dumps the first * 32 words, stuff that matters is in that extent. * **********************************************************************/ static int igb_sysctl_nvm_info(SYSCTL_HANDLER_ARGS) { struct adapter *adapter; int error; int result; result = -1; error = sysctl_handle_int(oidp, &result, 0, req); if (error || !req->newptr) return (error); /* * This value will cause a hex dump of the * first 32 16-bit words of the EEPROM to * the screen. */ if (result == 1) { adapter = (struct adapter *)arg1; igb_print_nvm_info(adapter); } return (error); } static void igb_print_nvm_info(struct adapter *adapter) { u16 eeprom_data; int i, j, row = 0; /* Its a bit crude, but it gets the job done */ printf("\nInterface EEPROM Dump:\n"); printf("Offset\n0x0000 "); for (i = 0, j = 0; i < 32; i++, j++) { if (j == 8) { /* Make the offset block */ j = 0; ++row; printf("\n0x00%x0 ",row); } e1000_read_nvm(&adapter->hw, i, 1, &eeprom_data); printf("%04x ", eeprom_data); } printf("\n"); } static void igb_set_sysctl_value(struct adapter *adapter, const char *name, const char *description, int *limit, int value) { *limit = value; SYSCTL_ADD_INT(device_get_sysctl_ctx(adapter->dev), SYSCTL_CHILDREN(device_get_sysctl_tree(adapter->dev)), OID_AUTO, name, CTLFLAG_RW, limit, value, description); } /* ** Set flow control using sysctl: ** Flow control values: ** 0 - off ** 1 - rx pause ** 2 - tx pause ** 3 - full */ static int igb_set_flowcntl(SYSCTL_HANDLER_ARGS) { int error; static int input = 3; /* default is full */ struct adapter *adapter = (struct adapter *) arg1; error = sysctl_handle_int(oidp, &input, 0, req); if ((error) || (req->newptr == NULL)) return (error); switch (input) { case e1000_fc_rx_pause: case e1000_fc_tx_pause: case e1000_fc_full: case e1000_fc_none: adapter->hw.fc.requested_mode = input; adapter->fc = input; break; default: /* Do nothing */ return (error); } adapter->hw.fc.current_mode = adapter->hw.fc.requested_mode; e1000_force_mac_fc(&adapter->hw); /* XXX TODO: update DROP_EN on each RX queue if appropriate */ return (error); } /* ** Manage DMA Coalesce: ** Control values: ** 0/1 - off/on ** Legal timer values are: ** 250,500,1000-10000 in thousands */ static int igb_sysctl_dmac(SYSCTL_HANDLER_ARGS) { struct adapter *adapter = (struct adapter *) arg1; int error; error = sysctl_handle_int(oidp, &adapter->dmac, 0, req); if ((error) || (req->newptr == NULL)) return (error); switch (adapter->dmac) { case 0: /* Disabling */ break; case 1: /* Just enable and use default */ adapter->dmac = 1000; break; case 250: case 500: case 1000: case 2000: case 3000: case 4000: case 5000: case 6000: case 7000: case 8000: case 9000: case 10000: /* Legal values - allow */ break; default: /* Do nothing, illegal value */ adapter->dmac = 0; return (EINVAL); } /* Reinit the interface */ igb_init(adapter); return (error); } /* ** Manage Energy Efficient Ethernet: ** Control values: ** 0/1 - enabled/disabled */ static int igb_sysctl_eee(SYSCTL_HANDLER_ARGS) { struct adapter *adapter = (struct adapter *) arg1; int error, value; value = adapter->hw.dev_spec._82575.eee_disable; error = sysctl_handle_int(oidp, &value, 0, req); if (error || req->newptr == NULL) return (error); IGB_CORE_LOCK(adapter); adapter->hw.dev_spec._82575.eee_disable = (value != 0); igb_init_locked(adapter); IGB_CORE_UNLOCK(adapter); return (0); } Index: projects/vnet/sys/dev/gpio/gpiobus.c =================================================================== --- projects/vnet/sys/dev/gpio/gpiobus.c (revision 301546) +++ projects/vnet/sys/dev/gpio/gpiobus.c (revision 301547) @@ -1,849 +1,874 @@ /*- * Copyright (c) 2009 Oleksandr Tymoshenko * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include "gpiobus_if.h" #undef GPIOBUS_DEBUG #ifdef GPIOBUS_DEBUG #define dprintf printf #else #define dprintf(x, arg...) #endif static void gpiobus_print_pins(struct gpiobus_ivar *, char *, size_t); static int gpiobus_parse_pins(struct gpiobus_softc *, device_t, int); static int gpiobus_probe(device_t); static int gpiobus_attach(device_t); static int gpiobus_detach(device_t); static int gpiobus_suspend(device_t); static int gpiobus_resume(device_t); static void gpiobus_probe_nomatch(device_t, device_t); static int gpiobus_print_child(device_t, device_t); static int gpiobus_child_location_str(device_t, device_t, char *, size_t); static int gpiobus_child_pnpinfo_str(device_t, device_t, char *, size_t); static device_t gpiobus_add_child(device_t, u_int, const char *, int); static void gpiobus_hinted_child(device_t, const char *, int); /* * GPIOBUS interface */ static int gpiobus_acquire_bus(device_t, device_t, int); static void gpiobus_release_bus(device_t, device_t); static int gpiobus_pin_setflags(device_t, device_t, uint32_t, uint32_t); static int gpiobus_pin_getflags(device_t, device_t, uint32_t, uint32_t*); static int gpiobus_pin_getcaps(device_t, device_t, uint32_t, uint32_t*); static int gpiobus_pin_set(device_t, device_t, uint32_t, unsigned int); static int gpiobus_pin_get(device_t, device_t, uint32_t, unsigned int*); static int gpiobus_pin_toggle(device_t, device_t, uint32_t); /* * XXX -> Move me to better place - gpio_subr.c? * Also, this function must be changed when interrupt configuration * data will be moved into struct resource. */ #ifdef INTRNG +static void +gpio_destruct_map_data(struct intr_map_data *map_data) +{ + + KASSERT(map_data->type == INTR_MAP_DATA_GPIO, + ("%s: bad map_data type %d", __func__, map_data->type)); + + free(map_data, M_DEVBUF); +} + struct resource * gpio_alloc_intr_resource(device_t consumer_dev, int *rid, u_int alloc_flags, gpio_pin_t pin, uint32_t intr_mode) { - u_int irqnum; + int rv; + u_int irq; + struct intr_map_data_gpio *gpio_data; + struct resource *res; - /* - * Allocate new fictitious interrupt number and store configuration - * into it. - */ - irqnum = intr_gpio_map_irq(pin->dev, pin->pin, pin->flags, intr_mode); - if (irqnum == INTR_IRQ_INVALID) + gpio_data = malloc(sizeof(*gpio_data), M_DEVBUF, M_WAITOK | M_ZERO); + gpio_data->hdr.type = INTR_MAP_DATA_GPIO; + gpio_data->hdr.destruct = gpio_destruct_map_data; + gpio_data->gpio_pin_num = pin->pin; + gpio_data->gpio_pin_flags = pin->flags; + gpio_data->gpio_intr_mode = intr_mode; + + rv = intr_map_irq(pin->dev, 0, (struct intr_map_data *)gpio_data, + &irq); + if (rv != 0) { + gpio_destruct_map_data((struct intr_map_data *)gpio_data); return (NULL); + } - return (bus_alloc_resource(consumer_dev, SYS_RES_IRQ, rid, - irqnum, irqnum, 1, alloc_flags)); + res = bus_alloc_resource(consumer_dev, SYS_RES_IRQ, rid, irq, irq, 1, + alloc_flags); + if (res == NULL) { + gpio_destruct_map_data((struct intr_map_data *)gpio_data); + return (NULL); + } + rman_set_virtual(res, gpio_data); + return (res); } #else struct resource * gpio_alloc_intr_resource(device_t consumer_dev, int *rid, u_int alloc_flags, gpio_pin_t pin, uint32_t intr_mode) { return (NULL); } #endif int gpio_check_flags(uint32_t caps, uint32_t flags) { /* Check for unwanted flags. */ if ((flags & caps) == 0 || (flags & caps) != flags) return (EINVAL); /* Cannot mix input/output together. */ if (flags & GPIO_PIN_INPUT && flags & GPIO_PIN_OUTPUT) return (EINVAL); /* Cannot mix pull-up/pull-down together. */ if (flags & GPIO_PIN_PULLUP && flags & GPIO_PIN_PULLDOWN) return (EINVAL); return (0); } static void gpiobus_print_pins(struct gpiobus_ivar *devi, char *buf, size_t buflen) { char tmp[128]; int i, range_start, range_stop, need_coma; if (devi->npins == 0) return; need_coma = 0; range_start = range_stop = devi->pins[0]; for (i = 1; i < devi->npins; i++) { if (devi->pins[i] != (range_stop + 1)) { if (need_coma) strlcat(buf, ",", buflen); memset(tmp, 0, sizeof(tmp)); if (range_start != range_stop) snprintf(tmp, sizeof(tmp) - 1, "%d-%d", range_start, range_stop); else snprintf(tmp, sizeof(tmp) - 1, "%d", range_start); strlcat(buf, tmp, buflen); range_start = range_stop = devi->pins[i]; need_coma = 1; } else range_stop++; } if (need_coma) strlcat(buf, ",", buflen); memset(tmp, 0, sizeof(tmp)); if (range_start != range_stop) snprintf(tmp, sizeof(tmp) - 1, "%d-%d", range_start, range_stop); else snprintf(tmp, sizeof(tmp) - 1, "%d", range_start); strlcat(buf, tmp, buflen); } device_t gpiobus_attach_bus(device_t dev) { device_t busdev; busdev = device_add_child(dev, "gpiobus", -1); if (busdev == NULL) return (NULL); if (device_add_child(dev, "gpioc", -1) == NULL) { device_delete_child(dev, busdev); return (NULL); } #ifdef FDT ofw_gpiobus_register_provider(dev); #endif bus_generic_attach(dev); return (busdev); } int gpiobus_detach_bus(device_t dev) { int err; #ifdef FDT ofw_gpiobus_unregister_provider(dev); #endif err = bus_generic_detach(dev); if (err != 0) return (err); return (device_delete_children(dev)); } int gpiobus_init_softc(device_t dev) { struct gpiobus_softc *sc; sc = GPIOBUS_SOFTC(dev); sc->sc_busdev = dev; sc->sc_dev = device_get_parent(dev); sc->sc_intr_rman.rm_type = RMAN_ARRAY; sc->sc_intr_rman.rm_descr = "GPIO Interrupts"; if (rman_init(&sc->sc_intr_rman) != 0 || rman_manage_region(&sc->sc_intr_rman, 0, ~0) != 0) panic("%s: failed to set up rman.", __func__); if (GPIO_PIN_MAX(sc->sc_dev, &sc->sc_npins) != 0) return (ENXIO); KASSERT(sc->sc_npins >= 0, ("GPIO device with no pins")); /* Pins = GPIO_PIN_MAX() + 1 */ sc->sc_npins++; sc->sc_pins = malloc(sizeof(*sc->sc_pins) * sc->sc_npins, M_DEVBUF, M_NOWAIT | M_ZERO); if (sc->sc_pins == NULL) return (ENOMEM); /* Initialize the bus lock. */ GPIOBUS_LOCK_INIT(sc); return (0); } int gpiobus_alloc_ivars(struct gpiobus_ivar *devi) { /* Allocate pins and flags memory. */ devi->pins = malloc(sizeof(uint32_t) * devi->npins, M_DEVBUF, M_NOWAIT | M_ZERO); if (devi->pins == NULL) return (ENOMEM); devi->flags = malloc(sizeof(uint32_t) * devi->npins, M_DEVBUF, M_NOWAIT | M_ZERO); if (devi->flags == NULL) { free(devi->pins, M_DEVBUF); return (ENOMEM); } return (0); } void gpiobus_free_ivars(struct gpiobus_ivar *devi) { if (devi->flags) { free(devi->flags, M_DEVBUF); devi->flags = NULL; } if (devi->pins) { free(devi->pins, M_DEVBUF); devi->pins = NULL; } } int gpiobus_acquire_pin(device_t bus, uint32_t pin) { struct gpiobus_softc *sc; sc = device_get_softc(bus); /* Consistency check. */ if (pin >= sc->sc_npins) { device_printf(bus, "invalid pin %d, max: %d\n", pin, sc->sc_npins - 1); return (-1); } /* Mark pin as mapped and give warning if it's already mapped. */ if (sc->sc_pins[pin].mapped) { device_printf(bus, "warning: pin %d is already mapped\n", pin); return (-1); } sc->sc_pins[pin].mapped = 1; return (0); } /* Release mapped pin */ int gpiobus_release_pin(device_t bus, uint32_t pin) { struct gpiobus_softc *sc; sc = device_get_softc(bus); /* Consistency check. */ if (pin >= sc->sc_npins) { device_printf(bus, "gpiobus_acquire_pin: invalid pin %d, max=%d\n", pin, sc->sc_npins - 1); return (-1); } if (!sc->sc_pins[pin].mapped) { device_printf(bus, "gpiobus_acquire_pin: pin %d is not mapped\n", pin); return (-1); } sc->sc_pins[pin].mapped = 0; return (0); } static int gpiobus_parse_pins(struct gpiobus_softc *sc, device_t child, int mask) { struct gpiobus_ivar *devi = GPIOBUS_IVAR(child); int i, npins; npins = 0; for (i = 0; i < 32; i++) { if (mask & (1 << i)) npins++; } if (npins == 0) { device_printf(child, "empty pin mask\n"); return (EINVAL); } devi->npins = npins; if (gpiobus_alloc_ivars(devi) != 0) { device_printf(child, "cannot allocate device ivars\n"); return (EINVAL); } npins = 0; for (i = 0; i < 32; i++) { if ((mask & (1 << i)) == 0) continue; /* Reserve the GPIO pin. */ if (gpiobus_acquire_pin(sc->sc_busdev, i) != 0) { gpiobus_free_ivars(devi); return (EINVAL); } devi->pins[npins++] = i; /* Use the child name as pin name. */ GPIOBUS_PIN_SETNAME(sc->sc_busdev, i, device_get_nameunit(child)); } return (0); } static int gpiobus_probe(device_t dev) { device_set_desc(dev, "GPIO bus"); return (BUS_PROBE_GENERIC); } static int gpiobus_attach(device_t dev) { int err; err = gpiobus_init_softc(dev); if (err != 0) return (err); /* * Get parent's pins and mark them as unmapped */ bus_generic_probe(dev); bus_enumerate_hinted_children(dev); return (bus_generic_attach(dev)); } /* * Since this is not a self-enumerating bus, and since we always add * children in attach, we have to always delete children here. */ static int gpiobus_detach(device_t dev) { struct gpiobus_softc *sc; struct gpiobus_ivar *devi; device_t *devlist; int i, err, ndevs; sc = GPIOBUS_SOFTC(dev); KASSERT(mtx_initialized(&sc->sc_mtx), ("gpiobus mutex not initialized")); GPIOBUS_LOCK_DESTROY(sc); if ((err = bus_generic_detach(dev)) != 0) return (err); if ((err = device_get_children(dev, &devlist, &ndevs)) != 0) return (err); for (i = 0; i < ndevs; i++) { devi = GPIOBUS_IVAR(devlist[i]); gpiobus_free_ivars(devi); resource_list_free(&devi->rl); free(devi, M_DEVBUF); device_delete_child(dev, devlist[i]); } free(devlist, M_TEMP); rman_fini(&sc->sc_intr_rman); if (sc->sc_pins) { for (i = 0; i < sc->sc_npins; i++) { if (sc->sc_pins[i].name != NULL) free(sc->sc_pins[i].name, M_DEVBUF); sc->sc_pins[i].name = NULL; } free(sc->sc_pins, M_DEVBUF); sc->sc_pins = NULL; } return (0); } static int gpiobus_suspend(device_t dev) { return (bus_generic_suspend(dev)); } static int gpiobus_resume(device_t dev) { return (bus_generic_resume(dev)); } static void gpiobus_probe_nomatch(device_t dev, device_t child) { char pins[128]; struct gpiobus_ivar *devi; devi = GPIOBUS_IVAR(child); memset(pins, 0, sizeof(pins)); gpiobus_print_pins(devi, pins, sizeof(pins)); if (devi->npins > 1) device_printf(dev, " at pins %s", pins); else device_printf(dev, " at pin %s", pins); resource_list_print_type(&devi->rl, "irq", SYS_RES_IRQ, "%jd"); printf("\n"); } static int gpiobus_print_child(device_t dev, device_t child) { char pins[128]; int retval = 0; struct gpiobus_ivar *devi; devi = GPIOBUS_IVAR(child); memset(pins, 0, sizeof(pins)); retval += bus_print_child_header(dev, child); if (devi->npins > 0) { if (devi->npins > 1) retval += printf(" at pins "); else retval += printf(" at pin "); gpiobus_print_pins(devi, pins, sizeof(pins)); retval += printf("%s", pins); } resource_list_print_type(&devi->rl, "irq", SYS_RES_IRQ, "%jd"); retval += bus_print_child_footer(dev, child); return (retval); } static int gpiobus_child_location_str(device_t bus, device_t child, char *buf, size_t buflen) { struct gpiobus_ivar *devi; devi = GPIOBUS_IVAR(child); if (devi->npins > 1) strlcpy(buf, "pins=", buflen); else strlcpy(buf, "pin=", buflen); gpiobus_print_pins(devi, buf, buflen); return (0); } static int gpiobus_child_pnpinfo_str(device_t bus, device_t child, char *buf, size_t buflen) { *buf = '\0'; return (0); } static device_t gpiobus_add_child(device_t dev, u_int order, const char *name, int unit) { device_t child; struct gpiobus_ivar *devi; child = device_add_child_ordered(dev, order, name, unit); if (child == NULL) return (child); devi = malloc(sizeof(struct gpiobus_ivar), M_DEVBUF, M_NOWAIT | M_ZERO); if (devi == NULL) { device_delete_child(dev, child); return (NULL); } resource_list_init(&devi->rl); device_set_ivars(child, devi); return (child); } static void gpiobus_hinted_child(device_t bus, const char *dname, int dunit) { struct gpiobus_softc *sc = GPIOBUS_SOFTC(bus); struct gpiobus_ivar *devi; device_t child; int irq, pins; child = BUS_ADD_CHILD(bus, 0, dname, dunit); devi = GPIOBUS_IVAR(child); resource_int_value(dname, dunit, "pins", &pins); if (gpiobus_parse_pins(sc, child, pins)) { resource_list_free(&devi->rl); free(devi, M_DEVBUF); device_delete_child(bus, child); } if (resource_int_value(dname, dunit, "irq", &irq) == 0) { if (bus_set_resource(child, SYS_RES_IRQ, 0, irq, 1) != 0) device_printf(bus, "warning: bus_set_resource() failed\n"); } } static int gpiobus_set_resource(device_t dev, device_t child, int type, int rid, rman_res_t start, rman_res_t count) { struct gpiobus_ivar *devi; struct resource_list_entry *rle; dprintf("%s: entry (%p, %p, %d, %d, %p, %ld)\n", __func__, dev, child, type, rid, (void *)(intptr_t)start, count); devi = GPIOBUS_IVAR(child); rle = resource_list_add(&devi->rl, type, rid, start, start + count - 1, count); if (rle == NULL) return (ENXIO); return (0); } static struct resource * gpiobus_alloc_resource(device_t bus, device_t child, int type, int *rid, rman_res_t start, rman_res_t end, rman_res_t count, u_int flags) { struct gpiobus_softc *sc; struct resource *rv; struct resource_list *rl; struct resource_list_entry *rle; int isdefault; if (type != SYS_RES_IRQ) return (NULL); isdefault = (RMAN_IS_DEFAULT_RANGE(start, end) && count == 1); rle = NULL; if (isdefault) { rl = BUS_GET_RESOURCE_LIST(bus, child); if (rl == NULL) return (NULL); rle = resource_list_find(rl, type, *rid); if (rle == NULL) return (NULL); if (rle->res != NULL) panic("%s: resource entry is busy", __func__); start = rle->start; count = rle->count; end = rle->end; } sc = device_get_softc(bus); rv = rman_reserve_resource(&sc->sc_intr_rman, start, end, count, flags, child); if (rv == NULL) return (NULL); rman_set_rid(rv, *rid); if ((flags & RF_ACTIVE) != 0 && bus_activate_resource(child, type, *rid, rv) != 0) { rman_release_resource(rv); return (NULL); } return (rv); } static int gpiobus_release_resource(device_t bus __unused, device_t child, int type, int rid, struct resource *r) { int error; if (rman_get_flags(r) & RF_ACTIVE) { error = bus_deactivate_resource(child, type, rid, r); if (error) return (error); } return (rman_release_resource(r)); } static struct resource_list * gpiobus_get_resource_list(device_t bus __unused, device_t child) { struct gpiobus_ivar *ivar; ivar = GPIOBUS_IVAR(child); return (&ivar->rl); } static int gpiobus_acquire_bus(device_t busdev, device_t child, int how) { struct gpiobus_softc *sc; sc = device_get_softc(busdev); GPIOBUS_ASSERT_UNLOCKED(sc); GPIOBUS_LOCK(sc); if (sc->sc_owner != NULL) { if (sc->sc_owner == child) panic("%s: %s still owns the bus.", device_get_nameunit(busdev), device_get_nameunit(child)); if (how == GPIOBUS_DONTWAIT) { GPIOBUS_UNLOCK(sc); return (EWOULDBLOCK); } while (sc->sc_owner != NULL) mtx_sleep(sc, &sc->sc_mtx, 0, "gpiobuswait", 0); } sc->sc_owner = child; GPIOBUS_UNLOCK(sc); return (0); } static void gpiobus_release_bus(device_t busdev, device_t child) { struct gpiobus_softc *sc; sc = device_get_softc(busdev); GPIOBUS_ASSERT_UNLOCKED(sc); GPIOBUS_LOCK(sc); if (sc->sc_owner == NULL) panic("%s: %s releasing unowned bus.", device_get_nameunit(busdev), device_get_nameunit(child)); if (sc->sc_owner != child) panic("%s: %s trying to release bus owned by %s", device_get_nameunit(busdev), device_get_nameunit(child), device_get_nameunit(sc->sc_owner)); sc->sc_owner = NULL; wakeup(sc); GPIOBUS_UNLOCK(sc); } static int gpiobus_pin_setflags(device_t dev, device_t child, uint32_t pin, uint32_t flags) { struct gpiobus_softc *sc = GPIOBUS_SOFTC(dev); struct gpiobus_ivar *devi = GPIOBUS_IVAR(child); uint32_t caps; if (pin >= devi->npins) return (EINVAL); if (GPIO_PIN_GETCAPS(sc->sc_dev, devi->pins[pin], &caps) != 0) return (EINVAL); if (gpio_check_flags(caps, flags) != 0) return (EINVAL); return (GPIO_PIN_SETFLAGS(sc->sc_dev, devi->pins[pin], flags)); } static int gpiobus_pin_getflags(device_t dev, device_t child, uint32_t pin, uint32_t *flags) { struct gpiobus_softc *sc = GPIOBUS_SOFTC(dev); struct gpiobus_ivar *devi = GPIOBUS_IVAR(child); if (pin >= devi->npins) return (EINVAL); return GPIO_PIN_GETFLAGS(sc->sc_dev, devi->pins[pin], flags); } static int gpiobus_pin_getcaps(device_t dev, device_t child, uint32_t pin, uint32_t *caps) { struct gpiobus_softc *sc = GPIOBUS_SOFTC(dev); struct gpiobus_ivar *devi = GPIOBUS_IVAR(child); if (pin >= devi->npins) return (EINVAL); return GPIO_PIN_GETCAPS(sc->sc_dev, devi->pins[pin], caps); } static int gpiobus_pin_set(device_t dev, device_t child, uint32_t pin, unsigned int value) { struct gpiobus_softc *sc = GPIOBUS_SOFTC(dev); struct gpiobus_ivar *devi = GPIOBUS_IVAR(child); if (pin >= devi->npins) return (EINVAL); return GPIO_PIN_SET(sc->sc_dev, devi->pins[pin], value); } static int gpiobus_pin_get(device_t dev, device_t child, uint32_t pin, unsigned int *value) { struct gpiobus_softc *sc = GPIOBUS_SOFTC(dev); struct gpiobus_ivar *devi = GPIOBUS_IVAR(child); if (pin >= devi->npins) return (EINVAL); return GPIO_PIN_GET(sc->sc_dev, devi->pins[pin], value); } static int gpiobus_pin_toggle(device_t dev, device_t child, uint32_t pin) { struct gpiobus_softc *sc = GPIOBUS_SOFTC(dev); struct gpiobus_ivar *devi = GPIOBUS_IVAR(child); if (pin >= devi->npins) return (EINVAL); return GPIO_PIN_TOGGLE(sc->sc_dev, devi->pins[pin]); } static int gpiobus_pin_getname(device_t dev, uint32_t pin, char *name) { struct gpiobus_softc *sc; sc = GPIOBUS_SOFTC(dev); if (pin > sc->sc_npins) return (EINVAL); /* Did we have a name for this pin ? */ if (sc->sc_pins[pin].name != NULL) { memcpy(name, sc->sc_pins[pin].name, GPIOMAXNAME); return (0); } /* Return the default pin name. */ return (GPIO_PIN_GETNAME(device_get_parent(dev), pin, name)); } static int gpiobus_pin_setname(device_t dev, uint32_t pin, const char *name) { struct gpiobus_softc *sc; sc = GPIOBUS_SOFTC(dev); if (pin > sc->sc_npins) return (EINVAL); if (name == NULL) return (EINVAL); /* Save the pin name. */ if (sc->sc_pins[pin].name == NULL) sc->sc_pins[pin].name = malloc(GPIOMAXNAME, M_DEVBUF, M_WAITOK | M_ZERO); strlcpy(sc->sc_pins[pin].name, name, GPIOMAXNAME); return (0); } static device_method_t gpiobus_methods[] = { /* Device interface */ DEVMETHOD(device_probe, gpiobus_probe), DEVMETHOD(device_attach, gpiobus_attach), DEVMETHOD(device_detach, gpiobus_detach), DEVMETHOD(device_shutdown, bus_generic_shutdown), DEVMETHOD(device_suspend, gpiobus_suspend), DEVMETHOD(device_resume, gpiobus_resume), /* Bus interface */ DEVMETHOD(bus_setup_intr, bus_generic_setup_intr), DEVMETHOD(bus_config_intr, bus_generic_config_intr), DEVMETHOD(bus_teardown_intr, bus_generic_teardown_intr), DEVMETHOD(bus_set_resource, gpiobus_set_resource), DEVMETHOD(bus_alloc_resource, gpiobus_alloc_resource), DEVMETHOD(bus_release_resource, gpiobus_release_resource), DEVMETHOD(bus_activate_resource, bus_generic_activate_resource), DEVMETHOD(bus_deactivate_resource, bus_generic_deactivate_resource), DEVMETHOD(bus_get_resource_list, gpiobus_get_resource_list), DEVMETHOD(bus_add_child, gpiobus_add_child), DEVMETHOD(bus_probe_nomatch, gpiobus_probe_nomatch), DEVMETHOD(bus_print_child, gpiobus_print_child), DEVMETHOD(bus_child_pnpinfo_str, gpiobus_child_pnpinfo_str), DEVMETHOD(bus_child_location_str, gpiobus_child_location_str), DEVMETHOD(bus_hinted_child, gpiobus_hinted_child), /* GPIO protocol */ DEVMETHOD(gpiobus_acquire_bus, gpiobus_acquire_bus), DEVMETHOD(gpiobus_release_bus, gpiobus_release_bus), DEVMETHOD(gpiobus_pin_getflags, gpiobus_pin_getflags), DEVMETHOD(gpiobus_pin_getcaps, gpiobus_pin_getcaps), DEVMETHOD(gpiobus_pin_setflags, gpiobus_pin_setflags), DEVMETHOD(gpiobus_pin_get, gpiobus_pin_get), DEVMETHOD(gpiobus_pin_set, gpiobus_pin_set), DEVMETHOD(gpiobus_pin_toggle, gpiobus_pin_toggle), DEVMETHOD(gpiobus_pin_getname, gpiobus_pin_getname), DEVMETHOD(gpiobus_pin_setname, gpiobus_pin_setname), DEVMETHOD_END }; driver_t gpiobus_driver = { "gpiobus", gpiobus_methods, sizeof(struct gpiobus_softc) }; devclass_t gpiobus_devclass; DRIVER_MODULE(gpiobus, gpio, gpiobus_driver, gpiobus_devclass, 0, 0); MODULE_VERSION(gpiobus, 1); Index: projects/vnet/sys/dev/gpio/gpiobusvar.h =================================================================== --- projects/vnet/sys/dev/gpio/gpiobusvar.h (revision 301546) +++ projects/vnet/sys/dev/gpio/gpiobusvar.h (revision 301547) @@ -1,144 +1,151 @@ /*- * Copyright (c) 2009 Oleksandr Tymoshenko * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ * */ #ifndef __GPIOBUS_H__ #define __GPIOBUS_H__ #include "opt_platform.h" #include #include #include #ifdef FDT #include #include #endif #include "gpio_if.h" #ifdef FDT #define GPIOBUS_IVAR(d) (struct gpiobus_ivar *) \ &((struct ofw_gpiobus_devinfo *)device_get_ivars(d))->opd_dinfo #else #define GPIOBUS_IVAR(d) (struct gpiobus_ivar *) device_get_ivars(d) #endif #define GPIOBUS_SOFTC(d) (struct gpiobus_softc *) device_get_softc(d) #define GPIOBUS_LOCK(_sc) mtx_lock(&(_sc)->sc_mtx) #define GPIOBUS_UNLOCK(_sc) mtx_unlock(&(_sc)->sc_mtx) #define GPIOBUS_LOCK_INIT(_sc) mtx_init(&_sc->sc_mtx, \ device_get_nameunit(_sc->sc_dev), "gpiobus", MTX_DEF) #define GPIOBUS_LOCK_DESTROY(_sc) mtx_destroy(&_sc->sc_mtx) #define GPIOBUS_ASSERT_LOCKED(_sc) mtx_assert(&_sc->sc_mtx, MA_OWNED) #define GPIOBUS_ASSERT_UNLOCKED(_sc) mtx_assert(&_sc->sc_mtx, MA_NOTOWNED) #define GPIOBUS_WAIT 1 #define GPIOBUS_DONTWAIT 2 /* Use default interrupt mode - for gpio_alloc_intr_resource */ #define GPIO_INTR_CONFORM GPIO_INTR_NONE struct gpiobus_pin_data { int mapped; /* pin is mapped/reserved. */ char *name; /* pin name. */ }; +struct intr_map_data_gpio { + struct intr_map_data hdr; + u_int gpio_pin_num; + u_int gpio_pin_flags; + u_int gpio_intr_mode; +}; + struct gpiobus_softc { struct mtx sc_mtx; /* bus mutex */ struct rman sc_intr_rman; /* isr resources */ device_t sc_busdev; /* bus device */ device_t sc_owner; /* bus owner */ device_t sc_dev; /* driver device */ int sc_npins; /* total pins on bus */ struct gpiobus_pin_data *sc_pins; /* pin data */ }; struct gpiobus_pin { device_t dev; /* gpio device */ uint32_t flags; /* pin flags */ uint32_t pin; /* pin number */ }; typedef struct gpiobus_pin *gpio_pin_t; struct gpiobus_ivar { struct resource_list rl; /* isr resource list */ uint32_t npins; /* pins total */ uint32_t *flags; /* pins flags */ uint32_t *pins; /* pins map */ }; #ifdef FDT struct ofw_gpiobus_devinfo { struct gpiobus_ivar opd_dinfo; struct ofw_bus_devinfo opd_obdinfo; }; static __inline int gpio_map_gpios(device_t bus, phandle_t dev, phandle_t gparent, int gcells, pcell_t *gpios, uint32_t *pin, uint32_t *flags) { return (GPIO_MAP_GPIOS(bus, dev, gparent, gcells, gpios, pin, flags)); } device_t ofw_gpiobus_add_fdt_child(device_t, const char *, phandle_t); int ofw_gpiobus_parse_gpios(device_t, char *, struct gpiobus_pin **); void ofw_gpiobus_register_provider(device_t); void ofw_gpiobus_unregister_provider(device_t); /* Consumers interface. */ int gpio_pin_get_by_ofw_name(device_t consumer, phandle_t node, char *name, gpio_pin_t *gpio); int gpio_pin_get_by_ofw_idx(device_t consumer, phandle_t node, int idx, gpio_pin_t *gpio); int gpio_pin_get_by_ofw_property(device_t consumer, phandle_t node, char *name, gpio_pin_t *gpio); void gpio_pin_release(gpio_pin_t gpio); int gpio_pin_getcaps(gpio_pin_t pin, uint32_t *caps); int gpio_pin_is_active(gpio_pin_t pin, bool *active); int gpio_pin_set_active(gpio_pin_t pin, bool active); int gpio_pin_setflags(gpio_pin_t pin, uint32_t flags); #endif struct resource *gpio_alloc_intr_resource(device_t consumer_dev, int *rid, u_int alloc_flags, gpio_pin_t pin, uint32_t intr_mode); int gpio_check_flags(uint32_t, uint32_t); device_t gpiobus_attach_bus(device_t); int gpiobus_detach_bus(device_t); int gpiobus_init_softc(device_t); int gpiobus_alloc_ivars(struct gpiobus_ivar *); void gpiobus_free_ivars(struct gpiobus_ivar *); int gpiobus_acquire_pin(device_t, uint32_t); int gpiobus_release_pin(device_t, uint32_t); extern driver_t gpiobus_driver; #endif /* __GPIOBUS_H__ */ Index: projects/vnet/sys/dev/hyperv/netvsc/hv_netvsc_drv_freebsd.c =================================================================== --- projects/vnet/sys/dev/hyperv/netvsc/hv_netvsc_drv_freebsd.c (revision 301546) +++ projects/vnet/sys/dev/hyperv/netvsc/hv_netvsc_drv_freebsd.c (revision 301547) @@ -1,3003 +1,3003 @@ /*- * Copyright (c) 2010-2012 Citrix Inc. * Copyright (c) 2009-2012,2016 Microsoft Corp. * Copyright (c) 2012 NetApp Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /*- * Copyright (c) 2004-2006 Kip Macy * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_inet6.h" #include "opt_inet.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "hv_net_vsc.h" #include "hv_rndis.h" #include "hv_rndis_filter.h" #define hv_chan_rxr hv_chan_priv1 #define hv_chan_txr hv_chan_priv2 /* Short for Hyper-V network interface */ #define NETVSC_DEVNAME "hn" /* * It looks like offset 0 of buf is reserved to hold the softc pointer. * The sc pointer evidently not needed, and is not presently populated. * The packet offset is where the netvsc_packet starts in the buffer. */ #define HV_NV_SC_PTR_OFFSET_IN_BUF 0 #define HV_NV_PACKET_OFFSET_IN_BUF 16 /* YYY should get it from the underlying channel */ #define HN_TX_DESC_CNT 512 #define HN_LROENT_CNT_DEF 128 #define HN_RING_CNT_DEF_MAX 8 #define HN_RNDIS_MSG_LEN \ (sizeof(rndis_msg) + \ RNDIS_HASHVAL_PPI_SIZE + \ RNDIS_VLAN_PPI_SIZE + \ RNDIS_TSO_PPI_SIZE + \ RNDIS_CSUM_PPI_SIZE) #define HN_RNDIS_MSG_BOUNDARY PAGE_SIZE #define HN_RNDIS_MSG_ALIGN CACHE_LINE_SIZE #define HN_TX_DATA_BOUNDARY PAGE_SIZE #define HN_TX_DATA_MAXSIZE IP_MAXPACKET #define HN_TX_DATA_SEGSIZE PAGE_SIZE #define HN_TX_DATA_SEGCNT_MAX \ (NETVSC_PACKET_MAXPAGE - HV_RF_NUM_TX_RESERVED_PAGE_BUFS) #define HN_DIRECT_TX_SIZE_DEF 128 #define HN_EARLY_TXEOF_THRESH 8 struct hn_txdesc { #ifndef HN_USE_TXDESC_BUFRING SLIST_ENTRY(hn_txdesc) link; #endif struct mbuf *m; struct hn_tx_ring *txr; int refs; uint32_t flags; /* HN_TXD_FLAG_ */ netvsc_packet netvsc_pkt; /* XXX to be removed */ bus_dmamap_t data_dmap; bus_addr_t rndis_msg_paddr; rndis_msg *rndis_msg; bus_dmamap_t rndis_msg_dmap; }; #define HN_TXD_FLAG_ONLIST 0x1 #define HN_TXD_FLAG_DMAMAP 0x2 /* * Only enable UDP checksum offloading when it is on 2012R2 or * later. UDP checksum offloading doesn't work on earlier * Windows releases. */ #define HN_CSUM_ASSIST_WIN8 (CSUM_IP | CSUM_TCP) #define HN_CSUM_ASSIST (CSUM_IP | CSUM_UDP | CSUM_TCP) #define HN_LRO_LENLIM_MULTIRX_DEF (12 * ETHERMTU) #define HN_LRO_LENLIM_DEF (25 * ETHERMTU) /* YYY 2*MTU is a bit rough, but should be good enough. */ #define HN_LRO_LENLIM_MIN(ifp) (2 * (ifp)->if_mtu) #define HN_LRO_ACKCNT_DEF 1 /* * Be aware that this sleepable mutex will exhibit WITNESS errors when * certain TCP and ARP code paths are taken. This appears to be a * well-known condition, as all other drivers checked use a sleeping * mutex to protect their transmit paths. * Also Be aware that mutexes do not play well with semaphores, and there * is a conflicting semaphore in a certain channel code path. */ #define NV_LOCK_INIT(_sc, _name) \ mtx_init(&(_sc)->hn_lock, _name, MTX_NETWORK_LOCK, MTX_DEF) #define NV_LOCK(_sc) mtx_lock(&(_sc)->hn_lock) #define NV_LOCK_ASSERT(_sc) mtx_assert(&(_sc)->hn_lock, MA_OWNED) #define NV_UNLOCK(_sc) mtx_unlock(&(_sc)->hn_lock) #define NV_LOCK_DESTROY(_sc) mtx_destroy(&(_sc)->hn_lock) /* * Globals */ int hv_promisc_mode = 0; /* normal mode by default */ SYSCTL_NODE(_hw, OID_AUTO, hn, CTLFLAG_RD | CTLFLAG_MPSAFE, NULL, "Hyper-V network interface"); /* Trust tcp segements verification on host side. */ static int hn_trust_hosttcp = 1; SYSCTL_INT(_hw_hn, OID_AUTO, trust_hosttcp, CTLFLAG_RDTUN, &hn_trust_hosttcp, 0, "Trust tcp segement verification on host side, " "when csum info is missing (global setting)"); /* Trust udp datagrams verification on host side. */ static int hn_trust_hostudp = 1; SYSCTL_INT(_hw_hn, OID_AUTO, trust_hostudp, CTLFLAG_RDTUN, &hn_trust_hostudp, 0, "Trust udp datagram verification on host side, " "when csum info is missing (global setting)"); /* Trust ip packets verification on host side. */ static int hn_trust_hostip = 1; SYSCTL_INT(_hw_hn, OID_AUTO, trust_hostip, CTLFLAG_RDTUN, &hn_trust_hostip, 0, "Trust ip packet verification on host side, " "when csum info is missing (global setting)"); #if __FreeBSD_version >= 1100045 /* Limit TSO burst size */ static int hn_tso_maxlen = 0; SYSCTL_INT(_hw_hn, OID_AUTO, tso_maxlen, CTLFLAG_RDTUN, &hn_tso_maxlen, 0, "TSO burst limit"); #endif /* Limit chimney send size */ static int hn_tx_chimney_size = 0; SYSCTL_INT(_hw_hn, OID_AUTO, tx_chimney_size, CTLFLAG_RDTUN, &hn_tx_chimney_size, 0, "Chimney send packet size limit"); /* Limit the size of packet for direct transmission */ static int hn_direct_tx_size = HN_DIRECT_TX_SIZE_DEF; SYSCTL_INT(_hw_hn, OID_AUTO, direct_tx_size, CTLFLAG_RDTUN, &hn_direct_tx_size, 0, "Size of the packet for direct transmission"); #if defined(INET) || defined(INET6) #if __FreeBSD_version >= 1100095 static int hn_lro_entry_count = HN_LROENT_CNT_DEF; SYSCTL_INT(_hw_hn, OID_AUTO, lro_entry_count, CTLFLAG_RDTUN, &hn_lro_entry_count, 0, "LRO entry count"); #endif #endif static int hn_share_tx_taskq = 0; SYSCTL_INT(_hw_hn, OID_AUTO, share_tx_taskq, CTLFLAG_RDTUN, &hn_share_tx_taskq, 0, "Enable shared TX taskqueue"); static struct taskqueue *hn_tx_taskq; #ifndef HN_USE_TXDESC_BUFRING static int hn_use_txdesc_bufring = 0; #else static int hn_use_txdesc_bufring = 1; #endif SYSCTL_INT(_hw_hn, OID_AUTO, use_txdesc_bufring, CTLFLAG_RD, &hn_use_txdesc_bufring, 0, "Use buf_ring for TX descriptors"); static int hn_bind_tx_taskq = -1; SYSCTL_INT(_hw_hn, OID_AUTO, bind_tx_taskq, CTLFLAG_RDTUN, &hn_bind_tx_taskq, 0, "Bind TX taskqueue to the specified cpu"); static int hn_use_if_start = 0; SYSCTL_INT(_hw_hn, OID_AUTO, use_if_start, CTLFLAG_RDTUN, &hn_use_if_start, 0, "Use if_start TX method"); static int hn_chan_cnt = 0; SYSCTL_INT(_hw_hn, OID_AUTO, chan_cnt, CTLFLAG_RDTUN, &hn_chan_cnt, 0, "# of channels to use; each channel has one RX ring and one TX ring"); static int hn_tx_ring_cnt = 0; SYSCTL_INT(_hw_hn, OID_AUTO, tx_ring_cnt, CTLFLAG_RDTUN, &hn_tx_ring_cnt, 0, "# of TX rings to use"); static int hn_tx_swq_depth = 0; SYSCTL_INT(_hw_hn, OID_AUTO, tx_swq_depth, CTLFLAG_RDTUN, &hn_tx_swq_depth, 0, "Depth of IFQ or BUFRING"); static u_int hn_cpu_index; /* * Forward declarations */ static void hn_stop(hn_softc_t *sc); static void hn_ifinit_locked(hn_softc_t *sc); static void hn_ifinit(void *xsc); static int hn_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data); static int hn_start_locked(struct hn_tx_ring *txr, int len); static void hn_start(struct ifnet *ifp); static void hn_start_txeof(struct hn_tx_ring *); static int hn_ifmedia_upd(struct ifnet *ifp); static void hn_ifmedia_sts(struct ifnet *ifp, struct ifmediareq *ifmr); #if __FreeBSD_version >= 1100099 static int hn_lro_lenlim_sysctl(SYSCTL_HANDLER_ARGS); static int hn_lro_ackcnt_sysctl(SYSCTL_HANDLER_ARGS); #endif static int hn_trust_hcsum_sysctl(SYSCTL_HANDLER_ARGS); static int hn_tx_chimney_size_sysctl(SYSCTL_HANDLER_ARGS); static int hn_rx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS); static int hn_rx_stat_u64_sysctl(SYSCTL_HANDLER_ARGS); static int hn_tx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS); static int hn_tx_conf_int_sysctl(SYSCTL_HANDLER_ARGS); static int hn_check_iplen(const struct mbuf *, int); static int hn_create_tx_ring(struct hn_softc *, int); static void hn_destroy_tx_ring(struct hn_tx_ring *); static int hn_create_tx_data(struct hn_softc *, int); static void hn_destroy_tx_data(struct hn_softc *); static void hn_start_taskfunc(void *, int); static void hn_start_txeof_taskfunc(void *, int); static void hn_stop_tx_tasks(struct hn_softc *); static int hn_encap(struct hn_tx_ring *, struct hn_txdesc *, struct mbuf **); static void hn_create_rx_data(struct hn_softc *sc, int); static void hn_destroy_rx_data(struct hn_softc *sc); static void hn_set_tx_chimney_size(struct hn_softc *, int); static void hn_channel_attach(struct hn_softc *, struct hv_vmbus_channel *); static void hn_subchan_attach(struct hn_softc *, struct hv_vmbus_channel *); static int hn_transmit(struct ifnet *, struct mbuf *); static void hn_xmit_qflush(struct ifnet *); static int hn_xmit(struct hn_tx_ring *, int); static void hn_xmit_txeof(struct hn_tx_ring *); static void hn_xmit_taskfunc(void *, int); static void hn_xmit_txeof_taskfunc(void *, int); #if __FreeBSD_version >= 1100099 static void hn_set_lro_lenlim(struct hn_softc *sc, int lenlim) { int i; for (i = 0; i < sc->hn_rx_ring_inuse; ++i) sc->hn_rx_ring[i].hn_lro.lro_length_lim = lenlim; } #endif static int hn_get_txswq_depth(const struct hn_tx_ring *txr) { KASSERT(txr->hn_txdesc_cnt > 0, ("tx ring is not setup yet")); if (hn_tx_swq_depth < txr->hn_txdesc_cnt) return txr->hn_txdesc_cnt; return hn_tx_swq_depth; } static int hn_ifmedia_upd(struct ifnet *ifp __unused) { return EOPNOTSUPP; } static void hn_ifmedia_sts(struct ifnet *ifp, struct ifmediareq *ifmr) { struct hn_softc *sc = ifp->if_softc; ifmr->ifm_status = IFM_AVALID; ifmr->ifm_active = IFM_ETHER; if (!sc->hn_carrier) { ifmr->ifm_active |= IFM_NONE; return; } ifmr->ifm_status |= IFM_ACTIVE; ifmr->ifm_active |= IFM_10G_T | IFM_FDX; } /* {F8615163-DF3E-46c5-913F-F2D2F965ED0E} */ static const hv_guid g_net_vsc_device_type = { .data = {0x63, 0x51, 0x61, 0xF8, 0x3E, 0xDF, 0xc5, 0x46, 0x91, 0x3F, 0xF2, 0xD2, 0xF9, 0x65, 0xED, 0x0E} }; /* * Standard probe entry point. * */ static int netvsc_probe(device_t dev) { const char *p; p = vmbus_get_type(dev); if (!memcmp(p, &g_net_vsc_device_type.data, sizeof(hv_guid))) { device_set_desc(dev, "Hyper-V Network Interface"); if (bootverbose) printf("Netvsc probe... DONE \n"); return (BUS_PROBE_DEFAULT); } return (ENXIO); } /* * Standard attach entry point. * * Called when the driver is loaded. It allocates needed resources, * and initializes the "hardware" and software. */ static int netvsc_attach(device_t dev) { struct hv_device *device_ctx = vmbus_get_devctx(dev); struct hv_vmbus_channel *pri_chan; netvsc_device_info device_info; hn_softc_t *sc; int unit = device_get_unit(dev); struct ifnet *ifp = NULL; int error, ring_cnt, tx_ring_cnt; #if __FreeBSD_version >= 1100045 int tso_maxlen; #endif sc = device_get_softc(dev); sc->hn_unit = unit; sc->hn_dev = dev; if (hn_tx_taskq == NULL) { sc->hn_tx_taskq = taskqueue_create("hn_tx", M_WAITOK, taskqueue_thread_enqueue, &sc->hn_tx_taskq); if (hn_bind_tx_taskq >= 0) { int cpu = hn_bind_tx_taskq; cpuset_t cpu_set; if (cpu > mp_ncpus - 1) cpu = mp_ncpus - 1; CPU_SETOF(cpu, &cpu_set); taskqueue_start_threads_cpuset(&sc->hn_tx_taskq, 1, PI_NET, &cpu_set, "%s tx", device_get_nameunit(dev)); } else { taskqueue_start_threads(&sc->hn_tx_taskq, 1, PI_NET, "%s tx", device_get_nameunit(dev)); } } else { sc->hn_tx_taskq = hn_tx_taskq; } NV_LOCK_INIT(sc, "NetVSCLock"); sc->hn_dev_obj = device_ctx; ifp = sc->hn_ifp = if_alloc(IFT_ETHER); ifp->if_softc = sc; if_initname(ifp, device_get_name(dev), device_get_unit(dev)); /* * Figure out the # of RX rings (ring_cnt) and the # of TX rings * to use (tx_ring_cnt). * * NOTE: * The # of RX rings to use is same as the # of channels to use. */ ring_cnt = hn_chan_cnt; if (ring_cnt <= 0) { /* Default */ ring_cnt = mp_ncpus; if (ring_cnt > HN_RING_CNT_DEF_MAX) ring_cnt = HN_RING_CNT_DEF_MAX; } else if (ring_cnt > mp_ncpus) { ring_cnt = mp_ncpus; } tx_ring_cnt = hn_tx_ring_cnt; if (tx_ring_cnt <= 0 || tx_ring_cnt > ring_cnt) tx_ring_cnt = ring_cnt; if (hn_use_if_start) { /* ifnet.if_start only needs one TX ring. */ tx_ring_cnt = 1; } /* * Set the leader CPU for channels. */ sc->hn_cpu = atomic_fetchadd_int(&hn_cpu_index, ring_cnt) % mp_ncpus; error = hn_create_tx_data(sc, tx_ring_cnt); if (error) goto failed; hn_create_rx_data(sc, ring_cnt); /* * Associate the first TX/RX ring w/ the primary channel. */ pri_chan = device_ctx->channel; KASSERT(HV_VMBUS_CHAN_ISPRIMARY(pri_chan), ("not primary channel")); KASSERT(pri_chan->offer_msg.offer.sub_channel_index == 0, ("primary channel subidx %u", pri_chan->offer_msg.offer.sub_channel_index)); hn_channel_attach(sc, pri_chan); ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; ifp->if_ioctl = hn_ioctl; ifp->if_init = hn_ifinit; /* needed by hv_rf_on_device_add() code */ ifp->if_mtu = ETHERMTU; if (hn_use_if_start) { int qdepth = hn_get_txswq_depth(&sc->hn_tx_ring[0]); ifp->if_start = hn_start; IFQ_SET_MAXLEN(&ifp->if_snd, qdepth); ifp->if_snd.ifq_drv_maxlen = qdepth - 1; IFQ_SET_READY(&ifp->if_snd); } else { ifp->if_transmit = hn_transmit; ifp->if_qflush = hn_xmit_qflush; } ifmedia_init(&sc->hn_media, 0, hn_ifmedia_upd, hn_ifmedia_sts); ifmedia_add(&sc->hn_media, IFM_ETHER | IFM_AUTO, 0, NULL); ifmedia_set(&sc->hn_media, IFM_ETHER | IFM_AUTO); /* XXX ifmedia_set really should do this for us */ sc->hn_media.ifm_media = sc->hn_media.ifm_cur->ifm_media; /* * Tell upper layers that we support full VLAN capability. */ ifp->if_hdrlen = sizeof(struct ether_vlan_header); ifp->if_capabilities |= IFCAP_VLAN_HWTAGGING | IFCAP_VLAN_MTU | IFCAP_HWCSUM | IFCAP_TSO | IFCAP_LRO; ifp->if_capenable |= IFCAP_VLAN_HWTAGGING | IFCAP_VLAN_MTU | IFCAP_HWCSUM | IFCAP_TSO | IFCAP_LRO; ifp->if_hwassist = sc->hn_tx_ring[0].hn_csum_assist | CSUM_TSO; error = hv_rf_on_device_add(device_ctx, &device_info, ring_cnt); if (error) goto failed; KASSERT(sc->net_dev->num_channel > 0 && sc->net_dev->num_channel <= sc->hn_rx_ring_inuse, ("invalid channel count %u, should be less than %d", sc->net_dev->num_channel, sc->hn_rx_ring_inuse)); /* * Set the # of TX/RX rings that could be used according to * the # of channels that host offered. */ if (sc->hn_tx_ring_inuse > sc->net_dev->num_channel) sc->hn_tx_ring_inuse = sc->net_dev->num_channel; sc->hn_rx_ring_inuse = sc->net_dev->num_channel; device_printf(dev, "%d TX ring, %d RX ring\n", sc->hn_tx_ring_inuse, sc->hn_rx_ring_inuse); if (sc->net_dev->num_channel > 1) { struct hv_vmbus_channel **subchan; int subchan_cnt = sc->net_dev->num_channel - 1; int i; /* Wait for sub-channels setup to complete. */ subchan = vmbus_get_subchan(pri_chan, subchan_cnt); /* Attach the sub-channels. */ for (i = 0; i < subchan_cnt; ++i) { /* NOTE: Calling order is critical. */ hn_subchan_attach(sc, subchan[i]); hv_nv_subchan_attach(subchan[i]); } /* Release the sub-channels */ vmbus_rel_subchan(subchan, subchan_cnt); device_printf(dev, "%d sub-channels setup done\n", subchan_cnt); } #if __FreeBSD_version >= 1100099 if (sc->hn_rx_ring_inuse > 1) { /* * Reduce TCP segment aggregation limit for multiple * RX rings to increase ACK timeliness. */ hn_set_lro_lenlim(sc, HN_LRO_LENLIM_MULTIRX_DEF); } #endif if (device_info.link_state == 0) { sc->hn_carrier = 1; } #if __FreeBSD_version >= 1100045 tso_maxlen = hn_tso_maxlen; if (tso_maxlen <= 0 || tso_maxlen > IP_MAXPACKET) tso_maxlen = IP_MAXPACKET; ifp->if_hw_tsomaxsegcount = HN_TX_DATA_SEGCNT_MAX; ifp->if_hw_tsomaxsegsize = PAGE_SIZE; ifp->if_hw_tsomax = tso_maxlen - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN); #endif ether_ifattach(ifp, device_info.mac_addr); #if __FreeBSD_version >= 1100045 if_printf(ifp, "TSO: %u/%u/%u\n", ifp->if_hw_tsomax, ifp->if_hw_tsomaxsegcount, ifp->if_hw_tsomaxsegsize); #endif sc->hn_tx_chimney_max = sc->net_dev->send_section_size; hn_set_tx_chimney_size(sc, sc->hn_tx_chimney_max); if (hn_tx_chimney_size > 0 && hn_tx_chimney_size < sc->hn_tx_chimney_max) hn_set_tx_chimney_size(sc, hn_tx_chimney_size); return (0); failed: hn_destroy_tx_data(sc); if (ifp != NULL) if_free(ifp); return (error); } /* * Standard detach entry point */ static int netvsc_detach(device_t dev) { struct hn_softc *sc = device_get_softc(dev); struct hv_device *hv_device = vmbus_get_devctx(dev); if (bootverbose) printf("netvsc_detach\n"); /* * XXXKYS: Need to clean up all our * driver state; this is the driver * unloading. */ /* * XXXKYS: Need to stop outgoing traffic and unregister * the netdevice. */ hv_rf_on_device_remove(hv_device, HV_RF_NV_DESTROY_CHANNEL); hn_stop_tx_tasks(sc); ifmedia_removeall(&sc->hn_media); hn_destroy_rx_data(sc); hn_destroy_tx_data(sc); if (sc->hn_tx_taskq != hn_tx_taskq) taskqueue_free(sc->hn_tx_taskq); return (0); } /* * Standard shutdown entry point */ static int netvsc_shutdown(device_t dev) { return (0); } static __inline int hn_txdesc_dmamap_load(struct hn_tx_ring *txr, struct hn_txdesc *txd, struct mbuf **m_head, bus_dma_segment_t *segs, int *nsegs) { struct mbuf *m = *m_head; int error; error = bus_dmamap_load_mbuf_sg(txr->hn_tx_data_dtag, txd->data_dmap, m, segs, nsegs, BUS_DMA_NOWAIT); if (error == EFBIG) { struct mbuf *m_new; m_new = m_collapse(m, M_NOWAIT, HN_TX_DATA_SEGCNT_MAX); if (m_new == NULL) return ENOBUFS; else *m_head = m = m_new; txr->hn_tx_collapsed++; error = bus_dmamap_load_mbuf_sg(txr->hn_tx_data_dtag, txd->data_dmap, m, segs, nsegs, BUS_DMA_NOWAIT); } if (!error) { bus_dmamap_sync(txr->hn_tx_data_dtag, txd->data_dmap, BUS_DMASYNC_PREWRITE); txd->flags |= HN_TXD_FLAG_DMAMAP; } return error; } static __inline void hn_txdesc_dmamap_unload(struct hn_tx_ring *txr, struct hn_txdesc *txd) { if (txd->flags & HN_TXD_FLAG_DMAMAP) { bus_dmamap_sync(txr->hn_tx_data_dtag, txd->data_dmap, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txr->hn_tx_data_dtag, txd->data_dmap); txd->flags &= ~HN_TXD_FLAG_DMAMAP; } } static __inline int hn_txdesc_put(struct hn_tx_ring *txr, struct hn_txdesc *txd) { KASSERT((txd->flags & HN_TXD_FLAG_ONLIST) == 0, ("put an onlist txd %#x", txd->flags)); KASSERT(txd->refs > 0, ("invalid txd refs %d", txd->refs)); if (atomic_fetchadd_int(&txd->refs, -1) != 1) return 0; hn_txdesc_dmamap_unload(txr, txd); if (txd->m != NULL) { m_freem(txd->m); txd->m = NULL; } txd->flags |= HN_TXD_FLAG_ONLIST; #ifndef HN_USE_TXDESC_BUFRING mtx_lock_spin(&txr->hn_txlist_spin); KASSERT(txr->hn_txdesc_avail >= 0 && txr->hn_txdesc_avail < txr->hn_txdesc_cnt, ("txdesc_put: invalid txd avail %d", txr->hn_txdesc_avail)); txr->hn_txdesc_avail++; SLIST_INSERT_HEAD(&txr->hn_txlist, txd, link); mtx_unlock_spin(&txr->hn_txlist_spin); #else atomic_add_int(&txr->hn_txdesc_avail, 1); buf_ring_enqueue(txr->hn_txdesc_br, txd); #endif return 1; } static __inline struct hn_txdesc * hn_txdesc_get(struct hn_tx_ring *txr) { struct hn_txdesc *txd; #ifndef HN_USE_TXDESC_BUFRING mtx_lock_spin(&txr->hn_txlist_spin); txd = SLIST_FIRST(&txr->hn_txlist); if (txd != NULL) { KASSERT(txr->hn_txdesc_avail > 0, ("txdesc_get: invalid txd avail %d", txr->hn_txdesc_avail)); txr->hn_txdesc_avail--; SLIST_REMOVE_HEAD(&txr->hn_txlist, link); } mtx_unlock_spin(&txr->hn_txlist_spin); #else txd = buf_ring_dequeue_sc(txr->hn_txdesc_br); #endif if (txd != NULL) { #ifdef HN_USE_TXDESC_BUFRING atomic_subtract_int(&txr->hn_txdesc_avail, 1); #endif KASSERT(txd->m == NULL && txd->refs == 0 && (txd->flags & HN_TXD_FLAG_ONLIST), ("invalid txd")); txd->flags &= ~HN_TXD_FLAG_ONLIST; txd->refs = 1; } return txd; } static __inline void hn_txdesc_hold(struct hn_txdesc *txd) { /* 0->1 transition will never work */ KASSERT(txd->refs > 0, ("invalid refs %d", txd->refs)); atomic_add_int(&txd->refs, 1); } static __inline void hn_txeof(struct hn_tx_ring *txr) { txr->hn_has_txeof = 0; txr->hn_txeof(txr); } static void hn_tx_done(struct hv_vmbus_channel *chan, void *xpkt) { netvsc_packet *packet = xpkt; struct hn_txdesc *txd; struct hn_tx_ring *txr; txd = (struct hn_txdesc *)(uintptr_t) packet->compl.send.send_completion_tid; txr = txd->txr; KASSERT(txr->hn_chan == chan, ("channel mismatch, on channel%u, should be channel%u", chan->offer_msg.offer.sub_channel_index, txr->hn_chan->offer_msg.offer.sub_channel_index)); txr->hn_has_txeof = 1; hn_txdesc_put(txr, txd); ++txr->hn_txdone_cnt; if (txr->hn_txdone_cnt >= HN_EARLY_TXEOF_THRESH) { txr->hn_txdone_cnt = 0; if (txr->hn_oactive) hn_txeof(txr); } } void netvsc_channel_rollup(struct hv_vmbus_channel *chan) { struct hn_tx_ring *txr = chan->hv_chan_txr; #if defined(INET) || defined(INET6) struct hn_rx_ring *rxr = chan->hv_chan_rxr; tcp_lro_flush_all(&rxr->hn_lro); #endif /* * NOTE: * 'txr' could be NULL, if multiple channels and * ifnet.if_start method are enabled. */ if (txr == NULL || !txr->hn_has_txeof) return; txr->hn_txdone_cnt = 0; hn_txeof(txr); } /* * NOTE: * If this function fails, then both txd and m_head0 will be freed. */ static int hn_encap(struct hn_tx_ring *txr, struct hn_txdesc *txd, struct mbuf **m_head0) { bus_dma_segment_t segs[HN_TX_DATA_SEGCNT_MAX]; int error, nsegs, i; struct mbuf *m_head = *m_head0; netvsc_packet *packet; rndis_msg *rndis_mesg; rndis_packet *rndis_pkt; rndis_per_packet_info *rppi; struct rndis_hash_value *hash_value; uint32_t rndis_msg_size; packet = &txd->netvsc_pkt; packet->is_data_pkt = TRUE; packet->tot_data_buf_len = m_head->m_pkthdr.len; /* * extension points to the area reserved for the * rndis_filter_packet, which is placed just after * the netvsc_packet (and rppi struct, if present; * length is updated later). */ rndis_mesg = txd->rndis_msg; /* XXX not necessary */ memset(rndis_mesg, 0, HN_RNDIS_MSG_LEN); rndis_mesg->ndis_msg_type = REMOTE_NDIS_PACKET_MSG; rndis_pkt = &rndis_mesg->msg.packet; rndis_pkt->data_offset = sizeof(rndis_packet); rndis_pkt->data_length = packet->tot_data_buf_len; rndis_pkt->per_pkt_info_offset = sizeof(rndis_packet); rndis_msg_size = RNDIS_MESSAGE_SIZE(rndis_packet); /* * Set the hash value for this packet, so that the host could * dispatch the TX done event for this packet back to this TX * ring's channel. */ rndis_msg_size += RNDIS_HASHVAL_PPI_SIZE; rppi = hv_set_rppi_data(rndis_mesg, RNDIS_HASHVAL_PPI_SIZE, nbl_hash_value); hash_value = (struct rndis_hash_value *)((uint8_t *)rppi + rppi->per_packet_info_offset); hash_value->hash_value = txr->hn_tx_idx; if (m_head->m_flags & M_VLANTAG) { ndis_8021q_info *rppi_vlan_info; rndis_msg_size += RNDIS_VLAN_PPI_SIZE; rppi = hv_set_rppi_data(rndis_mesg, RNDIS_VLAN_PPI_SIZE, ieee_8021q_info); rppi_vlan_info = (ndis_8021q_info *)((uint8_t *)rppi + rppi->per_packet_info_offset); rppi_vlan_info->u1.s1.vlan_id = m_head->m_pkthdr.ether_vtag & 0xfff; } if (m_head->m_pkthdr.csum_flags & CSUM_TSO) { rndis_tcp_tso_info *tso_info; struct ether_vlan_header *eh; int ether_len; /* * XXX need m_pullup and use mtodo */ eh = mtod(m_head, struct ether_vlan_header*); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) ether_len = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; else ether_len = ETHER_HDR_LEN; rndis_msg_size += RNDIS_TSO_PPI_SIZE; rppi = hv_set_rppi_data(rndis_mesg, RNDIS_TSO_PPI_SIZE, tcp_large_send_info); tso_info = (rndis_tcp_tso_info *)((uint8_t *)rppi + rppi->per_packet_info_offset); tso_info->lso_v2_xmit.type = RNDIS_TCP_LARGE_SEND_OFFLOAD_V2_TYPE; #ifdef INET if (m_head->m_pkthdr.csum_flags & CSUM_IP_TSO) { struct ip *ip = (struct ip *)(m_head->m_data + ether_len); unsigned long iph_len = ip->ip_hl << 2; struct tcphdr *th = (struct tcphdr *)((caddr_t)ip + iph_len); tso_info->lso_v2_xmit.ip_version = RNDIS_TCP_LARGE_SEND_OFFLOAD_IPV4; ip->ip_len = 0; ip->ip_sum = 0; th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htons(IPPROTO_TCP)); } #endif #if defined(INET6) && defined(INET) else #endif #ifdef INET6 { struct ip6_hdr *ip6 = (struct ip6_hdr *) (m_head->m_data + ether_len); struct tcphdr *th = (struct tcphdr *)(ip6 + 1); tso_info->lso_v2_xmit.ip_version = RNDIS_TCP_LARGE_SEND_OFFLOAD_IPV6; ip6->ip6_plen = 0; th->th_sum = in6_cksum_pseudo(ip6, 0, IPPROTO_TCP, 0); } #endif tso_info->lso_v2_xmit.tcp_header_offset = 0; tso_info->lso_v2_xmit.mss = m_head->m_pkthdr.tso_segsz; } else if (m_head->m_pkthdr.csum_flags & txr->hn_csum_assist) { rndis_tcp_ip_csum_info *csum_info; rndis_msg_size += RNDIS_CSUM_PPI_SIZE; rppi = hv_set_rppi_data(rndis_mesg, RNDIS_CSUM_PPI_SIZE, tcpip_chksum_info); csum_info = (rndis_tcp_ip_csum_info *)((uint8_t *)rppi + rppi->per_packet_info_offset); csum_info->xmit.is_ipv4 = 1; if (m_head->m_pkthdr.csum_flags & CSUM_IP) csum_info->xmit.ip_header_csum = 1; if (m_head->m_pkthdr.csum_flags & CSUM_TCP) { csum_info->xmit.tcp_csum = 1; csum_info->xmit.tcp_header_offset = 0; } else if (m_head->m_pkthdr.csum_flags & CSUM_UDP) { csum_info->xmit.udp_csum = 1; } } rndis_mesg->msg_len = packet->tot_data_buf_len + rndis_msg_size; packet->tot_data_buf_len = rndis_mesg->msg_len; /* * Chimney send, if the packet could fit into one chimney buffer. */ if (packet->tot_data_buf_len < txr->hn_tx_chimney_size) { netvsc_dev *net_dev = txr->hn_sc->net_dev; uint32_t send_buf_section_idx; txr->hn_tx_chimney_tried++; send_buf_section_idx = hv_nv_get_next_send_section(net_dev); if (send_buf_section_idx != NVSP_1_CHIMNEY_SEND_INVALID_SECTION_INDEX) { uint8_t *dest = ((uint8_t *)net_dev->send_buf + (send_buf_section_idx * net_dev->send_section_size)); memcpy(dest, rndis_mesg, rndis_msg_size); dest += rndis_msg_size; m_copydata(m_head, 0, m_head->m_pkthdr.len, dest); packet->send_buf_section_idx = send_buf_section_idx; packet->send_buf_section_size = packet->tot_data_buf_len; packet->page_buf_count = 0; txr->hn_tx_chimney++; goto done; } } error = hn_txdesc_dmamap_load(txr, txd, &m_head, segs, &nsegs); if (error) { int freed; /* * This mbuf is not linked w/ the txd yet, so free it now. */ m_freem(m_head); *m_head0 = NULL; freed = hn_txdesc_put(txr, txd); KASSERT(freed != 0, ("fail to free txd upon txdma error")); txr->hn_txdma_failed++; if_inc_counter(txr->hn_sc->hn_ifp, IFCOUNTER_OERRORS, 1); return error; } *m_head0 = m_head; packet->page_buf_count = nsegs + HV_RF_NUM_TX_RESERVED_PAGE_BUFS; /* send packet with page buffer */ packet->page_buffers[0].pfn = atop(txd->rndis_msg_paddr); packet->page_buffers[0].offset = txd->rndis_msg_paddr & PAGE_MASK; packet->page_buffers[0].length = rndis_msg_size; /* * Fill the page buffers with mbuf info starting at index * HV_RF_NUM_TX_RESERVED_PAGE_BUFS. */ for (i = 0; i < nsegs; ++i) { hv_vmbus_page_buffer *pb = &packet->page_buffers[ i + HV_RF_NUM_TX_RESERVED_PAGE_BUFS]; pb->pfn = atop(segs[i].ds_addr); pb->offset = segs[i].ds_addr & PAGE_MASK; pb->length = segs[i].ds_len; } packet->send_buf_section_idx = NVSP_1_CHIMNEY_SEND_INVALID_SECTION_INDEX; packet->send_buf_section_size = 0; done: txd->m = m_head; /* Set the completion routine */ packet->compl.send.on_send_completion = hn_tx_done; packet->compl.send.send_completion_context = packet; packet->compl.send.send_completion_tid = (uint64_t)(uintptr_t)txd; return 0; } /* * NOTE: * If this function fails, then txd will be freed, but the mbuf * associated w/ the txd will _not_ be freed. */ static int hn_send_pkt(struct ifnet *ifp, struct hn_tx_ring *txr, struct hn_txdesc *txd) { int error, send_failed = 0; again: /* * Make sure that txd is not freed before ETHER_BPF_MTAP. */ hn_txdesc_hold(txd); error = hv_nv_on_send(txr->hn_chan, &txd->netvsc_pkt); if (!error) { ETHER_BPF_MTAP(ifp, txd->m); if_inc_counter(ifp, IFCOUNTER_OPACKETS, 1); if (!hn_use_if_start) { if_inc_counter(ifp, IFCOUNTER_OBYTES, txd->m->m_pkthdr.len); if (txd->m->m_flags & M_MCAST) if_inc_counter(ifp, IFCOUNTER_OMCASTS, 1); } txr->hn_pkts++; } hn_txdesc_put(txr, txd); if (__predict_false(error)) { int freed; /* * This should "really rarely" happen. * * XXX Too many RX to be acked or too many sideband * commands to run? Ask netvsc_channel_rollup() * to kick start later. */ txr->hn_has_txeof = 1; if (!send_failed) { txr->hn_send_failed++; send_failed = 1; /* * Try sending again after set hn_has_txeof; * in case that we missed the last * netvsc_channel_rollup(). */ goto again; } if_printf(ifp, "send failed\n"); /* * Caller will perform further processing on the * associated mbuf, so don't free it in hn_txdesc_put(); * only unload it from the DMA map in hn_txdesc_put(), * if it was loaded. */ txd->m = NULL; freed = hn_txdesc_put(txr, txd); KASSERT(freed != 0, ("fail to free txd upon send error")); txr->hn_send_failed++; } return error; } /* * Start a transmit of one or more packets */ static int hn_start_locked(struct hn_tx_ring *txr, int len) { struct hn_softc *sc = txr->hn_sc; struct ifnet *ifp = sc->hn_ifp; KASSERT(hn_use_if_start, ("hn_start_locked is called, when if_start is disabled")); KASSERT(txr == &sc->hn_tx_ring[0], ("not the first TX ring")); mtx_assert(&txr->hn_tx_lock, MA_OWNED); if ((ifp->if_drv_flags & (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)) != IFF_DRV_RUNNING) return 0; while (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) { struct hn_txdesc *txd; struct mbuf *m_head; int error; IFQ_DRV_DEQUEUE(&ifp->if_snd, m_head); if (m_head == NULL) break; if (len > 0 && m_head->m_pkthdr.len > len) { /* * This sending could be time consuming; let callers * dispatch this packet sending (and sending of any * following up packets) to tx taskqueue. */ IFQ_DRV_PREPEND(&ifp->if_snd, m_head); return 1; } txd = hn_txdesc_get(txr); if (txd == NULL) { txr->hn_no_txdescs++; IFQ_DRV_PREPEND(&ifp->if_snd, m_head); atomic_set_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); break; } error = hn_encap(txr, txd, &m_head); if (error) { /* Both txd and m_head are freed */ continue; } error = hn_send_pkt(ifp, txr, txd); if (__predict_false(error)) { /* txd is freed, but m_head is not */ IFQ_DRV_PREPEND(&ifp->if_snd, m_head); atomic_set_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); break; } } return 0; } /* * Link up/down notification */ void netvsc_linkstatus_callback(struct hv_device *device_obj, uint32_t status) { hn_softc_t *sc = device_get_softc(device_obj->device); if (status == 1) { sc->hn_carrier = 1; } else { sc->hn_carrier = 0; } } /* * Append the specified data to the indicated mbuf chain, * Extend the mbuf chain if the new data does not fit in * existing space. * * This is a minor rewrite of m_append() from sys/kern/uipc_mbuf.c. * There should be an equivalent in the kernel mbuf code, * but there does not appear to be one yet. * * Differs from m_append() in that additional mbufs are * allocated with cluster size MJUMPAGESIZE, and filled * accordingly. * * Return 1 if able to complete the job; otherwise 0. */ static int hv_m_append(struct mbuf *m0, int len, c_caddr_t cp) { struct mbuf *m, *n; int remainder, space; for (m = m0; m->m_next != NULL; m = m->m_next) ; remainder = len; space = M_TRAILINGSPACE(m); if (space > 0) { /* * Copy into available space. */ if (space > remainder) space = remainder; bcopy(cp, mtod(m, caddr_t) + m->m_len, space); m->m_len += space; cp += space; remainder -= space; } while (remainder > 0) { /* * Allocate a new mbuf; could check space * and allocate a cluster instead. */ n = m_getjcl(M_NOWAIT, m->m_type, 0, MJUMPAGESIZE); if (n == NULL) break; n->m_len = min(MJUMPAGESIZE, remainder); bcopy(cp, mtod(n, caddr_t), n->m_len); cp += n->m_len; remainder -= n->m_len; m->m_next = n; m = n; } if (m0->m_flags & M_PKTHDR) m0->m_pkthdr.len += len - remainder; return (remainder == 0); } /* * Called when we receive a data packet from the "wire" on the * specified device * * Note: This is no longer used as a callback */ int netvsc_recv(struct hv_vmbus_channel *chan, netvsc_packet *packet, const rndis_tcp_ip_csum_info *csum_info, const struct rndis_hash_info *hash_info, const struct rndis_hash_value *hash_value) { struct hn_rx_ring *rxr = chan->hv_chan_rxr; struct ifnet *ifp = rxr->hn_ifp; struct mbuf *m_new; int size, do_lro = 0, do_csum = 1; + int hash_type = M_HASHTYPE_OPAQUE_HASH; if (!(ifp->if_drv_flags & IFF_DRV_RUNNING)) return (0); /* * Bail out if packet contains more data than configured MTU. */ if (packet->tot_data_buf_len > (ifp->if_mtu + ETHER_HDR_LEN)) { return (0); } else if (packet->tot_data_buf_len <= MHLEN) { m_new = m_gethdr(M_NOWAIT, MT_DATA); if (m_new == NULL) { if_inc_counter(ifp, IFCOUNTER_IQDROPS, 1); return (0); } memcpy(mtod(m_new, void *), packet->data, packet->tot_data_buf_len); m_new->m_pkthdr.len = m_new->m_len = packet->tot_data_buf_len; rxr->hn_small_pkts++; } else { /* * Get an mbuf with a cluster. For packets 2K or less, * get a standard 2K cluster. For anything larger, get a * 4K cluster. Any buffers larger than 4K can cause problems * if looped around to the Hyper-V TX channel, so avoid them. */ size = MCLBYTES; if (packet->tot_data_buf_len > MCLBYTES) { /* 4096 */ size = MJUMPAGESIZE; } m_new = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, size); if (m_new == NULL) { if_inc_counter(ifp, IFCOUNTER_IQDROPS, 1); return (0); } hv_m_append(m_new, packet->tot_data_buf_len, packet->data); } m_new->m_pkthdr.rcvif = ifp; if (__predict_false((ifp->if_capenable & IFCAP_RXCSUM) == 0)) do_csum = 0; /* receive side checksum offload */ if (csum_info != NULL) { /* IP csum offload */ if (csum_info->receive.ip_csum_succeeded && do_csum) { m_new->m_pkthdr.csum_flags |= (CSUM_IP_CHECKED | CSUM_IP_VALID); rxr->hn_csum_ip++; } /* TCP/UDP csum offload */ if ((csum_info->receive.tcp_csum_succeeded || csum_info->receive.udp_csum_succeeded) && do_csum) { m_new->m_pkthdr.csum_flags |= (CSUM_DATA_VALID | CSUM_PSEUDO_HDR); m_new->m_pkthdr.csum_data = 0xffff; if (csum_info->receive.tcp_csum_succeeded) rxr->hn_csum_tcp++; else rxr->hn_csum_udp++; } if (csum_info->receive.ip_csum_succeeded && csum_info->receive.tcp_csum_succeeded) do_lro = 1; } else { const struct ether_header *eh; uint16_t etype; int hoff; hoff = sizeof(*eh); if (m_new->m_len < hoff) goto skip; eh = mtod(m_new, struct ether_header *); etype = ntohs(eh->ether_type); if (etype == ETHERTYPE_VLAN) { const struct ether_vlan_header *evl; hoff = sizeof(*evl); if (m_new->m_len < hoff) goto skip; evl = mtod(m_new, struct ether_vlan_header *); etype = ntohs(evl->evl_proto); } if (etype == ETHERTYPE_IP) { int pr; pr = hn_check_iplen(m_new, hoff); if (pr == IPPROTO_TCP) { if (do_csum && (rxr->hn_trust_hcsum & HN_TRUST_HCSUM_TCP)) { rxr->hn_csum_trusted++; m_new->m_pkthdr.csum_flags |= (CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR); m_new->m_pkthdr.csum_data = 0xffff; } do_lro = 1; } else if (pr == IPPROTO_UDP) { if (do_csum && (rxr->hn_trust_hcsum & HN_TRUST_HCSUM_UDP)) { rxr->hn_csum_trusted++; m_new->m_pkthdr.csum_flags |= (CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR); m_new->m_pkthdr.csum_data = 0xffff; } } else if (pr != IPPROTO_DONE && do_csum && (rxr->hn_trust_hcsum & HN_TRUST_HCSUM_IP)) { rxr->hn_csum_trusted++; m_new->m_pkthdr.csum_flags |= (CSUM_IP_CHECKED | CSUM_IP_VALID); } } } skip: if ((packet->vlan_tci != 0) && (ifp->if_capenable & IFCAP_VLAN_HWTAGGING) != 0) { m_new->m_pkthdr.ether_vtag = packet->vlan_tci; m_new->m_flags |= M_VLANTAG; } if (hash_info != NULL && hash_value != NULL) { - int hash_type = M_HASHTYPE_OPAQUE; - rxr->hn_rss_pkts++; m_new->m_pkthdr.flowid = hash_value->hash_value; if ((hash_info->hash_info & NDIS_HASH_FUNCTION_MASK) == NDIS_HASH_FUNCTION_TOEPLITZ) { uint32_t type = (hash_info->hash_info & NDIS_HASH_TYPE_MASK); switch (type) { case NDIS_HASH_IPV4: hash_type = M_HASHTYPE_RSS_IPV4; break; case NDIS_HASH_TCP_IPV4: hash_type = M_HASHTYPE_RSS_TCP_IPV4; break; case NDIS_HASH_IPV6: hash_type = M_HASHTYPE_RSS_IPV6; break; case NDIS_HASH_IPV6_EX: hash_type = M_HASHTYPE_RSS_IPV6_EX; break; case NDIS_HASH_TCP_IPV6: hash_type = M_HASHTYPE_RSS_TCP_IPV6; break; case NDIS_HASH_TCP_IPV6_EX: hash_type = M_HASHTYPE_RSS_TCP_IPV6_EX; break; } } - M_HASHTYPE_SET(m_new, hash_type); } else { - if (hash_value != NULL) + if (hash_value != NULL) { m_new->m_pkthdr.flowid = hash_value->hash_value; - else + } else { m_new->m_pkthdr.flowid = rxr->hn_rx_idx; - M_HASHTYPE_SET(m_new, M_HASHTYPE_OPAQUE); + hash_type = M_HASHTYPE_OPAQUE; + } } + M_HASHTYPE_SET(m_new, hash_type); /* * Note: Moved RX completion back to hv_nv_on_receive() so all * messages (not just data messages) will trigger a response. */ if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1); rxr->hn_pkts++; if ((ifp->if_capenable & IFCAP_LRO) && do_lro) { #if defined(INET) || defined(INET6) struct lro_ctrl *lro = &rxr->hn_lro; if (lro->lro_cnt) { rxr->hn_lro_tried++; if (tcp_lro_rx(lro, m_new, 0) == 0) { /* DONE! */ return 0; } } #endif } /* We're not holding the lock here, so don't release it */ (*ifp->if_input)(ifp, m_new); return (0); } /* * Rules for using sc->temp_unusable: * 1. sc->temp_unusable can only be read or written while holding NV_LOCK() * 2. code reading sc->temp_unusable under NV_LOCK(), and finding * sc->temp_unusable set, must release NV_LOCK() and exit * 3. to retain exclusive control of the interface, * sc->temp_unusable must be set by code before releasing NV_LOCK() * 4. only code setting sc->temp_unusable can clear sc->temp_unusable * 5. code setting sc->temp_unusable must eventually clear sc->temp_unusable */ /* * Standard ioctl entry point. Called when the user wants to configure * the interface. */ static int hn_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data) { hn_softc_t *sc = ifp->if_softc; struct ifreq *ifr = (struct ifreq *)data; #ifdef INET struct ifaddr *ifa = (struct ifaddr *)data; #endif netvsc_device_info device_info; struct hv_device *hn_dev; int mask, error = 0; int retry_cnt = 500; switch(cmd) { case SIOCSIFADDR: #ifdef INET if (ifa->ifa_addr->sa_family == AF_INET) { ifp->if_flags |= IFF_UP; if (!(ifp->if_drv_flags & IFF_DRV_RUNNING)) hn_ifinit(sc); arp_ifinit(ifp, ifa); } else #endif error = ether_ioctl(ifp, cmd, data); break; case SIOCSIFMTU: hn_dev = vmbus_get_devctx(sc->hn_dev); /* Check MTU value change */ if (ifp->if_mtu == ifr->ifr_mtu) break; if (ifr->ifr_mtu > NETVSC_MAX_CONFIGURABLE_MTU) { error = EINVAL; break; } /* Obtain and record requested MTU */ ifp->if_mtu = ifr->ifr_mtu; #if __FreeBSD_version >= 1100099 /* * Make sure that LRO aggregation length limit is still * valid, after the MTU change. */ NV_LOCK(sc); if (sc->hn_rx_ring[0].hn_lro.lro_length_lim < HN_LRO_LENLIM_MIN(ifp)) hn_set_lro_lenlim(sc, HN_LRO_LENLIM_MIN(ifp)); NV_UNLOCK(sc); #endif do { NV_LOCK(sc); if (!sc->temp_unusable) { sc->temp_unusable = TRUE; retry_cnt = -1; } NV_UNLOCK(sc); if (retry_cnt > 0) { retry_cnt--; DELAY(5 * 1000); } } while (retry_cnt > 0); if (retry_cnt == 0) { error = EINVAL; break; } /* We must remove and add back the device to cause the new * MTU to take effect. This includes tearing down, but not * deleting the channel, then bringing it back up. */ error = hv_rf_on_device_remove(hn_dev, HV_RF_NV_RETAIN_CHANNEL); if (error) { NV_LOCK(sc); sc->temp_unusable = FALSE; NV_UNLOCK(sc); break; } error = hv_rf_on_device_add(hn_dev, &device_info, sc->hn_rx_ring_inuse); if (error) { NV_LOCK(sc); sc->temp_unusable = FALSE; NV_UNLOCK(sc); break; } sc->hn_tx_chimney_max = sc->net_dev->send_section_size; if (sc->hn_tx_ring[0].hn_tx_chimney_size > sc->hn_tx_chimney_max) hn_set_tx_chimney_size(sc, sc->hn_tx_chimney_max); hn_ifinit_locked(sc); NV_LOCK(sc); sc->temp_unusable = FALSE; NV_UNLOCK(sc); break; case SIOCSIFFLAGS: do { NV_LOCK(sc); if (!sc->temp_unusable) { sc->temp_unusable = TRUE; retry_cnt = -1; } NV_UNLOCK(sc); if (retry_cnt > 0) { retry_cnt--; DELAY(5 * 1000); } } while (retry_cnt > 0); if (retry_cnt == 0) { error = EINVAL; break; } if (ifp->if_flags & IFF_UP) { /* * If only the state of the PROMISC flag changed, * then just use the 'set promisc mode' command * instead of reinitializing the entire NIC. Doing * a full re-init means reloading the firmware and * waiting for it to start up, which may take a * second or two. */ #ifdef notyet /* Fixme: Promiscuous mode? */ if (ifp->if_drv_flags & IFF_DRV_RUNNING && ifp->if_flags & IFF_PROMISC && !(sc->hn_if_flags & IFF_PROMISC)) { /* do something here for Hyper-V */ } else if (ifp->if_drv_flags & IFF_DRV_RUNNING && !(ifp->if_flags & IFF_PROMISC) && sc->hn_if_flags & IFF_PROMISC) { /* do something here for Hyper-V */ } else #endif hn_ifinit_locked(sc); } else { if (ifp->if_drv_flags & IFF_DRV_RUNNING) { hn_stop(sc); } } NV_LOCK(sc); sc->temp_unusable = FALSE; NV_UNLOCK(sc); sc->hn_if_flags = ifp->if_flags; error = 0; break; case SIOCSIFCAP: NV_LOCK(sc); mask = ifr->ifr_reqcap ^ ifp->if_capenable; if (mask & IFCAP_TXCSUM) { ifp->if_capenable ^= IFCAP_TXCSUM; if (ifp->if_capenable & IFCAP_TXCSUM) { ifp->if_hwassist |= sc->hn_tx_ring[0].hn_csum_assist; } else { ifp->if_hwassist &= ~sc->hn_tx_ring[0].hn_csum_assist; } } if (mask & IFCAP_RXCSUM) ifp->if_capenable ^= IFCAP_RXCSUM; if (mask & IFCAP_LRO) ifp->if_capenable ^= IFCAP_LRO; if (mask & IFCAP_TSO4) { ifp->if_capenable ^= IFCAP_TSO4; if (ifp->if_capenable & IFCAP_TSO4) ifp->if_hwassist |= CSUM_IP_TSO; else ifp->if_hwassist &= ~CSUM_IP_TSO; } if (mask & IFCAP_TSO6) { ifp->if_capenable ^= IFCAP_TSO6; if (ifp->if_capenable & IFCAP_TSO6) ifp->if_hwassist |= CSUM_IP6_TSO; else ifp->if_hwassist &= ~CSUM_IP6_TSO; } NV_UNLOCK(sc); error = 0; break; case SIOCADDMULTI: case SIOCDELMULTI: #ifdef notyet /* Fixme: Multicast mode? */ if (ifp->if_drv_flags & IFF_DRV_RUNNING) { NV_LOCK(sc); netvsc_setmulti(sc); NV_UNLOCK(sc); error = 0; } #endif error = EINVAL; break; case SIOCSIFMEDIA: case SIOCGIFMEDIA: error = ifmedia_ioctl(ifp, ifr, &sc->hn_media, cmd); break; default: error = ether_ioctl(ifp, cmd, data); break; } return (error); } /* * */ static void hn_stop(hn_softc_t *sc) { struct ifnet *ifp; int ret, i; struct hv_device *device_ctx = vmbus_get_devctx(sc->hn_dev); ifp = sc->hn_ifp; if (bootverbose) printf(" Closing Device ...\n"); atomic_clear_int(&ifp->if_drv_flags, (IFF_DRV_RUNNING | IFF_DRV_OACTIVE)); for (i = 0; i < sc->hn_tx_ring_inuse; ++i) sc->hn_tx_ring[i].hn_oactive = 0; if_link_state_change(ifp, LINK_STATE_DOWN); sc->hn_initdone = 0; ret = hv_rf_on_close(device_ctx); } /* * FreeBSD transmit entry point */ static void hn_start(struct ifnet *ifp) { struct hn_softc *sc = ifp->if_softc; struct hn_tx_ring *txr = &sc->hn_tx_ring[0]; if (txr->hn_sched_tx) goto do_sched; if (mtx_trylock(&txr->hn_tx_lock)) { int sched; sched = hn_start_locked(txr, txr->hn_direct_tx_size); mtx_unlock(&txr->hn_tx_lock); if (!sched) return; } do_sched: taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_tx_task); } static void hn_start_txeof(struct hn_tx_ring *txr) { struct hn_softc *sc = txr->hn_sc; struct ifnet *ifp = sc->hn_ifp; KASSERT(txr == &sc->hn_tx_ring[0], ("not the first TX ring")); if (txr->hn_sched_tx) goto do_sched; if (mtx_trylock(&txr->hn_tx_lock)) { int sched; atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); sched = hn_start_locked(txr, txr->hn_direct_tx_size); mtx_unlock(&txr->hn_tx_lock); if (sched) { taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_tx_task); } } else { do_sched: /* * Release the OACTIVE earlier, with the hope, that * others could catch up. The task will clear the * flag again with the hn_tx_lock to avoid possible * races. */ atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_txeof_task); } } /* * */ static void hn_ifinit_locked(hn_softc_t *sc) { struct ifnet *ifp; struct hv_device *device_ctx = vmbus_get_devctx(sc->hn_dev); int ret, i; ifp = sc->hn_ifp; if (ifp->if_drv_flags & IFF_DRV_RUNNING) { return; } hv_promisc_mode = 1; ret = hv_rf_on_open(device_ctx); if (ret != 0) { return; } else { sc->hn_initdone = 1; } atomic_clear_int(&ifp->if_drv_flags, IFF_DRV_OACTIVE); for (i = 0; i < sc->hn_tx_ring_inuse; ++i) sc->hn_tx_ring[i].hn_oactive = 0; atomic_set_int(&ifp->if_drv_flags, IFF_DRV_RUNNING); if_link_state_change(ifp, LINK_STATE_UP); } /* * */ static void hn_ifinit(void *xsc) { hn_softc_t *sc = xsc; NV_LOCK(sc); if (sc->temp_unusable) { NV_UNLOCK(sc); return; } sc->temp_unusable = TRUE; NV_UNLOCK(sc); hn_ifinit_locked(sc); NV_LOCK(sc); sc->temp_unusable = FALSE; NV_UNLOCK(sc); } #ifdef LATER /* * */ static void hn_watchdog(struct ifnet *ifp) { hn_softc_t *sc; sc = ifp->if_softc; printf("hn%d: watchdog timeout -- resetting\n", sc->hn_unit); hn_ifinit(sc); /*???*/ if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); } #endif #if __FreeBSD_version >= 1100099 static int hn_lro_lenlim_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; unsigned int lenlim; int error; lenlim = sc->hn_rx_ring[0].hn_lro.lro_length_lim; error = sysctl_handle_int(oidp, &lenlim, 0, req); if (error || req->newptr == NULL) return error; if (lenlim < HN_LRO_LENLIM_MIN(sc->hn_ifp) || lenlim > TCP_LRO_LENGTH_MAX) return EINVAL; NV_LOCK(sc); hn_set_lro_lenlim(sc, lenlim); NV_UNLOCK(sc); return 0; } static int hn_lro_ackcnt_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ackcnt, error, i; /* * lro_ackcnt_lim is append count limit, * +1 to turn it into aggregation limit. */ ackcnt = sc->hn_rx_ring[0].hn_lro.lro_ackcnt_lim + 1; error = sysctl_handle_int(oidp, &ackcnt, 0, req); if (error || req->newptr == NULL) return error; if (ackcnt < 2 || ackcnt > (TCP_LRO_ACKCNT_MAX + 1)) return EINVAL; /* * Convert aggregation limit back to append * count limit. */ --ackcnt; NV_LOCK(sc); for (i = 0; i < sc->hn_rx_ring_inuse; ++i) sc->hn_rx_ring[i].hn_lro.lro_ackcnt_lim = ackcnt; NV_UNLOCK(sc); return 0; } #endif static int hn_trust_hcsum_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int hcsum = arg2; int on, error, i; on = 0; if (sc->hn_rx_ring[0].hn_trust_hcsum & hcsum) on = 1; error = sysctl_handle_int(oidp, &on, 0, req); if (error || req->newptr == NULL) return error; NV_LOCK(sc); for (i = 0; i < sc->hn_rx_ring_inuse; ++i) { struct hn_rx_ring *rxr = &sc->hn_rx_ring[i]; if (on) rxr->hn_trust_hcsum |= hcsum; else rxr->hn_trust_hcsum &= ~hcsum; } NV_UNLOCK(sc); return 0; } static int hn_tx_chimney_size_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int chimney_size, error; chimney_size = sc->hn_tx_ring[0].hn_tx_chimney_size; error = sysctl_handle_int(oidp, &chimney_size, 0, req); if (error || req->newptr == NULL) return error; if (chimney_size > sc->hn_tx_chimney_max || chimney_size <= 0) return EINVAL; hn_set_tx_chimney_size(sc, chimney_size); return 0; } static int hn_rx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ofs = arg2, i, error; struct hn_rx_ring *rxr; u_long stat; stat = 0; for (i = 0; i < sc->hn_rx_ring_inuse; ++i) { rxr = &sc->hn_rx_ring[i]; stat += *((u_long *)((uint8_t *)rxr + ofs)); } error = sysctl_handle_long(oidp, &stat, 0, req); if (error || req->newptr == NULL) return error; /* Zero out this stat. */ for (i = 0; i < sc->hn_rx_ring_inuse; ++i) { rxr = &sc->hn_rx_ring[i]; *((u_long *)((uint8_t *)rxr + ofs)) = 0; } return 0; } static int hn_rx_stat_u64_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ofs = arg2, i, error; struct hn_rx_ring *rxr; uint64_t stat; stat = 0; for (i = 0; i < sc->hn_rx_ring_inuse; ++i) { rxr = &sc->hn_rx_ring[i]; stat += *((uint64_t *)((uint8_t *)rxr + ofs)); } error = sysctl_handle_64(oidp, &stat, 0, req); if (error || req->newptr == NULL) return error; /* Zero out this stat. */ for (i = 0; i < sc->hn_rx_ring_inuse; ++i) { rxr = &sc->hn_rx_ring[i]; *((uint64_t *)((uint8_t *)rxr + ofs)) = 0; } return 0; } static int hn_tx_stat_ulong_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ofs = arg2, i, error; struct hn_tx_ring *txr; u_long stat; stat = 0; for (i = 0; i < sc->hn_tx_ring_inuse; ++i) { txr = &sc->hn_tx_ring[i]; stat += *((u_long *)((uint8_t *)txr + ofs)); } error = sysctl_handle_long(oidp, &stat, 0, req); if (error || req->newptr == NULL) return error; /* Zero out this stat. */ for (i = 0; i < sc->hn_tx_ring_inuse; ++i) { txr = &sc->hn_tx_ring[i]; *((u_long *)((uint8_t *)txr + ofs)) = 0; } return 0; } static int hn_tx_conf_int_sysctl(SYSCTL_HANDLER_ARGS) { struct hn_softc *sc = arg1; int ofs = arg2, i, error, conf; struct hn_tx_ring *txr; txr = &sc->hn_tx_ring[0]; conf = *((int *)((uint8_t *)txr + ofs)); error = sysctl_handle_int(oidp, &conf, 0, req); if (error || req->newptr == NULL) return error; NV_LOCK(sc); for (i = 0; i < sc->hn_tx_ring_inuse; ++i) { txr = &sc->hn_tx_ring[i]; *((int *)((uint8_t *)txr + ofs)) = conf; } NV_UNLOCK(sc); return 0; } static int hn_check_iplen(const struct mbuf *m, int hoff) { const struct ip *ip; int len, iphlen, iplen; const struct tcphdr *th; int thoff; /* TCP data offset */ len = hoff + sizeof(struct ip); /* The packet must be at least the size of an IP header. */ if (m->m_pkthdr.len < len) return IPPROTO_DONE; /* The fixed IP header must reside completely in the first mbuf. */ if (m->m_len < len) return IPPROTO_DONE; ip = mtodo(m, hoff); /* Bound check the packet's stated IP header length. */ iphlen = ip->ip_hl << 2; if (iphlen < sizeof(struct ip)) /* minimum header length */ return IPPROTO_DONE; /* The full IP header must reside completely in the one mbuf. */ if (m->m_len < hoff + iphlen) return IPPROTO_DONE; iplen = ntohs(ip->ip_len); /* * Check that the amount of data in the buffers is as * at least much as the IP header would have us expect. */ if (m->m_pkthdr.len < hoff + iplen) return IPPROTO_DONE; /* * Ignore IP fragments. */ if (ntohs(ip->ip_off) & (IP_OFFMASK | IP_MF)) return IPPROTO_DONE; /* * The TCP/IP or UDP/IP header must be entirely contained within * the first fragment of a packet. */ switch (ip->ip_p) { case IPPROTO_TCP: if (iplen < iphlen + sizeof(struct tcphdr)) return IPPROTO_DONE; if (m->m_len < hoff + iphlen + sizeof(struct tcphdr)) return IPPROTO_DONE; th = (const struct tcphdr *)((const uint8_t *)ip + iphlen); thoff = th->th_off << 2; if (thoff < sizeof(struct tcphdr) || thoff + iphlen > iplen) return IPPROTO_DONE; if (m->m_len < hoff + iphlen + thoff) return IPPROTO_DONE; break; case IPPROTO_UDP: if (iplen < iphlen + sizeof(struct udphdr)) return IPPROTO_DONE; if (m->m_len < hoff + iphlen + sizeof(struct udphdr)) return IPPROTO_DONE; break; default: if (iplen < iphlen) return IPPROTO_DONE; break; } return ip->ip_p; } static void hn_create_rx_data(struct hn_softc *sc, int ring_cnt) { struct sysctl_oid_list *child; struct sysctl_ctx_list *ctx; device_t dev = sc->hn_dev; #if defined(INET) || defined(INET6) #if __FreeBSD_version >= 1100095 int lroent_cnt; #endif #endif int i; sc->hn_rx_ring_cnt = ring_cnt; sc->hn_rx_ring_inuse = sc->hn_rx_ring_cnt; sc->hn_rx_ring = malloc(sizeof(struct hn_rx_ring) * sc->hn_rx_ring_cnt, M_NETVSC, M_WAITOK | M_ZERO); #if defined(INET) || defined(INET6) #if __FreeBSD_version >= 1100095 lroent_cnt = hn_lro_entry_count; if (lroent_cnt < TCP_LRO_ENTRIES) lroent_cnt = TCP_LRO_ENTRIES; device_printf(dev, "LRO: entry count %d\n", lroent_cnt); #endif #endif /* INET || INET6 */ ctx = device_get_sysctl_ctx(dev); child = SYSCTL_CHILDREN(device_get_sysctl_tree(dev)); /* Create dev.hn.UNIT.rx sysctl tree */ sc->hn_rx_sysctl_tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "rx", CTLFLAG_RD | CTLFLAG_MPSAFE, 0, ""); for (i = 0; i < sc->hn_rx_ring_cnt; ++i) { struct hn_rx_ring *rxr = &sc->hn_rx_ring[i]; if (hn_trust_hosttcp) rxr->hn_trust_hcsum |= HN_TRUST_HCSUM_TCP; if (hn_trust_hostudp) rxr->hn_trust_hcsum |= HN_TRUST_HCSUM_UDP; if (hn_trust_hostip) rxr->hn_trust_hcsum |= HN_TRUST_HCSUM_IP; rxr->hn_ifp = sc->hn_ifp; rxr->hn_rx_idx = i; /* * Initialize LRO. */ #if defined(INET) || defined(INET6) #if __FreeBSD_version >= 1100095 tcp_lro_init_args(&rxr->hn_lro, sc->hn_ifp, lroent_cnt, 0); #else tcp_lro_init(&rxr->hn_lro); rxr->hn_lro.ifp = sc->hn_ifp; #endif #if __FreeBSD_version >= 1100099 rxr->hn_lro.lro_length_lim = HN_LRO_LENLIM_DEF; rxr->hn_lro.lro_ackcnt_lim = HN_LRO_ACKCNT_DEF; #endif #endif /* INET || INET6 */ if (sc->hn_rx_sysctl_tree != NULL) { char name[16]; /* * Create per RX ring sysctl tree: * dev.hn.UNIT.rx.RINGID */ snprintf(name, sizeof(name), "%d", i); rxr->hn_rx_sysctl_tree = SYSCTL_ADD_NODE(ctx, SYSCTL_CHILDREN(sc->hn_rx_sysctl_tree), OID_AUTO, name, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, ""); if (rxr->hn_rx_sysctl_tree != NULL) { SYSCTL_ADD_ULONG(ctx, SYSCTL_CHILDREN(rxr->hn_rx_sysctl_tree), OID_AUTO, "packets", CTLFLAG_RW, &rxr->hn_pkts, "# of packets received"); SYSCTL_ADD_ULONG(ctx, SYSCTL_CHILDREN(rxr->hn_rx_sysctl_tree), OID_AUTO, "rss_pkts", CTLFLAG_RW, &rxr->hn_rss_pkts, "# of packets w/ RSS info received"); } } } SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_queued", CTLTYPE_U64 | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_lro.lro_queued), hn_rx_stat_u64_sysctl, "LU", "LRO queued"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_flushed", CTLTYPE_U64 | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_lro.lro_flushed), hn_rx_stat_u64_sysctl, "LU", "LRO flushed"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_tried", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_lro_tried), hn_rx_stat_ulong_sysctl, "LU", "# of LRO tries"); #if __FreeBSD_version >= 1100099 SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_length_lim", CTLTYPE_UINT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_lro_lenlim_sysctl, "IU", "Max # of data bytes to be aggregated by LRO"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "lro_ackcnt_lim", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_lro_ackcnt_sysctl, "I", "Max # of ACKs to be aggregated by LRO"); #endif SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "trust_hosttcp", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, HN_TRUST_HCSUM_TCP, hn_trust_hcsum_sysctl, "I", "Trust tcp segement verification on host side, " "when csum info is missing"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "trust_hostudp", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, HN_TRUST_HCSUM_UDP, hn_trust_hcsum_sysctl, "I", "Trust udp datagram verification on host side, " "when csum info is missing"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "trust_hostip", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, HN_TRUST_HCSUM_IP, hn_trust_hcsum_sysctl, "I", "Trust ip packet verification on host side, " "when csum info is missing"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_ip", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_csum_ip), hn_rx_stat_ulong_sysctl, "LU", "RXCSUM IP"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_tcp", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_csum_tcp), hn_rx_stat_ulong_sysctl, "LU", "RXCSUM TCP"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_udp", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_csum_udp), hn_rx_stat_ulong_sysctl, "LU", "RXCSUM UDP"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "csum_trusted", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_csum_trusted), hn_rx_stat_ulong_sysctl, "LU", "# of packets that we trust host's csum verification"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "small_pkts", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_rx_ring, hn_small_pkts), hn_rx_stat_ulong_sysctl, "LU", "# of small packets received"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "rx_ring_cnt", CTLFLAG_RD, &sc->hn_rx_ring_cnt, 0, "# created RX rings"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "rx_ring_inuse", CTLFLAG_RD, &sc->hn_rx_ring_inuse, 0, "# used RX rings"); } static void hn_destroy_rx_data(struct hn_softc *sc) { #if defined(INET) || defined(INET6) int i; #endif if (sc->hn_rx_ring_cnt == 0) return; #if defined(INET) || defined(INET6) for (i = 0; i < sc->hn_rx_ring_cnt; ++i) tcp_lro_free(&sc->hn_rx_ring[i].hn_lro); #endif free(sc->hn_rx_ring, M_NETVSC); sc->hn_rx_ring = NULL; sc->hn_rx_ring_cnt = 0; sc->hn_rx_ring_inuse = 0; } static int hn_create_tx_ring(struct hn_softc *sc, int id) { struct hn_tx_ring *txr = &sc->hn_tx_ring[id]; bus_dma_tag_t parent_dtag; int error, i; txr->hn_sc = sc; txr->hn_tx_idx = id; #ifndef HN_USE_TXDESC_BUFRING mtx_init(&txr->hn_txlist_spin, "hn txlist", NULL, MTX_SPIN); #endif mtx_init(&txr->hn_tx_lock, "hn tx", NULL, MTX_DEF); txr->hn_txdesc_cnt = HN_TX_DESC_CNT; txr->hn_txdesc = malloc(sizeof(struct hn_txdesc) * txr->hn_txdesc_cnt, M_NETVSC, M_WAITOK | M_ZERO); #ifndef HN_USE_TXDESC_BUFRING SLIST_INIT(&txr->hn_txlist); #else txr->hn_txdesc_br = buf_ring_alloc(txr->hn_txdesc_cnt, M_NETVSC, M_WAITOK, &txr->hn_tx_lock); #endif txr->hn_tx_taskq = sc->hn_tx_taskq; if (hn_use_if_start) { txr->hn_txeof = hn_start_txeof; TASK_INIT(&txr->hn_tx_task, 0, hn_start_taskfunc, txr); TASK_INIT(&txr->hn_txeof_task, 0, hn_start_txeof_taskfunc, txr); } else { int br_depth; txr->hn_txeof = hn_xmit_txeof; TASK_INIT(&txr->hn_tx_task, 0, hn_xmit_taskfunc, txr); TASK_INIT(&txr->hn_txeof_task, 0, hn_xmit_txeof_taskfunc, txr); br_depth = hn_get_txswq_depth(txr); txr->hn_mbuf_br = buf_ring_alloc(br_depth, M_NETVSC, M_WAITOK, &txr->hn_tx_lock); } txr->hn_direct_tx_size = hn_direct_tx_size; if (hv_vmbus_protocal_version >= HV_VMBUS_VERSION_WIN8_1) txr->hn_csum_assist = HN_CSUM_ASSIST; else txr->hn_csum_assist = HN_CSUM_ASSIST_WIN8; /* * Always schedule transmission instead of trying to do direct * transmission. This one gives the best performance so far. */ txr->hn_sched_tx = 1; parent_dtag = bus_get_dma_tag(sc->hn_dev); /* DMA tag for RNDIS messages. */ error = bus_dma_tag_create(parent_dtag, /* parent */ HN_RNDIS_MSG_ALIGN, /* alignment */ HN_RNDIS_MSG_BOUNDARY, /* boundary */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ HN_RNDIS_MSG_LEN, /* maxsize */ 1, /* nsegments */ HN_RNDIS_MSG_LEN, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txr->hn_tx_rndis_dtag); if (error) { device_printf(sc->hn_dev, "failed to create rndis dmatag\n"); return error; } /* DMA tag for data. */ error = bus_dma_tag_create(parent_dtag, /* parent */ 1, /* alignment */ HN_TX_DATA_BOUNDARY, /* boundary */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ HN_TX_DATA_MAXSIZE, /* maxsize */ HN_TX_DATA_SEGCNT_MAX, /* nsegments */ HN_TX_DATA_SEGSIZE, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txr->hn_tx_data_dtag); if (error) { device_printf(sc->hn_dev, "failed to create data dmatag\n"); return error; } for (i = 0; i < txr->hn_txdesc_cnt; ++i) { struct hn_txdesc *txd = &txr->hn_txdesc[i]; txd->txr = txr; /* * Allocate and load RNDIS messages. */ error = bus_dmamem_alloc(txr->hn_tx_rndis_dtag, (void **)&txd->rndis_msg, BUS_DMA_WAITOK | BUS_DMA_COHERENT, &txd->rndis_msg_dmap); if (error) { device_printf(sc->hn_dev, "failed to allocate rndis_msg, %d\n", i); return error; } error = bus_dmamap_load(txr->hn_tx_rndis_dtag, txd->rndis_msg_dmap, txd->rndis_msg, HN_RNDIS_MSG_LEN, hyperv_dma_map_paddr, &txd->rndis_msg_paddr, BUS_DMA_NOWAIT); if (error) { device_printf(sc->hn_dev, "failed to load rndis_msg, %d\n", i); bus_dmamem_free(txr->hn_tx_rndis_dtag, txd->rndis_msg, txd->rndis_msg_dmap); return error; } /* DMA map for TX data. */ error = bus_dmamap_create(txr->hn_tx_data_dtag, 0, &txd->data_dmap); if (error) { device_printf(sc->hn_dev, "failed to allocate tx data dmamap\n"); bus_dmamap_unload(txr->hn_tx_rndis_dtag, txd->rndis_msg_dmap); bus_dmamem_free(txr->hn_tx_rndis_dtag, txd->rndis_msg, txd->rndis_msg_dmap); return error; } /* All set, put it to list */ txd->flags |= HN_TXD_FLAG_ONLIST; #ifndef HN_USE_TXDESC_BUFRING SLIST_INSERT_HEAD(&txr->hn_txlist, txd, link); #else buf_ring_enqueue(txr->hn_txdesc_br, txd); #endif } txr->hn_txdesc_avail = txr->hn_txdesc_cnt; if (sc->hn_tx_sysctl_tree != NULL) { struct sysctl_oid_list *child; struct sysctl_ctx_list *ctx; char name[16]; /* * Create per TX ring sysctl tree: * dev.hn.UNIT.tx.RINGID */ ctx = device_get_sysctl_ctx(sc->hn_dev); child = SYSCTL_CHILDREN(sc->hn_tx_sysctl_tree); snprintf(name, sizeof(name), "%d", id); txr->hn_tx_sysctl_tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, name, CTLFLAG_RD | CTLFLAG_MPSAFE, 0, ""); if (txr->hn_tx_sysctl_tree != NULL) { child = SYSCTL_CHILDREN(txr->hn_tx_sysctl_tree); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "txdesc_avail", CTLFLAG_RD, &txr->hn_txdesc_avail, 0, "# of available TX descs"); if (!hn_use_if_start) { SYSCTL_ADD_INT(ctx, child, OID_AUTO, "oactive", CTLFLAG_RD, &txr->hn_oactive, 0, "over active"); } SYSCTL_ADD_ULONG(ctx, child, OID_AUTO, "packets", CTLFLAG_RW, &txr->hn_pkts, "# of packets transmitted"); } } return 0; } static void hn_txdesc_dmamap_destroy(struct hn_txdesc *txd) { struct hn_tx_ring *txr = txd->txr; KASSERT(txd->m == NULL, ("still has mbuf installed")); KASSERT((txd->flags & HN_TXD_FLAG_DMAMAP) == 0, ("still dma mapped")); bus_dmamap_unload(txr->hn_tx_rndis_dtag, txd->rndis_msg_dmap); bus_dmamem_free(txr->hn_tx_rndis_dtag, txd->rndis_msg, txd->rndis_msg_dmap); bus_dmamap_destroy(txr->hn_tx_data_dtag, txd->data_dmap); } static void hn_destroy_tx_ring(struct hn_tx_ring *txr) { struct hn_txdesc *txd; if (txr->hn_txdesc == NULL) return; #ifndef HN_USE_TXDESC_BUFRING while ((txd = SLIST_FIRST(&txr->hn_txlist)) != NULL) { SLIST_REMOVE_HEAD(&txr->hn_txlist, link); hn_txdesc_dmamap_destroy(txd); } #else mtx_lock(&txr->hn_tx_lock); while ((txd = buf_ring_dequeue_sc(txr->hn_txdesc_br)) != NULL) hn_txdesc_dmamap_destroy(txd); mtx_unlock(&txr->hn_tx_lock); #endif if (txr->hn_tx_data_dtag != NULL) bus_dma_tag_destroy(txr->hn_tx_data_dtag); if (txr->hn_tx_rndis_dtag != NULL) bus_dma_tag_destroy(txr->hn_tx_rndis_dtag); #ifdef HN_USE_TXDESC_BUFRING buf_ring_free(txr->hn_txdesc_br, M_NETVSC); #endif free(txr->hn_txdesc, M_NETVSC); txr->hn_txdesc = NULL; if (txr->hn_mbuf_br != NULL) buf_ring_free(txr->hn_mbuf_br, M_NETVSC); #ifndef HN_USE_TXDESC_BUFRING mtx_destroy(&txr->hn_txlist_spin); #endif mtx_destroy(&txr->hn_tx_lock); } static int hn_create_tx_data(struct hn_softc *sc, int ring_cnt) { struct sysctl_oid_list *child; struct sysctl_ctx_list *ctx; int i; sc->hn_tx_ring_cnt = ring_cnt; sc->hn_tx_ring_inuse = sc->hn_tx_ring_cnt; sc->hn_tx_ring = malloc(sizeof(struct hn_tx_ring) * sc->hn_tx_ring_cnt, M_NETVSC, M_WAITOK | M_ZERO); ctx = device_get_sysctl_ctx(sc->hn_dev); child = SYSCTL_CHILDREN(device_get_sysctl_tree(sc->hn_dev)); /* Create dev.hn.UNIT.tx sysctl tree */ sc->hn_tx_sysctl_tree = SYSCTL_ADD_NODE(ctx, child, OID_AUTO, "tx", CTLFLAG_RD | CTLFLAG_MPSAFE, 0, ""); for (i = 0; i < sc->hn_tx_ring_cnt; ++i) { int error; error = hn_create_tx_ring(sc, i); if (error) return error; } SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "no_txdescs", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_no_txdescs), hn_tx_stat_ulong_sysctl, "LU", "# of times short of TX descs"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "send_failed", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_send_failed), hn_tx_stat_ulong_sysctl, "LU", "# of hyper-v sending failure"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "txdma_failed", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_txdma_failed), hn_tx_stat_ulong_sysctl, "LU", "# of TX DMA failure"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_collapsed", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_tx_collapsed), hn_tx_stat_ulong_sysctl, "LU", "# of TX mbuf collapsed"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_chimney", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_tx_chimney), hn_tx_stat_ulong_sysctl, "LU", "# of chimney send"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_chimney_tried", CTLTYPE_ULONG | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_tx_chimney_tried), hn_tx_stat_ulong_sysctl, "LU", "# of chimney send tries"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "txdesc_cnt", CTLFLAG_RD, &sc->hn_tx_ring[0].hn_txdesc_cnt, 0, "# of total TX descs"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "tx_chimney_max", CTLFLAG_RD, &sc->hn_tx_chimney_max, 0, "Chimney send packet size upper boundary"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "tx_chimney_size", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, 0, hn_tx_chimney_size_sysctl, "I", "Chimney send packet size limit"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "direct_tx_size", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_direct_tx_size), hn_tx_conf_int_sysctl, "I", "Size of the packet for direct transmission"); SYSCTL_ADD_PROC(ctx, child, OID_AUTO, "sched_tx", CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, sc, __offsetof(struct hn_tx_ring, hn_sched_tx), hn_tx_conf_int_sysctl, "I", "Always schedule transmission " "instead of doing direct transmission"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "tx_ring_cnt", CTLFLAG_RD, &sc->hn_tx_ring_cnt, 0, "# created TX rings"); SYSCTL_ADD_INT(ctx, child, OID_AUTO, "tx_ring_inuse", CTLFLAG_RD, &sc->hn_tx_ring_inuse, 0, "# used TX rings"); return 0; } static void hn_set_tx_chimney_size(struct hn_softc *sc, int chimney_size) { int i; NV_LOCK(sc); for (i = 0; i < sc->hn_tx_ring_inuse; ++i) sc->hn_tx_ring[i].hn_tx_chimney_size = chimney_size; NV_UNLOCK(sc); } static void hn_destroy_tx_data(struct hn_softc *sc) { int i; if (sc->hn_tx_ring_cnt == 0) return; for (i = 0; i < sc->hn_tx_ring_cnt; ++i) hn_destroy_tx_ring(&sc->hn_tx_ring[i]); free(sc->hn_tx_ring, M_NETVSC); sc->hn_tx_ring = NULL; sc->hn_tx_ring_cnt = 0; sc->hn_tx_ring_inuse = 0; } static void hn_start_taskfunc(void *xtxr, int pending __unused) { struct hn_tx_ring *txr = xtxr; mtx_lock(&txr->hn_tx_lock); hn_start_locked(txr, 0); mtx_unlock(&txr->hn_tx_lock); } static void hn_start_txeof_taskfunc(void *xtxr, int pending __unused) { struct hn_tx_ring *txr = xtxr; mtx_lock(&txr->hn_tx_lock); atomic_clear_int(&txr->hn_sc->hn_ifp->if_drv_flags, IFF_DRV_OACTIVE); hn_start_locked(txr, 0); mtx_unlock(&txr->hn_tx_lock); } static void hn_stop_tx_tasks(struct hn_softc *sc) { int i; for (i = 0; i < sc->hn_tx_ring_inuse; ++i) { struct hn_tx_ring *txr = &sc->hn_tx_ring[i]; taskqueue_drain(txr->hn_tx_taskq, &txr->hn_tx_task); taskqueue_drain(txr->hn_tx_taskq, &txr->hn_txeof_task); } } static int hn_xmit(struct hn_tx_ring *txr, int len) { struct hn_softc *sc = txr->hn_sc; struct ifnet *ifp = sc->hn_ifp; struct mbuf *m_head; mtx_assert(&txr->hn_tx_lock, MA_OWNED); KASSERT(hn_use_if_start == 0, ("hn_xmit is called, when if_start is enabled")); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0 || txr->hn_oactive) return 0; while ((m_head = drbr_peek(ifp, txr->hn_mbuf_br)) != NULL) { struct hn_txdesc *txd; int error; if (len > 0 && m_head->m_pkthdr.len > len) { /* * This sending could be time consuming; let callers * dispatch this packet sending (and sending of any * following up packets) to tx taskqueue. */ drbr_putback(ifp, txr->hn_mbuf_br, m_head); return 1; } txd = hn_txdesc_get(txr); if (txd == NULL) { txr->hn_no_txdescs++; drbr_putback(ifp, txr->hn_mbuf_br, m_head); txr->hn_oactive = 1; break; } error = hn_encap(txr, txd, &m_head); if (error) { /* Both txd and m_head are freed; discard */ drbr_advance(ifp, txr->hn_mbuf_br); continue; } error = hn_send_pkt(ifp, txr, txd); if (__predict_false(error)) { /* txd is freed, but m_head is not */ drbr_putback(ifp, txr->hn_mbuf_br, m_head); txr->hn_oactive = 1; break; } /* Sent */ drbr_advance(ifp, txr->hn_mbuf_br); } return 0; } static int hn_transmit(struct ifnet *ifp, struct mbuf *m) { struct hn_softc *sc = ifp->if_softc; struct hn_tx_ring *txr; int error, idx = 0; /* * Select the TX ring based on flowid */ if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) idx = m->m_pkthdr.flowid % sc->hn_tx_ring_inuse; txr = &sc->hn_tx_ring[idx]; error = drbr_enqueue(ifp, txr->hn_mbuf_br, m); if (error) { if_inc_counter(ifp, IFCOUNTER_OQDROPS, 1); return error; } if (txr->hn_oactive) return 0; if (txr->hn_sched_tx) goto do_sched; if (mtx_trylock(&txr->hn_tx_lock)) { int sched; sched = hn_xmit(txr, txr->hn_direct_tx_size); mtx_unlock(&txr->hn_tx_lock); if (!sched) return 0; } do_sched: taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_tx_task); return 0; } static void hn_xmit_qflush(struct ifnet *ifp) { struct hn_softc *sc = ifp->if_softc; int i; for (i = 0; i < sc->hn_tx_ring_inuse; ++i) { struct hn_tx_ring *txr = &sc->hn_tx_ring[i]; struct mbuf *m; mtx_lock(&txr->hn_tx_lock); while ((m = buf_ring_dequeue_sc(txr->hn_mbuf_br)) != NULL) m_freem(m); mtx_unlock(&txr->hn_tx_lock); } if_qflush(ifp); } static void hn_xmit_txeof(struct hn_tx_ring *txr) { if (txr->hn_sched_tx) goto do_sched; if (mtx_trylock(&txr->hn_tx_lock)) { int sched; txr->hn_oactive = 0; sched = hn_xmit(txr, txr->hn_direct_tx_size); mtx_unlock(&txr->hn_tx_lock); if (sched) { taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_tx_task); } } else { do_sched: /* * Release the oactive earlier, with the hope, that * others could catch up. The task will clear the * oactive again with the hn_tx_lock to avoid possible * races. */ txr->hn_oactive = 0; taskqueue_enqueue(txr->hn_tx_taskq, &txr->hn_txeof_task); } } static void hn_xmit_taskfunc(void *xtxr, int pending __unused) { struct hn_tx_ring *txr = xtxr; mtx_lock(&txr->hn_tx_lock); hn_xmit(txr, 0); mtx_unlock(&txr->hn_tx_lock); } static void hn_xmit_txeof_taskfunc(void *xtxr, int pending __unused) { struct hn_tx_ring *txr = xtxr; mtx_lock(&txr->hn_tx_lock); txr->hn_oactive = 0; hn_xmit(txr, 0); mtx_unlock(&txr->hn_tx_lock); } static void hn_channel_attach(struct hn_softc *sc, struct hv_vmbus_channel *chan) { struct hn_rx_ring *rxr; int idx; idx = chan->offer_msg.offer.sub_channel_index; KASSERT(idx >= 0 && idx < sc->hn_rx_ring_inuse, ("invalid channel index %d, should > 0 && < %d", idx, sc->hn_rx_ring_inuse)); rxr = &sc->hn_rx_ring[idx]; KASSERT((rxr->hn_rx_flags & HN_RX_FLAG_ATTACHED) == 0, ("RX ring %d already attached", idx)); rxr->hn_rx_flags |= HN_RX_FLAG_ATTACHED; chan->hv_chan_rxr = rxr; if (bootverbose) { if_printf(sc->hn_ifp, "link RX ring %d to channel%u\n", idx, chan->offer_msg.child_rel_id); } if (idx < sc->hn_tx_ring_inuse) { struct hn_tx_ring *txr = &sc->hn_tx_ring[idx]; KASSERT((txr->hn_tx_flags & HN_TX_FLAG_ATTACHED) == 0, ("TX ring %d already attached", idx)); txr->hn_tx_flags |= HN_TX_FLAG_ATTACHED; chan->hv_chan_txr = txr; txr->hn_chan = chan; if (bootverbose) { if_printf(sc->hn_ifp, "link TX ring %d to channel%u\n", idx, chan->offer_msg.child_rel_id); } } /* Bind channel to a proper CPU */ vmbus_channel_cpu_set(chan, (sc->hn_cpu + idx) % mp_ncpus); } static void hn_subchan_attach(struct hn_softc *sc, struct hv_vmbus_channel *chan) { KASSERT(!HV_VMBUS_CHAN_ISPRIMARY(chan), ("subchannel callback on primary channel")); KASSERT(chan->offer_msg.offer.sub_channel_index > 0, ("invalid channel subidx %u", chan->offer_msg.offer.sub_channel_index)); hn_channel_attach(sc, chan); } static void hn_tx_taskq_create(void *arg __unused) { if (!hn_share_tx_taskq) return; hn_tx_taskq = taskqueue_create("hn_tx", M_WAITOK, taskqueue_thread_enqueue, &hn_tx_taskq); if (hn_bind_tx_taskq >= 0) { int cpu = hn_bind_tx_taskq; cpuset_t cpu_set; if (cpu > mp_ncpus - 1) cpu = mp_ncpus - 1; CPU_SETOF(cpu, &cpu_set); taskqueue_start_threads_cpuset(&hn_tx_taskq, 1, PI_NET, &cpu_set, "hn tx"); } else { taskqueue_start_threads(&hn_tx_taskq, 1, PI_NET, "hn tx"); } } SYSINIT(hn_txtq_create, SI_SUB_DRIVERS, SI_ORDER_FIRST, hn_tx_taskq_create, NULL); static void hn_tx_taskq_destroy(void *arg __unused) { if (hn_tx_taskq != NULL) taskqueue_free(hn_tx_taskq); } SYSUNINIT(hn_txtq_destroy, SI_SUB_DRIVERS, SI_ORDER_FIRST, hn_tx_taskq_destroy, NULL); static device_method_t netvsc_methods[] = { /* Device interface */ DEVMETHOD(device_probe, netvsc_probe), DEVMETHOD(device_attach, netvsc_attach), DEVMETHOD(device_detach, netvsc_detach), DEVMETHOD(device_shutdown, netvsc_shutdown), { 0, 0 } }; static driver_t netvsc_driver = { NETVSC_DEVNAME, netvsc_methods, sizeof(hn_softc_t) }; static devclass_t netvsc_devclass; DRIVER_MODULE(hn, vmbus, netvsc_driver, netvsc_devclass, 0, 0); MODULE_VERSION(hn, 1); MODULE_DEPEND(hn, vmbus, 1, 1, 1); Index: projects/vnet/sys/dev/ixgbe/ix_txrx.c =================================================================== --- projects/vnet/sys/dev/ixgbe/ix_txrx.c (revision 301546) +++ projects/vnet/sys/dev/ixgbe/ix_txrx.c (revision 301547) @@ -1,2303 +1,2303 @@ /****************************************************************************** Copyright (c) 2001-2015, Intel Corporation All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ******************************************************************************/ /*$FreeBSD$*/ #ifndef IXGBE_STANDALONE_BUILD #include "opt_inet.h" #include "opt_inet6.h" #include "opt_rss.h" #endif #include "ixgbe.h" #ifdef RSS #include #include #endif #ifdef DEV_NETMAP #include #include #include extern int ix_crcstrip; #endif /* ** HW RSC control: ** this feature only works with ** IPv4, and only on 82599 and later. ** Also this will cause IP forwarding to ** fail and that can't be controlled by ** the stack as LRO can. For all these ** reasons I've deemed it best to leave ** this off and not bother with a tuneable ** interface, this would need to be compiled ** to enable. */ static bool ixgbe_rsc_enable = FALSE; #ifdef IXGBE_FDIR /* ** For Flow Director: this is the ** number of TX packets we sample ** for the filter pool, this means ** every 20th packet will be probed. ** ** This feature can be disabled by ** setting this to 0. */ static int atr_sample_rate = 20; #endif /********************************************************************* * Local Function prototypes *********************************************************************/ static void ixgbe_setup_transmit_ring(struct tx_ring *); static void ixgbe_free_transmit_buffers(struct tx_ring *); static int ixgbe_setup_receive_ring(struct rx_ring *); static void ixgbe_free_receive_buffers(struct rx_ring *); static void ixgbe_rx_checksum(u32, struct mbuf *, u32); static void ixgbe_refresh_mbufs(struct rx_ring *, int); static int ixgbe_xmit(struct tx_ring *, struct mbuf **); static int ixgbe_tx_ctx_setup(struct tx_ring *, struct mbuf *, u32 *, u32 *); static int ixgbe_tso_setup(struct tx_ring *, struct mbuf *, u32 *, u32 *); #ifdef IXGBE_FDIR static void ixgbe_atr(struct tx_ring *, struct mbuf *); #endif static __inline void ixgbe_rx_discard(struct rx_ring *, int); static __inline void ixgbe_rx_input(struct rx_ring *, struct ifnet *, struct mbuf *, u32); #ifdef IXGBE_LEGACY_TX /********************************************************************* * Transmit entry point * * ixgbe_start is called by the stack to initiate a transmit. * The driver will remain in this routine as long as there are * packets to transmit and transmit resources are available. * In case resources are not available stack is notified and * the packet is requeued. **********************************************************************/ void ixgbe_start_locked(struct tx_ring *txr, struct ifnet * ifp) { struct mbuf *m_head; struct adapter *adapter = txr->adapter; IXGBE_TX_LOCK_ASSERT(txr); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) return; if (!adapter->link_active) return; while (!IFQ_DRV_IS_EMPTY(&ifp->if_snd)) { if (txr->tx_avail <= IXGBE_QUEUE_MIN_FREE) break; IFQ_DRV_DEQUEUE(&ifp->if_snd, m_head); if (m_head == NULL) break; if (ixgbe_xmit(txr, &m_head)) { if (m_head != NULL) IFQ_DRV_PREPEND(&ifp->if_snd, m_head); break; } /* Send a copy of the frame to the BPF listener */ ETHER_BPF_MTAP(ifp, m_head); } return; } /* * Legacy TX start - called by the stack, this * always uses the first tx ring, and should * not be used with multiqueue tx enabled. */ void ixgbe_start(struct ifnet *ifp) { struct adapter *adapter = ifp->if_softc; struct tx_ring *txr = adapter->tx_rings; if (ifp->if_drv_flags & IFF_DRV_RUNNING) { IXGBE_TX_LOCK(txr); ixgbe_start_locked(txr, ifp); IXGBE_TX_UNLOCK(txr); } return; } #else /* ! IXGBE_LEGACY_TX */ /* ** Multiqueue Transmit Entry Point ** (if_transmit function) */ int ixgbe_mq_start(struct ifnet *ifp, struct mbuf *m) { struct adapter *adapter = ifp->if_softc; struct ix_queue *que; struct tx_ring *txr; int i, err = 0; #ifdef RSS uint32_t bucket_id; #endif /* * When doing RSS, map it to the same outbound queue * as the incoming flow would be mapped to. * * If everything is setup correctly, it should be the * same bucket that the current CPU we're on is. */ if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) { #ifdef RSS if (rss_hash2bucket(m->m_pkthdr.flowid, M_HASHTYPE_GET(m), &bucket_id) == 0) { i = bucket_id % adapter->num_queues; #ifdef IXGBE_DEBUG if (bucket_id > adapter->num_queues) if_printf(ifp, "bucket_id (%d) > num_queues " "(%d)\n", bucket_id, adapter->num_queues); #endif } else #endif i = m->m_pkthdr.flowid % adapter->num_queues; } else i = curcpu % adapter->num_queues; /* Check for a hung queue and pick alternative */ if (((1 << i) & adapter->active_queues) == 0) i = ffsl(adapter->active_queues); txr = &adapter->tx_rings[i]; que = &adapter->queues[i]; err = drbr_enqueue(ifp, txr->br, m); if (err) return (err); if (IXGBE_TX_TRYLOCK(txr)) { ixgbe_mq_start_locked(ifp, txr); IXGBE_TX_UNLOCK(txr); } else taskqueue_enqueue(que->tq, &txr->txq_task); return (0); } int ixgbe_mq_start_locked(struct ifnet *ifp, struct tx_ring *txr) { struct adapter *adapter = txr->adapter; struct mbuf *next; int enqueued = 0, err = 0; if (((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) || adapter->link_active == 0) return (ENETDOWN); /* Process the queue */ #if __FreeBSD_version < 901504 next = drbr_dequeue(ifp, txr->br); while (next != NULL) { if ((err = ixgbe_xmit(txr, &next)) != 0) { if (next != NULL) err = drbr_enqueue(ifp, txr->br, next); #else while ((next = drbr_peek(ifp, txr->br)) != NULL) { if ((err = ixgbe_xmit(txr, &next)) != 0) { if (next == NULL) { drbr_advance(ifp, txr->br); } else { drbr_putback(ifp, txr->br, next); } #endif break; } #if __FreeBSD_version >= 901504 drbr_advance(ifp, txr->br); #endif enqueued++; #if 0 // this is VF-only #if __FreeBSD_version >= 1100036 /* * Since we're looking at the tx ring, we can check * to see if we're a VF by examing our tail register * address. */ if (txr->tail < IXGBE_TDT(0) && next->m_flags & M_MCAST) if_inc_counter(ifp, IFCOUNTER_OMCASTS, 1); #endif #endif /* Send a copy of the frame to the BPF listener */ ETHER_BPF_MTAP(ifp, next); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) break; #if __FreeBSD_version < 901504 next = drbr_dequeue(ifp, txr->br); #endif } if (txr->tx_avail < IXGBE_TX_CLEANUP_THRESHOLD) ixgbe_txeof(txr); return (err); } /* * Called from a taskqueue to drain queued transmit packets. */ void ixgbe_deferred_mq_start(void *arg, int pending) { struct tx_ring *txr = arg; struct adapter *adapter = txr->adapter; struct ifnet *ifp = adapter->ifp; IXGBE_TX_LOCK(txr); if (!drbr_empty(ifp, txr->br)) ixgbe_mq_start_locked(ifp, txr); IXGBE_TX_UNLOCK(txr); } /* * Flush all ring buffers */ void ixgbe_qflush(struct ifnet *ifp) { struct adapter *adapter = ifp->if_softc; struct tx_ring *txr = adapter->tx_rings; struct mbuf *m; for (int i = 0; i < adapter->num_queues; i++, txr++) { IXGBE_TX_LOCK(txr); while ((m = buf_ring_dequeue_sc(txr->br)) != NULL) m_freem(m); IXGBE_TX_UNLOCK(txr); } if_qflush(ifp); } #endif /* IXGBE_LEGACY_TX */ /********************************************************************* * * This routine maps the mbufs to tx descriptors, allowing the * TX engine to transmit the packets. * - return 0 on success, positive on failure * **********************************************************************/ static int ixgbe_xmit(struct tx_ring *txr, struct mbuf **m_headp) { struct adapter *adapter = txr->adapter; u32 olinfo_status = 0, cmd_type_len; int i, j, error, nsegs; int first; bool remap = TRUE; struct mbuf *m_head; bus_dma_segment_t segs[adapter->num_segs]; bus_dmamap_t map; struct ixgbe_tx_buf *txbuf; union ixgbe_adv_tx_desc *txd = NULL; m_head = *m_headp; /* Basic descriptor defines */ cmd_type_len = (IXGBE_ADVTXD_DTYP_DATA | IXGBE_ADVTXD_DCMD_IFCS | IXGBE_ADVTXD_DCMD_DEXT); if (m_head->m_flags & M_VLANTAG) cmd_type_len |= IXGBE_ADVTXD_DCMD_VLE; /* * Important to capture the first descriptor * used because it will contain the index of * the one we tell the hardware to report back */ first = txr->next_avail_desc; txbuf = &txr->tx_buffers[first]; map = txbuf->map; /* * Map the packet for DMA. */ retry: error = bus_dmamap_load_mbuf_sg(txr->txtag, map, *m_headp, segs, &nsegs, BUS_DMA_NOWAIT); if (__predict_false(error)) { struct mbuf *m; switch (error) { case EFBIG: /* Try it again? - one try */ if (remap == TRUE) { remap = FALSE; /* * XXX: m_defrag will choke on * non-MCLBYTES-sized clusters */ m = m_defrag(*m_headp, M_NOWAIT); if (m == NULL) { adapter->mbuf_defrag_failed++; m_freem(*m_headp); *m_headp = NULL; return (ENOBUFS); } *m_headp = m; goto retry; } else return (error); case ENOMEM: txr->no_tx_dma_setup++; return (error); default: txr->no_tx_dma_setup++; m_freem(*m_headp); *m_headp = NULL; return (error); } } /* Make certain there are enough descriptors */ if (txr->tx_avail < (nsegs + 2)) { txr->no_desc_avail++; bus_dmamap_unload(txr->txtag, map); return (ENOBUFS); } m_head = *m_headp; /* * Set up the appropriate offload context * this will consume the first descriptor */ error = ixgbe_tx_ctx_setup(txr, m_head, &cmd_type_len, &olinfo_status); if (__predict_false(error)) { if (error == ENOBUFS) *m_headp = NULL; return (error); } #ifdef IXGBE_FDIR /* Do the flow director magic */ if ((txr->atr_sample) && (!adapter->fdir_reinit)) { ++txr->atr_count; if (txr->atr_count >= atr_sample_rate) { ixgbe_atr(txr, m_head); txr->atr_count = 0; } } #endif olinfo_status |= IXGBE_ADVTXD_CC; i = txr->next_avail_desc; for (j = 0; j < nsegs; j++) { bus_size_t seglen; bus_addr_t segaddr; txbuf = &txr->tx_buffers[i]; txd = &txr->tx_base[i]; seglen = segs[j].ds_len; segaddr = htole64(segs[j].ds_addr); txd->read.buffer_addr = segaddr; txd->read.cmd_type_len = htole32(txr->txd_cmd | cmd_type_len |seglen); txd->read.olinfo_status = htole32(olinfo_status); if (++i == txr->num_desc) i = 0; } txd->read.cmd_type_len |= htole32(IXGBE_TXD_CMD_EOP | IXGBE_TXD_CMD_RS); txr->tx_avail -= nsegs; txr->next_avail_desc = i; txbuf->m_head = m_head; /* * Here we swap the map so the last descriptor, * which gets the completion interrupt has the * real map, and the first descriptor gets the * unused map from this descriptor. */ txr->tx_buffers[first].map = txbuf->map; txbuf->map = map; bus_dmamap_sync(txr->txtag, map, BUS_DMASYNC_PREWRITE); /* Set the EOP descriptor that will be marked done */ txbuf = &txr->tx_buffers[first]; txbuf->eop = txd; bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* * Advance the Transmit Descriptor Tail (Tdt), this tells the * hardware that this frame is available to transmit. */ ++txr->total_packets; IXGBE_WRITE_REG(&adapter->hw, txr->tail, i); /* Mark queue as having work */ if (txr->busy == 0) txr->busy = 1; return (0); } /********************************************************************* * * Allocate memory for tx_buffer structures. The tx_buffer stores all * the information needed to transmit a packet on the wire. This is * called only once at attach, setup is done every reset. * **********************************************************************/ int ixgbe_allocate_transmit_buffers(struct tx_ring *txr) { struct adapter *adapter = txr->adapter; device_t dev = adapter->dev; struct ixgbe_tx_buf *txbuf; int error, i; /* * Setup DMA descriptor areas. */ if ((error = bus_dma_tag_create( bus_get_dma_tag(adapter->dev), /* parent */ 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ IXGBE_TSO_SIZE, /* maxsize */ adapter->num_segs, /* nsegments */ PAGE_SIZE, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txr->txtag))) { device_printf(dev,"Unable to allocate TX DMA tag\n"); goto fail; } if (!(txr->tx_buffers = (struct ixgbe_tx_buf *) malloc(sizeof(struct ixgbe_tx_buf) * adapter->num_tx_desc, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate tx_buffer memory\n"); error = ENOMEM; goto fail; } /* Create the descriptor buffer dma maps */ txbuf = txr->tx_buffers; for (i = 0; i < adapter->num_tx_desc; i++, txbuf++) { error = bus_dmamap_create(txr->txtag, 0, &txbuf->map); if (error != 0) { device_printf(dev, "Unable to create TX DMA map\n"); goto fail; } } return 0; fail: /* We free all, it handles case where we are in the middle */ ixgbe_free_transmit_structures(adapter); return (error); } /********************************************************************* * * Initialize a transmit ring. * **********************************************************************/ static void ixgbe_setup_transmit_ring(struct tx_ring *txr) { struct adapter *adapter = txr->adapter; struct ixgbe_tx_buf *txbuf; #ifdef DEV_NETMAP struct netmap_adapter *na = NA(adapter->ifp); struct netmap_slot *slot; #endif /* DEV_NETMAP */ /* Clear the old ring contents */ IXGBE_TX_LOCK(txr); #ifdef DEV_NETMAP /* * (under lock): if in netmap mode, do some consistency * checks and set slot to entry 0 of the netmap ring. */ slot = netmap_reset(na, NR_TX, txr->me, 0); #endif /* DEV_NETMAP */ bzero((void *)txr->tx_base, (sizeof(union ixgbe_adv_tx_desc)) * adapter->num_tx_desc); /* Reset indices */ txr->next_avail_desc = 0; txr->next_to_clean = 0; /* Free any existing tx buffers. */ txbuf = txr->tx_buffers; for (int i = 0; i < txr->num_desc; i++, txbuf++) { if (txbuf->m_head != NULL) { bus_dmamap_sync(txr->txtag, txbuf->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txr->txtag, txbuf->map); m_freem(txbuf->m_head); txbuf->m_head = NULL; } #ifdef DEV_NETMAP /* * In netmap mode, set the map for the packet buffer. * NOTE: Some drivers (not this one) also need to set * the physical buffer address in the NIC ring. * Slots in the netmap ring (indexed by "si") are * kring->nkr_hwofs positions "ahead" wrt the * corresponding slot in the NIC ring. In some drivers * (not here) nkr_hwofs can be negative. Function * netmap_idx_n2k() handles wraparounds properly. */ if (slot) { int si = netmap_idx_n2k(&na->tx_rings[txr->me], i); netmap_load_map(na, txr->txtag, txbuf->map, NMB(na, slot + si)); } #endif /* DEV_NETMAP */ /* Clear the EOP descriptor pointer */ txbuf->eop = NULL; } #ifdef IXGBE_FDIR /* Set the rate at which we sample packets */ if (adapter->hw.mac.type != ixgbe_mac_82598EB) txr->atr_sample = atr_sample_rate; #endif /* Set number of descriptors available */ txr->tx_avail = adapter->num_tx_desc; bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); IXGBE_TX_UNLOCK(txr); } /********************************************************************* * * Initialize all transmit rings. * **********************************************************************/ int ixgbe_setup_transmit_structures(struct adapter *adapter) { struct tx_ring *txr = adapter->tx_rings; for (int i = 0; i < adapter->num_queues; i++, txr++) ixgbe_setup_transmit_ring(txr); return (0); } /********************************************************************* * * Free all transmit rings. * **********************************************************************/ void ixgbe_free_transmit_structures(struct adapter *adapter) { struct tx_ring *txr = adapter->tx_rings; for (int i = 0; i < adapter->num_queues; i++, txr++) { IXGBE_TX_LOCK(txr); ixgbe_free_transmit_buffers(txr); ixgbe_dma_free(adapter, &txr->txdma); IXGBE_TX_UNLOCK(txr); IXGBE_TX_LOCK_DESTROY(txr); } free(adapter->tx_rings, M_DEVBUF); } /********************************************************************* * * Free transmit ring related data structures. * **********************************************************************/ static void ixgbe_free_transmit_buffers(struct tx_ring *txr) { struct adapter *adapter = txr->adapter; struct ixgbe_tx_buf *tx_buffer; int i; INIT_DEBUGOUT("ixgbe_free_transmit_ring: begin"); if (txr->tx_buffers == NULL) return; tx_buffer = txr->tx_buffers; for (i = 0; i < adapter->num_tx_desc; i++, tx_buffer++) { if (tx_buffer->m_head != NULL) { bus_dmamap_sync(txr->txtag, tx_buffer->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txr->txtag, tx_buffer->map); m_freem(tx_buffer->m_head); tx_buffer->m_head = NULL; if (tx_buffer->map != NULL) { bus_dmamap_destroy(txr->txtag, tx_buffer->map); tx_buffer->map = NULL; } } else if (tx_buffer->map != NULL) { bus_dmamap_unload(txr->txtag, tx_buffer->map); bus_dmamap_destroy(txr->txtag, tx_buffer->map); tx_buffer->map = NULL; } } #ifdef IXGBE_LEGACY_TX if (txr->br != NULL) buf_ring_free(txr->br, M_DEVBUF); #endif if (txr->tx_buffers != NULL) { free(txr->tx_buffers, M_DEVBUF); txr->tx_buffers = NULL; } if (txr->txtag != NULL) { bus_dma_tag_destroy(txr->txtag); txr->txtag = NULL; } return; } /********************************************************************* * * Advanced Context Descriptor setup for VLAN, CSUM or TSO * **********************************************************************/ static int ixgbe_tx_ctx_setup(struct tx_ring *txr, struct mbuf *mp, u32 *cmd_type_len, u32 *olinfo_status) { struct adapter *adapter = txr->adapter; struct ixgbe_adv_tx_context_desc *TXD; struct ether_vlan_header *eh; #ifdef INET struct ip *ip; #endif #ifdef INET6 struct ip6_hdr *ip6; #endif u32 vlan_macip_lens = 0, type_tucmd_mlhl = 0; int ehdrlen, ip_hlen = 0; u16 etype; u8 ipproto = 0; int offload = TRUE; int ctxd = txr->next_avail_desc; u16 vtag = 0; caddr_t l3d; /* First check if TSO is to be used */ if (mp->m_pkthdr.csum_flags & (CSUM_IP_TSO|CSUM_IP6_TSO)) return (ixgbe_tso_setup(txr, mp, cmd_type_len, olinfo_status)); if ((mp->m_pkthdr.csum_flags & CSUM_OFFLOAD) == 0) offload = FALSE; /* Indicate the whole packet as payload when not doing TSO */ *olinfo_status |= mp->m_pkthdr.len << IXGBE_ADVTXD_PAYLEN_SHIFT; /* Now ready a context descriptor */ TXD = (struct ixgbe_adv_tx_context_desc *) &txr->tx_base[ctxd]; /* ** In advanced descriptors the vlan tag must ** be placed into the context descriptor. Hence ** we need to make one even if not doing offloads. */ if (mp->m_flags & M_VLANTAG) { vtag = htole16(mp->m_pkthdr.ether_vtag); vlan_macip_lens |= (vtag << IXGBE_ADVTXD_VLAN_SHIFT); } else if (!IXGBE_IS_X550VF(adapter) && (offload == FALSE)) return (0); /* * Determine where frame payload starts. * Jump over vlan headers if already present, * helpful for QinQ too. */ eh = mtod(mp, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { etype = ntohs(eh->evl_proto); ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; } else { etype = ntohs(eh->evl_encap_proto); ehdrlen = ETHER_HDR_LEN; } /* Set the ether header length */ vlan_macip_lens |= ehdrlen << IXGBE_ADVTXD_MACLEN_SHIFT; if (offload == FALSE) goto no_offloads; /* * If the first mbuf only includes the ethernet header, jump to the next one * XXX: This assumes the stack splits mbufs containing headers on header boundaries * XXX: And assumes the entire IP header is contained in one mbuf */ if (mp->m_len == ehdrlen && mp->m_next) l3d = mtod(mp->m_next, caddr_t); else l3d = mtod(mp, caddr_t) + ehdrlen; switch (etype) { #ifdef INET case ETHERTYPE_IP: ip = (struct ip *)(l3d); ip_hlen = ip->ip_hl << 2; ipproto = ip->ip_p; type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV4; /* Insert IPv4 checksum into data descriptors */ if (mp->m_pkthdr.csum_flags & CSUM_IP) { ip->ip_sum = 0; *olinfo_status |= IXGBE_TXD_POPTS_IXSM << 8; } break; #endif #ifdef INET6 case ETHERTYPE_IPV6: ip6 = (struct ip6_hdr *)(l3d); ip_hlen = sizeof(struct ip6_hdr); ipproto = ip6->ip6_nxt; type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV6; break; #endif default: offload = FALSE; break; } vlan_macip_lens |= ip_hlen; /* No support for offloads for non-L4 next headers */ switch (ipproto) { case IPPROTO_TCP: if (mp->m_pkthdr.csum_flags & (CSUM_IP_TCP | CSUM_IP6_TCP)) type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP; else offload = false; break; case IPPROTO_UDP: if (mp->m_pkthdr.csum_flags & (CSUM_IP_UDP | CSUM_IP6_UDP)) type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_UDP; else offload = false; break; case IPPROTO_SCTP: if (mp->m_pkthdr.csum_flags & (CSUM_IP_SCTP | CSUM_IP6_SCTP)) type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_SCTP; else offload = false; break; default: offload = false; break; } if (offload) /* Insert L4 checksum into data descriptors */ *olinfo_status |= IXGBE_TXD_POPTS_TXSM << 8; no_offloads: type_tucmd_mlhl |= IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT; /* Now copy bits into descriptor */ TXD->vlan_macip_lens = htole32(vlan_macip_lens); TXD->type_tucmd_mlhl = htole32(type_tucmd_mlhl); TXD->seqnum_seed = htole32(0); TXD->mss_l4len_idx = htole32(0); /* We've consumed the first desc, adjust counters */ if (++ctxd == txr->num_desc) ctxd = 0; txr->next_avail_desc = ctxd; --txr->tx_avail; return (0); } /********************************************************************** * * Setup work for hardware segmentation offload (TSO) on * adapters using advanced tx descriptors * **********************************************************************/ static int ixgbe_tso_setup(struct tx_ring *txr, struct mbuf *mp, u32 *cmd_type_len, u32 *olinfo_status) { struct ixgbe_adv_tx_context_desc *TXD; u32 vlan_macip_lens = 0, type_tucmd_mlhl = 0; u32 mss_l4len_idx = 0, paylen; u16 vtag = 0, eh_type; int ctxd, ehdrlen, ip_hlen, tcp_hlen; struct ether_vlan_header *eh; #ifdef INET6 struct ip6_hdr *ip6; #endif #ifdef INET struct ip *ip; #endif struct tcphdr *th; /* * Determine where frame payload starts. * Jump over vlan headers if already present */ eh = mtod(mp, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; eh_type = eh->evl_proto; } else { ehdrlen = ETHER_HDR_LEN; eh_type = eh->evl_encap_proto; } switch (ntohs(eh_type)) { #ifdef INET6 case ETHERTYPE_IPV6: ip6 = (struct ip6_hdr *)(mp->m_data + ehdrlen); /* XXX-BZ For now we do not pretend to support ext. hdrs. */ if (ip6->ip6_nxt != IPPROTO_TCP) return (ENXIO); ip_hlen = sizeof(struct ip6_hdr); ip6 = (struct ip6_hdr *)(mp->m_data + ehdrlen); th = (struct tcphdr *)((caddr_t)ip6 + ip_hlen); th->th_sum = in6_cksum_pseudo(ip6, 0, IPPROTO_TCP, 0); type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV6; break; #endif #ifdef INET case ETHERTYPE_IP: ip = (struct ip *)(mp->m_data + ehdrlen); if (ip->ip_p != IPPROTO_TCP) return (ENXIO); ip->ip_sum = 0; ip_hlen = ip->ip_hl << 2; th = (struct tcphdr *)((caddr_t)ip + ip_hlen); th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htons(IPPROTO_TCP)); type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_IPV4; /* Tell transmit desc to also do IPv4 checksum. */ *olinfo_status |= IXGBE_TXD_POPTS_IXSM << 8; break; #endif default: panic("%s: CSUM_TSO but no supported IP version (0x%04x)", __func__, ntohs(eh_type)); break; } ctxd = txr->next_avail_desc; TXD = (struct ixgbe_adv_tx_context_desc *) &txr->tx_base[ctxd]; tcp_hlen = th->th_off << 2; /* This is used in the transmit desc in encap */ paylen = mp->m_pkthdr.len - ehdrlen - ip_hlen - tcp_hlen; /* VLAN MACLEN IPLEN */ if (mp->m_flags & M_VLANTAG) { vtag = htole16(mp->m_pkthdr.ether_vtag); vlan_macip_lens |= (vtag << IXGBE_ADVTXD_VLAN_SHIFT); } vlan_macip_lens |= ehdrlen << IXGBE_ADVTXD_MACLEN_SHIFT; vlan_macip_lens |= ip_hlen; TXD->vlan_macip_lens = htole32(vlan_macip_lens); /* ADV DTYPE TUCMD */ type_tucmd_mlhl |= IXGBE_ADVTXD_DCMD_DEXT | IXGBE_ADVTXD_DTYP_CTXT; type_tucmd_mlhl |= IXGBE_ADVTXD_TUCMD_L4T_TCP; TXD->type_tucmd_mlhl = htole32(type_tucmd_mlhl); /* MSS L4LEN IDX */ mss_l4len_idx |= (mp->m_pkthdr.tso_segsz << IXGBE_ADVTXD_MSS_SHIFT); mss_l4len_idx |= (tcp_hlen << IXGBE_ADVTXD_L4LEN_SHIFT); TXD->mss_l4len_idx = htole32(mss_l4len_idx); TXD->seqnum_seed = htole32(0); if (++ctxd == txr->num_desc) ctxd = 0; txr->tx_avail--; txr->next_avail_desc = ctxd; *cmd_type_len |= IXGBE_ADVTXD_DCMD_TSE; *olinfo_status |= IXGBE_TXD_POPTS_TXSM << 8; *olinfo_status |= paylen << IXGBE_ADVTXD_PAYLEN_SHIFT; ++txr->tso_tx; return (0); } /********************************************************************** * * Examine each tx_buffer in the used queue. If the hardware is done * processing the packet then free associated resources. The * tx_buffer is put back on the free queue. * **********************************************************************/ void ixgbe_txeof(struct tx_ring *txr) { struct adapter *adapter = txr->adapter; #ifdef DEV_NETMAP struct ifnet *ifp = adapter->ifp; #endif u32 work, processed = 0; u32 limit = adapter->tx_process_limit; struct ixgbe_tx_buf *buf; union ixgbe_adv_tx_desc *txd; mtx_assert(&txr->tx_mtx, MA_OWNED); #ifdef DEV_NETMAP if (ifp->if_capenable & IFCAP_NETMAP) { struct netmap_adapter *na = NA(ifp); struct netmap_kring *kring = &na->tx_rings[txr->me]; txd = txr->tx_base; bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_POSTREAD); /* * In netmap mode, all the work is done in the context * of the client thread. Interrupt handlers only wake up * clients, which may be sleeping on individual rings * or on a global resource for all rings. * To implement tx interrupt mitigation, we wake up the client * thread roughly every half ring, even if the NIC interrupts * more frequently. This is implemented as follows: * - ixgbe_txsync() sets kring->nr_kflags with the index of * the slot that should wake up the thread (nkr_num_slots * means the user thread should not be woken up); * - the driver ignores tx interrupts unless netmap_mitigate=0 * or the slot has the DD bit set. */ if (!netmap_mitigate || (kring->nr_kflags < kring->nkr_num_slots && txd[kring->nr_kflags].wb.status & IXGBE_TXD_STAT_DD)) { netmap_tx_irq(ifp, txr->me); } return; } #endif /* DEV_NETMAP */ if (txr->tx_avail == txr->num_desc) { txr->busy = 0; return; } /* Get work starting point */ work = txr->next_to_clean; buf = &txr->tx_buffers[work]; txd = &txr->tx_base[work]; work -= txr->num_desc; /* The distance to ring end */ bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_POSTREAD); do { union ixgbe_adv_tx_desc *eop = buf->eop; if (eop == NULL) /* No work */ break; if ((eop->wb.status & IXGBE_TXD_STAT_DD) == 0) break; /* I/O not complete */ if (buf->m_head) { txr->bytes += buf->m_head->m_pkthdr.len; bus_dmamap_sync(txr->txtag, buf->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txr->txtag, buf->map); m_freem(buf->m_head); buf->m_head = NULL; } buf->eop = NULL; ++txr->tx_avail; /* We clean the range if multi segment */ while (txd != eop) { ++txd; ++buf; ++work; /* wrap the ring? */ if (__predict_false(!work)) { work -= txr->num_desc; buf = txr->tx_buffers; txd = txr->tx_base; } if (buf->m_head) { txr->bytes += buf->m_head->m_pkthdr.len; bus_dmamap_sync(txr->txtag, buf->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(txr->txtag, buf->map); m_freem(buf->m_head); buf->m_head = NULL; } ++txr->tx_avail; buf->eop = NULL; } ++txr->packets; ++processed; /* Try the next packet */ ++txd; ++buf; ++work; /* reset with a wrap */ if (__predict_false(!work)) { work -= txr->num_desc; buf = txr->tx_buffers; txd = txr->tx_base; } prefetch(txd); } while (__predict_true(--limit)); bus_dmamap_sync(txr->txdma.dma_tag, txr->txdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); work += txr->num_desc; txr->next_to_clean = work; /* ** Queue Hang detection, we know there's ** work outstanding or the first return ** would have been taken, so increment busy ** if nothing managed to get cleaned, then ** in local_timer it will be checked and ** marked as HUNG if it exceeds a MAX attempt. */ if ((processed == 0) && (txr->busy != IXGBE_QUEUE_HUNG)) ++txr->busy; /* ** If anything gets cleaned we reset state to 1, ** note this will turn off HUNG if its set. */ if (processed) txr->busy = 1; if (txr->tx_avail == txr->num_desc) txr->busy = 0; return; } #ifdef IXGBE_FDIR /* ** This routine parses packet headers so that Flow ** Director can make a hashed filter table entry ** allowing traffic flows to be identified and kept ** on the same cpu. This would be a performance ** hit, but we only do it at IXGBE_FDIR_RATE of ** packets. */ static void ixgbe_atr(struct tx_ring *txr, struct mbuf *mp) { struct adapter *adapter = txr->adapter; struct ix_queue *que; struct ip *ip; struct tcphdr *th; struct udphdr *uh; struct ether_vlan_header *eh; union ixgbe_atr_hash_dword input = {.dword = 0}; union ixgbe_atr_hash_dword common = {.dword = 0}; int ehdrlen, ip_hlen; u16 etype; eh = mtod(mp, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { ehdrlen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; etype = eh->evl_proto; } else { ehdrlen = ETHER_HDR_LEN; etype = eh->evl_encap_proto; } /* Only handling IPv4 */ if (etype != htons(ETHERTYPE_IP)) return; ip = (struct ip *)(mp->m_data + ehdrlen); ip_hlen = ip->ip_hl << 2; /* check if we're UDP or TCP */ switch (ip->ip_p) { case IPPROTO_TCP: th = (struct tcphdr *)((caddr_t)ip + ip_hlen); /* src and dst are inverted */ common.port.dst ^= th->th_sport; common.port.src ^= th->th_dport; input.formatted.flow_type ^= IXGBE_ATR_FLOW_TYPE_TCPV4; break; case IPPROTO_UDP: uh = (struct udphdr *)((caddr_t)ip + ip_hlen); /* src and dst are inverted */ common.port.dst ^= uh->uh_sport; common.port.src ^= uh->uh_dport; input.formatted.flow_type ^= IXGBE_ATR_FLOW_TYPE_UDPV4; break; default: return; } input.formatted.vlan_id = htobe16(mp->m_pkthdr.ether_vtag); if (mp->m_pkthdr.ether_vtag) common.flex_bytes ^= htons(ETHERTYPE_VLAN); else common.flex_bytes ^= etype; common.ip ^= ip->ip_src.s_addr ^ ip->ip_dst.s_addr; que = &adapter->queues[txr->me]; /* ** This assumes the Rx queue and Tx ** queue are bound to the same CPU */ ixgbe_fdir_add_signature_filter_82599(&adapter->hw, input, common, que->msix); } #endif /* IXGBE_FDIR */ /* ** Used to detect a descriptor that has ** been merged by Hardware RSC. */ static inline u32 ixgbe_rsc_count(union ixgbe_adv_rx_desc *rx) { return (le32toh(rx->wb.lower.lo_dword.data) & IXGBE_RXDADV_RSCCNT_MASK) >> IXGBE_RXDADV_RSCCNT_SHIFT; } /********************************************************************* * * Initialize Hardware RSC (LRO) feature on 82599 * for an RX ring, this is toggled by the LRO capability * even though it is transparent to the stack. * * NOTE: since this HW feature only works with IPV4 and * our testing has shown soft LRO to be as effective * I have decided to disable this by default. * **********************************************************************/ static void ixgbe_setup_hw_rsc(struct rx_ring *rxr) { struct adapter *adapter = rxr->adapter; struct ixgbe_hw *hw = &adapter->hw; u32 rscctrl, rdrxctl; /* If turning LRO/RSC off we need to disable it */ if ((adapter->ifp->if_capenable & IFCAP_LRO) == 0) { rscctrl = IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxr->me)); rscctrl &= ~IXGBE_RSCCTL_RSCEN; return; } rdrxctl = IXGBE_READ_REG(hw, IXGBE_RDRXCTL); rdrxctl &= ~IXGBE_RDRXCTL_RSCFRSTSIZE; #ifdef DEV_NETMAP /* crcstrip is optional in netmap */ if (adapter->ifp->if_capenable & IFCAP_NETMAP && !ix_crcstrip) #endif /* DEV_NETMAP */ rdrxctl |= IXGBE_RDRXCTL_CRCSTRIP; rdrxctl |= IXGBE_RDRXCTL_RSCACKC; IXGBE_WRITE_REG(hw, IXGBE_RDRXCTL, rdrxctl); rscctrl = IXGBE_READ_REG(hw, IXGBE_RSCCTL(rxr->me)); rscctrl |= IXGBE_RSCCTL_RSCEN; /* ** Limit the total number of descriptors that ** can be combined, so it does not exceed 64K */ if (rxr->mbuf_sz == MCLBYTES) rscctrl |= IXGBE_RSCCTL_MAXDESC_16; else if (rxr->mbuf_sz == MJUMPAGESIZE) rscctrl |= IXGBE_RSCCTL_MAXDESC_8; else if (rxr->mbuf_sz == MJUM9BYTES) rscctrl |= IXGBE_RSCCTL_MAXDESC_4; else /* Using 16K cluster */ rscctrl |= IXGBE_RSCCTL_MAXDESC_1; IXGBE_WRITE_REG(hw, IXGBE_RSCCTL(rxr->me), rscctrl); /* Enable TCP header recognition */ IXGBE_WRITE_REG(hw, IXGBE_PSRTYPE(0), (IXGBE_READ_REG(hw, IXGBE_PSRTYPE(0)) | IXGBE_PSRTYPE_TCPHDR)); /* Disable RSC for ACK packets */ IXGBE_WRITE_REG(hw, IXGBE_RSCDBU, (IXGBE_RSCDBU_RSCACKDIS | IXGBE_READ_REG(hw, IXGBE_RSCDBU))); rxr->hw_rsc = TRUE; } /********************************************************************* * * Refresh mbuf buffers for RX descriptor rings * - now keeps its own state so discards due to resource * exhaustion are unnecessary, if an mbuf cannot be obtained * it just returns, keeping its placeholder, thus it can simply * be recalled to try again. * **********************************************************************/ static void ixgbe_refresh_mbufs(struct rx_ring *rxr, int limit) { struct adapter *adapter = rxr->adapter; bus_dma_segment_t seg[1]; struct ixgbe_rx_buf *rxbuf; struct mbuf *mp; int i, j, nsegs, error; bool refreshed = FALSE; i = j = rxr->next_to_refresh; /* Control the loop with one beyond */ if (++j == rxr->num_desc) j = 0; while (j != limit) { rxbuf = &rxr->rx_buffers[i]; if (rxbuf->buf == NULL) { mp = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, rxr->mbuf_sz); if (mp == NULL) goto update; if (adapter->max_frame_size <= (MCLBYTES - ETHER_ALIGN)) m_adj(mp, ETHER_ALIGN); } else mp = rxbuf->buf; mp->m_pkthdr.len = mp->m_len = rxr->mbuf_sz; /* If we're dealing with an mbuf that was copied rather * than replaced, there's no need to go through busdma. */ if ((rxbuf->flags & IXGBE_RX_COPY) == 0) { /* Get the memory mapping */ bus_dmamap_unload(rxr->ptag, rxbuf->pmap); error = bus_dmamap_load_mbuf_sg(rxr->ptag, rxbuf->pmap, mp, seg, &nsegs, BUS_DMA_NOWAIT); if (error != 0) { printf("Refresh mbufs: payload dmamap load" " failure - %d\n", error); m_free(mp); rxbuf->buf = NULL; goto update; } rxbuf->buf = mp; bus_dmamap_sync(rxr->ptag, rxbuf->pmap, BUS_DMASYNC_PREREAD); rxbuf->addr = rxr->rx_base[i].read.pkt_addr = htole64(seg[0].ds_addr); } else { rxr->rx_base[i].read.pkt_addr = rxbuf->addr; rxbuf->flags &= ~IXGBE_RX_COPY; } refreshed = TRUE; /* Next is precalculated */ i = j; rxr->next_to_refresh = i; if (++j == rxr->num_desc) j = 0; } update: if (refreshed) /* Update hardware tail index */ IXGBE_WRITE_REG(&adapter->hw, rxr->tail, rxr->next_to_refresh); return; } /********************************************************************* * * Allocate memory for rx_buffer structures. Since we use one * rx_buffer per received packet, the maximum number of rx_buffer's * that we'll need is equal to the number of receive descriptors * that we've allocated. * **********************************************************************/ int ixgbe_allocate_receive_buffers(struct rx_ring *rxr) { struct adapter *adapter = rxr->adapter; device_t dev = adapter->dev; struct ixgbe_rx_buf *rxbuf; int bsize, error; bsize = sizeof(struct ixgbe_rx_buf) * rxr->num_desc; if (!(rxr->rx_buffers = (struct ixgbe_rx_buf *) malloc(bsize, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate rx_buffer memory\n"); error = ENOMEM; goto fail; } if ((error = bus_dma_tag_create(bus_get_dma_tag(dev), /* parent */ 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ MJUM16BYTES, /* maxsize */ 1, /* nsegments */ MJUM16BYTES, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &rxr->ptag))) { device_printf(dev, "Unable to create RX DMA tag\n"); goto fail; } for (int i = 0; i < rxr->num_desc; i++, rxbuf++) { rxbuf = &rxr->rx_buffers[i]; error = bus_dmamap_create(rxr->ptag, 0, &rxbuf->pmap); if (error) { device_printf(dev, "Unable to create RX dma map\n"); goto fail; } } return (0); fail: /* Frees all, but can handle partial completion */ ixgbe_free_receive_structures(adapter); return (error); } static void ixgbe_free_receive_ring(struct rx_ring *rxr) { struct ixgbe_rx_buf *rxbuf; for (int i = 0; i < rxr->num_desc; i++) { rxbuf = &rxr->rx_buffers[i]; if (rxbuf->buf != NULL) { bus_dmamap_sync(rxr->ptag, rxbuf->pmap, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(rxr->ptag, rxbuf->pmap); rxbuf->buf->m_flags |= M_PKTHDR; m_freem(rxbuf->buf); rxbuf->buf = NULL; rxbuf->flags = 0; } } } /********************************************************************* * * Initialize a receive ring and its buffers. * **********************************************************************/ static int ixgbe_setup_receive_ring(struct rx_ring *rxr) { struct adapter *adapter; struct ifnet *ifp; device_t dev; struct ixgbe_rx_buf *rxbuf; bus_dma_segment_t seg[1]; struct lro_ctrl *lro = &rxr->lro; int rsize, nsegs, error = 0; #ifdef DEV_NETMAP struct netmap_adapter *na = NA(rxr->adapter->ifp); struct netmap_slot *slot; #endif /* DEV_NETMAP */ adapter = rxr->adapter; ifp = adapter->ifp; dev = adapter->dev; /* Clear the ring contents */ IXGBE_RX_LOCK(rxr); #ifdef DEV_NETMAP /* same as in ixgbe_setup_transmit_ring() */ slot = netmap_reset(na, NR_RX, rxr->me, 0); #endif /* DEV_NETMAP */ rsize = roundup2(adapter->num_rx_desc * sizeof(union ixgbe_adv_rx_desc), DBA_ALIGN); bzero((void *)rxr->rx_base, rsize); /* Cache the size */ rxr->mbuf_sz = adapter->rx_mbuf_sz; /* Free current RX buffer structs and their mbufs */ ixgbe_free_receive_ring(rxr); /* Now replenish the mbufs */ for (int j = 0; j != rxr->num_desc; ++j) { struct mbuf *mp; rxbuf = &rxr->rx_buffers[j]; #ifdef DEV_NETMAP /* * In netmap mode, fill the map and set the buffer * address in the NIC ring, considering the offset * between the netmap and NIC rings (see comment in * ixgbe_setup_transmit_ring() ). No need to allocate * an mbuf, so end the block with a continue; */ if (slot) { int sj = netmap_idx_n2k(&na->rx_rings[rxr->me], j); uint64_t paddr; void *addr; addr = PNMB(na, slot + sj, &paddr); netmap_load_map(na, rxr->ptag, rxbuf->pmap, addr); /* Update descriptor and the cached value */ rxr->rx_base[j].read.pkt_addr = htole64(paddr); rxbuf->addr = htole64(paddr); continue; } #endif /* DEV_NETMAP */ rxbuf->flags = 0; rxbuf->buf = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, adapter->rx_mbuf_sz); if (rxbuf->buf == NULL) { error = ENOBUFS; goto fail; } mp = rxbuf->buf; mp->m_pkthdr.len = mp->m_len = rxr->mbuf_sz; /* Get the memory mapping */ error = bus_dmamap_load_mbuf_sg(rxr->ptag, rxbuf->pmap, mp, seg, &nsegs, BUS_DMA_NOWAIT); if (error != 0) goto fail; bus_dmamap_sync(rxr->ptag, rxbuf->pmap, BUS_DMASYNC_PREREAD); /* Update the descriptor and the cached value */ rxr->rx_base[j].read.pkt_addr = htole64(seg[0].ds_addr); rxbuf->addr = htole64(seg[0].ds_addr); } /* Setup our descriptor indices */ rxr->next_to_check = 0; rxr->next_to_refresh = 0; rxr->lro_enabled = FALSE; rxr->rx_copies = 0; rxr->rx_bytes = 0; rxr->vtag_strip = FALSE; bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* ** Now set up the LRO interface: */ if (ixgbe_rsc_enable) ixgbe_setup_hw_rsc(rxr); else if (ifp->if_capenable & IFCAP_LRO) { int err = tcp_lro_init(lro); if (err) { device_printf(dev, "LRO Initialization failed!\n"); goto fail; } INIT_DEBUGOUT("RX Soft LRO Initialized\n"); rxr->lro_enabled = TRUE; lro->ifp = adapter->ifp; } IXGBE_RX_UNLOCK(rxr); return (0); fail: ixgbe_free_receive_ring(rxr); IXGBE_RX_UNLOCK(rxr); return (error); } /********************************************************************* * * Initialize all receive rings. * **********************************************************************/ int ixgbe_setup_receive_structures(struct adapter *adapter) { struct rx_ring *rxr = adapter->rx_rings; int j; for (j = 0; j < adapter->num_queues; j++, rxr++) if (ixgbe_setup_receive_ring(rxr)) goto fail; return (0); fail: /* * Free RX buffers allocated so far, we will only handle * the rings that completed, the failing case will have * cleaned up for itself. 'j' failed, so its the terminus. */ for (int i = 0; i < j; ++i) { rxr = &adapter->rx_rings[i]; ixgbe_free_receive_ring(rxr); } return (ENOBUFS); } /********************************************************************* * * Free all receive rings. * **********************************************************************/ void ixgbe_free_receive_structures(struct adapter *adapter) { struct rx_ring *rxr = adapter->rx_rings; INIT_DEBUGOUT("ixgbe_free_receive_structures: begin"); for (int i = 0; i < adapter->num_queues; i++, rxr++) { struct lro_ctrl *lro = &rxr->lro; ixgbe_free_receive_buffers(rxr); /* Free LRO memory */ tcp_lro_free(lro); /* Free the ring memory as well */ ixgbe_dma_free(adapter, &rxr->rxdma); } free(adapter->rx_rings, M_DEVBUF); } /********************************************************************* * * Free receive ring data structures * **********************************************************************/ void ixgbe_free_receive_buffers(struct rx_ring *rxr) { struct adapter *adapter = rxr->adapter; struct ixgbe_rx_buf *rxbuf; INIT_DEBUGOUT("ixgbe_free_receive_buffers: begin"); /* Cleanup any existing buffers */ if (rxr->rx_buffers != NULL) { for (int i = 0; i < adapter->num_rx_desc; i++) { rxbuf = &rxr->rx_buffers[i]; if (rxbuf->buf != NULL) { bus_dmamap_sync(rxr->ptag, rxbuf->pmap, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(rxr->ptag, rxbuf->pmap); rxbuf->buf->m_flags |= M_PKTHDR; m_freem(rxbuf->buf); } rxbuf->buf = NULL; if (rxbuf->pmap != NULL) { bus_dmamap_destroy(rxr->ptag, rxbuf->pmap); rxbuf->pmap = NULL; } } if (rxr->rx_buffers != NULL) { free(rxr->rx_buffers, M_DEVBUF); rxr->rx_buffers = NULL; } } if (rxr->ptag != NULL) { bus_dma_tag_destroy(rxr->ptag); rxr->ptag = NULL; } return; } static __inline void ixgbe_rx_input(struct rx_ring *rxr, struct ifnet *ifp, struct mbuf *m, u32 ptype) { /* * ATM LRO is only for IP/TCP packets and TCP checksum of the packet * should be computed by hardware. Also it should not have VLAN tag in * ethernet header. In case of IPv6 we do not yet support ext. hdrs. */ if (rxr->lro_enabled && (ifp->if_capenable & IFCAP_VLAN_HWTAGGING) != 0 && (ptype & IXGBE_RXDADV_PKTTYPE_ETQF) == 0 && ((ptype & (IXGBE_RXDADV_PKTTYPE_IPV4 | IXGBE_RXDADV_PKTTYPE_TCP)) == (IXGBE_RXDADV_PKTTYPE_IPV4 | IXGBE_RXDADV_PKTTYPE_TCP) || (ptype & (IXGBE_RXDADV_PKTTYPE_IPV6 | IXGBE_RXDADV_PKTTYPE_TCP)) == (IXGBE_RXDADV_PKTTYPE_IPV6 | IXGBE_RXDADV_PKTTYPE_TCP)) && (m->m_pkthdr.csum_flags & (CSUM_DATA_VALID | CSUM_PSEUDO_HDR)) == (CSUM_DATA_VALID | CSUM_PSEUDO_HDR)) { /* * Send to the stack if: ** - LRO not enabled, or ** - no LRO resources, or ** - lro enqueue fails */ if (rxr->lro.lro_cnt != 0) if (tcp_lro_rx(&rxr->lro, m, 0) == 0) return; } IXGBE_RX_UNLOCK(rxr); (*ifp->if_input)(ifp, m); IXGBE_RX_LOCK(rxr); } static __inline void ixgbe_rx_discard(struct rx_ring *rxr, int i) { struct ixgbe_rx_buf *rbuf; rbuf = &rxr->rx_buffers[i]; /* ** With advanced descriptors the writeback ** clobbers the buffer addrs, so its easier ** to just free the existing mbufs and take ** the normal refresh path to get new buffers ** and mapping. */ if (rbuf->fmp != NULL) {/* Partial chain ? */ rbuf->fmp->m_flags |= M_PKTHDR; m_freem(rbuf->fmp); rbuf->fmp = NULL; rbuf->buf = NULL; /* rbuf->buf is part of fmp's chain */ } else if (rbuf->buf) { m_free(rbuf->buf); rbuf->buf = NULL; } bus_dmamap_unload(rxr->ptag, rbuf->pmap); rbuf->flags = 0; return; } /********************************************************************* * * This routine executes in interrupt context. It replenishes * the mbufs in the descriptor and sends data which has been * dma'ed into host memory to upper layer. * * Return TRUE for more work, FALSE for all clean. *********************************************************************/ bool ixgbe_rxeof(struct ix_queue *que) { struct adapter *adapter = que->adapter; struct rx_ring *rxr = que->rxr; struct ifnet *ifp = adapter->ifp; struct lro_ctrl *lro = &rxr->lro; int i, nextp, processed = 0; u32 staterr = 0; u32 count = adapter->rx_process_limit; union ixgbe_adv_rx_desc *cur; struct ixgbe_rx_buf *rbuf, *nbuf; u16 pkt_info; IXGBE_RX_LOCK(rxr); #ifdef DEV_NETMAP /* Same as the txeof routine: wakeup clients on intr. */ if (netmap_rx_irq(ifp, rxr->me, &processed)) { IXGBE_RX_UNLOCK(rxr); return (FALSE); } #endif /* DEV_NETMAP */ for (i = rxr->next_to_check; count != 0;) { struct mbuf *sendmp, *mp; u32 rsc, ptype; u16 len; u16 vtag = 0; bool eop; /* Sync the ring. */ bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); cur = &rxr->rx_base[i]; staterr = le32toh(cur->wb.upper.status_error); pkt_info = le16toh(cur->wb.lower.lo_dword.hs_rss.pkt_info); if ((staterr & IXGBE_RXD_STAT_DD) == 0) break; if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) break; count--; sendmp = NULL; nbuf = NULL; rsc = 0; cur->wb.upper.status_error = 0; rbuf = &rxr->rx_buffers[i]; mp = rbuf->buf; len = le16toh(cur->wb.upper.length); ptype = le32toh(cur->wb.lower.lo_dword.data) & IXGBE_RXDADV_PKTTYPE_MASK; eop = ((staterr & IXGBE_RXD_STAT_EOP) != 0); /* Make sure bad packets are discarded */ if (eop && (staterr & IXGBE_RXDADV_ERR_FRAME_ERR_MASK) != 0) { #if __FreeBSD_version >= 1100036 if (IXGBE_IS_VF(adapter)) if_inc_counter(ifp, IFCOUNTER_IERRORS, 1); #endif rxr->rx_discarded++; ixgbe_rx_discard(rxr, i); goto next_desc; } /* ** On 82599 which supports a hardware ** LRO (called HW RSC), packets need ** not be fragmented across sequential ** descriptors, rather the next descriptor ** is indicated in bits of the descriptor. ** This also means that we might proceses ** more than one packet at a time, something ** that has never been true before, it ** required eliminating global chain pointers ** in favor of what we are doing here. -jfv */ if (!eop) { /* ** Figure out the next descriptor ** of this frame. */ if (rxr->hw_rsc == TRUE) { rsc = ixgbe_rsc_count(cur); rxr->rsc_num += (rsc - 1); } if (rsc) { /* Get hardware index */ nextp = ((staterr & IXGBE_RXDADV_NEXTP_MASK) >> IXGBE_RXDADV_NEXTP_SHIFT); } else { /* Just sequential */ nextp = i + 1; if (nextp == adapter->num_rx_desc) nextp = 0; } nbuf = &rxr->rx_buffers[nextp]; prefetch(nbuf); } /* ** Rather than using the fmp/lmp global pointers ** we now keep the head of a packet chain in the ** buffer struct and pass this along from one ** descriptor to the next, until we get EOP. */ mp->m_len = len; /* ** See if there is a stored head ** that determines what we are */ sendmp = rbuf->fmp; if (sendmp != NULL) { /* secondary frag */ rbuf->buf = rbuf->fmp = NULL; mp->m_flags &= ~M_PKTHDR; sendmp->m_pkthdr.len += mp->m_len; } else { /* * Optimize. This might be a small packet, * maybe just a TCP ACK. Do a fast copy that * is cache aligned into a new mbuf, and * leave the old mbuf+cluster for re-use. */ if (eop && len <= IXGBE_RX_COPY_LEN) { sendmp = m_gethdr(M_NOWAIT, MT_DATA); if (sendmp != NULL) { sendmp->m_data += IXGBE_RX_COPY_ALIGN; ixgbe_bcopy(mp->m_data, sendmp->m_data, len); sendmp->m_len = len; rxr->rx_copies++; rbuf->flags |= IXGBE_RX_COPY; } } if (sendmp == NULL) { rbuf->buf = rbuf->fmp = NULL; sendmp = mp; } /* first desc of a non-ps chain */ sendmp->m_flags |= M_PKTHDR; sendmp->m_pkthdr.len = mp->m_len; } ++processed; /* Pass the head pointer on */ if (eop == 0) { nbuf->fmp = sendmp; sendmp = NULL; mp->m_next = nbuf->buf; } else { /* Sending this frame */ sendmp->m_pkthdr.rcvif = ifp; rxr->rx_packets++; /* capture data for AIM */ rxr->bytes += sendmp->m_pkthdr.len; rxr->rx_bytes += sendmp->m_pkthdr.len; /* Process vlan info */ if ((rxr->vtag_strip) && (staterr & IXGBE_RXD_STAT_VP)) vtag = le16toh(cur->wb.upper.vlan); if (vtag) { sendmp->m_pkthdr.ether_vtag = vtag; sendmp->m_flags |= M_VLANTAG; } if ((ifp->if_capenable & IFCAP_RXCSUM) != 0) ixgbe_rx_checksum(staterr, sendmp, ptype); /* * In case of multiqueue, we have RXCSUM.PCSD bit set * and never cleared. This means we have RSS hash * available to be used. */ if (adapter->num_queues > 1) { sendmp->m_pkthdr.flowid = le32toh(cur->wb.lower.hi_dword.rss); switch (pkt_info & IXGBE_RXDADV_RSSTYPE_MASK) { case IXGBE_RXDADV_RSSTYPE_IPV4: M_HASHTYPE_SET(sendmp, M_HASHTYPE_RSS_IPV4); break; case IXGBE_RXDADV_RSSTYPE_IPV4_TCP: M_HASHTYPE_SET(sendmp, M_HASHTYPE_RSS_TCP_IPV4); break; case IXGBE_RXDADV_RSSTYPE_IPV6: M_HASHTYPE_SET(sendmp, M_HASHTYPE_RSS_IPV6); break; case IXGBE_RXDADV_RSSTYPE_IPV6_TCP: M_HASHTYPE_SET(sendmp, M_HASHTYPE_RSS_TCP_IPV6); break; case IXGBE_RXDADV_RSSTYPE_IPV6_EX: M_HASHTYPE_SET(sendmp, M_HASHTYPE_RSS_IPV6_EX); break; case IXGBE_RXDADV_RSSTYPE_IPV6_TCP_EX: M_HASHTYPE_SET(sendmp, M_HASHTYPE_RSS_TCP_IPV6_EX); break; #if __FreeBSD_version > 1100000 case IXGBE_RXDADV_RSSTYPE_IPV4_UDP: M_HASHTYPE_SET(sendmp, M_HASHTYPE_RSS_UDP_IPV4); break; case IXGBE_RXDADV_RSSTYPE_IPV6_UDP: M_HASHTYPE_SET(sendmp, M_HASHTYPE_RSS_UDP_IPV6); break; case IXGBE_RXDADV_RSSTYPE_IPV6_UDP_EX: M_HASHTYPE_SET(sendmp, M_HASHTYPE_RSS_UDP_IPV6_EX); break; #endif default: M_HASHTYPE_SET(sendmp, - M_HASHTYPE_OPAQUE); + M_HASHTYPE_OPAQUE_HASH); } } else { sendmp->m_pkthdr.flowid = que->msix; M_HASHTYPE_SET(sendmp, M_HASHTYPE_OPAQUE); } } next_desc: bus_dmamap_sync(rxr->rxdma.dma_tag, rxr->rxdma.dma_map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* Advance our pointers to the next descriptor. */ if (++i == rxr->num_desc) i = 0; /* Now send to the stack or do LRO */ if (sendmp != NULL) { rxr->next_to_check = i; ixgbe_rx_input(rxr, ifp, sendmp, ptype); i = rxr->next_to_check; } /* Every 8 descriptors we go to refresh mbufs */ if (processed == 8) { ixgbe_refresh_mbufs(rxr, i); processed = 0; } } /* Refresh any remaining buf structs */ if (ixgbe_rx_unrefreshed(rxr)) ixgbe_refresh_mbufs(rxr, i); rxr->next_to_check = i; /* * Flush any outstanding LRO work */ tcp_lro_flush_all(lro); IXGBE_RX_UNLOCK(rxr); /* ** Still have cleaning to do? */ if ((staterr & IXGBE_RXD_STAT_DD) != 0) return (TRUE); else return (FALSE); } /********************************************************************* * * Verify that the hardware indicated that the checksum is valid. * Inform the stack about the status of checksum so that stack * doesn't spend time verifying the checksum. * *********************************************************************/ static void ixgbe_rx_checksum(u32 staterr, struct mbuf * mp, u32 ptype) { u16 status = (u16) staterr; u8 errors = (u8) (staterr >> 24); bool sctp = false; if ((ptype & IXGBE_RXDADV_PKTTYPE_ETQF) == 0 && (ptype & IXGBE_RXDADV_PKTTYPE_SCTP) != 0) sctp = true; /* IPv4 checksum */ if (status & IXGBE_RXD_STAT_IPCS) { mp->m_pkthdr.csum_flags |= CSUM_L3_CALC; /* IP Checksum Good */ if (!(errors & IXGBE_RXD_ERR_IPE)) mp->m_pkthdr.csum_flags |= CSUM_L3_VALID; } /* TCP/UDP/SCTP checksum */ if (status & IXGBE_RXD_STAT_L4CS) { mp->m_pkthdr.csum_flags |= CSUM_L4_CALC; if (!(errors & IXGBE_RXD_ERR_TCPE)) { mp->m_pkthdr.csum_flags |= CSUM_L4_VALID; if (!sctp) mp->m_pkthdr.csum_data = htons(0xffff); } } } /******************************************************************** * Manage DMA'able memory. *******************************************************************/ static void ixgbe_dmamap_cb(void *arg, bus_dma_segment_t * segs, int nseg, int error) { if (error) return; *(bus_addr_t *) arg = segs->ds_addr; return; } int ixgbe_dma_malloc(struct adapter *adapter, bus_size_t size, struct ixgbe_dma_alloc *dma, int mapflags) { device_t dev = adapter->dev; int r; r = bus_dma_tag_create(bus_get_dma_tag(adapter->dev), /* parent */ DBA_ALIGN, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ size, /* maxsize */ 1, /* nsegments */ size, /* maxsegsize */ BUS_DMA_ALLOCNOW, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &dma->dma_tag); if (r != 0) { device_printf(dev,"ixgbe_dma_malloc: bus_dma_tag_create failed; " "error %u\n", r); goto fail_0; } r = bus_dmamem_alloc(dma->dma_tag, (void **)&dma->dma_vaddr, BUS_DMA_NOWAIT, &dma->dma_map); if (r != 0) { device_printf(dev,"ixgbe_dma_malloc: bus_dmamem_alloc failed; " "error %u\n", r); goto fail_1; } r = bus_dmamap_load(dma->dma_tag, dma->dma_map, dma->dma_vaddr, size, ixgbe_dmamap_cb, &dma->dma_paddr, mapflags | BUS_DMA_NOWAIT); if (r != 0) { device_printf(dev,"ixgbe_dma_malloc: bus_dmamap_load failed; " "error %u\n", r); goto fail_2; } dma->dma_size = size; return (0); fail_2: bus_dmamem_free(dma->dma_tag, dma->dma_vaddr, dma->dma_map); fail_1: bus_dma_tag_destroy(dma->dma_tag); fail_0: dma->dma_tag = NULL; return (r); } void ixgbe_dma_free(struct adapter *adapter, struct ixgbe_dma_alloc *dma) { bus_dmamap_sync(dma->dma_tag, dma->dma_map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(dma->dma_tag, dma->dma_map); bus_dmamem_free(dma->dma_tag, dma->dma_vaddr, dma->dma_map); bus_dma_tag_destroy(dma->dma_tag); } /********************************************************************* * * Allocate memory for the transmit and receive rings, and then * the descriptors associated with each, called only once at attach. * **********************************************************************/ int ixgbe_allocate_queues(struct adapter *adapter) { device_t dev = adapter->dev; struct ix_queue *que; struct tx_ring *txr; struct rx_ring *rxr; int rsize, tsize, error = IXGBE_SUCCESS; int txconf = 0, rxconf = 0; #ifdef PCI_IOV enum ixgbe_iov_mode iov_mode; #endif /* First allocate the top level queue structs */ if (!(adapter->queues = (struct ix_queue *) malloc(sizeof(struct ix_queue) * adapter->num_queues, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate queue memory\n"); error = ENOMEM; goto fail; } /* First allocate the TX ring struct memory */ if (!(adapter->tx_rings = (struct tx_ring *) malloc(sizeof(struct tx_ring) * adapter->num_queues, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate TX ring memory\n"); error = ENOMEM; goto tx_fail; } /* Next allocate the RX */ if (!(adapter->rx_rings = (struct rx_ring *) malloc(sizeof(struct rx_ring) * adapter->num_queues, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate RX ring memory\n"); error = ENOMEM; goto rx_fail; } /* For the ring itself */ tsize = roundup2(adapter->num_tx_desc * sizeof(union ixgbe_adv_tx_desc), DBA_ALIGN); #ifdef PCI_IOV iov_mode = ixgbe_get_iov_mode(adapter); adapter->pool = ixgbe_max_vfs(iov_mode); #else adapter->pool = 0; #endif /* * Now set up the TX queues, txconf is needed to handle the * possibility that things fail midcourse and we need to * undo memory gracefully */ for (int i = 0; i < adapter->num_queues; i++, txconf++) { /* Set up some basics */ txr = &adapter->tx_rings[i]; txr->adapter = adapter; #ifdef PCI_IOV txr->me = ixgbe_pf_que_index(iov_mode, i); #else txr->me = i; #endif txr->num_desc = adapter->num_tx_desc; /* Initialize the TX side lock */ snprintf(txr->mtx_name, sizeof(txr->mtx_name), "%s:tx(%d)", device_get_nameunit(dev), txr->me); mtx_init(&txr->tx_mtx, txr->mtx_name, NULL, MTX_DEF); if (ixgbe_dma_malloc(adapter, tsize, &txr->txdma, BUS_DMA_NOWAIT)) { device_printf(dev, "Unable to allocate TX Descriptor memory\n"); error = ENOMEM; goto err_tx_desc; } txr->tx_base = (union ixgbe_adv_tx_desc *)txr->txdma.dma_vaddr; bzero((void *)txr->tx_base, tsize); /* Now allocate transmit buffers for the ring */ if (ixgbe_allocate_transmit_buffers(txr)) { device_printf(dev, "Critical Failure setting up transmit buffers\n"); error = ENOMEM; goto err_tx_desc; } #ifndef IXGBE_LEGACY_TX /* Allocate a buf ring */ txr->br = buf_ring_alloc(IXGBE_BR_SIZE, M_DEVBUF, M_WAITOK, &txr->tx_mtx); if (txr->br == NULL) { device_printf(dev, "Critical Failure setting up buf ring\n"); error = ENOMEM; goto err_tx_desc; } #endif } /* * Next the RX queues... */ rsize = roundup2(adapter->num_rx_desc * sizeof(union ixgbe_adv_rx_desc), DBA_ALIGN); for (int i = 0; i < adapter->num_queues; i++, rxconf++) { rxr = &adapter->rx_rings[i]; /* Set up some basics */ rxr->adapter = adapter; #ifdef PCI_IOV rxr->me = ixgbe_pf_que_index(iov_mode, i); #else rxr->me = i; #endif rxr->num_desc = adapter->num_rx_desc; /* Initialize the RX side lock */ snprintf(rxr->mtx_name, sizeof(rxr->mtx_name), "%s:rx(%d)", device_get_nameunit(dev), rxr->me); mtx_init(&rxr->rx_mtx, rxr->mtx_name, NULL, MTX_DEF); if (ixgbe_dma_malloc(adapter, rsize, &rxr->rxdma, BUS_DMA_NOWAIT)) { device_printf(dev, "Unable to allocate RxDescriptor memory\n"); error = ENOMEM; goto err_rx_desc; } rxr->rx_base = (union ixgbe_adv_rx_desc *)rxr->rxdma.dma_vaddr; bzero((void *)rxr->rx_base, rsize); /* Allocate receive buffers for the ring*/ if (ixgbe_allocate_receive_buffers(rxr)) { device_printf(dev, "Critical Failure setting up receive buffers\n"); error = ENOMEM; goto err_rx_desc; } } /* ** Finally set up the queue holding structs */ for (int i = 0; i < adapter->num_queues; i++) { que = &adapter->queues[i]; que->adapter = adapter; que->me = i; que->txr = &adapter->tx_rings[i]; que->rxr = &adapter->rx_rings[i]; } return (0); err_rx_desc: for (rxr = adapter->rx_rings; rxconf > 0; rxr++, rxconf--) ixgbe_dma_free(adapter, &rxr->rxdma); err_tx_desc: for (txr = adapter->tx_rings; txconf > 0; txr++, txconf--) ixgbe_dma_free(adapter, &txr->txdma); free(adapter->rx_rings, M_DEVBUF); rx_fail: free(adapter->tx_rings, M_DEVBUF); tx_fail: free(adapter->queues, M_DEVBUF); fail: return (error); } Index: projects/vnet/sys/dev/ixl/ixl_txrx.c =================================================================== --- projects/vnet/sys/dev/ixl/ixl_txrx.c (revision 301546) +++ projects/vnet/sys/dev/ixl/ixl_txrx.c (revision 301547) @@ -1,1831 +1,1831 @@ /****************************************************************************** Copyright (c) 2013-2015, Intel Corporation All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the Intel Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ******************************************************************************/ /*$FreeBSD$*/ /* ** IXL driver TX/RX Routines: ** This was seperated to allow usage by ** both the BASE and the VF drivers. */ #ifndef IXL_STANDALONE_BUILD #include "opt_inet.h" #include "opt_inet6.h" #include "opt_rss.h" #endif #include "ixl.h" #ifdef RSS #include #endif /* Local Prototypes */ static void ixl_rx_checksum(struct mbuf *, u32, u32, u8); static void ixl_refresh_mbufs(struct ixl_queue *, int); static int ixl_xmit(struct ixl_queue *, struct mbuf **); static int ixl_tx_setup_offload(struct ixl_queue *, struct mbuf *, u32 *, u32 *); static bool ixl_tso_setup(struct ixl_queue *, struct mbuf *); static __inline void ixl_rx_discard(struct rx_ring *, int); static __inline void ixl_rx_input(struct rx_ring *, struct ifnet *, struct mbuf *, u8); #ifdef DEV_NETMAP #include #endif /* DEV_NETMAP */ /* ** Multiqueue Transmit driver */ int ixl_mq_start(struct ifnet *ifp, struct mbuf *m) { struct ixl_vsi *vsi = ifp->if_softc; struct ixl_queue *que; struct tx_ring *txr; int err, i; #ifdef RSS u32 bucket_id; #endif /* ** Which queue to use: ** ** When doing RSS, map it to the same outbound ** queue as the incoming flow would be mapped to. ** If everything is setup correctly, it should be ** the same bucket that the current CPU we're on is. */ if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) { #ifdef RSS if (rss_hash2bucket(m->m_pkthdr.flowid, M_HASHTYPE_GET(m), &bucket_id) == 0) { i = bucket_id % vsi->num_queues; } else #endif i = m->m_pkthdr.flowid % vsi->num_queues; } else i = curcpu % vsi->num_queues; /* ** This may not be perfect, but until something ** better comes along it will keep from scheduling ** on stalled queues. */ if (((1 << i) & vsi->active_queues) == 0) i = ffsl(vsi->active_queues); que = &vsi->queues[i]; txr = &que->txr; err = drbr_enqueue(ifp, txr->br, m); if (err) return (err); if (IXL_TX_TRYLOCK(txr)) { ixl_mq_start_locked(ifp, txr); IXL_TX_UNLOCK(txr); } else taskqueue_enqueue(que->tq, &que->tx_task); return (0); } int ixl_mq_start_locked(struct ifnet *ifp, struct tx_ring *txr) { struct ixl_queue *que = txr->que; struct ixl_vsi *vsi = que->vsi; struct mbuf *next; int err = 0; if (((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) || vsi->link_active == 0) return (ENETDOWN); /* Process the transmit queue */ while ((next = drbr_peek(ifp, txr->br)) != NULL) { if ((err = ixl_xmit(que, &next)) != 0) { if (next == NULL) drbr_advance(ifp, txr->br); else drbr_putback(ifp, txr->br, next); break; } drbr_advance(ifp, txr->br); /* Send a copy of the frame to the BPF listener */ ETHER_BPF_MTAP(ifp, next); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) break; } if (txr->avail < IXL_TX_CLEANUP_THRESHOLD) ixl_txeof(que); return (err); } /* * Called from a taskqueue to drain queued transmit packets. */ void ixl_deferred_mq_start(void *arg, int pending) { struct ixl_queue *que = arg; struct tx_ring *txr = &que->txr; struct ixl_vsi *vsi = que->vsi; struct ifnet *ifp = vsi->ifp; IXL_TX_LOCK(txr); if (!drbr_empty(ifp, txr->br)) ixl_mq_start_locked(ifp, txr); IXL_TX_UNLOCK(txr); } /* ** Flush all queue ring buffers */ void ixl_qflush(struct ifnet *ifp) { struct ixl_vsi *vsi = ifp->if_softc; for (int i = 0; i < vsi->num_queues; i++) { struct ixl_queue *que = &vsi->queues[i]; struct tx_ring *txr = &que->txr; struct mbuf *m; IXL_TX_LOCK(txr); while ((m = buf_ring_dequeue_sc(txr->br)) != NULL) m_freem(m); IXL_TX_UNLOCK(txr); } if_qflush(ifp); } /* ** Find mbuf chains passed to the driver ** that are 'sparse', using more than 8 ** mbufs to deliver an mss-size chunk of data */ static inline bool ixl_tso_detect_sparse(struct mbuf *mp) { struct mbuf *m; int num = 0, mss; bool ret = FALSE; mss = mp->m_pkthdr.tso_segsz; for (m = mp->m_next; m != NULL; m = m->m_next) { num++; mss -= m->m_len; if (mss < 1) break; if (m->m_next == NULL) break; } if (num > IXL_SPARSE_CHAIN) ret = TRUE; return (ret); } /********************************************************************* * * This routine maps the mbufs to tx descriptors, allowing the * TX engine to transmit the packets. * - return 0 on success, positive on failure * **********************************************************************/ #define IXL_TXD_CMD (I40E_TX_DESC_CMD_EOP | I40E_TX_DESC_CMD_RS) static int ixl_xmit(struct ixl_queue *que, struct mbuf **m_headp) { struct ixl_vsi *vsi = que->vsi; struct i40e_hw *hw = vsi->hw; struct tx_ring *txr = &que->txr; struct ixl_tx_buf *buf; struct i40e_tx_desc *txd = NULL; struct mbuf *m_head, *m; int i, j, error, nsegs, maxsegs; int first, last = 0; u16 vtag = 0; u32 cmd, off; bus_dmamap_t map; bus_dma_tag_t tag; bus_dma_segment_t segs[IXL_MAX_TSO_SEGS]; cmd = off = 0; m_head = *m_headp; /* * Important to capture the first descriptor * used because it will contain the index of * the one we tell the hardware to report back */ first = txr->next_avail; buf = &txr->buffers[first]; map = buf->map; tag = txr->tx_tag; maxsegs = IXL_MAX_TX_SEGS; if (m_head->m_pkthdr.csum_flags & CSUM_TSO) { /* Use larger mapping for TSO */ tag = txr->tso_tag; maxsegs = IXL_MAX_TSO_SEGS; if (ixl_tso_detect_sparse(m_head)) { m = m_defrag(m_head, M_NOWAIT); if (m == NULL) { m_freem(*m_headp); *m_headp = NULL; return (ENOBUFS); } *m_headp = m; } } /* * Map the packet for DMA. */ error = bus_dmamap_load_mbuf_sg(tag, map, *m_headp, segs, &nsegs, BUS_DMA_NOWAIT); if (error == EFBIG) { struct mbuf *m; m = m_defrag(*m_headp, M_NOWAIT); if (m == NULL) { que->mbuf_defrag_failed++; m_freem(*m_headp); *m_headp = NULL; return (ENOBUFS); } *m_headp = m; /* Try it again */ error = bus_dmamap_load_mbuf_sg(tag, map, *m_headp, segs, &nsegs, BUS_DMA_NOWAIT); if (error == ENOMEM) { que->tx_dma_setup++; return (error); } else if (error != 0) { que->tx_dma_setup++; m_freem(*m_headp); *m_headp = NULL; return (error); } } else if (error == ENOMEM) { que->tx_dma_setup++; return (error); } else if (error != 0) { que->tx_dma_setup++; m_freem(*m_headp); *m_headp = NULL; return (error); } /* Make certain there are enough descriptors */ if (nsegs > txr->avail - 2) { txr->no_desc++; error = ENOBUFS; goto xmit_fail; } m_head = *m_headp; /* Set up the TSO/CSUM offload */ if (m_head->m_pkthdr.csum_flags & CSUM_OFFLOAD) { error = ixl_tx_setup_offload(que, m_head, &cmd, &off); if (error) goto xmit_fail; } cmd |= I40E_TX_DESC_CMD_ICRC; /* Grab the VLAN tag */ if (m_head->m_flags & M_VLANTAG) { cmd |= I40E_TX_DESC_CMD_IL2TAG1; vtag = htole16(m_head->m_pkthdr.ether_vtag); } i = txr->next_avail; for (j = 0; j < nsegs; j++) { bus_size_t seglen; buf = &txr->buffers[i]; buf->tag = tag; /* Keep track of the type tag */ txd = &txr->base[i]; seglen = segs[j].ds_len; txd->buffer_addr = htole64(segs[j].ds_addr); txd->cmd_type_offset_bsz = htole64(I40E_TX_DESC_DTYPE_DATA | ((u64)cmd << I40E_TXD_QW1_CMD_SHIFT) | ((u64)off << I40E_TXD_QW1_OFFSET_SHIFT) | ((u64)seglen << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) | ((u64)vtag << I40E_TXD_QW1_L2TAG1_SHIFT)); last = i; /* descriptor that will get completion IRQ */ if (++i == que->num_desc) i = 0; buf->m_head = NULL; buf->eop_index = -1; } /* Set the last descriptor for report */ txd->cmd_type_offset_bsz |= htole64(((u64)IXL_TXD_CMD << I40E_TXD_QW1_CMD_SHIFT)); txr->avail -= nsegs; txr->next_avail = i; buf->m_head = m_head; /* Swap the dma map between the first and last descriptor */ txr->buffers[first].map = buf->map; buf->map = map; bus_dmamap_sync(tag, map, BUS_DMASYNC_PREWRITE); /* Set the index of the descriptor that will be marked done */ buf = &txr->buffers[first]; buf->eop_index = last; bus_dmamap_sync(txr->dma.tag, txr->dma.map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* * Advance the Transmit Descriptor Tail (Tdt), this tells the * hardware that this frame is available to transmit. */ ++txr->total_packets; wr32(hw, txr->tail, i); /* Mark outstanding work */ if (que->busy == 0) que->busy = 1; return (0); xmit_fail: bus_dmamap_unload(tag, buf->map); return (error); } /********************************************************************* * * Allocate memory for tx_buffer structures. The tx_buffer stores all * the information needed to transmit a packet on the wire. This is * called only once at attach, setup is done every reset. * **********************************************************************/ int ixl_allocate_tx_data(struct ixl_queue *que) { struct tx_ring *txr = &que->txr; struct ixl_vsi *vsi = que->vsi; device_t dev = vsi->dev; struct ixl_tx_buf *buf; int error = 0; /* * Setup DMA descriptor areas. */ if ((error = bus_dma_tag_create(NULL, /* parent */ 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ IXL_TSO_SIZE, /* maxsize */ IXL_MAX_TX_SEGS, /* nsegments */ PAGE_SIZE, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txr->tx_tag))) { device_printf(dev,"Unable to allocate TX DMA tag\n"); goto fail; } /* Make a special tag for TSO */ if ((error = bus_dma_tag_create(NULL, /* parent */ 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ IXL_TSO_SIZE, /* maxsize */ IXL_MAX_TSO_SEGS, /* nsegments */ PAGE_SIZE, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &txr->tso_tag))) { device_printf(dev,"Unable to allocate TX TSO DMA tag\n"); goto fail; } if (!(txr->buffers = (struct ixl_tx_buf *) malloc(sizeof(struct ixl_tx_buf) * que->num_desc, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate tx_buffer memory\n"); error = ENOMEM; goto fail; } /* Create the descriptor buffer default dma maps */ buf = txr->buffers; for (int i = 0; i < que->num_desc; i++, buf++) { buf->tag = txr->tx_tag; error = bus_dmamap_create(buf->tag, 0, &buf->map); if (error != 0) { device_printf(dev, "Unable to create TX DMA map\n"); goto fail; } } fail: return (error); } /********************************************************************* * * (Re)Initialize a queue transmit ring. * - called by init, it clears the descriptor ring, * and frees any stale mbufs * **********************************************************************/ void ixl_init_tx_ring(struct ixl_queue *que) { #ifdef DEV_NETMAP struct netmap_adapter *na = NA(que->vsi->ifp); struct netmap_slot *slot; #endif /* DEV_NETMAP */ struct tx_ring *txr = &que->txr; struct ixl_tx_buf *buf; /* Clear the old ring contents */ IXL_TX_LOCK(txr); #ifdef DEV_NETMAP /* * (under lock): if in netmap mode, do some consistency * checks and set slot to entry 0 of the netmap ring. */ slot = netmap_reset(na, NR_TX, que->me, 0); #endif /* DEV_NETMAP */ bzero((void *)txr->base, (sizeof(struct i40e_tx_desc)) * que->num_desc); /* Reset indices */ txr->next_avail = 0; txr->next_to_clean = 0; #ifdef IXL_FDIR /* Initialize flow director */ txr->atr_rate = ixl_atr_rate; txr->atr_count = 0; #endif /* Free any existing tx mbufs. */ buf = txr->buffers; for (int i = 0; i < que->num_desc; i++, buf++) { if (buf->m_head != NULL) { bus_dmamap_sync(buf->tag, buf->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(buf->tag, buf->map); m_freem(buf->m_head); buf->m_head = NULL; } #ifdef DEV_NETMAP /* * In netmap mode, set the map for the packet buffer. * NOTE: Some drivers (not this one) also need to set * the physical buffer address in the NIC ring. * netmap_idx_n2k() maps a nic index, i, into the corresponding * netmap slot index, si */ if (slot) { int si = netmap_idx_n2k(&na->tx_rings[que->me], i); netmap_load_map(na, buf->tag, buf->map, NMB(na, slot + si)); } #endif /* DEV_NETMAP */ /* Clear the EOP index */ buf->eop_index = -1; } /* Set number of descriptors available */ txr->avail = que->num_desc; bus_dmamap_sync(txr->dma.tag, txr->dma.map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); IXL_TX_UNLOCK(txr); } /********************************************************************* * * Free transmit ring related data structures. * **********************************************************************/ void ixl_free_que_tx(struct ixl_queue *que) { struct tx_ring *txr = &que->txr; struct ixl_tx_buf *buf; INIT_DBG_IF(que->vsi->ifp, "queue %d: begin", que->me); for (int i = 0; i < que->num_desc; i++) { buf = &txr->buffers[i]; if (buf->m_head != NULL) { bus_dmamap_sync(buf->tag, buf->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(buf->tag, buf->map); m_freem(buf->m_head); buf->m_head = NULL; if (buf->map != NULL) { bus_dmamap_destroy(buf->tag, buf->map); buf->map = NULL; } } else if (buf->map != NULL) { bus_dmamap_unload(buf->tag, buf->map); bus_dmamap_destroy(buf->tag, buf->map); buf->map = NULL; } } if (txr->br != NULL) buf_ring_free(txr->br, M_DEVBUF); if (txr->buffers != NULL) { free(txr->buffers, M_DEVBUF); txr->buffers = NULL; } if (txr->tx_tag != NULL) { bus_dma_tag_destroy(txr->tx_tag); txr->tx_tag = NULL; } if (txr->tso_tag != NULL) { bus_dma_tag_destroy(txr->tso_tag); txr->tso_tag = NULL; } INIT_DBG_IF(que->vsi->ifp, "queue %d: end", que->me); return; } /********************************************************************* * * Setup descriptor for hw offloads * **********************************************************************/ static int ixl_tx_setup_offload(struct ixl_queue *que, struct mbuf *mp, u32 *cmd, u32 *off) { struct ether_vlan_header *eh; #ifdef INET struct ip *ip = NULL; #endif struct tcphdr *th = NULL; #ifdef INET6 struct ip6_hdr *ip6; #endif int elen, ip_hlen = 0, tcp_hlen; u16 etype; u8 ipproto = 0; bool tso = FALSE; /* Set up the TSO context descriptor if required */ if (mp->m_pkthdr.csum_flags & CSUM_TSO) { tso = ixl_tso_setup(que, mp); if (tso) ++que->tso; else return (ENXIO); } /* * Determine where frame payload starts. * Jump over vlan headers if already present, * helpful for QinQ too. */ eh = mtod(mp, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { etype = ntohs(eh->evl_proto); elen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; } else { etype = ntohs(eh->evl_encap_proto); elen = ETHER_HDR_LEN; } switch (etype) { #ifdef INET case ETHERTYPE_IP: ip = (struct ip *)(mp->m_data + elen); ip_hlen = ip->ip_hl << 2; ipproto = ip->ip_p; th = (struct tcphdr *)((caddr_t)ip + ip_hlen); /* The IP checksum must be recalculated with TSO */ if (tso) *cmd |= I40E_TX_DESC_CMD_IIPT_IPV4_CSUM; else *cmd |= I40E_TX_DESC_CMD_IIPT_IPV4; break; #endif #ifdef INET6 case ETHERTYPE_IPV6: ip6 = (struct ip6_hdr *)(mp->m_data + elen); ip_hlen = sizeof(struct ip6_hdr); ipproto = ip6->ip6_nxt; th = (struct tcphdr *)((caddr_t)ip6 + ip_hlen); *cmd |= I40E_TX_DESC_CMD_IIPT_IPV6; break; #endif default: break; } *off |= (elen >> 1) << I40E_TX_DESC_LENGTH_MACLEN_SHIFT; *off |= (ip_hlen >> 2) << I40E_TX_DESC_LENGTH_IPLEN_SHIFT; switch (ipproto) { case IPPROTO_TCP: tcp_hlen = th->th_off << 2; if (mp->m_pkthdr.csum_flags & (CSUM_TCP|CSUM_TCP_IPV6)) { *cmd |= I40E_TX_DESC_CMD_L4T_EOFT_TCP; *off |= (tcp_hlen >> 2) << I40E_TX_DESC_LENGTH_L4_FC_LEN_SHIFT; } #ifdef IXL_FDIR ixl_atr(que, th, etype); #endif break; case IPPROTO_UDP: if (mp->m_pkthdr.csum_flags & (CSUM_UDP|CSUM_UDP_IPV6)) { *cmd |= I40E_TX_DESC_CMD_L4T_EOFT_UDP; *off |= (sizeof(struct udphdr) >> 2) << I40E_TX_DESC_LENGTH_L4_FC_LEN_SHIFT; } break; case IPPROTO_SCTP: if (mp->m_pkthdr.csum_flags & (CSUM_SCTP|CSUM_SCTP_IPV6)) { *cmd |= I40E_TX_DESC_CMD_L4T_EOFT_SCTP; *off |= (sizeof(struct sctphdr) >> 2) << I40E_TX_DESC_LENGTH_L4_FC_LEN_SHIFT; } /* Fall Thru */ default: break; } return (0); } /********************************************************************** * * Setup context for hardware segmentation offload (TSO) * **********************************************************************/ static bool ixl_tso_setup(struct ixl_queue *que, struct mbuf *mp) { struct tx_ring *txr = &que->txr; struct i40e_tx_context_desc *TXD; struct ixl_tx_buf *buf; u32 cmd, mss, type, tsolen; u16 etype; int idx, elen, ip_hlen, tcp_hlen; struct ether_vlan_header *eh; #ifdef INET struct ip *ip; #endif #ifdef INET6 struct ip6_hdr *ip6; #endif #if defined(INET6) || defined(INET) struct tcphdr *th; #endif u64 type_cmd_tso_mss; /* * Determine where frame payload starts. * Jump over vlan headers if already present */ eh = mtod(mp, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { elen = ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN; etype = eh->evl_proto; } else { elen = ETHER_HDR_LEN; etype = eh->evl_encap_proto; } switch (ntohs(etype)) { #ifdef INET6 case ETHERTYPE_IPV6: ip6 = (struct ip6_hdr *)(mp->m_data + elen); if (ip6->ip6_nxt != IPPROTO_TCP) return (ENXIO); ip_hlen = sizeof(struct ip6_hdr); th = (struct tcphdr *)((caddr_t)ip6 + ip_hlen); th->th_sum = in6_cksum_pseudo(ip6, 0, IPPROTO_TCP, 0); tcp_hlen = th->th_off << 2; /* * The corresponding flag is set by the stack in the IPv4 * TSO case, but not in IPv6 (at least in FreeBSD 10.2). * So, set it here because the rest of the flow requires it. */ mp->m_pkthdr.csum_flags |= CSUM_TCP_IPV6; break; #endif #ifdef INET case ETHERTYPE_IP: ip = (struct ip *)(mp->m_data + elen); if (ip->ip_p != IPPROTO_TCP) return (ENXIO); ip->ip_sum = 0; ip_hlen = ip->ip_hl << 2; th = (struct tcphdr *)((caddr_t)ip + ip_hlen); th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htons(IPPROTO_TCP)); tcp_hlen = th->th_off << 2; break; #endif default: printf("%s: CSUM_TSO but no supported IP version (0x%04x)", __func__, ntohs(etype)); return FALSE; } /* Ensure we have at least the IP+TCP header in the first mbuf. */ if (mp->m_len < elen + ip_hlen + sizeof(struct tcphdr)) return FALSE; idx = txr->next_avail; buf = &txr->buffers[idx]; TXD = (struct i40e_tx_context_desc *) &txr->base[idx]; tsolen = mp->m_pkthdr.len - (elen + ip_hlen + tcp_hlen); type = I40E_TX_DESC_DTYPE_CONTEXT; cmd = I40E_TX_CTX_DESC_TSO; mss = mp->m_pkthdr.tso_segsz; type_cmd_tso_mss = ((u64)type << I40E_TXD_CTX_QW1_DTYPE_SHIFT) | ((u64)cmd << I40E_TXD_CTX_QW1_CMD_SHIFT) | ((u64)tsolen << I40E_TXD_CTX_QW1_TSO_LEN_SHIFT) | ((u64)mss << I40E_TXD_CTX_QW1_MSS_SHIFT); TXD->type_cmd_tso_mss = htole64(type_cmd_tso_mss); TXD->tunneling_params = htole32(0); buf->m_head = NULL; buf->eop_index = -1; if (++idx == que->num_desc) idx = 0; txr->avail--; txr->next_avail = idx; return TRUE; } /* ** ixl_get_tx_head - Retrieve the value from the ** location the HW records its HEAD index */ static inline u32 ixl_get_tx_head(struct ixl_queue *que) { struct tx_ring *txr = &que->txr; void *head = &txr->base[que->num_desc]; return LE32_TO_CPU(*(volatile __le32 *)head); } /********************************************************************** * * Examine each tx_buffer in the used queue. If the hardware is done * processing the packet then free associated resources. The * tx_buffer is put back on the free queue. * **********************************************************************/ bool ixl_txeof(struct ixl_queue *que) { struct tx_ring *txr = &que->txr; u32 first, last, head, done, processed; struct ixl_tx_buf *buf; struct i40e_tx_desc *tx_desc, *eop_desc; mtx_assert(&txr->mtx, MA_OWNED); #ifdef DEV_NETMAP // XXX todo: implement moderation if (netmap_tx_irq(que->vsi->ifp, que->me)) return FALSE; #endif /* DEF_NETMAP */ /* These are not the descriptors you seek, move along :) */ if (txr->avail == que->num_desc) { que->busy = 0; return FALSE; } processed = 0; first = txr->next_to_clean; buf = &txr->buffers[first]; tx_desc = (struct i40e_tx_desc *)&txr->base[first]; last = buf->eop_index; if (last == -1) return FALSE; eop_desc = (struct i40e_tx_desc *)&txr->base[last]; /* Get the Head WB value */ head = ixl_get_tx_head(que); /* ** Get the index of the first descriptor ** BEYOND the EOP and call that 'done'. ** I do this so the comparison in the ** inner while loop below can be simple */ if (++last == que->num_desc) last = 0; done = last; bus_dmamap_sync(txr->dma.tag, txr->dma.map, BUS_DMASYNC_POSTREAD); /* ** The HEAD index of the ring is written in a ** defined location, this rather than a done bit ** is what is used to keep track of what must be ** 'cleaned'. */ while (first != head) { /* We clean the range of the packet */ while (first != done) { ++txr->avail; ++processed; if (buf->m_head) { txr->bytes += /* for ITR adjustment */ buf->m_head->m_pkthdr.len; txr->tx_bytes += /* for TX stats */ buf->m_head->m_pkthdr.len; bus_dmamap_sync(buf->tag, buf->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(buf->tag, buf->map); m_freem(buf->m_head); buf->m_head = NULL; buf->map = NULL; } buf->eop_index = -1; if (++first == que->num_desc) first = 0; buf = &txr->buffers[first]; tx_desc = &txr->base[first]; } ++txr->packets; /* See if there is more work now */ last = buf->eop_index; if (last != -1) { eop_desc = &txr->base[last]; /* Get next done point */ if (++last == que->num_desc) last = 0; done = last; } else break; } bus_dmamap_sync(txr->dma.tag, txr->dma.map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); txr->next_to_clean = first; /* ** Hang detection, we know there's ** work outstanding or the first return ** would have been taken, so indicate an ** unsuccessful pass, in local_timer if ** the value is too great the queue will ** be considered hung. If anything has been ** cleaned then reset the state. */ if ((processed == 0) && (que->busy != IXL_QUEUE_HUNG)) ++que->busy; if (processed) que->busy = 1; /* Note this turns off HUNG */ /* * If there are no pending descriptors, clear the timeout. */ if (txr->avail == que->num_desc) { que->busy = 0; return FALSE; } return TRUE; } /********************************************************************* * * Refresh mbuf buffers for RX descriptor rings * - now keeps its own state so discards due to resource * exhaustion are unnecessary, if an mbuf cannot be obtained * it just returns, keeping its placeholder, thus it can simply * be recalled to try again. * **********************************************************************/ static void ixl_refresh_mbufs(struct ixl_queue *que, int limit) { struct ixl_vsi *vsi = que->vsi; struct rx_ring *rxr = &que->rxr; bus_dma_segment_t hseg[1]; bus_dma_segment_t pseg[1]; struct ixl_rx_buf *buf; struct mbuf *mh, *mp; int i, j, nsegs, error; bool refreshed = FALSE; i = j = rxr->next_refresh; /* Control the loop with one beyond */ if (++j == que->num_desc) j = 0; while (j != limit) { buf = &rxr->buffers[i]; if (rxr->hdr_split == FALSE) goto no_split; if (buf->m_head == NULL) { mh = m_gethdr(M_NOWAIT, MT_DATA); if (mh == NULL) goto update; } else mh = buf->m_head; mh->m_pkthdr.len = mh->m_len = MHLEN; mh->m_len = MHLEN; mh->m_flags |= M_PKTHDR; /* Get the memory mapping */ error = bus_dmamap_load_mbuf_sg(rxr->htag, buf->hmap, mh, hseg, &nsegs, BUS_DMA_NOWAIT); if (error != 0) { printf("Refresh mbufs: hdr dmamap load" " failure - %d\n", error); m_free(mh); buf->m_head = NULL; goto update; } buf->m_head = mh; bus_dmamap_sync(rxr->htag, buf->hmap, BUS_DMASYNC_PREREAD); rxr->base[i].read.hdr_addr = htole64(hseg[0].ds_addr); no_split: if (buf->m_pack == NULL) { mp = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, rxr->mbuf_sz); if (mp == NULL) goto update; } else mp = buf->m_pack; mp->m_pkthdr.len = mp->m_len = rxr->mbuf_sz; /* Get the memory mapping */ error = bus_dmamap_load_mbuf_sg(rxr->ptag, buf->pmap, mp, pseg, &nsegs, BUS_DMA_NOWAIT); if (error != 0) { printf("Refresh mbufs: payload dmamap load" " failure - %d\n", error); m_free(mp); buf->m_pack = NULL; goto update; } buf->m_pack = mp; bus_dmamap_sync(rxr->ptag, buf->pmap, BUS_DMASYNC_PREREAD); rxr->base[i].read.pkt_addr = htole64(pseg[0].ds_addr); /* Used only when doing header split */ rxr->base[i].read.hdr_addr = 0; refreshed = TRUE; /* Next is precalculated */ i = j; rxr->next_refresh = i; if (++j == que->num_desc) j = 0; } update: if (refreshed) /* Update hardware tail index */ wr32(vsi->hw, rxr->tail, rxr->next_refresh); return; } /********************************************************************* * * Allocate memory for rx_buffer structures. Since we use one * rx_buffer per descriptor, the maximum number of rx_buffer's * that we'll need is equal to the number of receive descriptors * that we've defined. * **********************************************************************/ int ixl_allocate_rx_data(struct ixl_queue *que) { struct rx_ring *rxr = &que->rxr; struct ixl_vsi *vsi = que->vsi; device_t dev = vsi->dev; struct ixl_rx_buf *buf; int i, bsize, error; bsize = sizeof(struct ixl_rx_buf) * que->num_desc; if (!(rxr->buffers = (struct ixl_rx_buf *) malloc(bsize, M_DEVBUF, M_NOWAIT | M_ZERO))) { device_printf(dev, "Unable to allocate rx_buffer memory\n"); error = ENOMEM; return (error); } if ((error = bus_dma_tag_create(NULL, /* parent */ 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ MSIZE, /* maxsize */ 1, /* nsegments */ MSIZE, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &rxr->htag))) { device_printf(dev, "Unable to create RX DMA htag\n"); return (error); } if ((error = bus_dma_tag_create(NULL, /* parent */ 1, 0, /* alignment, bounds */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ MJUM16BYTES, /* maxsize */ 1, /* nsegments */ MJUM16BYTES, /* maxsegsize */ 0, /* flags */ NULL, /* lockfunc */ NULL, /* lockfuncarg */ &rxr->ptag))) { device_printf(dev, "Unable to create RX DMA ptag\n"); return (error); } for (i = 0; i < que->num_desc; i++) { buf = &rxr->buffers[i]; error = bus_dmamap_create(rxr->htag, BUS_DMA_NOWAIT, &buf->hmap); if (error) { device_printf(dev, "Unable to create RX head map\n"); break; } error = bus_dmamap_create(rxr->ptag, BUS_DMA_NOWAIT, &buf->pmap); if (error) { device_printf(dev, "Unable to create RX pkt map\n"); break; } } return (error); } /********************************************************************* * * (Re)Initialize the queue receive ring and its buffers. * **********************************************************************/ int ixl_init_rx_ring(struct ixl_queue *que) { struct rx_ring *rxr = &que->rxr; struct ixl_vsi *vsi = que->vsi; #if defined(INET6) || defined(INET) struct ifnet *ifp = vsi->ifp; struct lro_ctrl *lro = &rxr->lro; #endif struct ixl_rx_buf *buf; bus_dma_segment_t pseg[1], hseg[1]; int rsize, nsegs, error = 0; #ifdef DEV_NETMAP struct netmap_adapter *na = NA(que->vsi->ifp); struct netmap_slot *slot; #endif /* DEV_NETMAP */ IXL_RX_LOCK(rxr); #ifdef DEV_NETMAP /* same as in ixl_init_tx_ring() */ slot = netmap_reset(na, NR_RX, que->me, 0); #endif /* DEV_NETMAP */ /* Clear the ring contents */ rsize = roundup2(que->num_desc * sizeof(union i40e_rx_desc), DBA_ALIGN); bzero((void *)rxr->base, rsize); /* Cleanup any existing buffers */ for (int i = 0; i < que->num_desc; i++) { buf = &rxr->buffers[i]; if (buf->m_head != NULL) { bus_dmamap_sync(rxr->htag, buf->hmap, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(rxr->htag, buf->hmap); buf->m_head->m_flags |= M_PKTHDR; m_freem(buf->m_head); } if (buf->m_pack != NULL) { bus_dmamap_sync(rxr->ptag, buf->pmap, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(rxr->ptag, buf->pmap); buf->m_pack->m_flags |= M_PKTHDR; m_freem(buf->m_pack); } buf->m_head = NULL; buf->m_pack = NULL; } /* header split is off */ rxr->hdr_split = FALSE; /* Now replenish the mbufs */ for (int j = 0; j != que->num_desc; ++j) { struct mbuf *mh, *mp; buf = &rxr->buffers[j]; #ifdef DEV_NETMAP /* * In netmap mode, fill the map and set the buffer * address in the NIC ring, considering the offset * between the netmap and NIC rings (see comment in * ixgbe_setup_transmit_ring() ). No need to allocate * an mbuf, so end the block with a continue; */ if (slot) { int sj = netmap_idx_n2k(&na->rx_rings[que->me], j); uint64_t paddr; void *addr; addr = PNMB(na, slot + sj, &paddr); netmap_load_map(na, rxr->dma.tag, buf->pmap, addr); /* Update descriptor and the cached value */ rxr->base[j].read.pkt_addr = htole64(paddr); rxr->base[j].read.hdr_addr = 0; continue; } #endif /* DEV_NETMAP */ /* ** Don't allocate mbufs if not ** doing header split, its wasteful */ if (rxr->hdr_split == FALSE) goto skip_head; /* First the header */ buf->m_head = m_gethdr(M_NOWAIT, MT_DATA); if (buf->m_head == NULL) { error = ENOBUFS; goto fail; } m_adj(buf->m_head, ETHER_ALIGN); mh = buf->m_head; mh->m_len = mh->m_pkthdr.len = MHLEN; mh->m_flags |= M_PKTHDR; /* Get the memory mapping */ error = bus_dmamap_load_mbuf_sg(rxr->htag, buf->hmap, buf->m_head, hseg, &nsegs, BUS_DMA_NOWAIT); if (error != 0) /* Nothing elegant to do here */ goto fail; bus_dmamap_sync(rxr->htag, buf->hmap, BUS_DMASYNC_PREREAD); /* Update descriptor */ rxr->base[j].read.hdr_addr = htole64(hseg[0].ds_addr); skip_head: /* Now the payload cluster */ buf->m_pack = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, rxr->mbuf_sz); if (buf->m_pack == NULL) { error = ENOBUFS; goto fail; } mp = buf->m_pack; mp->m_pkthdr.len = mp->m_len = rxr->mbuf_sz; /* Get the memory mapping */ error = bus_dmamap_load_mbuf_sg(rxr->ptag, buf->pmap, mp, pseg, &nsegs, BUS_DMA_NOWAIT); if (error != 0) goto fail; bus_dmamap_sync(rxr->ptag, buf->pmap, BUS_DMASYNC_PREREAD); /* Update descriptor */ rxr->base[j].read.pkt_addr = htole64(pseg[0].ds_addr); rxr->base[j].read.hdr_addr = 0; } /* Setup our descriptor indices */ rxr->next_check = 0; rxr->next_refresh = 0; rxr->lro_enabled = FALSE; rxr->split = 0; rxr->bytes = 0; rxr->discard = FALSE; wr32(vsi->hw, rxr->tail, que->num_desc - 1); ixl_flush(vsi->hw); #if defined(INET6) || defined(INET) /* ** Now set up the LRO interface: */ if (ifp->if_capenable & IFCAP_LRO) { int err = tcp_lro_init(lro); if (err) { if_printf(ifp, "queue %d: LRO Initialization failed!\n", que->me); goto fail; } INIT_DBG_IF(ifp, "queue %d: RX Soft LRO Initialized", que->me); rxr->lro_enabled = TRUE; lro->ifp = vsi->ifp; } #endif bus_dmamap_sync(rxr->dma.tag, rxr->dma.map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); fail: IXL_RX_UNLOCK(rxr); return (error); } /********************************************************************* * * Free station receive ring data structures * **********************************************************************/ void ixl_free_que_rx(struct ixl_queue *que) { struct rx_ring *rxr = &que->rxr; struct ixl_rx_buf *buf; INIT_DBG_IF(que->vsi->ifp, "queue %d: begin", que->me); /* Cleanup any existing buffers */ if (rxr->buffers != NULL) { for (int i = 0; i < que->num_desc; i++) { buf = &rxr->buffers[i]; if (buf->m_head != NULL) { bus_dmamap_sync(rxr->htag, buf->hmap, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(rxr->htag, buf->hmap); buf->m_head->m_flags |= M_PKTHDR; m_freem(buf->m_head); } if (buf->m_pack != NULL) { bus_dmamap_sync(rxr->ptag, buf->pmap, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(rxr->ptag, buf->pmap); buf->m_pack->m_flags |= M_PKTHDR; m_freem(buf->m_pack); } buf->m_head = NULL; buf->m_pack = NULL; if (buf->hmap != NULL) { bus_dmamap_destroy(rxr->htag, buf->hmap); buf->hmap = NULL; } if (buf->pmap != NULL) { bus_dmamap_destroy(rxr->ptag, buf->pmap); buf->pmap = NULL; } } if (rxr->buffers != NULL) { free(rxr->buffers, M_DEVBUF); rxr->buffers = NULL; } } if (rxr->htag != NULL) { bus_dma_tag_destroy(rxr->htag); rxr->htag = NULL; } if (rxr->ptag != NULL) { bus_dma_tag_destroy(rxr->ptag); rxr->ptag = NULL; } INIT_DBG_IF(que->vsi->ifp, "queue %d: end", que->me); return; } static __inline void ixl_rx_input(struct rx_ring *rxr, struct ifnet *ifp, struct mbuf *m, u8 ptype) { #if defined(INET6) || defined(INET) /* * ATM LRO is only for IPv4/TCP packets and TCP checksum of the packet * should be computed by hardware. Also it should not have VLAN tag in * ethernet header. */ if (rxr->lro_enabled && (ifp->if_capenable & IFCAP_VLAN_HWTAGGING) != 0 && (m->m_pkthdr.csum_flags & (CSUM_DATA_VALID | CSUM_PSEUDO_HDR)) == (CSUM_DATA_VALID | CSUM_PSEUDO_HDR)) { /* * Send to the stack if: ** - LRO not enabled, or ** - no LRO resources, or ** - lro enqueue fails */ if (rxr->lro.lro_cnt != 0) if (tcp_lro_rx(&rxr->lro, m, 0) == 0) return; } #endif IXL_RX_UNLOCK(rxr); (*ifp->if_input)(ifp, m); IXL_RX_LOCK(rxr); } static __inline void ixl_rx_discard(struct rx_ring *rxr, int i) { struct ixl_rx_buf *rbuf; rbuf = &rxr->buffers[i]; if (rbuf->fmp != NULL) {/* Partial chain ? */ rbuf->fmp->m_flags |= M_PKTHDR; m_freem(rbuf->fmp); rbuf->fmp = NULL; } /* ** With advanced descriptors the writeback ** clobbers the buffer addrs, so its easier ** to just free the existing mbufs and take ** the normal refresh path to get new buffers ** and mapping. */ if (rbuf->m_head) { m_free(rbuf->m_head); rbuf->m_head = NULL; } if (rbuf->m_pack) { m_free(rbuf->m_pack); rbuf->m_pack = NULL; } return; } #ifdef RSS /* ** i40e_ptype_to_hash: parse the packet type ** to determine the appropriate hash. */ static inline int ixl_ptype_to_hash(u8 ptype) { struct i40e_rx_ptype_decoded decoded; u8 ex = 0; decoded = decode_rx_desc_ptype(ptype); ex = decoded.outer_frag; if (!decoded.known) - return M_HASHTYPE_OPAQUE; + return M_HASHTYPE_OPAQUE_HASH; if (decoded.outer_ip == I40E_RX_PTYPE_OUTER_L2) - return M_HASHTYPE_OPAQUE; + return M_HASHTYPE_OPAQUE_HASH; /* Note: anything that gets to this point is IP */ if (decoded.outer_ip_ver == I40E_RX_PTYPE_OUTER_IPV6) { switch (decoded.inner_prot) { case I40E_RX_PTYPE_INNER_PROT_TCP: if (ex) return M_HASHTYPE_RSS_TCP_IPV6_EX; else return M_HASHTYPE_RSS_TCP_IPV6; case I40E_RX_PTYPE_INNER_PROT_UDP: if (ex) return M_HASHTYPE_RSS_UDP_IPV6_EX; else return M_HASHTYPE_RSS_UDP_IPV6; default: if (ex) return M_HASHTYPE_RSS_IPV6_EX; else return M_HASHTYPE_RSS_IPV6; } } if (decoded.outer_ip_ver == I40E_RX_PTYPE_OUTER_IPV4) { switch (decoded.inner_prot) { case I40E_RX_PTYPE_INNER_PROT_TCP: return M_HASHTYPE_RSS_TCP_IPV4; case I40E_RX_PTYPE_INNER_PROT_UDP: if (ex) return M_HASHTYPE_RSS_UDP_IPV4_EX; else return M_HASHTYPE_RSS_UDP_IPV4; default: return M_HASHTYPE_RSS_IPV4; } } /* We should never get here!! */ - return M_HASHTYPE_OPAQUE; + return M_HASHTYPE_OPAQUE_HASH; } #endif /* RSS */ /********************************************************************* * * This routine executes in interrupt context. It replenishes * the mbufs in the descriptor and sends data which has been * dma'ed into host memory to upper layer. * * We loop at most count times if count is > 0, or until done if * count < 0. * * Return TRUE for more work, FALSE for all clean. *********************************************************************/ bool ixl_rxeof(struct ixl_queue *que, int count) { struct ixl_vsi *vsi = que->vsi; struct rx_ring *rxr = &que->rxr; struct ifnet *ifp = vsi->ifp; #if defined(INET6) || defined(INET) struct lro_ctrl *lro = &rxr->lro; #endif int i, nextp, processed = 0; union i40e_rx_desc *cur; struct ixl_rx_buf *rbuf, *nbuf; IXL_RX_LOCK(rxr); #ifdef DEV_NETMAP if (netmap_rx_irq(ifp, que->me, &count)) { IXL_RX_UNLOCK(rxr); return (FALSE); } #endif /* DEV_NETMAP */ for (i = rxr->next_check; count != 0;) { struct mbuf *sendmp, *mh, *mp; u32 rsc, status, error; u16 hlen, plen, vtag; u64 qword; u8 ptype; bool eop; /* Sync the ring. */ bus_dmamap_sync(rxr->dma.tag, rxr->dma.map, BUS_DMASYNC_POSTREAD | BUS_DMASYNC_POSTWRITE); cur = &rxr->base[i]; qword = le64toh(cur->wb.qword1.status_error_len); status = (qword & I40E_RXD_QW1_STATUS_MASK) >> I40E_RXD_QW1_STATUS_SHIFT; error = (qword & I40E_RXD_QW1_ERROR_MASK) >> I40E_RXD_QW1_ERROR_SHIFT; plen = (qword & I40E_RXD_QW1_LENGTH_PBUF_MASK) >> I40E_RXD_QW1_LENGTH_PBUF_SHIFT; hlen = (qword & I40E_RXD_QW1_LENGTH_HBUF_MASK) >> I40E_RXD_QW1_LENGTH_HBUF_SHIFT; ptype = (qword & I40E_RXD_QW1_PTYPE_MASK) >> I40E_RXD_QW1_PTYPE_SHIFT; if ((status & (1 << I40E_RX_DESC_STATUS_DD_SHIFT)) == 0) { ++rxr->not_done; break; } if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) break; count--; sendmp = NULL; nbuf = NULL; rsc = 0; cur->wb.qword1.status_error_len = 0; rbuf = &rxr->buffers[i]; mh = rbuf->m_head; mp = rbuf->m_pack; eop = (status & (1 << I40E_RX_DESC_STATUS_EOF_SHIFT)); if (status & (1 << I40E_RX_DESC_STATUS_L2TAG1P_SHIFT)) vtag = le16toh(cur->wb.qword0.lo_dword.l2tag1); else vtag = 0; /* ** Make sure bad packets are discarded, ** note that only EOP descriptor has valid ** error results. */ if (eop && (error & (1 << I40E_RX_DESC_ERROR_RXE_SHIFT))) { rxr->desc_errs++; ixl_rx_discard(rxr, i); goto next_desc; } /* Prefetch the next buffer */ if (!eop) { nextp = i + 1; if (nextp == que->num_desc) nextp = 0; nbuf = &rxr->buffers[nextp]; prefetch(nbuf); } /* ** The header mbuf is ONLY used when header ** split is enabled, otherwise we get normal ** behavior, ie, both header and payload ** are DMA'd into the payload buffer. ** ** Rather than using the fmp/lmp global pointers ** we now keep the head of a packet chain in the ** buffer struct and pass this along from one ** descriptor to the next, until we get EOP. */ if (rxr->hdr_split && (rbuf->fmp == NULL)) { if (hlen > IXL_RX_HDR) hlen = IXL_RX_HDR; mh->m_len = hlen; mh->m_flags |= M_PKTHDR; mh->m_next = NULL; mh->m_pkthdr.len = mh->m_len; /* Null buf pointer so it is refreshed */ rbuf->m_head = NULL; /* ** Check the payload length, this ** could be zero if its a small ** packet. */ if (plen > 0) { mp->m_len = plen; mp->m_next = NULL; mp->m_flags &= ~M_PKTHDR; mh->m_next = mp; mh->m_pkthdr.len += mp->m_len; /* Null buf pointer so it is refreshed */ rbuf->m_pack = NULL; rxr->split++; } /* ** Now create the forward ** chain so when complete ** we wont have to. */ if (eop == 0) { /* stash the chain head */ nbuf->fmp = mh; /* Make forward chain */ if (plen) mp->m_next = nbuf->m_pack; else mh->m_next = nbuf->m_pack; } else { /* Singlet, prepare to send */ sendmp = mh; if (vtag) { sendmp->m_pkthdr.ether_vtag = vtag; sendmp->m_flags |= M_VLANTAG; } } } else { /* ** Either no header split, or a ** secondary piece of a fragmented ** split packet. */ mp->m_len = plen; /* ** See if there is a stored head ** that determines what we are */ sendmp = rbuf->fmp; rbuf->m_pack = rbuf->fmp = NULL; if (sendmp != NULL) /* secondary frag */ sendmp->m_pkthdr.len += mp->m_len; else { /* first desc of a non-ps chain */ sendmp = mp; sendmp->m_flags |= M_PKTHDR; sendmp->m_pkthdr.len = mp->m_len; if (vtag) { sendmp->m_pkthdr.ether_vtag = vtag; sendmp->m_flags |= M_VLANTAG; } } /* Pass the head pointer on */ if (eop == 0) { nbuf->fmp = sendmp; sendmp = NULL; mp->m_next = nbuf->m_pack; } } ++processed; /* Sending this frame? */ if (eop) { sendmp->m_pkthdr.rcvif = ifp; /* gather stats */ rxr->rx_packets++; rxr->rx_bytes += sendmp->m_pkthdr.len; /* capture data for dynamic ITR adjustment */ rxr->packets++; rxr->bytes += sendmp->m_pkthdr.len; if ((ifp->if_capenable & IFCAP_RXCSUM) != 0) ixl_rx_checksum(sendmp, status, error, ptype); #ifdef RSS sendmp->m_pkthdr.flowid = le32toh(cur->wb.qword0.hi_dword.rss); M_HASHTYPE_SET(sendmp, ixl_ptype_to_hash(ptype)); #else sendmp->m_pkthdr.flowid = que->msix; M_HASHTYPE_SET(sendmp, M_HASHTYPE_OPAQUE); #endif } next_desc: bus_dmamap_sync(rxr->dma.tag, rxr->dma.map, BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE); /* Advance our pointers to the next descriptor. */ if (++i == que->num_desc) i = 0; /* Now send to the stack or do LRO */ if (sendmp != NULL) { rxr->next_check = i; ixl_rx_input(rxr, ifp, sendmp, ptype); i = rxr->next_check; } /* Every 8 descriptors we go to refresh mbufs */ if (processed == 8) { ixl_refresh_mbufs(que, i); processed = 0; } } /* Refresh any remaining buf structs */ if (ixl_rx_unrefreshed(que)) ixl_refresh_mbufs(que, i); rxr->next_check = i; #if defined(INET6) || defined(INET) /* * Flush any outstanding LRO work */ tcp_lro_flush_all(lro); #endif IXL_RX_UNLOCK(rxr); return (FALSE); } /********************************************************************* * * Verify that the hardware indicated that the checksum is valid. * Inform the stack about the status of checksum so that stack * doesn't spend time verifying the checksum. * *********************************************************************/ static void ixl_rx_checksum(struct mbuf * mp, u32 status, u32 error, u8 ptype) { struct i40e_rx_ptype_decoded decoded; decoded = decode_rx_desc_ptype(ptype); /* Errors? */ if (error & ((1 << I40E_RX_DESC_ERROR_IPE_SHIFT) | (1 << I40E_RX_DESC_ERROR_L4E_SHIFT))) { mp->m_pkthdr.csum_flags = 0; return; } /* IPv6 with extension headers likely have bad csum */ if (decoded.outer_ip == I40E_RX_PTYPE_OUTER_IP && decoded.outer_ip_ver == I40E_RX_PTYPE_OUTER_IPV6) if (status & (1 << I40E_RX_DESC_STATUS_IPV6EXADD_SHIFT)) { mp->m_pkthdr.csum_flags = 0; return; } /* IP Checksum Good */ mp->m_pkthdr.csum_flags = CSUM_IP_CHECKED; mp->m_pkthdr.csum_flags |= CSUM_IP_VALID; if (status & (1 << I40E_RX_DESC_STATUS_L3L4P_SHIFT)) { mp->m_pkthdr.csum_flags |= (CSUM_DATA_VALID | CSUM_PSEUDO_HDR); mp->m_pkthdr.csum_data |= htons(0xffff); } return; } #if __FreeBSD_version >= 1100000 uint64_t ixl_get_counter(if_t ifp, ift_counter cnt) { struct ixl_vsi *vsi; vsi = if_getsoftc(ifp); switch (cnt) { case IFCOUNTER_IPACKETS: return (vsi->ipackets); case IFCOUNTER_IERRORS: return (vsi->ierrors); case IFCOUNTER_OPACKETS: return (vsi->opackets); case IFCOUNTER_OERRORS: return (vsi->oerrors); case IFCOUNTER_COLLISIONS: /* Collisions are by standard impossible in 40G/10G Ethernet */ return (0); case IFCOUNTER_IBYTES: return (vsi->ibytes); case IFCOUNTER_OBYTES: return (vsi->obytes); case IFCOUNTER_IMCASTS: return (vsi->imcasts); case IFCOUNTER_OMCASTS: return (vsi->omcasts); case IFCOUNTER_IQDROPS: return (vsi->iqdrops); case IFCOUNTER_OQDROPS: return (vsi->oqdrops); case IFCOUNTER_NOPROTO: return (vsi->noproto); default: return (if_get_counter_default(ifp, cnt)); } } #endif Index: projects/vnet/sys/dev/mlx5/driver.h =================================================================== --- projects/vnet/sys/dev/mlx5/driver.h (revision 301546) +++ projects/vnet/sys/dev/mlx5/driver.h (revision 301547) @@ -1,943 +1,944 @@ /*- * Copyright (c) 2013-2015, Mellanox Technologies, Ltd. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS `AS IS' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef MLX5_DRIVER_H #define MLX5_DRIVER_H #include #include #include #include #include +#include #include #include #include #include #include #include enum { MLX5_BOARD_ID_LEN = 64, MLX5_MAX_NAME_LEN = 16, }; enum { /* one minute for the sake of bringup. Generally, commands must always * complete and we may need to increase this timeout value */ MLX5_CMD_TIMEOUT_MSEC = 7200 * 1000, MLX5_CMD_WQ_MAX_NAME = 32, }; enum { CMD_OWNER_SW = 0x0, CMD_OWNER_HW = 0x1, CMD_STATUS_SUCCESS = 0, }; enum mlx5_sqp_t { MLX5_SQP_SMI = 0, MLX5_SQP_GSI = 1, MLX5_SQP_IEEE_1588 = 2, MLX5_SQP_SNIFFER = 3, MLX5_SQP_SYNC_UMR = 4, }; enum { MLX5_MAX_PORTS = 2, }; enum { MLX5_EQ_VEC_PAGES = 0, MLX5_EQ_VEC_CMD = 1, MLX5_EQ_VEC_ASYNC = 2, MLX5_EQ_VEC_COMP_BASE, }; enum { MLX5_MAX_IRQ_NAME = 32 }; enum { MLX5_ATOMIC_MODE_IB_COMP = 1 << 16, MLX5_ATOMIC_MODE_CX = 2 << 16, MLX5_ATOMIC_MODE_8B = 3 << 16, MLX5_ATOMIC_MODE_16B = 4 << 16, MLX5_ATOMIC_MODE_32B = 5 << 16, MLX5_ATOMIC_MODE_64B = 6 << 16, MLX5_ATOMIC_MODE_128B = 7 << 16, MLX5_ATOMIC_MODE_256B = 8 << 16, }; enum { MLX5_REG_QETCR = 0x4005, MLX5_REG_QPDP = 0x4007, MLX5_REG_QTCT = 0x400A, MLX5_REG_PCAP = 0x5001, MLX5_REG_PMTU = 0x5003, MLX5_REG_PTYS = 0x5004, MLX5_REG_PAOS = 0x5006, MLX5_REG_PFCC = 0x5007, MLX5_REG_PPCNT = 0x5008, MLX5_REG_PMAOS = 0x5012, MLX5_REG_PUDE = 0x5009, MLX5_REG_PPTB = 0x500B, MLX5_REG_PBMC = 0x500C, MLX5_REG_PMPE = 0x5010, MLX5_REG_PELC = 0x500e, MLX5_REG_PVLC = 0x500f, MLX5_REG_PMLP = 0x5002, MLX5_REG_NODE_DESC = 0x6001, MLX5_REG_HOST_ENDIANNESS = 0x7004, MLX5_REG_MCIA = 0x9014, }; enum dbg_rsc_type { MLX5_DBG_RSC_QP, MLX5_DBG_RSC_EQ, MLX5_DBG_RSC_CQ, }; struct mlx5_field_desc { struct dentry *dent; int i; }; struct mlx5_rsc_debug { struct mlx5_core_dev *dev; void *object; enum dbg_rsc_type type; struct dentry *root; struct mlx5_field_desc fields[0]; }; enum mlx5_dev_event { MLX5_DEV_EVENT_SYS_ERROR, MLX5_DEV_EVENT_PORT_UP, MLX5_DEV_EVENT_PORT_DOWN, MLX5_DEV_EVENT_PORT_INITIALIZED, MLX5_DEV_EVENT_LID_CHANGE, MLX5_DEV_EVENT_PKEY_CHANGE, MLX5_DEV_EVENT_GUID_CHANGE, MLX5_DEV_EVENT_CLIENT_REREG, MLX5_DEV_EVENT_VPORT_CHANGE, }; enum mlx5_port_status { MLX5_PORT_UP = 1 << 0, MLX5_PORT_DOWN = 1 << 1, }; enum mlx5_link_mode { MLX5_1000BASE_CX_SGMII = 0, MLX5_1000BASE_KX = 1, MLX5_10GBASE_CX4 = 2, MLX5_10GBASE_KX4 = 3, MLX5_10GBASE_KR = 4, MLX5_20GBASE_KR2 = 5, MLX5_40GBASE_CR4 = 6, MLX5_40GBASE_KR4 = 7, MLX5_56GBASE_R4 = 8, MLX5_10GBASE_CR = 12, MLX5_10GBASE_SR = 13, MLX5_10GBASE_ER = 14, MLX5_40GBASE_SR4 = 15, MLX5_40GBASE_LR4 = 16, MLX5_100GBASE_CR4 = 20, MLX5_100GBASE_SR4 = 21, MLX5_100GBASE_KR4 = 22, MLX5_100GBASE_LR4 = 23, MLX5_100BASE_TX = 24, MLX5_1000BASE_T = 25, MLX5_10GBASE_T = 26, MLX5_25GBASE_CR = 27, MLX5_25GBASE_KR = 28, MLX5_25GBASE_SR = 29, MLX5_50GBASE_CR2 = 30, MLX5_50GBASE_KR2 = 31, MLX5_LINK_MODES_NUMBER, }; #define MLX5_PROT_MASK(link_mode) (1 << link_mode) struct mlx5_uuar_info { struct mlx5_uar *uars; int num_uars; int num_low_latency_uuars; unsigned long *bitmap; unsigned int *count; struct mlx5_bf *bfs; /* * protect uuar allocation data structs */ struct mutex lock; u32 ver; }; struct mlx5_bf { void __iomem *reg; void __iomem *regreg; int buf_size; struct mlx5_uar *uar; unsigned long offset; int need_lock; /* protect blue flame buffer selection when needed */ spinlock_t lock; /* serialize 64 bit writes when done as two 32 bit accesses */ spinlock_t lock32; int uuarn; }; struct mlx5_cmd_first { __be32 data[4]; }; struct mlx5_cmd_msg { struct list_head list; struct cache_ent *cache; u32 len; struct mlx5_cmd_first first; struct mlx5_cmd_mailbox *next; }; struct mlx5_cmd_debug { struct dentry *dbg_root; struct dentry *dbg_in; struct dentry *dbg_out; struct dentry *dbg_outlen; struct dentry *dbg_status; struct dentry *dbg_run; void *in_msg; void *out_msg; u8 status; u16 inlen; u16 outlen; }; struct cache_ent { /* protect block chain allocations */ spinlock_t lock; struct list_head head; }; struct cmd_msg_cache { struct cache_ent large; struct cache_ent med; }; struct mlx5_cmd_stats { u64 sum; u64 n; struct dentry *root; struct dentry *avg; struct dentry *count; /* protect command average calculations */ spinlock_t lock; }; struct mlx5_cmd { void *cmd_alloc_buf; dma_addr_t alloc_dma; int alloc_size; void *cmd_buf; dma_addr_t dma; u16 cmdif_rev; u8 log_sz; u8 log_stride; int max_reg_cmds; int events; u32 __iomem *vector; /* protect command queue allocations */ spinlock_t alloc_lock; /* protect token allocations */ spinlock_t token_lock; u8 token; unsigned long bitmask; char wq_name[MLX5_CMD_WQ_MAX_NAME]; struct workqueue_struct *wq; struct semaphore sem; struct semaphore pages_sem; int mode; struct mlx5_cmd_work_ent *ent_arr[MLX5_MAX_COMMANDS]; struct pci_pool *pool; struct mlx5_cmd_debug dbg; struct cmd_msg_cache cache; int checksum_disabled; struct mlx5_cmd_stats stats[MLX5_CMD_OP_MAX]; int moving_to_polling; }; struct mlx5_port_caps { int gid_table_len; int pkey_table_len; u8 ext_port_cap; }; struct mlx5_cmd_mailbox { void *buf; dma_addr_t dma; struct mlx5_cmd_mailbox *next; }; struct mlx5_buf_list { void *buf; dma_addr_t map; }; struct mlx5_buf { struct mlx5_buf_list direct; struct mlx5_buf_list *page_list; int nbufs; int npages; int size; u8 page_shift; }; struct mlx5_eq { struct mlx5_core_dev *dev; __be32 __iomem *doorbell; u32 cons_index; struct mlx5_buf buf; int size; u8 irqn; u8 eqn; int nent; u64 mask; struct list_head list; int index; struct mlx5_rsc_debug *dbg; }; struct mlx5_core_psv { u32 psv_idx; struct psv_layout { u32 pd; u16 syndrome; u16 reserved; u16 bg; u16 app_tag; u32 ref_tag; } psv; }; struct mlx5_core_sig_ctx { struct mlx5_core_psv psv_memory; struct mlx5_core_psv psv_wire; #if (__FreeBSD_version >= 1100000) struct ib_sig_err err_item; #endif bool sig_status_checked; bool sig_err_exists; u32 sigerr_count; }; struct mlx5_core_mr { u64 iova; u64 size; u32 key; u32 pd; }; enum mlx5_res_type { MLX5_RES_QP, MLX5_RES_SRQ, MLX5_RES_XSRQ, }; struct mlx5_core_rsc_common { enum mlx5_res_type res; atomic_t refcount; struct completion free; }; struct mlx5_core_srq { struct mlx5_core_rsc_common common; /* must be first */ u32 srqn; int max; int max_gs; int max_avail_gather; int wqe_shift; void (*event)(struct mlx5_core_srq *, int); atomic_t refcount; struct completion free; }; struct mlx5_eq_table { void __iomem *update_ci; void __iomem *update_arm_ci; struct list_head comp_eqs_list; struct mlx5_eq pages_eq; struct mlx5_eq async_eq; struct mlx5_eq cmd_eq; int num_comp_vectors; /* protect EQs list */ spinlock_t lock; }; struct mlx5_uar { u32 index; struct list_head bf_list; unsigned free_bf_bmap; void __iomem *bf_map; void __iomem *map; }; struct mlx5_core_health { struct mlx5_health_buffer __iomem *health; __be32 __iomem *health_counter; struct timer_list timer; struct list_head list; u32 prev; int miss_counter; }; #define MLX5_CQ_LINEAR_ARRAY_SIZE 1024 struct mlx5_cq_linear_array_entry { spinlock_t lock; struct mlx5_core_cq * volatile cq; }; struct mlx5_cq_table { /* protect radix tree */ spinlock_t lock; struct radix_tree_root tree; struct mlx5_cq_linear_array_entry linear_array[MLX5_CQ_LINEAR_ARRAY_SIZE]; }; struct mlx5_qp_table { /* protect radix tree */ spinlock_t lock; struct radix_tree_root tree; }; struct mlx5_srq_table { /* protect radix tree */ spinlock_t lock; struct radix_tree_root tree; }; struct mlx5_mr_table { /* protect radix tree */ rwlock_t lock; struct radix_tree_root tree; }; struct mlx5_irq_info { char name[MLX5_MAX_IRQ_NAME]; }; struct mlx5_priv { char name[MLX5_MAX_NAME_LEN]; struct mlx5_eq_table eq_table; struct msix_entry *msix_arr; struct mlx5_irq_info *irq_info; struct mlx5_uuar_info uuari; MLX5_DECLARE_DOORBELL_LOCK(cq_uar_lock); struct io_mapping *bf_mapping; /* pages stuff */ struct workqueue_struct *pg_wq; struct rb_root page_root; int fw_pages; int reg_pages; struct list_head free_list; struct mlx5_core_health health; struct mlx5_srq_table srq_table; /* start: qp staff */ struct mlx5_qp_table qp_table; struct dentry *qp_debugfs; struct dentry *eq_debugfs; struct dentry *cq_debugfs; struct dentry *cmdif_debugfs; /* end: qp staff */ /* start: cq staff */ struct mlx5_cq_table cq_table; /* end: cq staff */ /* start: mr staff */ struct mlx5_mr_table mr_table; /* end: mr staff */ /* start: alloc staff */ int numa_node; struct mutex pgdir_mutex; struct list_head pgdir_list; /* end: alloc staff */ struct dentry *dbg_root; /* protect mkey key part */ spinlock_t mkey_lock; u8 mkey_key; struct list_head dev_list; struct list_head ctx_list; spinlock_t ctx_lock; }; struct mlx5_special_contexts { int resd_lkey; }; struct mlx5_core_dev { struct pci_dev *pdev; char board_id[MLX5_BOARD_ID_LEN]; struct mlx5_cmd cmd; struct mlx5_port_caps port_caps[MLX5_MAX_PORTS]; u32 hca_caps_cur[MLX5_CAP_NUM][MLX5_UN_SZ_DW(hca_cap_union)]; u32 hca_caps_max[MLX5_CAP_NUM][MLX5_UN_SZ_DW(hca_cap_union)]; struct mlx5_init_seg __iomem *iseg; void (*event) (struct mlx5_core_dev *dev, enum mlx5_dev_event event, unsigned long param); struct mlx5_priv priv; struct mlx5_profile *profile; atomic_t num_qps; u32 issi; struct mlx5_special_contexts special_contexts; unsigned int module_status[MLX5_MAX_PORTS]; }; enum { MLX5_WOL_DISABLE = 0, MLX5_WOL_SECURED_MAGIC = 1 << 1, MLX5_WOL_MAGIC = 1 << 2, MLX5_WOL_ARP = 1 << 3, MLX5_WOL_BROADCAST = 1 << 4, MLX5_WOL_MULTICAST = 1 << 5, MLX5_WOL_UNICAST = 1 << 6, MLX5_WOL_PHY_ACTIVITY = 1 << 7, }; struct mlx5_db { __be32 *db; union { struct mlx5_db_pgdir *pgdir; struct mlx5_ib_user_db_page *user_page; } u; dma_addr_t dma; int index; }; struct mlx5_net_counters { u64 packets; u64 octets; }; struct mlx5_ptys_reg { u8 local_port; u8 proto_mask; u32 eth_proto_cap; u16 ib_link_width_cap; u16 ib_proto_cap; u32 eth_proto_admin; u16 ib_link_width_admin; u16 ib_proto_admin; u32 eth_proto_oper; u16 ib_link_width_oper; u16 ib_proto_oper; u32 eth_proto_lp_advertise; }; struct mlx5_pvlc_reg { u8 local_port; u8 vl_hw_cap; u8 vl_admin; u8 vl_operational; }; struct mlx5_pmtu_reg { u8 local_port; u16 max_mtu; u16 admin_mtu; u16 oper_mtu; }; struct mlx5_vport_counters { struct mlx5_net_counters received_errors; struct mlx5_net_counters transmit_errors; struct mlx5_net_counters received_ib_unicast; struct mlx5_net_counters transmitted_ib_unicast; struct mlx5_net_counters received_ib_multicast; struct mlx5_net_counters transmitted_ib_multicast; struct mlx5_net_counters received_eth_broadcast; struct mlx5_net_counters transmitted_eth_broadcast; struct mlx5_net_counters received_eth_unicast; struct mlx5_net_counters transmitted_eth_unicast; struct mlx5_net_counters received_eth_multicast; struct mlx5_net_counters transmitted_eth_multicast; }; enum { MLX5_DB_PER_PAGE = PAGE_SIZE / L1_CACHE_BYTES, }; enum { MLX5_COMP_EQ_SIZE = 1024, }; enum { MLX5_PTYS_IB = 1 << 0, MLX5_PTYS_EN = 1 << 2, }; struct mlx5_db_pgdir { struct list_head list; DECLARE_BITMAP(bitmap, MLX5_DB_PER_PAGE); __be32 *db_page; dma_addr_t db_dma; }; typedef void (*mlx5_cmd_cbk_t)(int status, void *context); struct mlx5_cmd_work_ent { struct mlx5_cmd_msg *in; struct mlx5_cmd_msg *out; void *uout; int uout_size; mlx5_cmd_cbk_t callback; void *context; int idx; struct completion done; struct mlx5_cmd *cmd; struct work_struct work; struct mlx5_cmd_layout *lay; int ret; int page_queue; u8 status; u8 token; u64 ts1; u64 ts2; u16 op; }; struct mlx5_pas { u64 pa; u8 log_sz; }; static inline void *mlx5_buf_offset(struct mlx5_buf *buf, int offset) { if (likely(BITS_PER_LONG == 64 || buf->nbufs == 1)) return buf->direct.buf + offset; else return buf->page_list[offset >> PAGE_SHIFT].buf + (offset & (PAGE_SIZE - 1)); } extern struct workqueue_struct *mlx5_core_wq; #define STRUCT_FIELD(header, field) \ .struct_offset_bytes = offsetof(struct ib_unpacked_ ## header, field), \ .struct_size_bytes = sizeof((struct ib_unpacked_ ## header *)0)->field static inline struct mlx5_core_dev *pci2mlx5_core_dev(struct pci_dev *pdev) { return pci_get_drvdata(pdev); } extern struct dentry *mlx5_debugfs_root; static inline u16 fw_rev_maj(struct mlx5_core_dev *dev) { return ioread32be(&dev->iseg->fw_rev) & 0xffff; } static inline u16 fw_rev_min(struct mlx5_core_dev *dev) { return ioread32be(&dev->iseg->fw_rev) >> 16; } static inline u16 fw_rev_sub(struct mlx5_core_dev *dev) { return ioread32be(&dev->iseg->cmdif_rev_fw_sub) & 0xffff; } static inline u16 cmdif_rev_get(struct mlx5_core_dev *dev) { return ioread32be(&dev->iseg->cmdif_rev_fw_sub) >> 16; } static inline int mlx5_get_gid_table_len(u16 param) { if (param > 4) { printf("M4_CORE_DRV_NAME: WARN: ""gid table length is zero\n"); return 0; } return 8 * (1 << param); } static inline void *mlx5_vzalloc(unsigned long size) { void *rtn; rtn = kzalloc(size, GFP_KERNEL | __GFP_NOWARN); return rtn; } static inline u32 mlx5_base_mkey(const u32 key) { return key & 0xffffff00u; } int mlx5_cmd_init(struct mlx5_core_dev *dev); void mlx5_cmd_cleanup(struct mlx5_core_dev *dev); void mlx5_cmd_use_events(struct mlx5_core_dev *dev); void mlx5_cmd_use_polling(struct mlx5_core_dev *dev); int mlx5_cmd_status_to_err(struct mlx5_outbox_hdr *hdr); int mlx5_cmd_status_to_err_v2(void *ptr); int mlx5_core_get_caps(struct mlx5_core_dev *dev, enum mlx5_cap_type cap_type, enum mlx5_cap_mode cap_mode); int mlx5_cmd_exec(struct mlx5_core_dev *dev, void *in, int in_size, void *out, int out_size); int mlx5_cmd_exec_cb(struct mlx5_core_dev *dev, void *in, int in_size, void *out, int out_size, mlx5_cmd_cbk_t callback, void *context); int mlx5_cmd_alloc_uar(struct mlx5_core_dev *dev, u32 *uarn); int mlx5_cmd_free_uar(struct mlx5_core_dev *dev, u32 uarn); int mlx5_alloc_uuars(struct mlx5_core_dev *dev, struct mlx5_uuar_info *uuari); int mlx5_free_uuars(struct mlx5_core_dev *dev, struct mlx5_uuar_info *uuari); int mlx5_alloc_map_uar(struct mlx5_core_dev *mdev, struct mlx5_uar *uar); void mlx5_unmap_free_uar(struct mlx5_core_dev *mdev, struct mlx5_uar *uar); void mlx5_health_cleanup(void); void __init mlx5_health_init(void); void mlx5_start_health_poll(struct mlx5_core_dev *dev); void mlx5_stop_health_poll(struct mlx5_core_dev *dev); int mlx5_buf_alloc_node(struct mlx5_core_dev *dev, int size, int max_direct, struct mlx5_buf *buf, int node); int mlx5_buf_alloc(struct mlx5_core_dev *dev, int size, int max_direct, struct mlx5_buf *buf); void mlx5_buf_free(struct mlx5_core_dev *dev, struct mlx5_buf *buf); int mlx5_core_create_srq(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq, struct mlx5_create_srq_mbox_in *in, int inlen, int is_xrc); int mlx5_core_destroy_srq(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq); int mlx5_core_query_srq(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq, struct mlx5_query_srq_mbox_out *out); int mlx5_core_query_vendor_id(struct mlx5_core_dev *mdev, u32 *vendor_id); int mlx5_core_arm_srq(struct mlx5_core_dev *dev, struct mlx5_core_srq *srq, u16 lwm, int is_srq); void mlx5_init_mr_table(struct mlx5_core_dev *dev); void mlx5_cleanup_mr_table(struct mlx5_core_dev *dev); int mlx5_core_create_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr, struct mlx5_create_mkey_mbox_in *in, int inlen, mlx5_cmd_cbk_t callback, void *context, struct mlx5_create_mkey_mbox_out *out); int mlx5_core_destroy_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr); int mlx5_core_query_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr, struct mlx5_query_mkey_mbox_out *out, int outlen); int mlx5_core_dump_fill_mkey(struct mlx5_core_dev *dev, struct mlx5_core_mr *mr, u32 *mkey); int mlx5_core_alloc_pd(struct mlx5_core_dev *dev, u32 *pdn); int mlx5_core_dealloc_pd(struct mlx5_core_dev *dev, u32 pdn); int mlx5_core_mad_ifc(struct mlx5_core_dev *dev, void *inb, void *outb, u16 opmod, u8 port); void mlx5_pagealloc_init(struct mlx5_core_dev *dev); void mlx5_pagealloc_cleanup(struct mlx5_core_dev *dev); int mlx5_pagealloc_start(struct mlx5_core_dev *dev); void mlx5_pagealloc_stop(struct mlx5_core_dev *dev); void mlx5_core_req_pages_handler(struct mlx5_core_dev *dev, u16 func_id, s32 npages); int mlx5_satisfy_startup_pages(struct mlx5_core_dev *dev, int boot); int mlx5_reclaim_startup_pages(struct mlx5_core_dev *dev); void mlx5_register_debugfs(void); void mlx5_unregister_debugfs(void); int mlx5_eq_init(struct mlx5_core_dev *dev); void mlx5_eq_cleanup(struct mlx5_core_dev *dev); void mlx5_fill_page_array(struct mlx5_buf *buf, __be64 *pas); void mlx5_cq_completion(struct mlx5_core_dev *dev, u32 cqn); void mlx5_rsc_event(struct mlx5_core_dev *dev, u32 rsn, int event_type); void mlx5_srq_event(struct mlx5_core_dev *dev, u32 srqn, int event_type); struct mlx5_core_srq *mlx5_core_get_srq(struct mlx5_core_dev *dev, u32 srqn); void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, unsigned long vector); void mlx5_cq_event(struct mlx5_core_dev *dev, u32 cqn, int event_type); int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 vecidx, int nent, u64 mask, const char *name, struct mlx5_uar *uar); int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq); int mlx5_start_eqs(struct mlx5_core_dev *dev); int mlx5_stop_eqs(struct mlx5_core_dev *dev); int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, int *eqn, int *irqn); int mlx5_core_attach_mcg(struct mlx5_core_dev *dev, union ib_gid *mgid, u32 qpn); int mlx5_core_detach_mcg(struct mlx5_core_dev *dev, union ib_gid *mgid, u32 qpn); int mlx5_qp_debugfs_init(struct mlx5_core_dev *dev); void mlx5_qp_debugfs_cleanup(struct mlx5_core_dev *dev); int mlx5_core_access_reg(struct mlx5_core_dev *dev, void *data_in, int size_in, void *data_out, int size_out, u16 reg_num, int arg, int write); int mlx5_set_port_caps(struct mlx5_core_dev *dev, u8 port_num, u32 caps); int mlx5_query_port_ptys(struct mlx5_core_dev *dev, u32 *ptys, int ptys_size, int proto_mask); int mlx5_query_port_proto_cap(struct mlx5_core_dev *dev, u32 *proto_cap, int proto_mask); int mlx5_query_port_proto_admin(struct mlx5_core_dev *dev, u32 *proto_admin, int proto_mask); int mlx5_set_port_proto(struct mlx5_core_dev *dev, u32 proto_admin, int proto_mask); int mlx5_set_port_status(struct mlx5_core_dev *dev, enum mlx5_port_status status); int mlx5_query_port_status(struct mlx5_core_dev *dev, u8 *status); int mlx5_set_port_pause(struct mlx5_core_dev *dev, u32 port, u32 rx_pause, u32 tx_pause); int mlx5_query_port_pause(struct mlx5_core_dev *dev, u32 port, u32 *rx_pause, u32 *tx_pause); int mlx5_set_port_mtu(struct mlx5_core_dev *dev, int mtu); int mlx5_query_port_max_mtu(struct mlx5_core_dev *dev, int *max_mtu); int mlx5_query_port_oper_mtu(struct mlx5_core_dev *dev, int *oper_mtu); unsigned int mlx5_query_module_status(struct mlx5_core_dev *dev, int module_num); int mlx5_query_module_num(struct mlx5_core_dev *dev, int *module_num); int mlx5_query_eeprom(struct mlx5_core_dev *dev, int i2c_addr, int page_num, int device_addr, int size, int module_num, u32 *data, int *size_read); int mlx5_debug_eq_add(struct mlx5_core_dev *dev, struct mlx5_eq *eq); void mlx5_debug_eq_remove(struct mlx5_core_dev *dev, struct mlx5_eq *eq); int mlx5_core_eq_query(struct mlx5_core_dev *dev, struct mlx5_eq *eq, struct mlx5_query_eq_mbox_out *out, int outlen); int mlx5_eq_debugfs_init(struct mlx5_core_dev *dev); void mlx5_eq_debugfs_cleanup(struct mlx5_core_dev *dev); int mlx5_cq_debugfs_init(struct mlx5_core_dev *dev); void mlx5_cq_debugfs_cleanup(struct mlx5_core_dev *dev); int mlx5_db_alloc(struct mlx5_core_dev *dev, struct mlx5_db *db); int mlx5_db_alloc_node(struct mlx5_core_dev *dev, struct mlx5_db *db, int node); void mlx5_db_free(struct mlx5_core_dev *dev, struct mlx5_db *db); const char *mlx5_command_str(int command); int mlx5_cmdif_debugfs_init(struct mlx5_core_dev *dev); void mlx5_cmdif_debugfs_cleanup(struct mlx5_core_dev *dev); int mlx5_core_create_psv(struct mlx5_core_dev *dev, u32 pdn, int npsvs, u32 *sig_index); int mlx5_core_destroy_psv(struct mlx5_core_dev *dev, int psv_num); void mlx5_core_put_rsc(struct mlx5_core_rsc_common *common); u8 mlx5_is_wol_supported(struct mlx5_core_dev *dev); int mlx5_set_wol(struct mlx5_core_dev *dev, u8 wol_mode); int mlx5_query_wol(struct mlx5_core_dev *dev, u8 *wol_mode); int mlx5_core_access_pvlc(struct mlx5_core_dev *dev, struct mlx5_pvlc_reg *pvlc, int write); int mlx5_core_access_ptys(struct mlx5_core_dev *dev, struct mlx5_ptys_reg *ptys, int write); int mlx5_core_access_pmtu(struct mlx5_core_dev *dev, struct mlx5_pmtu_reg *pmtu, int write); int mlx5_vxlan_udp_port_add(struct mlx5_core_dev *dev, u16 port); int mlx5_vxlan_udp_port_delete(struct mlx5_core_dev *dev, u16 port); int mlx5_query_port_cong_status(struct mlx5_core_dev *mdev, int protocol, int priority, int *is_enable); int mlx5_modify_port_cong_status(struct mlx5_core_dev *mdev, int protocol, int priority, int enable); int mlx5_query_port_cong_params(struct mlx5_core_dev *mdev, int protocol, void *out, int out_size); int mlx5_modify_port_cong_params(struct mlx5_core_dev *mdev, void *in, int in_size); int mlx5_query_port_cong_statistics(struct mlx5_core_dev *mdev, int clear, void *out, int out_size); static inline u32 mlx5_mkey_to_idx(u32 mkey) { return mkey >> 8; } static inline u32 mlx5_idx_to_mkey(u32 mkey_idx) { return mkey_idx << 8; } static inline u8 mlx5_mkey_variant(u32 mkey) { return mkey & 0xff; } enum { MLX5_PROF_MASK_QP_SIZE = (u64)1 << 0, MLX5_PROF_MASK_MR_CACHE = (u64)1 << 1, }; enum { MAX_MR_CACHE_ENTRIES = 16, }; enum { MLX5_INTERFACE_PROTOCOL_IB = 0, MLX5_INTERFACE_PROTOCOL_ETH = 1, }; struct mlx5_interface { void * (*add)(struct mlx5_core_dev *dev); void (*remove)(struct mlx5_core_dev *dev, void *context); void (*event)(struct mlx5_core_dev *dev, void *context, enum mlx5_dev_event event, unsigned long param); void * (*get_dev)(void *context); int protocol; struct list_head list; }; void *mlx5_get_protocol_dev(struct mlx5_core_dev *mdev, int protocol); int mlx5_register_interface(struct mlx5_interface *intf); void mlx5_unregister_interface(struct mlx5_interface *intf); struct mlx5_profile { u64 mask; u8 log_max_qp; struct { int size; int limit; } mr_cache[MAX_MR_CACHE_ENTRIES]; }; #define MLX5_EEPROM_MAX_BYTES 32 #define MLX5_EEPROM_IDENTIFIER_BYTE_MASK 0x000000ff #define MLX5_EEPROM_REVISION_ID_BYTE_MASK 0x0000ff00 #define MLX5_EEPROM_PAGE_3_VALID_BIT_MASK 0x00040000 #endif /* MLX5_DRIVER_H */ Index: projects/vnet/sys/dev/mlx5/mlx5_core/mlx5_vport.c =================================================================== --- projects/vnet/sys/dev/mlx5/mlx5_core/mlx5_vport.c (revision 301546) +++ projects/vnet/sys/dev/mlx5/mlx5_core/mlx5_vport.c (revision 301547) @@ -1,922 +1,1280 @@ /*- * Copyright (c) 2013-2015, Mellanox Technologies, Ltd. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS `AS IS' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include #include #include #include "mlx5_core.h" u8 mlx5_query_vport_state(struct mlx5_core_dev *mdev, u8 opmod) { u32 in[MLX5_ST_SZ_DW(query_vport_state_in)]; u32 out[MLX5_ST_SZ_DW(query_vport_state_out)]; int err; memset(in, 0, sizeof(in)); MLX5_SET(query_vport_state_in, in, opcode, MLX5_CMD_OP_QUERY_VPORT_STATE); MLX5_SET(query_vport_state_in, in, op_mod, opmod); err = mlx5_cmd_exec_check_status(mdev, in, sizeof(in), out, sizeof(out)); if (err) mlx5_core_warn(mdev, "MLX5_CMD_OP_QUERY_VPORT_STATE failed\n"); return MLX5_GET(query_vport_state_out, out, state); } EXPORT_SYMBOL_GPL(mlx5_query_vport_state); static int mlx5_query_nic_vport_context(struct mlx5_core_dev *mdev, u32 vport, u32 *out, int outlen) { u32 in[MLX5_ST_SZ_DW(query_nic_vport_context_in)]; memset(in, 0, sizeof(in)); MLX5_SET(query_nic_vport_context_in, in, opcode, MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT); MLX5_SET(query_nic_vport_context_in, in, vport_number, vport); if (vport) MLX5_SET(query_nic_vport_context_in, in, other_vport, 1); return mlx5_cmd_exec_check_status(mdev, in, sizeof(in), out, outlen); } int mlx5_vport_alloc_q_counter(struct mlx5_core_dev *mdev, int *counter_set_id) { u32 in[MLX5_ST_SZ_DW(alloc_q_counter_in)]; u32 out[MLX5_ST_SZ_DW(alloc_q_counter_in)]; int err; memset(in, 0, sizeof(in)); memset(out, 0, sizeof(out)); MLX5_SET(alloc_q_counter_in, in, opcode, MLX5_CMD_OP_ALLOC_Q_COUNTER); err = mlx5_cmd_exec_check_status(mdev, in, sizeof(in), out, sizeof(out)); if (err) return err; *counter_set_id = MLX5_GET(alloc_q_counter_out, out, counter_set_id); return err; } int mlx5_vport_dealloc_q_counter(struct mlx5_core_dev *mdev, int counter_set_id) { u32 in[MLX5_ST_SZ_DW(dealloc_q_counter_in)]; u32 out[MLX5_ST_SZ_DW(dealloc_q_counter_out)]; memset(in, 0, sizeof(in)); memset(out, 0, sizeof(out)); MLX5_SET(dealloc_q_counter_in, in, opcode, MLX5_CMD_OP_DEALLOC_Q_COUNTER); MLX5_SET(dealloc_q_counter_in, in, counter_set_id, counter_set_id); return mlx5_cmd_exec_check_status(mdev, in, sizeof(in), out, sizeof(out)); } static int mlx5_vport_query_q_counter(struct mlx5_core_dev *mdev, int counter_set_id, int reset, void *out, int out_size) { u32 in[MLX5_ST_SZ_DW(query_q_counter_in)]; memset(in, 0, sizeof(in)); MLX5_SET(query_q_counter_in, in, opcode, MLX5_CMD_OP_QUERY_Q_COUNTER); MLX5_SET(query_q_counter_in, in, clear, reset); MLX5_SET(query_q_counter_in, in, counter_set_id, counter_set_id); return mlx5_cmd_exec_check_status(mdev, in, sizeof(in), out, out_size); } int mlx5_vport_query_out_of_rx_buffer(struct mlx5_core_dev *mdev, int counter_set_id, u32 *out_of_rx_buffer) { u32 out[MLX5_ST_SZ_DW(query_q_counter_out)]; int err; memset(out, 0, sizeof(out)); err = mlx5_vport_query_q_counter(mdev, counter_set_id, 0, out, sizeof(out)); if (err) return err; *out_of_rx_buffer = MLX5_GET(query_q_counter_out, out, out_of_buffer); return err; } int mlx5_query_nic_vport_mac_address(struct mlx5_core_dev *mdev, u32 vport, u8 *addr) { u32 *out; int outlen = MLX5_ST_SZ_BYTES(query_nic_vport_context_out); u8 *out_addr; int err; out = mlx5_vzalloc(outlen); if (!out) return -ENOMEM; out_addr = MLX5_ADDR_OF(query_nic_vport_context_out, out, nic_vport_context.permanent_address); err = mlx5_query_nic_vport_context(mdev, vport, out, outlen); if (err) goto out; ether_addr_copy(addr, &out_addr[2]); out: kvfree(out); return err; } EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_mac_address); int mlx5_query_nic_vport_system_image_guid(struct mlx5_core_dev *mdev, u64 *system_image_guid) { u32 *out; int outlen = MLX5_ST_SZ_BYTES(query_nic_vport_context_out); int err; out = mlx5_vzalloc(outlen); if (!out) return -ENOMEM; err = mlx5_query_nic_vport_context(mdev, 0, out, outlen); if (err) goto out; *system_image_guid = MLX5_GET64(query_nic_vport_context_out, out, nic_vport_context.system_image_guid); out: kvfree(out); return err; } EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_system_image_guid); int mlx5_query_nic_vport_node_guid(struct mlx5_core_dev *mdev, u64 *node_guid) { u32 *out; int outlen = MLX5_ST_SZ_BYTES(query_nic_vport_context_out); int err; out = mlx5_vzalloc(outlen); if (!out) return -ENOMEM; err = mlx5_query_nic_vport_context(mdev, 0, out, outlen); if (err) goto out; *node_guid = MLX5_GET64(query_nic_vport_context_out, out, nic_vport_context.node_guid); out: kvfree(out); return err; } EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_node_guid); int mlx5_query_nic_vport_port_guid(struct mlx5_core_dev *mdev, u64 *port_guid) { u32 *out; int outlen = MLX5_ST_SZ_BYTES(query_nic_vport_context_out); int err; out = mlx5_vzalloc(outlen); if (!out) return -ENOMEM; err = mlx5_query_nic_vport_context(mdev, 0, out, outlen); if (err) goto out; *port_guid = MLX5_GET64(query_nic_vport_context_out, out, nic_vport_context.port_guid); out: kvfree(out); return err; } EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_port_guid); int mlx5_query_nic_vport_qkey_viol_cntr(struct mlx5_core_dev *mdev, u16 *qkey_viol_cntr) { u32 *out; int outlen = MLX5_ST_SZ_BYTES(query_nic_vport_context_out); int err; out = mlx5_vzalloc(outlen); if (!out) return -ENOMEM; err = mlx5_query_nic_vport_context(mdev, 0, out, outlen); if (err) goto out; *qkey_viol_cntr = MLX5_GET(query_nic_vport_context_out, out, nic_vport_context.qkey_violation_counter); out: kvfree(out); return err; } EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_qkey_viol_cntr); static int mlx5_modify_nic_vport_context(struct mlx5_core_dev *mdev, void *in, int inlen) { u32 out[MLX5_ST_SZ_DW(modify_nic_vport_context_out)]; MLX5_SET(modify_nic_vport_context_in, in, opcode, MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT); memset(out, 0, sizeof(out)); return mlx5_cmd_exec_check_status(mdev, in, inlen, out, sizeof(out)); } static int mlx5_nic_vport_enable_disable_roce(struct mlx5_core_dev *mdev, int enable_disable) { void *in; int inlen = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in); int err; in = mlx5_vzalloc(inlen); if (!in) { mlx5_core_warn(mdev, "failed to allocate inbox\n"); return -ENOMEM; } MLX5_SET(modify_nic_vport_context_in, in, field_select.roce_en, 1); MLX5_SET(modify_nic_vport_context_in, in, nic_vport_context.roce_en, enable_disable); err = mlx5_modify_nic_vport_context(mdev, in, inlen); kvfree(in); return err; } int mlx5_set_nic_vport_current_mac(struct mlx5_core_dev *mdev, int vport, bool other_vport, u8 *addr) { void *in; int inlen = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in) + MLX5_ST_SZ_BYTES(mac_address_layout); u8 *mac_layout; u8 *mac_ptr; int err; in = mlx5_vzalloc(inlen); if (!in) { mlx5_core_warn(mdev, "failed to allocate inbox\n"); return -ENOMEM; } MLX5_SET(modify_nic_vport_context_in, in, opcode, MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT); MLX5_SET(modify_nic_vport_context_in, in, vport_number, vport); MLX5_SET(modify_nic_vport_context_in, in, other_vport, other_vport); MLX5_SET(modify_nic_vport_context_in, in, field_select.addresses_list, 1); MLX5_SET(modify_nic_vport_context_in, in, nic_vport_context.allowed_list_type, MLX5_NIC_VPORT_LIST_TYPE_UC); MLX5_SET(modify_nic_vport_context_in, in, nic_vport_context.allowed_list_size, 1); mac_layout = (u8 *)MLX5_ADDR_OF(modify_nic_vport_context_in, in, nic_vport_context.current_uc_mac_address); mac_ptr = (u8 *)MLX5_ADDR_OF(mac_address_layout, mac_layout, mac_addr_47_32); ether_addr_copy(mac_ptr, addr); err = mlx5_modify_nic_vport_context(mdev, in, inlen); kvfree(in); return err; } EXPORT_SYMBOL_GPL(mlx5_set_nic_vport_current_mac); int mlx5_set_nic_vport_vlan_list(struct mlx5_core_dev *dev, u32 vport, u16 *vlan_list, int list_len) { void *in, *ctx; int i, err; int inlen = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in) + MLX5_ST_SZ_BYTES(vlan_layout) * (int)list_len; int max_list_size = 1 << MLX5_CAP_GEN_MAX(dev, log_max_vlan_list); if (list_len > max_list_size) { mlx5_core_warn(dev, "Requested list size (%d) > (%d) max_list_size\n", list_len, max_list_size); return -ENOSPC; } in = mlx5_vzalloc(inlen); if (!in) { mlx5_core_warn(dev, "failed to allocate inbox\n"); return -ENOMEM; } MLX5_SET(modify_nic_vport_context_in, in, vport_number, vport); if (vport) MLX5_SET(modify_nic_vport_context_in, in, other_vport, 1); MLX5_SET(modify_nic_vport_context_in, in, field_select.addresses_list, 1); ctx = MLX5_ADDR_OF(modify_nic_vport_context_in, in, nic_vport_context); MLX5_SET(nic_vport_context, ctx, allowed_list_type, MLX5_NIC_VPORT_LIST_TYPE_VLAN); MLX5_SET(nic_vport_context, ctx, allowed_list_size, list_len); for (i = 0; i < list_len; i++) { u8 *vlan_lout = MLX5_ADDR_OF(nic_vport_context, ctx, current_uc_mac_address[i]); MLX5_SET(vlan_layout, vlan_lout, vlan, vlan_list[i]); } err = mlx5_modify_nic_vport_context(dev, in, inlen); kvfree(in); return err; } EXPORT_SYMBOL_GPL(mlx5_set_nic_vport_vlan_list); int mlx5_set_nic_vport_mc_list(struct mlx5_core_dev *mdev, int vport, u64 *addr_list, size_t addr_list_len) { void *in, *ctx; int inlen = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in) + MLX5_ST_SZ_BYTES(mac_address_layout) * (int)addr_list_len; int err; size_t i; int max_list_sz = 1 << MLX5_CAP_GEN_MAX(mdev, log_max_current_mc_list); if ((int)addr_list_len > max_list_sz) { mlx5_core_warn(mdev, "Requested list size (%d) > (%d) max_list_size\n", (int)addr_list_len, max_list_sz); return -ENOSPC; } in = mlx5_vzalloc(inlen); if (!in) { mlx5_core_warn(mdev, "failed to allocate inbox\n"); return -ENOMEM; } MLX5_SET(modify_nic_vport_context_in, in, vport_number, vport); if (vport) MLX5_SET(modify_nic_vport_context_in, in, other_vport, 1); MLX5_SET(modify_nic_vport_context_in, in, field_select.addresses_list, 1); ctx = MLX5_ADDR_OF(modify_nic_vport_context_in, in, nic_vport_context); MLX5_SET(nic_vport_context, ctx, allowed_list_type, MLX5_NIC_VPORT_LIST_TYPE_MC); MLX5_SET(nic_vport_context, ctx, allowed_list_size, addr_list_len); for (i = 0; i < addr_list_len; i++) { u8 *mac_lout = (u8 *)MLX5_ADDR_OF(nic_vport_context, ctx, current_uc_mac_address[i]); u8 *mac_ptr = (u8 *)MLX5_ADDR_OF(mac_address_layout, mac_lout, mac_addr_47_32); ether_addr_copy(mac_ptr, (u8 *)&addr_list[i]); } err = mlx5_modify_nic_vport_context(mdev, in, inlen); kvfree(in); return err; } EXPORT_SYMBOL_GPL(mlx5_set_nic_vport_mc_list); int mlx5_set_nic_vport_promisc(struct mlx5_core_dev *mdev, int vport, bool promisc_mc, bool promisc_uc, bool promisc_all) { u8 in[MLX5_ST_SZ_BYTES(modify_nic_vport_context_in)]; u8 *ctx = MLX5_ADDR_OF(modify_nic_vport_context_in, in, nic_vport_context); memset(in, 0, MLX5_ST_SZ_BYTES(modify_nic_vport_context_in)); MLX5_SET(modify_nic_vport_context_in, in, vport_number, vport); if (vport) MLX5_SET(modify_nic_vport_context_in, in, other_vport, 1); MLX5_SET(modify_nic_vport_context_in, in, field_select.promisc, 1); if (promisc_mc) MLX5_SET(nic_vport_context, ctx, promisc_mc, 1); if (promisc_uc) MLX5_SET(nic_vport_context, ctx, promisc_uc, 1); if (promisc_all) MLX5_SET(nic_vport_context, ctx, promisc_all, 1); return mlx5_modify_nic_vport_context(mdev, in, sizeof(in)); } EXPORT_SYMBOL_GPL(mlx5_set_nic_vport_promisc); + +int mlx5_query_nic_vport_mac_list(struct mlx5_core_dev *dev, + u32 vport, + enum mlx5_list_type list_type, + u8 addr_list[][ETH_ALEN], + int *list_size) +{ + u32 in[MLX5_ST_SZ_DW(query_nic_vport_context_in)]; + void *nic_vport_ctx; + int max_list_size; + int req_list_size; + u8 *mac_addr; + int out_sz; + void *out; + int err; + int i; + + req_list_size = *list_size; + + max_list_size = (list_type == MLX5_NIC_VPORT_LIST_TYPE_UC) ? + 1 << MLX5_CAP_GEN_MAX(dev, log_max_current_uc_list) : + 1 << MLX5_CAP_GEN_MAX(dev, log_max_current_mc_list); + + if (req_list_size > max_list_size) { + mlx5_core_warn(dev, "Requested list size (%d) > (%d) max_list_size\n", + req_list_size, max_list_size); + req_list_size = max_list_size; + } + + out_sz = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in) + + req_list_size * MLX5_ST_SZ_BYTES(mac_address_layout); + + memset(in, 0, sizeof(in)); + out = kzalloc(out_sz, GFP_KERNEL); + if (!out) + return -ENOMEM; + + MLX5_SET(query_nic_vport_context_in, in, opcode, + MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT); + MLX5_SET(query_nic_vport_context_in, in, allowed_list_type, list_type); + MLX5_SET(query_nic_vport_context_in, in, vport_number, vport); + + if (vport) + MLX5_SET(query_nic_vport_context_in, in, other_vport, 1); + + err = mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, out_sz); + if (err) + goto out; + + nic_vport_ctx = MLX5_ADDR_OF(query_nic_vport_context_out, out, + nic_vport_context); + req_list_size = MLX5_GET(nic_vport_context, nic_vport_ctx, + allowed_list_size); + + *list_size = req_list_size; + for (i = 0; i < req_list_size; i++) { + mac_addr = MLX5_ADDR_OF(nic_vport_context, + nic_vport_ctx, + current_uc_mac_address[i]) + 2; + ether_addr_copy(addr_list[i], mac_addr); + } +out: + kfree(out); + return err; +} +EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_mac_list); + +int mlx5_modify_nic_vport_mac_list(struct mlx5_core_dev *dev, + enum mlx5_list_type list_type, + u8 addr_list[][ETH_ALEN], + int list_size) +{ + u32 out[MLX5_ST_SZ_DW(modify_nic_vport_context_out)]; + void *nic_vport_ctx; + int max_list_size; + int in_sz; + void *in; + int err; + int i; + + max_list_size = list_type == MLX5_NIC_VPORT_LIST_TYPE_UC ? + 1 << MLX5_CAP_GEN(dev, log_max_current_uc_list) : + 1 << MLX5_CAP_GEN(dev, log_max_current_mc_list); + + if (list_size > max_list_size) + return -ENOSPC; + + in_sz = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in) + + list_size * MLX5_ST_SZ_BYTES(mac_address_layout); + + memset(out, 0, sizeof(out)); + in = kzalloc(in_sz, GFP_KERNEL); + if (!in) + return -ENOMEM; + + MLX5_SET(modify_nic_vport_context_in, in, opcode, + MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT); + MLX5_SET(modify_nic_vport_context_in, in, + field_select.addresses_list, 1); + + nic_vport_ctx = MLX5_ADDR_OF(modify_nic_vport_context_in, in, + nic_vport_context); + + MLX5_SET(nic_vport_context, nic_vport_ctx, + allowed_list_type, list_type); + MLX5_SET(nic_vport_context, nic_vport_ctx, + allowed_list_size, list_size); + + for (i = 0; i < list_size; i++) { + u8 *curr_mac = MLX5_ADDR_OF(nic_vport_context, + nic_vport_ctx, + current_uc_mac_address[i]) + 2; + ether_addr_copy(curr_mac, addr_list[i]); + } + + err = mlx5_cmd_exec_check_status(dev, in, in_sz, out, sizeof(out)); + kfree(in); + return err; +} +EXPORT_SYMBOL_GPL(mlx5_modify_nic_vport_mac_list); + +int mlx5_query_nic_vport_vlan_list(struct mlx5_core_dev *dev, + u32 vport, + u16 *vlan_list, + int *list_size) +{ + u32 in[MLX5_ST_SZ_DW(query_nic_vport_context_in)]; + void *nic_vport_ctx; + int max_list_size; + int req_list_size; + int out_sz; + void *out; + void *vlan_addr; + int err; + int i; + + req_list_size = *list_size; + + max_list_size = 1 << MLX5_CAP_GEN_MAX(dev, log_max_vlan_list); + + if (req_list_size > max_list_size) { + mlx5_core_warn(dev, "Requested list size (%d) > (%d) max_list_size\n", + req_list_size, max_list_size); + req_list_size = max_list_size; + } + + out_sz = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in) + + req_list_size * MLX5_ST_SZ_BYTES(vlan_layout); + + memset(in, 0, sizeof(in)); + out = kzalloc(out_sz, GFP_KERNEL); + if (!out) + return -ENOMEM; + + MLX5_SET(query_nic_vport_context_in, in, opcode, + MLX5_CMD_OP_QUERY_NIC_VPORT_CONTEXT); + MLX5_SET(query_nic_vport_context_in, in, allowed_list_type, + MLX5_NIC_VPORT_CONTEXT_ALLOWED_LIST_TYPE_VLAN_LIST); + MLX5_SET(query_nic_vport_context_in, in, vport_number, vport); + + if (vport) + MLX5_SET(query_nic_vport_context_in, in, other_vport, 1); + + err = mlx5_cmd_exec_check_status(dev, in, sizeof(in), out, out_sz); + if (err) + goto out; + + nic_vport_ctx = MLX5_ADDR_OF(query_nic_vport_context_out, out, + nic_vport_context); + req_list_size = MLX5_GET(nic_vport_context, nic_vport_ctx, + allowed_list_size); + + *list_size = req_list_size; + for (i = 0; i < req_list_size; i++) { + vlan_addr = MLX5_ADDR_OF(nic_vport_context, nic_vport_ctx, + current_uc_mac_address[i]); + vlan_list[i] = MLX5_GET(vlan_layout, vlan_addr, vlan); + } +out: + kfree(out); + return err; +} +EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_vlan_list); + +int mlx5_modify_nic_vport_vlans(struct mlx5_core_dev *dev, + u16 vlans[], + int list_size) +{ + u32 out[MLX5_ST_SZ_DW(modify_nic_vport_context_out)]; + void *nic_vport_ctx; + int max_list_size; + int in_sz; + void *in; + int err; + int i; + + max_list_size = 1 << MLX5_CAP_GEN(dev, log_max_vlan_list); + + if (list_size > max_list_size) + return -ENOSPC; + + in_sz = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in) + + list_size * MLX5_ST_SZ_BYTES(vlan_layout); + + memset(out, 0, sizeof(out)); + in = kzalloc(in_sz, GFP_KERNEL); + if (!in) + return -ENOMEM; + + MLX5_SET(modify_nic_vport_context_in, in, opcode, + MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT); + MLX5_SET(modify_nic_vport_context_in, in, + field_select.addresses_list, 1); + + nic_vport_ctx = MLX5_ADDR_OF(modify_nic_vport_context_in, in, + nic_vport_context); + + MLX5_SET(nic_vport_context, nic_vport_ctx, + allowed_list_type, MLX5_NIC_VPORT_LIST_TYPE_VLAN); + MLX5_SET(nic_vport_context, nic_vport_ctx, + allowed_list_size, list_size); + + for (i = 0; i < list_size; i++) { + void *vlan_addr = MLX5_ADDR_OF(nic_vport_context, + nic_vport_ctx, + current_uc_mac_address[i]); + MLX5_SET(vlan_layout, vlan_addr, vlan, vlans[i]); + } + + err = mlx5_cmd_exec_check_status(dev, in, in_sz, out, sizeof(out)); + kfree(in); + return err; +} +EXPORT_SYMBOL_GPL(mlx5_modify_nic_vport_vlans); + int mlx5_set_nic_vport_permanent_mac(struct mlx5_core_dev *mdev, int vport, u8 *addr) { void *in; int inlen = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in); u8 *mac_ptr; int err; in = mlx5_vzalloc(inlen); if (!in) { mlx5_core_warn(mdev, "failed to allocate inbox\n"); return -ENOMEM; } MLX5_SET(modify_nic_vport_context_in, in, opcode, MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT); MLX5_SET(modify_nic_vport_context_in, in, vport_number, vport); MLX5_SET(modify_nic_vport_context_in, in, other_vport, 1); MLX5_SET(modify_nic_vport_context_in, in, field_select.permanent_address, 1); mac_ptr = (u8 *)MLX5_ADDR_OF(modify_nic_vport_context_in, in, nic_vport_context.permanent_address.mac_addr_47_32); ether_addr_copy(mac_ptr, addr); err = mlx5_modify_nic_vport_context(mdev, in, inlen); kvfree(in); return err; } EXPORT_SYMBOL_GPL(mlx5_set_nic_vport_permanent_mac); int mlx5_nic_vport_enable_roce(struct mlx5_core_dev *mdev) { return mlx5_nic_vport_enable_disable_roce(mdev, 1); } EXPORT_SYMBOL_GPL(mlx5_nic_vport_enable_roce); int mlx5_nic_vport_disable_roce(struct mlx5_core_dev *mdev) { return mlx5_nic_vport_enable_disable_roce(mdev, 0); } EXPORT_SYMBOL_GPL(mlx5_nic_vport_disable_roce); int mlx5_query_hca_vport_context(struct mlx5_core_dev *mdev, u8 port_num, u8 vport_num, u32 *out, int outlen) { u32 in[MLX5_ST_SZ_DW(query_hca_vport_context_in)]; int is_group_manager; is_group_manager = MLX5_CAP_GEN(mdev, vport_group_manager); memset(in, 0, sizeof(in)); MLX5_SET(query_hca_vport_context_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_VPORT_CONTEXT); if (vport_num) { if (is_group_manager) { MLX5_SET(query_hca_vport_context_in, in, other_vport, 1); MLX5_SET(query_hca_vport_context_in, in, vport_number, vport_num); } else { return -EPERM; } } if (MLX5_CAP_GEN(mdev, num_ports) == 2) MLX5_SET(query_hca_vport_context_in, in, port_num, port_num); return mlx5_cmd_exec_check_status(mdev, in, sizeof(in), out, outlen); } int mlx5_query_hca_vport_system_image_guid(struct mlx5_core_dev *mdev, u64 *system_image_guid) { u32 *out; int outlen = MLX5_ST_SZ_BYTES(query_hca_vport_context_out); int err; out = mlx5_vzalloc(outlen); if (!out) return -ENOMEM; err = mlx5_query_hca_vport_context(mdev, 1, 0, out, outlen); if (err) goto out; *system_image_guid = MLX5_GET64(query_hca_vport_context_out, out, hca_vport_context.system_image_guid); out: kvfree(out); return err; } EXPORT_SYMBOL_GPL(mlx5_query_hca_vport_system_image_guid); int mlx5_query_hca_vport_node_guid(struct mlx5_core_dev *mdev, u64 *node_guid) { u32 *out; int outlen = MLX5_ST_SZ_BYTES(query_hca_vport_context_out); int err; out = mlx5_vzalloc(outlen); if (!out) return -ENOMEM; err = mlx5_query_hca_vport_context(mdev, 1, 0, out, outlen); if (err) goto out; *node_guid = MLX5_GET64(query_hca_vport_context_out, out, hca_vport_context.node_guid); out: kvfree(out); return err; } EXPORT_SYMBOL_GPL(mlx5_query_hca_vport_node_guid); int mlx5_query_hca_vport_gid(struct mlx5_core_dev *dev, u8 port_num, u16 vport_num, u16 gid_index, union ib_gid *gid) { int in_sz = MLX5_ST_SZ_BYTES(query_hca_vport_gid_in); int out_sz = MLX5_ST_SZ_BYTES(query_hca_vport_gid_out); int is_group_manager; void *out = NULL; void *in = NULL; union ib_gid *tmp; int tbsz; int nout; int err; is_group_manager = MLX5_CAP_GEN(dev, vport_group_manager); tbsz = mlx5_get_gid_table_len(MLX5_CAP_GEN(dev, gid_table_size)); if (gid_index > tbsz && gid_index != 0xffff) return -EINVAL; if (gid_index == 0xffff) nout = tbsz; else nout = 1; out_sz += nout * sizeof(*gid); in = mlx5_vzalloc(in_sz); out = mlx5_vzalloc(out_sz); if (!in || !out) { err = -ENOMEM; goto out; } MLX5_SET(query_hca_vport_gid_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_VPORT_GID); if (vport_num) { if (is_group_manager) { MLX5_SET(query_hca_vport_gid_in, in, vport_number, vport_num); MLX5_SET(query_hca_vport_gid_in, in, other_vport, 1); } else { err = -EPERM; goto out; } } MLX5_SET(query_hca_vport_gid_in, in, gid_index, gid_index); if (MLX5_CAP_GEN(dev, num_ports) == 2) MLX5_SET(query_hca_vport_gid_in, in, port_num, port_num); err = mlx5_cmd_exec(dev, in, in_sz, out, out_sz); if (err) goto out; err = mlx5_cmd_status_to_err_v2(out); if (err) goto out; tmp = (union ib_gid *)MLX5_ADDR_OF(query_hca_vport_gid_out, out, gid); gid->global.subnet_prefix = tmp->global.subnet_prefix; gid->global.interface_id = tmp->global.interface_id; out: kvfree(in); kvfree(out); return err; } EXPORT_SYMBOL_GPL(mlx5_query_hca_vport_gid); int mlx5_query_hca_vport_pkey(struct mlx5_core_dev *dev, u8 other_vport, u8 port_num, u16 vf_num, u16 pkey_index, u16 *pkey) { int in_sz = MLX5_ST_SZ_BYTES(query_hca_vport_pkey_in); int out_sz = MLX5_ST_SZ_BYTES(query_hca_vport_pkey_out); int is_group_manager; void *out = NULL; void *in = NULL; void *pkarr; int nout; int tbsz; int err; int i; is_group_manager = MLX5_CAP_GEN(dev, vport_group_manager); tbsz = mlx5_to_sw_pkey_sz(MLX5_CAP_GEN(dev, pkey_table_size)); if (pkey_index > tbsz && pkey_index != 0xffff) return -EINVAL; if (pkey_index == 0xffff) nout = tbsz; else nout = 1; out_sz += nout * MLX5_ST_SZ_BYTES(pkey); in = kzalloc(in_sz, GFP_KERNEL); out = kzalloc(out_sz, GFP_KERNEL); MLX5_SET(query_hca_vport_pkey_in, in, opcode, MLX5_CMD_OP_QUERY_HCA_VPORT_PKEY); if (other_vport) { if (is_group_manager) { MLX5_SET(query_hca_vport_pkey_in, in, vport_number, vf_num); MLX5_SET(query_hca_vport_pkey_in, in, other_vport, 1); } else { err = -EPERM; goto out; } } MLX5_SET(query_hca_vport_pkey_in, in, pkey_index, pkey_index); if (MLX5_CAP_GEN(dev, num_ports) == 2) MLX5_SET(query_hca_vport_pkey_in, in, port_num, port_num); err = mlx5_cmd_exec(dev, in, in_sz, out, out_sz); if (err) goto out; err = mlx5_cmd_status_to_err_v2(out); if (err) goto out; pkarr = MLX5_ADDR_OF(query_hca_vport_pkey_out, out, pkey); for (i = 0; i < nout; i++, pkey++, pkarr += MLX5_ST_SZ_BYTES(pkey)) *pkey = MLX5_GET_PR(pkey, pkarr, pkey); out: kfree(in); kfree(out); return err; } EXPORT_SYMBOL_GPL(mlx5_query_hca_vport_pkey); static int mlx5_modify_eswitch_vport_context(struct mlx5_core_dev *mdev, u16 vport, void *in, int inlen) { u32 out[MLX5_ST_SZ_DW(modify_esw_vport_context_out)]; int err; memset(out, 0, sizeof(out)); MLX5_SET(modify_esw_vport_context_in, in, vport_number, vport); if (vport) MLX5_SET(modify_esw_vport_context_in, in, other_vport, 1); MLX5_SET(modify_esw_vport_context_in, in, opcode, MLX5_CMD_OP_MODIFY_ESW_VPORT_CONTEXT); err = mlx5_cmd_exec_check_status(mdev, in, inlen, out, sizeof(out)); if (err) mlx5_core_warn(mdev, "MLX5_CMD_OP_MODIFY_ESW_VPORT_CONTEXT failed\n"); return err; } int mlx5_set_eswitch_cvlan_info(struct mlx5_core_dev *mdev, u8 vport, u8 insert_mode, u8 strip_mode, u16 vlan, u8 cfi, u8 pcp) { u32 in[MLX5_ST_SZ_DW(modify_esw_vport_context_in)]; memset(in, 0, sizeof(in)); if (insert_mode != MLX5_MODIFY_ESW_VPORT_CONTEXT_CVLAN_INSERT_NONE) { MLX5_SET(modify_esw_vport_context_in, in, esw_vport_context.cvlan_cfi, cfi); MLX5_SET(modify_esw_vport_context_in, in, esw_vport_context.cvlan_pcp, pcp); MLX5_SET(modify_esw_vport_context_in, in, esw_vport_context.cvlan_id, vlan); } MLX5_SET(modify_esw_vport_context_in, in, esw_vport_context.vport_cvlan_insert, insert_mode); MLX5_SET(modify_esw_vport_context_in, in, esw_vport_context.vport_cvlan_strip, strip_mode); MLX5_SET(modify_esw_vport_context_in, in, field_select, MLX5_MODIFY_ESW_VPORT_CONTEXT_FIELD_SELECT_CVLAN_STRIP | MLX5_MODIFY_ESW_VPORT_CONTEXT_FIELD_SELECT_CVLAN_INSERT); return mlx5_modify_eswitch_vport_context(mdev, vport, in, sizeof(in)); } EXPORT_SYMBOL_GPL(mlx5_set_eswitch_cvlan_info); + +int mlx5_arm_vport_context_events(struct mlx5_core_dev *mdev, + u8 vport, + u32 events_mask) +{ + u32 *in; + u32 inlen = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in); + void *nic_vport_ctx; + int err; + + in = mlx5_vzalloc(inlen); + if (!in) + return -ENOMEM; + + MLX5_SET(modify_nic_vport_context_in, + in, + opcode, + MLX5_CMD_OP_MODIFY_NIC_VPORT_CONTEXT); + MLX5_SET(modify_nic_vport_context_in, + in, + field_select.change_event, + 1); + MLX5_SET(modify_nic_vport_context_in, in, vport_number, vport); + if (vport) + MLX5_SET(modify_nic_vport_context_in, in, other_vport, 1); + nic_vport_ctx = MLX5_ADDR_OF(modify_nic_vport_context_in, + in, + nic_vport_context); + + MLX5_SET(nic_vport_context, nic_vport_ctx, arm_change_event, 1); + + if (events_mask & MLX5_UC_ADDR_CHANGE) + MLX5_SET(nic_vport_context, + nic_vport_ctx, + event_on_uc_address_change, + 1); + if (events_mask & MLX5_MC_ADDR_CHANGE) + MLX5_SET(nic_vport_context, + nic_vport_ctx, + event_on_mc_address_change, + 1); + if (events_mask & MLX5_VLAN_CHANGE) + MLX5_SET(nic_vport_context, + nic_vport_ctx, + event_on_vlan_change, + 1); + if (events_mask & MLX5_PROMISC_CHANGE) + MLX5_SET(nic_vport_context, + nic_vport_ctx, + event_on_promisc_change, + 1); + if (events_mask & MLX5_MTU_CHANGE) + MLX5_SET(nic_vport_context, + nic_vport_ctx, + event_on_mtu, + 1); + + err = mlx5_modify_nic_vport_context(mdev, in, inlen); + + kvfree(in); + return err; +} +EXPORT_SYMBOL_GPL(mlx5_arm_vport_context_events); + +int mlx5_query_vport_promisc(struct mlx5_core_dev *mdev, + u32 vport, + u8 *promisc_uc, + u8 *promisc_mc, + u8 *promisc_all) +{ + u32 *out; + int outlen = MLX5_ST_SZ_BYTES(query_nic_vport_context_out); + int err; + + out = kzalloc(outlen, GFP_KERNEL); + if (!out) + return -ENOMEM; + + err = mlx5_query_nic_vport_context(mdev, vport, out, outlen); + if (err) + goto out; + + *promisc_uc = MLX5_GET(query_nic_vport_context_out, out, + nic_vport_context.promisc_uc); + *promisc_mc = MLX5_GET(query_nic_vport_context_out, out, + nic_vport_context.promisc_mc); + *promisc_all = MLX5_GET(query_nic_vport_context_out, out, + nic_vport_context.promisc_all); + +out: + kfree(out); + return err; +} +EXPORT_SYMBOL_GPL(mlx5_query_nic_vport_promisc); + +int mlx5_modify_nic_vport_promisc(struct mlx5_core_dev *mdev, + int promisc_uc, + int promisc_mc, + int promisc_all) +{ + void *in; + int inlen = MLX5_ST_SZ_BYTES(modify_nic_vport_context_in); + int err; + + in = mlx5_vzalloc(inlen); + if (!in) { + mlx5_core_err(mdev, "failed to allocate inbox\n"); + return -ENOMEM; + } + + MLX5_SET(modify_nic_vport_context_in, in, field_select.promisc, 1); + MLX5_SET(modify_nic_vport_context_in, in, + nic_vport_context.promisc_uc, promisc_uc); + MLX5_SET(modify_nic_vport_context_in, in, + nic_vport_context.promisc_mc, promisc_mc); + MLX5_SET(modify_nic_vport_context_in, in, + nic_vport_context.promisc_all, promisc_all); + + err = mlx5_modify_nic_vport_context(mdev, in, inlen); + kvfree(in); + return err; +} +EXPORT_SYMBOL_GPL(mlx5_modify_nic_vport_promisc); int mlx5_query_vport_counter(struct mlx5_core_dev *dev, u8 port_num, u16 vport_num, void *out, int out_size) { int in_sz = MLX5_ST_SZ_BYTES(query_vport_counter_in); int is_group_manager; void *in; int err; is_group_manager = MLX5_CAP_GEN(dev, vport_group_manager); in = mlx5_vzalloc(in_sz); if (!in) return -ENOMEM; MLX5_SET(query_vport_counter_in, in, opcode, MLX5_CMD_OP_QUERY_VPORT_COUNTER); if (vport_num) { if (is_group_manager) { MLX5_SET(query_vport_counter_in, in, other_vport, 1); MLX5_SET(query_vport_counter_in, in, vport_number, vport_num); } else { err = -EPERM; goto ex; } } if (MLX5_CAP_GEN(dev, num_ports) == 2) MLX5_SET(query_vport_counter_in, in, port_num, port_num); err = mlx5_cmd_exec(dev, in, in_sz, out, out_size); if (err) goto ex; err = mlx5_cmd_status_to_err_v2(out); if (err) goto ex; ex: kvfree(in); return err; } EXPORT_SYMBOL_GPL(mlx5_query_vport_counter); int mlx5_get_vport_counters(struct mlx5_core_dev *dev, u8 port_num, struct mlx5_vport_counters *vc) { int out_sz = MLX5_ST_SZ_BYTES(query_vport_counter_out); void *out; int err; out = mlx5_vzalloc(out_sz); if (!out) return -ENOMEM; err = mlx5_query_vport_counter(dev, port_num, 0, out, out_sz); if (err) goto ex; vc->received_errors.packets = MLX5_GET64(query_vport_counter_out, out, received_errors.packets); vc->received_errors.octets = MLX5_GET64(query_vport_counter_out, out, received_errors.octets); vc->transmit_errors.packets = MLX5_GET64(query_vport_counter_out, out, transmit_errors.packets); vc->transmit_errors.octets = MLX5_GET64(query_vport_counter_out, out, transmit_errors.octets); vc->received_ib_unicast.packets = MLX5_GET64(query_vport_counter_out, out, received_ib_unicast.packets); vc->received_ib_unicast.octets = MLX5_GET64(query_vport_counter_out, out, received_ib_unicast.octets); vc->transmitted_ib_unicast.packets = MLX5_GET64(query_vport_counter_out, out, transmitted_ib_unicast.packets); vc->transmitted_ib_unicast.octets = MLX5_GET64(query_vport_counter_out, out, transmitted_ib_unicast.octets); vc->received_ib_multicast.packets = MLX5_GET64(query_vport_counter_out, out, received_ib_multicast.packets); vc->received_ib_multicast.octets = MLX5_GET64(query_vport_counter_out, out, received_ib_multicast.octets); vc->transmitted_ib_multicast.packets = MLX5_GET64(query_vport_counter_out, out, transmitted_ib_multicast.packets); vc->transmitted_ib_multicast.octets = MLX5_GET64(query_vport_counter_out, out, transmitted_ib_multicast.octets); vc->received_eth_broadcast.packets = MLX5_GET64(query_vport_counter_out, out, received_eth_broadcast.packets); vc->received_eth_broadcast.octets = MLX5_GET64(query_vport_counter_out, out, received_eth_broadcast.octets); vc->transmitted_eth_broadcast.packets = MLX5_GET64(query_vport_counter_out, out, transmitted_eth_broadcast.packets); vc->transmitted_eth_broadcast.octets = MLX5_GET64(query_vport_counter_out, out, transmitted_eth_broadcast.octets); vc->received_eth_unicast.octets = MLX5_GET64(query_vport_counter_out, out, received_eth_unicast.octets); vc->received_eth_unicast.packets = MLX5_GET64(query_vport_counter_out, out, received_eth_unicast.packets); vc->transmitted_eth_unicast.octets = MLX5_GET64(query_vport_counter_out, out, transmitted_eth_unicast.octets); vc->transmitted_eth_unicast.packets = MLX5_GET64(query_vport_counter_out, out, transmitted_eth_unicast.packets); vc->received_eth_multicast.octets = MLX5_GET64(query_vport_counter_out, out, received_eth_multicast.octets); vc->received_eth_multicast.packets = MLX5_GET64(query_vport_counter_out, out, received_eth_multicast.packets); vc->transmitted_eth_multicast.octets = MLX5_GET64(query_vport_counter_out, out, transmitted_eth_multicast.octets); vc->transmitted_eth_multicast.packets = MLX5_GET64(query_vport_counter_out, out, transmitted_eth_multicast.packets); ex: kvfree(out); return err; } Index: projects/vnet/sys/dev/mlx5/mlx5_en/mlx5_en_flow_table.c =================================================================== --- projects/vnet/sys/dev/mlx5/mlx5_en/mlx5_en_flow_table.c (revision 301546) +++ projects/vnet/sys/dev/mlx5/mlx5_en/mlx5_en_flow_table.c (revision 301547) @@ -1,867 +1,999 @@ /*- * Copyright (c) 2015 Mellanox Technologies. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS `AS IS' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include "en.h" #include #include enum { MLX5E_FULLMATCH = 0, MLX5E_ALLMULTI = 1, MLX5E_PROMISC = 2, }; enum { MLX5E_UC = 0, MLX5E_MC_IPV4 = 1, MLX5E_MC_IPV6 = 2, MLX5E_MC_OTHER = 3, }; enum { MLX5E_ACTION_NONE = 0, MLX5E_ACTION_ADD = 1, MLX5E_ACTION_DEL = 2, }; struct mlx5e_eth_addr_hash_node { LIST_ENTRY(mlx5e_eth_addr_hash_node) hlist; u8 action; struct mlx5e_eth_addr_info ai; }; static inline int mlx5e_hash_eth_addr(const u8 * addr) { return (addr[5]); } static void mlx5e_add_eth_addr_to_hash(struct mlx5e_eth_addr_hash_head *hash, const u8 * addr) { struct mlx5e_eth_addr_hash_node *hn; int ix = mlx5e_hash_eth_addr(addr); LIST_FOREACH(hn, &hash[ix], hlist) { if (bcmp(hn->ai.addr, addr, ETHER_ADDR_LEN) == 0) { if (hn->action == MLX5E_ACTION_DEL) hn->action = MLX5E_ACTION_NONE; return; } } hn = malloc(sizeof(*hn), M_MLX5EN, M_NOWAIT | M_ZERO); if (hn == NULL) return; ether_addr_copy(hn->ai.addr, addr); hn->action = MLX5E_ACTION_ADD; LIST_INSERT_HEAD(&hash[ix], hn, hlist); } static void mlx5e_del_eth_addr_from_hash(struct mlx5e_eth_addr_hash_node *hn) { LIST_REMOVE(hn, hlist); free(hn, M_MLX5EN); } static void mlx5e_del_eth_addr_from_flow_table(struct mlx5e_priv *priv, struct mlx5e_eth_addr_info *ai) { void *ft = priv->ft.main; if (ai->tt_vec & (1 << MLX5E_TT_IPV6_TCP)) mlx5_del_flow_table_entry(ft, ai->ft_ix[MLX5E_TT_IPV6_TCP]); if (ai->tt_vec & (1 << MLX5E_TT_IPV4_TCP)) mlx5_del_flow_table_entry(ft, ai->ft_ix[MLX5E_TT_IPV4_TCP]); if (ai->tt_vec & (1 << MLX5E_TT_IPV6_UDP)) mlx5_del_flow_table_entry(ft, ai->ft_ix[MLX5E_TT_IPV6_UDP]); if (ai->tt_vec & (1 << MLX5E_TT_IPV4_UDP)) mlx5_del_flow_table_entry(ft, ai->ft_ix[MLX5E_TT_IPV4_UDP]); if (ai->tt_vec & (1 << MLX5E_TT_IPV6)) mlx5_del_flow_table_entry(ft, ai->ft_ix[MLX5E_TT_IPV6]); if (ai->tt_vec & (1 << MLX5E_TT_IPV4)) mlx5_del_flow_table_entry(ft, ai->ft_ix[MLX5E_TT_IPV4]); if (ai->tt_vec & (1 << MLX5E_TT_ANY)) mlx5_del_flow_table_entry(ft, ai->ft_ix[MLX5E_TT_ANY]); } static int mlx5e_get_eth_addr_type(const u8 * addr) { if (ETHER_IS_MULTICAST(addr) == 0) return (MLX5E_UC); if ((addr[0] == 0x01) && (addr[1] == 0x00) && (addr[2] == 0x5e) && !(addr[3] & 0x80)) return (MLX5E_MC_IPV4); if ((addr[0] == 0x33) && (addr[1] == 0x33)) return (MLX5E_MC_IPV6); return (MLX5E_MC_OTHER); } static u32 mlx5e_get_tt_vec(struct mlx5e_eth_addr_info *ai, int type) { int eth_addr_type; u32 ret; switch (type) { case MLX5E_FULLMATCH: eth_addr_type = mlx5e_get_eth_addr_type(ai->addr); switch (eth_addr_type) { case MLX5E_UC: ret = (1 << MLX5E_TT_IPV4_TCP) | (1 << MLX5E_TT_IPV6_TCP) | (1 << MLX5E_TT_IPV4_UDP) | (1 << MLX5E_TT_IPV6_UDP) | (1 << MLX5E_TT_IPV4) | (1 << MLX5E_TT_IPV6) | (1 << MLX5E_TT_ANY) | 0; break; case MLX5E_MC_IPV4: ret = (1 << MLX5E_TT_IPV4_UDP) | (1 << MLX5E_TT_IPV4) | 0; break; case MLX5E_MC_IPV6: ret = (1 << MLX5E_TT_IPV6_UDP) | (1 << MLX5E_TT_IPV6) | 0; break; default: ret = (1 << MLX5E_TT_ANY) | 0; break; } break; case MLX5E_ALLMULTI: ret = (1 << MLX5E_TT_IPV4_UDP) | (1 << MLX5E_TT_IPV6_UDP) | (1 << MLX5E_TT_IPV4) | (1 << MLX5E_TT_IPV6) | (1 << MLX5E_TT_ANY) | 0; break; default: /* MLX5E_PROMISC */ ret = (1 << MLX5E_TT_IPV4_TCP) | (1 << MLX5E_TT_IPV6_TCP) | (1 << MLX5E_TT_IPV4_UDP) | (1 << MLX5E_TT_IPV6_UDP) | (1 << MLX5E_TT_IPV4) | (1 << MLX5E_TT_IPV6) | (1 << MLX5E_TT_ANY) | 0; break; } return (ret); } static int mlx5e_add_eth_addr_rule_sub(struct mlx5e_priv *priv, struct mlx5e_eth_addr_info *ai, int type, void *flow_context, void *match_criteria) { u8 match_criteria_enable = 0; void *match_value; void *dest; u8 *dmac; u8 *match_criteria_dmac; void *ft = priv->ft.main; u32 *tirn = priv->tirn; u32 tt_vec; int err; match_value = MLX5_ADDR_OF(flow_context, flow_context, match_value); dmac = MLX5_ADDR_OF(fte_match_param, match_value, outer_headers.dmac_47_16); match_criteria_dmac = MLX5_ADDR_OF(fte_match_param, match_criteria, outer_headers.dmac_47_16); dest = MLX5_ADDR_OF(flow_context, flow_context, destination); MLX5_SET(flow_context, flow_context, action, MLX5_FLOW_CONTEXT_ACTION_FWD_DEST); MLX5_SET(flow_context, flow_context, destination_list_size, 1); MLX5_SET(dest_format_struct, dest, destination_type, MLX5_FLOW_CONTEXT_DEST_TYPE_TIR); switch (type) { case MLX5E_FULLMATCH: match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; memset(match_criteria_dmac, 0xff, ETH_ALEN); ether_addr_copy(dmac, ai->addr); break; case MLX5E_ALLMULTI: match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; match_criteria_dmac[0] = 0x01; dmac[0] = 0x01; break; case MLX5E_PROMISC: break; default: break; } tt_vec = mlx5e_get_tt_vec(ai, type); if (tt_vec & (1 << MLX5E_TT_ANY)) { MLX5_SET(dest_format_struct, dest, destination_id, tirn[MLX5E_TT_ANY]); err = mlx5_add_flow_table_entry(ft, match_criteria_enable, match_criteria, flow_context, &ai->ft_ix[MLX5E_TT_ANY]); if (err) { mlx5e_del_eth_addr_from_flow_table(priv, ai); return (err); } ai->tt_vec |= (1 << MLX5E_TT_ANY); } match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.ethertype); if (tt_vec & (1 << MLX5E_TT_IPV4)) { MLX5_SET(fte_match_param, match_value, outer_headers.ethertype, ETHERTYPE_IP); MLX5_SET(dest_format_struct, dest, destination_id, tirn[MLX5E_TT_IPV4]); err = mlx5_add_flow_table_entry(ft, match_criteria_enable, match_criteria, flow_context, &ai->ft_ix[MLX5E_TT_IPV4]); if (err) { mlx5e_del_eth_addr_from_flow_table(priv, ai); return (err); } ai->tt_vec |= (1 << MLX5E_TT_IPV4); } if (tt_vec & (1 << MLX5E_TT_IPV6)) { MLX5_SET(fte_match_param, match_value, outer_headers.ethertype, ETHERTYPE_IPV6); MLX5_SET(dest_format_struct, dest, destination_id, tirn[MLX5E_TT_IPV6]); err = mlx5_add_flow_table_entry(ft, match_criteria_enable, match_criteria, flow_context, &ai->ft_ix[MLX5E_TT_IPV6]); if (err) { mlx5e_del_eth_addr_from_flow_table(priv, ai); return (err); } ai->tt_vec |= (1 << MLX5E_TT_IPV6); } MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.ip_protocol); MLX5_SET(fte_match_param, match_value, outer_headers.ip_protocol, IPPROTO_UDP); if (tt_vec & (1 << MLX5E_TT_IPV4_UDP)) { MLX5_SET(fte_match_param, match_value, outer_headers.ethertype, ETHERTYPE_IP); MLX5_SET(dest_format_struct, dest, destination_id, tirn[MLX5E_TT_IPV4_UDP]); err = mlx5_add_flow_table_entry(ft, match_criteria_enable, match_criteria, flow_context, &ai->ft_ix[MLX5E_TT_IPV4_UDP]); if (err) { mlx5e_del_eth_addr_from_flow_table(priv, ai); return (err); } ai->tt_vec |= (1 << MLX5E_TT_IPV4_UDP); } if (tt_vec & (1 << MLX5E_TT_IPV6_UDP)) { MLX5_SET(fte_match_param, match_value, outer_headers.ethertype, ETHERTYPE_IPV6); MLX5_SET(dest_format_struct, dest, destination_id, tirn[MLX5E_TT_IPV6_UDP]); err = mlx5_add_flow_table_entry(ft, match_criteria_enable, match_criteria, flow_context, &ai->ft_ix[MLX5E_TT_IPV6_UDP]); if (err) { mlx5e_del_eth_addr_from_flow_table(priv, ai); return (err); } ai->tt_vec |= (1 << MLX5E_TT_IPV6_UDP); } MLX5_SET(fte_match_param, match_value, outer_headers.ip_protocol, IPPROTO_TCP); if (tt_vec & (1 << MLX5E_TT_IPV4_TCP)) { MLX5_SET(fte_match_param, match_value, outer_headers.ethertype, ETHERTYPE_IP); MLX5_SET(dest_format_struct, dest, destination_id, tirn[MLX5E_TT_IPV4_TCP]); err = mlx5_add_flow_table_entry(ft, match_criteria_enable, match_criteria, flow_context, &ai->ft_ix[MLX5E_TT_IPV4_TCP]); if (err) { mlx5e_del_eth_addr_from_flow_table(priv, ai); return (err); } ai->tt_vec |= (1 << MLX5E_TT_IPV4_TCP); } if (tt_vec & (1 << MLX5E_TT_IPV6_TCP)) { MLX5_SET(fte_match_param, match_value, outer_headers.ethertype, ETHERTYPE_IPV6); MLX5_SET(dest_format_struct, dest, destination_id, tirn[MLX5E_TT_IPV6_TCP]); err = mlx5_add_flow_table_entry(ft, match_criteria_enable, match_criteria, flow_context, &ai->ft_ix[MLX5E_TT_IPV6_TCP]); if (err) { mlx5e_del_eth_addr_from_flow_table(priv, ai); return (err); } ai->tt_vec |= (1 << MLX5E_TT_IPV6_TCP); } return (0); } static int mlx5e_add_eth_addr_rule(struct mlx5e_priv *priv, struct mlx5e_eth_addr_info *ai, int type) { u32 *flow_context; u32 *match_criteria; int err; flow_context = mlx5_vzalloc(MLX5_ST_SZ_BYTES(flow_context) + MLX5_ST_SZ_BYTES(dest_format_struct)); match_criteria = mlx5_vzalloc(MLX5_ST_SZ_BYTES(fte_match_param)); if (!flow_context || !match_criteria) { if_printf(priv->ifp, "%s: alloc failed\n", __func__); err = -ENOMEM; goto add_eth_addr_rule_out; } err = mlx5e_add_eth_addr_rule_sub(priv, ai, type, flow_context, match_criteria); if (err) if_printf(priv->ifp, "%s: failed\n", __func__); add_eth_addr_rule_out: kvfree(match_criteria); kvfree(flow_context); return (err); } +static int mlx5e_vport_context_update_vlans(struct mlx5e_priv *priv) +{ + struct ifnet *ifp = priv->ifp; + int max_list_size; + int list_size; + u16 *vlans; + int vlan; + int err; + int i; + + list_size = 0; + for_each_set_bit(vlan, priv->vlan.active_vlans, VLAN_N_VID) + list_size++; + + max_list_size = 1 << MLX5_CAP_GEN(priv->mdev, log_max_vlan_list); + + if (list_size > max_list_size) { + if_printf(ifp, + "ifnet vlans list size (%d) > (%d) max vport list size, some vlans will be dropped\n", + list_size, max_list_size); + list_size = max_list_size; + } + + vlans = kcalloc(list_size, sizeof(*vlans), GFP_KERNEL); + if (!vlans) + return -ENOMEM; + + i = 0; + for_each_set_bit(vlan, priv->vlan.active_vlans, VLAN_N_VID) { + if (i >= list_size) + break; + vlans[i++] = vlan; + } + + err = mlx5_modify_nic_vport_vlans(priv->mdev, vlans, list_size); + if (err) + if_printf(ifp, "Failed to modify vport vlans list err(%d)\n", + err); + + kfree(vlans); + return err; +} + enum mlx5e_vlan_rule_type { MLX5E_VLAN_RULE_TYPE_UNTAGGED, MLX5E_VLAN_RULE_TYPE_ANY_VID, MLX5E_VLAN_RULE_TYPE_MATCH_VID, }; static int mlx5e_add_vlan_rule(struct mlx5e_priv *priv, enum mlx5e_vlan_rule_type rule_type, u16 vid) { u8 match_criteria_enable = 0; u32 *flow_context; void *match_value; void *dest; u32 *match_criteria; u32 *ft_ix; int err; flow_context = mlx5_vzalloc(MLX5_ST_SZ_BYTES(flow_context) + MLX5_ST_SZ_BYTES(dest_format_struct)); match_criteria = mlx5_vzalloc(MLX5_ST_SZ_BYTES(fte_match_param)); if (!flow_context || !match_criteria) { if_printf(priv->ifp, "%s: alloc failed\n", __func__); err = -ENOMEM; goto add_vlan_rule_out; } match_value = MLX5_ADDR_OF(flow_context, flow_context, match_value); dest = MLX5_ADDR_OF(flow_context, flow_context, destination); MLX5_SET(flow_context, flow_context, action, MLX5_FLOW_CONTEXT_ACTION_FWD_DEST); MLX5_SET(flow_context, flow_context, destination_list_size, 1); MLX5_SET(dest_format_struct, dest, destination_type, MLX5_FLOW_CONTEXT_DEST_TYPE_FLOW_TABLE); MLX5_SET(dest_format_struct, dest, destination_id, mlx5_get_flow_table_id(priv->ft.main)); match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.vlan_tag); switch (rule_type) { case MLX5E_VLAN_RULE_TYPE_UNTAGGED: ft_ix = &priv->vlan.untagged_rule_ft_ix; break; case MLX5E_VLAN_RULE_TYPE_ANY_VID: ft_ix = &priv->vlan.any_vlan_rule_ft_ix; MLX5_SET(fte_match_param, match_value, outer_headers.vlan_tag, 1); break; default: /* MLX5E_VLAN_RULE_TYPE_MATCH_VID */ ft_ix = &priv->vlan.active_vlans_ft_ix[vid]; MLX5_SET(fte_match_param, match_value, outer_headers.vlan_tag, 1); MLX5_SET_TO_ONES(fte_match_param, match_criteria, outer_headers.first_vid); MLX5_SET(fte_match_param, match_value, outer_headers.first_vid, vid); + mlx5e_vport_context_update_vlans(priv); break; } err = mlx5_add_flow_table_entry(priv->ft.vlan, match_criteria_enable, match_criteria, flow_context, ft_ix); if (err) if_printf(priv->ifp, "%s: failed\n", __func__); add_vlan_rule_out: kvfree(match_criteria); kvfree(flow_context); return (err); } static void mlx5e_del_vlan_rule(struct mlx5e_priv *priv, enum mlx5e_vlan_rule_type rule_type, u16 vid) { switch (rule_type) { case MLX5E_VLAN_RULE_TYPE_UNTAGGED: mlx5_del_flow_table_entry(priv->ft.vlan, priv->vlan.untagged_rule_ft_ix); break; case MLX5E_VLAN_RULE_TYPE_ANY_VID: mlx5_del_flow_table_entry(priv->ft.vlan, priv->vlan.any_vlan_rule_ft_ix); break; case MLX5E_VLAN_RULE_TYPE_MATCH_VID: mlx5_del_flow_table_entry(priv->ft.vlan, priv->vlan.active_vlans_ft_ix[vid]); + mlx5e_vport_context_update_vlans(priv); break; } } void mlx5e_enable_vlan_filter(struct mlx5e_priv *priv) { if (priv->vlan.filter_disabled) { priv->vlan.filter_disabled = false; if (test_bit(MLX5E_STATE_OPENED, &priv->state)) mlx5e_del_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_ANY_VID, 0); } } void mlx5e_disable_vlan_filter(struct mlx5e_priv *priv) { if (!priv->vlan.filter_disabled) { priv->vlan.filter_disabled = true; if (test_bit(MLX5E_STATE_OPENED, &priv->state)) mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_ANY_VID, 0); } } void mlx5e_vlan_rx_add_vid(void *arg, struct ifnet *ifp, u16 vid) { struct mlx5e_priv *priv = arg; if (ifp != priv->ifp) return; PRIV_LOCK(priv); set_bit(vid, priv->vlan.active_vlans); if (test_bit(MLX5E_STATE_OPENED, &priv->state)) mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_MATCH_VID, vid); PRIV_UNLOCK(priv); } void mlx5e_vlan_rx_kill_vid(void *arg, struct ifnet *ifp, u16 vid) { struct mlx5e_priv *priv = arg; if (ifp != priv->ifp) return; PRIV_LOCK(priv); clear_bit(vid, priv->vlan.active_vlans); if (test_bit(MLX5E_STATE_OPENED, &priv->state)) mlx5e_del_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_MATCH_VID, vid); PRIV_UNLOCK(priv); } int mlx5e_add_all_vlan_rules(struct mlx5e_priv *priv) { u16 vid; int err; for_each_set_bit(vid, priv->vlan.active_vlans, VLAN_N_VID) { err = mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_MATCH_VID, vid); if (err) return (err); } err = mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0); if (err) return (err); if (priv->vlan.filter_disabled) { err = mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_ANY_VID, 0); if (err) return (err); } return (0); } void mlx5e_del_all_vlan_rules(struct mlx5e_priv *priv) { u16 vid; if (priv->vlan.filter_disabled) mlx5e_del_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_ANY_VID, 0); mlx5e_del_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_UNTAGGED, 0); for_each_set_bit(vid, priv->vlan.active_vlans, VLAN_N_VID) mlx5e_del_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_MATCH_VID, vid); } #define mlx5e_for_each_hash_node(hn, tmp, hash, i) \ for (i = 0; i < MLX5E_ETH_ADDR_HASH_SIZE; i++) \ LIST_FOREACH_SAFE(hn, &(hash)[i], hlist, tmp) static void mlx5e_execute_action(struct mlx5e_priv *priv, struct mlx5e_eth_addr_hash_node *hn) { switch (hn->action) { case MLX5E_ACTION_ADD: mlx5e_add_eth_addr_rule(priv, &hn->ai, MLX5E_FULLMATCH); hn->action = MLX5E_ACTION_NONE; break; case MLX5E_ACTION_DEL: mlx5e_del_eth_addr_from_flow_table(priv, &hn->ai); mlx5e_del_eth_addr_from_hash(hn); break; default: break; } } static void mlx5e_sync_ifp_addr(struct mlx5e_priv *priv) { struct ifnet *ifp = priv->ifp; struct ifaddr *ifa; struct ifmultiaddr *ifma; /* XXX adding this entry might not be needed */ mlx5e_add_eth_addr_to_hash(priv->eth_addr.if_uc, LLADDR((struct sockaddr_dl *)(ifp->if_addr->ifa_addr))); if_addr_rlock(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { if (ifa->ifa_addr->sa_family != AF_LINK) continue; mlx5e_add_eth_addr_to_hash(priv->eth_addr.if_uc, LLADDR((struct sockaddr_dl *)ifa->ifa_addr)); } if_addr_runlock(ifp); if_maddr_rlock(ifp); TAILQ_FOREACH(ifma, &ifp->if_multiaddrs, ifma_link) { if (ifma->ifma_addr->sa_family != AF_LINK) continue; mlx5e_add_eth_addr_to_hash(priv->eth_addr.if_mc, LLADDR((struct sockaddr_dl *)ifma->ifma_addr)); } if_maddr_runlock(ifp); } +static void mlx5e_fill_addr_array(struct mlx5e_priv *priv, int list_type, + u8 addr_array[][ETH_ALEN], int size) +{ + bool is_uc = (list_type == MLX5_NIC_VPORT_LIST_TYPE_UC); + struct ifnet *ifp = priv->ifp; + struct mlx5e_eth_addr_hash_node *hn; + struct mlx5e_eth_addr_hash_head *addr_list; + struct mlx5e_eth_addr_hash_node *tmp; + int i = 0; + int hi; + + addr_list = is_uc ? priv->eth_addr.if_uc : priv->eth_addr.if_mc; + + if (is_uc) /* Make sure our own address is pushed first */ + ether_addr_copy(addr_array[i++], IF_LLADDR(ifp)); + else if (priv->eth_addr.broadcast_enabled) + ether_addr_copy(addr_array[i++], ifp->if_broadcastaddr); + + mlx5e_for_each_hash_node(hn, tmp, addr_list, hi) { + if (ether_addr_equal(IF_LLADDR(ifp), hn->ai.addr)) + continue; + if (i >= size) + break; + ether_addr_copy(addr_array[i++], hn->ai.addr); + } +} + +static void mlx5e_vport_context_update_addr_list(struct mlx5e_priv *priv, + int list_type) +{ + bool is_uc = (list_type == MLX5_NIC_VPORT_LIST_TYPE_UC); + struct mlx5e_eth_addr_hash_node *hn; + u8 (*addr_array)[ETH_ALEN] = NULL; + struct mlx5e_eth_addr_hash_head *addr_list; + struct mlx5e_eth_addr_hash_node *tmp; + int max_size; + int size; + int err; + int hi; + + size = is_uc ? 0 : (priv->eth_addr.broadcast_enabled ? 1 : 0); + max_size = is_uc ? + 1 << MLX5_CAP_GEN(priv->mdev, log_max_current_uc_list) : + 1 << MLX5_CAP_GEN(priv->mdev, log_max_current_mc_list); + + addr_list = is_uc ? priv->eth_addr.if_uc : priv->eth_addr.if_mc; + mlx5e_for_each_hash_node(hn, tmp, addr_list, hi) + size++; + + if (size > max_size) { + if_printf(priv->ifp, + "ifp %s list size (%d) > (%d) max vport list size, some addresses will be dropped\n", + is_uc ? "UC" : "MC", size, max_size); + size = max_size; + } + + if (size) { + addr_array = kcalloc(size, ETH_ALEN, GFP_KERNEL); + if (!addr_array) { + err = -ENOMEM; + goto out; + } + mlx5e_fill_addr_array(priv, list_type, addr_array, size); + } + + err = mlx5_modify_nic_vport_mac_list(priv->mdev, list_type, addr_array, size); +out: + if (err) + if_printf(priv->ifp, + "Failed to modify vport %s list err(%d)\n", + is_uc ? "UC" : "MC", err); + kfree(addr_array); +} + +static void mlx5e_vport_context_update(struct mlx5e_priv *priv) +{ + struct mlx5e_eth_addr_db *ea = &priv->eth_addr; + + mlx5e_vport_context_update_addr_list(priv, MLX5_NIC_VPORT_LIST_TYPE_UC); + mlx5e_vport_context_update_addr_list(priv, MLX5_NIC_VPORT_LIST_TYPE_MC); + mlx5_modify_nic_vport_promisc(priv->mdev, 0, + ea->allmulti_enabled, + ea->promisc_enabled); +} + static void mlx5e_apply_ifp_addr(struct mlx5e_priv *priv) { struct mlx5e_eth_addr_hash_node *hn; struct mlx5e_eth_addr_hash_node *tmp; int i; mlx5e_for_each_hash_node(hn, tmp, priv->eth_addr.if_uc, i) mlx5e_execute_action(priv, hn); mlx5e_for_each_hash_node(hn, tmp, priv->eth_addr.if_mc, i) mlx5e_execute_action(priv, hn); } static void mlx5e_handle_ifp_addr(struct mlx5e_priv *priv) { struct mlx5e_eth_addr_hash_node *hn; struct mlx5e_eth_addr_hash_node *tmp; int i; mlx5e_for_each_hash_node(hn, tmp, priv->eth_addr.if_uc, i) hn->action = MLX5E_ACTION_DEL; mlx5e_for_each_hash_node(hn, tmp, priv->eth_addr.if_mc, i) hn->action = MLX5E_ACTION_DEL; if (test_bit(MLX5E_STATE_OPENED, &priv->state)) mlx5e_sync_ifp_addr(priv); mlx5e_apply_ifp_addr(priv); } void mlx5e_set_rx_mode_core(struct mlx5e_priv *priv) { struct mlx5e_eth_addr_db *ea = &priv->eth_addr; struct ifnet *ndev = priv->ifp; bool rx_mode_enable = test_bit(MLX5E_STATE_OPENED, &priv->state); bool promisc_enabled = rx_mode_enable && (ndev->if_flags & IFF_PROMISC); bool allmulti_enabled = rx_mode_enable && (ndev->if_flags & IFF_ALLMULTI); bool broadcast_enabled = rx_mode_enable; bool enable_promisc = !ea->promisc_enabled && promisc_enabled; bool disable_promisc = ea->promisc_enabled && !promisc_enabled; bool enable_allmulti = !ea->allmulti_enabled && allmulti_enabled; bool disable_allmulti = ea->allmulti_enabled && !allmulti_enabled; bool enable_broadcast = !ea->broadcast_enabled && broadcast_enabled; bool disable_broadcast = ea->broadcast_enabled && !broadcast_enabled; /* update broadcast address */ ether_addr_copy(priv->eth_addr.broadcast.addr, priv->ifp->if_broadcastaddr); if (enable_promisc) mlx5e_add_eth_addr_rule(priv, &ea->promisc, MLX5E_PROMISC); if (enable_allmulti) mlx5e_add_eth_addr_rule(priv, &ea->allmulti, MLX5E_ALLMULTI); if (enable_broadcast) mlx5e_add_eth_addr_rule(priv, &ea->broadcast, MLX5E_FULLMATCH); mlx5e_handle_ifp_addr(priv); if (disable_broadcast) mlx5e_del_eth_addr_from_flow_table(priv, &ea->broadcast); if (disable_allmulti) mlx5e_del_eth_addr_from_flow_table(priv, &ea->allmulti); if (disable_promisc) mlx5e_del_eth_addr_from_flow_table(priv, &ea->promisc); ea->promisc_enabled = promisc_enabled; ea->allmulti_enabled = allmulti_enabled; ea->broadcast_enabled = broadcast_enabled; + + mlx5e_vport_context_update(priv); } void mlx5e_set_rx_mode_work(struct work_struct *work) { struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv, set_rx_mode_work); PRIV_LOCK(priv); if (test_bit(MLX5E_STATE_OPENED, &priv->state)) mlx5e_set_rx_mode_core(priv); PRIV_UNLOCK(priv); } static int mlx5e_create_main_flow_table(struct mlx5e_priv *priv) { struct mlx5_flow_table_group *g; u8 *dmac; g = malloc(9 * sizeof(*g), M_MLX5EN, M_WAITOK | M_ZERO); if (g == NULL) return (-ENOMEM); g[0].log_sz = 2; g[0].match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; MLX5_SET_TO_ONES(fte_match_param, g[0].match_criteria, outer_headers.ethertype); MLX5_SET_TO_ONES(fte_match_param, g[0].match_criteria, outer_headers.ip_protocol); g[1].log_sz = 1; g[1].match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; MLX5_SET_TO_ONES(fte_match_param, g[1].match_criteria, outer_headers.ethertype); g[2].log_sz = 0; g[3].log_sz = 14; g[3].match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; dmac = MLX5_ADDR_OF(fte_match_param, g[3].match_criteria, outer_headers.dmac_47_16); memset(dmac, 0xff, ETH_ALEN); MLX5_SET_TO_ONES(fte_match_param, g[3].match_criteria, outer_headers.ethertype); MLX5_SET_TO_ONES(fte_match_param, g[3].match_criteria, outer_headers.ip_protocol); g[4].log_sz = 13; g[4].match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; dmac = MLX5_ADDR_OF(fte_match_param, g[4].match_criteria, outer_headers.dmac_47_16); memset(dmac, 0xff, ETH_ALEN); MLX5_SET_TO_ONES(fte_match_param, g[4].match_criteria, outer_headers.ethertype); g[5].log_sz = 11; g[5].match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; dmac = MLX5_ADDR_OF(fte_match_param, g[5].match_criteria, outer_headers.dmac_47_16); memset(dmac, 0xff, ETH_ALEN); g[6].log_sz = 2; g[6].match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; dmac = MLX5_ADDR_OF(fte_match_param, g[6].match_criteria, outer_headers.dmac_47_16); dmac[0] = 0x01; MLX5_SET_TO_ONES(fte_match_param, g[6].match_criteria, outer_headers.ethertype); MLX5_SET_TO_ONES(fte_match_param, g[6].match_criteria, outer_headers.ip_protocol); g[7].log_sz = 1; g[7].match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; dmac = MLX5_ADDR_OF(fte_match_param, g[7].match_criteria, outer_headers.dmac_47_16); dmac[0] = 0x01; MLX5_SET_TO_ONES(fte_match_param, g[7].match_criteria, outer_headers.ethertype); g[8].log_sz = 0; g[8].match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; dmac = MLX5_ADDR_OF(fte_match_param, g[8].match_criteria, outer_headers.dmac_47_16); dmac[0] = 0x01; priv->ft.main = mlx5_create_flow_table(priv->mdev, 1, MLX5_FLOW_TABLE_TYPE_NIC_RCV, 0, 9, g); free(g, M_MLX5EN); return (priv->ft.main ? 0 : -ENOMEM); } static void mlx5e_destroy_main_flow_table(struct mlx5e_priv *priv) { mlx5_destroy_flow_table(priv->ft.main); priv->ft.main = NULL; } static int mlx5e_create_vlan_flow_table(struct mlx5e_priv *priv) { struct mlx5_flow_table_group *g; g = malloc(2 * sizeof(*g), M_MLX5EN, M_WAITOK | M_ZERO); if (g == NULL) return (-ENOMEM); g[0].log_sz = 12; g[0].match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; MLX5_SET_TO_ONES(fte_match_param, g[0].match_criteria, outer_headers.vlan_tag); MLX5_SET_TO_ONES(fte_match_param, g[0].match_criteria, outer_headers.first_vid); /* untagged + any vlan id */ g[1].log_sz = 1; g[1].match_criteria_enable = MLX5_MATCH_OUTER_HEADERS; MLX5_SET_TO_ONES(fte_match_param, g[1].match_criteria, outer_headers.vlan_tag); priv->ft.vlan = mlx5_create_flow_table(priv->mdev, 0, MLX5_FLOW_TABLE_TYPE_NIC_RCV, 0, 2, g); free(g, M_MLX5EN); return (priv->ft.vlan ? 0 : -ENOMEM); } static void mlx5e_destroy_vlan_flow_table(struct mlx5e_priv *priv) { mlx5_destroy_flow_table(priv->ft.vlan); priv->ft.vlan = NULL; } int mlx5e_open_flow_table(struct mlx5e_priv *priv) { int err; err = mlx5e_create_main_flow_table(priv); if (err) return (err); err = mlx5e_create_vlan_flow_table(priv); if (err) goto err_destroy_main_flow_table; return (0); err_destroy_main_flow_table: mlx5e_destroy_main_flow_table(priv); return (err); } void mlx5e_close_flow_table(struct mlx5e_priv *priv) { mlx5e_destroy_vlan_flow_table(priv); mlx5e_destroy_main_flow_table(priv); } Index: projects/vnet/sys/dev/mlx5/mlx5_en/mlx5_en_main.c =================================================================== --- projects/vnet/sys/dev/mlx5/mlx5_en/mlx5_en_main.c (revision 301546) +++ projects/vnet/sys/dev/mlx5/mlx5_en/mlx5_en_main.c (revision 301547) @@ -1,3183 +1,3190 @@ /*- * Copyright (c) 2015 Mellanox Technologies. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS `AS IS' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include "en.h" #include #include #define ETH_DRIVER_VERSION "3.1.0-dev" char mlx5e_version[] = "Mellanox Ethernet driver" " (" ETH_DRIVER_VERSION ")"; struct mlx5e_rq_param { u32 rqc [MLX5_ST_SZ_DW(rqc)]; struct mlx5_wq_param wq; }; struct mlx5e_sq_param { u32 sqc [MLX5_ST_SZ_DW(sqc)]; struct mlx5_wq_param wq; }; struct mlx5e_cq_param { u32 cqc [MLX5_ST_SZ_DW(cqc)]; struct mlx5_wq_param wq; u16 eq_ix; }; struct mlx5e_channel_param { struct mlx5e_rq_param rq; struct mlx5e_sq_param sq; struct mlx5e_cq_param rx_cq; struct mlx5e_cq_param tx_cq; }; static const struct { u32 subtype; u64 baudrate; } mlx5e_mode_table[MLX5E_LINK_MODES_NUMBER] = { [MLX5E_1000BASE_CX_SGMII] = { .subtype = IFM_1000_CX_SGMII, .baudrate = IF_Mbps(1000ULL), }, [MLX5E_1000BASE_KX] = { .subtype = IFM_1000_KX, .baudrate = IF_Mbps(1000ULL), }, [MLX5E_10GBASE_CX4] = { .subtype = IFM_10G_CX4, .baudrate = IF_Gbps(10ULL), }, [MLX5E_10GBASE_KX4] = { .subtype = IFM_10G_KX4, .baudrate = IF_Gbps(10ULL), }, [MLX5E_10GBASE_KR] = { .subtype = IFM_10G_KR, .baudrate = IF_Gbps(10ULL), }, [MLX5E_20GBASE_KR2] = { .subtype = IFM_20G_KR2, .baudrate = IF_Gbps(20ULL), }, [MLX5E_40GBASE_CR4] = { .subtype = IFM_40G_CR4, .baudrate = IF_Gbps(40ULL), }, [MLX5E_40GBASE_KR4] = { .subtype = IFM_40G_KR4, .baudrate = IF_Gbps(40ULL), }, [MLX5E_56GBASE_R4] = { .subtype = IFM_56G_R4, .baudrate = IF_Gbps(56ULL), }, [MLX5E_10GBASE_CR] = { .subtype = IFM_10G_CR1, .baudrate = IF_Gbps(10ULL), }, [MLX5E_10GBASE_SR] = { .subtype = IFM_10G_SR, .baudrate = IF_Gbps(10ULL), }, [MLX5E_10GBASE_LR] = { .subtype = IFM_10G_LR, .baudrate = IF_Gbps(10ULL), }, [MLX5E_40GBASE_SR4] = { .subtype = IFM_40G_SR4, .baudrate = IF_Gbps(40ULL), }, [MLX5E_40GBASE_LR4] = { .subtype = IFM_40G_LR4, .baudrate = IF_Gbps(40ULL), }, [MLX5E_100GBASE_CR4] = { .subtype = IFM_100G_CR4, .baudrate = IF_Gbps(100ULL), }, [MLX5E_100GBASE_SR4] = { .subtype = IFM_100G_SR4, .baudrate = IF_Gbps(100ULL), }, [MLX5E_100GBASE_KR4] = { .subtype = IFM_100G_KR4, .baudrate = IF_Gbps(100ULL), }, [MLX5E_100GBASE_LR4] = { .subtype = IFM_100G_LR4, .baudrate = IF_Gbps(100ULL), }, [MLX5E_100BASE_TX] = { .subtype = IFM_100_TX, .baudrate = IF_Mbps(100ULL), }, [MLX5E_100BASE_T] = { .subtype = IFM_100_T, .baudrate = IF_Mbps(100ULL), }, [MLX5E_10GBASE_T] = { .subtype = IFM_10G_T, .baudrate = IF_Gbps(10ULL), }, [MLX5E_25GBASE_CR] = { .subtype = IFM_25G_CR, .baudrate = IF_Gbps(25ULL), }, [MLX5E_25GBASE_KR] = { .subtype = IFM_25G_KR, .baudrate = IF_Gbps(25ULL), }, [MLX5E_25GBASE_SR] = { .subtype = IFM_25G_SR, .baudrate = IF_Gbps(25ULL), }, [MLX5E_50GBASE_CR2] = { .subtype = IFM_50G_CR2, .baudrate = IF_Gbps(50ULL), }, [MLX5E_50GBASE_KR2] = { .subtype = IFM_50G_KR2, .baudrate = IF_Gbps(50ULL), }, }; MALLOC_DEFINE(M_MLX5EN, "MLX5EN", "MLX5 Ethernet"); static void mlx5e_update_carrier(struct mlx5e_priv *priv) { struct mlx5_core_dev *mdev = priv->mdev; u32 out[MLX5_ST_SZ_DW(ptys_reg)]; u32 eth_proto_oper; int error; u8 port_state; u8 i; port_state = mlx5_query_vport_state(mdev, MLX5_QUERY_VPORT_STATE_IN_OP_MOD_VNIC_VPORT); if (port_state == VPORT_STATE_UP) { priv->media_status_last |= IFM_ACTIVE; } else { priv->media_status_last &= ~IFM_ACTIVE; priv->media_active_last = IFM_ETHER; if_link_state_change(priv->ifp, LINK_STATE_DOWN); return; } error = mlx5_query_port_ptys(mdev, out, sizeof(out), MLX5_PTYS_EN); if (error) { priv->media_active_last = IFM_ETHER; priv->ifp->if_baudrate = 1; if_printf(priv->ifp, "%s: query port ptys failed: 0x%x\n", __func__, error); return; } eth_proto_oper = MLX5_GET(ptys_reg, out, eth_proto_oper); for (i = 0; i != MLX5E_LINK_MODES_NUMBER; i++) { if (mlx5e_mode_table[i].baudrate == 0) continue; if (MLX5E_PROT_MASK(i) & eth_proto_oper) { priv->ifp->if_baudrate = mlx5e_mode_table[i].baudrate; priv->media_active_last = mlx5e_mode_table[i].subtype | IFM_ETHER | IFM_FDX; } } if_link_state_change(priv->ifp, LINK_STATE_UP); } static void mlx5e_media_status(struct ifnet *dev, struct ifmediareq *ifmr) { struct mlx5e_priv *priv = dev->if_softc; ifmr->ifm_status = priv->media_status_last; ifmr->ifm_active = priv->media_active_last | (priv->params.rx_pauseframe_control ? IFM_ETH_RXPAUSE : 0) | (priv->params.tx_pauseframe_control ? IFM_ETH_TXPAUSE : 0); } static u32 mlx5e_find_link_mode(u32 subtype) { u32 i; u32 link_mode = 0; for (i = 0; i < MLX5E_LINK_MODES_NUMBER; ++i) { if (mlx5e_mode_table[i].baudrate == 0) continue; if (mlx5e_mode_table[i].subtype == subtype) link_mode |= MLX5E_PROT_MASK(i); } return (link_mode); } static int mlx5e_media_change(struct ifnet *dev) { struct mlx5e_priv *priv = dev->if_softc; struct mlx5_core_dev *mdev = priv->mdev; u32 eth_proto_cap; u32 link_mode; int was_opened; int locked; int error; locked = PRIV_LOCKED(priv); if (!locked) PRIV_LOCK(priv); if (IFM_TYPE(priv->media.ifm_media) != IFM_ETHER) { error = EINVAL; goto done; } link_mode = mlx5e_find_link_mode(IFM_SUBTYPE(priv->media.ifm_media)); /* query supported capabilities */ error = mlx5_query_port_proto_cap(mdev, ð_proto_cap, MLX5_PTYS_EN); if (error != 0) { if_printf(dev, "Query port media capability failed\n"); goto done; } /* check for autoselect */ if (IFM_SUBTYPE(priv->media.ifm_media) == IFM_AUTO) { link_mode = eth_proto_cap; if (link_mode == 0) { if_printf(dev, "Port media capability is zero\n"); error = EINVAL; goto done; } } else { link_mode = link_mode & eth_proto_cap; if (link_mode == 0) { if_printf(dev, "Not supported link mode requested\n"); error = EINVAL; goto done; } } /* update pauseframe control bits */ priv->params.rx_pauseframe_control = (priv->media.ifm_media & IFM_ETH_RXPAUSE) ? 1 : 0; priv->params.tx_pauseframe_control = (priv->media.ifm_media & IFM_ETH_TXPAUSE) ? 1 : 0; /* check if device is opened */ was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state); /* reconfigure the hardware */ mlx5_set_port_status(mdev, MLX5_PORT_DOWN); mlx5_set_port_proto(mdev, link_mode, MLX5_PTYS_EN); mlx5_set_port_pause(mdev, 1, priv->params.rx_pauseframe_control, priv->params.tx_pauseframe_control); if (was_opened) mlx5_set_port_status(mdev, MLX5_PORT_UP); done: if (!locked) PRIV_UNLOCK(priv); return (error); } static void mlx5e_update_carrier_work(struct work_struct *work) { struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv, update_carrier_work); PRIV_LOCK(priv); if (test_bit(MLX5E_STATE_OPENED, &priv->state)) mlx5e_update_carrier(priv); PRIV_UNLOCK(priv); } static void mlx5e_update_pport_counters(struct mlx5e_priv *priv) { struct mlx5_core_dev *mdev = priv->mdev; struct mlx5e_pport_stats *s = &priv->stats.pport; struct mlx5e_port_stats_debug *s_debug = &priv->stats.port_stats_debug; u32 *in; u32 *out; u64 *ptr; unsigned sz = MLX5_ST_SZ_BYTES(ppcnt_reg); unsigned x; unsigned y; in = mlx5_vzalloc(sz); out = mlx5_vzalloc(sz); if (in == NULL || out == NULL) goto free_out; ptr = (uint64_t *)MLX5_ADDR_OF(ppcnt_reg, out, counter_set); MLX5_SET(ppcnt_reg, in, local_port, 1); MLX5_SET(ppcnt_reg, in, grp, MLX5_IEEE_802_3_COUNTERS_GROUP); mlx5_core_access_reg(mdev, in, sz, out, sz, MLX5_REG_PPCNT, 0, 0); for (x = y = 0; x != MLX5E_PPORT_IEEE802_3_STATS_NUM; x++, y++) s->arg[y] = be64toh(ptr[x]); MLX5_SET(ppcnt_reg, in, grp, MLX5_RFC_2819_COUNTERS_GROUP); mlx5_core_access_reg(mdev, in, sz, out, sz, MLX5_REG_PPCNT, 0, 0); for (x = 0; x != MLX5E_PPORT_RFC2819_STATS_NUM; x++, y++) s->arg[y] = be64toh(ptr[x]); for (y = 0; x != MLX5E_PPORT_RFC2819_STATS_NUM + MLX5E_PPORT_RFC2819_STATS_DEBUG_NUM; x++, y++) s_debug->arg[y] = be64toh(ptr[x]); MLX5_SET(ppcnt_reg, in, grp, MLX5_RFC_2863_COUNTERS_GROUP); mlx5_core_access_reg(mdev, in, sz, out, sz, MLX5_REG_PPCNT, 0, 0); for (x = 0; x != MLX5E_PPORT_RFC2863_STATS_DEBUG_NUM; x++, y++) s_debug->arg[y] = be64toh(ptr[x]); MLX5_SET(ppcnt_reg, in, grp, MLX5_PHYSICAL_LAYER_COUNTERS_GROUP); mlx5_core_access_reg(mdev, in, sz, out, sz, MLX5_REG_PPCNT, 0, 0); for (x = 0; x != MLX5E_PPORT_PHYSICAL_LAYER_STATS_DEBUG_NUM; x++, y++) s_debug->arg[y] = be64toh(ptr[x]); free_out: kvfree(in); kvfree(out); } static void mlx5e_update_stats_work(struct work_struct *work) { struct mlx5e_priv *priv = container_of(work, struct mlx5e_priv, update_stats_work); struct mlx5_core_dev *mdev = priv->mdev; struct mlx5e_vport_stats *s = &priv->stats.vport; struct mlx5e_rq_stats *rq_stats; struct mlx5e_sq_stats *sq_stats; struct buf_ring *sq_br; #if (__FreeBSD_version < 1100000) struct ifnet *ifp = priv->ifp; #endif u32 in[MLX5_ST_SZ_DW(query_vport_counter_in)]; u32 *out; int outlen = MLX5_ST_SZ_BYTES(query_vport_counter_out); u64 tso_packets = 0; u64 tso_bytes = 0; u64 tx_queue_dropped = 0; u64 tx_defragged = 0; u64 tx_offload_none = 0; u64 lro_packets = 0; u64 lro_bytes = 0; u64 sw_lro_queued = 0; u64 sw_lro_flushed = 0; u64 rx_csum_none = 0; u64 rx_wqe_err = 0; u32 rx_out_of_buffer = 0; int i; int j; PRIV_LOCK(priv); out = mlx5_vzalloc(outlen); if (out == NULL) goto free_out; if (test_bit(MLX5E_STATE_OPENED, &priv->state) == 0) goto free_out; /* Collect firts the SW counters and then HW for consistency */ for (i = 0; i < priv->params.num_channels; i++) { struct mlx5e_rq *rq = &priv->channel[i]->rq; rq_stats = &priv->channel[i]->rq.stats; /* collect stats from LRO */ rq_stats->sw_lro_queued = rq->lro.lro_queued; rq_stats->sw_lro_flushed = rq->lro.lro_flushed; sw_lro_queued += rq_stats->sw_lro_queued; sw_lro_flushed += rq_stats->sw_lro_flushed; lro_packets += rq_stats->lro_packets; lro_bytes += rq_stats->lro_bytes; rx_csum_none += rq_stats->csum_none; rx_wqe_err += rq_stats->wqe_err; for (j = 0; j < priv->num_tc; j++) { sq_stats = &priv->channel[i]->sq[j].stats; sq_br = priv->channel[i]->sq[j].br; tso_packets += sq_stats->tso_packets; tso_bytes += sq_stats->tso_bytes; tx_queue_dropped += sq_stats->dropped; tx_queue_dropped += sq_br->br_drops; tx_defragged += sq_stats->defragged; tx_offload_none += sq_stats->csum_offload_none; } } /* update counters */ s->tso_packets = tso_packets; s->tso_bytes = tso_bytes; s->tx_queue_dropped = tx_queue_dropped; s->tx_defragged = tx_defragged; s->lro_packets = lro_packets; s->lro_bytes = lro_bytes; s->sw_lro_queued = sw_lro_queued; s->sw_lro_flushed = sw_lro_flushed; s->rx_csum_none = rx_csum_none; s->rx_wqe_err = rx_wqe_err; /* HW counters */ memset(in, 0, sizeof(in)); MLX5_SET(query_vport_counter_in, in, opcode, MLX5_CMD_OP_QUERY_VPORT_COUNTER); MLX5_SET(query_vport_counter_in, in, op_mod, 0); MLX5_SET(query_vport_counter_in, in, other_vport, 0); memset(out, 0, outlen); /* get number of out-of-buffer drops first */ if (mlx5_vport_query_out_of_rx_buffer(mdev, priv->counter_set_id, &rx_out_of_buffer)) goto free_out; /* accumulate difference into a 64-bit counter */ s->rx_out_of_buffer += (u64)(u32)(rx_out_of_buffer - s->rx_out_of_buffer_prev); s->rx_out_of_buffer_prev = rx_out_of_buffer; /* get port statistics */ if (mlx5_cmd_exec(mdev, in, sizeof(in), out, outlen)) goto free_out; #define MLX5_GET_CTR(out, x) \ MLX5_GET64(query_vport_counter_out, out, x) s->rx_error_packets = MLX5_GET_CTR(out, received_errors.packets); s->rx_error_bytes = MLX5_GET_CTR(out, received_errors.octets); s->tx_error_packets = MLX5_GET_CTR(out, transmit_errors.packets); s->tx_error_bytes = MLX5_GET_CTR(out, transmit_errors.octets); s->rx_unicast_packets = MLX5_GET_CTR(out, received_eth_unicast.packets); s->rx_unicast_bytes = MLX5_GET_CTR(out, received_eth_unicast.octets); s->tx_unicast_packets = MLX5_GET_CTR(out, transmitted_eth_unicast.packets); s->tx_unicast_bytes = MLX5_GET_CTR(out, transmitted_eth_unicast.octets); s->rx_multicast_packets = MLX5_GET_CTR(out, received_eth_multicast.packets); s->rx_multicast_bytes = MLX5_GET_CTR(out, received_eth_multicast.octets); s->tx_multicast_packets = MLX5_GET_CTR(out, transmitted_eth_multicast.packets); s->tx_multicast_bytes = MLX5_GET_CTR(out, transmitted_eth_multicast.octets); s->rx_broadcast_packets = MLX5_GET_CTR(out, received_eth_broadcast.packets); s->rx_broadcast_bytes = MLX5_GET_CTR(out, received_eth_broadcast.octets); s->tx_broadcast_packets = MLX5_GET_CTR(out, transmitted_eth_broadcast.packets); s->tx_broadcast_bytes = MLX5_GET_CTR(out, transmitted_eth_broadcast.octets); s->rx_packets = s->rx_unicast_packets + s->rx_multicast_packets + s->rx_broadcast_packets - s->rx_out_of_buffer; s->rx_bytes = s->rx_unicast_bytes + s->rx_multicast_bytes + s->rx_broadcast_bytes; s->tx_packets = s->tx_unicast_packets + s->tx_multicast_packets + s->tx_broadcast_packets; s->tx_bytes = s->tx_unicast_bytes + s->tx_multicast_bytes + s->tx_broadcast_bytes; /* Update calculated offload counters */ s->tx_csum_offload = s->tx_packets - tx_offload_none; s->rx_csum_good = s->rx_packets - s->rx_csum_none; /* Update per port counters */ mlx5e_update_pport_counters(priv); #if (__FreeBSD_version < 1100000) /* no get_counters interface in fbsd 10 */ ifp->if_ipackets = s->rx_packets; ifp->if_ierrors = s->rx_error_packets; ifp->if_iqdrops = s->rx_out_of_buffer; ifp->if_opackets = s->tx_packets; ifp->if_oerrors = s->tx_error_packets; ifp->if_snd.ifq_drops = s->tx_queue_dropped; ifp->if_ibytes = s->rx_bytes; ifp->if_obytes = s->tx_bytes; #endif free_out: kvfree(out); PRIV_UNLOCK(priv); } static void mlx5e_update_stats(void *arg) { struct mlx5e_priv *priv = arg; schedule_work(&priv->update_stats_work); callout_reset(&priv->watchdog, hz, &mlx5e_update_stats, priv); } static void mlx5e_async_event_sub(struct mlx5e_priv *priv, enum mlx5_dev_event event) { switch (event) { case MLX5_DEV_EVENT_PORT_UP: case MLX5_DEV_EVENT_PORT_DOWN: schedule_work(&priv->update_carrier_work); break; default: break; } } static void mlx5e_async_event(struct mlx5_core_dev *mdev, void *vpriv, enum mlx5_dev_event event, unsigned long param) { struct mlx5e_priv *priv = vpriv; mtx_lock(&priv->async_events_mtx); if (test_bit(MLX5E_STATE_ASYNC_EVENTS_ENABLE, &priv->state)) mlx5e_async_event_sub(priv, event); mtx_unlock(&priv->async_events_mtx); } static void mlx5e_enable_async_events(struct mlx5e_priv *priv) { set_bit(MLX5E_STATE_ASYNC_EVENTS_ENABLE, &priv->state); } static void mlx5e_disable_async_events(struct mlx5e_priv *priv) { mtx_lock(&priv->async_events_mtx); clear_bit(MLX5E_STATE_ASYNC_EVENTS_ENABLE, &priv->state); mtx_unlock(&priv->async_events_mtx); } static const char *mlx5e_rq_stats_desc[] = { MLX5E_RQ_STATS(MLX5E_STATS_DESC) }; static int mlx5e_create_rq(struct mlx5e_channel *c, struct mlx5e_rq_param *param, struct mlx5e_rq *rq) { struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; char buffer[16]; void *rqc = param->rqc; void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq); int wq_sz; int err; int i; /* Create DMA descriptor TAG */ if ((err = -bus_dma_tag_create( bus_get_dma_tag(mdev->pdev->dev.bsddev), 1, /* any alignment */ 0, /* no boundary */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ MJUM16BYTES, /* maxsize */ 1, /* nsegments */ MJUM16BYTES, /* maxsegsize */ 0, /* flags */ NULL, NULL, /* lockfunc, lockfuncarg */ &rq->dma_tag))) goto done; err = mlx5_wq_ll_create(mdev, ¶m->wq, rqc_wq, &rq->wq, &rq->wq_ctrl); if (err) goto err_free_dma_tag; rq->wq.db = &rq->wq.db[MLX5_RCV_DBR]; if (priv->params.hw_lro_en) { rq->wqe_sz = priv->params.lro_wqe_sz; } else { rq->wqe_sz = MLX5E_SW2MB_MTU(priv->ifp->if_mtu); } if (rq->wqe_sz > MJUM16BYTES) { err = -ENOMEM; goto err_rq_wq_destroy; } else if (rq->wqe_sz > MJUM9BYTES) { rq->wqe_sz = MJUM16BYTES; } else if (rq->wqe_sz > MJUMPAGESIZE) { rq->wqe_sz = MJUM9BYTES; } else if (rq->wqe_sz > MCLBYTES) { rq->wqe_sz = MJUMPAGESIZE; } else { rq->wqe_sz = MCLBYTES; } wq_sz = mlx5_wq_ll_get_size(&rq->wq); rq->mbuf = malloc(wq_sz * sizeof(rq->mbuf[0]), M_MLX5EN, M_WAITOK | M_ZERO); if (rq->mbuf == NULL) { err = -ENOMEM; goto err_rq_wq_destroy; } for (i = 0; i != wq_sz; i++) { struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i); uint32_t byte_count = rq->wqe_sz - MLX5E_NET_IP_ALIGN; err = -bus_dmamap_create(rq->dma_tag, 0, &rq->mbuf[i].dma_map); if (err != 0) { while (i--) bus_dmamap_destroy(rq->dma_tag, rq->mbuf[i].dma_map); goto err_rq_mbuf_free; } wqe->data.lkey = c->mkey_be; wqe->data.byte_count = cpu_to_be32(byte_count | MLX5_HW_START_PADDING); } rq->pdev = c->pdev; rq->ifp = c->ifp; rq->channel = c; rq->ix = c->ix; snprintf(buffer, sizeof(buffer), "rxstat%d", c->ix); mlx5e_create_stats(&rq->stats.ctx, SYSCTL_CHILDREN(priv->sysctl_ifnet), buffer, mlx5e_rq_stats_desc, MLX5E_RQ_STATS_NUM, rq->stats.arg); #ifdef HAVE_TURBO_LRO if (tcp_tlro_init(&rq->lro, c->ifp, MLX5E_BUDGET_MAX) != 0) rq->lro.mbuf = NULL; #else if (tcp_lro_init(&rq->lro)) rq->lro.lro_cnt = 0; else rq->lro.ifp = c->ifp; #endif return (0); err_rq_mbuf_free: free(rq->mbuf, M_MLX5EN); err_rq_wq_destroy: mlx5_wq_destroy(&rq->wq_ctrl); err_free_dma_tag: bus_dma_tag_destroy(rq->dma_tag); done: return (err); } static void mlx5e_destroy_rq(struct mlx5e_rq *rq) { int wq_sz; int i; /* destroy all sysctl nodes */ sysctl_ctx_free(&rq->stats.ctx); /* free leftover LRO packets, if any */ #ifdef HAVE_TURBO_LRO tcp_tlro_free(&rq->lro); #else tcp_lro_free(&rq->lro); #endif wq_sz = mlx5_wq_ll_get_size(&rq->wq); for (i = 0; i != wq_sz; i++) { if (rq->mbuf[i].mbuf != NULL) { bus_dmamap_unload(rq->dma_tag, rq->mbuf[i].dma_map); m_freem(rq->mbuf[i].mbuf); } bus_dmamap_destroy(rq->dma_tag, rq->mbuf[i].dma_map); } free(rq->mbuf, M_MLX5EN); mlx5_wq_destroy(&rq->wq_ctrl); } static int mlx5e_enable_rq(struct mlx5e_rq *rq, struct mlx5e_rq_param *param) { struct mlx5e_channel *c = rq->channel; struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; void *in; void *rqc; void *wq; int inlen; int err; inlen = MLX5_ST_SZ_BYTES(create_rq_in) + sizeof(u64) * rq->wq_ctrl.buf.npages; in = mlx5_vzalloc(inlen); if (in == NULL) return (-ENOMEM); rqc = MLX5_ADDR_OF(create_rq_in, in, ctx); wq = MLX5_ADDR_OF(rqc, rqc, wq); memcpy(rqc, param->rqc, sizeof(param->rqc)); MLX5_SET(rqc, rqc, cqn, c->rq.cq.mcq.cqn); MLX5_SET(rqc, rqc, state, MLX5_RQC_STATE_RST); MLX5_SET(rqc, rqc, flush_in_error_en, 1); if (priv->counter_set_id >= 0) MLX5_SET(rqc, rqc, counter_set_id, priv->counter_set_id); MLX5_SET(wq, wq, log_wq_pg_sz, rq->wq_ctrl.buf.page_shift - PAGE_SHIFT); MLX5_SET64(wq, wq, dbr_addr, rq->wq_ctrl.db.dma); mlx5_fill_page_array(&rq->wq_ctrl.buf, (__be64 *) MLX5_ADDR_OF(wq, wq, pas)); err = mlx5_core_create_rq(mdev, in, inlen, &rq->rqn); kvfree(in); return (err); } static int mlx5e_modify_rq(struct mlx5e_rq *rq, int curr_state, int next_state) { struct mlx5e_channel *c = rq->channel; struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; void *in; void *rqc; int inlen; int err; inlen = MLX5_ST_SZ_BYTES(modify_rq_in); in = mlx5_vzalloc(inlen); if (in == NULL) return (-ENOMEM); rqc = MLX5_ADDR_OF(modify_rq_in, in, ctx); MLX5_SET(modify_rq_in, in, rqn, rq->rqn); MLX5_SET(modify_rq_in, in, rq_state, curr_state); MLX5_SET(rqc, rqc, state, next_state); err = mlx5_core_modify_rq(mdev, in, inlen); kvfree(in); return (err); } static void mlx5e_disable_rq(struct mlx5e_rq *rq) { struct mlx5e_channel *c = rq->channel; struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; mlx5_core_destroy_rq(mdev, rq->rqn); } static int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq) { struct mlx5e_channel *c = rq->channel; struct mlx5e_priv *priv = c->priv; struct mlx5_wq_ll *wq = &rq->wq; int i; for (i = 0; i < 1000; i++) { if (wq->cur_sz >= priv->params.min_rx_wqes) return (0); msleep(4); } return (-ETIMEDOUT); } static int mlx5e_open_rq(struct mlx5e_channel *c, struct mlx5e_rq_param *param, struct mlx5e_rq *rq) { int err; err = mlx5e_create_rq(c, param, rq); if (err) return (err); err = mlx5e_enable_rq(rq, param); if (err) goto err_destroy_rq; err = mlx5e_modify_rq(rq, MLX5_RQC_STATE_RST, MLX5_RQC_STATE_RDY); if (err) goto err_disable_rq; c->rq.enabled = 1; return (0); err_disable_rq: mlx5e_disable_rq(rq); err_destroy_rq: mlx5e_destroy_rq(rq); return (err); } static void mlx5e_close_rq(struct mlx5e_rq *rq) { rq->enabled = 0; mlx5e_modify_rq(rq, MLX5_RQC_STATE_RDY, MLX5_RQC_STATE_ERR); } static void mlx5e_close_rq_wait(struct mlx5e_rq *rq) { /* wait till RQ is empty */ while (!mlx5_wq_ll_is_empty(&rq->wq)) { msleep(4); rq->cq.mcq.comp(&rq->cq.mcq); } mlx5e_disable_rq(rq); mlx5e_destroy_rq(rq); } static void mlx5e_free_sq_db(struct mlx5e_sq *sq) { int wq_sz = mlx5_wq_cyc_get_size(&sq->wq); int x; for (x = 0; x != wq_sz; x++) bus_dmamap_destroy(sq->dma_tag, sq->mbuf[x].dma_map); free(sq->mbuf, M_MLX5EN); } static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq) { int wq_sz = mlx5_wq_cyc_get_size(&sq->wq); int err; int x; sq->mbuf = malloc(wq_sz * sizeof(sq->mbuf[0]), M_MLX5EN, M_WAITOK | M_ZERO); if (sq->mbuf == NULL) return (-ENOMEM); /* Create DMA descriptor MAPs */ for (x = 0; x != wq_sz; x++) { err = -bus_dmamap_create(sq->dma_tag, 0, &sq->mbuf[x].dma_map); if (err != 0) { while (x--) bus_dmamap_destroy(sq->dma_tag, sq->mbuf[x].dma_map); free(sq->mbuf, M_MLX5EN); return (err); } } return (0); } static const char *mlx5e_sq_stats_desc[] = { MLX5E_SQ_STATS(MLX5E_STATS_DESC) }; static int mlx5e_create_sq(struct mlx5e_channel *c, int tc, struct mlx5e_sq_param *param, struct mlx5e_sq *sq) { struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; char buffer[16]; void *sqc = param->sqc; void *sqc_wq = MLX5_ADDR_OF(sqc, sqc, wq); #ifdef RSS cpuset_t cpu_mask; int cpu_id; #endif int err; /* Create DMA descriptor TAG */ if ((err = -bus_dma_tag_create( bus_get_dma_tag(mdev->pdev->dev.bsddev), 1, /* any alignment */ 0, /* no boundary */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ MLX5E_MAX_TX_PAYLOAD_SIZE, /* maxsize */ MLX5E_MAX_TX_MBUF_FRAGS, /* nsegments */ MLX5E_MAX_TX_MBUF_SIZE, /* maxsegsize */ 0, /* flags */ NULL, NULL, /* lockfunc, lockfuncarg */ &sq->dma_tag))) goto done; err = mlx5_alloc_map_uar(mdev, &sq->uar); if (err) goto err_free_dma_tag; err = mlx5_wq_cyc_create(mdev, ¶m->wq, sqc_wq, &sq->wq, &sq->wq_ctrl); if (err) goto err_unmap_free_uar; sq->wq.db = &sq->wq.db[MLX5_SND_DBR]; sq->uar_map = sq->uar.map; sq->uar_bf_map = sq->uar.bf_map; sq->bf_buf_size = (1 << MLX5_CAP_GEN(mdev, log_bf_reg_size)) / 2; err = mlx5e_alloc_sq_db(sq); if (err) goto err_sq_wq_destroy; sq->pdev = c->pdev; sq->mkey_be = c->mkey_be; sq->channel = c; sq->tc = tc; sq->br = buf_ring_alloc(MLX5E_SQ_TX_QUEUE_SIZE, M_MLX5EN, M_WAITOK, &sq->lock); if (sq->br == NULL) { if_printf(c->ifp, "%s: Failed allocating sq drbr buffer\n", __func__); err = -ENOMEM; goto err_free_sq_db; } sq->sq_tq = taskqueue_create_fast("mlx5e_que", M_WAITOK, taskqueue_thread_enqueue, &sq->sq_tq); if (sq->sq_tq == NULL) { if_printf(c->ifp, "%s: Failed allocating taskqueue\n", __func__); err = -ENOMEM; goto err_free_drbr; } TASK_INIT(&sq->sq_task, 0, mlx5e_tx_que, sq); #ifdef RSS cpu_id = rss_getcpu(c->ix % rss_getnumbuckets()); CPU_SETOF(cpu_id, &cpu_mask); taskqueue_start_threads_cpuset(&sq->sq_tq, 1, PI_NET, &cpu_mask, "%s TX SQ%d.%d CPU%d", c->ifp->if_xname, c->ix, tc, cpu_id); #else taskqueue_start_threads(&sq->sq_tq, 1, PI_NET, "%s TX SQ%d.%d", c->ifp->if_xname, c->ix, tc); #endif snprintf(buffer, sizeof(buffer), "txstat%dtc%d", c->ix, tc); mlx5e_create_stats(&sq->stats.ctx, SYSCTL_CHILDREN(priv->sysctl_ifnet), buffer, mlx5e_sq_stats_desc, MLX5E_SQ_STATS_NUM, sq->stats.arg); return (0); err_free_drbr: buf_ring_free(sq->br, M_MLX5EN); err_free_sq_db: mlx5e_free_sq_db(sq); err_sq_wq_destroy: mlx5_wq_destroy(&sq->wq_ctrl); err_unmap_free_uar: mlx5_unmap_free_uar(mdev, &sq->uar); err_free_dma_tag: bus_dma_tag_destroy(sq->dma_tag); done: return (err); } static void mlx5e_destroy_sq(struct mlx5e_sq *sq) { struct mlx5e_channel *c = sq->channel; struct mlx5e_priv *priv = c->priv; /* destroy all sysctl nodes */ sysctl_ctx_free(&sq->stats.ctx); mlx5e_free_sq_db(sq); mlx5_wq_destroy(&sq->wq_ctrl); mlx5_unmap_free_uar(priv->mdev, &sq->uar); taskqueue_drain(sq->sq_tq, &sq->sq_task); taskqueue_free(sq->sq_tq); buf_ring_free(sq->br, M_MLX5EN); } static int mlx5e_enable_sq(struct mlx5e_sq *sq, struct mlx5e_sq_param *param) { struct mlx5e_channel *c = sq->channel; struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; void *in; void *sqc; void *wq; int inlen; int err; inlen = MLX5_ST_SZ_BYTES(create_sq_in) + sizeof(u64) * sq->wq_ctrl.buf.npages; in = mlx5_vzalloc(inlen); if (in == NULL) return (-ENOMEM); sqc = MLX5_ADDR_OF(create_sq_in, in, ctx); wq = MLX5_ADDR_OF(sqc, sqc, wq); memcpy(sqc, param->sqc, sizeof(param->sqc)); MLX5_SET(sqc, sqc, tis_num_0, priv->tisn[sq->tc]); MLX5_SET(sqc, sqc, cqn, c->sq[sq->tc].cq.mcq.cqn); MLX5_SET(sqc, sqc, state, MLX5_SQC_STATE_RST); MLX5_SET(sqc, sqc, tis_lst_sz, 1); MLX5_SET(sqc, sqc, flush_in_error_en, 1); MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_CYCLIC); MLX5_SET(wq, wq, uar_page, sq->uar.index); MLX5_SET(wq, wq, log_wq_pg_sz, sq->wq_ctrl.buf.page_shift - PAGE_SHIFT); MLX5_SET64(wq, wq, dbr_addr, sq->wq_ctrl.db.dma); mlx5_fill_page_array(&sq->wq_ctrl.buf, (__be64 *) MLX5_ADDR_OF(wq, wq, pas)); err = mlx5_core_create_sq(mdev, in, inlen, &sq->sqn); kvfree(in); return (err); } static int mlx5e_modify_sq(struct mlx5e_sq *sq, int curr_state, int next_state) { struct mlx5e_channel *c = sq->channel; struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; void *in; void *sqc; int inlen; int err; inlen = MLX5_ST_SZ_BYTES(modify_sq_in); in = mlx5_vzalloc(inlen); if (in == NULL) return (-ENOMEM); sqc = MLX5_ADDR_OF(modify_sq_in, in, ctx); MLX5_SET(modify_sq_in, in, sqn, sq->sqn); MLX5_SET(modify_sq_in, in, sq_state, curr_state); MLX5_SET(sqc, sqc, state, next_state); err = mlx5_core_modify_sq(mdev, in, inlen); kvfree(in); return (err); } static void mlx5e_disable_sq(struct mlx5e_sq *sq) { struct mlx5e_channel *c = sq->channel; struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; mlx5_core_destroy_sq(mdev, sq->sqn); } static int mlx5e_open_sq(struct mlx5e_channel *c, int tc, struct mlx5e_sq_param *param, struct mlx5e_sq *sq) { int err; err = mlx5e_create_sq(c, tc, param, sq); if (err) return (err); err = mlx5e_enable_sq(sq, param); if (err) goto err_destroy_sq; err = mlx5e_modify_sq(sq, MLX5_SQC_STATE_RST, MLX5_SQC_STATE_RDY); if (err) goto err_disable_sq; atomic_store_rel_int(&sq->queue_state, MLX5E_SQ_READY); return (0); err_disable_sq: mlx5e_disable_sq(sq); err_destroy_sq: mlx5e_destroy_sq(sq); return (err); } static void mlx5e_sq_send_nops_locked(struct mlx5e_sq *sq, int can_sleep) { /* fill up remainder with NOPs */ while (sq->cev_counter != 0) { while (!mlx5e_sq_has_room_for(sq, 1)) { if (can_sleep != 0) { mtx_unlock(&sq->lock); msleep(4); mtx_lock(&sq->lock); } else { goto done; } } /* send a single NOP */ mlx5e_send_nop(sq, 1); wmb(); } done: /* Check if we need to write the doorbell */ if (likely(sq->doorbell.d64 != 0)) { mlx5e_tx_notify_hw(sq, sq->doorbell.d32, 0); sq->doorbell.d64 = 0; } return; } void mlx5e_sq_cev_timeout(void *arg) { struct mlx5e_sq *sq = arg; mtx_assert(&sq->lock, MA_OWNED); /* check next state */ switch (sq->cev_next_state) { case MLX5E_CEV_STATE_SEND_NOPS: /* fill TX ring with NOPs, if any */ mlx5e_sq_send_nops_locked(sq, 0); /* check if completed */ if (sq->cev_counter == 0) { sq->cev_next_state = MLX5E_CEV_STATE_INITIAL; return; } break; default: /* send NOPs on next timeout */ sq->cev_next_state = MLX5E_CEV_STATE_SEND_NOPS; break; } /* restart timer */ callout_reset_curcpu(&sq->cev_callout, hz, mlx5e_sq_cev_timeout, sq); } static void mlx5e_close_sq_wait(struct mlx5e_sq *sq) { mtx_lock(&sq->lock); /* teardown event factor timer, if any */ sq->cev_next_state = MLX5E_CEV_STATE_HOLD_NOPS; callout_stop(&sq->cev_callout); /* send dummy NOPs in order to flush the transmit ring */ mlx5e_sq_send_nops_locked(sq, 1); mtx_unlock(&sq->lock); /* make sure it is safe to free the callout */ callout_drain(&sq->cev_callout); /* error out remaining requests */ mlx5e_modify_sq(sq, MLX5_SQC_STATE_RDY, MLX5_SQC_STATE_ERR); /* wait till SQ is empty */ mtx_lock(&sq->lock); while (sq->cc != sq->pc) { mtx_unlock(&sq->lock); msleep(4); sq->cq.mcq.comp(&sq->cq.mcq); mtx_lock(&sq->lock); } mtx_unlock(&sq->lock); mlx5e_disable_sq(sq); mlx5e_destroy_sq(sq); } static int mlx5e_create_cq(struct mlx5e_channel *c, struct mlx5e_cq_param *param, struct mlx5e_cq *cq, mlx5e_cq_comp_t *comp) { struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; struct mlx5_core_cq *mcq = &cq->mcq; int eqn_not_used; int irqn; int err; u32 i; param->wq.buf_numa_node = 0; param->wq.db_numa_node = 0; param->eq_ix = c->ix; err = mlx5_cqwq_create(mdev, ¶m->wq, param->cqc, &cq->wq, &cq->wq_ctrl); if (err) return (err); mlx5_vector2eqn(mdev, param->eq_ix, &eqn_not_used, &irqn); mcq->cqe_sz = 64; mcq->set_ci_db = cq->wq_ctrl.db.db; mcq->arm_db = cq->wq_ctrl.db.db + 1; *mcq->set_ci_db = 0; *mcq->arm_db = 0; mcq->vector = param->eq_ix; mcq->comp = comp; mcq->event = mlx5e_cq_error_event; mcq->irqn = irqn; mcq->uar = &priv->cq_uar; for (i = 0; i < mlx5_cqwq_get_size(&cq->wq); i++) { struct mlx5_cqe64 *cqe = mlx5_cqwq_get_wqe(&cq->wq, i); cqe->op_own = 0xf1; } cq->channel = c; return (0); } static void mlx5e_destroy_cq(struct mlx5e_cq *cq) { mlx5_wq_destroy(&cq->wq_ctrl); } static int mlx5e_enable_cq(struct mlx5e_cq *cq, struct mlx5e_cq_param *param, u8 moderation_mode) { struct mlx5e_channel *c = cq->channel; struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; struct mlx5_core_cq *mcq = &cq->mcq; void *in; void *cqc; int inlen; int irqn_not_used; int eqn; int err; inlen = MLX5_ST_SZ_BYTES(create_cq_in) + sizeof(u64) * cq->wq_ctrl.buf.npages; in = mlx5_vzalloc(inlen); if (in == NULL) return (-ENOMEM); cqc = MLX5_ADDR_OF(create_cq_in, in, cq_context); memcpy(cqc, param->cqc, sizeof(param->cqc)); mlx5_fill_page_array(&cq->wq_ctrl.buf, (__be64 *) MLX5_ADDR_OF(create_cq_in, in, pas)); mlx5_vector2eqn(mdev, param->eq_ix, &eqn, &irqn_not_used); MLX5_SET(cqc, cqc, cq_period_mode, moderation_mode); MLX5_SET(cqc, cqc, c_eqn, eqn); MLX5_SET(cqc, cqc, uar_page, mcq->uar->index); MLX5_SET(cqc, cqc, log_page_size, cq->wq_ctrl.buf.page_shift - PAGE_SHIFT); MLX5_SET64(cqc, cqc, dbr_addr, cq->wq_ctrl.db.dma); err = mlx5_core_create_cq(mdev, mcq, in, inlen); kvfree(in); if (err) return (err); mlx5e_cq_arm(cq); return (0); } static void mlx5e_disable_cq(struct mlx5e_cq *cq) { struct mlx5e_channel *c = cq->channel; struct mlx5e_priv *priv = c->priv; struct mlx5_core_dev *mdev = priv->mdev; mlx5_core_destroy_cq(mdev, &cq->mcq); } static int mlx5e_open_cq(struct mlx5e_channel *c, struct mlx5e_cq_param *param, struct mlx5e_cq *cq, mlx5e_cq_comp_t *comp, u8 moderation_mode) { int err; err = mlx5e_create_cq(c, param, cq, comp); if (err) return (err); err = mlx5e_enable_cq(cq, param, moderation_mode); if (err) goto err_destroy_cq; return (0); err_destroy_cq: mlx5e_destroy_cq(cq); return (err); } static void mlx5e_close_cq(struct mlx5e_cq *cq) { mlx5e_disable_cq(cq); mlx5e_destroy_cq(cq); } static int mlx5e_open_tx_cqs(struct mlx5e_channel *c, struct mlx5e_channel_param *cparam) { u8 tx_moderation_mode; int err; int tc; switch (c->priv->params.tx_cq_moderation_mode) { case 0: tx_moderation_mode = MLX5_CQ_PERIOD_MODE_START_FROM_EQE; break; default: if (MLX5_CAP_GEN(c->priv->mdev, cq_period_start_from_cqe)) tx_moderation_mode = MLX5_CQ_PERIOD_MODE_START_FROM_CQE; else tx_moderation_mode = MLX5_CQ_PERIOD_MODE_START_FROM_EQE; break; } for (tc = 0; tc < c->num_tc; tc++) { /* open completion queue */ err = mlx5e_open_cq(c, &cparam->tx_cq, &c->sq[tc].cq, &mlx5e_tx_cq_comp, tx_moderation_mode); if (err) goto err_close_tx_cqs; } return (0); err_close_tx_cqs: for (tc--; tc >= 0; tc--) mlx5e_close_cq(&c->sq[tc].cq); return (err); } static void mlx5e_close_tx_cqs(struct mlx5e_channel *c) { int tc; for (tc = 0; tc < c->num_tc; tc++) mlx5e_close_cq(&c->sq[tc].cq); } static int mlx5e_open_sqs(struct mlx5e_channel *c, struct mlx5e_channel_param *cparam) { int err; int tc; for (tc = 0; tc < c->num_tc; tc++) { err = mlx5e_open_sq(c, tc, &cparam->sq, &c->sq[tc]); if (err) goto err_close_sqs; } return (0); err_close_sqs: for (tc--; tc >= 0; tc--) mlx5e_close_sq_wait(&c->sq[tc]); return (err); } static void mlx5e_close_sqs_wait(struct mlx5e_channel *c) { int tc; for (tc = 0; tc < c->num_tc; tc++) mlx5e_close_sq_wait(&c->sq[tc]); } static void mlx5e_chan_mtx_init(struct mlx5e_channel *c) { int tc; mtx_init(&c->rq.mtx, "mlx5rx", MTX_NETWORK_LOCK, MTX_DEF); for (tc = 0; tc < c->num_tc; tc++) { struct mlx5e_sq *sq = c->sq + tc; mtx_init(&sq->lock, "mlx5tx", MTX_NETWORK_LOCK, MTX_DEF); mtx_init(&sq->comp_lock, "mlx5comp", MTX_NETWORK_LOCK, MTX_DEF); callout_init_mtx(&sq->cev_callout, &sq->lock, 0); sq->cev_factor = c->priv->params_ethtool.tx_completion_fact; /* ensure the TX completion event factor is not zero */ if (sq->cev_factor == 0) sq->cev_factor = 1; } } static void mlx5e_chan_mtx_destroy(struct mlx5e_channel *c) { int tc; mtx_destroy(&c->rq.mtx); for (tc = 0; tc < c->num_tc; tc++) { mtx_destroy(&c->sq[tc].lock); mtx_destroy(&c->sq[tc].comp_lock); } } static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix, struct mlx5e_channel_param *cparam, struct mlx5e_channel *volatile *cp) { struct mlx5e_channel *c; u8 rx_moderation_mode; int err; c = malloc(sizeof(*c), M_MLX5EN, M_WAITOK | M_ZERO); if (c == NULL) return (-ENOMEM); c->priv = priv; c->ix = ix; c->cpu = 0; c->pdev = &priv->mdev->pdev->dev; c->ifp = priv->ifp; c->mkey_be = cpu_to_be32(priv->mr.key); c->num_tc = priv->num_tc; /* init mutexes */ mlx5e_chan_mtx_init(c); /* open transmit completion queue */ err = mlx5e_open_tx_cqs(c, cparam); if (err) goto err_free; switch (priv->params.rx_cq_moderation_mode) { case 0: rx_moderation_mode = MLX5_CQ_PERIOD_MODE_START_FROM_EQE; break; default: if (MLX5_CAP_GEN(priv->mdev, cq_period_start_from_cqe)) rx_moderation_mode = MLX5_CQ_PERIOD_MODE_START_FROM_CQE; else rx_moderation_mode = MLX5_CQ_PERIOD_MODE_START_FROM_EQE; break; } /* open receive completion queue */ err = mlx5e_open_cq(c, &cparam->rx_cq, &c->rq.cq, &mlx5e_rx_cq_comp, rx_moderation_mode); if (err) goto err_close_tx_cqs; err = mlx5e_open_sqs(c, cparam); if (err) goto err_close_rx_cq; err = mlx5e_open_rq(c, &cparam->rq, &c->rq); if (err) goto err_close_sqs; /* store channel pointer */ *cp = c; /* poll receive queue initially */ c->rq.cq.mcq.comp(&c->rq.cq.mcq); return (0); err_close_sqs: mlx5e_close_sqs_wait(c); err_close_rx_cq: mlx5e_close_cq(&c->rq.cq); err_close_tx_cqs: mlx5e_close_tx_cqs(c); err_free: /* destroy mutexes */ mlx5e_chan_mtx_destroy(c); free(c, M_MLX5EN); return (err); } static void mlx5e_close_channel(struct mlx5e_channel *volatile *pp) { struct mlx5e_channel *c = *pp; /* check if channel is already closed */ if (c == NULL) return; mlx5e_close_rq(&c->rq); } static void mlx5e_close_channel_wait(struct mlx5e_channel *volatile *pp) { struct mlx5e_channel *c = *pp; /* check if channel is already closed */ if (c == NULL) return; /* ensure channel pointer is no longer used */ *pp = NULL; mlx5e_close_rq_wait(&c->rq); mlx5e_close_sqs_wait(c); mlx5e_close_cq(&c->rq.cq); mlx5e_close_tx_cqs(c); /* destroy mutexes */ mlx5e_chan_mtx_destroy(c); free(c, M_MLX5EN); } static void mlx5e_build_rq_param(struct mlx5e_priv *priv, struct mlx5e_rq_param *param) { void *rqc = param->rqc; void *wq = MLX5_ADDR_OF(rqc, rqc, wq); MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_LINKED_LIST); MLX5_SET(wq, wq, end_padding_mode, MLX5_WQ_END_PAD_MODE_ALIGN); MLX5_SET(wq, wq, log_wq_stride, ilog2(sizeof(struct mlx5e_rx_wqe))); MLX5_SET(wq, wq, log_wq_sz, priv->params.log_rq_size); MLX5_SET(wq, wq, pd, priv->pdn); param->wq.buf_numa_node = 0; param->wq.db_numa_node = 0; param->wq.linear = 1; } static void mlx5e_build_sq_param(struct mlx5e_priv *priv, struct mlx5e_sq_param *param) { void *sqc = param->sqc; void *wq = MLX5_ADDR_OF(sqc, sqc, wq); MLX5_SET(wq, wq, log_wq_sz, priv->params.log_sq_size); MLX5_SET(wq, wq, log_wq_stride, ilog2(MLX5_SEND_WQE_BB)); MLX5_SET(wq, wq, pd, priv->pdn); param->wq.buf_numa_node = 0; param->wq.db_numa_node = 0; param->wq.linear = 1; } static void mlx5e_build_common_cq_param(struct mlx5e_priv *priv, struct mlx5e_cq_param *param) { void *cqc = param->cqc; MLX5_SET(cqc, cqc, uar_page, priv->cq_uar.index); } static void mlx5e_build_rx_cq_param(struct mlx5e_priv *priv, struct mlx5e_cq_param *param) { void *cqc = param->cqc; /* * TODO The sysctl to control on/off is a bool value for now, which means * we only support CSUM, once HASH is implemnted we'll need to address that. */ if (priv->params.cqe_zipping_en) { MLX5_SET(cqc, cqc, mini_cqe_res_format, MLX5_CQE_FORMAT_CSUM); MLX5_SET(cqc, cqc, cqe_compression_en, 1); } MLX5_SET(cqc, cqc, log_cq_size, priv->params.log_rq_size); MLX5_SET(cqc, cqc, cq_period, priv->params.rx_cq_moderation_usec); MLX5_SET(cqc, cqc, cq_max_count, priv->params.rx_cq_moderation_pkts); mlx5e_build_common_cq_param(priv, param); } static void mlx5e_build_tx_cq_param(struct mlx5e_priv *priv, struct mlx5e_cq_param *param) { void *cqc = param->cqc; MLX5_SET(cqc, cqc, log_cq_size, priv->params.log_sq_size); MLX5_SET(cqc, cqc, cq_period, priv->params.tx_cq_moderation_usec); MLX5_SET(cqc, cqc, cq_max_count, priv->params.tx_cq_moderation_pkts); mlx5e_build_common_cq_param(priv, param); } static void mlx5e_build_channel_param(struct mlx5e_priv *priv, struct mlx5e_channel_param *cparam) { memset(cparam, 0, sizeof(*cparam)); mlx5e_build_rq_param(priv, &cparam->rq); mlx5e_build_sq_param(priv, &cparam->sq); mlx5e_build_rx_cq_param(priv, &cparam->rx_cq); mlx5e_build_tx_cq_param(priv, &cparam->tx_cq); } static int mlx5e_open_channels(struct mlx5e_priv *priv) { struct mlx5e_channel_param cparam; void *ptr; int err; int i; int j; priv->channel = malloc(priv->params.num_channels * sizeof(struct mlx5e_channel *), M_MLX5EN, M_WAITOK | M_ZERO); if (priv->channel == NULL) return (-ENOMEM); mlx5e_build_channel_param(priv, &cparam); for (i = 0; i < priv->params.num_channels; i++) { err = mlx5e_open_channel(priv, i, &cparam, &priv->channel[i]); if (err) goto err_close_channels; } for (j = 0; j < priv->params.num_channels; j++) { err = mlx5e_wait_for_min_rx_wqes(&priv->channel[j]->rq); if (err) goto err_close_channels; } return (0); err_close_channels: for (i--; i >= 0; i--) { mlx5e_close_channel(&priv->channel[i]); mlx5e_close_channel_wait(&priv->channel[i]); } /* remove "volatile" attribute from "channel" pointer */ ptr = __DECONST(void *, priv->channel); priv->channel = NULL; free(ptr, M_MLX5EN); return (err); } static void mlx5e_close_channels(struct mlx5e_priv *priv) { void *ptr; int i; if (priv->channel == NULL) return; for (i = 0; i < priv->params.num_channels; i++) mlx5e_close_channel(&priv->channel[i]); for (i = 0; i < priv->params.num_channels; i++) mlx5e_close_channel_wait(&priv->channel[i]); /* remove "volatile" attribute from "channel" pointer */ ptr = __DECONST(void *, priv->channel); priv->channel = NULL; free(ptr, M_MLX5EN); } static int mlx5e_refresh_sq_params(struct mlx5e_priv *priv, struct mlx5e_sq *sq) { return (mlx5_core_modify_cq_moderation(priv->mdev, &sq->cq.mcq, priv->params.tx_cq_moderation_usec, priv->params.tx_cq_moderation_pkts)); } static int mlx5e_refresh_rq_params(struct mlx5e_priv *priv, struct mlx5e_rq *rq) { return (mlx5_core_modify_cq_moderation(priv->mdev, &rq->cq.mcq, priv->params.rx_cq_moderation_usec, priv->params.rx_cq_moderation_pkts)); } static int mlx5e_refresh_channel_params_sub(struct mlx5e_priv *priv, struct mlx5e_channel *c) { int err; int i; if (c == NULL) return (EINVAL); err = mlx5e_refresh_rq_params(priv, &c->rq); if (err) goto done; for (i = 0; i != c->num_tc; i++) { err = mlx5e_refresh_sq_params(priv, &c->sq[i]); if (err) goto done; } done: return (err); } int mlx5e_refresh_channel_params(struct mlx5e_priv *priv) { int i; if (priv->channel == NULL) return (EINVAL); for (i = 0; i < priv->params.num_channels; i++) { int err; err = mlx5e_refresh_channel_params_sub(priv, priv->channel[i]); if (err) return (err); } return (0); } static int mlx5e_open_tis(struct mlx5e_priv *priv, int tc) { struct mlx5_core_dev *mdev = priv->mdev; u32 in[MLX5_ST_SZ_DW(create_tis_in)]; void *tisc = MLX5_ADDR_OF(create_tis_in, in, ctx); memset(in, 0, sizeof(in)); MLX5_SET(tisc, tisc, prio, tc); MLX5_SET(tisc, tisc, transport_domain, priv->tdn); return (mlx5_core_create_tis(mdev, in, sizeof(in), &priv->tisn[tc])); } static void mlx5e_close_tis(struct mlx5e_priv *priv, int tc) { mlx5_core_destroy_tis(priv->mdev, priv->tisn[tc]); } static int mlx5e_open_tises(struct mlx5e_priv *priv) { int num_tc = priv->num_tc; int err; int tc; for (tc = 0; tc < num_tc; tc++) { err = mlx5e_open_tis(priv, tc); if (err) goto err_close_tises; } return (0); err_close_tises: for (tc--; tc >= 0; tc--) mlx5e_close_tis(priv, tc); return (err); } static void mlx5e_close_tises(struct mlx5e_priv *priv) { int num_tc = priv->num_tc; int tc; for (tc = 0; tc < num_tc; tc++) mlx5e_close_tis(priv, tc); } static int mlx5e_open_rqt(struct mlx5e_priv *priv) { struct mlx5_core_dev *mdev = priv->mdev; u32 *in; u32 out[MLX5_ST_SZ_DW(create_rqt_out)]; void *rqtc; int inlen; int err; int sz; int i; sz = 1 << priv->params.rx_hash_log_tbl_sz; inlen = MLX5_ST_SZ_BYTES(create_rqt_in) + sizeof(u32) * sz; in = mlx5_vzalloc(inlen); if (in == NULL) return (-ENOMEM); rqtc = MLX5_ADDR_OF(create_rqt_in, in, rqt_context); MLX5_SET(rqtc, rqtc, rqt_actual_size, sz); MLX5_SET(rqtc, rqtc, rqt_max_size, sz); for (i = 0; i < sz; i++) { int ix; #ifdef RSS ix = rss_get_indirection_to_bucket(i); #else ix = i; #endif /* ensure we don't overflow */ ix %= priv->params.num_channels; MLX5_SET(rqtc, rqtc, rq_num[i], priv->channel[ix]->rq.rqn); } MLX5_SET(create_rqt_in, in, opcode, MLX5_CMD_OP_CREATE_RQT); memset(out, 0, sizeof(out)); err = mlx5_cmd_exec_check_status(mdev, in, inlen, out, sizeof(out)); if (!err) priv->rqtn = MLX5_GET(create_rqt_out, out, rqtn); kvfree(in); return (err); } static void mlx5e_close_rqt(struct mlx5e_priv *priv) { u32 in[MLX5_ST_SZ_DW(destroy_rqt_in)]; u32 out[MLX5_ST_SZ_DW(destroy_rqt_out)]; memset(in, 0, sizeof(in)); MLX5_SET(destroy_rqt_in, in, opcode, MLX5_CMD_OP_DESTROY_RQT); MLX5_SET(destroy_rqt_in, in, rqtn, priv->rqtn); mlx5_cmd_exec_check_status(priv->mdev, in, sizeof(in), out, sizeof(out)); } static void mlx5e_build_tir_ctx(struct mlx5e_priv *priv, u32 * tirc, int tt) { void *hfso = MLX5_ADDR_OF(tirc, tirc, rx_hash_field_selector_outer); __be32 *hkey; MLX5_SET(tirc, tirc, transport_domain, priv->tdn); #define ROUGH_MAX_L2_L3_HDR_SZ 256 #define MLX5_HASH_IP (MLX5_HASH_FIELD_SEL_SRC_IP |\ MLX5_HASH_FIELD_SEL_DST_IP) #define MLX5_HASH_ALL (MLX5_HASH_FIELD_SEL_SRC_IP |\ MLX5_HASH_FIELD_SEL_DST_IP |\ MLX5_HASH_FIELD_SEL_L4_SPORT |\ MLX5_HASH_FIELD_SEL_L4_DPORT) #define MLX5_HASH_IP_IPSEC_SPI (MLX5_HASH_FIELD_SEL_SRC_IP |\ MLX5_HASH_FIELD_SEL_DST_IP |\ MLX5_HASH_FIELD_SEL_IPSEC_SPI) if (priv->params.hw_lro_en) { MLX5_SET(tirc, tirc, lro_enable_mask, MLX5_TIRC_LRO_ENABLE_MASK_IPV4_LRO | MLX5_TIRC_LRO_ENABLE_MASK_IPV6_LRO); MLX5_SET(tirc, tirc, lro_max_msg_sz, (priv->params.lro_wqe_sz - ROUGH_MAX_L2_L3_HDR_SZ) >> 8); /* TODO: add the option to choose timer value dynamically */ MLX5_SET(tirc, tirc, lro_timeout_period_usecs, MLX5_CAP_ETH(priv->mdev, lro_timer_supported_periods[2])); } /* setup parameters for hashing TIR type, if any */ switch (tt) { case MLX5E_TT_ANY: MLX5_SET(tirc, tirc, disp_type, MLX5_TIRC_DISP_TYPE_DIRECT); MLX5_SET(tirc, tirc, inline_rqn, priv->channel[0]->rq.rqn); break; default: MLX5_SET(tirc, tirc, disp_type, MLX5_TIRC_DISP_TYPE_INDIRECT); MLX5_SET(tirc, tirc, indirect_table, priv->rqtn); MLX5_SET(tirc, tirc, rx_hash_fn, MLX5_TIRC_RX_HASH_FN_HASH_TOEPLITZ); hkey = (__be32 *) MLX5_ADDR_OF(tirc, tirc, rx_hash_toeplitz_key); #ifdef RSS /* * The FreeBSD RSS implementation does currently not * support symmetric Toeplitz hashes: */ MLX5_SET(tirc, tirc, rx_hash_symmetric, 0); rss_getkey((uint8_t *)hkey); #else MLX5_SET(tirc, tirc, rx_hash_symmetric, 1); hkey[0] = cpu_to_be32(0xD181C62C); hkey[1] = cpu_to_be32(0xF7F4DB5B); hkey[2] = cpu_to_be32(0x1983A2FC); hkey[3] = cpu_to_be32(0x943E1ADB); hkey[4] = cpu_to_be32(0xD9389E6B); hkey[5] = cpu_to_be32(0xD1039C2C); hkey[6] = cpu_to_be32(0xA74499AD); hkey[7] = cpu_to_be32(0x593D56D9); hkey[8] = cpu_to_be32(0xF3253C06); hkey[9] = cpu_to_be32(0x2ADC1FFC); #endif break; } switch (tt) { case MLX5E_TT_IPV4_TCP: MLX5_SET(rx_hash_field_select, hfso, l3_prot_type, MLX5_L3_PROT_TYPE_IPV4); MLX5_SET(rx_hash_field_select, hfso, l4_prot_type, MLX5_L4_PROT_TYPE_TCP); #ifdef RSS if (!(rss_gethashconfig() & RSS_HASHTYPE_RSS_TCP_IPV4)) { MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_IP); } else #endif MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_ALL); break; case MLX5E_TT_IPV6_TCP: MLX5_SET(rx_hash_field_select, hfso, l3_prot_type, MLX5_L3_PROT_TYPE_IPV6); MLX5_SET(rx_hash_field_select, hfso, l4_prot_type, MLX5_L4_PROT_TYPE_TCP); #ifdef RSS if (!(rss_gethashconfig() & RSS_HASHTYPE_RSS_TCP_IPV6)) { MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_IP); } else #endif MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_ALL); break; case MLX5E_TT_IPV4_UDP: MLX5_SET(rx_hash_field_select, hfso, l3_prot_type, MLX5_L3_PROT_TYPE_IPV4); MLX5_SET(rx_hash_field_select, hfso, l4_prot_type, MLX5_L4_PROT_TYPE_UDP); #ifdef RSS if (!(rss_gethashconfig() & RSS_HASHTYPE_RSS_UDP_IPV4)) { MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_IP); } else #endif MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_ALL); break; case MLX5E_TT_IPV6_UDP: MLX5_SET(rx_hash_field_select, hfso, l3_prot_type, MLX5_L3_PROT_TYPE_IPV6); MLX5_SET(rx_hash_field_select, hfso, l4_prot_type, MLX5_L4_PROT_TYPE_UDP); #ifdef RSS if (!(rss_gethashconfig() & RSS_HASHTYPE_RSS_UDP_IPV6)) { MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_IP); } else #endif MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_ALL); break; case MLX5E_TT_IPV4_IPSEC_AH: MLX5_SET(rx_hash_field_select, hfso, l3_prot_type, MLX5_L3_PROT_TYPE_IPV4); MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_IP_IPSEC_SPI); break; case MLX5E_TT_IPV6_IPSEC_AH: MLX5_SET(rx_hash_field_select, hfso, l3_prot_type, MLX5_L3_PROT_TYPE_IPV6); MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_IP_IPSEC_SPI); break; case MLX5E_TT_IPV4_IPSEC_ESP: MLX5_SET(rx_hash_field_select, hfso, l3_prot_type, MLX5_L3_PROT_TYPE_IPV4); MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_IP_IPSEC_SPI); break; case MLX5E_TT_IPV6_IPSEC_ESP: MLX5_SET(rx_hash_field_select, hfso, l3_prot_type, MLX5_L3_PROT_TYPE_IPV6); MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_IP_IPSEC_SPI); break; case MLX5E_TT_IPV4: MLX5_SET(rx_hash_field_select, hfso, l3_prot_type, MLX5_L3_PROT_TYPE_IPV4); MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_IP); break; case MLX5E_TT_IPV6: MLX5_SET(rx_hash_field_select, hfso, l3_prot_type, MLX5_L3_PROT_TYPE_IPV6); MLX5_SET(rx_hash_field_select, hfso, selected_fields, MLX5_HASH_IP); break; default: break; } } static int mlx5e_open_tir(struct mlx5e_priv *priv, int tt) { struct mlx5_core_dev *mdev = priv->mdev; u32 *in; void *tirc; int inlen; int err; inlen = MLX5_ST_SZ_BYTES(create_tir_in); in = mlx5_vzalloc(inlen); if (in == NULL) return (-ENOMEM); tirc = MLX5_ADDR_OF(create_tir_in, in, tir_context); mlx5e_build_tir_ctx(priv, tirc, tt); err = mlx5_core_create_tir(mdev, in, inlen, &priv->tirn[tt]); kvfree(in); return (err); } static void mlx5e_close_tir(struct mlx5e_priv *priv, int tt) { mlx5_core_destroy_tir(priv->mdev, priv->tirn[tt]); } static int mlx5e_open_tirs(struct mlx5e_priv *priv) { int err; int i; for (i = 0; i < MLX5E_NUM_TT; i++) { err = mlx5e_open_tir(priv, i); if (err) goto err_close_tirs; } return (0); err_close_tirs: for (i--; i >= 0; i--) mlx5e_close_tir(priv, i); return (err); } static void mlx5e_close_tirs(struct mlx5e_priv *priv) { int i; for (i = 0; i < MLX5E_NUM_TT; i++) mlx5e_close_tir(priv, i); } /* * SW MTU does not include headers, * HW MTU includes all headers and checksums. */ static int mlx5e_set_dev_port_mtu(struct ifnet *ifp, int sw_mtu) { struct mlx5e_priv *priv = ifp->if_softc; struct mlx5_core_dev *mdev = priv->mdev; int hw_mtu; int err; err = mlx5_set_port_mtu(mdev, MLX5E_SW2HW_MTU(sw_mtu)); if (err) { if_printf(ifp, "%s: mlx5_set_port_mtu failed setting %d, err=%d\n", __func__, sw_mtu, err); return (err); } err = mlx5_query_port_oper_mtu(mdev, &hw_mtu); if (!err) { ifp->if_mtu = MLX5E_HW2SW_MTU(hw_mtu); if (ifp->if_mtu != sw_mtu) { if_printf(ifp, "Port MTU %d is different than " "ifp mtu %d\n", sw_mtu, (int)ifp->if_mtu); } } else { if_printf(ifp, "Query port MTU, after setting new " "MTU value, failed\n"); ifp->if_mtu = sw_mtu; } return (0); } int mlx5e_open_locked(struct ifnet *ifp) { struct mlx5e_priv *priv = ifp->if_softc; int err; /* check if already opened */ if (test_bit(MLX5E_STATE_OPENED, &priv->state) != 0) return (0); #ifdef RSS if (rss_getnumbuckets() > priv->params.num_channels) { if_printf(ifp, "NOTE: There are more RSS buckets(%u) than " "channels(%u) available\n", rss_getnumbuckets(), priv->params.num_channels); } #endif err = mlx5e_open_tises(priv); if (err) { if_printf(ifp, "%s: mlx5e_open_tises failed, %d\n", __func__, err); return (err); } err = mlx5_vport_alloc_q_counter(priv->mdev, &priv->counter_set_id); if (err) { if_printf(priv->ifp, "%s: mlx5_vport_alloc_q_counter failed: %d\n", __func__, err); goto err_close_tises; } err = mlx5e_open_channels(priv); if (err) { if_printf(ifp, "%s: mlx5e_open_channels failed, %d\n", __func__, err); goto err_dalloc_q_counter; } err = mlx5e_open_rqt(priv); if (err) { if_printf(ifp, "%s: mlx5e_open_rqt failed, %d\n", __func__, err); goto err_close_channels; } err = mlx5e_open_tirs(priv); if (err) { if_printf(ifp, "%s: mlx5e_open_tir failed, %d\n", __func__, err); goto err_close_rqls; } err = mlx5e_open_flow_table(priv); if (err) { if_printf(ifp, "%s: mlx5e_open_flow_table failed, %d\n", __func__, err); goto err_close_tirs; } err = mlx5e_add_all_vlan_rules(priv); if (err) { if_printf(ifp, "%s: mlx5e_add_all_vlan_rules failed, %d\n", __func__, err); goto err_close_flow_table; } set_bit(MLX5E_STATE_OPENED, &priv->state); mlx5e_update_carrier(priv); mlx5e_set_rx_mode_core(priv); return (0); err_close_flow_table: mlx5e_close_flow_table(priv); err_close_tirs: mlx5e_close_tirs(priv); err_close_rqls: mlx5e_close_rqt(priv); err_close_channels: mlx5e_close_channels(priv); err_dalloc_q_counter: mlx5_vport_dealloc_q_counter(priv->mdev, priv->counter_set_id); err_close_tises: mlx5e_close_tises(priv); return (err); } static void mlx5e_open(void *arg) { struct mlx5e_priv *priv = arg; PRIV_LOCK(priv); if (mlx5_set_port_status(priv->mdev, MLX5_PORT_UP)) if_printf(priv->ifp, "%s: Setting port status to up failed\n", __func__); mlx5e_open_locked(priv->ifp); priv->ifp->if_drv_flags |= IFF_DRV_RUNNING; PRIV_UNLOCK(priv); } int mlx5e_close_locked(struct ifnet *ifp) { struct mlx5e_priv *priv = ifp->if_softc; /* check if already closed */ if (test_bit(MLX5E_STATE_OPENED, &priv->state) == 0) return (0); clear_bit(MLX5E_STATE_OPENED, &priv->state); mlx5e_set_rx_mode_core(priv); mlx5e_del_all_vlan_rules(priv); if_link_state_change(priv->ifp, LINK_STATE_DOWN); mlx5e_close_flow_table(priv); mlx5e_close_tirs(priv); mlx5e_close_rqt(priv); mlx5e_close_channels(priv); mlx5_vport_dealloc_q_counter(priv->mdev, priv->counter_set_id); mlx5e_close_tises(priv); return (0); } #if (__FreeBSD_version >= 1100000) static uint64_t mlx5e_get_counter(struct ifnet *ifp, ift_counter cnt) { struct mlx5e_priv *priv = ifp->if_softc; u64 retval; /* PRIV_LOCK(priv); XXX not allowed */ switch (cnt) { case IFCOUNTER_IPACKETS: retval = priv->stats.vport.rx_packets; break; case IFCOUNTER_IERRORS: retval = priv->stats.vport.rx_error_packets; break; case IFCOUNTER_IQDROPS: retval = priv->stats.vport.rx_out_of_buffer; break; case IFCOUNTER_OPACKETS: retval = priv->stats.vport.tx_packets; break; case IFCOUNTER_OERRORS: retval = priv->stats.vport.tx_error_packets; break; case IFCOUNTER_IBYTES: retval = priv->stats.vport.rx_bytes; break; case IFCOUNTER_OBYTES: retval = priv->stats.vport.tx_bytes; break; case IFCOUNTER_IMCASTS: retval = priv->stats.vport.rx_multicast_packets; break; case IFCOUNTER_OMCASTS: retval = priv->stats.vport.tx_multicast_packets; break; case IFCOUNTER_OQDROPS: retval = priv->stats.vport.tx_queue_dropped; break; default: retval = if_get_counter_default(ifp, cnt); break; } /* PRIV_UNLOCK(priv); XXX not allowed */ return (retval); } #endif static void mlx5e_set_rx_mode(struct ifnet *ifp) { struct mlx5e_priv *priv = ifp->if_softc; schedule_work(&priv->set_rx_mode_work); } static int mlx5e_ioctl(struct ifnet *ifp, u_long command, caddr_t data) { struct mlx5e_priv *priv; struct ifreq *ifr; struct ifi2creq i2c; int error = 0; int mask = 0; int size_read = 0; int module_num; int max_mtu; uint8_t read_addr; priv = ifp->if_softc; /* check if detaching */ if (priv == NULL || priv->gone != 0) return (ENXIO); switch (command) { case SIOCSIFMTU: ifr = (struct ifreq *)data; PRIV_LOCK(priv); mlx5_query_port_max_mtu(priv->mdev, &max_mtu); if (ifr->ifr_mtu >= MLX5E_MTU_MIN && ifr->ifr_mtu <= MIN(MLX5E_MTU_MAX, max_mtu)) { int was_opened; was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state); if (was_opened) mlx5e_close_locked(ifp); /* set new MTU */ mlx5e_set_dev_port_mtu(ifp, ifr->ifr_mtu); if (was_opened) mlx5e_open_locked(ifp); } else { error = EINVAL; if_printf(ifp, "Invalid MTU value. Min val: %d, Max val: %d\n", MLX5E_MTU_MIN, MIN(MLX5E_MTU_MAX, max_mtu)); } PRIV_UNLOCK(priv); break; case SIOCSIFFLAGS: if ((ifp->if_flags & IFF_UP) && (ifp->if_drv_flags & IFF_DRV_RUNNING)) { mlx5e_set_rx_mode(ifp); break; } PRIV_LOCK(priv); if (ifp->if_flags & IFF_UP) { if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) { if (test_bit(MLX5E_STATE_OPENED, &priv->state) == 0) mlx5e_open_locked(ifp); ifp->if_drv_flags |= IFF_DRV_RUNNING; mlx5_set_port_status(priv->mdev, MLX5_PORT_UP); } } else { if (ifp->if_drv_flags & IFF_DRV_RUNNING) { mlx5_set_port_status(priv->mdev, MLX5_PORT_DOWN); if (test_bit(MLX5E_STATE_OPENED, &priv->state) != 0) mlx5e_close_locked(ifp); mlx5e_update_carrier(priv); ifp->if_drv_flags &= ~IFF_DRV_RUNNING; } } PRIV_UNLOCK(priv); break; case SIOCADDMULTI: case SIOCDELMULTI: mlx5e_set_rx_mode(ifp); break; case SIOCSIFMEDIA: case SIOCGIFMEDIA: case SIOCGIFXMEDIA: ifr = (struct ifreq *)data; error = ifmedia_ioctl(ifp, ifr, &priv->media, command); break; case SIOCSIFCAP: ifr = (struct ifreq *)data; PRIV_LOCK(priv); mask = ifr->ifr_reqcap ^ ifp->if_capenable; if (mask & IFCAP_TXCSUM) { ifp->if_capenable ^= IFCAP_TXCSUM; ifp->if_hwassist ^= (CSUM_TCP | CSUM_UDP | CSUM_IP); if (IFCAP_TSO4 & ifp->if_capenable && !(IFCAP_TXCSUM & ifp->if_capenable)) { ifp->if_capenable &= ~IFCAP_TSO4; ifp->if_hwassist &= ~CSUM_IP_TSO; if_printf(ifp, "tso4 disabled due to -txcsum.\n"); } } if (mask & IFCAP_TXCSUM_IPV6) { ifp->if_capenable ^= IFCAP_TXCSUM_IPV6; ifp->if_hwassist ^= (CSUM_UDP_IPV6 | CSUM_TCP_IPV6); if (IFCAP_TSO6 & ifp->if_capenable && !(IFCAP_TXCSUM_IPV6 & ifp->if_capenable)) { ifp->if_capenable &= ~IFCAP_TSO6; ifp->if_hwassist &= ~CSUM_IP6_TSO; if_printf(ifp, "tso6 disabled due to -txcsum6.\n"); } } if (mask & IFCAP_RXCSUM) ifp->if_capenable ^= IFCAP_RXCSUM; if (mask & IFCAP_RXCSUM_IPV6) ifp->if_capenable ^= IFCAP_RXCSUM_IPV6; if (mask & IFCAP_TSO4) { if (!(IFCAP_TSO4 & ifp->if_capenable) && !(IFCAP_TXCSUM & ifp->if_capenable)) { if_printf(ifp, "enable txcsum first.\n"); error = EAGAIN; goto out; } ifp->if_capenable ^= IFCAP_TSO4; ifp->if_hwassist ^= CSUM_IP_TSO; } if (mask & IFCAP_TSO6) { if (!(IFCAP_TSO6 & ifp->if_capenable) && !(IFCAP_TXCSUM_IPV6 & ifp->if_capenable)) { if_printf(ifp, "enable txcsum6 first.\n"); error = EAGAIN; goto out; } ifp->if_capenable ^= IFCAP_TSO6; ifp->if_hwassist ^= CSUM_IP6_TSO; } if (mask & IFCAP_VLAN_HWFILTER) { if (ifp->if_capenable & IFCAP_VLAN_HWFILTER) mlx5e_disable_vlan_filter(priv); else mlx5e_enable_vlan_filter(priv); ifp->if_capenable ^= IFCAP_VLAN_HWFILTER; } if (mask & IFCAP_VLAN_HWTAGGING) ifp->if_capenable ^= IFCAP_VLAN_HWTAGGING; if (mask & IFCAP_WOL_MAGIC) ifp->if_capenable ^= IFCAP_WOL_MAGIC; VLAN_CAPABILITIES(ifp); /* turn off LRO means also turn of HW LRO - if it's on */ if (mask & IFCAP_LRO) { int was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state); bool need_restart = false; ifp->if_capenable ^= IFCAP_LRO; if (!(ifp->if_capenable & IFCAP_LRO)) { if (priv->params.hw_lro_en) { priv->params.hw_lro_en = false; need_restart = true; /* Not sure this is the correct way */ priv->params_ethtool.hw_lro = priv->params.hw_lro_en; } } if (was_opened && need_restart) { mlx5e_close_locked(ifp); mlx5e_open_locked(ifp); } } out: PRIV_UNLOCK(priv); break; case SIOCGI2C: ifr = (struct ifreq *)data; /* * Copy from the user-space address ifr_data to the * kernel-space address i2c */ error = copyin(ifr->ifr_data, &i2c, sizeof(i2c)); if (error) break; if (i2c.len > sizeof(i2c.data)) { error = EINVAL; break; } PRIV_LOCK(priv); /* Get module_num which is required for the query_eeprom */ error = mlx5_query_module_num(priv->mdev, &module_num); if (error) { if_printf(ifp, "Query module num failed, eeprom " "reading is not supported\n"); error = EINVAL; goto err_i2c; } /* Check if module is present before doing an access */ if (mlx5_query_module_status(priv->mdev, module_num) != MLX5_MODULE_STATUS_PLUGGED) { error = EINVAL; goto err_i2c; } /* * Currently 0XA0 and 0xA2 are the only addresses permitted. * The internal conversion is as follows: */ if (i2c.dev_addr == 0xA0) read_addr = MLX5E_I2C_ADDR_LOW; else if (i2c.dev_addr == 0xA2) read_addr = MLX5E_I2C_ADDR_HIGH; else { if_printf(ifp, "Query eeprom failed, " "Invalid Address: %X\n", i2c.dev_addr); error = EINVAL; goto err_i2c; } error = mlx5_query_eeprom(priv->mdev, read_addr, MLX5E_EEPROM_LOW_PAGE, (uint32_t)i2c.offset, (uint32_t)i2c.len, module_num, (uint32_t *)i2c.data, &size_read); if (error) { if_printf(ifp, "Query eeprom failed, eeprom " "reading is not supported\n"); error = EINVAL; goto err_i2c; } if (i2c.len > MLX5_EEPROM_MAX_BYTES) { error = mlx5_query_eeprom(priv->mdev, read_addr, MLX5E_EEPROM_LOW_PAGE, (uint32_t)(i2c.offset + size_read), (uint32_t)(i2c.len - size_read), module_num, (uint32_t *)(i2c.data + size_read), &size_read); } if (error) { if_printf(ifp, "Query eeprom failed, eeprom " "reading is not supported\n"); error = EINVAL; goto err_i2c; } error = copyout(&i2c, ifr->ifr_data, sizeof(i2c)); err_i2c: PRIV_UNLOCK(priv); break; default: error = ether_ioctl(ifp, command, data); break; } return (error); } static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev) { /* * TODO: uncoment once FW really sets all these bits if * (!mdev->caps.eth.rss_ind_tbl_cap || !mdev->caps.eth.csum_cap || * !mdev->caps.eth.max_lso_cap || !mdev->caps.eth.vlan_cap || * !(mdev->caps.gen.flags & MLX5_DEV_CAP_FLAG_SCQE_BRK_MOD)) return * -ENOTSUPP; */ /* TODO: add more must-to-have features */ return (0); } static void mlx5e_build_ifp_priv(struct mlx5_core_dev *mdev, struct mlx5e_priv *priv, int num_comp_vectors) { /* * TODO: Consider link speed for setting "log_sq_size", * "log_rq_size" and "cq_moderation_xxx": */ priv->params.log_sq_size = MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE; priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE; priv->params.rx_cq_moderation_usec = MLX5_CAP_GEN(mdev, cq_period_start_from_cqe) ? MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC_FROM_CQE : MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_USEC; priv->params.rx_cq_moderation_mode = MLX5_CAP_GEN(mdev, cq_period_start_from_cqe) ? 1 : 0; priv->params.rx_cq_moderation_pkts = MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_PKTS; priv->params.tx_cq_moderation_usec = MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_USEC; priv->params.tx_cq_moderation_pkts = MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_PKTS; priv->params.min_rx_wqes = MLX5E_PARAMS_DEFAULT_MIN_RX_WQES; priv->params.rx_hash_log_tbl_sz = (order_base_2(num_comp_vectors) > MLX5E_PARAMS_DEFAULT_RX_HASH_LOG_TBL_SZ) ? order_base_2(num_comp_vectors) : MLX5E_PARAMS_DEFAULT_RX_HASH_LOG_TBL_SZ; priv->params.num_tc = 1; priv->params.default_vlan_prio = 0; priv->counter_set_id = -1; /* * hw lro is currently defaulted to off. when it won't anymore we * will consider the HW capability: "!!MLX5_CAP_ETH(mdev, lro_cap)" */ priv->params.hw_lro_en = false; priv->params.lro_wqe_sz = MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ; priv->params.cqe_zipping_en = !!MLX5_CAP_GEN(mdev, cqe_compression); priv->mdev = mdev; priv->params.num_channels = num_comp_vectors; priv->order_base_2_num_channels = order_base_2(num_comp_vectors); priv->queue_mapping_channel_mask = roundup_pow_of_two(num_comp_vectors) - 1; priv->num_tc = priv->params.num_tc; priv->default_vlan_prio = priv->params.default_vlan_prio; INIT_WORK(&priv->update_stats_work, mlx5e_update_stats_work); INIT_WORK(&priv->update_carrier_work, mlx5e_update_carrier_work); INIT_WORK(&priv->set_rx_mode_work, mlx5e_set_rx_mode_work); } static int mlx5e_create_mkey(struct mlx5e_priv *priv, u32 pdn, struct mlx5_core_mr *mr) { struct ifnet *ifp = priv->ifp; struct mlx5_core_dev *mdev = priv->mdev; struct mlx5_create_mkey_mbox_in *in; int err; in = mlx5_vzalloc(sizeof(*in)); if (in == NULL) { if_printf(ifp, "%s: failed to allocate inbox\n", __func__); return (-ENOMEM); } in->seg.flags = MLX5_PERM_LOCAL_WRITE | MLX5_PERM_LOCAL_READ | MLX5_ACCESS_MODE_PA; in->seg.flags_pd = cpu_to_be32(pdn | MLX5_MKEY_LEN64); in->seg.qpn_mkey7_0 = cpu_to_be32(0xffffff << 8); err = mlx5_core_create_mkey(mdev, mr, in, sizeof(*in), NULL, NULL, NULL); if (err) if_printf(ifp, "%s: mlx5_core_create_mkey failed, %d\n", __func__, err); kvfree(in); return (err); } static const char *mlx5e_vport_stats_desc[] = { MLX5E_VPORT_STATS(MLX5E_STATS_DESC) }; static const char *mlx5e_pport_stats_desc[] = { MLX5E_PPORT_STATS(MLX5E_STATS_DESC) }; static void mlx5e_priv_mtx_init(struct mlx5e_priv *priv) { mtx_init(&priv->async_events_mtx, "mlx5async", MTX_NETWORK_LOCK, MTX_DEF); sx_init(&priv->state_lock, "mlx5state"); callout_init_mtx(&priv->watchdog, &priv->async_events_mtx, 0); } static void mlx5e_priv_mtx_destroy(struct mlx5e_priv *priv) { mtx_destroy(&priv->async_events_mtx); sx_destroy(&priv->state_lock); } static int sysctl_firmware(SYSCTL_HANDLER_ARGS) { /* * %d.%d%.d the string format. * fw_rev_{maj,min,sub} return u16, 2^16 = 65536. * We need at most 5 chars to store that. * It also has: two "." and NULL at the end, which means we need 18 * (5*3 + 3) chars at most. */ char fw[18]; struct mlx5e_priv *priv = arg1; int error; snprintf(fw, sizeof(fw), "%d.%d.%d", fw_rev_maj(priv->mdev), fw_rev_min(priv->mdev), fw_rev_sub(priv->mdev)); error = sysctl_handle_string(oidp, fw, sizeof(fw), req); return (error); } static void mlx5e_add_hw_stats(struct mlx5e_priv *priv) { SYSCTL_ADD_PROC(&priv->sysctl_ctx, SYSCTL_CHILDREN(priv->sysctl_hw), OID_AUTO, "fw_version", CTLTYPE_STRING | CTLFLAG_RD, priv, 0, sysctl_firmware, "A", "HCA firmware version"); SYSCTL_ADD_STRING(&priv->sysctl_ctx, SYSCTL_CHILDREN(priv->sysctl_hw), OID_AUTO, "board_id", CTLFLAG_RD, priv->mdev->board_id, 0, "Board ID"); } static void mlx5e_setup_pauseframes(struct mlx5e_priv *priv) { #if (__FreeBSD_version < 1100000) char path[64]; #endif /* Only receiving pauseframes is enabled by default */ priv->params.tx_pauseframe_control = 0; priv->params.rx_pauseframe_control = 1; #if (__FreeBSD_version < 1100000) /* compute path for sysctl */ snprintf(path, sizeof(path), "dev.mce.%d.tx_pauseframe_control", device_get_unit(priv->mdev->pdev->dev.bsddev)); /* try to fetch tunable, if any */ TUNABLE_INT_FETCH(path, &priv->params.tx_pauseframe_control); /* compute path for sysctl */ snprintf(path, sizeof(path), "dev.mce.%d.rx_pauseframe_control", device_get_unit(priv->mdev->pdev->dev.bsddev)); /* try to fetch tunable, if any */ TUNABLE_INT_FETCH(path, &priv->params.rx_pauseframe_control); #endif /* register pausframe SYSCTLs */ SYSCTL_ADD_INT(&priv->sysctl_ctx, SYSCTL_CHILDREN(priv->sysctl_ifnet), OID_AUTO, "tx_pauseframe_control", CTLFLAG_RDTUN, &priv->params.tx_pauseframe_control, 0, "Set to enable TX pause frames. Clear to disable."); SYSCTL_ADD_INT(&priv->sysctl_ctx, SYSCTL_CHILDREN(priv->sysctl_ifnet), OID_AUTO, "rx_pauseframe_control", CTLFLAG_RDTUN, &priv->params.rx_pauseframe_control, 0, "Set to enable RX pause frames. Clear to disable."); /* range check */ priv->params.tx_pauseframe_control = priv->params.tx_pauseframe_control ? 1 : 0; priv->params.rx_pauseframe_control = priv->params.rx_pauseframe_control ? 1 : 0; /* update firmware */ mlx5_set_port_pause(priv->mdev, 1, priv->params.rx_pauseframe_control, priv->params.tx_pauseframe_control); } static void * mlx5e_create_ifp(struct mlx5_core_dev *mdev) { static volatile int mlx5_en_unit; struct ifnet *ifp; struct mlx5e_priv *priv; u8 dev_addr[ETHER_ADDR_LEN] __aligned(4); struct sysctl_oid_list *child; int ncv = mdev->priv.eq_table.num_comp_vectors; char unit[16]; int err; int i; u32 eth_proto_cap; if (mlx5e_check_required_hca_cap(mdev)) { mlx5_core_dbg(mdev, "mlx5e_check_required_hca_cap() failed\n"); return (NULL); } priv = malloc(sizeof(*priv), M_MLX5EN, M_WAITOK | M_ZERO); if (priv == NULL) { mlx5_core_err(mdev, "malloc() failed\n"); return (NULL); } mlx5e_priv_mtx_init(priv); ifp = priv->ifp = if_alloc(IFT_ETHER); if (ifp == NULL) { mlx5_core_err(mdev, "if_alloc() failed\n"); goto err_free_priv; } ifp->if_softc = priv; if_initname(ifp, "mce", atomic_fetchadd_int(&mlx5_en_unit, 1)); ifp->if_mtu = ETHERMTU; ifp->if_init = mlx5e_open; ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; ifp->if_ioctl = mlx5e_ioctl; ifp->if_transmit = mlx5e_xmit; ifp->if_qflush = if_qflush; #if (__FreeBSD_version >= 1100000) ifp->if_get_counter = mlx5e_get_counter; #endif ifp->if_snd.ifq_maxlen = ifqmaxlen; /* * Set driver features */ ifp->if_capabilities |= IFCAP_HWCSUM | IFCAP_HWCSUM_IPV6; ifp->if_capabilities |= IFCAP_VLAN_MTU | IFCAP_VLAN_HWTAGGING; ifp->if_capabilities |= IFCAP_VLAN_HWCSUM | IFCAP_VLAN_HWFILTER; ifp->if_capabilities |= IFCAP_LINKSTATE | IFCAP_JUMBO_MTU; ifp->if_capabilities |= IFCAP_LRO; ifp->if_capabilities |= IFCAP_TSO | IFCAP_VLAN_HWTSO; /* set TSO limits so that we don't have to drop TX packets */ ifp->if_hw_tsomax = MLX5E_MAX_TX_PAYLOAD_SIZE - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN); ifp->if_hw_tsomaxsegcount = MLX5E_MAX_TX_MBUF_FRAGS - 1 /* hdr */; ifp->if_hw_tsomaxsegsize = MLX5E_MAX_TX_MBUF_SIZE; ifp->if_capenable = ifp->if_capabilities; ifp->if_hwassist = 0; if (ifp->if_capenable & IFCAP_TSO) ifp->if_hwassist |= CSUM_TSO; if (ifp->if_capenable & IFCAP_TXCSUM) ifp->if_hwassist |= (CSUM_TCP | CSUM_UDP | CSUM_IP); if (ifp->if_capenable & IFCAP_TXCSUM_IPV6) ifp->if_hwassist |= (CSUM_UDP_IPV6 | CSUM_TCP_IPV6); /* ifnet sysctl tree */ sysctl_ctx_init(&priv->sysctl_ctx); priv->sysctl_ifnet = SYSCTL_ADD_NODE(&priv->sysctl_ctx, SYSCTL_STATIC_CHILDREN(_dev), OID_AUTO, ifp->if_dname, CTLFLAG_RD, 0, "MLX5 ethernet - interface name"); if (priv->sysctl_ifnet == NULL) { mlx5_core_err(mdev, "SYSCTL_ADD_NODE() failed\n"); goto err_free_sysctl; } snprintf(unit, sizeof(unit), "%d", ifp->if_dunit); priv->sysctl_ifnet = SYSCTL_ADD_NODE(&priv->sysctl_ctx, SYSCTL_CHILDREN(priv->sysctl_ifnet), OID_AUTO, unit, CTLFLAG_RD, 0, "MLX5 ethernet - interface unit"); if (priv->sysctl_ifnet == NULL) { mlx5_core_err(mdev, "SYSCTL_ADD_NODE() failed\n"); goto err_free_sysctl; } /* HW sysctl tree */ child = SYSCTL_CHILDREN(device_get_sysctl_tree(mdev->pdev->dev.bsddev)); priv->sysctl_hw = SYSCTL_ADD_NODE(&priv->sysctl_ctx, child, OID_AUTO, "hw", CTLFLAG_RD, 0, "MLX5 ethernet dev hw"); if (priv->sysctl_hw == NULL) { mlx5_core_err(mdev, "SYSCTL_ADD_NODE() failed\n"); goto err_free_sysctl; } mlx5e_build_ifp_priv(mdev, priv, ncv); err = mlx5_alloc_map_uar(mdev, &priv->cq_uar); if (err) { if_printf(ifp, "%s: mlx5_alloc_map_uar failed, %d\n", __func__, err); goto err_free_sysctl; } err = mlx5_core_alloc_pd(mdev, &priv->pdn); if (err) { if_printf(ifp, "%s: mlx5_core_alloc_pd failed, %d\n", __func__, err); goto err_unmap_free_uar; } err = mlx5_alloc_transport_domain(mdev, &priv->tdn); if (err) { if_printf(ifp, "%s: mlx5_alloc_transport_domain failed, %d\n", __func__, err); goto err_dealloc_pd; } err = mlx5e_create_mkey(priv, priv->pdn, &priv->mr); if (err) { if_printf(ifp, "%s: mlx5e_create_mkey failed, %d\n", __func__, err); goto err_dealloc_transport_domain; } mlx5_query_nic_vport_mac_address(priv->mdev, 0, dev_addr); + /* check if we should generate a random MAC address */ + if (MLX5_CAP_GEN(priv->mdev, vport_group_manager) == 0 && + is_zero_ether_addr(dev_addr)) { + random_ether_addr(dev_addr); + if_printf(ifp, "Assigned random MAC address\n"); + } + /* set default MTU */ mlx5e_set_dev_port_mtu(ifp, ifp->if_mtu); /* Set desc */ device_set_desc(mdev->pdev->dev.bsddev, mlx5e_version); /* Set default media status */ priv->media_status_last = IFM_AVALID; priv->media_active_last = IFM_ETHER | IFM_AUTO | IFM_ETH_RXPAUSE | IFM_FDX; /* setup default pauseframes configuration */ mlx5e_setup_pauseframes(priv); err = mlx5_query_port_proto_cap(mdev, ð_proto_cap, MLX5_PTYS_EN); if (err) { eth_proto_cap = 0; if_printf(ifp, "%s: Query port media capability failed, %d\n", __func__, err); } /* Setup supported medias */ ifmedia_init(&priv->media, IFM_IMASK | IFM_ETH_FMASK, mlx5e_media_change, mlx5e_media_status); for (i = 0; i < MLX5E_LINK_MODES_NUMBER; ++i) { if (mlx5e_mode_table[i].baudrate == 0) continue; if (MLX5E_PROT_MASK(i) & eth_proto_cap) { ifmedia_add(&priv->media, mlx5e_mode_table[i].subtype | IFM_ETHER, 0, NULL); ifmedia_add(&priv->media, mlx5e_mode_table[i].subtype | IFM_ETHER | IFM_FDX | IFM_ETH_RXPAUSE | IFM_ETH_TXPAUSE, 0, NULL); } } ifmedia_add(&priv->media, IFM_ETHER | IFM_AUTO, 0, NULL); ifmedia_add(&priv->media, IFM_ETHER | IFM_AUTO | IFM_FDX | IFM_ETH_RXPAUSE | IFM_ETH_TXPAUSE, 0, NULL); /* Set autoselect by default */ ifmedia_set(&priv->media, IFM_ETHER | IFM_AUTO | IFM_FDX | IFM_ETH_RXPAUSE | IFM_ETH_TXPAUSE); ether_ifattach(ifp, dev_addr); /* Register for VLAN events */ priv->vlan_attach = EVENTHANDLER_REGISTER(vlan_config, mlx5e_vlan_rx_add_vid, priv, EVENTHANDLER_PRI_FIRST); priv->vlan_detach = EVENTHANDLER_REGISTER(vlan_unconfig, mlx5e_vlan_rx_kill_vid, priv, EVENTHANDLER_PRI_FIRST); /* Link is down by default */ if_link_state_change(ifp, LINK_STATE_DOWN); mlx5e_enable_async_events(priv); mlx5e_add_hw_stats(priv); mlx5e_create_stats(&priv->stats.vport.ctx, SYSCTL_CHILDREN(priv->sysctl_ifnet), "vstats", mlx5e_vport_stats_desc, MLX5E_VPORT_STATS_NUM, priv->stats.vport.arg); mlx5e_create_stats(&priv->stats.pport.ctx, SYSCTL_CHILDREN(priv->sysctl_ifnet), "pstats", mlx5e_pport_stats_desc, MLX5E_PPORT_STATS_NUM, priv->stats.pport.arg); mlx5e_create_ethtool(priv); mtx_lock(&priv->async_events_mtx); mlx5e_update_stats(priv); mtx_unlock(&priv->async_events_mtx); return (priv); err_dealloc_transport_domain: mlx5_dealloc_transport_domain(mdev, priv->tdn); err_dealloc_pd: mlx5_core_dealloc_pd(mdev, priv->pdn); err_unmap_free_uar: mlx5_unmap_free_uar(mdev, &priv->cq_uar); err_free_sysctl: sysctl_ctx_free(&priv->sysctl_ctx); if_free(ifp); err_free_priv: mlx5e_priv_mtx_destroy(priv); free(priv, M_MLX5EN); return (NULL); } static void mlx5e_destroy_ifp(struct mlx5_core_dev *mdev, void *vpriv) { struct mlx5e_priv *priv = vpriv; struct ifnet *ifp = priv->ifp; /* don't allow more IOCTLs */ priv->gone = 1; /* XXX wait a bit to allow IOCTL handlers to complete */ pause("W", hz); /* stop watchdog timer */ callout_drain(&priv->watchdog); if (priv->vlan_attach != NULL) EVENTHANDLER_DEREGISTER(vlan_config, priv->vlan_attach); if (priv->vlan_detach != NULL) EVENTHANDLER_DEREGISTER(vlan_unconfig, priv->vlan_detach); /* make sure device gets closed */ PRIV_LOCK(priv); mlx5e_close_locked(ifp); PRIV_UNLOCK(priv); /* unregister device */ ifmedia_removeall(&priv->media); ether_ifdetach(ifp); if_free(ifp); /* destroy all remaining sysctl nodes */ if (priv->sysctl_debug) sysctl_ctx_free(&priv->stats.port_stats_debug.ctx); sysctl_ctx_free(&priv->stats.vport.ctx); sysctl_ctx_free(&priv->stats.pport.ctx); sysctl_ctx_free(&priv->sysctl_ctx); mlx5_core_destroy_mkey(priv->mdev, &priv->mr); mlx5_dealloc_transport_domain(priv->mdev, priv->tdn); mlx5_core_dealloc_pd(priv->mdev, priv->pdn); mlx5_unmap_free_uar(priv->mdev, &priv->cq_uar); mlx5e_disable_async_events(priv); flush_scheduled_work(); mlx5e_priv_mtx_destroy(priv); free(priv, M_MLX5EN); } static void * mlx5e_get_ifp(void *vpriv) { struct mlx5e_priv *priv = vpriv; return (priv->ifp); } static struct mlx5_interface mlx5e_interface = { .add = mlx5e_create_ifp, .remove = mlx5e_destroy_ifp, .event = mlx5e_async_event, .protocol = MLX5_INTERFACE_PROTOCOL_ETH, .get_dev = mlx5e_get_ifp, }; void mlx5e_init(void) { mlx5_register_interface(&mlx5e_interface); } void mlx5e_cleanup(void) { mlx5_unregister_interface(&mlx5e_interface); } module_init_order(mlx5e_init, SI_ORDER_THIRD); module_exit_order(mlx5e_cleanup, SI_ORDER_THIRD); #if (__FreeBSD_version >= 1100000) MODULE_DEPEND(mlx5en, linuxkpi, 1, 1, 1); #endif MODULE_DEPEND(mlx5en, mlx5, 1, 1, 1); MODULE_VERSION(mlx5en, 1); Index: projects/vnet/sys/dev/mlx5/mlx5_en/mlx5_en_rx.c =================================================================== --- projects/vnet/sys/dev/mlx5/mlx5_en/mlx5_en_rx.c (revision 301546) +++ projects/vnet/sys/dev/mlx5/mlx5_en/mlx5_en_rx.c (revision 301547) @@ -1,444 +1,444 @@ /*- * Copyright (c) 2015 Mellanox Technologies. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS `AS IS' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include "en.h" #include static inline int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix) { bus_dma_segment_t segs[1]; struct mbuf *mb; int nsegs; int err; if (rq->mbuf[ix].mbuf != NULL) return (0); mb = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, rq->wqe_sz); if (unlikely(!mb)) return (-ENOMEM); /* set initial mbuf length */ mb->m_pkthdr.len = mb->m_len = rq->wqe_sz; /* get IP header aligned */ m_adj(mb, MLX5E_NET_IP_ALIGN); err = -bus_dmamap_load_mbuf_sg(rq->dma_tag, rq->mbuf[ix].dma_map, mb, segs, &nsegs, BUS_DMA_NOWAIT); if (err != 0) goto err_free_mbuf; if (unlikely(nsegs != 1)) { bus_dmamap_unload(rq->dma_tag, rq->mbuf[ix].dma_map); err = -ENOMEM; goto err_free_mbuf; } wqe->data.addr = cpu_to_be64(segs[0].ds_addr); rq->mbuf[ix].mbuf = mb; rq->mbuf[ix].data = mb->m_data; bus_dmamap_sync(rq->dma_tag, rq->mbuf[ix].dma_map, BUS_DMASYNC_PREREAD); return (0); err_free_mbuf: m_freem(mb); return (err); } static void mlx5e_post_rx_wqes(struct mlx5e_rq *rq) { if (unlikely(rq->enabled == 0)) return; while (!mlx5_wq_ll_is_full(&rq->wq)) { struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, rq->wq.head); if (unlikely(mlx5e_alloc_rx_wqe(rq, wqe, rq->wq.head))) break; mlx5_wq_ll_push(&rq->wq, be16_to_cpu(wqe->next.next_wqe_index)); } /* ensure wqes are visible to device before updating doorbell record */ wmb(); mlx5_wq_ll_update_db_record(&rq->wq); } static void mlx5e_lro_update_hdr(struct mbuf *mb, struct mlx5_cqe64 *cqe) { /* TODO: consider vlans, ip options, ... */ struct ether_header *eh; uint16_t eh_type; uint16_t tot_len; struct ip6_hdr *ip6 = NULL; struct ip *ip4 = NULL; struct tcphdr *th; uint32_t *ts_ptr; uint8_t l4_hdr_type; int tcp_ack; eh = mtod(mb, struct ether_header *); eh_type = ntohs(eh->ether_type); l4_hdr_type = get_cqe_l4_hdr_type(cqe); tcp_ack = ((CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA == l4_hdr_type) || (CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA == l4_hdr_type)); /* TODO: consider vlan */ tot_len = be32_to_cpu(cqe->byte_cnt) - ETHER_HDR_LEN; switch (eh_type) { case ETHERTYPE_IP: ip4 = (struct ip *)(eh + 1); th = (struct tcphdr *)(ip4 + 1); break; case ETHERTYPE_IPV6: ip6 = (struct ip6_hdr *)(eh + 1); th = (struct tcphdr *)(ip6 + 1); break; default: return; } ts_ptr = (uint32_t *)(th + 1); if (get_cqe_lro_tcppsh(cqe)) th->th_flags |= TH_PUSH; if (tcp_ack) { th->th_flags |= TH_ACK; th->th_ack = cqe->lro_ack_seq_num; th->th_win = cqe->lro_tcp_win; /* * FreeBSD handles only 32bit aligned timestamp right after * the TCP hdr * +--------+--------+--------+--------+ * | NOP | NOP | TSopt | 10 | * +--------+--------+--------+--------+ * | TSval timestamp | * +--------+--------+--------+--------+ * | TSecr timestamp | * +--------+--------+--------+--------+ */ if (get_cqe_lro_timestamp_valid(cqe) && (__predict_true(*ts_ptr) == ntohl(TCPOPT_NOP << 24 | TCPOPT_NOP << 16 | TCPOPT_TIMESTAMP << 8 | TCPOLEN_TIMESTAMP))) { /* * cqe->timestamp is 64bit long. * [0-31] - timestamp. * [32-64] - timestamp echo replay. */ ts_ptr[1] = *(uint32_t *)&cqe->timestamp; ts_ptr[2] = *((uint32_t *)&cqe->timestamp + 1); } } if (ip4) { ip4->ip_ttl = cqe->lro_min_ttl; ip4->ip_len = cpu_to_be16(tot_len); ip4->ip_sum = 0; ip4->ip_sum = in_cksum(mb, ip4->ip_hl << 2); } else { ip6->ip6_hlim = cqe->lro_min_ttl; ip6->ip6_plen = cpu_to_be16(tot_len - sizeof(struct ip6_hdr)); } /* TODO: handle tcp checksum */ } static inline void mlx5e_build_rx_mbuf(struct mlx5_cqe64 *cqe, struct mlx5e_rq *rq, struct mbuf *mb, u32 cqe_bcnt) { struct ifnet *ifp = rq->ifp; int lro_num_seg; /* HW LRO session aggregated packets counter */ lro_num_seg = be32_to_cpu(cqe->srqn) >> 24; if (lro_num_seg > 1) { mlx5e_lro_update_hdr(mb, cqe); rq->stats.lro_packets++; rq->stats.lro_bytes += cqe_bcnt; } mb->m_pkthdr.len = mb->m_len = cqe_bcnt; /* check if a Toeplitz hash was computed */ if (cqe->rss_hash_type != 0) { mb->m_pkthdr.flowid = be32_to_cpu(cqe->rss_hash_result); #ifdef RSS /* decode the RSS hash type */ switch (cqe->rss_hash_type & (CQE_RSS_DST_HTYPE_L4 | CQE_RSS_DST_HTYPE_IP)) { /* IPv4 */ case (CQE_RSS_DST_HTYPE_TCP | CQE_RSS_DST_HTYPE_IPV4): M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_TCP_IPV4); break; case (CQE_RSS_DST_HTYPE_UDP | CQE_RSS_DST_HTYPE_IPV4): M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_UDP_IPV4); break; case CQE_RSS_DST_HTYPE_IPV4: M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_IPV4); break; /* IPv6 */ case (CQE_RSS_DST_HTYPE_TCP | CQE_RSS_DST_HTYPE_IPV6): M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_TCP_IPV6); break; case (CQE_RSS_DST_HTYPE_UDP | CQE_RSS_DST_HTYPE_IPV6): M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_UDP_IPV6); break; case CQE_RSS_DST_HTYPE_IPV6: M_HASHTYPE_SET(mb, M_HASHTYPE_RSS_IPV6); break; default: /* Other */ - M_HASHTYPE_SET(mb, M_HASHTYPE_OPAQUE); + M_HASHTYPE_SET(mb, M_HASHTYPE_OPAQUE_HASH); break; } #else - M_HASHTYPE_SET(mb, M_HASHTYPE_OPAQUE); + M_HASHTYPE_SET(mb, M_HASHTYPE_OPAQUE_HASH); #endif } else { mb->m_pkthdr.flowid = rq->ix; M_HASHTYPE_SET(mb, M_HASHTYPE_OPAQUE); } mb->m_pkthdr.rcvif = ifp; if (likely(ifp->if_capenable & (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6)) && ((cqe->hds_ip_ext & (CQE_L2_OK | CQE_L3_OK | CQE_L4_OK)) == (CQE_L2_OK | CQE_L3_OK | CQE_L4_OK))) { mb->m_pkthdr.csum_flags = CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR; mb->m_pkthdr.csum_data = htons(0xffff); } else { rq->stats.csum_none++; } if (cqe_has_vlan(cqe)) { mb->m_pkthdr.ether_vtag = be16_to_cpu(cqe->vlan_info); mb->m_flags |= M_VLANTAG; } } static inline void mlx5e_read_cqe_slot(struct mlx5e_cq *cq, u32 cc, void *data) { memcpy(data, mlx5_cqwq_get_wqe(&cq->wq, (cc & cq->wq.sz_m1)), sizeof(struct mlx5_cqe64)); } static inline void mlx5e_write_cqe_slot(struct mlx5e_cq *cq, u32 cc, void *data) { memcpy(mlx5_cqwq_get_wqe(&cq->wq, cc & cq->wq.sz_m1), data, sizeof(struct mlx5_cqe64)); } static inline void mlx5e_decompress_cqe(struct mlx5e_cq *cq, struct mlx5_cqe64 *title, struct mlx5_mini_cqe8 *mini, u16 wqe_counter, int i) { /* * NOTE: The fields which are not set here are copied from the * initial and common title. See memcpy() in * mlx5e_write_cqe_slot(). */ title->byte_cnt = mini->byte_cnt; title->wqe_counter = cpu_to_be16((wqe_counter + i) & cq->wq.sz_m1); title->check_sum = mini->checksum; title->op_own = (title->op_own & 0xf0) | (((cq->wq.cc + i) >> cq->wq.log_sz) & 1); } #define MLX5E_MINI_ARRAY_SZ 8 /* Make sure structs are not packet differently */ CTASSERT(sizeof(struct mlx5_cqe64) == sizeof(struct mlx5_mini_cqe8) * MLX5E_MINI_ARRAY_SZ); static void mlx5e_decompress_cqes(struct mlx5e_cq *cq) { struct mlx5_mini_cqe8 mini_array[MLX5E_MINI_ARRAY_SZ]; struct mlx5_cqe64 title; u32 cqe_count; u32 i = 0; u16 title_wqe_counter; mlx5e_read_cqe_slot(cq, cq->wq.cc, &title); title_wqe_counter = be16_to_cpu(title.wqe_counter); cqe_count = be32_to_cpu(title.byte_cnt); /* Make sure we won't overflow */ KASSERT(cqe_count <= cq->wq.sz_m1, ("%s: cqe_count %u > cq->wq.sz_m1 %u", __func__, cqe_count, cq->wq.sz_m1)); mlx5e_read_cqe_slot(cq, cq->wq.cc + 1, mini_array); while (true) { mlx5e_decompress_cqe(cq, &title, &mini_array[i % MLX5E_MINI_ARRAY_SZ], title_wqe_counter, i); mlx5e_write_cqe_slot(cq, cq->wq.cc + i, &title); i++; if (i == cqe_count) break; if (i % MLX5E_MINI_ARRAY_SZ == 0) mlx5e_read_cqe_slot(cq, cq->wq.cc + i, mini_array); } } static int mlx5e_poll_rx_cq(struct mlx5e_rq *rq, int budget) { int i; for (i = 0; i < budget; i++) { struct mlx5e_rx_wqe *wqe; struct mlx5_cqe64 *cqe; struct mbuf *mb; __be16 wqe_counter_be; u16 wqe_counter; u32 byte_cnt; cqe = mlx5e_get_cqe(&rq->cq); if (!cqe) break; if (mlx5_get_cqe_format(cqe) == MLX5_COMPRESSED) mlx5e_decompress_cqes(&rq->cq); mlx5_cqwq_pop(&rq->cq.wq); wqe_counter_be = cqe->wqe_counter; wqe_counter = be16_to_cpu(wqe_counter_be); wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter); byte_cnt = be32_to_cpu(cqe->byte_cnt); bus_dmamap_sync(rq->dma_tag, rq->mbuf[wqe_counter].dma_map, BUS_DMASYNC_POSTREAD); if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) { rq->stats.wqe_err++; goto wq_ll_pop; } if (MHLEN >= byte_cnt && (mb = m_gethdr(M_NOWAIT, MT_DATA)) != NULL) { bcopy(rq->mbuf[wqe_counter].data, mtod(mb, caddr_t), byte_cnt); } else { mb = rq->mbuf[wqe_counter].mbuf; rq->mbuf[wqe_counter].mbuf = NULL; /* safety clear */ bus_dmamap_unload(rq->dma_tag, rq->mbuf[wqe_counter].dma_map); } mlx5e_build_rx_mbuf(cqe, rq, mb, byte_cnt); rq->stats.packets++; #ifdef HAVE_TURBO_LRO if (mb->m_pkthdr.csum_flags == 0 || (rq->ifp->if_capenable & IFCAP_LRO) == 0 || rq->lro.mbuf == NULL) { /* normal input */ rq->ifp->if_input(rq->ifp, mb); } else { tcp_tlro_rx(&rq->lro, mb); } #else if (mb->m_pkthdr.csum_flags == 0 || (rq->ifp->if_capenable & IFCAP_LRO) == 0 || rq->lro.lro_cnt == 0 || tcp_lro_rx(&rq->lro, mb, 0) != 0) { rq->ifp->if_input(rq->ifp, mb); } #endif wq_ll_pop: mlx5_wq_ll_pop(&rq->wq, wqe_counter_be, &wqe->next.next_wqe_index); } mlx5_cqwq_update_db_record(&rq->cq.wq); /* ensure cq space is freed before enabling more cqes */ wmb(); #ifndef HAVE_TURBO_LRO tcp_lro_flush_all(&rq->lro); #endif return (i); } void mlx5e_rx_cq_comp(struct mlx5_core_cq *mcq) { struct mlx5e_rq *rq = container_of(mcq, struct mlx5e_rq, cq.mcq); int i = 0; #ifdef HAVE_PER_CQ_EVENT_PACKET struct mbuf *mb = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, rq->wqe_sz); if (mb != NULL) { /* this code is used for debugging purpose only */ mb->m_pkthdr.len = mb->m_len = 15; memset(mb->m_data, 255, 14); mb->m_data[14] = rq->ix; mb->m_pkthdr.rcvif = rq->ifp; rq->ifp->if_input(rq->ifp, mb); } #endif mtx_lock(&rq->mtx); /* * Polling the entire CQ without posting new WQEs results in * lack of receive WQEs during heavy traffic scenarios. */ while (1) { if (mlx5e_poll_rx_cq(rq, MLX5E_RX_BUDGET_MAX) != MLX5E_RX_BUDGET_MAX) break; i += MLX5E_RX_BUDGET_MAX; if (i >= MLX5E_BUDGET_MAX) break; mlx5e_post_rx_wqes(rq); } mlx5e_post_rx_wqes(rq); mlx5e_cq_arm(&rq->cq); #ifdef HAVE_TURBO_LRO tcp_tlro_flush(&rq->lro, 1); #endif mtx_unlock(&rq->mtx); } Index: projects/vnet/sys/dev/mlx5/vport.h =================================================================== --- projects/vnet/sys/dev/mlx5/vport.h (revision 301546) +++ projects/vnet/sys/dev/mlx5/vport.h (revision 301547) @@ -1,81 +1,109 @@ /*- * Copyright (c) 2013-2015, Mellanox Technologies, Ltd. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS `AS IS' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef __MLX5_VPORT_H__ #define __MLX5_VPORT_H__ #include int mlx5_vport_alloc_q_counter(struct mlx5_core_dev *mdev, int *counter_set_id); int mlx5_vport_dealloc_q_counter(struct mlx5_core_dev *mdev, int counter_set_id); int mlx5_vport_query_out_of_rx_buffer(struct mlx5_core_dev *mdev, int counter_set_id, u32 *out_of_rx_buffer); u8 mlx5_query_vport_state(struct mlx5_core_dev *mdev, u8 opmod); +int mlx5_arm_vport_context_events(struct mlx5_core_dev *mdev, + u8 vport, + u32 events_mask); +int mlx5_query_vport_promisc(struct mlx5_core_dev *mdev, + u32 vport, + u8 *promisc_uc, + u8 *promisc_mc, + u8 *promisc_all); +int mlx5_modify_nic_vport_promisc(struct mlx5_core_dev *mdev, + int promisc_uc, + int promisc_mc, + int promisc_all); int mlx5_query_nic_vport_mac_address(struct mlx5_core_dev *mdev, u32 vport, u8 *addr); int mlx5_set_nic_vport_current_mac(struct mlx5_core_dev *mdev, int vport, bool other_vport, u8 *addr); int mlx5_set_nic_vport_vlan_list(struct mlx5_core_dev *dev, u32 vport, u16 *vlan_list, int list_len); int mlx5_set_nic_vport_mc_list(struct mlx5_core_dev *mdev, int vport, u64 *addr_list, size_t addr_list_len); int mlx5_set_nic_vport_promisc(struct mlx5_core_dev *mdev, int vport, bool promisc_mc, bool promisc_uc, bool promisc_all); +int mlx5_query_nic_vport_mac_list(struct mlx5_core_dev *dev, + u32 vport, + enum mlx5_list_type list_type, + u8 addr_list[][ETH_ALEN], + int *list_size); +int mlx5_modify_nic_vport_mac_list(struct mlx5_core_dev *dev, + enum mlx5_list_type list_type, + u8 addr_list[][ETH_ALEN], + int list_size); +int mlx5_query_nic_vport_vlan_list(struct mlx5_core_dev *dev, + u32 vport, + u16 *vlan_list, + int *list_size); +int mlx5_modify_nic_vport_vlans(struct mlx5_core_dev *dev, + u16 vlans[], + int list_size); int mlx5_set_nic_vport_permanent_mac(struct mlx5_core_dev *mdev, int vport, u8 *addr); int mlx5_nic_vport_enable_roce(struct mlx5_core_dev *mdev); int mlx5_nic_vport_disable_roce(struct mlx5_core_dev *mdev); int mlx5_query_nic_vport_system_image_guid(struct mlx5_core_dev *mdev, u64 *system_image_guid); int mlx5_query_nic_vport_node_guid(struct mlx5_core_dev *mdev, u64 *node_guid); int mlx5_query_nic_vport_port_guid(struct mlx5_core_dev *mdev, u64 *port_guid); int mlx5_query_nic_vport_qkey_viol_cntr(struct mlx5_core_dev *mdev, u16 *qkey_viol_cntr); int mlx5_query_hca_vport_node_guid(struct mlx5_core_dev *mdev, u64 *node_guid); int mlx5_query_hca_vport_system_image_guid(struct mlx5_core_dev *mdev, u64 *system_image_guid); int mlx5_query_hca_vport_context(struct mlx5_core_dev *mdev, u8 port_num, u8 vport_num, u32 *out, int outlen); int mlx5_query_hca_vport_pkey(struct mlx5_core_dev *dev, u8 other_vport, u8 port_num, u16 vf_num, u16 pkey_index, u16 *pkey); int mlx5_query_hca_vport_gid(struct mlx5_core_dev *dev, u8 port_num, u16 vport_num, u16 gid_index, union ib_gid *gid); int mlx5_set_eswitch_cvlan_info(struct mlx5_core_dev *mdev, u8 vport, u8 insert_mode, u8 strip_mode, u16 vlan, u8 cfi, u8 pcp); int mlx5_query_vport_counter(struct mlx5_core_dev *dev, u8 port_num, u16 vport_num, void *out, int out_size); int mlx5_get_vport_counters(struct mlx5_core_dev *dev, u8 port_num, struct mlx5_vport_counters *vc); #endif /* __MLX5_VPORT_H__ */ Index: projects/vnet/sys/dev/qlxgbe/ql_isr.c =================================================================== --- projects/vnet/sys/dev/qlxgbe/ql_isr.c (revision 301546) +++ projects/vnet/sys/dev/qlxgbe/ql_isr.c (revision 301547) @@ -1,924 +1,924 @@ /* * Copyright (c) 2013-2016 Qlogic Corporation * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ /* * File: ql_isr.c * Author : David C Somayajulu, Qlogic Corporation, Aliso Viejo, CA 92656. */ #include __FBSDID("$FreeBSD$"); #include "ql_os.h" #include "ql_hw.h" #include "ql_def.h" #include "ql_inline.h" #include "ql_ver.h" #include "ql_glbl.h" #include "ql_dbg.h" static void qla_replenish_normal_rx(qla_host_t *ha, qla_sds_t *sdsp, uint32_t r_idx); static void qla_rcv_error(qla_host_t *ha) { ha->flags.stop_rcv = 1; ha->qla_initiate_recovery = 1; } /* * Name: qla_rx_intr * Function: Handles normal ethernet frames received */ static void qla_rx_intr(qla_host_t *ha, qla_sgl_rcv_t *sgc, uint32_t sds_idx) { qla_rx_buf_t *rxb; struct mbuf *mp = NULL, *mpf = NULL, *mpl = NULL; struct ifnet *ifp = ha->ifp; qla_sds_t *sdsp; struct ether_vlan_header *eh; uint32_t i, rem_len = 0; uint32_t r_idx = 0; qla_rx_ring_t *rx_ring; if (ha->hw.num_rds_rings > 1) r_idx = sds_idx; ha->hw.rds[r_idx].count++; sdsp = &ha->hw.sds[sds_idx]; rx_ring = &ha->rx_ring[r_idx]; for (i = 0; i < sgc->num_handles; i++) { rxb = &rx_ring->rx_buf[sgc->handle[i] & 0x7FFF]; QL_ASSERT(ha, (rxb != NULL), ("%s: [sds_idx]=[%d] rxb != NULL\n", __func__,\ sds_idx)); if ((rxb == NULL) || QL_ERR_INJECT(ha, INJCT_RX_RXB_INVAL)) { /* log the error */ device_printf(ha->pci_dev, "%s invalid rxb[%d, %d, 0x%04x]\n", __func__, sds_idx, i, sgc->handle[i]); qla_rcv_error(ha); return; } mp = rxb->m_head; if (i == 0) mpf = mp; QL_ASSERT(ha, (mp != NULL), ("%s: [sds_idx]=[%d] mp != NULL\n", __func__,\ sds_idx)); bus_dmamap_sync(ha->rx_tag, rxb->map, BUS_DMASYNC_POSTREAD); rxb->m_head = NULL; rxb->next = sdsp->rxb_free; sdsp->rxb_free = rxb; sdsp->rx_free++; if ((mp == NULL) || QL_ERR_INJECT(ha, INJCT_RX_MP_NULL)) { /* log the error */ device_printf(ha->pci_dev, "%s mp == NULL [%d, %d, 0x%04x]\n", __func__, sds_idx, i, sgc->handle[i]); qla_rcv_error(ha); return; } if (i == 0) { mpl = mpf = mp; mp->m_flags |= M_PKTHDR; mp->m_pkthdr.len = sgc->pkt_length; mp->m_pkthdr.rcvif = ifp; rem_len = mp->m_pkthdr.len; } else { mp->m_flags &= ~M_PKTHDR; mpl->m_next = mp; mpl = mp; rem_len = rem_len - mp->m_len; } } mpl->m_len = rem_len; eh = mtod(mpf, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { uint32_t *data = (uint32_t *)eh; mpf->m_pkthdr.ether_vtag = ntohs(eh->evl_tag); mpf->m_flags |= M_VLANTAG; *(data + 3) = *(data + 2); *(data + 2) = *(data + 1); *(data + 1) = *data; m_adj(mpf, ETHER_VLAN_ENCAP_LEN); } if (sgc->chksum_status == Q8_STAT_DESC_STATUS_CHKSUM_OK) { mpf->m_pkthdr.csum_flags = CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR; mpf->m_pkthdr.csum_data = 0xFFFF; } else { mpf->m_pkthdr.csum_flags = 0; } if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1); mpf->m_pkthdr.flowid = sgc->rss_hash; - M_HASHTYPE_SET(mpf, M_HASHTYPE_OPAQUE); + M_HASHTYPE_SET(mpf, M_HASHTYPE_OPAQUE_HASH); (*ifp->if_input)(ifp, mpf); if (sdsp->rx_free > ha->std_replenish) qla_replenish_normal_rx(ha, sdsp, r_idx); return; } #define QLA_TCP_HDR_SIZE 20 #define QLA_TCP_TS_OPTION_SIZE 12 /* * Name: qla_lro_intr * Function: Handles normal ethernet frames received */ static int qla_lro_intr(qla_host_t *ha, qla_sgl_lro_t *sgc, uint32_t sds_idx) { qla_rx_buf_t *rxb; struct mbuf *mp = NULL, *mpf = NULL, *mpl = NULL; struct ifnet *ifp = ha->ifp; qla_sds_t *sdsp; struct ether_vlan_header *eh; uint32_t i, rem_len = 0, pkt_length, iplen; struct tcphdr *th; struct ip *ip = NULL; struct ip6_hdr *ip6 = NULL; uint16_t etype; uint32_t r_idx = 0; qla_rx_ring_t *rx_ring; if (ha->hw.num_rds_rings > 1) r_idx = sds_idx; ha->hw.rds[r_idx].count++; rx_ring = &ha->rx_ring[r_idx]; ha->lro_pkt_count++; sdsp = &ha->hw.sds[sds_idx]; pkt_length = sgc->payload_length + sgc->l4_offset; if (sgc->flags & Q8_LRO_COMP_TS) { pkt_length += QLA_TCP_HDR_SIZE + QLA_TCP_TS_OPTION_SIZE; } else { pkt_length += QLA_TCP_HDR_SIZE; } ha->lro_bytes += pkt_length; for (i = 0; i < sgc->num_handles; i++) { rxb = &rx_ring->rx_buf[sgc->handle[i] & 0x7FFF]; QL_ASSERT(ha, (rxb != NULL), ("%s: [sds_idx]=[%d] rxb != NULL\n", __func__,\ sds_idx)); if ((rxb == NULL) || QL_ERR_INJECT(ha, INJCT_LRO_RXB_INVAL)) { /* log the error */ device_printf(ha->pci_dev, "%s invalid rxb[%d, %d, 0x%04x]\n", __func__, sds_idx, i, sgc->handle[i]); qla_rcv_error(ha); return (0); } mp = rxb->m_head; if (i == 0) mpf = mp; QL_ASSERT(ha, (mp != NULL), ("%s: [sds_idx]=[%d] mp != NULL\n", __func__,\ sds_idx)); bus_dmamap_sync(ha->rx_tag, rxb->map, BUS_DMASYNC_POSTREAD); rxb->m_head = NULL; rxb->next = sdsp->rxb_free; sdsp->rxb_free = rxb; sdsp->rx_free++; if ((mp == NULL) || QL_ERR_INJECT(ha, INJCT_LRO_MP_NULL)) { /* log the error */ device_printf(ha->pci_dev, "%s mp == NULL [%d, %d, 0x%04x]\n", __func__, sds_idx, i, sgc->handle[i]); qla_rcv_error(ha); return (0); } if (i == 0) { mpl = mpf = mp; mp->m_flags |= M_PKTHDR; mp->m_pkthdr.len = pkt_length; mp->m_pkthdr.rcvif = ifp; rem_len = mp->m_pkthdr.len; } else { mp->m_flags &= ~M_PKTHDR; mpl->m_next = mp; mpl = mp; rem_len = rem_len - mp->m_len; } } mpl->m_len = rem_len; th = (struct tcphdr *)(mpf->m_data + sgc->l4_offset); if (sgc->flags & Q8_LRO_COMP_PUSH_BIT) th->th_flags |= TH_PUSH; m_adj(mpf, sgc->l2_offset); eh = mtod(mpf, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { uint32_t *data = (uint32_t *)eh; mpf->m_pkthdr.ether_vtag = ntohs(eh->evl_tag); mpf->m_flags |= M_VLANTAG; *(data + 3) = *(data + 2); *(data + 2) = *(data + 1); *(data + 1) = *data; m_adj(mpf, ETHER_VLAN_ENCAP_LEN); etype = ntohs(eh->evl_proto); } else { etype = ntohs(eh->evl_encap_proto); } if (etype == ETHERTYPE_IP) { ip = (struct ip *)(mpf->m_data + ETHER_HDR_LEN); iplen = (ip->ip_hl << 2) + (th->th_off << 2) + sgc->payload_length; ip->ip_len = htons(iplen); ha->ipv4_lro++; } else if (etype == ETHERTYPE_IPV6) { ip6 = (struct ip6_hdr *)(mpf->m_data + ETHER_HDR_LEN); iplen = (th->th_off << 2) + sgc->payload_length; ip6->ip6_plen = htons(iplen); ha->ipv6_lro++; } else { m_freem(mpf); if (sdsp->rx_free > ha->std_replenish) qla_replenish_normal_rx(ha, sdsp, r_idx); return 0; } mpf->m_pkthdr.csum_flags = CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR; mpf->m_pkthdr.csum_data = 0xFFFF; mpf->m_pkthdr.flowid = sgc->rss_hash; - M_HASHTYPE_SET(mpf, M_HASHTYPE_OPAQUE); + M_HASHTYPE_SET(mpf, M_HASHTYPE_OPAQUE_HASH); if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1); (*ifp->if_input)(ifp, mpf); if (sdsp->rx_free > ha->std_replenish) qla_replenish_normal_rx(ha, sdsp, r_idx); return (0); } static int qla_rcv_cont_sds(qla_host_t *ha, uint32_t sds_idx, uint32_t comp_idx, uint32_t dcount, uint16_t *handle, uint16_t *nhandles) { uint32_t i; uint16_t num_handles; q80_stat_desc_t *sdesc; uint32_t opcode; *nhandles = 0; dcount--; for (i = 0; i < dcount; i++) { comp_idx = (comp_idx + 1) & (NUM_STATUS_DESCRIPTORS-1); sdesc = (q80_stat_desc_t *) &ha->hw.sds[sds_idx].sds_ring_base[comp_idx]; opcode = Q8_STAT_DESC_OPCODE((sdesc->data[1])); if (!opcode) { device_printf(ha->pci_dev, "%s: opcode=0 %p %p\n", __func__, (void *)sdesc->data[0], (void *)sdesc->data[1]); return -1; } num_handles = Q8_SGL_STAT_DESC_NUM_HANDLES((sdesc->data[1])); if (!num_handles) { device_printf(ha->pci_dev, "%s: opcode=0 %p %p\n", __func__, (void *)sdesc->data[0], (void *)sdesc->data[1]); return -1; } if (QL_ERR_INJECT(ha, INJCT_NUM_HNDLE_INVALID)) num_handles = -1; switch (num_handles) { case 1: *handle++ = Q8_SGL_STAT_DESC_HANDLE1((sdesc->data[0])); break; case 2: *handle++ = Q8_SGL_STAT_DESC_HANDLE1((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE2((sdesc->data[0])); break; case 3: *handle++ = Q8_SGL_STAT_DESC_HANDLE1((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE2((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE3((sdesc->data[0])); break; case 4: *handle++ = Q8_SGL_STAT_DESC_HANDLE1((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE2((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE3((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE4((sdesc->data[0])); break; case 5: *handle++ = Q8_SGL_STAT_DESC_HANDLE1((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE2((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE3((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE4((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE5((sdesc->data[1])); break; case 6: *handle++ = Q8_SGL_STAT_DESC_HANDLE1((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE2((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE3((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE4((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE5((sdesc->data[1])); *handle++ = Q8_SGL_STAT_DESC_HANDLE6((sdesc->data[1])); break; case 7: *handle++ = Q8_SGL_STAT_DESC_HANDLE1((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE2((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE3((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE4((sdesc->data[0])); *handle++ = Q8_SGL_STAT_DESC_HANDLE5((sdesc->data[1])); *handle++ = Q8_SGL_STAT_DESC_HANDLE6((sdesc->data[1])); *handle++ = Q8_SGL_STAT_DESC_HANDLE7((sdesc->data[1])); break; default: device_printf(ha->pci_dev, "%s: invalid num handles %p %p\n", __func__, (void *)sdesc->data[0], (void *)sdesc->data[1]); QL_ASSERT(ha, (0),\ ("%s: %s [nh, sds, d0, d1]=[%d, %d, %p, %p]\n", __func__, "invalid num handles", sds_idx, num_handles, (void *)sdesc->data[0],(void *)sdesc->data[1])); qla_rcv_error(ha); return 0; } *nhandles = *nhandles + num_handles; } return 0; } /* * Name: qla_rcv_isr * Function: Main Interrupt Service Routine */ static uint32_t qla_rcv_isr(qla_host_t *ha, uint32_t sds_idx, uint32_t count) { device_t dev; qla_hw_t *hw; uint32_t comp_idx, c_idx = 0, desc_count = 0, opcode; volatile q80_stat_desc_t *sdesc, *sdesc0 = NULL; uint32_t ret = 0; qla_sgl_comp_t sgc; uint16_t nhandles; uint32_t sds_replenish_threshold = 0; dev = ha->pci_dev; hw = &ha->hw; hw->sds[sds_idx].rcv_active = 1; if (ha->flags.stop_rcv) { hw->sds[sds_idx].rcv_active = 0; return 0; } QL_DPRINT2(ha, (dev, "%s: [%d]enter\n", __func__, sds_idx)); /* * receive interrupts */ comp_idx = hw->sds[sds_idx].sdsr_next; while (count-- && !ha->flags.stop_rcv) { sdesc = (q80_stat_desc_t *) &hw->sds[sds_idx].sds_ring_base[comp_idx]; opcode = Q8_STAT_DESC_OPCODE((sdesc->data[1])); if (!opcode) break; hw->sds[sds_idx].intr_count++; switch (opcode) { case Q8_STAT_DESC_OPCODE_RCV_PKT: desc_count = 1; bzero(&sgc, sizeof(qla_sgl_comp_t)); sgc.rcv.pkt_length = Q8_STAT_DESC_TOTAL_LENGTH((sdesc->data[0])); sgc.rcv.num_handles = 1; sgc.rcv.handle[0] = Q8_STAT_DESC_HANDLE((sdesc->data[0])); sgc.rcv.chksum_status = Q8_STAT_DESC_STATUS((sdesc->data[1])); sgc.rcv.rss_hash = Q8_STAT_DESC_RSS_HASH((sdesc->data[0])); if (Q8_STAT_DESC_VLAN((sdesc->data[1]))) { sgc.rcv.vlan_tag = Q8_STAT_DESC_VLAN_ID((sdesc->data[1])); } qla_rx_intr(ha, &sgc.rcv, sds_idx); break; case Q8_STAT_DESC_OPCODE_SGL_RCV: desc_count = Q8_STAT_DESC_COUNT_SGL_RCV((sdesc->data[1])); if (desc_count > 1) { c_idx = (comp_idx + desc_count -1) & (NUM_STATUS_DESCRIPTORS-1); sdesc0 = (q80_stat_desc_t *) &hw->sds[sds_idx].sds_ring_base[c_idx]; if (Q8_STAT_DESC_OPCODE((sdesc0->data[1])) != Q8_STAT_DESC_OPCODE_CONT) { desc_count = 0; break; } } bzero(&sgc, sizeof(qla_sgl_comp_t)); sgc.rcv.pkt_length = Q8_STAT_DESC_TOTAL_LENGTH_SGL_RCV(\ (sdesc->data[0])); sgc.rcv.chksum_status = Q8_STAT_DESC_STATUS((sdesc->data[1])); sgc.rcv.rss_hash = Q8_STAT_DESC_RSS_HASH((sdesc->data[0])); if (Q8_STAT_DESC_VLAN((sdesc->data[1]))) { sgc.rcv.vlan_tag = Q8_STAT_DESC_VLAN_ID((sdesc->data[1])); } QL_ASSERT(ha, (desc_count <= 2) ,\ ("%s: [sds_idx, data0, data1]="\ "%d, %p, %p]\n", __func__, sds_idx,\ (void *)sdesc->data[0],\ (void *)sdesc->data[1])); sgc.rcv.num_handles = 1; sgc.rcv.handle[0] = Q8_STAT_DESC_HANDLE((sdesc->data[0])); if (qla_rcv_cont_sds(ha, sds_idx, comp_idx, desc_count, &sgc.rcv.handle[1], &nhandles)) { device_printf(dev, "%s: [sds_idx, dcount, data0, data1]=" "[%d, %d, 0x%llx, 0x%llx]\n", __func__, sds_idx, desc_count, (long long unsigned int)sdesc->data[0], (long long unsigned int)sdesc->data[1]); desc_count = 0; break; } sgc.rcv.num_handles += nhandles; qla_rx_intr(ha, &sgc.rcv, sds_idx); break; case Q8_STAT_DESC_OPCODE_SGL_LRO: desc_count = Q8_STAT_DESC_COUNT_SGL_LRO((sdesc->data[1])); if (desc_count > 1) { c_idx = (comp_idx + desc_count -1) & (NUM_STATUS_DESCRIPTORS-1); sdesc0 = (q80_stat_desc_t *) &hw->sds[sds_idx].sds_ring_base[c_idx]; if (Q8_STAT_DESC_OPCODE((sdesc0->data[1])) != Q8_STAT_DESC_OPCODE_CONT) { desc_count = 0; break; } } bzero(&sgc, sizeof(qla_sgl_comp_t)); sgc.lro.payload_length = Q8_STAT_DESC_TOTAL_LENGTH_SGL_RCV((sdesc->data[0])); sgc.lro.rss_hash = Q8_STAT_DESC_RSS_HASH((sdesc->data[0])); sgc.lro.num_handles = 1; sgc.lro.handle[0] = Q8_STAT_DESC_HANDLE((sdesc->data[0])); if (Q8_SGL_LRO_STAT_TS((sdesc->data[1]))) sgc.lro.flags |= Q8_LRO_COMP_TS; if (Q8_SGL_LRO_STAT_PUSH_BIT((sdesc->data[1]))) sgc.lro.flags |= Q8_LRO_COMP_PUSH_BIT; sgc.lro.l2_offset = Q8_SGL_LRO_STAT_L2_OFFSET((sdesc->data[1])); sgc.lro.l4_offset = Q8_SGL_LRO_STAT_L4_OFFSET((sdesc->data[1])); if (Q8_STAT_DESC_VLAN((sdesc->data[1]))) { sgc.lro.vlan_tag = Q8_STAT_DESC_VLAN_ID((sdesc->data[1])); } QL_ASSERT(ha, (desc_count <= 7) ,\ ("%s: [sds_idx, data0, data1]="\ "[%d, 0x%llx, 0x%llx]\n",\ __func__, sds_idx,\ (long long unsigned int)sdesc->data[0],\ (long long unsigned int)sdesc->data[1])); if (qla_rcv_cont_sds(ha, sds_idx, comp_idx, desc_count, &sgc.lro.handle[1], &nhandles)) { device_printf(dev, "%s: [sds_idx, data0, data1]="\ "[%d, 0x%llx, 0x%llx]\n",\ __func__, sds_idx,\ (long long unsigned int)sdesc->data[0],\ (long long unsigned int)sdesc->data[1]); desc_count = 0; break; } sgc.lro.num_handles += nhandles; if (qla_lro_intr(ha, &sgc.lro, sds_idx)) { device_printf(dev, "%s: [sds_idx, data0, data1]="\ "[%d, 0x%llx, 0x%llx]\n",\ __func__, sds_idx,\ (long long unsigned int)sdesc->data[0],\ (long long unsigned int)sdesc->data[1]); device_printf(dev, "%s: [comp_idx, c_idx, dcount, nhndls]="\ "[%d, %d, %d, %d]\n",\ __func__, comp_idx, c_idx, desc_count, sgc.lro.num_handles); if (desc_count > 1) { device_printf(dev, "%s: [sds_idx, data0, data1]="\ "[%d, 0x%llx, 0x%llx]\n",\ __func__, sds_idx,\ (long long unsigned int)sdesc0->data[0],\ (long long unsigned int)sdesc0->data[1]); } } break; default: device_printf(dev, "%s: default 0x%llx!\n", __func__, (long long unsigned int)sdesc->data[0]); break; } if (desc_count == 0) break; sds_replenish_threshold += desc_count; while (desc_count--) { sdesc->data[0] = 0ULL; sdesc->data[1] = 0ULL; comp_idx = (comp_idx + 1) & (NUM_STATUS_DESCRIPTORS-1); sdesc = (q80_stat_desc_t *) &hw->sds[sds_idx].sds_ring_base[comp_idx]; } if (sds_replenish_threshold > ha->hw.sds_cidx_thres) { sds_replenish_threshold = 0; if (hw->sds[sds_idx].sdsr_next != comp_idx) { QL_UPDATE_SDS_CONSUMER_INDEX(ha, sds_idx,\ comp_idx); } hw->sds[sds_idx].sdsr_next = comp_idx; } } if (ha->flags.stop_rcv) goto qla_rcv_isr_exit; if (hw->sds[sds_idx].sdsr_next != comp_idx) { QL_UPDATE_SDS_CONSUMER_INDEX(ha, sds_idx, comp_idx); } hw->sds[sds_idx].sdsr_next = comp_idx; sdesc = (q80_stat_desc_t *)&hw->sds[sds_idx].sds_ring_base[comp_idx]; opcode = Q8_STAT_DESC_OPCODE((sdesc->data[1])); if (opcode) ret = -1; qla_rcv_isr_exit: hw->sds[sds_idx].rcv_active = 0; return (ret); } void ql_mbx_isr(void *arg) { qla_host_t *ha; uint32_t data; uint32_t prev_link_state; ha = arg; if (ha == NULL) { device_printf(ha->pci_dev, "%s: arg == NULL\n", __func__); return; } data = READ_REG32(ha, Q8_FW_MBOX_CNTRL); if ((data & 0x3) != 0x1) { WRITE_REG32(ha, ha->hw.mbx_intr_mask_offset, 0); return; } data = READ_REG32(ha, Q8_FW_MBOX0); if ((data & 0xF000) != 0x8000) return; data = data & 0xFFFF; switch (data) { case 0x8001: /* It's an AEN */ ha->hw.cable_oui = READ_REG32(ha, (Q8_FW_MBOX0 + 4)); data = READ_REG32(ha, (Q8_FW_MBOX0 + 8)); ha->hw.cable_length = data & 0xFFFF; data = data >> 16; ha->hw.link_speed = data & 0xFFF; data = READ_REG32(ha, (Q8_FW_MBOX0 + 12)); prev_link_state = ha->hw.link_up; ha->hw.link_up = (((data & 0xFF) == 0) ? 0 : 1); if (prev_link_state != ha->hw.link_up) { if (ha->hw.link_up) if_link_state_change(ha->ifp, LINK_STATE_UP); else if_link_state_change(ha->ifp, LINK_STATE_DOWN); } ha->hw.module_type = ((data >> 8) & 0xFF); ha->hw.flags.fduplex = (((data & 0xFF0000) == 0) ? 0 : 1); ha->hw.flags.autoneg = (((data & 0xFF000000) == 0) ? 0 : 1); data = READ_REG32(ha, (Q8_FW_MBOX0 + 16)); ha->hw.flags.loopback_mode = data & 0x03; ha->hw.link_faults = (data >> 3) & 0xFF; break; case 0x8100: ha->hw.imd_compl=1; break; case 0x8101: ha->async_event = 1; ha->hw.aen_mb0 = 0x8101; ha->hw.aen_mb1 = READ_REG32(ha, (Q8_FW_MBOX0 + 4)); ha->hw.aen_mb2 = READ_REG32(ha, (Q8_FW_MBOX0 + 8)); ha->hw.aen_mb3 = READ_REG32(ha, (Q8_FW_MBOX0 + 12)); ha->hw.aen_mb4 = READ_REG32(ha, (Q8_FW_MBOX0 + 16)); break; case 0x8110: /* for now just dump the registers */ { uint32_t ombx[5]; ombx[0] = READ_REG32(ha, (Q8_FW_MBOX0 + 4)); ombx[1] = READ_REG32(ha, (Q8_FW_MBOX0 + 8)); ombx[2] = READ_REG32(ha, (Q8_FW_MBOX0 + 12)); ombx[3] = READ_REG32(ha, (Q8_FW_MBOX0 + 16)); ombx[4] = READ_REG32(ha, (Q8_FW_MBOX0 + 20)); device_printf(ha->pci_dev, "%s: " "0x%08x 0x%08x 0x%08x 0x%08x 0x%08x 0x%08x\n", __func__, data, ombx[0], ombx[1], ombx[2], ombx[3], ombx[4]); } break; case 0x8130: /* sfp insertion aen */ device_printf(ha->pci_dev, "%s: sfp inserted [0x%08x]\n", __func__, READ_REG32(ha, (Q8_FW_MBOX0 + 4))); break; case 0x8131: /* sfp removal aen */ device_printf(ha->pci_dev, "%s: sfp removed]\n", __func__); break; default: device_printf(ha->pci_dev, "%s: AEN[0x%08x]\n", __func__, data); break; } WRITE_REG32(ha, Q8_FW_MBOX_CNTRL, 0x0); WRITE_REG32(ha, ha->hw.mbx_intr_mask_offset, 0x0); return; } static void qla_replenish_normal_rx(qla_host_t *ha, qla_sds_t *sdsp, uint32_t r_idx) { qla_rx_buf_t *rxb; int count = sdsp->rx_free; uint32_t rx_next; qla_rdesc_t *rdesc; /* we can play with this value via a sysctl */ uint32_t replenish_thresh = ha->hw.rds_pidx_thres; rdesc = &ha->hw.rds[r_idx]; rx_next = rdesc->rx_next; while (count--) { rxb = sdsp->rxb_free; if (rxb == NULL) break; sdsp->rxb_free = rxb->next; sdsp->rx_free--; if (ql_get_mbuf(ha, rxb, NULL) == 0) { qla_set_hw_rcv_desc(ha, r_idx, rdesc->rx_in, rxb->handle, rxb->paddr, (rxb->m_head)->m_pkthdr.len); rdesc->rx_in++; if (rdesc->rx_in == NUM_RX_DESCRIPTORS) rdesc->rx_in = 0; rdesc->rx_next++; if (rdesc->rx_next == NUM_RX_DESCRIPTORS) rdesc->rx_next = 0; } else { device_printf(ha->pci_dev, "%s: ql_get_mbuf [0,(%d),(%d)] failed\n", __func__, rdesc->rx_in, rxb->handle); rxb->m_head = NULL; rxb->next = sdsp->rxb_free; sdsp->rxb_free = rxb; sdsp->rx_free++; break; } if (replenish_thresh-- == 0) { QL_UPDATE_RDS_PRODUCER_INDEX(ha, rdesc->prod_std, rdesc->rx_next); rx_next = rdesc->rx_next; replenish_thresh = ha->hw.rds_pidx_thres; } } if (rx_next != rdesc->rx_next) { QL_UPDATE_RDS_PRODUCER_INDEX(ha, rdesc->prod_std, rdesc->rx_next); } } void ql_isr(void *arg) { qla_ivec_t *ivec = arg; qla_host_t *ha ; int idx; qla_hw_t *hw; struct ifnet *ifp; uint32_t ret = 0; ha = ivec->ha; hw = &ha->hw; ifp = ha->ifp; if ((idx = ivec->sds_idx) >= ha->hw.num_sds_rings) return; if (idx == 0) taskqueue_enqueue(ha->tx_tq, &ha->tx_task); ret = qla_rcv_isr(ha, idx, -1); if (idx == 0) taskqueue_enqueue(ha->tx_tq, &ha->tx_task); if (!ha->flags.stop_rcv) { QL_ENABLE_INTERRUPTS(ha, idx); } return; } Index: projects/vnet/sys/dev/qlxge/qls_isr.c =================================================================== --- projects/vnet/sys/dev/qlxge/qls_isr.c (revision 301546) +++ projects/vnet/sys/dev/qlxge/qls_isr.c (revision 301547) @@ -1,396 +1,396 @@ /* * Copyright (c) 2013-2014 Qlogic Corporation * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ /* * File: qls_isr.c * Author : David C Somayajulu, Qlogic Corporation, Aliso Viejo, CA 92656. */ #include __FBSDID("$FreeBSD$"); #include "qls_os.h" #include "qls_hw.h" #include "qls_def.h" #include "qls_inline.h" #include "qls_ver.h" #include "qls_glbl.h" #include "qls_dbg.h" static void qls_tx_comp(qla_host_t *ha, uint32_t txr_idx, q81_tx_mac_comp_t *tx_comp) { qla_tx_buf_t *txb; uint32_t tx_idx = tx_comp->tid_lo; if (tx_idx >= NUM_TX_DESCRIPTORS) { ha->qla_initiate_recovery = 1; return; } txb = &ha->tx_ring[txr_idx].tx_buf[tx_idx]; if (txb->m_head) { if_inc_counter(ha->ifp, IFCOUNTER_OPACKETS, 1); bus_dmamap_sync(ha->tx_tag, txb->map, BUS_DMASYNC_POSTWRITE); bus_dmamap_unload(ha->tx_tag, txb->map); m_freem(txb->m_head); txb->m_head = NULL; } ha->tx_ring[txr_idx].txr_done++; if (ha->tx_ring[txr_idx].txr_done == NUM_TX_DESCRIPTORS) ha->tx_ring[txr_idx].txr_done = 0; } static void qls_replenish_rx(qla_host_t *ha, uint32_t r_idx) { qla_rx_buf_t *rxb; qla_rx_ring_t *rxr; int count; volatile q81_bq_addr_e_t *sbq_e; rxr = &ha->rx_ring[r_idx]; count = rxr->rx_free; sbq_e = rxr->sbq_vaddr; while (count--) { rxb = &rxr->rx_buf[rxr->sbq_next]; if (rxb->m_head == NULL) { if (qls_get_mbuf(ha, rxb, NULL) != 0) { device_printf(ha->pci_dev, "%s: qls_get_mbuf [0,%d,%d] failed\n", __func__, rxr->sbq_next, r_idx); rxb->m_head = NULL; break; } } if (rxb->m_head != NULL) { sbq_e[rxr->sbq_next].addr_lo = (uint32_t)rxb->paddr; sbq_e[rxr->sbq_next].addr_hi = (uint32_t)(rxb->paddr >> 32); rxr->sbq_next++; if (rxr->sbq_next == NUM_RX_DESCRIPTORS) rxr->sbq_next = 0; rxr->sbq_free++; rxr->rx_free--; } if (rxr->sbq_free == 16) { rxr->sbq_in += 16; rxr->sbq_in = rxr->sbq_in & (NUM_RX_DESCRIPTORS - 1); rxr->sbq_free = 0; Q81_WR_SBQ_PROD_IDX(r_idx, (rxr->sbq_in)); } } } static int qls_rx_comp(qla_host_t *ha, uint32_t rxr_idx, uint32_t cq_idx, q81_rx_t *cq_e) { qla_rx_buf_t *rxb; qla_rx_ring_t *rxr; device_t dev = ha->pci_dev; struct mbuf *mp = NULL; struct ifnet *ifp = ha->ifp; struct lro_ctrl *lro; struct ether_vlan_header *eh; rxr = &ha->rx_ring[rxr_idx]; lro = &rxr->lro; rxb = &rxr->rx_buf[rxr->rx_next]; if (!(cq_e->flags1 & Q81_RX_FLAGS1_DS)) { device_printf(dev, "%s: DS bit not set \n", __func__); return -1; } if (rxb->paddr != cq_e->b_paddr) { device_printf(dev, "%s: (rxb->paddr != cq_e->b_paddr)[%p, %p] \n", __func__, (void *)rxb->paddr, (void *)cq_e->b_paddr); Q81_SET_CQ_INVALID(cq_idx); ha->qla_initiate_recovery = 1; return(-1); } rxr->rx_int++; if ((cq_e->flags1 & Q81_RX_FLAGS1_ERR_MASK) == 0) { mp = rxb->m_head; rxb->m_head = NULL; if (mp == NULL) { device_printf(dev, "%s: mp == NULL\n", __func__); } else { mp->m_flags |= M_PKTHDR; mp->m_pkthdr.len = cq_e->length; mp->m_pkthdr.rcvif = ifp; mp->m_len = cq_e->length; eh = mtod(mp, struct ether_vlan_header *); if (eh->evl_encap_proto == htons(ETHERTYPE_VLAN)) { uint32_t *data = (uint32_t *)eh; mp->m_pkthdr.ether_vtag = ntohs(eh->evl_tag); mp->m_flags |= M_VLANTAG; *(data + 3) = *(data + 2); *(data + 2) = *(data + 1); *(data + 1) = *data; m_adj(mp, ETHER_VLAN_ENCAP_LEN); } if ((cq_e->flags1 & Q81_RX_FLAGS1_RSS_MATCH_MASK)) { rxr->rss_int++; mp->m_pkthdr.flowid = cq_e->rss; - M_HASHTYPE_SET(mp, M_HASHTYPE_OPAQUE); + M_HASHTYPE_SET(mp, M_HASHTYPE_OPAQUE_HASH); } if (cq_e->flags0 & (Q81_RX_FLAGS0_TE | Q81_RX_FLAGS0_NU | Q81_RX_FLAGS0_IE)) { mp->m_pkthdr.csum_flags = 0; } else { mp->m_pkthdr.csum_flags = CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR; mp->m_pkthdr.csum_data = 0xFFFF; } if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1); if (lro->lro_cnt && (tcp_lro_rx(lro, mp, 0) == 0)) { /* LRO packet has been successfully queued */ } else { (*ifp->if_input)(ifp, mp); } } } else { device_printf(dev, "%s: err [0%08x]\n", __func__, cq_e->flags1); } rxr->rx_free++; rxr->rx_next++; if (rxr->rx_next == NUM_RX_DESCRIPTORS) rxr->rx_next = 0; if ((rxr->rx_free + rxr->sbq_free) >= 16) qls_replenish_rx(ha, rxr_idx); return 0; } static void qls_cq_isr(qla_host_t *ha, uint32_t cq_idx) { q81_cq_e_t *cq_e, *cq_b; uint32_t i, cq_comp_idx; int ret = 0, tx_comp_done = 0; struct lro_ctrl *lro; cq_b = ha->rx_ring[cq_idx].cq_base_vaddr; lro = &ha->rx_ring[cq_idx].lro; cq_comp_idx = *(ha->rx_ring[cq_idx].cqi_vaddr); i = ha->rx_ring[cq_idx].cq_next; while (i != cq_comp_idx) { cq_e = &cq_b[i]; switch (cq_e->opcode) { case Q81_IOCB_TX_MAC: case Q81_IOCB_TX_TSO: qls_tx_comp(ha, cq_idx, (q81_tx_mac_comp_t *)cq_e); tx_comp_done++; break; case Q81_IOCB_RX: ret = qls_rx_comp(ha, cq_idx, i, (q81_rx_t *)cq_e); break; case Q81_IOCB_MPI: case Q81_IOCB_SYS: default: device_printf(ha->pci_dev, "%s[%d %d 0x%x]: illegal \n", __func__, i, (*(ha->rx_ring[cq_idx].cqi_vaddr)), cq_e->opcode); qls_dump_buf32(ha, __func__, cq_e, (sizeof (q81_cq_e_t) >> 2)); break; } i++; if (i == NUM_CQ_ENTRIES) i = 0; if (ret) { break; } if (i == cq_comp_idx) { cq_comp_idx = *(ha->rx_ring[cq_idx].cqi_vaddr); } if (tx_comp_done) { taskqueue_enqueue(ha->tx_tq, &ha->tx_task); tx_comp_done = 0; } } tcp_lro_flush_all(lro); ha->rx_ring[cq_idx].cq_next = cq_comp_idx; if (!ret) { Q81_WR_CQ_CONS_IDX(cq_idx, (ha->rx_ring[cq_idx].cq_next)); } if (tx_comp_done) taskqueue_enqueue(ha->tx_tq, &ha->tx_task); return; } static void qls_mbx_isr(qla_host_t *ha) { uint32_t data; int i; device_t dev = ha->pci_dev; if (qls_mbx_rd_reg(ha, 0, &data) == 0) { if ((data & 0xF000) == 0x4000) { ha->mbox[0] = data; for (i = 1; i < Q81_NUM_MBX_REGISTERS; i++) { if (qls_mbx_rd_reg(ha, i, &data)) break; ha->mbox[i] = data; } ha->mbx_done = 1; } else if ((data & 0xF000) == 0x8000) { /* we have an AEN */ ha->aen[0] = data; for (i = 1; i < Q81_NUM_AEN_REGISTERS; i++) { if (qls_mbx_rd_reg(ha, i, &data)) break; ha->aen[i] = data; } device_printf(dev,"%s: AEN " "[0x%08x 0x%08x 0x%08x 0x%08x 0x%08x" " 0x%08x 0x%08x 0x%08x 0x%08x]\n", __func__, ha->aen[0], ha->aen[1], ha->aen[2], ha->aen[3], ha->aen[4], ha->aen[5], ha->aen[6], ha->aen[7], ha->aen[8]); switch ((ha->aen[0] & 0xFFFF)) { case 0x8011: ha->link_up = 1; break; case 0x8012: ha->link_up = 0; break; case 0x8130: ha->link_hw_info = ha->aen[1]; break; case 0x8131: ha->link_hw_info = 0; break; } } } WRITE_REG32(ha, Q81_CTL_HOST_CMD_STATUS, Q81_CTL_HCS_CMD_CLR_RTH_INTR); return; } void qls_isr(void *arg) { qla_ivec_t *ivec = arg; qla_host_t *ha; uint32_t status; uint32_t cq_idx; device_t dev; ha = ivec->ha; cq_idx = ivec->cq_idx; dev = ha->pci_dev; status = READ_REG32(ha, Q81_CTL_STATUS); if (status & Q81_CTL_STATUS_FE) { device_printf(dev, "%s fatal error\n", __func__); return; } if ((cq_idx == 0) && (status & Q81_CTL_STATUS_PI)) { qls_mbx_isr(ha); } status = READ_REG32(ha, Q81_CTL_INTR_STATUS1); if (status & ( 0x1 << cq_idx)) qls_cq_isr(ha, cq_idx); Q81_ENABLE_INTR(ha, cq_idx); return; } Index: projects/vnet/sys/kern/subr_intr.c =================================================================== --- projects/vnet/sys/kern/subr_intr.c (revision 301546) +++ projects/vnet/sys/kern/subr_intr.c (revision 301547) @@ -1,1544 +1,1392 @@ /*- * Copyright (c) 2015-2016 Svatopluk Kraus * Copyright (c) 2015-2016 Michal Meloun * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); /* * New-style Interrupt Framework * * TODO: - to support IPI (PPI) enabling on other CPUs if already started * - to complete things for removable PICs */ -#include "opt_acpi.h" #include "opt_ddb.h" #include "opt_hwpmc_hooks.h" -#include "opt_platform.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef HWPMC_HOOKS #include #endif #include #include #include #include #include #ifdef DDB #include #endif #include "pic_if.h" #include "msi_if.h" #define INTRNAME_LEN (2*MAXCOMLEN + 1) #ifdef DEBUG #define debugf(fmt, args...) do { printf("%s(): ", __func__); \ printf(fmt,##args); } while (0) #else #define debugf(fmt, args...) #endif MALLOC_DECLARE(M_INTRNG); MALLOC_DEFINE(M_INTRNG, "intr", "intr interrupt handling"); /* Main interrupt handler called from assembler -> 'hidden' for C code. */ void intr_irq_handler(struct trapframe *tf); /* Root interrupt controller stuff. */ device_t intr_irq_root_dev; static intr_irq_filter_t *irq_root_filter; static void *irq_root_arg; static u_int irq_root_ipicount; struct intr_pic_child { SLIST_ENTRY(intr_pic_child) pc_next; struct intr_pic *pc_pic; intr_child_irq_filter_t *pc_filter; void *pc_filter_arg; uintptr_t pc_start; uintptr_t pc_length; }; /* Interrupt controller definition. */ struct intr_pic { SLIST_ENTRY(intr_pic) pic_next; intptr_t pic_xref; /* hardware identification */ device_t pic_dev; #define FLAG_PIC (1 << 0) #define FLAG_MSI (1 << 1) u_int pic_flags; struct mtx pic_child_lock; SLIST_HEAD(, intr_pic_child) pic_children; }; static struct mtx pic_list_lock; static SLIST_HEAD(, intr_pic) pic_list; static struct intr_pic *pic_lookup(device_t dev, intptr_t xref); /* Interrupt source definition. */ static struct mtx isrc_table_lock; static struct intr_irqsrc *irq_sources[NIRQ]; u_int irq_next_free; -/* - * XXX - All stuff around struct intr_dev_data is considered as temporary - * until better place for storing struct intr_map_data will be find. - * - * For now, there are two global interrupt numbers spaces: - * <0, NIRQ) ... interrupts without config data - * managed in irq_sources[] - * IRQ_DDATA_BASE + <0, 2 * NIRQ) ... interrupts with config data - * managed in intr_ddata_tab[] - * - * Read intr_ddata_lookup() to see how these spaces are worked with. - * Note that each interrupt number from second space duplicates some number - * from first space at this moment. An interrupt number from first space can - * be duplicated even multiple times in second space. - */ -struct intr_dev_data { - device_t idd_dev; - intptr_t idd_xref; - u_int idd_irq; - struct intr_map_data * idd_data; - struct intr_irqsrc * idd_isrc; -}; - -static struct intr_dev_data *intr_ddata_tab[2 * NIRQ]; -static u_int intr_ddata_first_unused; - -#define IRQ_DDATA_BASE 10000 -CTASSERT(IRQ_DDATA_BASE > nitems(irq_sources)); - #ifdef SMP static boolean_t irq_assign_cpu = FALSE; #endif /* * - 2 counters for each I/O interrupt. * - MAXCPU counters for each IPI counters for SMP. */ #ifdef SMP #define INTRCNT_COUNT (NIRQ * 2 + INTR_IPI_COUNT * MAXCPU) #else #define INTRCNT_COUNT (NIRQ * 2) #endif /* Data for MI statistics reporting. */ u_long intrcnt[INTRCNT_COUNT]; char intrnames[INTRCNT_COUNT * INTRNAME_LEN]; size_t sintrcnt = sizeof(intrcnt); size_t sintrnames = sizeof(intrnames); static u_int intrcnt_index; /* * Interrupt framework initialization routine. */ static void intr_irq_init(void *dummy __unused) { SLIST_INIT(&pic_list); mtx_init(&pic_list_lock, "intr pic list", NULL, MTX_DEF); mtx_init(&isrc_table_lock, "intr isrc table", NULL, MTX_DEF); } SYSINIT(intr_irq_init, SI_SUB_INTR, SI_ORDER_FIRST, intr_irq_init, NULL); static void intrcnt_setname(const char *name, int index) { snprintf(intrnames + INTRNAME_LEN * index, INTRNAME_LEN, "%-*s", INTRNAME_LEN - 1, name); } /* * Update name for interrupt source with interrupt event. */ static void intrcnt_updatename(struct intr_irqsrc *isrc) { /* QQQ: What about stray counter name? */ mtx_assert(&isrc_table_lock, MA_OWNED); intrcnt_setname(isrc->isrc_event->ie_fullname, isrc->isrc_index); } /* * Virtualization for interrupt source interrupt counter increment. */ static inline void isrc_increment_count(struct intr_irqsrc *isrc) { if (isrc->isrc_flags & INTR_ISRCF_PPI) atomic_add_long(&isrc->isrc_count[0], 1); else isrc->isrc_count[0]++; } /* * Virtualization for interrupt source interrupt stray counter increment. */ static inline void isrc_increment_straycount(struct intr_irqsrc *isrc) { isrc->isrc_count[1]++; } /* * Virtualization for interrupt source interrupt name update. */ static void isrc_update_name(struct intr_irqsrc *isrc, const char *name) { char str[INTRNAME_LEN]; mtx_assert(&isrc_table_lock, MA_OWNED); if (name != NULL) { snprintf(str, INTRNAME_LEN, "%s: %s", isrc->isrc_name, name); intrcnt_setname(str, isrc->isrc_index); snprintf(str, INTRNAME_LEN, "stray %s: %s", isrc->isrc_name, name); intrcnt_setname(str, isrc->isrc_index + 1); } else { snprintf(str, INTRNAME_LEN, "%s:", isrc->isrc_name); intrcnt_setname(str, isrc->isrc_index); snprintf(str, INTRNAME_LEN, "stray %s:", isrc->isrc_name); intrcnt_setname(str, isrc->isrc_index + 1); } } /* * Virtualization for interrupt source interrupt counters setup. */ static void isrc_setup_counters(struct intr_irqsrc *isrc) { u_int index; /* * XXX - it does not work well with removable controllers and * interrupt sources !!! */ index = atomic_fetchadd_int(&intrcnt_index, 2); isrc->isrc_index = index; isrc->isrc_count = &intrcnt[index]; isrc_update_name(isrc, NULL); } /* * Virtualization for interrupt source interrupt counters release. */ static void isrc_release_counters(struct intr_irqsrc *isrc) { panic("%s: not implemented", __func__); } #ifdef SMP /* * Virtualization for interrupt source IPI counters setup. */ u_long * intr_ipi_setup_counters(const char *name) { u_int index, i; char str[INTRNAME_LEN]; index = atomic_fetchadd_int(&intrcnt_index, MAXCPU); for (i = 0; i < MAXCPU; i++) { snprintf(str, INTRNAME_LEN, "cpu%d:%s", i, name); intrcnt_setname(str, index + i); } return (&intrcnt[index]); } #endif /* * Main interrupt dispatch handler. It's called straight * from the assembler, where CPU interrupt is served. */ void intr_irq_handler(struct trapframe *tf) { struct trapframe * oldframe; struct thread * td; KASSERT(irq_root_filter != NULL, ("%s: no filter", __func__)); PCPU_INC(cnt.v_intr); critical_enter(); td = curthread; oldframe = td->td_intr_frame; td->td_intr_frame = tf; irq_root_filter(irq_root_arg); td->td_intr_frame = oldframe; critical_exit(); #ifdef HWPMC_HOOKS if (pmc_hook && TRAPF_USERMODE(tf) && (PCPU_GET(curthread)->td_pflags & TDP_CALLCHAIN)) pmc_hook(PCPU_GET(curthread), PMC_FN_USER_CALLCHAIN, tf); #endif } int intr_child_irq_handler(struct intr_pic *parent, uintptr_t irq) { struct intr_pic_child *child; bool found; found = false; mtx_lock_spin(&parent->pic_child_lock); SLIST_FOREACH(child, &parent->pic_children, pc_next) { if (child->pc_start <= irq && irq < (child->pc_start + child->pc_length)) { found = true; break; } } mtx_unlock_spin(&parent->pic_child_lock); if (found) return (child->pc_filter(child->pc_filter_arg, irq)); return (FILTER_STRAY); } /* * interrupt controller dispatch function for interrupts. It should * be called straight from the interrupt controller, when associated interrupt * source is learned. */ int intr_isrc_dispatch(struct intr_irqsrc *isrc, struct trapframe *tf) { KASSERT(isrc != NULL, ("%s: no source", __func__)); isrc_increment_count(isrc); #ifdef INTR_SOLO if (isrc->isrc_filter != NULL) { int error; error = isrc->isrc_filter(isrc->isrc_arg, tf); PIC_POST_FILTER(isrc->isrc_dev, isrc); if (error == FILTER_HANDLED) return (0); } else #endif if (isrc->isrc_event != NULL) { if (intr_event_handle(isrc->isrc_event, tf) == 0) return (0); } isrc_increment_straycount(isrc); return (EINVAL); } /* * Alloc unique interrupt number (resource handle) for interrupt source. * * There could be various strategies how to allocate free interrupt number * (resource handle) for new interrupt source. * * 1. Handles are always allocated forward, so handles are not recycled * immediately. However, if only one free handle left which is reused * constantly... */ static inline int isrc_alloc_irq(struct intr_irqsrc *isrc) { u_int maxirqs, irq; mtx_assert(&isrc_table_lock, MA_OWNED); maxirqs = nitems(irq_sources); if (irq_next_free >= maxirqs) return (ENOSPC); for (irq = irq_next_free; irq < maxirqs; irq++) { if (irq_sources[irq] == NULL) goto found; } for (irq = 0; irq < irq_next_free; irq++) { if (irq_sources[irq] == NULL) goto found; } irq_next_free = maxirqs; return (ENOSPC); found: isrc->isrc_irq = irq; irq_sources[irq] = isrc; irq_next_free = irq + 1; if (irq_next_free >= maxirqs) irq_next_free = 0; return (0); } /* * Free unique interrupt number (resource handle) from interrupt source. */ static inline int isrc_free_irq(struct intr_irqsrc *isrc) { mtx_assert(&isrc_table_lock, MA_OWNED); if (isrc->isrc_irq >= nitems(irq_sources)) return (EINVAL); if (irq_sources[isrc->isrc_irq] != isrc) return (EINVAL); irq_sources[isrc->isrc_irq] = NULL; isrc->isrc_irq = INTR_IRQ_INVALID; /* just to be safe */ return (0); } /* * Lookup interrupt source by interrupt number (resource handle). */ static inline struct intr_irqsrc * isrc_lookup(u_int irq) { if (irq < nitems(irq_sources)) return (irq_sources[irq]); return (NULL); } /* * Initialize interrupt source and register it into global interrupt table. */ int intr_isrc_register(struct intr_irqsrc *isrc, device_t dev, u_int flags, const char *fmt, ...) { int error; va_list ap; bzero(isrc, sizeof(struct intr_irqsrc)); isrc->isrc_dev = dev; isrc->isrc_irq = INTR_IRQ_INVALID; /* just to be safe */ isrc->isrc_flags = flags; va_start(ap, fmt); vsnprintf(isrc->isrc_name, INTR_ISRC_NAMELEN, fmt, ap); va_end(ap); mtx_lock(&isrc_table_lock); error = isrc_alloc_irq(isrc); if (error != 0) { mtx_unlock(&isrc_table_lock); return (error); } /* * Setup interrupt counters, but not for IPI sources. Those are setup * later and only for used ones (up to INTR_IPI_COUNT) to not exhaust * our counter pool. */ if ((isrc->isrc_flags & INTR_ISRCF_IPI) == 0) isrc_setup_counters(isrc); mtx_unlock(&isrc_table_lock); return (0); } /* * Deregister interrupt source from global interrupt table. */ int intr_isrc_deregister(struct intr_irqsrc *isrc) { int error; mtx_lock(&isrc_table_lock); if ((isrc->isrc_flags & INTR_ISRCF_IPI) == 0) isrc_release_counters(isrc); error = isrc_free_irq(isrc); mtx_unlock(&isrc_table_lock); return (error); } #ifdef SMP /* * A support function for a PIC to decide if provided ISRC should be inited * on given cpu. The logic of INTR_ISRCF_BOUND flag and isrc_cpu member of * struct intr_irqsrc is the following: * * If INTR_ISRCF_BOUND is set, the ISRC should be inited only on cpus * set in isrc_cpu. If not, the ISRC should be inited on every cpu and * isrc_cpu is kept consistent with it. Thus isrc_cpu is always correct. */ bool intr_isrc_init_on_cpu(struct intr_irqsrc *isrc, u_int cpu) { if (isrc->isrc_handlers == 0) return (false); if ((isrc->isrc_flags & (INTR_ISRCF_PPI | INTR_ISRCF_IPI)) == 0) return (false); if (isrc->isrc_flags & INTR_ISRCF_BOUND) return (CPU_ISSET(cpu, &isrc->isrc_cpu)); CPU_SET(cpu, &isrc->isrc_cpu); return (true); } #endif -static struct intr_dev_data * -intr_ddata_alloc(u_int extsize) -{ - struct intr_dev_data *ddata; - size_t size; - - size = sizeof(*ddata); - ddata = malloc(size + extsize, M_INTRNG, M_WAITOK | M_ZERO); - - mtx_lock(&isrc_table_lock); - if (intr_ddata_first_unused >= nitems(intr_ddata_tab)) { - mtx_unlock(&isrc_table_lock); - free(ddata, M_INTRNG); - return (NULL); - } - intr_ddata_tab[intr_ddata_first_unused] = ddata; - ddata->idd_irq = IRQ_DDATA_BASE + intr_ddata_first_unused++; - mtx_unlock(&isrc_table_lock); - - ddata->idd_data = (struct intr_map_data *)((uintptr_t)ddata + size); - return (ddata); -} - -static struct intr_irqsrc * -intr_ddata_lookup(u_int irq, struct intr_map_data **datap) -{ - int error; - struct intr_irqsrc *isrc; - struct intr_dev_data *ddata; - - isrc = isrc_lookup(irq); - if (isrc != NULL) { - if (datap != NULL) - *datap = NULL; - return (isrc); - } - - if (irq < IRQ_DDATA_BASE) - return (NULL); - - irq -= IRQ_DDATA_BASE; - if (irq >= nitems(intr_ddata_tab)) - return (NULL); - - ddata = intr_ddata_tab[irq]; - if (ddata->idd_isrc == NULL) { - error = intr_map_irq(ddata->idd_dev, ddata->idd_xref, - ddata->idd_data, &irq); - if (error != 0) - return (NULL); - ddata->idd_isrc = isrc_lookup(irq); - } - if (datap != NULL) - *datap = ddata->idd_data; - return (ddata->idd_isrc); -} - -#ifdef DEV_ACPI -/* - * Map interrupt source according to ACPI info into framework. If such mapping - * does not exist, create it. Return unique interrupt number (resource handle) - * associated with mapped interrupt source. - */ -u_int -intr_acpi_map_irq(device_t dev, u_int irq, enum intr_polarity pol, - enum intr_trigger trig) -{ - struct intr_map_data_acpi *daa; - struct intr_dev_data *ddata; - - ddata = intr_ddata_alloc(sizeof(struct intr_map_data_acpi)); - if (ddata == NULL) - return (INTR_IRQ_INVALID); /* no space left */ - - ddata->idd_dev = dev; - ddata->idd_data->type = INTR_MAP_DATA_ACPI; - - daa = (struct intr_map_data_acpi *)ddata->idd_data; - daa->irq = irq; - daa->pol = pol; - daa->trig = trig; - - return (ddata->idd_irq); -} -#endif - -/* - * Store GPIO interrupt decription in framework and return unique interrupt - * number (resource handle) associated with it. - */ -u_int -intr_gpio_map_irq(device_t dev, u_int pin_num, u_int pin_flags, u_int intr_mode) -{ - struct intr_dev_data *ddata; - struct intr_map_data_gpio *dag; - - ddata = intr_ddata_alloc(sizeof(struct intr_map_data_gpio)); - if (ddata == NULL) - return (INTR_IRQ_INVALID); /* no space left */ - - ddata->idd_dev = dev; - ddata->idd_data->type = INTR_MAP_DATA_GPIO; - - dag = (struct intr_map_data_gpio *)ddata->idd_data; - dag->gpio_pin_num = pin_num; - dag->gpio_pin_flags = pin_flags; - dag->gpio_intr_mode = intr_mode; - return (ddata->idd_irq); -} - #ifdef INTR_SOLO /* * Setup filter into interrupt source. */ static int iscr_setup_filter(struct intr_irqsrc *isrc, const char *name, intr_irq_filter_t *filter, void *arg, void **cookiep) { if (filter == NULL) return (EINVAL); mtx_lock(&isrc_table_lock); /* * Make sure that we do not mix the two ways * how we handle interrupt sources. */ if (isrc->isrc_filter != NULL || isrc->isrc_event != NULL) { mtx_unlock(&isrc_table_lock); return (EBUSY); } isrc->isrc_filter = filter; isrc->isrc_arg = arg; isrc_update_name(isrc, name); mtx_unlock(&isrc_table_lock); *cookiep = isrc; return (0); } #endif /* * Interrupt source pre_ithread method for MI interrupt framework. */ static void intr_isrc_pre_ithread(void *arg) { struct intr_irqsrc *isrc = arg; PIC_PRE_ITHREAD(isrc->isrc_dev, isrc); } /* * Interrupt source post_ithread method for MI interrupt framework. */ static void intr_isrc_post_ithread(void *arg) { struct intr_irqsrc *isrc = arg; PIC_POST_ITHREAD(isrc->isrc_dev, isrc); } /* * Interrupt source post_filter method for MI interrupt framework. */ static void intr_isrc_post_filter(void *arg) { struct intr_irqsrc *isrc = arg; PIC_POST_FILTER(isrc->isrc_dev, isrc); } /* * Interrupt source assign_cpu method for MI interrupt framework. */ static int intr_isrc_assign_cpu(void *arg, int cpu) { #ifdef SMP struct intr_irqsrc *isrc = arg; int error; if (isrc->isrc_dev != intr_irq_root_dev) return (EINVAL); mtx_lock(&isrc_table_lock); if (cpu == NOCPU) { CPU_ZERO(&isrc->isrc_cpu); isrc->isrc_flags &= ~INTR_ISRCF_BOUND; } else { CPU_SETOF(cpu, &isrc->isrc_cpu); isrc->isrc_flags |= INTR_ISRCF_BOUND; } /* * In NOCPU case, it's up to PIC to either leave ISRC on same CPU or * re-balance it to another CPU or enable it on more CPUs. However, * PIC is expected to change isrc_cpu appropriately to keep us well * informed if the call is successful. */ if (irq_assign_cpu) { error = PIC_BIND_INTR(isrc->isrc_dev, isrc); if (error) { CPU_ZERO(&isrc->isrc_cpu); mtx_unlock(&isrc_table_lock); return (error); } } mtx_unlock(&isrc_table_lock); return (0); #else return (EOPNOTSUPP); #endif } /* * Create interrupt event for interrupt source. */ static int isrc_event_create(struct intr_irqsrc *isrc) { struct intr_event *ie; int error; error = intr_event_create(&ie, isrc, 0, isrc->isrc_irq, intr_isrc_pre_ithread, intr_isrc_post_ithread, intr_isrc_post_filter, intr_isrc_assign_cpu, "%s:", isrc->isrc_name); if (error) return (error); mtx_lock(&isrc_table_lock); /* * Make sure that we do not mix the two ways * how we handle interrupt sources. Let contested event wins. */ #ifdef INTR_SOLO if (isrc->isrc_filter != NULL || isrc->isrc_event != NULL) { #else if (isrc->isrc_event != NULL) { #endif mtx_unlock(&isrc_table_lock); intr_event_destroy(ie); return (isrc->isrc_event != NULL ? EBUSY : 0); } isrc->isrc_event = ie; mtx_unlock(&isrc_table_lock); return (0); } #ifdef notyet /* * Destroy interrupt event for interrupt source. */ static void isrc_event_destroy(struct intr_irqsrc *isrc) { struct intr_event *ie; mtx_lock(&isrc_table_lock); ie = isrc->isrc_event; isrc->isrc_event = NULL; mtx_unlock(&isrc_table_lock); if (ie != NULL) intr_event_destroy(ie); } #endif /* * Add handler to interrupt source. */ static int isrc_add_handler(struct intr_irqsrc *isrc, const char *name, driver_filter_t filter, driver_intr_t handler, void *arg, enum intr_type flags, void **cookiep) { int error; if (isrc->isrc_event == NULL) { error = isrc_event_create(isrc); if (error) return (error); } error = intr_event_add_handler(isrc->isrc_event, name, filter, handler, arg, intr_priority(flags), flags, cookiep); if (error == 0) { mtx_lock(&isrc_table_lock); intrcnt_updatename(isrc); mtx_unlock(&isrc_table_lock); } return (error); } /* * Lookup interrupt controller locked. */ static inline struct intr_pic * pic_lookup_locked(device_t dev, intptr_t xref) { struct intr_pic *pic; mtx_assert(&pic_list_lock, MA_OWNED); if (dev == NULL && xref == 0) return (NULL); /* Note that pic->pic_dev is never NULL on registered PIC. */ SLIST_FOREACH(pic, &pic_list, pic_next) { if (dev == NULL) { if (xref == pic->pic_xref) return (pic); } else if (xref == 0 || pic->pic_xref == 0) { if (dev == pic->pic_dev) return (pic); } else if (xref == pic->pic_xref && dev == pic->pic_dev) return (pic); } return (NULL); } /* * Lookup interrupt controller. */ static struct intr_pic * pic_lookup(device_t dev, intptr_t xref) { struct intr_pic *pic; mtx_lock(&pic_list_lock); pic = pic_lookup_locked(dev, xref); mtx_unlock(&pic_list_lock); return (pic); } /* * Create interrupt controller. */ static struct intr_pic * pic_create(device_t dev, intptr_t xref) { struct intr_pic *pic; mtx_lock(&pic_list_lock); pic = pic_lookup_locked(dev, xref); if (pic != NULL) { mtx_unlock(&pic_list_lock); return (pic); } pic = malloc(sizeof(*pic), M_INTRNG, M_NOWAIT | M_ZERO); if (pic == NULL) { mtx_unlock(&pic_list_lock); return (NULL); } pic->pic_xref = xref; pic->pic_dev = dev; mtx_init(&pic->pic_child_lock, "pic child lock", NULL, MTX_SPIN); SLIST_INSERT_HEAD(&pic_list, pic, pic_next); mtx_unlock(&pic_list_lock); return (pic); } #ifdef notyet /* * Destroy interrupt controller. */ static void pic_destroy(device_t dev, intptr_t xref) { struct intr_pic *pic; mtx_lock(&pic_list_lock); pic = pic_lookup_locked(dev, xref); if (pic == NULL) { mtx_unlock(&pic_list_lock); return; } SLIST_REMOVE(&pic_list, pic, intr_pic, pic_next); mtx_unlock(&pic_list_lock); free(pic, M_INTRNG); } #endif /* * Register interrupt controller. */ struct intr_pic * intr_pic_register(device_t dev, intptr_t xref) { struct intr_pic *pic; if (dev == NULL) return (NULL); pic = pic_create(dev, xref); if (pic == NULL) return (NULL); pic->pic_flags |= FLAG_PIC; debugf("PIC %p registered for %s \n", pic, device_get_nameunit(dev), dev, xref); return (pic); } /* * Unregister interrupt controller. */ int intr_pic_deregister(device_t dev, intptr_t xref) { panic("%s: not implemented", __func__); } /* * Mark interrupt controller (itself) as a root one. * * Note that only an interrupt controller can really know its position * in interrupt controller's tree. So root PIC must claim itself as a root. * * In FDT case, according to ePAPR approved version 1.1 from 08 April 2011, * page 30: * "The root of the interrupt tree is determined when traversal * of the interrupt tree reaches an interrupt controller node without * an interrupts property and thus no explicit interrupt parent." */ int intr_pic_claim_root(device_t dev, intptr_t xref, intr_irq_filter_t *filter, void *arg, u_int ipicount) { struct intr_pic *pic; pic = pic_lookup(dev, xref); if (pic == NULL) { device_printf(dev, "not registered\n"); return (EINVAL); } KASSERT((pic->pic_flags & FLAG_PIC) != 0, ("%s: Found a non-PIC controller: %s", __func__, device_get_name(pic->pic_dev))); if (filter == NULL) { device_printf(dev, "filter missing\n"); return (EINVAL); } /* * Only one interrupt controllers could be on the root for now. * Note that we further suppose that there is not threaded interrupt * routine (handler) on the root. See intr_irq_handler(). */ if (intr_irq_root_dev != NULL) { device_printf(dev, "another root already set\n"); return (EBUSY); } intr_irq_root_dev = dev; irq_root_filter = filter; irq_root_arg = arg; irq_root_ipicount = ipicount; debugf("irq root set to %s\n", device_get_nameunit(dev)); return (0); } /* * Add a handler to manage a sub range of a parents interrupts. */ struct intr_pic * intr_pic_add_handler(device_t parent, struct intr_pic *pic, intr_child_irq_filter_t *filter, void *arg, uintptr_t start, uintptr_t length) { struct intr_pic *parent_pic; struct intr_pic_child *newchild; #ifdef INVARIANTS struct intr_pic_child *child; #endif parent_pic = pic_lookup(parent, 0); if (parent_pic == NULL) return (NULL); newchild = malloc(sizeof(*newchild), M_INTRNG, M_WAITOK | M_ZERO); newchild->pc_pic = pic; newchild->pc_filter = filter; newchild->pc_filter_arg = arg; newchild->pc_start = start; newchild->pc_length = length; mtx_lock_spin(&parent_pic->pic_child_lock); #ifdef INVARIANTS SLIST_FOREACH(child, &parent_pic->pic_children, pc_next) { KASSERT(child->pc_pic != pic, ("%s: Adding a child PIC twice", __func__)); } #endif SLIST_INSERT_HEAD(&parent_pic->pic_children, newchild, pc_next); mtx_unlock_spin(&parent_pic->pic_child_lock); return (pic); } int intr_map_irq(device_t dev, intptr_t xref, struct intr_map_data *data, u_int *irqp) { int error; struct intr_irqsrc *isrc; struct intr_pic *pic; if (data == NULL) return (EINVAL); pic = pic_lookup(dev, xref); if (pic == NULL) return (ESRCH); KASSERT((pic->pic_flags & FLAG_PIC) != 0, ("%s: Found a non-PIC controller: %s", __func__, device_get_name(pic->pic_dev))); error = PIC_MAP_INTR(pic->pic_dev, data, &isrc); if (error == 0) *irqp = isrc->isrc_irq; return (error); } int intr_alloc_irq(device_t dev, struct resource *res) { struct intr_map_data *data; struct intr_irqsrc *isrc; KASSERT(rman_get_start(res) == rman_get_end(res), ("%s: more interrupts in resource", __func__)); - data = rman_get_virtual(res); - if (data == NULL) - isrc = intr_ddata_lookup(rman_get_start(res), &data); - else - isrc = isrc_lookup(rman_get_start(res)); + isrc = isrc_lookup(rman_get_start(res)); if (isrc == NULL) return (EINVAL); + data = rman_get_virtual(res); return (PIC_ALLOC_INTR(isrc->isrc_dev, isrc, res, data)); } int intr_release_irq(device_t dev, struct resource *res) { struct intr_map_data *data; struct intr_irqsrc *isrc; KASSERT(rman_get_start(res) == rman_get_end(res), ("%s: more interrupts in resource", __func__)); - data = rman_get_virtual(res); - if (data == NULL) - isrc = intr_ddata_lookup(rman_get_start(res), &data); - else - isrc = isrc_lookup(rman_get_start(res)); + isrc = isrc_lookup(rman_get_start(res)); if (isrc == NULL) return (EINVAL); + data = rman_get_virtual(res); return (PIC_RELEASE_INTR(isrc->isrc_dev, isrc, res, data)); } int intr_setup_irq(device_t dev, struct resource *res, driver_filter_t filt, driver_intr_t hand, void *arg, int flags, void **cookiep) { int error; struct intr_map_data *data; struct intr_irqsrc *isrc; const char *name; KASSERT(rman_get_start(res) == rman_get_end(res), ("%s: more interrupts in resource", __func__)); - data = rman_get_virtual(res); - if (data == NULL) - isrc = intr_ddata_lookup(rman_get_start(res), &data); - else - isrc = isrc_lookup(rman_get_start(res)); + isrc = isrc_lookup(rman_get_start(res)); if (isrc == NULL) return (EINVAL); + data = rman_get_virtual(res); name = device_get_nameunit(dev); #ifdef INTR_SOLO /* * Standard handling is done through MI interrupt framework. However, * some interrupts could request solely own special handling. This * non standard handling can be used for interrupt controllers without * handler (filter only), so in case that interrupt controllers are * chained, MI interrupt framework is called only in leaf controller. * * Note that root interrupt controller routine is served as well, * however in intr_irq_handler(), i.e. main system dispatch routine. */ if (flags & INTR_SOLO && hand != NULL) { debugf("irq %u cannot solo on %s\n", irq, name); return (EINVAL); } if (flags & INTR_SOLO) { error = iscr_setup_filter(isrc, name, (intr_irq_filter_t *)filt, arg, cookiep); debugf("irq %u setup filter error %d on %s\n", irq, error, name); } else #endif { error = isrc_add_handler(isrc, name, filt, hand, arg, flags, cookiep); debugf("irq %u add handler error %d on %s\n", irq, error, name); } if (error != 0) return (error); mtx_lock(&isrc_table_lock); error = PIC_SETUP_INTR(isrc->isrc_dev, isrc, res, data); if (error == 0) { isrc->isrc_handlers++; if (isrc->isrc_handlers == 1) PIC_ENABLE_INTR(isrc->isrc_dev, isrc); } mtx_unlock(&isrc_table_lock); if (error != 0) intr_event_remove_handler(*cookiep); return (error); } int intr_teardown_irq(device_t dev, struct resource *res, void *cookie) { int error; struct intr_map_data *data; struct intr_irqsrc *isrc; KASSERT(rman_get_start(res) == rman_get_end(res), ("%s: more interrupts in resource", __func__)); - data = rman_get_virtual(res); - if (data == NULL) - isrc = intr_ddata_lookup(rman_get_start(res), &data); - else - isrc = isrc_lookup(rman_get_start(res)); + isrc = isrc_lookup(rman_get_start(res)); if (isrc == NULL || isrc->isrc_handlers == 0) return (EINVAL); + data = rman_get_virtual(res); + #ifdef INTR_SOLO if (isrc->isrc_filter != NULL) { if (isrc != cookie) return (EINVAL); mtx_lock(&isrc_table_lock); isrc->isrc_filter = NULL; isrc->isrc_arg = NULL; isrc->isrc_handlers = 0; PIC_DISABLE_INTR(isrc->isrc_dev, isrc); PIC_TEARDOWN_INTR(isrc->isrc_dev, isrc, res, data); isrc_update_name(isrc, NULL); mtx_unlock(&isrc_table_lock); return (0); } #endif if (isrc != intr_handler_source(cookie)) return (EINVAL); error = intr_event_remove_handler(cookie); if (error == 0) { mtx_lock(&isrc_table_lock); isrc->isrc_handlers--; if (isrc->isrc_handlers == 0) PIC_DISABLE_INTR(isrc->isrc_dev, isrc); PIC_TEARDOWN_INTR(isrc->isrc_dev, isrc, res, data); intrcnt_updatename(isrc); mtx_unlock(&isrc_table_lock); } return (error); } int intr_describe_irq(device_t dev, struct resource *res, void *cookie, const char *descr) { int error; struct intr_irqsrc *isrc; KASSERT(rman_get_start(res) == rman_get_end(res), ("%s: more interrupts in resource", __func__)); - isrc = intr_ddata_lookup(rman_get_start(res), NULL); + isrc = isrc_lookup(rman_get_start(res)); if (isrc == NULL || isrc->isrc_handlers == 0) return (EINVAL); #ifdef INTR_SOLO if (isrc->isrc_filter != NULL) { if (isrc != cookie) return (EINVAL); mtx_lock(&isrc_table_lock); isrc_update_name(isrc, descr); mtx_unlock(&isrc_table_lock); return (0); } #endif error = intr_event_describe_handler(isrc->isrc_event, cookie, descr); if (error == 0) { mtx_lock(&isrc_table_lock); intrcnt_updatename(isrc); mtx_unlock(&isrc_table_lock); } return (error); } #ifdef SMP int intr_bind_irq(device_t dev, struct resource *res, int cpu) { struct intr_irqsrc *isrc; KASSERT(rman_get_start(res) == rman_get_end(res), ("%s: more interrupts in resource", __func__)); - isrc = intr_ddata_lookup(rman_get_start(res), NULL); + isrc = isrc_lookup(rman_get_start(res)); if (isrc == NULL || isrc->isrc_handlers == 0) return (EINVAL); #ifdef INTR_SOLO if (isrc->isrc_filter != NULL) return (intr_isrc_assign_cpu(isrc, cpu)); #endif return (intr_event_bind(isrc->isrc_event, cpu)); } /* * Return the CPU that the next interrupt source should use. * For now just returns the next CPU according to round-robin. */ u_int intr_irq_next_cpu(u_int last_cpu, cpuset_t *cpumask) { if (!irq_assign_cpu || mp_ncpus == 1) return (PCPU_GET(cpuid)); do { last_cpu++; if (last_cpu > mp_maxid) last_cpu = 0; } while (!CPU_ISSET(last_cpu, cpumask)); return (last_cpu); } /* * Distribute all the interrupt sources among the available * CPUs once the AP's have been launched. */ static void intr_irq_shuffle(void *arg __unused) { struct intr_irqsrc *isrc; u_int i; if (mp_ncpus == 1) return; mtx_lock(&isrc_table_lock); irq_assign_cpu = TRUE; for (i = 0; i < NIRQ; i++) { isrc = irq_sources[i]; if (isrc == NULL || isrc->isrc_handlers == 0 || isrc->isrc_flags & (INTR_ISRCF_PPI | INTR_ISRCF_IPI)) continue; if (isrc->isrc_event != NULL && isrc->isrc_flags & INTR_ISRCF_BOUND && isrc->isrc_event->ie_cpu != CPU_FFS(&isrc->isrc_cpu) - 1) panic("%s: CPU inconsistency", __func__); if ((isrc->isrc_flags & INTR_ISRCF_BOUND) == 0) CPU_ZERO(&isrc->isrc_cpu); /* start again */ /* * We are in wicked position here if the following call fails * for bound ISRC. The best thing we can do is to clear * isrc_cpu so inconsistency with ie_cpu will be detectable. */ if (PIC_BIND_INTR(isrc->isrc_dev, isrc) != 0) CPU_ZERO(&isrc->isrc_cpu); } mtx_unlock(&isrc_table_lock); } SYSINIT(intr_irq_shuffle, SI_SUB_SMP, SI_ORDER_SECOND, intr_irq_shuffle, NULL); #else u_int intr_irq_next_cpu(u_int current_cpu, cpuset_t *cpumask) { return (PCPU_GET(cpuid)); } #endif /* * Register a MSI/MSI-X interrupt controller */ int intr_msi_register(device_t dev, intptr_t xref) { struct intr_pic *pic; if (dev == NULL) return (EINVAL); pic = pic_create(dev, xref); if (pic == NULL) return (ENOMEM); pic->pic_flags |= FLAG_MSI; debugf("PIC %p registered for %s \n", pic, device_get_nameunit(dev), dev, (uintmax_t)xref); return (0); } int intr_alloc_msi(device_t pci, device_t child, intptr_t xref, int count, int maxcount, int *irqs) { struct intr_irqsrc **isrc; struct intr_pic *pic; device_t pdev; int err, i; pic = pic_lookup(NULL, xref); if (pic == NULL) return (ESRCH); KASSERT((pic->pic_flags & FLAG_MSI) != 0, ("%s: Found a non-MSI controller: %s", __func__, device_get_name(pic->pic_dev))); isrc = malloc(sizeof(*isrc) * count, M_INTRNG, M_WAITOK); err = MSI_ALLOC_MSI(pic->pic_dev, child, count, maxcount, &pdev, isrc); if (err == 0) { for (i = 0; i < count; i++) { irqs[i] = isrc[i]->isrc_irq; } } free(isrc, M_INTRNG); return (err); } int intr_release_msi(device_t pci, device_t child, intptr_t xref, int count, int *irqs) { struct intr_irqsrc **isrc; struct intr_pic *pic; int i, err; pic = pic_lookup(NULL, xref); if (pic == NULL) return (ESRCH); KASSERT((pic->pic_flags & FLAG_MSI) != 0, ("%s: Found a non-MSI controller: %s", __func__, device_get_name(pic->pic_dev))); isrc = malloc(sizeof(*isrc) * count, M_INTRNG, M_WAITOK); for (i = 0; i < count; i++) { isrc[i] = isrc_lookup(irqs[i]); if (isrc == NULL) { free(isrc, M_INTRNG); return (EINVAL); } } err = MSI_RELEASE_MSI(pic->pic_dev, child, count, isrc); free(isrc, M_INTRNG); return (err); } int intr_alloc_msix(device_t pci, device_t child, intptr_t xref, int *irq) { struct intr_irqsrc *isrc; struct intr_pic *pic; device_t pdev; int err; pic = pic_lookup(NULL, xref); if (pic == NULL) return (ESRCH); KASSERT((pic->pic_flags & FLAG_MSI) != 0, ("%s: Found a non-MSI controller: %s", __func__, device_get_name(pic->pic_dev))); err = MSI_ALLOC_MSIX(pic->pic_dev, child, &pdev, &isrc); if (err != 0) return (err); *irq = isrc->isrc_irq; return (0); } int intr_release_msix(device_t pci, device_t child, intptr_t xref, int irq) { struct intr_irqsrc *isrc; struct intr_pic *pic; int err; pic = pic_lookup(NULL, xref); if (pic == NULL) return (ESRCH); KASSERT((pic->pic_flags & FLAG_MSI) != 0, ("%s: Found a non-MSI controller: %s", __func__, device_get_name(pic->pic_dev))); isrc = isrc_lookup(irq); if (isrc == NULL) return (EINVAL); err = MSI_RELEASE_MSIX(pic->pic_dev, child, isrc); return (err); } int intr_map_msi(device_t pci, device_t child, intptr_t xref, int irq, uint64_t *addr, uint32_t *data) { struct intr_irqsrc *isrc; struct intr_pic *pic; int err; pic = pic_lookup(NULL, xref); if (pic == NULL) return (ESRCH); KASSERT((pic->pic_flags & FLAG_MSI) != 0, ("%s: Found a non-MSI controller: %s", __func__, device_get_name(pic->pic_dev))); isrc = isrc_lookup(irq); if (isrc == NULL) return (EINVAL); err = MSI_MAP_MSI(pic->pic_dev, child, isrc, addr, data); return (err); } void dosoftints(void); void dosoftints(void) { } #ifdef SMP /* * Init interrupt controller on another CPU. */ void intr_pic_init_secondary(void) { /* * QQQ: Only root PIC is aware of other CPUs ??? */ KASSERT(intr_irq_root_dev != NULL, ("%s: no root attached", __func__)); //mtx_lock(&isrc_table_lock); PIC_INIT_SECONDARY(intr_irq_root_dev); //mtx_unlock(&isrc_table_lock); } #endif #ifdef DDB DB_SHOW_COMMAND(irqs, db_show_irqs) { u_int i, irqsum; u_long num; struct intr_irqsrc *isrc; for (irqsum = 0, i = 0; i < NIRQ; i++) { isrc = irq_sources[i]; if (isrc == NULL) continue; num = isrc->isrc_count != NULL ? isrc->isrc_count[0] : 0; db_printf("irq%-3u <%s>: cpu %02lx%s cnt %lu\n", i, isrc->isrc_name, isrc->isrc_cpu.__bits[0], isrc->isrc_flags & INTR_ISRCF_BOUND ? " (bound)" : "", num); irqsum += num; } db_printf("irq total %u\n", irqsum); } #endif Index: projects/vnet/sys/net/flowtable.c =================================================================== --- projects/vnet/sys/net/flowtable.c (revision 301546) +++ projects/vnet/sys/net/flowtable.c (revision 301547) @@ -1,1184 +1,1184 @@ /*- * Copyright (c) 2014 Gleb Smirnoff * Copyright (c) 2008-2010, BitGravity Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * * 2. Neither the name of the BitGravity Corporation nor the names of its * contributors may be used to endorse or promote products derived from * this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ #include "opt_route.h" #include "opt_mpath.h" #include "opt_ddb.h" #include "opt_inet.h" #include "opt_inet6.h" #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET6 #include #endif #ifdef FLOWTABLE_HASH_ALL #include #include #include #endif #include #ifdef FLOWTABLE_HASH_ALL #define KEY_PORTS (sizeof(uint16_t) * 2) #define KEY_ADDRS 2 #else #define KEY_PORTS 0 #define KEY_ADDRS 1 #endif #ifdef INET6 #define KEY_ADDR_LEN sizeof(struct in6_addr) #else #define KEY_ADDR_LEN sizeof(struct in_addr) #endif #define KEYLEN ((KEY_ADDR_LEN * KEY_ADDRS + KEY_PORTS) / sizeof(uint32_t)) struct flentry { uint32_t f_hash; /* hash flowing forward */ uint32_t f_key[KEYLEN]; /* address(es and ports) */ uint32_t f_uptime; /* uptime at last access */ uint16_t f_fibnum; /* fib index */ #ifdef FLOWTABLE_HASH_ALL uint8_t f_proto; /* protocol */ uint8_t f_flags; /* stale? */ #define FL_STALE 1 #endif SLIST_ENTRY(flentry) f_next; /* pointer to collision entry */ struct rtentry *f_rt; /* rtentry for flow */ struct llentry *f_lle; /* llentry for flow */ }; #undef KEYLEN SLIST_HEAD(flist, flentry); /* Make sure we can use pcpu_zone_ptr for struct flist. */ CTASSERT(sizeof(struct flist) == sizeof(void *)); struct flowtable { counter_u64_t *ft_stat; int ft_size; /* * ft_table is a malloc(9)ed array of pointers. Pointers point to * memory from UMA_ZONE_PCPU zone. * ft_masks is per-cpu pointer itself. Each instance points * to a malloc(9)ed bitset, that is private to corresponding CPU. */ struct flist **ft_table; bitstr_t **ft_masks; bitstr_t *ft_tmpmask; }; #define FLOWSTAT_ADD(ft, name, v) \ counter_u64_add((ft)->ft_stat[offsetof(struct flowtable_stat, name) / sizeof(uint64_t)], (v)) #define FLOWSTAT_INC(ft, name) FLOWSTAT_ADD(ft, name, 1) static struct proc *flowcleanerproc; static uint32_t flow_hashjitter; static struct cv flowclean_f_cv; static struct cv flowclean_c_cv; static struct mtx flowclean_lock; static uint32_t flowclean_cycles; /* * TODO: * - add sysctls to resize && flush flow tables * - Add per flowtable sysctls for statistics and configuring timeouts * - add saturation counter to rtentry to support per-packet load-balancing * add flag to indicate round-robin flow, add list lookup from head for flows * - add sysctl / device node / syscall to support exporting and importing * of flows with flag to indicate that a flow was imported so should * not be considered for auto-cleaning * - support explicit connection state (currently only ad-hoc for DSR) * - idetach() cleanup for options VIMAGE builds. */ #ifdef INET static VNET_DEFINE(struct flowtable, ip4_ft); #define V_ip4_ft VNET(ip4_ft) #endif #ifdef INET6 static VNET_DEFINE(struct flowtable, ip6_ft); #define V_ip6_ft VNET(ip6_ft) #endif static uma_zone_t flow_zone; static VNET_DEFINE(int, flowtable_enable) = 1; #define V_flowtable_enable VNET(flowtable_enable) static SYSCTL_NODE(_net, OID_AUTO, flowtable, CTLFLAG_RD, NULL, "flowtable"); SYSCTL_INT(_net_flowtable, OID_AUTO, enable, CTLFLAG_VNET | CTLFLAG_RW, &VNET_NAME(flowtable_enable), 0, "enable flowtable caching."); SYSCTL_UMA_MAX(_net_flowtable, OID_AUTO, maxflows, CTLFLAG_RW, &flow_zone, "Maximum number of flows allowed"); static MALLOC_DEFINE(M_FTABLE, "flowtable", "flowtable hashes and bitstrings"); static struct flentry * flowtable_lookup_common(struct flowtable *, uint32_t *, int, uint32_t); #ifdef INET static struct flentry * flowtable_lookup_ipv4(struct mbuf *m, struct route *ro) { struct flentry *fle; struct sockaddr_in *sin; struct ip *ip; uint32_t fibnum; #ifdef FLOWTABLE_HASH_ALL uint32_t key[3]; int iphlen; uint16_t sport, dport; uint8_t proto; #endif ip = mtod(m, struct ip *); if (ip->ip_src.s_addr == ip->ip_dst.s_addr || (ntohl(ip->ip_dst.s_addr) >> IN_CLASSA_NSHIFT) == IN_LOOPBACKNET || (ntohl(ip->ip_src.s_addr) >> IN_CLASSA_NSHIFT) == IN_LOOPBACKNET) return (NULL); fibnum = M_GETFIB(m); #ifdef FLOWTABLE_HASH_ALL iphlen = ip->ip_hl << 2; proto = ip->ip_p; switch (proto) { case IPPROTO_TCP: { struct tcphdr *th; th = (struct tcphdr *)((char *)ip + iphlen); sport = th->th_sport; dport = th->th_dport; if (th->th_flags & (TH_RST|TH_FIN)) fibnum |= (FL_STALE << 24); break; } case IPPROTO_UDP: { struct udphdr *uh; uh = (struct udphdr *)((char *)ip + iphlen); sport = uh->uh_sport; dport = uh->uh_dport; break; } case IPPROTO_SCTP: { struct sctphdr *sh; sh = (struct sctphdr *)((char *)ip + iphlen); sport = sh->src_port; dport = sh->dest_port; /* XXXGL: handle stale? */ break; } default: sport = dport = 0; break; } key[0] = ip->ip_dst.s_addr; key[1] = ip->ip_src.s_addr; key[2] = (dport << 16) | sport; fibnum |= proto << 16; fle = flowtable_lookup_common(&V_ip4_ft, key, 3 * sizeof(uint32_t), fibnum); #else /* !FLOWTABLE_HASH_ALL */ fle = flowtable_lookup_common(&V_ip4_ft, (uint32_t *)&ip->ip_dst, sizeof(struct in_addr), fibnum); #endif /* FLOWTABLE_HASH_ALL */ if (fle == NULL) return (NULL); sin = (struct sockaddr_in *)&ro->ro_dst; sin->sin_family = AF_INET; sin->sin_len = sizeof(*sin); sin->sin_addr = ip->ip_dst; return (fle); } #endif /* INET */ #ifdef INET6 /* * PULLUP_TO(len, p, T) makes sure that len + sizeof(T) is contiguous, * then it sets p to point at the offset "len" in the mbuf. WARNING: the * pointer might become stale after other pullups (but we never use it * this way). */ #define PULLUP_TO(_len, p, T) \ do { \ int x = (_len) + sizeof(T); \ if ((m)->m_len < x) \ return (NULL); \ p = (mtod(m, char *) + (_len)); \ } while (0) #define TCP(p) ((struct tcphdr *)(p)) #define SCTP(p) ((struct sctphdr *)(p)) #define UDP(p) ((struct udphdr *)(p)) static struct flentry * flowtable_lookup_ipv6(struct mbuf *m, struct route *ro) { struct flentry *fle; struct sockaddr_in6 *sin6; struct ip6_hdr *ip6; uint32_t fibnum; #ifdef FLOWTABLE_HASH_ALL uint32_t key[9]; void *ulp; int hlen; uint16_t sport, dport; u_short offset; uint8_t proto; #else uint32_t key[4]; #endif ip6 = mtod(m, struct ip6_hdr *); if (in6_localaddr(&ip6->ip6_dst)) return (NULL); fibnum = M_GETFIB(m); #ifdef FLOWTABLE_HASH_ALL hlen = sizeof(struct ip6_hdr); proto = ip6->ip6_nxt; offset = sport = dport = 0; ulp = NULL; while (ulp == NULL) { switch (proto) { case IPPROTO_ICMPV6: case IPPROTO_OSPFIGP: case IPPROTO_PIM: case IPPROTO_CARP: case IPPROTO_ESP: case IPPROTO_NONE: ulp = ip6; break; case IPPROTO_TCP: PULLUP_TO(hlen, ulp, struct tcphdr); dport = TCP(ulp)->th_dport; sport = TCP(ulp)->th_sport; if (TCP(ulp)->th_flags & (TH_RST|TH_FIN)) fibnum |= (FL_STALE << 24); break; case IPPROTO_SCTP: PULLUP_TO(hlen, ulp, struct sctphdr); dport = SCTP(ulp)->src_port; sport = SCTP(ulp)->dest_port; /* XXXGL: handle stale? */ break; case IPPROTO_UDP: PULLUP_TO(hlen, ulp, struct udphdr); dport = UDP(ulp)->uh_dport; sport = UDP(ulp)->uh_sport; break; case IPPROTO_HOPOPTS: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_hbh); hlen += (((struct ip6_hbh *)ulp)->ip6h_len + 1) << 3; proto = ((struct ip6_hbh *)ulp)->ip6h_nxt; ulp = NULL; break; case IPPROTO_ROUTING: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_rthdr); hlen += (((struct ip6_rthdr *)ulp)->ip6r_len + 1) << 3; proto = ((struct ip6_rthdr *)ulp)->ip6r_nxt; ulp = NULL; break; case IPPROTO_FRAGMENT: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_frag); hlen += sizeof (struct ip6_frag); proto = ((struct ip6_frag *)ulp)->ip6f_nxt; offset = ((struct ip6_frag *)ulp)->ip6f_offlg & IP6F_OFF_MASK; ulp = NULL; break; case IPPROTO_DSTOPTS: /* RFC 2460 */ PULLUP_TO(hlen, ulp, struct ip6_hbh); hlen += (((struct ip6_hbh *)ulp)->ip6h_len + 1) << 3; proto = ((struct ip6_hbh *)ulp)->ip6h_nxt; ulp = NULL; break; case IPPROTO_AH: /* RFC 2402 */ PULLUP_TO(hlen, ulp, struct ip6_ext); hlen += (((struct ip6_ext *)ulp)->ip6e_len + 2) << 2; proto = ((struct ip6_ext *)ulp)->ip6e_nxt; ulp = NULL; break; default: PULLUP_TO(hlen, ulp, struct ip6_ext); break; } } bcopy(&ip6->ip6_dst, &key[0], sizeof(struct in6_addr)); bcopy(&ip6->ip6_src, &key[4], sizeof(struct in6_addr)); key[8] = (dport << 16) | sport; fibnum |= proto << 16; fle = flowtable_lookup_common(&V_ip6_ft, key, 9 * sizeof(uint32_t), fibnum); #else /* !FLOWTABLE_HASH_ALL */ bcopy(&ip6->ip6_dst, &key[0], sizeof(struct in6_addr)); fle = flowtable_lookup_common(&V_ip6_ft, key, sizeof(struct in6_addr), fibnum); #endif /* FLOWTABLE_HASH_ALL */ if (fle == NULL) return (NULL); sin6 = (struct sockaddr_in6 *)&ro->ro_dst; sin6->sin6_family = AF_INET6; sin6->sin6_len = sizeof(*sin6); bcopy(&ip6->ip6_dst, &sin6->sin6_addr, sizeof(struct in6_addr)); return (fle); } #endif /* INET6 */ static bitstr_t * flowtable_mask(struct flowtable *ft) { /* * flowtable_free_stale() calls w/o critical section, but * with sched_bind(). Since pointer is stable throughout * ft lifetime, it is safe, otherwise... * * CRITICAL_ASSERT(curthread); */ return (*(bitstr_t **)zpcpu_get(ft->ft_masks)); } static struct flist * flowtable_list(struct flowtable *ft, uint32_t hash) { CRITICAL_ASSERT(curthread); return (zpcpu_get(ft->ft_table[hash % ft->ft_size])); } static int flow_stale(struct flowtable *ft, struct flentry *fle, int maxidle) { if (((fle->f_rt->rt_flags & RTF_UP) == 0) || (fle->f_rt->rt_ifp == NULL) || !RT_LINK_IS_UP(fle->f_rt->rt_ifp) || (fle->f_lle->la_flags & LLE_VALID) == 0) return (1); if (time_uptime - fle->f_uptime > maxidle) return (1); #ifdef FLOWTABLE_HASH_ALL if (fle->f_flags & FL_STALE) return (1); #endif return (0); } static int flow_full(void) { int count, max; count = uma_zone_get_cur(flow_zone); max = uma_zone_get_max(flow_zone); return (count > (max - (max >> 3))); } static int flow_matches(struct flentry *fle, uint32_t *key, int keylen, uint32_t fibnum) { #ifdef FLOWTABLE_HASH_ALL uint8_t proto; proto = (fibnum >> 16) & 0xff; fibnum &= 0xffff; #endif CRITICAL_ASSERT(curthread); /* Microoptimization for IPv4: don't use bcmp(). */ if (((keylen == sizeof(uint32_t) && (fle->f_key[0] == key[0])) || (bcmp(fle->f_key, key, keylen) == 0)) && fibnum == fle->f_fibnum && #ifdef FLOWTABLE_HASH_ALL proto == fle->f_proto && #endif (fle->f_rt->rt_flags & RTF_UP) && fle->f_rt->rt_ifp != NULL && (fle->f_lle->la_flags & LLE_VALID)) return (1); return (0); } static struct flentry * flowtable_insert(struct flowtable *ft, uint32_t hash, uint32_t *key, int keylen, uint32_t fibnum0) { #ifdef INET6 struct route_in6 sro6; #endif #ifdef INET struct route sro; #endif struct route *ro = NULL; struct rtentry *rt; struct lltable *lt = NULL; struct llentry *lle; struct sockaddr_storage *l3addr; struct ifnet *ifp; struct flist *flist; struct flentry *fle, *iter; bitstr_t *mask; uint16_t fibnum = fibnum0; #ifdef FLOWTABLE_HASH_ALL uint8_t proto; proto = (fibnum0 >> 16) & 0xff; fibnum = fibnum0 & 0xffff; #endif /* * This bit of code ends up locking the * same route 3 times (just like ip_output + ether_output) * - at lookup * - in rt_check when called by arpresolve * - dropping the refcount for the rtentry * * This could be consolidated to one if we wrote a variant * of arpresolve with an rt_check variant that expected to * receive the route locked */ #ifdef INET if (ft == &V_ip4_ft) { struct sockaddr_in *sin; ro = &sro; bzero(&sro.ro_dst, sizeof(sro.ro_dst)); sin = (struct sockaddr_in *)&sro.ro_dst; sin->sin_family = AF_INET; sin->sin_len = sizeof(*sin); sin->sin_addr.s_addr = key[0]; } #endif #ifdef INET6 if (ft == &V_ip6_ft) { struct sockaddr_in6 *sin6; ro = (struct route *)&sro6; sin6 = &sro6.ro_dst; bzero(sin6, sizeof(*sin6)); sin6->sin6_family = AF_INET6; sin6->sin6_len = sizeof(*sin6); bcopy(key, &sin6->sin6_addr, sizeof(struct in6_addr)); } #endif ro->ro_rt = NULL; #ifdef RADIX_MPATH rtalloc_mpath_fib(ro, hash, fibnum); #else rtalloc_ign_fib(ro, 0, fibnum); #endif if (ro->ro_rt == NULL) return (NULL); rt = ro->ro_rt; ifp = rt->rt_ifp; if (ifp->if_flags & (IFF_POINTOPOINT | IFF_LOOPBACK)) { RTFREE(rt); return (NULL); } #ifdef INET if (ft == &V_ip4_ft) lt = LLTABLE(ifp); #endif #ifdef INET6 if (ft == &V_ip6_ft) lt = LLTABLE6(ifp); #endif if (rt->rt_flags & RTF_GATEWAY) l3addr = (struct sockaddr_storage *)rt->rt_gateway; else l3addr = (struct sockaddr_storage *)&ro->ro_dst; lle = llentry_alloc(ifp, lt, l3addr); if (lle == NULL) { RTFREE(rt); return (NULL); } /* Don't insert the entry if the ARP hasn't yet finished resolving. */ if ((lle->la_flags & LLE_VALID) == 0) { RTFREE(rt); LLE_FREE(lle); FLOWSTAT_INC(ft, ft_fail_lle_invalid); return (NULL); } fle = uma_zalloc(flow_zone, M_NOWAIT | M_ZERO); if (fle == NULL) { RTFREE(rt); LLE_FREE(lle); return (NULL); } fle->f_hash = hash; bcopy(key, &fle->f_key, keylen); fle->f_rt = rt; fle->f_lle = lle; fle->f_fibnum = fibnum; fle->f_uptime = time_uptime; #ifdef FLOWTABLE_HASH_ALL fle->f_proto = proto; fle->f_flags = fibnum0 >> 24; #endif critical_enter(); mask = flowtable_mask(ft); flist = flowtable_list(ft, hash); if (SLIST_EMPTY(flist)) { bit_set(mask, (hash % ft->ft_size)); SLIST_INSERT_HEAD(flist, fle, f_next); goto skip; } /* * find end of list and make sure that we were not * preempted by another thread handling this flow */ SLIST_FOREACH(iter, flist, f_next) { KASSERT(iter->f_hash % ft->ft_size == hash % ft->ft_size, ("%s: wrong hash", __func__)); if (flow_matches(iter, key, keylen, fibnum)) { /* * We probably migrated to an other CPU after * lookup in flowtable_lookup_common() failed. * It appeared that this CPU already has flow * entry. */ iter->f_uptime = time_uptime; #ifdef FLOWTABLE_HASH_ALL iter->f_flags |= fibnum >> 24; #endif critical_exit(); FLOWSTAT_INC(ft, ft_collisions); uma_zfree(flow_zone, fle); return (iter); } } SLIST_INSERT_HEAD(flist, fle, f_next); skip: critical_exit(); FLOWSTAT_INC(ft, ft_inserts); return (fle); } int flowtable_lookup(sa_family_t sa, struct mbuf *m, struct route *ro) { struct flentry *fle; struct llentry *lle; if (V_flowtable_enable == 0) return (ENXIO); switch (sa) { #ifdef INET case AF_INET: fle = flowtable_lookup_ipv4(m, ro); break; #endif #ifdef INET6 case AF_INET6: fle = flowtable_lookup_ipv6(m, ro); break; #endif default: panic("%s: sa %d", __func__, sa); } if (fle == NULL) return (EHOSTUNREACH); if (M_HASHTYPE_GET(m) == M_HASHTYPE_NONE) { - M_HASHTYPE_SET(m, M_HASHTYPE_OPAQUE); + M_HASHTYPE_SET(m, M_HASHTYPE_OPAQUE_HASH); m->m_pkthdr.flowid = fle->f_hash; } ro->ro_rt = fle->f_rt; ro->ro_flags |= RT_NORTREF; lle = fle->f_lle; if (lle != NULL && (lle->la_flags & LLE_VALID)) ro->ro_lle = lle; /* share ref with fle->f_lle */ return (0); } static struct flentry * flowtable_lookup_common(struct flowtable *ft, uint32_t *key, int keylen, uint32_t fibnum) { struct flist *flist; struct flentry *fle; uint32_t hash; FLOWSTAT_INC(ft, ft_lookups); hash = jenkins_hash32(key, keylen / sizeof(uint32_t), flow_hashjitter); critical_enter(); flist = flowtable_list(ft, hash); SLIST_FOREACH(fle, flist, f_next) { KASSERT(fle->f_hash % ft->ft_size == hash % ft->ft_size, ("%s: wrong hash", __func__)); if (flow_matches(fle, key, keylen, fibnum)) { fle->f_uptime = time_uptime; #ifdef FLOWTABLE_HASH_ALL fle->f_flags |= fibnum >> 24; #endif critical_exit(); FLOWSTAT_INC(ft, ft_hits); return (fle); } } critical_exit(); FLOWSTAT_INC(ft, ft_misses); return (flowtable_insert(ft, hash, key, keylen, fibnum)); } static void flowtable_alloc(struct flowtable *ft) { ft->ft_table = malloc(ft->ft_size * sizeof(struct flist), M_FTABLE, M_WAITOK); for (int i = 0; i < ft->ft_size; i++) ft->ft_table[i] = uma_zalloc(pcpu_zone_ptr, M_WAITOK | M_ZERO); ft->ft_masks = uma_zalloc(pcpu_zone_ptr, M_WAITOK); for (int i = 0; i < mp_ncpus; i++) { bitstr_t **b; b = zpcpu_get_cpu(ft->ft_masks, i); *b = bit_alloc(ft->ft_size, M_FTABLE, M_WAITOK); } ft->ft_tmpmask = bit_alloc(ft->ft_size, M_FTABLE, M_WAITOK); } static void flowtable_free_stale(struct flowtable *ft, struct rtentry *rt, int maxidle) { struct flist *flist, freelist; struct flentry *fle, *fle1, *fleprev; bitstr_t *mask, *tmpmask; int curbit, tmpsize; SLIST_INIT(&freelist); mask = flowtable_mask(ft); tmpmask = ft->ft_tmpmask; tmpsize = ft->ft_size; memcpy(tmpmask, mask, ft->ft_size/8); curbit = 0; fleprev = NULL; /* pacify gcc */ /* * XXX Note to self, bit_ffs operates at the byte level * and thus adds gratuitous overhead */ bit_ffs(tmpmask, ft->ft_size, &curbit); while (curbit != -1) { if (curbit >= ft->ft_size || curbit < -1) { log(LOG_ALERT, "warning: bad curbit value %d \n", curbit); break; } FLOWSTAT_INC(ft, ft_free_checks); critical_enter(); flist = flowtable_list(ft, curbit); #ifdef DIAGNOSTIC if (SLIST_EMPTY(flist) && curbit > 0) { log(LOG_ALERT, "warning bit=%d set, but no fle found\n", curbit); } #endif SLIST_FOREACH_SAFE(fle, flist, f_next, fle1) { if (rt != NULL && fle->f_rt != rt) { fleprev = fle; continue; } if (!flow_stale(ft, fle, maxidle)) { fleprev = fle; continue; } if (fle == SLIST_FIRST(flist)) SLIST_REMOVE_HEAD(flist, f_next); else SLIST_REMOVE_AFTER(fleprev, f_next); SLIST_INSERT_HEAD(&freelist, fle, f_next); } if (SLIST_EMPTY(flist)) bit_clear(mask, curbit); critical_exit(); bit_clear(tmpmask, curbit); bit_ffs(tmpmask, tmpsize, &curbit); } SLIST_FOREACH_SAFE(fle, &freelist, f_next, fle1) { FLOWSTAT_INC(ft, ft_frees); if (fle->f_rt != NULL) RTFREE(fle->f_rt); if (fle->f_lle != NULL) LLE_FREE(fle->f_lle); uma_zfree(flow_zone, fle); } } static void flowtable_clean_vnet(struct flowtable *ft, struct rtentry *rt, int maxidle) { int i; CPU_FOREACH(i) { if (smp_started == 1) { thread_lock(curthread); sched_bind(curthread, i); thread_unlock(curthread); } flowtable_free_stale(ft, rt, maxidle); if (smp_started == 1) { thread_lock(curthread); sched_unbind(curthread); thread_unlock(curthread); } } } void flowtable_route_flush(sa_family_t sa, struct rtentry *rt) { struct flowtable *ft; switch (sa) { #ifdef INET case AF_INET: ft = &V_ip4_ft; break; #endif #ifdef INET6 case AF_INET6: ft = &V_ip6_ft; break; #endif default: panic("%s: sa %d", __func__, sa); } flowtable_clean_vnet(ft, rt, 0); } static void flowtable_cleaner(void) { VNET_ITERATOR_DECL(vnet_iter); struct thread *td; if (bootverbose) log(LOG_INFO, "flowtable cleaner started\n"); td = curthread; while (1) { uint32_t flowclean_freq, maxidle; /* * The maximum idle time, as well as frequency are arbitrary. */ if (flow_full()) maxidle = 5; else maxidle = 30; VNET_LIST_RLOCK(); VNET_FOREACH(vnet_iter) { CURVNET_SET(vnet_iter); #ifdef INET flowtable_clean_vnet(&V_ip4_ft, NULL, maxidle); #endif #ifdef INET6 flowtable_clean_vnet(&V_ip6_ft, NULL, maxidle); #endif CURVNET_RESTORE(); } VNET_LIST_RUNLOCK(); if (flow_full()) flowclean_freq = 4*hz; else flowclean_freq = 20*hz; mtx_lock(&flowclean_lock); thread_lock(td); sched_prio(td, PPAUSE); thread_unlock(td); flowclean_cycles++; cv_broadcast(&flowclean_f_cv); cv_timedwait(&flowclean_c_cv, &flowclean_lock, flowclean_freq); mtx_unlock(&flowclean_lock); } } static void flowtable_flush(void *unused __unused) { uint64_t start; mtx_lock(&flowclean_lock); start = flowclean_cycles; while (start == flowclean_cycles) { cv_broadcast(&flowclean_c_cv); cv_wait(&flowclean_f_cv, &flowclean_lock); } mtx_unlock(&flowclean_lock); } static struct kproc_desc flow_kp = { "flowcleaner", flowtable_cleaner, &flowcleanerproc }; SYSINIT(flowcleaner, SI_SUB_KTHREAD_IDLE, SI_ORDER_ANY, kproc_start, &flow_kp); static int flowtable_get_size(char *name) { int size; if (TUNABLE_INT_FETCH(name, &size)) { if (size < 256) size = 256; if (!powerof2(size)) { printf("%s must be power of 2\n", name); size = 2048; } } else { /* * round up to the next power of 2 */ size = 1 << fls((1024 + maxusers * 64) - 1); } return (size); } static void flowtable_init(const void *unused __unused) { flow_hashjitter = arc4random(); flow_zone = uma_zcreate("flows", sizeof(struct flentry), NULL, NULL, NULL, NULL, (64-1), UMA_ZONE_MAXBUCKET); uma_zone_set_max(flow_zone, 1024 + maxusers * 64 * mp_ncpus); cv_init(&flowclean_c_cv, "c_flowcleanwait"); cv_init(&flowclean_f_cv, "f_flowcleanwait"); mtx_init(&flowclean_lock, "flowclean lock", NULL, MTX_DEF); EVENTHANDLER_REGISTER(ifnet_departure_event, flowtable_flush, NULL, EVENTHANDLER_PRI_ANY); } SYSINIT(flowtable_init, SI_SUB_PROTO_BEGIN, SI_ORDER_FIRST, flowtable_init, NULL); #ifdef INET static SYSCTL_NODE(_net_flowtable, OID_AUTO, ip4, CTLFLAG_RD, NULL, "Flowtable for IPv4"); static VNET_PCPUSTAT_DEFINE(struct flowtable_stat, ip4_ftstat); VNET_PCPUSTAT_SYSINIT(ip4_ftstat); VNET_PCPUSTAT_SYSUNINIT(ip4_ftstat); SYSCTL_VNET_PCPUSTAT(_net_flowtable_ip4, OID_AUTO, stat, struct flowtable_stat, ip4_ftstat, "Flowtable statistics for IPv4 " "(struct flowtable_stat, net/flowtable.h)"); static void flowtable_init_vnet_v4(const void *unused __unused) { V_ip4_ft.ft_size = flowtable_get_size("net.flowtable.ip4.size"); V_ip4_ft.ft_stat = VNET(ip4_ftstat); flowtable_alloc(&V_ip4_ft); } VNET_SYSINIT(ft_vnet_v4, SI_SUB_PROTO_IFATTACHDOMAIN, SI_ORDER_ANY, flowtable_init_vnet_v4, NULL); #endif /* INET */ #ifdef INET6 static SYSCTL_NODE(_net_flowtable, OID_AUTO, ip6, CTLFLAG_RD, NULL, "Flowtable for IPv6"); static VNET_PCPUSTAT_DEFINE(struct flowtable_stat, ip6_ftstat); VNET_PCPUSTAT_SYSINIT(ip6_ftstat); VNET_PCPUSTAT_SYSUNINIT(ip6_ftstat); SYSCTL_VNET_PCPUSTAT(_net_flowtable_ip6, OID_AUTO, stat, struct flowtable_stat, ip6_ftstat, "Flowtable statistics for IPv6 " "(struct flowtable_stat, net/flowtable.h)"); static void flowtable_init_vnet_v6(const void *unused __unused) { V_ip6_ft.ft_size = flowtable_get_size("net.flowtable.ip6.size"); V_ip6_ft.ft_stat = VNET(ip6_ftstat); flowtable_alloc(&V_ip6_ft); } VNET_SYSINIT(flowtable_init_vnet_v6, SI_SUB_PROTO_IFATTACHDOMAIN, SI_ORDER_ANY, flowtable_init_vnet_v6, NULL); #endif /* INET6 */ #ifdef DDB static bitstr_t * flowtable_mask_pcpu(struct flowtable *ft, int cpuid) { return (zpcpu_get_cpu(*ft->ft_masks, cpuid)); } static struct flist * flowtable_list_pcpu(struct flowtable *ft, uint32_t hash, int cpuid) { return (zpcpu_get_cpu(&ft->ft_table[hash % ft->ft_size], cpuid)); } static void flow_show(struct flowtable *ft, struct flentry *fle) { int idle_time; int rt_valid, ifp_valid; volatile struct rtentry *rt; struct ifnet *ifp = NULL; uint32_t *hashkey = fle->f_key; idle_time = (int)(time_uptime - fle->f_uptime); rt = fle->f_rt; rt_valid = rt != NULL; if (rt_valid) ifp = rt->rt_ifp; ifp_valid = ifp != NULL; #ifdef INET if (ft == &V_ip4_ft) { char daddr[4*sizeof "123"]; #ifdef FLOWTABLE_HASH_ALL char saddr[4*sizeof "123"]; uint16_t sport, dport; #endif inet_ntoa_r(*(struct in_addr *) &hashkey[0], daddr); #ifdef FLOWTABLE_HASH_ALL inet_ntoa_r(*(struct in_addr *) &hashkey[1], saddr); dport = ntohs((uint16_t)(hashkey[2] >> 16)); sport = ntohs((uint16_t)(hashkey[2] & 0xffff)); db_printf("%s:%d->%s:%d", saddr, sport, daddr, dport); #else db_printf("%s ", daddr); #endif } #endif /* INET */ #ifdef INET6 if (ft == &V_ip6_ft) { #ifdef FLOWTABLE_HASH_ALL db_printf("\n\tkey=%08x:%08x:%08x%08x:%08x:%08x%08x:%08x:%08x", hashkey[0], hashkey[1], hashkey[2], hashkey[3], hashkey[4], hashkey[5], hashkey[6], hashkey[7], hashkey[8]); #else db_printf("\n\tkey=%08x:%08x:%08x ", hashkey[0], hashkey[1], hashkey[2]); #endif } #endif /* INET6 */ db_printf("hash=%08x idle_time=%03d" "\n\tfibnum=%02d rt=%p", fle->f_hash, idle_time, fle->f_fibnum, fle->f_rt); #ifdef FLOWTABLE_HASH_ALL if (fle->f_flags & FL_STALE) db_printf(" FL_STALE "); #endif if (rt_valid) { if (rt->rt_flags & RTF_UP) db_printf(" RTF_UP "); } if (ifp_valid) { if (ifp->if_flags & IFF_LOOPBACK) db_printf(" IFF_LOOPBACK "); if (ifp->if_flags & IFF_UP) db_printf(" IFF_UP "); if (ifp->if_flags & IFF_POINTOPOINT) db_printf(" IFF_POINTOPOINT "); } db_printf("\n"); } static void flowtable_show(struct flowtable *ft, int cpuid) { int curbit = 0; bitstr_t *mask, *tmpmask; if (cpuid != -1) db_printf("cpu: %d\n", cpuid); mask = flowtable_mask_pcpu(ft, cpuid); tmpmask = ft->ft_tmpmask; memcpy(tmpmask, mask, ft->ft_size/8); /* * XXX Note to self, bit_ffs operates at the byte level * and thus adds gratuitous overhead */ bit_ffs(tmpmask, ft->ft_size, &curbit); while (curbit != -1) { struct flist *flist; struct flentry *fle; if (curbit >= ft->ft_size || curbit < -1) { db_printf("warning: bad curbit value %d \n", curbit); break; } flist = flowtable_list_pcpu(ft, curbit, cpuid); SLIST_FOREACH(fle, flist, f_next) flow_show(ft, fle); bit_clear(tmpmask, curbit); bit_ffs(tmpmask, ft->ft_size, &curbit); } } static void flowtable_show_vnet(struct flowtable *ft) { int i; CPU_FOREACH(i) flowtable_show(ft, i); } DB_SHOW_COMMAND(flowtables, db_show_flowtables) { VNET_ITERATOR_DECL(vnet_iter); VNET_FOREACH(vnet_iter) { CURVNET_SET(vnet_iter); #ifdef VIMAGE db_printf("vnet %p\n", vnet_iter); #endif #ifdef INET printf("IPv4:\n"); flowtable_show_vnet(&V_ip4_ft); #endif #ifdef INET6 printf("IPv6:\n"); flowtable_show_vnet(&V_ip6_ft); #endif CURVNET_RESTORE(); } } #endif Index: projects/vnet/sys/net/if_vxlan.c =================================================================== --- projects/vnet/sys/net/if_vxlan.c (revision 301546) +++ projects/vnet/sys/net/if_vxlan.c (revision 301547) @@ -1,3090 +1,3089 @@ /*- * Copyright (c) 2014, Bryan Venteicher * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include "opt_inet.h" #include "opt_inet6.h" #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct vxlan_softc; LIST_HEAD(vxlan_softc_head, vxlan_softc); struct vxlan_socket_mc_info { union vxlan_sockaddr vxlsomc_saddr; union vxlan_sockaddr vxlsomc_gaddr; int vxlsomc_ifidx; int vxlsomc_users; }; #define VXLAN_SO_MC_MAX_GROUPS 32 #define VXLAN_SO_VNI_HASH_SHIFT 6 #define VXLAN_SO_VNI_HASH_SIZE (1 << VXLAN_SO_VNI_HASH_SHIFT) #define VXLAN_SO_VNI_HASH(_vni) ((_vni) % VXLAN_SO_VNI_HASH_SIZE) struct vxlan_socket { struct socket *vxlso_sock; struct rmlock vxlso_lock; u_int vxlso_refcnt; union vxlan_sockaddr vxlso_laddr; LIST_ENTRY(vxlan_socket) vxlso_entry; struct vxlan_softc_head vxlso_vni_hash[VXLAN_SO_VNI_HASH_SIZE]; struct vxlan_socket_mc_info vxlso_mc[VXLAN_SO_MC_MAX_GROUPS]; }; #define VXLAN_SO_RLOCK(_vso, _p) rm_rlock(&(_vso)->vxlso_lock, (_p)) #define VXLAN_SO_RUNLOCK(_vso, _p) rm_runlock(&(_vso)->vxlso_lock, (_p)) #define VXLAN_SO_WLOCK(_vso) rm_wlock(&(_vso)->vxlso_lock) #define VXLAN_SO_WUNLOCK(_vso) rm_wunlock(&(_vso)->vxlso_lock) #define VXLAN_SO_LOCK_ASSERT(_vso) \ rm_assert(&(_vso)->vxlso_lock, RA_LOCKED) #define VXLAN_SO_LOCK_WASSERT(_vso) \ rm_assert(&(_vso)->vxlso_lock, RA_WLOCKED) #define VXLAN_SO_ACQUIRE(_vso) refcount_acquire(&(_vso)->vxlso_refcnt) #define VXLAN_SO_RELEASE(_vso) refcount_release(&(_vso)->vxlso_refcnt) struct vxlan_ftable_entry { LIST_ENTRY(vxlan_ftable_entry) vxlfe_hash; uint16_t vxlfe_flags; uint8_t vxlfe_mac[ETHER_ADDR_LEN]; union vxlan_sockaddr vxlfe_raddr; time_t vxlfe_expire; }; #define VXLAN_FE_FLAG_DYNAMIC 0x01 #define VXLAN_FE_FLAG_STATIC 0x02 #define VXLAN_FE_IS_DYNAMIC(_fe) \ ((_fe)->vxlfe_flags & VXLAN_FE_FLAG_DYNAMIC) #define VXLAN_SC_FTABLE_SHIFT 9 #define VXLAN_SC_FTABLE_SIZE (1 << VXLAN_SC_FTABLE_SHIFT) #define VXLAN_SC_FTABLE_MASK (VXLAN_SC_FTABLE_SIZE - 1) #define VXLAN_SC_FTABLE_HASH(_sc, _mac) \ (vxlan_mac_hash(_sc, _mac) % VXLAN_SC_FTABLE_SIZE) LIST_HEAD(vxlan_ftable_head, vxlan_ftable_entry); struct vxlan_statistics { uint32_t ftable_nospace; uint32_t ftable_lock_upgrade_failed; }; struct vxlan_softc { struct ifnet *vxl_ifp; struct vxlan_socket *vxl_sock; uint32_t vxl_vni; union vxlan_sockaddr vxl_src_addr; union vxlan_sockaddr vxl_dst_addr; uint32_t vxl_flags; #define VXLAN_FLAG_INIT 0x0001 #define VXLAN_FLAG_TEARDOWN 0x0002 #define VXLAN_FLAG_LEARN 0x0004 uint32_t vxl_port_hash_key; uint16_t vxl_min_port; uint16_t vxl_max_port; uint8_t vxl_ttl; /* Lookup table from MAC address to forwarding entry. */ uint32_t vxl_ftable_cnt; uint32_t vxl_ftable_max; uint32_t vxl_ftable_timeout; uint32_t vxl_ftable_hash_key; struct vxlan_ftable_head *vxl_ftable; /* Derived from vxl_dst_addr. */ struct vxlan_ftable_entry vxl_default_fe; struct ip_moptions *vxl_im4o; struct ip6_moptions *vxl_im6o; struct rmlock vxl_lock; volatile u_int vxl_refcnt; int vxl_unit; int vxl_vso_mc_index; struct vxlan_statistics vxl_stats; struct sysctl_oid *vxl_sysctl_node; struct sysctl_ctx_list vxl_sysctl_ctx; struct callout vxl_callout; uint8_t vxl_hwaddr[ETHER_ADDR_LEN]; int vxl_mc_ifindex; struct ifnet *vxl_mc_ifp; char vxl_mc_ifname[IFNAMSIZ]; LIST_ENTRY(vxlan_softc) vxl_entry; LIST_ENTRY(vxlan_softc) vxl_ifdetach_list; }; #define VXLAN_RLOCK(_sc, _p) rm_rlock(&(_sc)->vxl_lock, (_p)) #define VXLAN_RUNLOCK(_sc, _p) rm_runlock(&(_sc)->vxl_lock, (_p)) #define VXLAN_WLOCK(_sc) rm_wlock(&(_sc)->vxl_lock) #define VXLAN_WUNLOCK(_sc) rm_wunlock(&(_sc)->vxl_lock) #define VXLAN_LOCK_WOWNED(_sc) rm_wowned(&(_sc)->vxl_lock) #define VXLAN_LOCK_ASSERT(_sc) rm_assert(&(_sc)->vxl_lock, RA_LOCKED) #define VXLAN_LOCK_WASSERT(_sc) rm_assert(&(_sc)->vxl_lock, RA_WLOCKED) #define VXLAN_UNLOCK(_sc, _p) do { \ if (VXLAN_LOCK_WOWNED(_sc)) \ VXLAN_WUNLOCK(_sc); \ else \ VXLAN_RUNLOCK(_sc, _p); \ } while (0) #define VXLAN_ACQUIRE(_sc) refcount_acquire(&(_sc)->vxl_refcnt) #define VXLAN_RELEASE(_sc) refcount_release(&(_sc)->vxl_refcnt) #define satoconstsin(sa) ((const struct sockaddr_in *)(sa)) #define satoconstsin6(sa) ((const struct sockaddr_in6 *)(sa)) struct vxlanudphdr { struct udphdr vxlh_udp; struct vxlan_header vxlh_hdr; } __packed; static int vxlan_ftable_addr_cmp(const uint8_t *, const uint8_t *); static void vxlan_ftable_init(struct vxlan_softc *); static void vxlan_ftable_fini(struct vxlan_softc *); static void vxlan_ftable_flush(struct vxlan_softc *, int); static void vxlan_ftable_expire(struct vxlan_softc *); static int vxlan_ftable_update_locked(struct vxlan_softc *, const struct sockaddr *, const uint8_t *, struct rm_priotracker *); static int vxlan_ftable_update(struct vxlan_softc *, const struct sockaddr *, const uint8_t *); static int vxlan_ftable_sysctl_dump(SYSCTL_HANDLER_ARGS); static struct vxlan_ftable_entry * vxlan_ftable_entry_alloc(void); static void vxlan_ftable_entry_free(struct vxlan_ftable_entry *); static void vxlan_ftable_entry_init(struct vxlan_softc *, struct vxlan_ftable_entry *, const uint8_t *, const struct sockaddr *, uint32_t); static void vxlan_ftable_entry_destroy(struct vxlan_softc *, struct vxlan_ftable_entry *); static int vxlan_ftable_entry_insert(struct vxlan_softc *, struct vxlan_ftable_entry *); static struct vxlan_ftable_entry * vxlan_ftable_entry_lookup(struct vxlan_softc *, const uint8_t *); static void vxlan_ftable_entry_dump(struct vxlan_ftable_entry *, struct sbuf *); static struct vxlan_socket * vxlan_socket_alloc(const union vxlan_sockaddr *); static void vxlan_socket_destroy(struct vxlan_socket *); static void vxlan_socket_release(struct vxlan_socket *); static struct vxlan_socket * vxlan_socket_lookup(union vxlan_sockaddr *vxlsa); static void vxlan_socket_insert(struct vxlan_socket *); static int vxlan_socket_init(struct vxlan_socket *, struct ifnet *); static int vxlan_socket_bind(struct vxlan_socket *, struct ifnet *); static int vxlan_socket_create(struct ifnet *, int, const union vxlan_sockaddr *, struct vxlan_socket **); static void vxlan_socket_ifdetach(struct vxlan_socket *, struct ifnet *, struct vxlan_softc_head *); static struct vxlan_socket * vxlan_socket_mc_lookup(const union vxlan_sockaddr *); static int vxlan_sockaddr_mc_info_match( const struct vxlan_socket_mc_info *, const union vxlan_sockaddr *, const union vxlan_sockaddr *, int); static int vxlan_socket_mc_join_group(struct vxlan_socket *, const union vxlan_sockaddr *, const union vxlan_sockaddr *, int *, union vxlan_sockaddr *); static int vxlan_socket_mc_leave_group(struct vxlan_socket *, const union vxlan_sockaddr *, const union vxlan_sockaddr *, int); static int vxlan_socket_mc_add_group(struct vxlan_socket *, const union vxlan_sockaddr *, const union vxlan_sockaddr *, int, int *); static void vxlan_socket_mc_release_group_by_idx(struct vxlan_socket *, int); static struct vxlan_softc * vxlan_socket_lookup_softc_locked(struct vxlan_socket *, uint32_t); static struct vxlan_softc * vxlan_socket_lookup_softc(struct vxlan_socket *, uint32_t); static int vxlan_socket_insert_softc(struct vxlan_socket *, struct vxlan_softc *); static void vxlan_socket_remove_softc(struct vxlan_socket *, struct vxlan_softc *); static struct ifnet * vxlan_multicast_if_ref(struct vxlan_softc *, int); static void vxlan_free_multicast(struct vxlan_softc *); static int vxlan_setup_multicast_interface(struct vxlan_softc *); static int vxlan_setup_multicast(struct vxlan_softc *); static int vxlan_setup_socket(struct vxlan_softc *); static void vxlan_setup_interface(struct vxlan_softc *); static int vxlan_valid_init_config(struct vxlan_softc *); static void vxlan_init_wait(struct vxlan_softc *); static void vxlan_init_complete(struct vxlan_softc *); static void vxlan_init(void *); static void vxlan_release(struct vxlan_softc *); static void vxlan_teardown_wait(struct vxlan_softc *); static void vxlan_teardown_complete(struct vxlan_softc *); static void vxlan_teardown_locked(struct vxlan_softc *); static void vxlan_teardown(struct vxlan_softc *); static void vxlan_ifdetach(struct vxlan_softc *, struct ifnet *, struct vxlan_softc_head *); static void vxlan_timer(void *); static int vxlan_ctrl_get_config(struct vxlan_softc *, void *); static int vxlan_ctrl_set_vni(struct vxlan_softc *, void *); static int vxlan_ctrl_set_local_addr(struct vxlan_softc *, void *); static int vxlan_ctrl_set_remote_addr(struct vxlan_softc *, void *); static int vxlan_ctrl_set_local_port(struct vxlan_softc *, void *); static int vxlan_ctrl_set_remote_port(struct vxlan_softc *, void *); static int vxlan_ctrl_set_port_range(struct vxlan_softc *, void *); static int vxlan_ctrl_set_ftable_timeout(struct vxlan_softc *, void *); static int vxlan_ctrl_set_ftable_max(struct vxlan_softc *, void *); static int vxlan_ctrl_set_multicast_if(struct vxlan_softc * , void *); static int vxlan_ctrl_set_ttl(struct vxlan_softc *, void *); static int vxlan_ctrl_set_learn(struct vxlan_softc *, void *); static int vxlan_ctrl_ftable_entry_add(struct vxlan_softc *, void *); static int vxlan_ctrl_ftable_entry_rem(struct vxlan_softc *, void *); static int vxlan_ctrl_flush(struct vxlan_softc *, void *); static int vxlan_ioctl_drvspec(struct vxlan_softc *, struct ifdrv *, int); static int vxlan_ioctl_ifflags(struct vxlan_softc *); static int vxlan_ioctl(struct ifnet *, u_long, caddr_t); #if defined(INET) || defined(INET6) static uint16_t vxlan_pick_source_port(struct vxlan_softc *, struct mbuf *); static void vxlan_encap_header(struct vxlan_softc *, struct mbuf *, int, uint16_t, uint16_t); #endif static int vxlan_encap4(struct vxlan_softc *, const union vxlan_sockaddr *, struct mbuf *); static int vxlan_encap6(struct vxlan_softc *, const union vxlan_sockaddr *, struct mbuf *); static int vxlan_transmit(struct ifnet *, struct mbuf *); static void vxlan_qflush(struct ifnet *); static void vxlan_rcv_udp_packet(struct mbuf *, int, struct inpcb *, const struct sockaddr *, void *); static int vxlan_input(struct vxlan_socket *, uint32_t, struct mbuf **, const struct sockaddr *); static void vxlan_set_default_config(struct vxlan_softc *); static int vxlan_set_user_config(struct vxlan_softc *, struct ifvxlanparam *); static int vxlan_clone_create(struct if_clone *, int, caddr_t); static void vxlan_clone_destroy(struct ifnet *); static uint32_t vxlan_mac_hash(struct vxlan_softc *, const uint8_t *); static void vxlan_fakeaddr(struct vxlan_softc *); static int vxlan_sockaddr_cmp(const union vxlan_sockaddr *, const struct sockaddr *); static void vxlan_sockaddr_copy(union vxlan_sockaddr *, const struct sockaddr *); static int vxlan_sockaddr_in_equal(const union vxlan_sockaddr *, const struct sockaddr *); static void vxlan_sockaddr_in_copy(union vxlan_sockaddr *, const struct sockaddr *); static int vxlan_sockaddr_supported(const union vxlan_sockaddr *, int); static int vxlan_sockaddr_in_any(const union vxlan_sockaddr *); static int vxlan_sockaddr_in_multicast(const union vxlan_sockaddr *); static int vxlan_can_change_config(struct vxlan_softc *); static int vxlan_check_vni(uint32_t); static int vxlan_check_ttl(int); static int vxlan_check_ftable_timeout(uint32_t); static int vxlan_check_ftable_max(uint32_t); static void vxlan_sysctl_setup(struct vxlan_softc *); static void vxlan_sysctl_destroy(struct vxlan_softc *); static int vxlan_tunable_int(struct vxlan_softc *, const char *, int); static void vxlan_ifdetach_event(void *, struct ifnet *); static void vxlan_load(void); static void vxlan_unload(void); static int vxlan_modevent(module_t, int, void *); static const char vxlan_name[] = "vxlan"; static MALLOC_DEFINE(M_VXLAN, vxlan_name, "Virtual eXtensible LAN Interface"); static struct if_clone *vxlan_cloner; static struct mtx vxlan_list_mtx; static LIST_HEAD(, vxlan_socket) vxlan_socket_list; static eventhandler_tag vxlan_ifdetach_event_tag; SYSCTL_DECL(_net_link); SYSCTL_NODE(_net_link, OID_AUTO, vxlan, CTLFLAG_RW, 0, "Virtual eXtensible Local Area Network"); static int vxlan_legacy_port = 0; TUNABLE_INT("net.link.vxlan.legacy_port", &vxlan_legacy_port); static int vxlan_reuse_port = 0; TUNABLE_INT("net.link.vxlan.reuse_port", &vxlan_reuse_port); /* Default maximum number of addresses in the forwarding table. */ #ifndef VXLAN_FTABLE_MAX #define VXLAN_FTABLE_MAX 2000 #endif /* Timeout (in seconds) of addresses learned in the forwarding table. */ #ifndef VXLAN_FTABLE_TIMEOUT #define VXLAN_FTABLE_TIMEOUT (20 * 60) #endif /* * Maximum timeout (in seconds) of addresses learned in the forwarding * table. */ #ifndef VXLAN_FTABLE_MAX_TIMEOUT #define VXLAN_FTABLE_MAX_TIMEOUT (60 * 60 * 24) #endif /* Number of seconds between pruning attempts of the forwarding table. */ #ifndef VXLAN_FTABLE_PRUNE #define VXLAN_FTABLE_PRUNE (5 * 60) #endif static int vxlan_ftable_prune_period = VXLAN_FTABLE_PRUNE; struct vxlan_control { int (*vxlc_func)(struct vxlan_softc *, void *); int vxlc_argsize; int vxlc_flags; #define VXLAN_CTRL_FLAG_COPYIN 0x01 #define VXLAN_CTRL_FLAG_COPYOUT 0x02 #define VXLAN_CTRL_FLAG_SUSER 0x04 }; static const struct vxlan_control vxlan_control_table[] = { [VXLAN_CMD_GET_CONFIG] = { vxlan_ctrl_get_config, sizeof(struct ifvxlancfg), VXLAN_CTRL_FLAG_COPYOUT }, [VXLAN_CMD_SET_VNI] = { vxlan_ctrl_set_vni, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_SET_LOCAL_ADDR] = { vxlan_ctrl_set_local_addr, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_SET_REMOTE_ADDR] = { vxlan_ctrl_set_remote_addr, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_SET_LOCAL_PORT] = { vxlan_ctrl_set_local_port, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_SET_REMOTE_PORT] = { vxlan_ctrl_set_remote_port, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_SET_PORT_RANGE] = { vxlan_ctrl_set_port_range, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_SET_FTABLE_TIMEOUT] = { vxlan_ctrl_set_ftable_timeout, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_SET_FTABLE_MAX] = { vxlan_ctrl_set_ftable_max, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_SET_MULTICAST_IF] = { vxlan_ctrl_set_multicast_if, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_SET_TTL] = { vxlan_ctrl_set_ttl, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_SET_LEARN] = { vxlan_ctrl_set_learn, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_FTABLE_ENTRY_ADD] = { vxlan_ctrl_ftable_entry_add, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_FTABLE_ENTRY_REM] = { vxlan_ctrl_ftable_entry_rem, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, [VXLAN_CMD_FLUSH] = { vxlan_ctrl_flush, sizeof(struct ifvxlancmd), VXLAN_CTRL_FLAG_COPYIN | VXLAN_CTRL_FLAG_SUSER, }, }; static const int vxlan_control_table_size = nitems(vxlan_control_table); static int vxlan_ftable_addr_cmp(const uint8_t *a, const uint8_t *b) { int i, d; for (i = 0, d = 0; i < ETHER_ADDR_LEN && d == 0; i++) d = ((int)a[i]) - ((int)b[i]); return (d); } static void vxlan_ftable_init(struct vxlan_softc *sc) { int i; sc->vxl_ftable = malloc(sizeof(struct vxlan_ftable_head) * VXLAN_SC_FTABLE_SIZE, M_VXLAN, M_ZERO | M_WAITOK); for (i = 0; i < VXLAN_SC_FTABLE_SIZE; i++) LIST_INIT(&sc->vxl_ftable[i]); sc->vxl_ftable_hash_key = arc4random(); } static void vxlan_ftable_fini(struct vxlan_softc *sc) { int i; for (i = 0; i < VXLAN_SC_FTABLE_SIZE; i++) { KASSERT(LIST_EMPTY(&sc->vxl_ftable[i]), ("%s: vxlan %p ftable[%d] not empty", __func__, sc, i)); } MPASS(sc->vxl_ftable_cnt == 0); free(sc->vxl_ftable, M_VXLAN); sc->vxl_ftable = NULL; } static void vxlan_ftable_flush(struct vxlan_softc *sc, int all) { struct vxlan_ftable_entry *fe, *tfe; int i; for (i = 0; i < VXLAN_SC_FTABLE_SIZE; i++) { LIST_FOREACH_SAFE(fe, &sc->vxl_ftable[i], vxlfe_hash, tfe) { if (all || VXLAN_FE_IS_DYNAMIC(fe)) vxlan_ftable_entry_destroy(sc, fe); } } } static void vxlan_ftable_expire(struct vxlan_softc *sc) { struct vxlan_ftable_entry *fe, *tfe; int i; VXLAN_LOCK_WASSERT(sc); for (i = 0; i < VXLAN_SC_FTABLE_SIZE; i++) { LIST_FOREACH_SAFE(fe, &sc->vxl_ftable[i], vxlfe_hash, tfe) { if (VXLAN_FE_IS_DYNAMIC(fe) && time_uptime >= fe->vxlfe_expire) vxlan_ftable_entry_destroy(sc, fe); } } } static int vxlan_ftable_update_locked(struct vxlan_softc *sc, const struct sockaddr *sa, const uint8_t *mac, struct rm_priotracker *tracker) { union vxlan_sockaddr vxlsa; struct vxlan_ftable_entry *fe; int error; VXLAN_LOCK_ASSERT(sc); again: /* * A forwarding entry for this MAC address might already exist. If * so, update it, otherwise create a new one. We may have to upgrade * the lock if we have to change or create an entry. */ fe = vxlan_ftable_entry_lookup(sc, mac); if (fe != NULL) { fe->vxlfe_expire = time_uptime + sc->vxl_ftable_timeout; if (!VXLAN_FE_IS_DYNAMIC(fe) || vxlan_sockaddr_in_equal(&fe->vxlfe_raddr, sa)) return (0); if (!VXLAN_LOCK_WOWNED(sc)) { VXLAN_RUNLOCK(sc, tracker); VXLAN_WLOCK(sc); sc->vxl_stats.ftable_lock_upgrade_failed++; goto again; } vxlan_sockaddr_in_copy(&fe->vxlfe_raddr, sa); return (0); } if (!VXLAN_LOCK_WOWNED(sc)) { VXLAN_RUNLOCK(sc, tracker); VXLAN_WLOCK(sc); sc->vxl_stats.ftable_lock_upgrade_failed++; goto again; } if (sc->vxl_ftable_cnt >= sc->vxl_ftable_max) { sc->vxl_stats.ftable_nospace++; return (ENOSPC); } fe = vxlan_ftable_entry_alloc(); if (fe == NULL) return (ENOMEM); /* * The source port may be randomly select by the remove host, so * use the port of the default destination address. */ vxlan_sockaddr_copy(&vxlsa, sa); vxlsa.in4.sin_port = sc->vxl_dst_addr.in4.sin_port; vxlan_ftable_entry_init(sc, fe, mac, &vxlsa.sa, VXLAN_FE_FLAG_DYNAMIC); /* The prior lookup failed, so the insert should not. */ error = vxlan_ftable_entry_insert(sc, fe); MPASS(error == 0); return (0); } static int vxlan_ftable_update(struct vxlan_softc *sc, const struct sockaddr *sa, const uint8_t *mac) { struct rm_priotracker tracker; int error; VXLAN_RLOCK(sc, &tracker); error = vxlan_ftable_update_locked(sc, sa, mac, &tracker); VXLAN_UNLOCK(sc, &tracker); return (error); } static int vxlan_ftable_sysctl_dump(SYSCTL_HANDLER_ARGS) { struct rm_priotracker tracker; struct sbuf sb; struct vxlan_softc *sc; struct vxlan_ftable_entry *fe; size_t size; int i, error; /* * This is mostly intended for debugging during development. It is * not practical to dump an entire large table this way. */ sc = arg1; size = PAGE_SIZE; /* Calculate later. */ sbuf_new(&sb, NULL, size, SBUF_FIXEDLEN); sbuf_putc(&sb, '\n'); VXLAN_RLOCK(sc, &tracker); for (i = 0; i < VXLAN_SC_FTABLE_SIZE; i++) { LIST_FOREACH(fe, &sc->vxl_ftable[i], vxlfe_hash) { if (sbuf_error(&sb) != 0) break; vxlan_ftable_entry_dump(fe, &sb); } } VXLAN_RUNLOCK(sc, &tracker); if (sbuf_len(&sb) == 1) sbuf_setpos(&sb, 0); sbuf_finish(&sb); error = sysctl_handle_string(oidp, sbuf_data(&sb), sbuf_len(&sb), req); sbuf_delete(&sb); return (error); } static struct vxlan_ftable_entry * vxlan_ftable_entry_alloc(void) { struct vxlan_ftable_entry *fe; fe = malloc(sizeof(*fe), M_VXLAN, M_ZERO | M_NOWAIT); return (fe); } static void vxlan_ftable_entry_free(struct vxlan_ftable_entry *fe) { free(fe, M_VXLAN); } static void vxlan_ftable_entry_init(struct vxlan_softc *sc, struct vxlan_ftable_entry *fe, const uint8_t *mac, const struct sockaddr *sa, uint32_t flags) { fe->vxlfe_flags = flags; fe->vxlfe_expire = time_uptime + sc->vxl_ftable_timeout; memcpy(fe->vxlfe_mac, mac, ETHER_ADDR_LEN); vxlan_sockaddr_copy(&fe->vxlfe_raddr, sa); } static void vxlan_ftable_entry_destroy(struct vxlan_softc *sc, struct vxlan_ftable_entry *fe) { sc->vxl_ftable_cnt--; LIST_REMOVE(fe, vxlfe_hash); vxlan_ftable_entry_free(fe); } static int vxlan_ftable_entry_insert(struct vxlan_softc *sc, struct vxlan_ftable_entry *fe) { struct vxlan_ftable_entry *lfe; uint32_t hash; int dir; VXLAN_LOCK_WASSERT(sc); hash = VXLAN_SC_FTABLE_HASH(sc, fe->vxlfe_mac); lfe = LIST_FIRST(&sc->vxl_ftable[hash]); if (lfe == NULL) { LIST_INSERT_HEAD(&sc->vxl_ftable[hash], fe, vxlfe_hash); goto out; } do { dir = vxlan_ftable_addr_cmp(fe->vxlfe_mac, lfe->vxlfe_mac); if (dir == 0) return (EEXIST); if (dir > 0) { LIST_INSERT_BEFORE(lfe, fe, vxlfe_hash); goto out; } else if (LIST_NEXT(lfe, vxlfe_hash) == NULL) { LIST_INSERT_AFTER(lfe, fe, vxlfe_hash); goto out; } else lfe = LIST_NEXT(lfe, vxlfe_hash); } while (lfe != NULL); out: sc->vxl_ftable_cnt++; return (0); } static struct vxlan_ftable_entry * vxlan_ftable_entry_lookup(struct vxlan_softc *sc, const uint8_t *mac) { struct vxlan_ftable_entry *fe; uint32_t hash; int dir; VXLAN_LOCK_ASSERT(sc); hash = VXLAN_SC_FTABLE_HASH(sc, mac); LIST_FOREACH(fe, &sc->vxl_ftable[hash], vxlfe_hash) { dir = vxlan_ftable_addr_cmp(fe->vxlfe_mac, mac); if (dir == 0) return (fe); if (dir > 0) break; } return (NULL); } static void vxlan_ftable_entry_dump(struct vxlan_ftable_entry *fe, struct sbuf *sb) { char buf[64]; const union vxlan_sockaddr *sa; const void *addr; int i, len, af, width; sa = &fe->vxlfe_raddr; af = sa->sa.sa_family; len = sbuf_len(sb); sbuf_printf(sb, "%c 0x%02X ", VXLAN_FE_IS_DYNAMIC(fe) ? 'D' : 'S', fe->vxlfe_flags); for (i = 0; i < ETHER_ADDR_LEN - 1; i++) sbuf_printf(sb, "%02X:", fe->vxlfe_mac[i]); sbuf_printf(sb, "%02X ", fe->vxlfe_mac[i]); if (af == AF_INET) { addr = &sa->in4.sin_addr; width = INET_ADDRSTRLEN - 1; } else { addr = &sa->in6.sin6_addr; width = INET6_ADDRSTRLEN - 1; } inet_ntop(af, addr, buf, sizeof(buf)); sbuf_printf(sb, "%*s ", width, buf); sbuf_printf(sb, "%08jd", (intmax_t)fe->vxlfe_expire); sbuf_putc(sb, '\n'); /* Truncate a partial line. */ if (sbuf_error(sb) != 0) sbuf_setpos(sb, len); } static struct vxlan_socket * vxlan_socket_alloc(const union vxlan_sockaddr *sa) { struct vxlan_socket *vso; int i; vso = malloc(sizeof(*vso), M_VXLAN, M_WAITOK | M_ZERO); rm_init(&vso->vxlso_lock, "vxlansorm"); refcount_init(&vso->vxlso_refcnt, 0); for (i = 0; i < VXLAN_SO_VNI_HASH_SIZE; i++) LIST_INIT(&vso->vxlso_vni_hash[i]); vso->vxlso_laddr = *sa; return (vso); } static void vxlan_socket_destroy(struct vxlan_socket *vso) { struct socket *so; struct vxlan_socket_mc_info *mc; int i; for (i = 0; i < VXLAN_SO_MC_MAX_GROUPS; i++) { mc = &vso->vxlso_mc[i]; KASSERT(mc->vxlsomc_gaddr.sa.sa_family == AF_UNSPEC, ("%s: socket %p mc[%d] still has address", __func__, vso, i)); } for (i = 0; i < VXLAN_SO_VNI_HASH_SIZE; i++) { KASSERT(LIST_EMPTY(&vso->vxlso_vni_hash[i]), ("%s: socket %p vni_hash[%d] not empty", __func__, vso, i)); } so = vso->vxlso_sock; if (so != NULL) { vso->vxlso_sock = NULL; soclose(so); } rm_destroy(&vso->vxlso_lock); free(vso, M_VXLAN); } static void vxlan_socket_release(struct vxlan_socket *vso) { int destroy; mtx_lock(&vxlan_list_mtx); destroy = VXLAN_SO_RELEASE(vso); if (destroy != 0) LIST_REMOVE(vso, vxlso_entry); mtx_unlock(&vxlan_list_mtx); if (destroy != 0) vxlan_socket_destroy(vso); } static struct vxlan_socket * vxlan_socket_lookup(union vxlan_sockaddr *vxlsa) { struct vxlan_socket *vso; mtx_lock(&vxlan_list_mtx); LIST_FOREACH(vso, &vxlan_socket_list, vxlso_entry) { if (vxlan_sockaddr_cmp(&vso->vxlso_laddr, &vxlsa->sa) == 0) { VXLAN_SO_ACQUIRE(vso); break; } } mtx_unlock(&vxlan_list_mtx); return (vso); } static void vxlan_socket_insert(struct vxlan_socket *vso) { mtx_lock(&vxlan_list_mtx); VXLAN_SO_ACQUIRE(vso); LIST_INSERT_HEAD(&vxlan_socket_list, vso, vxlso_entry); mtx_unlock(&vxlan_list_mtx); } static int vxlan_socket_init(struct vxlan_socket *vso, struct ifnet *ifp) { struct thread *td; int error; td = curthread; error = socreate(vso->vxlso_laddr.sa.sa_family, &vso->vxlso_sock, SOCK_DGRAM, IPPROTO_UDP, td->td_ucred, td); if (error) { if_printf(ifp, "cannot create socket: %d\n", error); return (error); } error = udp_set_kernel_tunneling(vso->vxlso_sock, vxlan_rcv_udp_packet, NULL, vso); if (error) { if_printf(ifp, "cannot set tunneling function: %d\n", error); return (error); } if (vxlan_reuse_port != 0) { struct sockopt sopt; int val = 1; bzero(&sopt, sizeof(sopt)); sopt.sopt_dir = SOPT_SET; sopt.sopt_level = IPPROTO_IP; sopt.sopt_name = SO_REUSEPORT; sopt.sopt_val = &val; sopt.sopt_valsize = sizeof(val); error = sosetopt(vso->vxlso_sock, &sopt); if (error) { if_printf(ifp, "cannot set REUSEADDR socket opt: %d\n", error); return (error); } } return (0); } static int vxlan_socket_bind(struct vxlan_socket *vso, struct ifnet *ifp) { union vxlan_sockaddr laddr; struct thread *td; int error; td = curthread; laddr = vso->vxlso_laddr; error = sobind(vso->vxlso_sock, &laddr.sa, td); if (error) { if (error != EADDRINUSE) if_printf(ifp, "cannot bind socket: %d\n", error); return (error); } return (0); } static int vxlan_socket_create(struct ifnet *ifp, int multicast, const union vxlan_sockaddr *saddr, struct vxlan_socket **vsop) { union vxlan_sockaddr laddr; struct vxlan_socket *vso; int error; laddr = *saddr; /* * If this socket will be multicast, then only the local port * must be specified when binding. */ if (multicast != 0) { if (VXLAN_SOCKADDR_IS_IPV4(&laddr)) laddr.in4.sin_addr.s_addr = INADDR_ANY; #ifdef INET6 else laddr.in6.sin6_addr = in6addr_any; #endif } vso = vxlan_socket_alloc(&laddr); if (vso == NULL) return (ENOMEM); error = vxlan_socket_init(vso, ifp); if (error) goto fail; error = vxlan_socket_bind(vso, ifp); if (error) goto fail; /* * There is a small window between the bind completing and * inserting the socket, so that a concurrent create may fail. * Let's not worry about that for now. */ vxlan_socket_insert(vso); *vsop = vso; return (0); fail: vxlan_socket_destroy(vso); return (error); } static void vxlan_socket_ifdetach(struct vxlan_socket *vso, struct ifnet *ifp, struct vxlan_softc_head *list) { struct rm_priotracker tracker; struct vxlan_softc *sc; int i; VXLAN_SO_RLOCK(vso, &tracker); for (i = 0; i < VXLAN_SO_VNI_HASH_SIZE; i++) { LIST_FOREACH(sc, &vso->vxlso_vni_hash[i], vxl_entry) vxlan_ifdetach(sc, ifp, list); } VXLAN_SO_RUNLOCK(vso, &tracker); } static struct vxlan_socket * vxlan_socket_mc_lookup(const union vxlan_sockaddr *vxlsa) { struct vxlan_socket *vso; union vxlan_sockaddr laddr; laddr = *vxlsa; if (VXLAN_SOCKADDR_IS_IPV4(&laddr)) laddr.in4.sin_addr.s_addr = INADDR_ANY; #ifdef INET6 else laddr.in6.sin6_addr = in6addr_any; #endif vso = vxlan_socket_lookup(&laddr); return (vso); } static int vxlan_sockaddr_mc_info_match(const struct vxlan_socket_mc_info *mc, const union vxlan_sockaddr *group, const union vxlan_sockaddr *local, int ifidx) { if (!vxlan_sockaddr_in_any(local) && !vxlan_sockaddr_in_equal(&mc->vxlsomc_saddr, &local->sa)) return (0); if (!vxlan_sockaddr_in_equal(&mc->vxlsomc_gaddr, &group->sa)) return (0); if (ifidx != 0 && ifidx != mc->vxlsomc_ifidx) return (0); return (1); } static int vxlan_socket_mc_join_group(struct vxlan_socket *vso, const union vxlan_sockaddr *group, const union vxlan_sockaddr *local, int *ifidx, union vxlan_sockaddr *source) { struct sockopt sopt; int error; *source = *local; if (VXLAN_SOCKADDR_IS_IPV4(group)) { struct ip_mreq mreq; mreq.imr_multiaddr = group->in4.sin_addr; mreq.imr_interface = local->in4.sin_addr; bzero(&sopt, sizeof(sopt)); sopt.sopt_dir = SOPT_SET; sopt.sopt_level = IPPROTO_IP; sopt.sopt_name = IP_ADD_MEMBERSHIP; sopt.sopt_val = &mreq; sopt.sopt_valsize = sizeof(mreq); error = sosetopt(vso->vxlso_sock, &sopt); if (error) return (error); /* * BMV: Ideally, there would be a formal way for us to get * the local interface that was selected based on the * imr_interface address. We could then update *ifidx so * vxlan_sockaddr_mc_info_match() would return a match for * later creates that explicitly set the multicast interface. * * If we really need to, we can of course look in the INP's * membership list: * sotoinpcb(vso->vxlso_sock)->inp_moptions-> * imo_membership[]->inm_ifp * similarly to imo_match_group(). */ source->in4.sin_addr = local->in4.sin_addr; } else if (VXLAN_SOCKADDR_IS_IPV6(group)) { struct ipv6_mreq mreq; mreq.ipv6mr_multiaddr = group->in6.sin6_addr; mreq.ipv6mr_interface = *ifidx; bzero(&sopt, sizeof(sopt)); sopt.sopt_dir = SOPT_SET; sopt.sopt_level = IPPROTO_IPV6; sopt.sopt_name = IPV6_JOIN_GROUP; sopt.sopt_val = &mreq; sopt.sopt_valsize = sizeof(mreq); error = sosetopt(vso->vxlso_sock, &sopt); if (error) return (error); /* * BMV: As with IPv4, we would really like to know what * interface in6p_lookup_mcast_ifp() selected. */ } else error = EAFNOSUPPORT; return (error); } static int vxlan_socket_mc_leave_group(struct vxlan_socket *vso, const union vxlan_sockaddr *group, const union vxlan_sockaddr *source, int ifidx) { struct sockopt sopt; int error; bzero(&sopt, sizeof(sopt)); sopt.sopt_dir = SOPT_SET; if (VXLAN_SOCKADDR_IS_IPV4(group)) { struct ip_mreq mreq; mreq.imr_multiaddr = group->in4.sin_addr; mreq.imr_interface = source->in4.sin_addr; sopt.sopt_level = IPPROTO_IP; sopt.sopt_name = IP_DROP_MEMBERSHIP; sopt.sopt_val = &mreq; sopt.sopt_valsize = sizeof(mreq); error = sosetopt(vso->vxlso_sock, &sopt); } else if (VXLAN_SOCKADDR_IS_IPV6(group)) { struct ipv6_mreq mreq; mreq.ipv6mr_multiaddr = group->in6.sin6_addr; mreq.ipv6mr_interface = ifidx; sopt.sopt_level = IPPROTO_IPV6; sopt.sopt_name = IPV6_LEAVE_GROUP; sopt.sopt_val = &mreq; sopt.sopt_valsize = sizeof(mreq); error = sosetopt(vso->vxlso_sock, &sopt); } else error = EAFNOSUPPORT; return (error); } static int vxlan_socket_mc_add_group(struct vxlan_socket *vso, const union vxlan_sockaddr *group, const union vxlan_sockaddr *local, int ifidx, int *idx) { union vxlan_sockaddr source; struct vxlan_socket_mc_info *mc; int i, empty, error; /* * Within a socket, the same multicast group may be used by multiple * interfaces, each with a different network identifier. But a socket * may only join a multicast group once, so keep track of the users * here. */ VXLAN_SO_WLOCK(vso); for (empty = 0, i = 0; i < VXLAN_SO_MC_MAX_GROUPS; i++) { mc = &vso->vxlso_mc[i]; if (mc->vxlsomc_gaddr.sa.sa_family == AF_UNSPEC) { empty++; continue; } if (vxlan_sockaddr_mc_info_match(mc, group, local, ifidx)) goto out; } VXLAN_SO_WUNLOCK(vso); if (empty == 0) return (ENOSPC); error = vxlan_socket_mc_join_group(vso, group, local, &ifidx, &source); if (error) return (error); VXLAN_SO_WLOCK(vso); for (i = 0; i < VXLAN_SO_MC_MAX_GROUPS; i++) { mc = &vso->vxlso_mc[i]; if (mc->vxlsomc_gaddr.sa.sa_family == AF_UNSPEC) { vxlan_sockaddr_copy(&mc->vxlsomc_gaddr, &group->sa); vxlan_sockaddr_copy(&mc->vxlsomc_saddr, &source.sa); mc->vxlsomc_ifidx = ifidx; goto out; } } VXLAN_SO_WUNLOCK(vso); error = vxlan_socket_mc_leave_group(vso, group, &source, ifidx); MPASS(error == 0); return (ENOSPC); out: mc->vxlsomc_users++; VXLAN_SO_WUNLOCK(vso); *idx = i; return (0); } static void vxlan_socket_mc_release_group_by_idx(struct vxlan_socket *vso, int idx) { union vxlan_sockaddr group, source; struct vxlan_socket_mc_info *mc; int ifidx, leave; KASSERT(idx >= 0 && idx < VXLAN_SO_MC_MAX_GROUPS, ("%s: vso %p idx %d out of bounds", __func__, vso, idx)); leave = 0; mc = &vso->vxlso_mc[idx]; VXLAN_SO_WLOCK(vso); mc->vxlsomc_users--; if (mc->vxlsomc_users == 0) { group = mc->vxlsomc_gaddr; source = mc->vxlsomc_saddr; ifidx = mc->vxlsomc_ifidx; bzero(mc, sizeof(*mc)); leave = 1; } VXLAN_SO_WUNLOCK(vso); if (leave != 0) { /* * Our socket's membership in this group may have already * been removed if we joined through an interface that's * been detached. */ vxlan_socket_mc_leave_group(vso, &group, &source, ifidx); } } static struct vxlan_softc * vxlan_socket_lookup_softc_locked(struct vxlan_socket *vso, uint32_t vni) { struct vxlan_softc *sc; uint32_t hash; VXLAN_SO_LOCK_ASSERT(vso); hash = VXLAN_SO_VNI_HASH(vni); LIST_FOREACH(sc, &vso->vxlso_vni_hash[hash], vxl_entry) { if (sc->vxl_vni == vni) { VXLAN_ACQUIRE(sc); break; } } return (sc); } static struct vxlan_softc * vxlan_socket_lookup_softc(struct vxlan_socket *vso, uint32_t vni) { struct rm_priotracker tracker; struct vxlan_softc *sc; VXLAN_SO_RLOCK(vso, &tracker); sc = vxlan_socket_lookup_softc_locked(vso, vni); VXLAN_SO_RUNLOCK(vso, &tracker); return (sc); } static int vxlan_socket_insert_softc(struct vxlan_socket *vso, struct vxlan_softc *sc) { struct vxlan_softc *tsc; uint32_t vni, hash; vni = sc->vxl_vni; hash = VXLAN_SO_VNI_HASH(vni); VXLAN_SO_WLOCK(vso); tsc = vxlan_socket_lookup_softc_locked(vso, vni); if (tsc != NULL) { VXLAN_SO_WUNLOCK(vso); vxlan_release(tsc); return (EEXIST); } VXLAN_ACQUIRE(sc); LIST_INSERT_HEAD(&vso->vxlso_vni_hash[hash], sc, vxl_entry); VXLAN_SO_WUNLOCK(vso); return (0); } static void vxlan_socket_remove_softc(struct vxlan_socket *vso, struct vxlan_softc *sc) { VXLAN_SO_WLOCK(vso); LIST_REMOVE(sc, vxl_entry); VXLAN_SO_WUNLOCK(vso); vxlan_release(sc); } static struct ifnet * vxlan_multicast_if_ref(struct vxlan_softc *sc, int ipv4) { struct ifnet *ifp; VXLAN_LOCK_ASSERT(sc); if (ipv4 && sc->vxl_im4o != NULL) ifp = sc->vxl_im4o->imo_multicast_ifp; else if (!ipv4 && sc->vxl_im6o != NULL) ifp = sc->vxl_im6o->im6o_multicast_ifp; else ifp = NULL; if (ifp != NULL) if_ref(ifp); return (ifp); } static void vxlan_free_multicast(struct vxlan_softc *sc) { if (sc->vxl_mc_ifp != NULL) { if_rele(sc->vxl_mc_ifp); sc->vxl_mc_ifp = NULL; sc->vxl_mc_ifindex = 0; } if (sc->vxl_im4o != NULL) { free(sc->vxl_im4o, M_VXLAN); sc->vxl_im4o = NULL; } if (sc->vxl_im6o != NULL) { free(sc->vxl_im6o, M_VXLAN); sc->vxl_im6o = NULL; } } static int vxlan_setup_multicast_interface(struct vxlan_softc *sc) { struct ifnet *ifp; ifp = ifunit_ref(sc->vxl_mc_ifname); if (ifp == NULL) { if_printf(sc->vxl_ifp, "multicast interfaces %s does " "not exist\n", sc->vxl_mc_ifname); return (ENOENT); } if ((ifp->if_flags & IFF_MULTICAST) == 0) { if_printf(sc->vxl_ifp, "interface %s does not support " "multicast\n", sc->vxl_mc_ifname); if_rele(ifp); return (ENOTSUP); } sc->vxl_mc_ifp = ifp; sc->vxl_mc_ifindex = ifp->if_index; return (0); } static int vxlan_setup_multicast(struct vxlan_softc *sc) { const union vxlan_sockaddr *group; int error; group = &sc->vxl_dst_addr; error = 0; if (sc->vxl_mc_ifname[0] != '\0') { error = vxlan_setup_multicast_interface(sc); if (error) return (error); } /* * Initialize an multicast options structure that is sufficiently * populated for use in the respective IP output routine. This * structure is typically stored in the socket, but our sockets * may be shared among multiple interfaces. */ if (VXLAN_SOCKADDR_IS_IPV4(group)) { sc->vxl_im4o = malloc(sizeof(struct ip_moptions), M_VXLAN, M_ZERO | M_WAITOK); sc->vxl_im4o->imo_multicast_ifp = sc->vxl_mc_ifp; sc->vxl_im4o->imo_multicast_ttl = sc->vxl_ttl; sc->vxl_im4o->imo_multicast_vif = -1; } else if (VXLAN_SOCKADDR_IS_IPV6(group)) { sc->vxl_im6o = malloc(sizeof(struct ip6_moptions), M_VXLAN, M_ZERO | M_WAITOK); sc->vxl_im6o->im6o_multicast_ifp = sc->vxl_mc_ifp; sc->vxl_im6o->im6o_multicast_hlim = sc->vxl_ttl; } return (error); } static int vxlan_setup_socket(struct vxlan_softc *sc) { struct vxlan_socket *vso; struct ifnet *ifp; union vxlan_sockaddr *saddr, *daddr; int multicast, error; vso = NULL; ifp = sc->vxl_ifp; saddr = &sc->vxl_src_addr; daddr = &sc->vxl_dst_addr; multicast = vxlan_sockaddr_in_multicast(daddr); MPASS(multicast != -1); sc->vxl_vso_mc_index = -1; /* * Try to create the socket. If that fails, attempt to use an * existing socket. */ error = vxlan_socket_create(ifp, multicast, saddr, &vso); if (error) { if (multicast != 0) vso = vxlan_socket_mc_lookup(saddr); else vso = vxlan_socket_lookup(saddr); if (vso == NULL) { if_printf(ifp, "cannot create socket (error: %d), " "and no existing socket found\n", error); goto out; } } if (multicast != 0) { error = vxlan_setup_multicast(sc); if (error) goto out; error = vxlan_socket_mc_add_group(vso, daddr, saddr, sc->vxl_mc_ifindex, &sc->vxl_vso_mc_index); if (error) goto out; } sc->vxl_sock = vso; error = vxlan_socket_insert_softc(vso, sc); if (error) { sc->vxl_sock = NULL; if_printf(ifp, "network identifier %d already exists in " "this socket\n", sc->vxl_vni); goto out; } return (0); out: if (vso != NULL) { if (sc->vxl_vso_mc_index != -1) { vxlan_socket_mc_release_group_by_idx(vso, sc->vxl_vso_mc_index); sc->vxl_vso_mc_index = -1; } if (multicast != 0) vxlan_free_multicast(sc); vxlan_socket_release(vso); } return (error); } static void vxlan_setup_interface(struct vxlan_softc *sc) { struct ifnet *ifp; ifp = sc->vxl_ifp; ifp->if_hdrlen = ETHER_HDR_LEN + sizeof(struct vxlanudphdr); if (VXLAN_SOCKADDR_IS_IPV4(&sc->vxl_dst_addr) != 0) ifp->if_hdrlen += sizeof(struct ip); else if (VXLAN_SOCKADDR_IS_IPV6(&sc->vxl_dst_addr) != 0) ifp->if_hdrlen += sizeof(struct ip6_hdr); } static int vxlan_valid_init_config(struct vxlan_softc *sc) { const char *reason; if (vxlan_check_vni(sc->vxl_vni) != 0) { reason = "invalid virtual network identifier specified"; goto fail; } if (vxlan_sockaddr_supported(&sc->vxl_src_addr, 1) == 0) { reason = "source address type is not supported"; goto fail; } if (vxlan_sockaddr_supported(&sc->vxl_dst_addr, 0) == 0) { reason = "destination address type is not supported"; goto fail; } if (vxlan_sockaddr_in_any(&sc->vxl_dst_addr) != 0) { reason = "no valid destination address specified"; goto fail; } if (vxlan_sockaddr_in_multicast(&sc->vxl_dst_addr) == 0 && sc->vxl_mc_ifname[0] != '\0') { reason = "can only specify interface with a group address"; goto fail; } if (vxlan_sockaddr_in_any(&sc->vxl_src_addr) == 0) { if (VXLAN_SOCKADDR_IS_IPV4(&sc->vxl_src_addr) ^ VXLAN_SOCKADDR_IS_IPV4(&sc->vxl_dst_addr)) { reason = "source and destination address must both " "be either IPv4 or IPv6"; goto fail; } } if (sc->vxl_src_addr.in4.sin_port == 0) { reason = "local port not specified"; goto fail; } if (sc->vxl_dst_addr.in4.sin_port == 0) { reason = "remote port not specified"; goto fail; } return (0); fail: if_printf(sc->vxl_ifp, "cannot initialize interface: %s\n", reason); return (EINVAL); } static void vxlan_init_wait(struct vxlan_softc *sc) { VXLAN_LOCK_WASSERT(sc); while (sc->vxl_flags & VXLAN_FLAG_INIT) rm_sleep(sc, &sc->vxl_lock, 0, "vxlint", hz); } static void vxlan_init_complete(struct vxlan_softc *sc) { VXLAN_WLOCK(sc); sc->vxl_flags &= ~VXLAN_FLAG_INIT; wakeup(sc); VXLAN_WUNLOCK(sc); } static void vxlan_init(void *xsc) { static const uint8_t empty_mac[ETHER_ADDR_LEN]; struct vxlan_softc *sc; struct ifnet *ifp; sc = xsc; ifp = sc->vxl_ifp; VXLAN_WLOCK(sc); if (ifp->if_drv_flags & IFF_DRV_RUNNING) { VXLAN_WUNLOCK(sc); return; } sc->vxl_flags |= VXLAN_FLAG_INIT; VXLAN_WUNLOCK(sc); if (vxlan_valid_init_config(sc) != 0) goto out; vxlan_setup_interface(sc); if (vxlan_setup_socket(sc) != 0) goto out; /* Initialize the default forwarding entry. */ vxlan_ftable_entry_init(sc, &sc->vxl_default_fe, empty_mac, &sc->vxl_dst_addr.sa, VXLAN_FE_FLAG_STATIC); VXLAN_WLOCK(sc); ifp->if_drv_flags |= IFF_DRV_RUNNING; callout_reset(&sc->vxl_callout, vxlan_ftable_prune_period * hz, vxlan_timer, sc); VXLAN_WUNLOCK(sc); out: vxlan_init_complete(sc); } static void vxlan_release(struct vxlan_softc *sc) { /* * The softc may be destroyed as soon as we release our reference, * so we cannot serialize the wakeup with the softc lock. We use a * timeout in our sleeps so a missed wakeup is unfortunate but not * fatal. */ if (VXLAN_RELEASE(sc) != 0) wakeup(sc); } static void vxlan_teardown_wait(struct vxlan_softc *sc) { VXLAN_LOCK_WASSERT(sc); while (sc->vxl_flags & VXLAN_FLAG_TEARDOWN) rm_sleep(sc, &sc->vxl_lock, 0, "vxltrn", hz); } static void vxlan_teardown_complete(struct vxlan_softc *sc) { VXLAN_WLOCK(sc); sc->vxl_flags &= ~VXLAN_FLAG_TEARDOWN; wakeup(sc); VXLAN_WUNLOCK(sc); } static void vxlan_teardown_locked(struct vxlan_softc *sc) { struct ifnet *ifp; struct vxlan_socket *vso; ifp = sc->vxl_ifp; VXLAN_LOCK_WASSERT(sc); MPASS(sc->vxl_flags & VXLAN_FLAG_TEARDOWN); ifp->if_flags &= ~IFF_UP; ifp->if_drv_flags &= ~IFF_DRV_RUNNING; callout_stop(&sc->vxl_callout); vso = sc->vxl_sock; sc->vxl_sock = NULL; VXLAN_WUNLOCK(sc); if (vso != NULL) { vxlan_socket_remove_softc(vso, sc); if (sc->vxl_vso_mc_index != -1) { vxlan_socket_mc_release_group_by_idx(vso, sc->vxl_vso_mc_index); sc->vxl_vso_mc_index = -1; } } VXLAN_WLOCK(sc); while (sc->vxl_refcnt != 0) rm_sleep(sc, &sc->vxl_lock, 0, "vxldrn", hz); VXLAN_WUNLOCK(sc); callout_drain(&sc->vxl_callout); vxlan_free_multicast(sc); if (vso != NULL) vxlan_socket_release(vso); vxlan_teardown_complete(sc); } static void vxlan_teardown(struct vxlan_softc *sc) { VXLAN_WLOCK(sc); if (sc->vxl_flags & VXLAN_FLAG_TEARDOWN) { vxlan_teardown_wait(sc); VXLAN_WUNLOCK(sc); return; } sc->vxl_flags |= VXLAN_FLAG_TEARDOWN; vxlan_teardown_locked(sc); } static void vxlan_ifdetach(struct vxlan_softc *sc, struct ifnet *ifp, struct vxlan_softc_head *list) { VXLAN_WLOCK(sc); if (sc->vxl_mc_ifp != ifp) goto out; if (sc->vxl_flags & VXLAN_FLAG_TEARDOWN) goto out; sc->vxl_flags |= VXLAN_FLAG_TEARDOWN; LIST_INSERT_HEAD(list, sc, vxl_ifdetach_list); out: VXLAN_WUNLOCK(sc); } static void vxlan_timer(void *xsc) { struct vxlan_softc *sc; sc = xsc; VXLAN_LOCK_WASSERT(sc); vxlan_ftable_expire(sc); callout_schedule(&sc->vxl_callout, vxlan_ftable_prune_period * hz); } static int vxlan_ioctl_ifflags(struct vxlan_softc *sc) { struct ifnet *ifp; ifp = sc->vxl_ifp; if (ifp->if_flags & IFF_UP) { if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) vxlan_init(sc); } else { if (ifp->if_drv_flags & IFF_DRV_RUNNING) vxlan_teardown(sc); } return (0); } static int vxlan_ctrl_get_config(struct vxlan_softc *sc, void *arg) { struct rm_priotracker tracker; struct ifvxlancfg *cfg; cfg = arg; bzero(cfg, sizeof(*cfg)); VXLAN_RLOCK(sc, &tracker); cfg->vxlc_vni = sc->vxl_vni; memcpy(&cfg->vxlc_local_sa, &sc->vxl_src_addr, sizeof(union vxlan_sockaddr)); memcpy(&cfg->vxlc_remote_sa, &sc->vxl_dst_addr, sizeof(union vxlan_sockaddr)); cfg->vxlc_mc_ifindex = sc->vxl_mc_ifindex; cfg->vxlc_ftable_cnt = sc->vxl_ftable_cnt; cfg->vxlc_ftable_max = sc->vxl_ftable_max; cfg->vxlc_ftable_timeout = sc->vxl_ftable_timeout; cfg->vxlc_port_min = sc->vxl_min_port; cfg->vxlc_port_max = sc->vxl_max_port; cfg->vxlc_learn = (sc->vxl_flags & VXLAN_FLAG_LEARN) != 0; cfg->vxlc_ttl = sc->vxl_ttl; VXLAN_RUNLOCK(sc, &tracker); return (0); } static int vxlan_ctrl_set_vni(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; int error; cmd = arg; if (vxlan_check_vni(cmd->vxlcmd_vni) != 0) return (EINVAL); VXLAN_WLOCK(sc); if (vxlan_can_change_config(sc)) { sc->vxl_vni = cmd->vxlcmd_vni; error = 0; } else error = EBUSY; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_set_local_addr(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; union vxlan_sockaddr *vxlsa; int error; cmd = arg; vxlsa = &cmd->vxlcmd_sa; if (!VXLAN_SOCKADDR_IS_IPV46(vxlsa)) return (EINVAL); if (vxlan_sockaddr_in_multicast(vxlsa) != 0) return (EINVAL); VXLAN_WLOCK(sc); if (vxlan_can_change_config(sc)) { vxlan_sockaddr_in_copy(&sc->vxl_src_addr, &vxlsa->sa); error = 0; } else error = EBUSY; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_set_remote_addr(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; union vxlan_sockaddr *vxlsa; int error; cmd = arg; vxlsa = &cmd->vxlcmd_sa; if (!VXLAN_SOCKADDR_IS_IPV46(vxlsa)) return (EINVAL); VXLAN_WLOCK(sc); if (vxlan_can_change_config(sc)) { vxlan_sockaddr_in_copy(&sc->vxl_dst_addr, &vxlsa->sa); error = 0; } else error = EBUSY; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_set_local_port(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; int error; cmd = arg; if (cmd->vxlcmd_port == 0) return (EINVAL); VXLAN_WLOCK(sc); if (vxlan_can_change_config(sc)) { sc->vxl_src_addr.in4.sin_port = htons(cmd->vxlcmd_port); error = 0; } else error = EBUSY; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_set_remote_port(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; int error; cmd = arg; if (cmd->vxlcmd_port == 0) return (EINVAL); VXLAN_WLOCK(sc); if (vxlan_can_change_config(sc)) { sc->vxl_dst_addr.in4.sin_port = htons(cmd->vxlcmd_port); error = 0; } else error = EBUSY; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_set_port_range(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; uint16_t min, max; int error; cmd = arg; min = cmd->vxlcmd_port_min; max = cmd->vxlcmd_port_max; if (max < min) return (EINVAL); VXLAN_WLOCK(sc); if (vxlan_can_change_config(sc)) { sc->vxl_min_port = min; sc->vxl_max_port = max; error = 0; } else error = EBUSY; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_set_ftable_timeout(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; int error; cmd = arg; VXLAN_WLOCK(sc); if (vxlan_check_ftable_timeout(cmd->vxlcmd_ftable_timeout) == 0) { sc->vxl_ftable_timeout = cmd->vxlcmd_ftable_timeout; error = 0; } else error = EINVAL; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_set_ftable_max(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; int error; cmd = arg; VXLAN_WLOCK(sc); if (vxlan_check_ftable_max(cmd->vxlcmd_ftable_max) == 0) { sc->vxl_ftable_max = cmd->vxlcmd_ftable_max; error = 0; } else error = EINVAL; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_set_multicast_if(struct vxlan_softc * sc, void *arg) { struct ifvxlancmd *cmd; int error; cmd = arg; VXLAN_WLOCK(sc); if (vxlan_can_change_config(sc)) { strlcpy(sc->vxl_mc_ifname, cmd->vxlcmd_ifname, IFNAMSIZ); error = 0; } else error = EBUSY; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_set_ttl(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; int error; cmd = arg; VXLAN_WLOCK(sc); if (vxlan_check_ttl(cmd->vxlcmd_ttl) == 0) { sc->vxl_ttl = cmd->vxlcmd_ttl; if (sc->vxl_im4o != NULL) sc->vxl_im4o->imo_multicast_ttl = sc->vxl_ttl; if (sc->vxl_im6o != NULL) sc->vxl_im6o->im6o_multicast_hlim = sc->vxl_ttl; error = 0; } else error = EINVAL; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_set_learn(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; cmd = arg; VXLAN_WLOCK(sc); if (cmd->vxlcmd_flags & VXLAN_CMD_FLAG_LEARN) sc->vxl_flags |= VXLAN_FLAG_LEARN; else sc->vxl_flags &= ~VXLAN_FLAG_LEARN; VXLAN_WUNLOCK(sc); return (0); } static int vxlan_ctrl_ftable_entry_add(struct vxlan_softc *sc, void *arg) { union vxlan_sockaddr vxlsa; struct ifvxlancmd *cmd; struct vxlan_ftable_entry *fe; int error; cmd = arg; vxlsa = cmd->vxlcmd_sa; if (!VXLAN_SOCKADDR_IS_IPV46(&vxlsa)) return (EINVAL); if (vxlan_sockaddr_in_any(&vxlsa) != 0) return (EINVAL); if (vxlan_sockaddr_in_multicast(&vxlsa) != 0) return (EINVAL); /* BMV: We could support both IPv4 and IPv6 later. */ if (vxlsa.sa.sa_family != sc->vxl_dst_addr.sa.sa_family) return (EAFNOSUPPORT); fe = vxlan_ftable_entry_alloc(); if (fe == NULL) return (ENOMEM); if (vxlsa.in4.sin_port == 0) vxlsa.in4.sin_port = sc->vxl_dst_addr.in4.sin_port; vxlan_ftable_entry_init(sc, fe, cmd->vxlcmd_mac, &vxlsa.sa, VXLAN_FE_FLAG_STATIC); VXLAN_WLOCK(sc); error = vxlan_ftable_entry_insert(sc, fe); VXLAN_WUNLOCK(sc); if (error) vxlan_ftable_entry_free(fe); return (error); } static int vxlan_ctrl_ftable_entry_rem(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; struct vxlan_ftable_entry *fe; int error; cmd = arg; VXLAN_WLOCK(sc); fe = vxlan_ftable_entry_lookup(sc, cmd->vxlcmd_mac); if (fe != NULL) { vxlan_ftable_entry_destroy(sc, fe); error = 0; } else error = ENOENT; VXLAN_WUNLOCK(sc); return (error); } static int vxlan_ctrl_flush(struct vxlan_softc *sc, void *arg) { struct ifvxlancmd *cmd; int all; cmd = arg; all = cmd->vxlcmd_flags & VXLAN_CMD_FLAG_FLUSH_ALL; VXLAN_WLOCK(sc); vxlan_ftable_flush(sc, all); VXLAN_WUNLOCK(sc); return (0); } static int vxlan_ioctl_drvspec(struct vxlan_softc *sc, struct ifdrv *ifd, int get) { const struct vxlan_control *vc; union { struct ifvxlancfg cfg; struct ifvxlancmd cmd; } args; int out, error; if (ifd->ifd_cmd >= vxlan_control_table_size) return (EINVAL); bzero(&args, sizeof(args)); vc = &vxlan_control_table[ifd->ifd_cmd]; out = (vc->vxlc_flags & VXLAN_CTRL_FLAG_COPYOUT) != 0; if ((get != 0 && out == 0) || (get == 0 && out != 0)) return (EINVAL); if (vc->vxlc_flags & VXLAN_CTRL_FLAG_SUSER) { error = priv_check(curthread, PRIV_NET_VXLAN); if (error) return (error); } if (ifd->ifd_len != vc->vxlc_argsize || ifd->ifd_len > sizeof(args)) return (EINVAL); if (vc->vxlc_flags & VXLAN_CTRL_FLAG_COPYIN) { error = copyin(ifd->ifd_data, &args, ifd->ifd_len); if (error) return (error); } error = vc->vxlc_func(sc, &args); if (error) return (error); if (vc->vxlc_flags & VXLAN_CTRL_FLAG_COPYOUT) { error = copyout(&args, ifd->ifd_data, ifd->ifd_len); if (error) return (error); } return (0); } static int vxlan_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data) { struct vxlan_softc *sc; struct ifreq *ifr; struct ifdrv *ifd; int error; sc = ifp->if_softc; ifr = (struct ifreq *) data; ifd = (struct ifdrv *) data; switch (cmd) { case SIOCADDMULTI: case SIOCDELMULTI: error = 0; break; case SIOCGDRVSPEC: case SIOCSDRVSPEC: error = vxlan_ioctl_drvspec(sc, ifd, cmd == SIOCGDRVSPEC); break; case SIOCSIFFLAGS: error = vxlan_ioctl_ifflags(sc); break; default: error = ether_ioctl(ifp, cmd, data); break; } return (error); } #if defined(INET) || defined(INET6) static uint16_t vxlan_pick_source_port(struct vxlan_softc *sc, struct mbuf *m) { int range; uint32_t hash; range = sc->vxl_max_port - sc->vxl_min_port + 1; /* check if flowid is set and not opaque */ - if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE && - M_HASHTYPE_GET(m) != M_HASHTYPE_OPAQUE) + if (M_HASHTYPE_ISHASH(m)) hash = m->m_pkthdr.flowid; else hash = jenkins_hash(m->m_data, ETHER_HDR_LEN, sc->vxl_port_hash_key); return (sc->vxl_min_port + (hash % range)); } static void vxlan_encap_header(struct vxlan_softc *sc, struct mbuf *m, int ipoff, uint16_t srcport, uint16_t dstport) { struct vxlanudphdr *hdr; struct udphdr *udph; struct vxlan_header *vxh; int len; len = m->m_pkthdr.len - ipoff; MPASS(len >= sizeof(struct vxlanudphdr)); hdr = mtodo(m, ipoff); udph = &hdr->vxlh_udp; udph->uh_sport = srcport; udph->uh_dport = dstport; udph->uh_ulen = htons(len); udph->uh_sum = 0; vxh = &hdr->vxlh_hdr; vxh->vxlh_flags = htonl(VXLAN_HDR_FLAGS_VALID_VNI); vxh->vxlh_vni = htonl(sc->vxl_vni << VXLAN_HDR_VNI_SHIFT); } #endif static int vxlan_encap4(struct vxlan_softc *sc, const union vxlan_sockaddr *fvxlsa, struct mbuf *m) { #ifdef INET struct ifnet *ifp; struct ip *ip; struct in_addr srcaddr, dstaddr; uint16_t srcport, dstport; int len, mcast, error; ifp = sc->vxl_ifp; srcaddr = sc->vxl_src_addr.in4.sin_addr; srcport = vxlan_pick_source_port(sc, m); dstaddr = fvxlsa->in4.sin_addr; dstport = fvxlsa->in4.sin_port; M_PREPEND(m, sizeof(struct ip) + sizeof(struct vxlanudphdr), M_NOWAIT); if (m == NULL) { if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); return (ENOBUFS); } len = m->m_pkthdr.len; ip = mtod(m, struct ip *); ip->ip_tos = 0; ip->ip_len = htons(len); ip->ip_off = 0; ip->ip_ttl = sc->vxl_ttl; ip->ip_p = IPPROTO_UDP; ip->ip_sum = 0; ip->ip_src = srcaddr; ip->ip_dst = dstaddr; vxlan_encap_header(sc, m, sizeof(struct ip), srcport, dstport); mcast = (m->m_flags & (M_MCAST | M_BCAST)) ? 1 : 0; m->m_flags &= ~(M_MCAST | M_BCAST); error = ip_output(m, NULL, NULL, 0, sc->vxl_im4o, NULL); if (error == 0) { if_inc_counter(ifp, IFCOUNTER_OPACKETS, 1); if_inc_counter(ifp, IFCOUNTER_OBYTES, len); if (mcast != 0) if_inc_counter(ifp, IFCOUNTER_OMCASTS, 1); } else if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); return (error); #else m_freem(m); return (ENOTSUP); #endif } static int vxlan_encap6(struct vxlan_softc *sc, const union vxlan_sockaddr *fvxlsa, struct mbuf *m) { #ifdef INET6 struct ifnet *ifp; struct ip6_hdr *ip6; const struct in6_addr *srcaddr, *dstaddr; uint16_t srcport, dstport; int len, mcast, error; ifp = sc->vxl_ifp; srcaddr = &sc->vxl_src_addr.in6.sin6_addr; srcport = vxlan_pick_source_port(sc, m); dstaddr = &fvxlsa->in6.sin6_addr; dstport = fvxlsa->in6.sin6_port; M_PREPEND(m, sizeof(struct ip6_hdr) + sizeof(struct vxlanudphdr), M_NOWAIT); if (m == NULL) { if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); return (ENOBUFS); } len = m->m_pkthdr.len; ip6 = mtod(m, struct ip6_hdr *); ip6->ip6_flow = 0; /* BMV: Keep in forwarding entry? */ ip6->ip6_vfc = IPV6_VERSION; ip6->ip6_plen = 0; ip6->ip6_nxt = IPPROTO_UDP; ip6->ip6_hlim = sc->vxl_ttl; ip6->ip6_src = *srcaddr; ip6->ip6_dst = *dstaddr; vxlan_encap_header(sc, m, sizeof(struct ip6_hdr), srcport, dstport); /* * XXX BMV We need support for RFC6935 before we can send and * receive IPv6 UDP packets with a zero checksum. */ { struct udphdr *hdr = mtodo(m, sizeof(struct ip6_hdr)); hdr->uh_sum = in6_cksum_pseudo(ip6, m->m_pkthdr.len - sizeof(struct ip6_hdr), IPPROTO_UDP, 0); m->m_pkthdr.csum_flags = CSUM_UDP_IPV6; m->m_pkthdr.csum_data = offsetof(struct udphdr, uh_sum); } mcast = (m->m_flags & (M_MCAST | M_BCAST)) ? 1 : 0; m->m_flags &= ~(M_MCAST | M_BCAST); error = ip6_output(m, NULL, NULL, 0, sc->vxl_im6o, NULL, NULL); if (error == 0) { if_inc_counter(ifp, IFCOUNTER_OPACKETS, 1); if_inc_counter(ifp, IFCOUNTER_OBYTES, len); if (mcast != 0) if_inc_counter(ifp, IFCOUNTER_OMCASTS, 1); } else if_inc_counter(ifp, IFCOUNTER_OERRORS, 1); return (error); #else m_freem(m); return (ENOTSUP); #endif } static int vxlan_transmit(struct ifnet *ifp, struct mbuf *m) { struct rm_priotracker tracker; union vxlan_sockaddr vxlsa; struct vxlan_softc *sc; struct vxlan_ftable_entry *fe; struct ifnet *mcifp; struct ether_header *eh; int ipv4, error; sc = ifp->if_softc; eh = mtod(m, struct ether_header *); fe = NULL; mcifp = NULL; ETHER_BPF_MTAP(ifp, m); VXLAN_RLOCK(sc, &tracker); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) { VXLAN_RUNLOCK(sc, &tracker); m_freem(m); return (ENETDOWN); } if ((m->m_flags & (M_BCAST | M_MCAST)) == 0) fe = vxlan_ftable_entry_lookup(sc, eh->ether_dhost); if (fe == NULL) fe = &sc->vxl_default_fe; vxlan_sockaddr_copy(&vxlsa, &fe->vxlfe_raddr.sa); ipv4 = VXLAN_SOCKADDR_IS_IPV4(&vxlsa) != 0; if (vxlan_sockaddr_in_multicast(&vxlsa) != 0) mcifp = vxlan_multicast_if_ref(sc, ipv4); VXLAN_ACQUIRE(sc); VXLAN_RUNLOCK(sc, &tracker); if (ipv4 != 0) error = vxlan_encap4(sc, &vxlsa, m); else error = vxlan_encap6(sc, &vxlsa, m); vxlan_release(sc); if (mcifp != NULL) if_rele(mcifp); return (error); } static void vxlan_qflush(struct ifnet *ifp __unused) { } static void vxlan_rcv_udp_packet(struct mbuf *m, int offset, struct inpcb *inpcb, const struct sockaddr *srcsa, void *xvso) { struct vxlan_socket *vso; struct vxlan_header *vxh, vxlanhdr; uint32_t vni; int error; M_ASSERTPKTHDR(m); vso = xvso; offset += sizeof(struct udphdr); if (m->m_pkthdr.len < offset + sizeof(struct vxlan_header)) goto out; if (__predict_false(m->m_len < offset + sizeof(struct vxlan_header))) { m_copydata(m, offset, sizeof(struct vxlan_header), (caddr_t) &vxlanhdr); vxh = &vxlanhdr; } else vxh = mtodo(m, offset); /* * Drop if there is a reserved bit set in either the flags or VNI * fields of the header. This goes against the specification, but * a bit set may indicate an unsupported new feature. This matches * the behavior of the Linux implementation. */ if (vxh->vxlh_flags != htonl(VXLAN_HDR_FLAGS_VALID_VNI) || vxh->vxlh_vni & ~htonl(VXLAN_VNI_MASK)) goto out; vni = ntohl(vxh->vxlh_vni) >> VXLAN_HDR_VNI_SHIFT; /* Adjust to the start of the inner Ethernet frame. */ m_adj(m, offset + sizeof(struct vxlan_header)); error = vxlan_input(vso, vni, &m, srcsa); MPASS(error != 0 || m == NULL); out: if (m != NULL) m_freem(m); } static int vxlan_input(struct vxlan_socket *vso, uint32_t vni, struct mbuf **m0, const struct sockaddr *sa) { struct vxlan_softc *sc; struct ifnet *ifp; struct mbuf *m; struct ether_header *eh; int error; sc = vxlan_socket_lookup_softc(vso, vni); if (sc == NULL) return (ENOENT); ifp = sc->vxl_ifp; m = *m0; eh = mtod(m, struct ether_header *); if ((ifp->if_drv_flags & IFF_DRV_RUNNING) == 0) { error = ENETDOWN; goto out; } else if (ifp == m->m_pkthdr.rcvif) { /* XXX Does not catch more complex loops. */ error = EDEADLK; goto out; } if (sc->vxl_flags & VXLAN_FLAG_LEARN) vxlan_ftable_update(sc, sa, eh->ether_shost); m_clrprotoflags(m); m->m_pkthdr.rcvif = ifp; M_SETFIB(m, ifp->if_fib); error = netisr_queue_src(NETISR_ETHER, 0, m); *m0 = NULL; out: vxlan_release(sc); return (error); } static void vxlan_set_default_config(struct vxlan_softc *sc) { sc->vxl_flags |= VXLAN_FLAG_LEARN; sc->vxl_vni = VXLAN_VNI_MAX; sc->vxl_ttl = IPDEFTTL; if (!vxlan_tunable_int(sc, "legacy_port", vxlan_legacy_port)) { sc->vxl_src_addr.in4.sin_port = htons(VXLAN_PORT); sc->vxl_dst_addr.in4.sin_port = htons(VXLAN_PORT); } else { sc->vxl_src_addr.in4.sin_port = htons(VXLAN_LEGACY_PORT); sc->vxl_dst_addr.in4.sin_port = htons(VXLAN_LEGACY_PORT); } sc->vxl_min_port = V_ipport_firstauto; sc->vxl_max_port = V_ipport_lastauto; sc->vxl_ftable_max = VXLAN_FTABLE_MAX; sc->vxl_ftable_timeout = VXLAN_FTABLE_TIMEOUT; } static int vxlan_set_user_config(struct vxlan_softc *sc, struct ifvxlanparam *vxlp) { #ifndef INET if (vxlp->vxlp_with & (VXLAN_PARAM_WITH_LOCAL_ADDR4 | VXLAN_PARAM_WITH_REMOTE_ADDR4)) return (EAFNOSUPPORT); #endif #ifndef INET6 if (vxlp->vxlp_with & (VXLAN_PARAM_WITH_LOCAL_ADDR6 | VXLAN_PARAM_WITH_REMOTE_ADDR6)) return (EAFNOSUPPORT); #endif if (vxlp->vxlp_with & VXLAN_PARAM_WITH_VNI) { if (vxlan_check_vni(vxlp->vxlp_vni) == 0) sc->vxl_vni = vxlp->vxlp_vni; } if (vxlp->vxlp_with & VXLAN_PARAM_WITH_LOCAL_ADDR4) { sc->vxl_src_addr.in4.sin_len = sizeof(struct sockaddr_in); sc->vxl_src_addr.in4.sin_family = AF_INET; sc->vxl_src_addr.in4.sin_addr = vxlp->vxlp_local_in4; } else if (vxlp->vxlp_with & VXLAN_PARAM_WITH_LOCAL_ADDR6) { sc->vxl_src_addr.in6.sin6_len = sizeof(struct sockaddr_in6); sc->vxl_src_addr.in6.sin6_family = AF_INET6; sc->vxl_src_addr.in6.sin6_addr = vxlp->vxlp_local_in6; } if (vxlp->vxlp_with & VXLAN_PARAM_WITH_REMOTE_ADDR4) { sc->vxl_dst_addr.in4.sin_len = sizeof(struct sockaddr_in); sc->vxl_dst_addr.in4.sin_family = AF_INET; sc->vxl_dst_addr.in4.sin_addr = vxlp->vxlp_remote_in4; } else if (vxlp->vxlp_with & VXLAN_PARAM_WITH_REMOTE_ADDR6) { sc->vxl_dst_addr.in6.sin6_len = sizeof(struct sockaddr_in6); sc->vxl_dst_addr.in6.sin6_family = AF_INET6; sc->vxl_dst_addr.in6.sin6_addr = vxlp->vxlp_remote_in6; } if (vxlp->vxlp_with & VXLAN_PARAM_WITH_LOCAL_PORT) sc->vxl_src_addr.in4.sin_port = htons(vxlp->vxlp_local_port); if (vxlp->vxlp_with & VXLAN_PARAM_WITH_REMOTE_PORT) sc->vxl_dst_addr.in4.sin_port = htons(vxlp->vxlp_remote_port); if (vxlp->vxlp_with & VXLAN_PARAM_WITH_PORT_RANGE) { if (vxlp->vxlp_min_port <= vxlp->vxlp_max_port) { sc->vxl_min_port = vxlp->vxlp_min_port; sc->vxl_max_port = vxlp->vxlp_max_port; } } if (vxlp->vxlp_with & VXLAN_PARAM_WITH_MULTICAST_IF) strlcpy(sc->vxl_mc_ifname, vxlp->vxlp_mc_ifname, IFNAMSIZ); if (vxlp->vxlp_with & VXLAN_PARAM_WITH_FTABLE_TIMEOUT) { if (vxlan_check_ftable_timeout(vxlp->vxlp_ftable_timeout) == 0) sc->vxl_ftable_timeout = vxlp->vxlp_ftable_timeout; } if (vxlp->vxlp_with & VXLAN_PARAM_WITH_FTABLE_MAX) { if (vxlan_check_ftable_max(vxlp->vxlp_ftable_max) == 0) sc->vxl_ftable_max = vxlp->vxlp_ftable_max; } if (vxlp->vxlp_with & VXLAN_PARAM_WITH_TTL) { if (vxlan_check_ttl(vxlp->vxlp_ttl) == 0) sc->vxl_ttl = vxlp->vxlp_ttl; } if (vxlp->vxlp_with & VXLAN_PARAM_WITH_LEARN) { if (vxlp->vxlp_learn == 0) sc->vxl_flags &= ~VXLAN_FLAG_LEARN; } return (0); } static int vxlan_clone_create(struct if_clone *ifc, int unit, caddr_t params) { struct vxlan_softc *sc; struct ifnet *ifp; struct ifvxlanparam vxlp; int error; sc = malloc(sizeof(struct vxlan_softc), M_VXLAN, M_WAITOK | M_ZERO); sc->vxl_unit = unit; vxlan_set_default_config(sc); if (params != 0) { error = copyin(params, &vxlp, sizeof(vxlp)); if (error) goto fail; error = vxlan_set_user_config(sc, &vxlp); if (error) goto fail; } ifp = if_alloc(IFT_ETHER); if (ifp == NULL) { error = ENOSPC; goto fail; } sc->vxl_ifp = ifp; rm_init(&sc->vxl_lock, "vxlanrm"); callout_init_rw(&sc->vxl_callout, &sc->vxl_lock, 0); sc->vxl_port_hash_key = arc4random(); vxlan_ftable_init(sc); vxlan_sysctl_setup(sc); ifp->if_softc = sc; if_initname(ifp, vxlan_name, unit); ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST; ifp->if_init = vxlan_init; ifp->if_ioctl = vxlan_ioctl; ifp->if_transmit = vxlan_transmit; ifp->if_qflush = vxlan_qflush; vxlan_fakeaddr(sc); ether_ifattach(ifp, sc->vxl_hwaddr); ifp->if_baudrate = 0; ifp->if_hdrlen = 0; return (0); fail: free(sc, M_VXLAN); return (error); } static void vxlan_clone_destroy(struct ifnet *ifp) { struct vxlan_softc *sc; sc = ifp->if_softc; vxlan_teardown(sc); vxlan_ftable_flush(sc, 1); ether_ifdetach(ifp); if_free(ifp); vxlan_ftable_fini(sc); vxlan_sysctl_destroy(sc); rm_destroy(&sc->vxl_lock); free(sc, M_VXLAN); } /* BMV: Taken from if_bridge. */ static uint32_t vxlan_mac_hash(struct vxlan_softc *sc, const uint8_t *addr) { uint32_t a = 0x9e3779b9, b = 0x9e3779b9, c = sc->vxl_ftable_hash_key; b += addr[5] << 8; b += addr[4]; a += addr[3] << 24; a += addr[2] << 16; a += addr[1] << 8; a += addr[0]; /* * The following hash function is adapted from "Hash Functions" by Bob Jenkins * ("Algorithm Alley", Dr. Dobbs Journal, September 1997). */ #define mix(a, b, c) \ do { \ a -= b; a -= c; a ^= (c >> 13); \ b -= c; b -= a; b ^= (a << 8); \ c -= a; c -= b; c ^= (b >> 13); \ a -= b; a -= c; a ^= (c >> 12); \ b -= c; b -= a; b ^= (a << 16); \ c -= a; c -= b; c ^= (b >> 5); \ a -= b; a -= c; a ^= (c >> 3); \ b -= c; b -= a; b ^= (a << 10); \ c -= a; c -= b; c ^= (b >> 15); \ } while (0) mix(a, b, c); #undef mix return (c); } static void vxlan_fakeaddr(struct vxlan_softc *sc) { /* * Generate a non-multicast, locally administered address. * * BMV: Should we use the FreeBSD OUI range instead? */ arc4rand(sc->vxl_hwaddr, ETHER_ADDR_LEN, 1); sc->vxl_hwaddr[0] &= ~1; sc->vxl_hwaddr[0] |= 2; } static int vxlan_sockaddr_cmp(const union vxlan_sockaddr *vxladdr, const struct sockaddr *sa) { return (bcmp(&vxladdr->sa, sa, vxladdr->sa.sa_len)); } static void vxlan_sockaddr_copy(union vxlan_sockaddr *vxladdr, const struct sockaddr *sa) { MPASS(sa->sa_family == AF_INET || sa->sa_family == AF_INET6); bzero(vxladdr, sizeof(*vxladdr)); if (sa->sa_family == AF_INET) { vxladdr->in4 = *satoconstsin(sa); vxladdr->in4.sin_len = sizeof(struct sockaddr_in); } else if (sa->sa_family == AF_INET6) { vxladdr->in6 = *satoconstsin6(sa); vxladdr->in6.sin6_len = sizeof(struct sockaddr_in6); } } static int vxlan_sockaddr_in_equal(const union vxlan_sockaddr *vxladdr, const struct sockaddr *sa) { int equal; if (sa->sa_family == AF_INET) { const struct in_addr *in4 = &satoconstsin(sa)->sin_addr; equal = in4->s_addr == vxladdr->in4.sin_addr.s_addr; } else if (sa->sa_family == AF_INET6) { const struct in6_addr *in6 = &satoconstsin6(sa)->sin6_addr; equal = IN6_ARE_ADDR_EQUAL(in6, &vxladdr->in6.sin6_addr); } else equal = 0; return (equal); } static void vxlan_sockaddr_in_copy(union vxlan_sockaddr *vxladdr, const struct sockaddr *sa) { MPASS(sa->sa_family == AF_INET || sa->sa_family == AF_INET6); if (sa->sa_family == AF_INET) { const struct in_addr *in4 = &satoconstsin(sa)->sin_addr; vxladdr->in4.sin_family = AF_INET; vxladdr->in4.sin_len = sizeof(struct sockaddr_in); vxladdr->in4.sin_addr = *in4; } else if (sa->sa_family == AF_INET6) { const struct in6_addr *in6 = &satoconstsin6(sa)->sin6_addr; vxladdr->in6.sin6_family = AF_INET6; vxladdr->in6.sin6_len = sizeof(struct sockaddr_in6); vxladdr->in6.sin6_addr = *in6; } } static int vxlan_sockaddr_supported(const union vxlan_sockaddr *vxladdr, int unspec) { const struct sockaddr *sa; int supported; sa = &vxladdr->sa; supported = 0; if (sa->sa_family == AF_UNSPEC && unspec != 0) { supported = 1; } else if (sa->sa_family == AF_INET) { #ifdef INET supported = 1; #endif } else if (sa->sa_family == AF_INET6) { #ifdef INET6 supported = 1; #endif } return (supported); } static int vxlan_sockaddr_in_any(const union vxlan_sockaddr *vxladdr) { const struct sockaddr *sa; int any; sa = &vxladdr->sa; if (sa->sa_family == AF_INET) { const struct in_addr *in4 = &satoconstsin(sa)->sin_addr; any = in4->s_addr == INADDR_ANY; } else if (sa->sa_family == AF_INET6) { const struct in6_addr *in6 = &satoconstsin6(sa)->sin6_addr; any = IN6_IS_ADDR_UNSPECIFIED(in6); } else any = -1; return (any); } static int vxlan_sockaddr_in_multicast(const union vxlan_sockaddr *vxladdr) { const struct sockaddr *sa; int mc; sa = &vxladdr->sa; if (sa->sa_family == AF_INET) { const struct in_addr *in4 = &satoconstsin(sa)->sin_addr; mc = IN_MULTICAST(ntohl(in4->s_addr)); } else if (sa->sa_family == AF_INET6) { const struct in6_addr *in6 = &satoconstsin6(sa)->sin6_addr; mc = IN6_IS_ADDR_MULTICAST(in6); } else mc = -1; return (mc); } static int vxlan_can_change_config(struct vxlan_softc *sc) { struct ifnet *ifp; ifp = sc->vxl_ifp; VXLAN_LOCK_ASSERT(sc); if (ifp->if_drv_flags & IFF_DRV_RUNNING) return (0); if (sc->vxl_flags & (VXLAN_FLAG_INIT | VXLAN_FLAG_TEARDOWN)) return (0); return (1); } static int vxlan_check_vni(uint32_t vni) { return (vni >= VXLAN_VNI_MAX); } static int vxlan_check_ttl(int ttl) { return (ttl > MAXTTL); } static int vxlan_check_ftable_timeout(uint32_t timeout) { return (timeout > VXLAN_FTABLE_MAX_TIMEOUT); } static int vxlan_check_ftable_max(uint32_t max) { return (max > VXLAN_FTABLE_MAX); } static void vxlan_sysctl_setup(struct vxlan_softc *sc) { struct sysctl_ctx_list *ctx; struct sysctl_oid *node; struct vxlan_statistics *stats; char namebuf[8]; ctx = &sc->vxl_sysctl_ctx; stats = &sc->vxl_stats; snprintf(namebuf, sizeof(namebuf), "%d", sc->vxl_unit); sysctl_ctx_init(ctx); sc->vxl_sysctl_node = SYSCTL_ADD_NODE(ctx, SYSCTL_STATIC_CHILDREN(_net_link_vxlan), OID_AUTO, namebuf, CTLFLAG_RD, NULL, ""); node = SYSCTL_ADD_NODE(ctx, SYSCTL_CHILDREN(sc->vxl_sysctl_node), OID_AUTO, "ftable", CTLFLAG_RD, NULL, ""); SYSCTL_ADD_UINT(ctx, SYSCTL_CHILDREN(node), OID_AUTO, "count", CTLFLAG_RD, &sc->vxl_ftable_cnt, 0, "Number of entries in fowarding table"); SYSCTL_ADD_UINT(ctx, SYSCTL_CHILDREN(node), OID_AUTO, "max", CTLFLAG_RD, &sc->vxl_ftable_max, 0, "Maximum number of entries allowed in fowarding table"); SYSCTL_ADD_UINT(ctx, SYSCTL_CHILDREN(node), OID_AUTO, "timeout", CTLFLAG_RD, &sc->vxl_ftable_timeout, 0, "Number of seconds between prunes of the forwarding table"); SYSCTL_ADD_PROC(ctx, SYSCTL_CHILDREN(node), OID_AUTO, "dump", CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE | CTLFLAG_SKIP, sc, 0, vxlan_ftable_sysctl_dump, "A", "Dump the forwarding table entries"); node = SYSCTL_ADD_NODE(ctx, SYSCTL_CHILDREN(sc->vxl_sysctl_node), OID_AUTO, "stats", CTLFLAG_RD, NULL, ""); SYSCTL_ADD_UINT(ctx, SYSCTL_CHILDREN(node), OID_AUTO, "ftable_nospace", CTLFLAG_RD, &stats->ftable_nospace, 0, "Fowarding table reached maximum entries"); SYSCTL_ADD_UINT(ctx, SYSCTL_CHILDREN(node), OID_AUTO, "ftable_lock_upgrade_failed", CTLFLAG_RD, &stats->ftable_lock_upgrade_failed, 0, "Forwarding table update required lock upgrade"); } static void vxlan_sysctl_destroy(struct vxlan_softc *sc) { sysctl_ctx_free(&sc->vxl_sysctl_ctx); sc->vxl_sysctl_node = NULL; } static int vxlan_tunable_int(struct vxlan_softc *sc, const char *knob, int def) { char path[64]; snprintf(path, sizeof(path), "net.link.vxlan.%d.%s", sc->vxl_unit, knob); TUNABLE_INT_FETCH(path, &def); return (def); } static void vxlan_ifdetach_event(void *arg __unused, struct ifnet *ifp) { struct vxlan_softc_head list; struct vxlan_socket *vso; struct vxlan_softc *sc, *tsc; LIST_INIT(&list); if (ifp->if_flags & IFF_RENAMING) return; if ((ifp->if_flags & IFF_MULTICAST) == 0) return; mtx_lock(&vxlan_list_mtx); LIST_FOREACH(vso, &vxlan_socket_list, vxlso_entry) vxlan_socket_ifdetach(vso, ifp, &list); mtx_unlock(&vxlan_list_mtx); LIST_FOREACH_SAFE(sc, &list, vxl_ifdetach_list, tsc) { LIST_REMOVE(sc, vxl_ifdetach_list); VXLAN_WLOCK(sc); if (sc->vxl_flags & VXLAN_FLAG_INIT) vxlan_init_wait(sc); vxlan_teardown_locked(sc); } } static void vxlan_load(void) { mtx_init(&vxlan_list_mtx, "vxlan list", NULL, MTX_DEF); LIST_INIT(&vxlan_socket_list); vxlan_ifdetach_event_tag = EVENTHANDLER_REGISTER(ifnet_departure_event, vxlan_ifdetach_event, NULL, EVENTHANDLER_PRI_ANY); vxlan_cloner = if_clone_simple(vxlan_name, vxlan_clone_create, vxlan_clone_destroy, 0); } static void vxlan_unload(void) { EVENTHANDLER_DEREGISTER(ifnet_departure_event, vxlan_ifdetach_event_tag); if_clone_detach(vxlan_cloner); mtx_destroy(&vxlan_list_mtx); MPASS(LIST_EMPTY(&vxlan_socket_list)); } static int vxlan_modevent(module_t mod, int type, void *unused) { int error; error = 0; switch (type) { case MOD_LOAD: vxlan_load(); break; case MOD_UNLOAD: vxlan_unload(); break; default: error = ENOTSUP; break; } return (error); } static moduledata_t vxlan_mod = { "if_vxlan", vxlan_modevent, 0 }; DECLARE_MODULE(if_vxlan, vxlan_mod, SI_SUB_PSEUDO, SI_ORDER_ANY); MODULE_VERSION(if_vxlan, 1); Index: projects/vnet/sys/netinet/sctp_input.c =================================================================== --- projects/vnet/sys/netinet/sctp_input.c (revision 301546) +++ projects/vnet/sys/netinet/sctp_input.c (revision 301547) @@ -1,6250 +1,6250 @@ /*- * Copyright (c) 2001-2008, by Cisco Systems, Inc. All rights reserved. * Copyright (c) 2008-2012, by Randall Stewart. All rights reserved. * Copyright (c) 2008-2012, by Michael Tuexen. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * a) Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * * b) Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the distribution. * * c) Neither the name of Cisco Systems, Inc. nor the names of its * contributors may be used to endorse or promote products derived * from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF * THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #if defined(INET) || defined(INET6) #include #endif #include static void sctp_stop_all_cookie_timers(struct sctp_tcb *stcb) { struct sctp_nets *net; /* * This now not only stops all cookie timers it also stops any INIT * timers as well. This will make sure that the timers are stopped * in all collision cases. */ SCTP_TCB_LOCK_ASSERT(stcb); TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { if (net->rxt_timer.type == SCTP_TIMER_TYPE_COOKIE) { sctp_timer_stop(SCTP_TIMER_TYPE_COOKIE, stcb->sctp_ep, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_1); } else if (net->rxt_timer.type == SCTP_TIMER_TYPE_INIT) { sctp_timer_stop(SCTP_TIMER_TYPE_INIT, stcb->sctp_ep, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_2); } } } /* INIT handler */ static void sctp_handle_init(struct mbuf *m, int iphlen, int offset, struct sockaddr *src, struct sockaddr *dst, struct sctphdr *sh, struct sctp_init_chunk *cp, struct sctp_inpcb *inp, struct sctp_tcb *stcb, struct sctp_nets *net, int *abort_no_unlock, uint8_t mflowtype, uint32_t mflowid, uint32_t vrf_id, uint16_t port) { struct sctp_init *init; struct mbuf *op_err; SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_init: handling INIT tcb:%p\n", (void *)stcb); if (stcb == NULL) { SCTP_INP_RLOCK(inp); } /* validate length */ if (ntohs(cp->ch.chunk_length) < sizeof(struct sctp_init_chunk)) { op_err = sctp_generate_cause(SCTP_CAUSE_INVALID_PARAM, ""); sctp_abort_association(inp, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); if (stcb) *abort_no_unlock = 1; goto outnow; } /* validate parameters */ init = &cp->init; if (init->initiate_tag == 0) { /* protocol error... send abort */ op_err = sctp_generate_cause(SCTP_CAUSE_INVALID_PARAM, ""); sctp_abort_association(inp, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); if (stcb) *abort_no_unlock = 1; goto outnow; } if (ntohl(init->a_rwnd) < SCTP_MIN_RWND) { /* invalid parameter... send abort */ op_err = sctp_generate_cause(SCTP_CAUSE_INVALID_PARAM, ""); sctp_abort_association(inp, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); if (stcb) *abort_no_unlock = 1; goto outnow; } if (init->num_inbound_streams == 0) { /* protocol error... send abort */ op_err = sctp_generate_cause(SCTP_CAUSE_INVALID_PARAM, ""); sctp_abort_association(inp, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); if (stcb) *abort_no_unlock = 1; goto outnow; } if (init->num_outbound_streams == 0) { /* protocol error... send abort */ op_err = sctp_generate_cause(SCTP_CAUSE_INVALID_PARAM, ""); sctp_abort_association(inp, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); if (stcb) *abort_no_unlock = 1; goto outnow; } if (sctp_validate_init_auth_params(m, offset + sizeof(*cp), offset + ntohs(cp->ch.chunk_length))) { /* auth parameter(s) error... send abort */ op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), "Problem with AUTH parameters"); sctp_abort_association(inp, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); if (stcb) *abort_no_unlock = 1; goto outnow; } /* * We are only accepting if we have a socket with positive * so_qlimit. */ if ((stcb == NULL) && ((inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE) || (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) || (inp->sctp_socket == NULL) || (inp->sctp_socket->so_qlimit == 0))) { /* * FIX ME ?? What about TCP model and we have a * match/restart case? Actually no fix is needed. the lookup * will always find the existing assoc so stcb would not be * NULL. It may be questionable to do this since we COULD * just send back the INIT-ACK and hope that the app did * accept()'s by the time the COOKIE was sent. But there is * a price to pay for COOKIE generation and I don't want to * pay it on the chance that the app will actually do some * accepts(). The App just looses and should NOT be in this * state :-) */ if (SCTP_BASE_SYSCTL(sctp_blackhole) == 0) { op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), "No listener"); sctp_send_abort(m, iphlen, src, dst, sh, 0, op_err, mflowtype, mflowid, inp->fibnum, vrf_id, port); } goto outnow; } if ((stcb != NULL) && (SCTP_GET_STATE(&stcb->asoc) == SCTP_STATE_SHUTDOWN_ACK_SENT)) { SCTPDBG(SCTP_DEBUG_INPUT3, "sctp_handle_init: sending SHUTDOWN-ACK\n"); sctp_send_shutdown_ack(stcb, NULL); sctp_chunk_output(inp, stcb, SCTP_OUTPUT_FROM_CONTROL_PROC, SCTP_SO_NOT_LOCKED); } else { SCTPDBG(SCTP_DEBUG_INPUT3, "sctp_handle_init: sending INIT-ACK\n"); sctp_send_initiate_ack(inp, stcb, net, m, iphlen, offset, src, dst, sh, cp, mflowtype, mflowid, vrf_id, port, ((stcb == NULL) ? SCTP_HOLDS_LOCK : SCTP_NOT_LOCKED)); } outnow: if (stcb == NULL) { SCTP_INP_RUNLOCK(inp); } } /* * process peer "INIT/INIT-ACK" chunk returns value < 0 on error */ int sctp_is_there_unsent_data(struct sctp_tcb *stcb, int so_locked #if !defined(__APPLE__) && !defined(SCTP_SO_LOCK_TESTING) SCTP_UNUSED #endif ) { int unsent_data = 0; unsigned int i; struct sctp_stream_queue_pending *sp; struct sctp_association *asoc; /* * This function returns the number of streams that have true unsent * data on them. Note that as it looks through it will clean up any * places that have old data that has been sent but left at top of * stream queue. */ asoc = &stcb->asoc; SCTP_TCB_SEND_LOCK(stcb); if (!stcb->asoc.ss_functions.sctp_ss_is_empty(stcb, asoc)) { /* Check to see if some data queued */ for (i = 0; i < stcb->asoc.streamoutcnt; i++) { /* sa_ignore FREED_MEMORY */ sp = TAILQ_FIRST(&stcb->asoc.strmout[i].outqueue); if (sp == NULL) { continue; } if ((sp->msg_is_complete) && (sp->length == 0) && (sp->sender_all_done)) { /* * We are doing differed cleanup. Last time * through when we took all the data the * sender_all_done was not set. */ if (sp->put_last_out == 0) { SCTP_PRINTF("Gak, put out entire msg with NO end!-1\n"); SCTP_PRINTF("sender_done:%d len:%d msg_comp:%d put_last_out:%d\n", sp->sender_all_done, sp->length, sp->msg_is_complete, sp->put_last_out); } atomic_subtract_int(&stcb->asoc.stream_queue_cnt, 1); TAILQ_REMOVE(&stcb->asoc.strmout[i].outqueue, sp, next); if (sp->net) { sctp_free_remote_addr(sp->net); sp->net = NULL; } if (sp->data) { sctp_m_freem(sp->data); sp->data = NULL; } sctp_free_a_strmoq(stcb, sp, so_locked); } else { unsent_data++; break; } } } SCTP_TCB_SEND_UNLOCK(stcb); return (unsent_data); } static int sctp_process_init(struct sctp_init_chunk *cp, struct sctp_tcb *stcb) { struct sctp_init *init; struct sctp_association *asoc; struct sctp_nets *lnet; unsigned int i; init = &cp->init; asoc = &stcb->asoc; /* save off parameters */ asoc->peer_vtag = ntohl(init->initiate_tag); asoc->peers_rwnd = ntohl(init->a_rwnd); /* init tsn's */ asoc->highest_tsn_inside_map = asoc->asconf_seq_in = ntohl(init->initial_tsn) - 1; if (!TAILQ_EMPTY(&asoc->nets)) { /* update any ssthresh's that may have a default */ TAILQ_FOREACH(lnet, &asoc->nets, sctp_next) { lnet->ssthresh = asoc->peers_rwnd; if (SCTP_BASE_SYSCTL(sctp_logging_level) & (SCTP_CWND_MONITOR_ENABLE | SCTP_CWND_LOGGING_ENABLE)) { sctp_log_cwnd(stcb, lnet, 0, SCTP_CWND_INITIALIZATION); } } } SCTP_TCB_SEND_LOCK(stcb); if (asoc->pre_open_streams > ntohs(init->num_inbound_streams)) { unsigned int newcnt; struct sctp_stream_out *outs; struct sctp_stream_queue_pending *sp, *nsp; struct sctp_tmit_chunk *chk, *nchk; /* abandon the upper streams */ newcnt = ntohs(init->num_inbound_streams); TAILQ_FOREACH_SAFE(chk, &asoc->send_queue, sctp_next, nchk) { if (chk->rec.data.stream_number >= newcnt) { TAILQ_REMOVE(&asoc->send_queue, chk, sctp_next); asoc->send_queue_cnt--; if (asoc->strmout[chk->rec.data.stream_number].chunks_on_queues > 0) { asoc->strmout[chk->rec.data.stream_number].chunks_on_queues--; #ifdef INVARIANTS } else { panic("No chunks on the queues for sid %u.", chk->rec.data.stream_number); #endif } if (chk->data != NULL) { sctp_free_bufspace(stcb, asoc, chk, 1); sctp_ulp_notify(SCTP_NOTIFY_UNSENT_DG_FAIL, stcb, 0, chk, SCTP_SO_NOT_LOCKED); if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } } sctp_free_a_chunk(stcb, chk, SCTP_SO_NOT_LOCKED); /* sa_ignore FREED_MEMORY */ } } if (asoc->strmout) { for (i = newcnt; i < asoc->pre_open_streams; i++) { outs = &asoc->strmout[i]; TAILQ_FOREACH_SAFE(sp, &outs->outqueue, next, nsp) { TAILQ_REMOVE(&outs->outqueue, sp, next); asoc->stream_queue_cnt--; sctp_ulp_notify(SCTP_NOTIFY_SPECIAL_SP_FAIL, stcb, 0, sp, SCTP_SO_NOT_LOCKED); if (sp->data) { sctp_m_freem(sp->data); sp->data = NULL; } if (sp->net) { sctp_free_remote_addr(sp->net); sp->net = NULL; } /* Free the chunk */ sctp_free_a_strmoq(stcb, sp, SCTP_SO_NOT_LOCKED); /* sa_ignore FREED_MEMORY */ } outs->state = SCTP_STREAM_CLOSED; } } /* cut back the count */ asoc->pre_open_streams = newcnt; } SCTP_TCB_SEND_UNLOCK(stcb); asoc->streamoutcnt = asoc->pre_open_streams; if (asoc->strmout) { for (i = 0; i < asoc->streamoutcnt; i++) { asoc->strmout[i].state = SCTP_STREAM_OPEN; } } /* EY - nr_sack: initialize highest tsn in nr_mapping_array */ asoc->highest_tsn_inside_nr_map = asoc->highest_tsn_inside_map; if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_MAP_LOGGING_ENABLE) { sctp_log_map(0, 5, asoc->highest_tsn_inside_map, SCTP_MAP_SLIDE_RESULT); } /* This is the next one we expect */ asoc->str_reset_seq_in = asoc->asconf_seq_in + 1; asoc->mapping_array_base_tsn = ntohl(init->initial_tsn); asoc->tsn_last_delivered = asoc->cumulative_tsn = asoc->asconf_seq_in; asoc->advanced_peer_ack_point = asoc->last_acked_seq; /* open the requested streams */ if (asoc->strmin != NULL) { /* Free the old ones */ for (i = 0; i < asoc->streamincnt; i++) { sctp_clean_up_stream(stcb, &asoc->strmin[i].inqueue); sctp_clean_up_stream(stcb, &asoc->strmin[i].uno_inqueue); } SCTP_FREE(asoc->strmin, SCTP_M_STRMI); } if (asoc->max_inbound_streams > ntohs(init->num_outbound_streams)) { asoc->streamincnt = ntohs(init->num_outbound_streams); } else { asoc->streamincnt = asoc->max_inbound_streams; } SCTP_MALLOC(asoc->strmin, struct sctp_stream_in *, asoc->streamincnt * sizeof(struct sctp_stream_in), SCTP_M_STRMI); if (asoc->strmin == NULL) { /* we didn't get memory for the streams! */ SCTPDBG(SCTP_DEBUG_INPUT2, "process_init: couldn't get memory for the streams!\n"); return (-1); } for (i = 0; i < asoc->streamincnt; i++) { asoc->strmin[i].stream_no = i; asoc->strmin[i].last_sequence_delivered = 0xffffffff; TAILQ_INIT(&asoc->strmin[i].inqueue); TAILQ_INIT(&asoc->strmin[i].uno_inqueue); asoc->strmin[i].pd_api_started = 0; asoc->strmin[i].delivery_started = 0; } /* * load_address_from_init will put the addresses into the * association when the COOKIE is processed or the INIT-ACK is * processed. Both types of COOKIE's existing and new call this * routine. It will remove addresses that are no longer in the * association (for the restarting case where addresses are * removed). Up front when the INIT arrives we will discard it if it * is a restart and new addresses have been added. */ /* sa_ignore MEMLEAK */ return (0); } /* * INIT-ACK message processing/consumption returns value < 0 on error */ static int sctp_process_init_ack(struct mbuf *m, int iphlen, int offset, struct sockaddr *src, struct sockaddr *dst, struct sctphdr *sh, struct sctp_init_ack_chunk *cp, struct sctp_tcb *stcb, struct sctp_nets *net, int *abort_no_unlock, uint8_t mflowtype, uint32_t mflowid, uint32_t vrf_id) { struct sctp_association *asoc; struct mbuf *op_err; int retval, abort_flag; uint32_t initack_limit; int nat_friendly = 0; /* First verify that we have no illegal param's */ abort_flag = 0; op_err = sctp_arethere_unrecognized_parameters(m, (offset + sizeof(struct sctp_init_chunk)), &abort_flag, (struct sctp_chunkhdr *)cp, &nat_friendly); if (abort_flag) { /* Send an abort and notify peer */ sctp_abort_an_association(stcb->sctp_ep, stcb, op_err, SCTP_SO_NOT_LOCKED); *abort_no_unlock = 1; return (-1); } asoc = &stcb->asoc; asoc->peer_supports_nat = (uint8_t) nat_friendly; /* process the peer's parameters in the INIT-ACK */ retval = sctp_process_init((struct sctp_init_chunk *)cp, stcb); if (retval < 0) { return (retval); } initack_limit = offset + ntohs(cp->ch.chunk_length); /* load all addresses */ if ((retval = sctp_load_addresses_from_init(stcb, m, (offset + sizeof(struct sctp_init_chunk)), initack_limit, src, dst, NULL, stcb->asoc.port))) { op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), "Problem with address parameters"); SCTPDBG(SCTP_DEBUG_INPUT1, "Load addresses from INIT causes an abort %d\n", retval); sctp_abort_association(stcb->sctp_ep, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, net->port); *abort_no_unlock = 1; return (-1); } /* if the peer doesn't support asconf, flush the asconf queue */ if (asoc->asconf_supported == 0) { struct sctp_asconf_addr *param, *nparam; TAILQ_FOREACH_SAFE(param, &asoc->asconf_queue, next, nparam) { TAILQ_REMOVE(&asoc->asconf_queue, param, next); SCTP_FREE(param, SCTP_M_ASC_ADDR); } } stcb->asoc.peer_hmac_id = sctp_negotiate_hmacid(stcb->asoc.peer_hmacs, stcb->asoc.local_hmacs); if (op_err) { sctp_queue_op_err(stcb, op_err); /* queuing will steal away the mbuf chain to the out queue */ op_err = NULL; } /* extract the cookie and queue it to "echo" it back... */ if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; net->error_count = 0; /* * Cancel the INIT timer, We do this first before queueing the * cookie. We always cancel at the primary to assue that we are * canceling the timer started by the INIT which always goes to the * primary. */ sctp_timer_stop(SCTP_TIMER_TYPE_INIT, stcb->sctp_ep, stcb, asoc->primary_destination, SCTP_FROM_SCTP_INPUT + SCTP_LOC_3); /* calculate the RTO */ net->RTO = sctp_calculate_rto(stcb, asoc, net, &asoc->time_entered, sctp_align_safe_nocopy, SCTP_RTT_FROM_NON_DATA); retval = sctp_send_cookie_echo(m, offset, stcb, net); if (retval < 0) { /* * No cookie, we probably should send a op error. But in any * case if there is no cookie in the INIT-ACK, we can * abandon the peer, its broke. */ if (retval == -3) { uint16_t len; len = (uint16_t) (sizeof(struct sctp_error_missing_param) + sizeof(uint16_t)); /* We abort with an error of missing mandatory param */ op_err = sctp_get_mbuf_for_msg(len, 0, M_NOWAIT, 1, MT_DATA); if (op_err != NULL) { struct sctp_error_missing_param *cause; SCTP_BUF_LEN(op_err) = len; cause = mtod(op_err, struct sctp_error_missing_param *); /* Subtract the reserved param */ cause->cause.code = htons(SCTP_CAUSE_MISSING_PARAM); cause->cause.length = htons(len); cause->num_missing_params = htonl(1); cause->type[0] = htons(SCTP_STATE_COOKIE); } sctp_abort_association(stcb->sctp_ep, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, net->port); *abort_no_unlock = 1; } return (retval); } return (0); } static void sctp_handle_heartbeat_ack(struct sctp_heartbeat_chunk *cp, struct sctp_tcb *stcb, struct sctp_nets *net) { union sctp_sockstore store; struct sctp_nets *r_net, *f_net; struct timeval tv; int req_prim = 0; uint16_t old_error_counter; if (ntohs(cp->ch.chunk_length) != sizeof(struct sctp_heartbeat_chunk)) { /* Invalid length */ return; } memset(&store, 0, sizeof(store)); switch (cp->heartbeat.hb_info.addr_family) { #ifdef INET case AF_INET: if (cp->heartbeat.hb_info.addr_len == sizeof(struct sockaddr_in)) { store.sin.sin_family = cp->heartbeat.hb_info.addr_family; store.sin.sin_len = cp->heartbeat.hb_info.addr_len; store.sin.sin_port = stcb->rport; memcpy(&store.sin.sin_addr, cp->heartbeat.hb_info.address, sizeof(store.sin.sin_addr)); } else { return; } break; #endif #ifdef INET6 case AF_INET6: if (cp->heartbeat.hb_info.addr_len == sizeof(struct sockaddr_in6)) { store.sin6.sin6_family = cp->heartbeat.hb_info.addr_family; store.sin6.sin6_len = cp->heartbeat.hb_info.addr_len; store.sin6.sin6_port = stcb->rport; memcpy(&store.sin6.sin6_addr, cp->heartbeat.hb_info.address, sizeof(struct in6_addr)); } else { return; } break; #endif default: return; } r_net = sctp_findnet(stcb, &store.sa); if (r_net == NULL) { SCTPDBG(SCTP_DEBUG_INPUT1, "Huh? I can't find the address I sent it to, discard\n"); return; } if ((r_net && (r_net->dest_state & SCTP_ADDR_UNCONFIRMED)) && (r_net->heartbeat_random1 == cp->heartbeat.hb_info.random_value1) && (r_net->heartbeat_random2 == cp->heartbeat.hb_info.random_value2)) { /* * If the its a HB and it's random value is correct when can * confirm the destination. */ r_net->dest_state &= ~SCTP_ADDR_UNCONFIRMED; if (r_net->dest_state & SCTP_ADDR_REQ_PRIMARY) { stcb->asoc.primary_destination = r_net; r_net->dest_state &= ~SCTP_ADDR_REQ_PRIMARY; f_net = TAILQ_FIRST(&stcb->asoc.nets); if (f_net != r_net) { /* * first one on the list is NOT the primary * sctp_cmpaddr() is much more efficient if * the primary is the first on the list, * make it so. */ TAILQ_REMOVE(&stcb->asoc.nets, r_net, sctp_next); TAILQ_INSERT_HEAD(&stcb->asoc.nets, r_net, sctp_next); } req_prim = 1; } sctp_ulp_notify(SCTP_NOTIFY_INTERFACE_CONFIRMED, stcb, 0, (void *)r_net, SCTP_SO_NOT_LOCKED); sctp_timer_stop(SCTP_TIMER_TYPE_HEARTBEAT, stcb->sctp_ep, stcb, r_net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_4); sctp_timer_start(SCTP_TIMER_TYPE_HEARTBEAT, stcb->sctp_ep, stcb, r_net); } old_error_counter = r_net->error_count; r_net->error_count = 0; r_net->hb_responded = 1; tv.tv_sec = cp->heartbeat.hb_info.time_value_1; tv.tv_usec = cp->heartbeat.hb_info.time_value_2; /* Now lets do a RTO with this */ r_net->RTO = sctp_calculate_rto(stcb, &stcb->asoc, r_net, &tv, sctp_align_safe_nocopy, SCTP_RTT_FROM_NON_DATA); if (!(r_net->dest_state & SCTP_ADDR_REACHABLE)) { r_net->dest_state |= SCTP_ADDR_REACHABLE; sctp_ulp_notify(SCTP_NOTIFY_INTERFACE_UP, stcb, 0, (void *)r_net, SCTP_SO_NOT_LOCKED); } if (r_net->dest_state & SCTP_ADDR_PF) { r_net->dest_state &= ~SCTP_ADDR_PF; stcb->asoc.cc_functions.sctp_cwnd_update_exit_pf(stcb, net); } if (old_error_counter > 0) { sctp_timer_stop(SCTP_TIMER_TYPE_HEARTBEAT, stcb->sctp_ep, stcb, r_net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_5); sctp_timer_start(SCTP_TIMER_TYPE_HEARTBEAT, stcb->sctp_ep, stcb, r_net); } if (r_net == stcb->asoc.primary_destination) { if (stcb->asoc.alternate) { /* release the alternate, primary is good */ sctp_free_remote_addr(stcb->asoc.alternate); stcb->asoc.alternate = NULL; } } /* Mobility adaptation */ if (req_prim) { if ((sctp_is_mobility_feature_on(stcb->sctp_ep, SCTP_MOBILITY_BASE) || sctp_is_mobility_feature_on(stcb->sctp_ep, SCTP_MOBILITY_FASTHANDOFF)) && sctp_is_mobility_feature_on(stcb->sctp_ep, SCTP_MOBILITY_PRIM_DELETED)) { sctp_timer_stop(SCTP_TIMER_TYPE_PRIM_DELETED, stcb->sctp_ep, stcb, NULL, SCTP_FROM_SCTP_INPUT + SCTP_LOC_6); if (sctp_is_mobility_feature_on(stcb->sctp_ep, SCTP_MOBILITY_FASTHANDOFF)) { sctp_assoc_immediate_retrans(stcb, stcb->asoc.primary_destination); } if (sctp_is_mobility_feature_on(stcb->sctp_ep, SCTP_MOBILITY_BASE)) { sctp_move_chunks_from_net(stcb, stcb->asoc.deleted_primary); } sctp_delete_prim_timer(stcb->sctp_ep, stcb, stcb->asoc.deleted_primary); } } } static int sctp_handle_nat_colliding_state(struct sctp_tcb *stcb) { /* * return 0 means we want you to proceed with the abort non-zero * means no abort processing */ struct sctpasochead *head; if (SCTP_GET_STATE(&stcb->asoc) == SCTP_STATE_COOKIE_WAIT) { /* generate a new vtag and send init */ LIST_REMOVE(stcb, sctp_asocs); stcb->asoc.my_vtag = sctp_select_a_tag(stcb->sctp_ep, stcb->sctp_ep->sctp_lport, stcb->rport, 1); head = &SCTP_BASE_INFO(sctp_asochash)[SCTP_PCBHASH_ASOC(stcb->asoc.my_vtag, SCTP_BASE_INFO(hashasocmark))]; /* * put it in the bucket in the vtag hash of assoc's for the * system */ LIST_INSERT_HEAD(head, stcb, sctp_asocs); sctp_send_initiate(stcb->sctp_ep, stcb, SCTP_SO_NOT_LOCKED); return (1); } if (SCTP_GET_STATE(&stcb->asoc) == SCTP_STATE_COOKIE_ECHOED) { /* * treat like a case where the cookie expired i.e.: - dump * current cookie. - generate a new vtag. - resend init. */ /* generate a new vtag and send init */ LIST_REMOVE(stcb, sctp_asocs); stcb->asoc.state &= ~SCTP_STATE_COOKIE_ECHOED; stcb->asoc.state |= SCTP_STATE_COOKIE_WAIT; sctp_stop_all_cookie_timers(stcb); sctp_toss_old_cookies(stcb, &stcb->asoc); stcb->asoc.my_vtag = sctp_select_a_tag(stcb->sctp_ep, stcb->sctp_ep->sctp_lport, stcb->rport, 1); head = &SCTP_BASE_INFO(sctp_asochash)[SCTP_PCBHASH_ASOC(stcb->asoc.my_vtag, SCTP_BASE_INFO(hashasocmark))]; /* * put it in the bucket in the vtag hash of assoc's for the * system */ LIST_INSERT_HEAD(head, stcb, sctp_asocs); sctp_send_initiate(stcb->sctp_ep, stcb, SCTP_SO_NOT_LOCKED); return (1); } return (0); } static int sctp_handle_nat_missing_state(struct sctp_tcb *stcb, struct sctp_nets *net) { /* * return 0 means we want you to proceed with the abort non-zero * means no abort processing */ if (stcb->asoc.auth_supported == 0) { SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_nat_missing_state: Peer does not support AUTH, cannot send an asconf\n"); return (0); } sctp_asconf_send_nat_state_update(stcb, net); return (1); } static void sctp_handle_abort(struct sctp_abort_chunk *abort, struct sctp_tcb *stcb, struct sctp_nets *net) { #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; #endif uint16_t len; uint16_t error; SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_abort: handling ABORT\n"); if (stcb == NULL) return; len = ntohs(abort->ch.chunk_length); if (len > sizeof(struct sctp_chunkhdr)) { /* * Need to check the cause codes for our two magic nat * aborts which don't kill the assoc necessarily. */ struct sctp_gen_error_cause *cause; cause = (struct sctp_gen_error_cause *)(abort + 1); error = ntohs(cause->code); if (error == SCTP_CAUSE_NAT_COLLIDING_STATE) { SCTPDBG(SCTP_DEBUG_INPUT2, "Received Colliding state abort flags:%x\n", abort->ch.chunk_flags); if (sctp_handle_nat_colliding_state(stcb)) { return; } } else if (error == SCTP_CAUSE_NAT_MISSING_STATE) { SCTPDBG(SCTP_DEBUG_INPUT2, "Received missing state abort flags:%x\n", abort->ch.chunk_flags); if (sctp_handle_nat_missing_state(stcb, net)) { return; } } } else { error = 0; } /* stop any receive timers */ sctp_timer_stop(SCTP_TIMER_TYPE_RECV, stcb->sctp_ep, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_7); /* notify user of the abort and clean up... */ sctp_abort_notification(stcb, 1, error, abort, SCTP_SO_NOT_LOCKED); /* free the tcb */ SCTP_STAT_INCR_COUNTER32(sctps_aborted); if ((SCTP_GET_STATE(&stcb->asoc) == SCTP_STATE_OPEN) || (SCTP_GET_STATE(&stcb->asoc) == SCTP_STATE_SHUTDOWN_RECEIVED)) { SCTP_STAT_DECR_GAUGE32(sctps_currestab); } #ifdef SCTP_ASOCLOG_OF_TSNS sctp_print_out_track_log(stcb); #endif #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(stcb->sctp_ep); atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); #endif stcb->asoc.state |= SCTP_STATE_WAS_ABORTED; (void)sctp_free_assoc(stcb->sctp_ep, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_8); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_abort: finished\n"); } static void sctp_start_net_timers(struct sctp_tcb *stcb) { uint32_t cnt_hb_sent; struct sctp_nets *net; cnt_hb_sent = 0; TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { /* * For each network start: 1) A pmtu timer. 2) A HB timer 3) * If the dest in unconfirmed send a hb as well if under * max_hb_burst have been sent. */ sctp_timer_start(SCTP_TIMER_TYPE_PATHMTURAISE, stcb->sctp_ep, stcb, net); sctp_timer_start(SCTP_TIMER_TYPE_HEARTBEAT, stcb->sctp_ep, stcb, net); if ((net->dest_state & SCTP_ADDR_UNCONFIRMED) && (cnt_hb_sent < SCTP_BASE_SYSCTL(sctp_hb_maxburst))) { sctp_send_hb(stcb, net, SCTP_SO_NOT_LOCKED); cnt_hb_sent++; } } if (cnt_hb_sent) { sctp_chunk_output(stcb->sctp_ep, stcb, SCTP_OUTPUT_FROM_COOKIE_ACK, SCTP_SO_NOT_LOCKED); } } static void sctp_handle_shutdown(struct sctp_shutdown_chunk *cp, struct sctp_tcb *stcb, struct sctp_nets *net, int *abort_flag) { struct sctp_association *asoc; int some_on_streamwheel; int old_state; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; #endif SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_shutdown: handling SHUTDOWN\n"); if (stcb == NULL) return; asoc = &stcb->asoc; if ((SCTP_GET_STATE(asoc) == SCTP_STATE_COOKIE_WAIT) || (SCTP_GET_STATE(asoc) == SCTP_STATE_COOKIE_ECHOED)) { return; } if (ntohs(cp->ch.chunk_length) != sizeof(struct sctp_shutdown_chunk)) { /* Shutdown NOT the expected size */ return; } old_state = SCTP_GET_STATE(asoc); sctp_update_acked(stcb, cp, abort_flag); if (*abort_flag) { return; } if (asoc->control_pdapi) { /* * With a normal shutdown we assume the end of last record. */ SCTP_INP_READ_LOCK(stcb->sctp_ep); if (asoc->control_pdapi->on_strm_q) { struct sctp_stream_in *strm; strm = &asoc->strmin[asoc->control_pdapi->sinfo_stream]; if (asoc->control_pdapi->on_strm_q == SCTP_ON_UNORDERED) { /* Unordered */ TAILQ_REMOVE(&strm->uno_inqueue, asoc->control_pdapi, next_instrm); asoc->control_pdapi->on_strm_q = 0; } else if (asoc->control_pdapi->on_strm_q == SCTP_ON_ORDERED) { /* Ordered */ TAILQ_REMOVE(&strm->inqueue, asoc->control_pdapi, next_instrm); asoc->control_pdapi->on_strm_q = 0; #ifdef INVARIANTS } else { panic("Unknown state on ctrl:%p on_strm_q:%d", asoc->control_pdapi, asoc->control_pdapi->on_strm_q); #endif } } asoc->control_pdapi->end_added = 1; asoc->control_pdapi->pdapi_aborted = 1; asoc->control_pdapi = NULL; SCTP_INP_READ_UNLOCK(stcb->sctp_ep); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(stcb->sctp_ep); atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); if (stcb->asoc.state & SCTP_STATE_CLOSED_SOCKET) { /* assoc was freed while we were unlocked */ SCTP_SOCKET_UNLOCK(so, 1); return; } #endif if (stcb->sctp_socket) { sctp_sorwakeup(stcb->sctp_ep, stcb->sctp_socket); } #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif } /* goto SHUTDOWN_RECEIVED state to block new requests */ if (stcb->sctp_socket) { if ((SCTP_GET_STATE(asoc) != SCTP_STATE_SHUTDOWN_RECEIVED) && (SCTP_GET_STATE(asoc) != SCTP_STATE_SHUTDOWN_ACK_SENT) && (SCTP_GET_STATE(asoc) != SCTP_STATE_SHUTDOWN_SENT)) { SCTP_SET_STATE(asoc, SCTP_STATE_SHUTDOWN_RECEIVED); SCTP_CLEAR_SUBSTATE(asoc, SCTP_STATE_SHUTDOWN_PENDING); /* * notify upper layer that peer has initiated a * shutdown */ sctp_ulp_notify(SCTP_NOTIFY_PEER_SHUTDOWN, stcb, 0, NULL, SCTP_SO_NOT_LOCKED); /* reset time */ (void)SCTP_GETTIME_TIMEVAL(&asoc->time_entered); } } if (SCTP_GET_STATE(asoc) == SCTP_STATE_SHUTDOWN_SENT) { /* * stop the shutdown timer, since we WILL move to * SHUTDOWN-ACK-SENT. */ sctp_timer_stop(SCTP_TIMER_TYPE_SHUTDOWN, stcb->sctp_ep, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_9); } /* Now is there unsent data on a stream somewhere? */ some_on_streamwheel = sctp_is_there_unsent_data(stcb, SCTP_SO_NOT_LOCKED); if (!TAILQ_EMPTY(&asoc->send_queue) || !TAILQ_EMPTY(&asoc->sent_queue) || some_on_streamwheel) { /* By returning we will push more data out */ return; } else { /* no outstanding data to send, so move on... */ /* send SHUTDOWN-ACK */ /* move to SHUTDOWN-ACK-SENT state */ if ((SCTP_GET_STATE(asoc) == SCTP_STATE_OPEN) || (SCTP_GET_STATE(asoc) == SCTP_STATE_SHUTDOWN_RECEIVED)) { SCTP_STAT_DECR_GAUGE32(sctps_currestab); } SCTP_CLEAR_SUBSTATE(asoc, SCTP_STATE_SHUTDOWN_PENDING); if (SCTP_GET_STATE(asoc) != SCTP_STATE_SHUTDOWN_ACK_SENT) { SCTP_SET_STATE(asoc, SCTP_STATE_SHUTDOWN_ACK_SENT); sctp_stop_timers_for_shutdown(stcb); sctp_send_shutdown_ack(stcb, net); sctp_timer_start(SCTP_TIMER_TYPE_SHUTDOWNACK, stcb->sctp_ep, stcb, net); } else if (old_state == SCTP_STATE_SHUTDOWN_ACK_SENT) { sctp_send_shutdown_ack(stcb, net); } } } static void sctp_handle_shutdown_ack(struct sctp_shutdown_ack_chunk *cp SCTP_UNUSED, struct sctp_tcb *stcb, struct sctp_nets *net) { struct sctp_association *asoc; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; so = SCTP_INP_SO(stcb->sctp_ep); #endif SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_shutdown_ack: handling SHUTDOWN ACK\n"); if (stcb == NULL) return; asoc = &stcb->asoc; /* process according to association state */ if ((SCTP_GET_STATE(asoc) == SCTP_STATE_COOKIE_WAIT) || (SCTP_GET_STATE(asoc) == SCTP_STATE_COOKIE_ECHOED)) { /* unexpected SHUTDOWN-ACK... do OOTB handling... */ sctp_send_shutdown_complete(stcb, net, 1); SCTP_TCB_UNLOCK(stcb); return; } if ((SCTP_GET_STATE(asoc) != SCTP_STATE_SHUTDOWN_SENT) && (SCTP_GET_STATE(asoc) != SCTP_STATE_SHUTDOWN_ACK_SENT)) { /* unexpected SHUTDOWN-ACK... so ignore... */ SCTP_TCB_UNLOCK(stcb); return; } if (asoc->control_pdapi) { /* * With a normal shutdown we assume the end of last record. */ SCTP_INP_READ_LOCK(stcb->sctp_ep); asoc->control_pdapi->end_added = 1; asoc->control_pdapi->pdapi_aborted = 1; asoc->control_pdapi = NULL; SCTP_INP_READ_UNLOCK(stcb->sctp_ep); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); if (stcb->asoc.state & SCTP_STATE_CLOSED_SOCKET) { /* assoc was freed while we were unlocked */ SCTP_SOCKET_UNLOCK(so, 1); return; } #endif sctp_sorwakeup(stcb->sctp_ep, stcb->sctp_socket); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif } #ifdef INVARIANTS if (!TAILQ_EMPTY(&asoc->send_queue) || !TAILQ_EMPTY(&asoc->sent_queue) || !stcb->asoc.ss_functions.sctp_ss_is_empty(stcb, asoc)) { panic("Queues are not empty when handling SHUTDOWN-ACK"); } #endif /* stop the timer */ sctp_timer_stop(SCTP_TIMER_TYPE_SHUTDOWN, stcb->sctp_ep, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_10); /* send SHUTDOWN-COMPLETE */ sctp_send_shutdown_complete(stcb, net, 0); /* notify upper layer protocol */ if (stcb->sctp_socket) { if ((stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE) || (stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL)) { stcb->sctp_socket->so_snd.sb_cc = 0; } sctp_ulp_notify(SCTP_NOTIFY_ASSOC_DOWN, stcb, 0, NULL, SCTP_SO_NOT_LOCKED); } SCTP_STAT_INCR_COUNTER32(sctps_shutdown); /* free the TCB but first save off the ep */ #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); #endif (void)sctp_free_assoc(stcb->sctp_ep, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_11); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif } /* * Skip past the param header and then we will find the chunk that caused the * problem. There are two possibilities ASCONF or FWD-TSN other than that and * our peer must be broken. */ static void sctp_process_unrecog_chunk(struct sctp_tcb *stcb, struct sctp_paramhdr *phdr, struct sctp_nets *net) { struct sctp_chunkhdr *chk; chk = (struct sctp_chunkhdr *)((caddr_t)phdr + sizeof(*phdr)); switch (chk->chunk_type) { case SCTP_ASCONF_ACK: case SCTP_ASCONF: sctp_asconf_cleanup(stcb, net); break; case SCTP_IFORWARD_CUM_TSN: case SCTP_FORWARD_CUM_TSN: stcb->asoc.prsctp_supported = 0; break; default: SCTPDBG(SCTP_DEBUG_INPUT2, "Peer does not support chunk type %d(%x)??\n", chk->chunk_type, (uint32_t) chk->chunk_type); break; } } /* * Skip past the param header and then we will find the param that caused the * problem. There are a number of param's in a ASCONF OR the prsctp param * these will turn of specific features. * XXX: Is this the right thing to do? */ static void sctp_process_unrecog_param(struct sctp_tcb *stcb, struct sctp_paramhdr *phdr) { struct sctp_paramhdr *pbad; pbad = phdr + 1; switch (ntohs(pbad->param_type)) { /* pr-sctp draft */ case SCTP_PRSCTP_SUPPORTED: stcb->asoc.prsctp_supported = 0; break; case SCTP_SUPPORTED_CHUNK_EXT: break; /* draft-ietf-tsvwg-addip-sctp */ case SCTP_HAS_NAT_SUPPORT: stcb->asoc.peer_supports_nat = 0; break; case SCTP_ADD_IP_ADDRESS: case SCTP_DEL_IP_ADDRESS: case SCTP_SET_PRIM_ADDR: stcb->asoc.asconf_supported = 0; break; case SCTP_SUCCESS_REPORT: case SCTP_ERROR_CAUSE_IND: SCTPDBG(SCTP_DEBUG_INPUT2, "Huh, the peer does not support success? or error cause?\n"); SCTPDBG(SCTP_DEBUG_INPUT2, "Turning off ASCONF to this strange peer\n"); stcb->asoc.asconf_supported = 0; break; default: SCTPDBG(SCTP_DEBUG_INPUT2, "Peer does not support param type %d(%x)??\n", pbad->param_type, (uint32_t) pbad->param_type); break; } } static int sctp_handle_error(struct sctp_chunkhdr *ch, struct sctp_tcb *stcb, struct sctp_nets *net) { int chklen; struct sctp_paramhdr *phdr; uint16_t error, error_type; uint16_t error_len; struct sctp_association *asoc; int adjust; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; #endif /* parse through all of the errors and process */ asoc = &stcb->asoc; phdr = (struct sctp_paramhdr *)((caddr_t)ch + sizeof(struct sctp_chunkhdr)); chklen = ntohs(ch->chunk_length) - sizeof(struct sctp_chunkhdr); error = 0; while ((size_t)chklen >= sizeof(struct sctp_paramhdr)) { /* Process an Error Cause */ error_type = ntohs(phdr->param_type); error_len = ntohs(phdr->param_length); if ((error_len > chklen) || (error_len == 0)) { /* invalid param length for this param */ SCTPDBG(SCTP_DEBUG_INPUT1, "Bogus length in error param- chunk left:%d errorlen:%d\n", chklen, error_len); return (0); } if (error == 0) { /* report the first error cause */ error = error_type; } switch (error_type) { case SCTP_CAUSE_INVALID_STREAM: case SCTP_CAUSE_MISSING_PARAM: case SCTP_CAUSE_INVALID_PARAM: case SCTP_CAUSE_NO_USER_DATA: SCTPDBG(SCTP_DEBUG_INPUT1, "Software error we got a %d back? We have a bug :/ (or do they?)\n", error_type); break; case SCTP_CAUSE_NAT_COLLIDING_STATE: SCTPDBG(SCTP_DEBUG_INPUT2, "Received Colliding state abort flags:%x\n", ch->chunk_flags); if (sctp_handle_nat_colliding_state(stcb)) { return (0); } break; case SCTP_CAUSE_NAT_MISSING_STATE: SCTPDBG(SCTP_DEBUG_INPUT2, "Received missing state abort flags:%x\n", ch->chunk_flags); if (sctp_handle_nat_missing_state(stcb, net)) { return (0); } break; case SCTP_CAUSE_STALE_COOKIE: /* * We only act if we have echoed a cookie and are * waiting. */ if (SCTP_GET_STATE(asoc) == SCTP_STATE_COOKIE_ECHOED) { int *p; p = (int *)((caddr_t)phdr + sizeof(*phdr)); /* Save the time doubled */ asoc->cookie_preserve_req = ntohl(*p) << 1; asoc->stale_cookie_count++; if (asoc->stale_cookie_count > asoc->max_init_times) { sctp_abort_notification(stcb, 0, 0, NULL, SCTP_SO_NOT_LOCKED); /* now free the asoc */ #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(stcb->sctp_ep); atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); #endif (void)sctp_free_assoc(stcb->sctp_ep, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_12); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif return (-1); } /* blast back to INIT state */ sctp_toss_old_cookies(stcb, &stcb->asoc); asoc->state &= ~SCTP_STATE_COOKIE_ECHOED; asoc->state |= SCTP_STATE_COOKIE_WAIT; sctp_stop_all_cookie_timers(stcb); sctp_send_initiate(stcb->sctp_ep, stcb, SCTP_SO_NOT_LOCKED); } break; case SCTP_CAUSE_UNRESOLVABLE_ADDR: /* * Nothing we can do here, we don't do hostname * addresses so if the peer does not like my IPv6 * (or IPv4 for that matter) it does not matter. If * they don't support that type of address, they can * NOT possibly get that packet type... i.e. with no * IPv6 you can't receive a IPv6 packet. so we can * safely ignore this one. If we ever added support * for HOSTNAME Addresses, then we would need to do * something here. */ break; case SCTP_CAUSE_UNRECOG_CHUNK: sctp_process_unrecog_chunk(stcb, phdr, net); break; case SCTP_CAUSE_UNRECOG_PARAM: sctp_process_unrecog_param(stcb, phdr); break; case SCTP_CAUSE_COOKIE_IN_SHUTDOWN: /* * We ignore this since the timer will drive out a * new cookie anyway and there timer will drive us * to send a SHUTDOWN_COMPLETE. We can't send one * here since we don't have their tag. */ break; case SCTP_CAUSE_DELETING_LAST_ADDR: case SCTP_CAUSE_RESOURCE_SHORTAGE: case SCTP_CAUSE_DELETING_SRC_ADDR: /* * We should NOT get these here, but in a * ASCONF-ACK. */ SCTPDBG(SCTP_DEBUG_INPUT2, "Peer sends ASCONF errors in a Operational Error?<%d>?\n", error_type); break; case SCTP_CAUSE_OUT_OF_RESC: /* * And what, pray tell do we do with the fact that * the peer is out of resources? Not really sure we * could do anything but abort. I suspect this * should have came WITH an abort instead of in a * OP-ERROR. */ break; default: SCTPDBG(SCTP_DEBUG_INPUT1, "sctp_handle_error: unknown error type = 0x%xh\n", error_type); break; } adjust = SCTP_SIZE32(error_len); chklen -= adjust; phdr = (struct sctp_paramhdr *)((caddr_t)phdr + adjust); } sctp_ulp_notify(SCTP_NOTIFY_REMOTE_ERROR, stcb, error, ch, SCTP_SO_NOT_LOCKED); return (0); } static int sctp_handle_init_ack(struct mbuf *m, int iphlen, int offset, struct sockaddr *src, struct sockaddr *dst, struct sctphdr *sh, struct sctp_init_ack_chunk *cp, struct sctp_tcb *stcb, struct sctp_nets *net, int *abort_no_unlock, uint8_t mflowtype, uint32_t mflowid, uint32_t vrf_id) { struct sctp_init_ack *init_ack; struct mbuf *op_err; SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_init_ack: handling INIT-ACK\n"); if (stcb == NULL) { SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_init_ack: TCB is null\n"); return (-1); } if (ntohs(cp->ch.chunk_length) < sizeof(struct sctp_init_ack_chunk)) { /* Invalid length */ op_err = sctp_generate_cause(SCTP_CAUSE_INVALID_PARAM, ""); sctp_abort_association(stcb->sctp_ep, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, net->port); *abort_no_unlock = 1; return (-1); } init_ack = &cp->init; /* validate parameters */ if (init_ack->initiate_tag == 0) { /* protocol error... send an abort */ op_err = sctp_generate_cause(SCTP_CAUSE_INVALID_PARAM, ""); sctp_abort_association(stcb->sctp_ep, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, net->port); *abort_no_unlock = 1; return (-1); } if (ntohl(init_ack->a_rwnd) < SCTP_MIN_RWND) { /* protocol error... send an abort */ op_err = sctp_generate_cause(SCTP_CAUSE_INVALID_PARAM, ""); sctp_abort_association(stcb->sctp_ep, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, net->port); *abort_no_unlock = 1; return (-1); } if (init_ack->num_inbound_streams == 0) { /* protocol error... send an abort */ op_err = sctp_generate_cause(SCTP_CAUSE_INVALID_PARAM, ""); sctp_abort_association(stcb->sctp_ep, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, net->port); *abort_no_unlock = 1; return (-1); } if (init_ack->num_outbound_streams == 0) { /* protocol error... send an abort */ op_err = sctp_generate_cause(SCTP_CAUSE_INVALID_PARAM, ""); sctp_abort_association(stcb->sctp_ep, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, net->port); *abort_no_unlock = 1; return (-1); } /* process according to association state... */ switch (stcb->asoc.state & SCTP_STATE_MASK) { case SCTP_STATE_COOKIE_WAIT: /* this is the expected state for this chunk */ /* process the INIT-ACK parameters */ if (stcb->asoc.primary_destination->dest_state & SCTP_ADDR_UNCONFIRMED) { /* * The primary is where we sent the INIT, we can * always consider it confirmed when the INIT-ACK is * returned. Do this before we load addresses * though. */ stcb->asoc.primary_destination->dest_state &= ~SCTP_ADDR_UNCONFIRMED; sctp_ulp_notify(SCTP_NOTIFY_INTERFACE_CONFIRMED, stcb, 0, (void *)stcb->asoc.primary_destination, SCTP_SO_NOT_LOCKED); } if (sctp_process_init_ack(m, iphlen, offset, src, dst, sh, cp, stcb, net, abort_no_unlock, mflowtype, mflowid, vrf_id) < 0) { /* error in parsing parameters */ return (-1); } /* update our state */ SCTPDBG(SCTP_DEBUG_INPUT2, "moving to COOKIE-ECHOED state\n"); SCTP_SET_STATE(&stcb->asoc, SCTP_STATE_COOKIE_ECHOED); /* reset the RTO calc */ if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; (void)SCTP_GETTIME_TIMEVAL(&stcb->asoc.time_entered); /* * collapse the init timer back in case of a exponential * backoff */ sctp_timer_start(SCTP_TIMER_TYPE_COOKIE, stcb->sctp_ep, stcb, net); /* * the send at the end of the inbound data processing will * cause the cookie to be sent */ break; case SCTP_STATE_SHUTDOWN_SENT: /* incorrect state... discard */ break; case SCTP_STATE_COOKIE_ECHOED: /* incorrect state... discard */ break; case SCTP_STATE_OPEN: /* incorrect state... discard */ break; case SCTP_STATE_EMPTY: case SCTP_STATE_INUSE: default: /* incorrect state... discard */ return (-1); break; } SCTPDBG(SCTP_DEBUG_INPUT1, "Leaving handle-init-ack end\n"); return (0); } static struct sctp_tcb * sctp_process_cookie_new(struct mbuf *m, int iphlen, int offset, struct sockaddr *src, struct sockaddr *dst, struct sctphdr *sh, struct sctp_state_cookie *cookie, int cookie_len, struct sctp_inpcb *inp, struct sctp_nets **netp, struct sockaddr *init_src, int *notification, int auth_skipped, uint32_t auth_offset, uint32_t auth_len, uint8_t mflowtype, uint32_t mflowid, uint32_t vrf_id, uint16_t port); /* * handle a state cookie for an existing association m: input packet mbuf * chain-- assumes a pullup on IP/SCTP/COOKIE-ECHO chunk note: this is a * "split" mbuf and the cookie signature does not exist offset: offset into * mbuf to the cookie-echo chunk */ static struct sctp_tcb * sctp_process_cookie_existing(struct mbuf *m, int iphlen, int offset, struct sockaddr *src, struct sockaddr *dst, struct sctphdr *sh, struct sctp_state_cookie *cookie, int cookie_len, struct sctp_inpcb *inp, struct sctp_tcb *stcb, struct sctp_nets **netp, struct sockaddr *init_src, int *notification, int auth_skipped, uint32_t auth_offset, uint32_t auth_len, uint8_t mflowtype, uint32_t mflowid, uint32_t vrf_id, uint16_t port) { struct sctp_association *asoc; struct sctp_init_chunk *init_cp, init_buf; struct sctp_init_ack_chunk *initack_cp, initack_buf; struct sctp_nets *net; struct mbuf *op_err; int init_offset, initack_offset, i; int retval; int spec_flag = 0; uint32_t how_indx; #if defined(SCTP_DETAILED_STR_STATS) int j; #endif net = *netp; /* I know that the TCB is non-NULL from the caller */ asoc = &stcb->asoc; for (how_indx = 0; how_indx < sizeof(asoc->cookie_how); how_indx++) { if (asoc->cookie_how[how_indx] == 0) break; } if (how_indx < sizeof(asoc->cookie_how)) { asoc->cookie_how[how_indx] = 1; } if (SCTP_GET_STATE(asoc) == SCTP_STATE_SHUTDOWN_ACK_SENT) { /* SHUTDOWN came in after sending INIT-ACK */ sctp_send_shutdown_ack(stcb, stcb->asoc.primary_destination); op_err = sctp_generate_cause(SCTP_CAUSE_COOKIE_IN_SHUTDOWN, ""); sctp_send_operr_to(src, dst, sh, cookie->peers_vtag, op_err, mflowtype, mflowid, inp->fibnum, vrf_id, net->port); if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 2; return (NULL); } /* * find and validate the INIT chunk in the cookie (peer's info) the * INIT should start after the cookie-echo header struct (chunk * header, state cookie header struct) */ init_offset = offset += sizeof(struct sctp_cookie_echo_chunk); init_cp = (struct sctp_init_chunk *) sctp_m_getptr(m, init_offset, sizeof(struct sctp_init_chunk), (uint8_t *) & init_buf); if (init_cp == NULL) { /* could not pull a INIT chunk in cookie */ return (NULL); } if (init_cp->ch.chunk_type != SCTP_INITIATION) { return (NULL); } /* * find and validate the INIT-ACK chunk in the cookie (my info) the * INIT-ACK follows the INIT chunk */ initack_offset = init_offset + SCTP_SIZE32(ntohs(init_cp->ch.chunk_length)); initack_cp = (struct sctp_init_ack_chunk *) sctp_m_getptr(m, initack_offset, sizeof(struct sctp_init_ack_chunk), (uint8_t *) & initack_buf); if (initack_cp == NULL) { /* could not pull INIT-ACK chunk in cookie */ return (NULL); } if (initack_cp->ch.chunk_type != SCTP_INITIATION_ACK) { return (NULL); } if ((ntohl(initack_cp->init.initiate_tag) == asoc->my_vtag) && (ntohl(init_cp->init.initiate_tag) == asoc->peer_vtag)) { /* * case D in Section 5.2.4 Table 2: MMAA process accordingly * to get into the OPEN state */ if (ntohl(initack_cp->init.initial_tsn) != asoc->init_seq_number) { /*- * Opps, this means that we somehow generated two vtag's * the same. I.e. we did: * Us Peer * <---INIT(tag=a)------ * ----INIT-ACK(tag=t)--> * ----INIT(tag=t)------> *1 * <---INIT-ACK(tag=a)--- * <----CE(tag=t)------------- *2 * * At point *1 we should be generating a different * tag t'. Which means we would throw away the CE and send * ours instead. Basically this is case C (throw away side). */ if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 17; return (NULL); } switch (SCTP_GET_STATE(asoc)) { case SCTP_STATE_COOKIE_WAIT: case SCTP_STATE_COOKIE_ECHOED: /* * INIT was sent but got a COOKIE_ECHO with the * correct tags... just accept it...but we must * process the init so that we can make sure we have * the right seq no's. */ /* First we must process the INIT !! */ retval = sctp_process_init(init_cp, stcb); if (retval < 0) { if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 3; return (NULL); } /* we have already processed the INIT so no problem */ sctp_timer_stop(SCTP_TIMER_TYPE_HEARTBEAT, inp, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_13); sctp_timer_stop(SCTP_TIMER_TYPE_INIT, inp, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_14); /* update current state */ if (SCTP_GET_STATE(asoc) == SCTP_STATE_COOKIE_ECHOED) SCTP_STAT_INCR_COUNTER32(sctps_activeestab); else SCTP_STAT_INCR_COUNTER32(sctps_collisionestab); SCTP_SET_STATE(asoc, SCTP_STATE_OPEN); if (asoc->state & SCTP_STATE_SHUTDOWN_PENDING) { sctp_timer_start(SCTP_TIMER_TYPE_SHUTDOWNGUARD, stcb->sctp_ep, stcb, asoc->primary_destination); } SCTP_STAT_INCR_GAUGE32(sctps_currestab); sctp_stop_all_cookie_timers(stcb); if (((stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE) || (stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL)) && (inp->sctp_socket->so_qlimit == 0) ) { #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; #endif /* * Here is where collision would go if we * did a connect() and instead got a * init/init-ack/cookie done before the * init-ack came back.. */ stcb->sctp_ep->sctp_flags |= SCTP_PCB_FLAGS_CONNECTED; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(stcb->sctp_ep); atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_add_int(&stcb->asoc.refcnt, -1); if (stcb->asoc.state & SCTP_STATE_CLOSED_SOCKET) { SCTP_SOCKET_UNLOCK(so, 1); return (NULL); } #endif soisconnected(stcb->sctp_socket); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif } /* notify upper layer */ *notification = SCTP_NOTIFY_ASSOC_UP; /* * since we did not send a HB make sure we don't * double things */ net->hb_responded = 1; net->RTO = sctp_calculate_rto(stcb, asoc, net, &cookie->time_entered, sctp_align_unsafe_makecopy, SCTP_RTT_FROM_NON_DATA); if (stcb->asoc.sctp_autoclose_ticks && (sctp_is_feature_on(inp, SCTP_PCB_FLAGS_AUTOCLOSE))) { sctp_timer_start(SCTP_TIMER_TYPE_AUTOCLOSE, inp, stcb, NULL); } break; default: /* * we're in the OPEN state (or beyond), so peer must * have simply lost the COOKIE-ACK */ break; } /* end switch */ sctp_stop_all_cookie_timers(stcb); /* * We ignore the return code here.. not sure if we should * somehow abort.. but we do have an existing asoc. This * really should not fail. */ if (sctp_load_addresses_from_init(stcb, m, init_offset + sizeof(struct sctp_init_chunk), initack_offset, src, dst, init_src, stcb->asoc.port)) { if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 4; return (NULL); } /* respond with a COOKIE-ACK */ sctp_toss_old_cookies(stcb, asoc); sctp_send_cookie_ack(stcb); if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 5; return (stcb); } if (ntohl(initack_cp->init.initiate_tag) != asoc->my_vtag && ntohl(init_cp->init.initiate_tag) == asoc->peer_vtag && cookie->tie_tag_my_vtag == 0 && cookie->tie_tag_peer_vtag == 0) { /* * case C in Section 5.2.4 Table 2: XMOO silently discard */ if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 6; return (NULL); } /* * If nat support, and the below and stcb is established, send back * a ABORT(colliding state) if we are established. */ if ((SCTP_GET_STATE(asoc) == SCTP_STATE_OPEN) && (asoc->peer_supports_nat) && ((ntohl(initack_cp->init.initiate_tag) == asoc->my_vtag) && ((ntohl(init_cp->init.initiate_tag) != asoc->peer_vtag) || (asoc->peer_vtag == 0)))) { /* * Special case - Peer's support nat. We may have two init's * that we gave out the same tag on since one was not * established.. i.e. we get INIT from host-1 behind the nat * and we respond tag-a, we get a INIT from host-2 behind * the nat and we get tag-a again. Then we bring up host-1 * (or 2's) assoc, Then comes the cookie from hsot-2 (or 1). * Now we have colliding state. We must send an abort here * with colliding state indication. */ op_err = sctp_generate_cause(SCTP_CAUSE_NAT_COLLIDING_STATE, ""); sctp_send_abort(m, iphlen, src, dst, sh, 0, op_err, mflowtype, mflowid, inp->fibnum, vrf_id, port); return (NULL); } if ((ntohl(initack_cp->init.initiate_tag) == asoc->my_vtag) && ((ntohl(init_cp->init.initiate_tag) != asoc->peer_vtag) || (asoc->peer_vtag == 0))) { /* * case B in Section 5.2.4 Table 2: MXAA or MOAA my info * should be ok, re-accept peer info */ if (ntohl(initack_cp->init.initial_tsn) != asoc->init_seq_number) { /* * Extension of case C. If we hit this, then the * random number generator returned the same vtag * when we first sent our INIT-ACK and when we later * sent our INIT. The side with the seq numbers that * are different will be the one that normnally * would have hit case C. This in effect "extends" * our vtags in this collision case to be 64 bits. * The same collision could occur aka you get both * vtag and seq number the same twice in a row.. but * is much less likely. If it did happen then we * would proceed through and bring up the assoc.. we * may end up with the wrong stream setup however.. * which would be bad.. but there is no way to * tell.. until we send on a stream that does not * exist :-) */ if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 7; return (NULL); } if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 8; sctp_timer_stop(SCTP_TIMER_TYPE_HEARTBEAT, inp, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_15); sctp_stop_all_cookie_timers(stcb); /* * since we did not send a HB make sure we don't double * things */ net->hb_responded = 1; if (stcb->asoc.sctp_autoclose_ticks && sctp_is_feature_on(inp, SCTP_PCB_FLAGS_AUTOCLOSE)) { sctp_timer_start(SCTP_TIMER_TYPE_AUTOCLOSE, inp, stcb, NULL); } asoc->my_rwnd = ntohl(initack_cp->init.a_rwnd); asoc->pre_open_streams = ntohs(initack_cp->init.num_outbound_streams); if (ntohl(init_cp->init.initiate_tag) != asoc->peer_vtag) { /* * Ok the peer probably discarded our data (if we * echoed a cookie+data). So anything on the * sent_queue should be marked for retransmit, we * may not get something to kick us so it COULD * still take a timeout to move these.. but it can't * hurt to mark them. */ struct sctp_tmit_chunk *chk; TAILQ_FOREACH(chk, &stcb->asoc.sent_queue, sctp_next) { if (chk->sent < SCTP_DATAGRAM_RESEND) { chk->sent = SCTP_DATAGRAM_RESEND; sctp_flight_size_decrease(chk); sctp_total_flight_decrease(stcb, chk); sctp_ucount_incr(stcb->asoc.sent_queue_retran_cnt); spec_flag++; } } } /* process the INIT info (peer's info) */ retval = sctp_process_init(init_cp, stcb); if (retval < 0) { if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 9; return (NULL); } if (sctp_load_addresses_from_init(stcb, m, init_offset + sizeof(struct sctp_init_chunk), initack_offset, src, dst, init_src, stcb->asoc.port)) { if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 10; return (NULL); } if ((asoc->state & SCTP_STATE_COOKIE_WAIT) || (asoc->state & SCTP_STATE_COOKIE_ECHOED)) { *notification = SCTP_NOTIFY_ASSOC_UP; if (((stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE) || (stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL)) && (inp->sctp_socket->so_qlimit == 0)) { #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; #endif stcb->sctp_ep->sctp_flags |= SCTP_PCB_FLAGS_CONNECTED; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(stcb->sctp_ep); atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_add_int(&stcb->asoc.refcnt, -1); if (stcb->asoc.state & SCTP_STATE_CLOSED_SOCKET) { SCTP_SOCKET_UNLOCK(so, 1); return (NULL); } #endif soisconnected(stcb->sctp_socket); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif } if (SCTP_GET_STATE(asoc) == SCTP_STATE_COOKIE_ECHOED) SCTP_STAT_INCR_COUNTER32(sctps_activeestab); else SCTP_STAT_INCR_COUNTER32(sctps_collisionestab); SCTP_STAT_INCR_GAUGE32(sctps_currestab); } else if (SCTP_GET_STATE(asoc) == SCTP_STATE_OPEN) { SCTP_STAT_INCR_COUNTER32(sctps_restartestab); } else { SCTP_STAT_INCR_COUNTER32(sctps_collisionestab); } SCTP_SET_STATE(asoc, SCTP_STATE_OPEN); if (asoc->state & SCTP_STATE_SHUTDOWN_PENDING) { sctp_timer_start(SCTP_TIMER_TYPE_SHUTDOWNGUARD, stcb->sctp_ep, stcb, asoc->primary_destination); } sctp_stop_all_cookie_timers(stcb); sctp_toss_old_cookies(stcb, asoc); sctp_send_cookie_ack(stcb); if (spec_flag) { /* * only if we have retrans set do we do this. What * this call does is get only the COOKIE-ACK out and * then when we return the normal call to * sctp_chunk_output will get the retrans out behind * this. */ sctp_chunk_output(inp, stcb, SCTP_OUTPUT_FROM_COOKIE_ACK, SCTP_SO_NOT_LOCKED); } if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 11; return (stcb); } if ((ntohl(initack_cp->init.initiate_tag) != asoc->my_vtag && ntohl(init_cp->init.initiate_tag) != asoc->peer_vtag) && cookie->tie_tag_my_vtag == asoc->my_vtag_nonce && cookie->tie_tag_peer_vtag == asoc->peer_vtag_nonce && cookie->tie_tag_peer_vtag != 0) { struct sctpasochead *head; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; #endif if (asoc->peer_supports_nat) { /* * This is a gross gross hack. Just call the * cookie_new code since we are allowing a duplicate * association. I hope this works... */ return (sctp_process_cookie_new(m, iphlen, offset, src, dst, sh, cookie, cookie_len, inp, netp, init_src, notification, auth_skipped, auth_offset, auth_len, mflowtype, mflowid, vrf_id, port)); } /* * case A in Section 5.2.4 Table 2: XXMM (peer restarted) */ /* temp code */ if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 12; sctp_timer_stop(SCTP_TIMER_TYPE_INIT, inp, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_16); sctp_timer_stop(SCTP_TIMER_TYPE_HEARTBEAT, inp, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_17); /* notify upper layer */ *notification = SCTP_NOTIFY_ASSOC_RESTART; atomic_add_int(&stcb->asoc.refcnt, 1); if ((SCTP_GET_STATE(asoc) != SCTP_STATE_OPEN) && (SCTP_GET_STATE(asoc) != SCTP_STATE_SHUTDOWN_RECEIVED) && (SCTP_GET_STATE(asoc) != SCTP_STATE_SHUTDOWN_SENT)) { SCTP_STAT_INCR_GAUGE32(sctps_currestab); } if (SCTP_GET_STATE(asoc) == SCTP_STATE_OPEN) { SCTP_STAT_INCR_GAUGE32(sctps_restartestab); } else if (SCTP_GET_STATE(asoc) != SCTP_STATE_SHUTDOWN_SENT) { SCTP_STAT_INCR_GAUGE32(sctps_collisionestab); } if (asoc->state & SCTP_STATE_SHUTDOWN_PENDING) { SCTP_SET_STATE(asoc, SCTP_STATE_OPEN); sctp_timer_start(SCTP_TIMER_TYPE_SHUTDOWNGUARD, stcb->sctp_ep, stcb, asoc->primary_destination); } else if (!(asoc->state & SCTP_STATE_SHUTDOWN_SENT)) { /* move to OPEN state, if not in SHUTDOWN_SENT */ SCTP_SET_STATE(asoc, SCTP_STATE_OPEN); } asoc->pre_open_streams = ntohs(initack_cp->init.num_outbound_streams); asoc->init_seq_number = ntohl(initack_cp->init.initial_tsn); asoc->sending_seq = asoc->asconf_seq_out = asoc->str_reset_seq_out = asoc->init_seq_number; asoc->asconf_seq_out_acked = asoc->asconf_seq_out - 1; asoc->asconf_seq_in = asoc->last_acked_seq = asoc->init_seq_number - 1; asoc->str_reset_seq_in = asoc->init_seq_number; asoc->advanced_peer_ack_point = asoc->last_acked_seq; if (asoc->mapping_array) { memset(asoc->mapping_array, 0, asoc->mapping_array_size); } if (asoc->nr_mapping_array) { memset(asoc->nr_mapping_array, 0, asoc->mapping_array_size); } SCTP_TCB_UNLOCK(stcb); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(stcb->sctp_ep); SCTP_SOCKET_LOCK(so, 1); #endif SCTP_INP_INFO_WLOCK(); SCTP_INP_WLOCK(stcb->sctp_ep); SCTP_TCB_LOCK(stcb); atomic_add_int(&stcb->asoc.refcnt, -1); /* send up all the data */ SCTP_TCB_SEND_LOCK(stcb); sctp_report_all_outbound(stcb, 0, 1, SCTP_SO_LOCKED); for (i = 0; i < stcb->asoc.streamoutcnt; i++) { stcb->asoc.strmout[i].chunks_on_queues = 0; #if defined(SCTP_DETAILED_STR_STATS) for (j = 0; j < SCTP_PR_SCTP_MAX + 1; j++) { asoc->strmout[i].abandoned_sent[j] = 0; asoc->strmout[i].abandoned_unsent[j] = 0; } #else asoc->strmout[i].abandoned_sent[0] = 0; asoc->strmout[i].abandoned_unsent[0] = 0; #endif stcb->asoc.strmout[i].stream_no = i; stcb->asoc.strmout[i].next_sequence_send = 0; stcb->asoc.strmout[i].last_msg_incomplete = 0; } /* process the INIT-ACK info (my info) */ asoc->my_vtag = ntohl(initack_cp->init.initiate_tag); asoc->my_rwnd = ntohl(initack_cp->init.a_rwnd); /* pull from vtag hash */ LIST_REMOVE(stcb, sctp_asocs); /* re-insert to new vtag position */ head = &SCTP_BASE_INFO(sctp_asochash)[SCTP_PCBHASH_ASOC(stcb->asoc.my_vtag, SCTP_BASE_INFO(hashasocmark))]; /* * put it in the bucket in the vtag hash of assoc's for the * system */ LIST_INSERT_HEAD(head, stcb, sctp_asocs); SCTP_TCB_SEND_UNLOCK(stcb); SCTP_INP_WUNLOCK(stcb->sctp_ep); SCTP_INP_INFO_WUNLOCK(); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif asoc->total_flight = 0; asoc->total_flight_count = 0; /* process the INIT info (peer's info) */ retval = sctp_process_init(init_cp, stcb); if (retval < 0) { if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 13; return (NULL); } /* * since we did not send a HB make sure we don't double * things */ net->hb_responded = 1; if (sctp_load_addresses_from_init(stcb, m, init_offset + sizeof(struct sctp_init_chunk), initack_offset, src, dst, init_src, stcb->asoc.port)) { if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 14; return (NULL); } /* respond with a COOKIE-ACK */ sctp_stop_all_cookie_timers(stcb); sctp_toss_old_cookies(stcb, asoc); sctp_send_cookie_ack(stcb); if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 15; return (stcb); } if (how_indx < sizeof(asoc->cookie_how)) asoc->cookie_how[how_indx] = 16; /* all other cases... */ return (NULL); } /* * handle a state cookie for a new association m: input packet mbuf chain-- * assumes a pullup on IP/SCTP/COOKIE-ECHO chunk note: this is a "split" mbuf * and the cookie signature does not exist offset: offset into mbuf to the * cookie-echo chunk length: length of the cookie chunk to: where the init * was from returns a new TCB */ static struct sctp_tcb * sctp_process_cookie_new(struct mbuf *m, int iphlen, int offset, struct sockaddr *src, struct sockaddr *dst, struct sctphdr *sh, struct sctp_state_cookie *cookie, int cookie_len, struct sctp_inpcb *inp, struct sctp_nets **netp, struct sockaddr *init_src, int *notification, int auth_skipped, uint32_t auth_offset, uint32_t auth_len, uint8_t mflowtype, uint32_t mflowid, uint32_t vrf_id, uint16_t port) { struct sctp_tcb *stcb; struct sctp_init_chunk *init_cp, init_buf; struct sctp_init_ack_chunk *initack_cp, initack_buf; union sctp_sockstore store; struct sctp_association *asoc; int init_offset, initack_offset, initack_limit; int retval; int error = 0; uint8_t auth_chunk_buf[SCTP_PARAM_BUFFER_SIZE]; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; so = SCTP_INP_SO(inp); #endif /* * find and validate the INIT chunk in the cookie (peer's info) the * INIT should start after the cookie-echo header struct (chunk * header, state cookie header struct) */ init_offset = offset + sizeof(struct sctp_cookie_echo_chunk); init_cp = (struct sctp_init_chunk *) sctp_m_getptr(m, init_offset, sizeof(struct sctp_init_chunk), (uint8_t *) & init_buf); if (init_cp == NULL) { /* could not pull a INIT chunk in cookie */ SCTPDBG(SCTP_DEBUG_INPUT1, "process_cookie_new: could not pull INIT chunk hdr\n"); return (NULL); } if (init_cp->ch.chunk_type != SCTP_INITIATION) { SCTPDBG(SCTP_DEBUG_INPUT1, "HUH? process_cookie_new: could not find INIT chunk!\n"); return (NULL); } initack_offset = init_offset + SCTP_SIZE32(ntohs(init_cp->ch.chunk_length)); /* * find and validate the INIT-ACK chunk in the cookie (my info) the * INIT-ACK follows the INIT chunk */ initack_cp = (struct sctp_init_ack_chunk *) sctp_m_getptr(m, initack_offset, sizeof(struct sctp_init_ack_chunk), (uint8_t *) & initack_buf); if (initack_cp == NULL) { /* could not pull INIT-ACK chunk in cookie */ SCTPDBG(SCTP_DEBUG_INPUT1, "process_cookie_new: could not pull INIT-ACK chunk hdr\n"); return (NULL); } if (initack_cp->ch.chunk_type != SCTP_INITIATION_ACK) { return (NULL); } /* * NOTE: We can't use the INIT_ACK's chk_length to determine the * "initack_limit" value. This is because the chk_length field * includes the length of the cookie, but the cookie is omitted when * the INIT and INIT_ACK are tacked onto the cookie... */ initack_limit = offset + cookie_len; /* * now that we know the INIT/INIT-ACK are in place, create a new TCB * and popluate */ /* * Here we do a trick, we set in NULL for the proc/thread argument. * We do this since in effect we only use the p argument when the * socket is unbound and we must do an implicit bind. Since we are * getting a cookie, we cannot be unbound. */ stcb = sctp_aloc_assoc(inp, init_src, &error, ntohl(initack_cp->init.initiate_tag), vrf_id, ntohs(initack_cp->init.num_outbound_streams), port, (struct thread *)NULL ); if (stcb == NULL) { struct mbuf *op_err; /* memory problem? */ SCTPDBG(SCTP_DEBUG_INPUT1, "process_cookie_new: no room for another TCB!\n"); op_err = sctp_generate_cause(SCTP_CAUSE_OUT_OF_RESC, ""); sctp_abort_association(inp, (struct sctp_tcb *)NULL, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); return (NULL); } /* get the correct sctp_nets */ if (netp) *netp = sctp_findnet(stcb, init_src); asoc = &stcb->asoc; /* get scope variables out of cookie */ asoc->scope.ipv4_local_scope = cookie->ipv4_scope; asoc->scope.site_scope = cookie->site_scope; asoc->scope.local_scope = cookie->local_scope; asoc->scope.loopback_scope = cookie->loopback_scope; if ((asoc->scope.ipv4_addr_legal != cookie->ipv4_addr_legal) || (asoc->scope.ipv6_addr_legal != cookie->ipv6_addr_legal)) { struct mbuf *op_err; /* * Houston we have a problem. The EP changed while the * cookie was in flight. Only recourse is to abort the * association. */ atomic_add_int(&stcb->asoc.refcnt, 1); op_err = sctp_generate_cause(SCTP_CAUSE_OUT_OF_RESC, ""); sctp_abort_association(inp, (struct sctp_tcb *)NULL, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); #endif (void)sctp_free_assoc(inp, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_18); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif atomic_subtract_int(&stcb->asoc.refcnt, 1); return (NULL); } /* process the INIT-ACK info (my info) */ asoc->my_vtag = ntohl(initack_cp->init.initiate_tag); asoc->my_rwnd = ntohl(initack_cp->init.a_rwnd); asoc->pre_open_streams = ntohs(initack_cp->init.num_outbound_streams); asoc->init_seq_number = ntohl(initack_cp->init.initial_tsn); asoc->sending_seq = asoc->asconf_seq_out = asoc->str_reset_seq_out = asoc->init_seq_number; asoc->asconf_seq_out_acked = asoc->asconf_seq_out - 1; asoc->asconf_seq_in = asoc->last_acked_seq = asoc->init_seq_number - 1; asoc->str_reset_seq_in = asoc->init_seq_number; asoc->advanced_peer_ack_point = asoc->last_acked_seq; /* process the INIT info (peer's info) */ if (netp) retval = sctp_process_init(init_cp, stcb); else retval = 0; if (retval < 0) { atomic_add_int(&stcb->asoc.refcnt, 1); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); #endif (void)sctp_free_assoc(inp, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_19); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif atomic_subtract_int(&stcb->asoc.refcnt, 1); return (NULL); } /* load all addresses */ if (sctp_load_addresses_from_init(stcb, m, init_offset + sizeof(struct sctp_init_chunk), initack_offset, src, dst, init_src, port)) { atomic_add_int(&stcb->asoc.refcnt, 1); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); #endif (void)sctp_free_assoc(inp, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_20); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif atomic_subtract_int(&stcb->asoc.refcnt, 1); return (NULL); } /* * verify any preceding AUTH chunk that was skipped */ /* pull the local authentication parameters from the cookie/init-ack */ sctp_auth_get_cookie_params(stcb, m, initack_offset + sizeof(struct sctp_init_ack_chunk), initack_limit - (initack_offset + sizeof(struct sctp_init_ack_chunk))); if (auth_skipped) { struct sctp_auth_chunk *auth; auth = (struct sctp_auth_chunk *) sctp_m_getptr(m, auth_offset, auth_len, auth_chunk_buf); if ((auth == NULL) || sctp_handle_auth(stcb, auth, m, auth_offset)) { /* auth HMAC failed, dump the assoc and packet */ SCTPDBG(SCTP_DEBUG_AUTH1, "COOKIE-ECHO: AUTH failed\n"); atomic_add_int(&stcb->asoc.refcnt, 1); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); #endif (void)sctp_free_assoc(inp, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_21); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif atomic_subtract_int(&stcb->asoc.refcnt, 1); return (NULL); } else { /* remaining chunks checked... good to go */ stcb->asoc.authenticated = 1; } } /* update current state */ SCTPDBG(SCTP_DEBUG_INPUT2, "moving to OPEN state\n"); SCTP_SET_STATE(asoc, SCTP_STATE_OPEN); if (asoc->state & SCTP_STATE_SHUTDOWN_PENDING) { sctp_timer_start(SCTP_TIMER_TYPE_SHUTDOWNGUARD, stcb->sctp_ep, stcb, asoc->primary_destination); } sctp_stop_all_cookie_timers(stcb); SCTP_STAT_INCR_COUNTER32(sctps_passiveestab); SCTP_STAT_INCR_GAUGE32(sctps_currestab); /* * if we're doing ASCONFs, check to see if we have any new local * addresses that need to get added to the peer (eg. addresses * changed while cookie echo in flight). This needs to be done * after we go to the OPEN state to do the correct asconf * processing. else, make sure we have the correct addresses in our * lists */ /* warning, we re-use sin, sin6, sa_store here! */ /* pull in local_address (our "from" address) */ switch (cookie->laddr_type) { #ifdef INET case SCTP_IPV4_ADDRESS: /* source addr is IPv4 */ memset(&store.sin, 0, sizeof(struct sockaddr_in)); store.sin.sin_family = AF_INET; store.sin.sin_len = sizeof(struct sockaddr_in); store.sin.sin_addr.s_addr = cookie->laddress[0]; break; #endif #ifdef INET6 case SCTP_IPV6_ADDRESS: /* source addr is IPv6 */ memset(&store.sin6, 0, sizeof(struct sockaddr_in6)); store.sin6.sin6_family = AF_INET6; store.sin6.sin6_len = sizeof(struct sockaddr_in6); store.sin6.sin6_scope_id = cookie->scope_id; memcpy(&store.sin6.sin6_addr, cookie->laddress, sizeof(struct in6_addr)); break; #endif default: atomic_add_int(&stcb->asoc.refcnt, 1); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); #endif (void)sctp_free_assoc(inp, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_22); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif atomic_subtract_int(&stcb->asoc.refcnt, 1); return (NULL); } /* set up to notify upper layer */ *notification = SCTP_NOTIFY_ASSOC_UP; if (((stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE) || (stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL)) && (inp->sctp_socket->so_qlimit == 0)) { /* * This is an endpoint that called connect() how it got a * cookie that is NEW is a bit of a mystery. It must be that * the INIT was sent, but before it got there.. a complete * INIT/INIT-ACK/COOKIE arrived. But of course then it * should have went to the other code.. not here.. oh well.. * a bit of protection is worth having.. */ stcb->sctp_ep->sctp_flags |= SCTP_PCB_FLAGS_CONNECTED; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); if (stcb->asoc.state & SCTP_STATE_CLOSED_SOCKET) { SCTP_SOCKET_UNLOCK(so, 1); return (NULL); } #endif soisconnected(stcb->sctp_socket); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif } else if ((stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE) && (inp->sctp_socket->so_qlimit)) { /* * We don't want to do anything with this one. Since it is * the listening guy. The timer will get started for * accepted connections in the caller. */ ; } /* since we did not send a HB make sure we don't double things */ if ((netp) && (*netp)) (*netp)->hb_responded = 1; if (stcb->asoc.sctp_autoclose_ticks && sctp_is_feature_on(inp, SCTP_PCB_FLAGS_AUTOCLOSE)) { sctp_timer_start(SCTP_TIMER_TYPE_AUTOCLOSE, inp, stcb, NULL); } (void)SCTP_GETTIME_TIMEVAL(&stcb->asoc.time_entered); if ((netp != NULL) && (*netp != NULL)) { /* calculate the RTT and set the encaps port */ (*netp)->RTO = sctp_calculate_rto(stcb, asoc, *netp, &cookie->time_entered, sctp_align_unsafe_makecopy, SCTP_RTT_FROM_NON_DATA); } /* respond with a COOKIE-ACK */ sctp_send_cookie_ack(stcb); /* * check the address lists for any ASCONFs that need to be sent * AFTER the cookie-ack is sent */ sctp_check_address_list(stcb, m, initack_offset + sizeof(struct sctp_init_ack_chunk), initack_limit - (initack_offset + sizeof(struct sctp_init_ack_chunk)), &store.sa, cookie->local_scope, cookie->site_scope, cookie->ipv4_scope, cookie->loopback_scope); return (stcb); } /* * CODE LIKE THIS NEEDS TO RUN IF the peer supports the NAT extension, i.e * we NEED to make sure we are not already using the vtag. If so we * need to send back an ABORT-TRY-AGAIN-WITH-NEW-TAG No middle box bit! head = &SCTP_BASE_INFO(sctp_asochash)[SCTP_PCBHASH_ASOC(tag, SCTP_BASE_INFO(hashasocmark))]; LIST_FOREACH(stcb, head, sctp_asocs) { if ((stcb->asoc.my_vtag == tag) && (stcb->rport == rport) && (inp == stcb->sctp_ep)) { -- SEND ABORT - TRY AGAIN -- } } */ /* * handles a COOKIE-ECHO message stcb: modified to either a new or left as * existing (non-NULL) TCB */ static struct mbuf * sctp_handle_cookie_echo(struct mbuf *m, int iphlen, int offset, struct sockaddr *src, struct sockaddr *dst, struct sctphdr *sh, struct sctp_cookie_echo_chunk *cp, struct sctp_inpcb **inp_p, struct sctp_tcb **stcb, struct sctp_nets **netp, int auth_skipped, uint32_t auth_offset, uint32_t auth_len, struct sctp_tcb **locked_tcb, uint8_t mflowtype, uint32_t mflowid, uint32_t vrf_id, uint16_t port) { struct sctp_state_cookie *cookie; struct sctp_tcb *l_stcb = *stcb; struct sctp_inpcb *l_inp; struct sockaddr *to; struct sctp_pcb *ep; struct mbuf *m_sig; uint8_t calc_sig[SCTP_SIGNATURE_SIZE], tmp_sig[SCTP_SIGNATURE_SIZE]; uint8_t *sig; uint8_t cookie_ok = 0; unsigned int sig_offset, cookie_offset; unsigned int cookie_len; struct timeval now; struct timeval time_expires; int notification = 0; struct sctp_nets *netl; int had_a_existing_tcb = 0; int send_int_conf = 0; #ifdef INET struct sockaddr_in sin; #endif #ifdef INET6 struct sockaddr_in6 sin6; #endif SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_cookie: handling COOKIE-ECHO\n"); if (inp_p == NULL) { return (NULL); } cookie = &cp->cookie; cookie_offset = offset + sizeof(struct sctp_chunkhdr); cookie_len = ntohs(cp->ch.chunk_length); if ((cookie->peerport != sh->src_port) || (cookie->myport != sh->dest_port) || (cookie->my_vtag != sh->v_tag)) { /* * invalid ports or bad tag. Note that we always leave the * v_tag in the header in network order and when we stored * it in the my_vtag slot we also left it in network order. * This maintains the match even though it may be in the * opposite byte order of the machine :-> */ return (NULL); } if (cookie_len < sizeof(struct sctp_cookie_echo_chunk) + sizeof(struct sctp_init_chunk) + sizeof(struct sctp_init_ack_chunk) + SCTP_SIGNATURE_SIZE) { /* cookie too small */ return (NULL); } /* * split off the signature into its own mbuf (since it should not be * calculated in the sctp_hmac_m() call). */ sig_offset = offset + cookie_len - SCTP_SIGNATURE_SIZE; m_sig = m_split(m, sig_offset, M_NOWAIT); if (m_sig == NULL) { /* out of memory or ?? */ return (NULL); } #ifdef SCTP_MBUF_LOGGING if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_MBUF_LOGGING_ENABLE) { sctp_log_mbc(m_sig, SCTP_MBUF_SPLIT); } #endif /* * compute the signature/digest for the cookie */ ep = &(*inp_p)->sctp_ep; l_inp = *inp_p; if (l_stcb) { SCTP_TCB_UNLOCK(l_stcb); } SCTP_INP_RLOCK(l_inp); if (l_stcb) { SCTP_TCB_LOCK(l_stcb); } /* which cookie is it? */ if ((cookie->time_entered.tv_sec < (long)ep->time_of_secret_change) && (ep->current_secret_number != ep->last_secret_number)) { /* it's the old cookie */ (void)sctp_hmac_m(SCTP_HMAC, (uint8_t *) ep->secret_key[(int)ep->last_secret_number], SCTP_SECRET_SIZE, m, cookie_offset, calc_sig, 0); } else { /* it's the current cookie */ (void)sctp_hmac_m(SCTP_HMAC, (uint8_t *) ep->secret_key[(int)ep->current_secret_number], SCTP_SECRET_SIZE, m, cookie_offset, calc_sig, 0); } /* get the signature */ SCTP_INP_RUNLOCK(l_inp); sig = (uint8_t *) sctp_m_getptr(m_sig, 0, SCTP_SIGNATURE_SIZE, (uint8_t *) & tmp_sig); if (sig == NULL) { /* couldn't find signature */ sctp_m_freem(m_sig); return (NULL); } /* compare the received digest with the computed digest */ if (memcmp(calc_sig, sig, SCTP_SIGNATURE_SIZE) != 0) { /* try the old cookie? */ if ((cookie->time_entered.tv_sec == (long)ep->time_of_secret_change) && (ep->current_secret_number != ep->last_secret_number)) { /* compute digest with old */ (void)sctp_hmac_m(SCTP_HMAC, (uint8_t *) ep->secret_key[(int)ep->last_secret_number], SCTP_SECRET_SIZE, m, cookie_offset, calc_sig, 0); /* compare */ if (memcmp(calc_sig, sig, SCTP_SIGNATURE_SIZE) == 0) cookie_ok = 1; } } else { cookie_ok = 1; } /* * Now before we continue we must reconstruct our mbuf so that * normal processing of any other chunks will work. */ { struct mbuf *m_at; m_at = m; while (SCTP_BUF_NEXT(m_at) != NULL) { m_at = SCTP_BUF_NEXT(m_at); } SCTP_BUF_NEXT(m_at) = m_sig; } if (cookie_ok == 0) { SCTPDBG(SCTP_DEBUG_INPUT2, "handle_cookie_echo: cookie signature validation failed!\n"); SCTPDBG(SCTP_DEBUG_INPUT2, "offset = %u, cookie_offset = %u, sig_offset = %u\n", (uint32_t) offset, cookie_offset, sig_offset); return (NULL); } /* * check the cookie timestamps to be sure it's not stale */ (void)SCTP_GETTIME_TIMEVAL(&now); /* Expire time is in Ticks, so we convert to seconds */ time_expires.tv_sec = cookie->time_entered.tv_sec + TICKS_TO_SEC(cookie->cookie_life); time_expires.tv_usec = cookie->time_entered.tv_usec; /* * TODO sctp_constants.h needs alternative time macros when _KERNEL * is undefined. */ if (timevalcmp(&now, &time_expires, >)) { /* cookie is stale! */ struct mbuf *op_err; struct sctp_error_stale_cookie *cause; uint32_t tim; op_err = sctp_get_mbuf_for_msg(sizeof(struct sctp_error_stale_cookie), 0, M_NOWAIT, 1, MT_DATA); if (op_err == NULL) { /* FOOBAR */ return (NULL); } /* Set the len */ SCTP_BUF_LEN(op_err) = sizeof(struct sctp_error_stale_cookie); cause = mtod(op_err, struct sctp_error_stale_cookie *); cause->cause.code = htons(SCTP_CAUSE_STALE_COOKIE); cause->cause.length = htons((sizeof(struct sctp_paramhdr) + (sizeof(uint32_t)))); /* seconds to usec */ tim = (now.tv_sec - time_expires.tv_sec) * 1000000; /* add in usec */ if (tim == 0) tim = now.tv_usec - cookie->time_entered.tv_usec; cause->stale_time = htonl(tim); sctp_send_operr_to(src, dst, sh, cookie->peers_vtag, op_err, mflowtype, mflowid, l_inp->fibnum, vrf_id, port); return (NULL); } /* * Now we must see with the lookup address if we have an existing * asoc. This will only happen if we were in the COOKIE-WAIT state * and a INIT collided with us and somewhere the peer sent the * cookie on another address besides the single address our assoc * had for him. In this case we will have one of the tie-tags set at * least AND the address field in the cookie can be used to look it * up. */ to = NULL; switch (cookie->addr_type) { #ifdef INET6 case SCTP_IPV6_ADDRESS: memset(&sin6, 0, sizeof(sin6)); sin6.sin6_family = AF_INET6; sin6.sin6_len = sizeof(sin6); sin6.sin6_port = sh->src_port; sin6.sin6_scope_id = cookie->scope_id; memcpy(&sin6.sin6_addr.s6_addr, cookie->address, sizeof(sin6.sin6_addr.s6_addr)); to = (struct sockaddr *)&sin6; break; #endif #ifdef INET case SCTP_IPV4_ADDRESS: memset(&sin, 0, sizeof(sin)); sin.sin_family = AF_INET; sin.sin_len = sizeof(sin); sin.sin_port = sh->src_port; sin.sin_addr.s_addr = cookie->address[0]; to = (struct sockaddr *)&sin; break; #endif default: /* This should not happen */ return (NULL); } if (*stcb == NULL) { /* Yep, lets check */ *stcb = sctp_findassociation_ep_addr(inp_p, to, netp, dst, NULL); if (*stcb == NULL) { /* * We should have only got back the same inp. If we * got back a different ep we have a problem. The * original findep got back l_inp and now */ if (l_inp != *inp_p) { SCTP_PRINTF("Bad problem find_ep got a diff inp then special_locate?\n"); } } else { if (*locked_tcb == NULL) { /* * In this case we found the assoc only * after we locked the create lock. This * means we are in a colliding case and we * must make sure that we unlock the tcb if * its one of the cases where we throw away * the incoming packets. */ *locked_tcb = *stcb; /* * We must also increment the inp ref count * since the ref_count flags was set when we * did not find the TCB, now we found it * which reduces the refcount.. we must * raise it back out to balance it all :-) */ SCTP_INP_INCR_REF((*stcb)->sctp_ep); if ((*stcb)->sctp_ep != l_inp) { SCTP_PRINTF("Huh? ep:%p diff then l_inp:%p?\n", (void *)(*stcb)->sctp_ep, (void *)l_inp); } } } } cookie_len -= SCTP_SIGNATURE_SIZE; if (*stcb == NULL) { /* this is the "normal" case... get a new TCB */ *stcb = sctp_process_cookie_new(m, iphlen, offset, src, dst, sh, cookie, cookie_len, *inp_p, netp, to, ¬ification, auth_skipped, auth_offset, auth_len, mflowtype, mflowid, vrf_id, port); } else { /* this is abnormal... cookie-echo on existing TCB */ had_a_existing_tcb = 1; *stcb = sctp_process_cookie_existing(m, iphlen, offset, src, dst, sh, cookie, cookie_len, *inp_p, *stcb, netp, to, ¬ification, auth_skipped, auth_offset, auth_len, mflowtype, mflowid, vrf_id, port); } if (*stcb == NULL) { /* still no TCB... must be bad cookie-echo */ return (NULL); } if (*netp != NULL) { (*netp)->flowtype = mflowtype; (*netp)->flowid = mflowid; } /* * Ok, we built an association so confirm the address we sent the * INIT-ACK to. */ netl = sctp_findnet(*stcb, to); /* * This code should in theory NOT run but */ if (netl == NULL) { /* TSNH! Huh, why do I need to add this address here? */ if (sctp_add_remote_addr(*stcb, to, NULL, port, SCTP_DONOT_SETSCOPE, SCTP_IN_COOKIE_PROC)) { return (NULL); } netl = sctp_findnet(*stcb, to); } if (netl) { if (netl->dest_state & SCTP_ADDR_UNCONFIRMED) { netl->dest_state &= ~SCTP_ADDR_UNCONFIRMED; (void)sctp_set_primary_addr((*stcb), (struct sockaddr *)NULL, netl); send_int_conf = 1; } } sctp_start_net_timers(*stcb); if ((*inp_p)->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE) { if (!had_a_existing_tcb || (((*inp_p)->sctp_flags & SCTP_PCB_FLAGS_CONNECTED) == 0)) { /* * If we have a NEW cookie or the connect never * reached the connected state during collision we * must do the TCP accept thing. */ struct socket *so, *oso; struct sctp_inpcb *inp; if (notification == SCTP_NOTIFY_ASSOC_RESTART) { /* * For a restart we will keep the same * socket, no need to do anything. I THINK!! */ sctp_ulp_notify(notification, *stcb, 0, NULL, SCTP_SO_NOT_LOCKED); if (send_int_conf) { sctp_ulp_notify(SCTP_NOTIFY_INTERFACE_CONFIRMED, (*stcb), 0, (void *)netl, SCTP_SO_NOT_LOCKED); } return (m); } oso = (*inp_p)->sctp_socket; atomic_add_int(&(*stcb)->asoc.refcnt, 1); SCTP_TCB_UNLOCK((*stcb)); CURVNET_SET(oso->so_vnet); so = sonewconn(oso, 0 ); CURVNET_RESTORE(); SCTP_TCB_LOCK((*stcb)); atomic_subtract_int(&(*stcb)->asoc.refcnt, 1); if (so == NULL) { struct mbuf *op_err; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *pcb_so; #endif /* Too many sockets */ SCTPDBG(SCTP_DEBUG_INPUT1, "process_cookie_new: no room for another socket!\n"); op_err = sctp_generate_cause(SCTP_CAUSE_OUT_OF_RESC, ""); sctp_abort_association(*inp_p, NULL, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) pcb_so = SCTP_INP_SO(*inp_p); atomic_add_int(&(*stcb)->asoc.refcnt, 1); SCTP_TCB_UNLOCK((*stcb)); SCTP_SOCKET_LOCK(pcb_so, 1); SCTP_TCB_LOCK((*stcb)); atomic_subtract_int(&(*stcb)->asoc.refcnt, 1); #endif (void)sctp_free_assoc(*inp_p, *stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_23); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(pcb_so, 1); #endif return (NULL); } inp = (struct sctp_inpcb *)so->so_pcb; SCTP_INP_INCR_REF(inp); /* * We add the unbound flag here so that if we get an * soabort() before we get the move_pcb done, we * will properly cleanup. */ inp->sctp_flags = (SCTP_PCB_FLAGS_TCPTYPE | SCTP_PCB_FLAGS_CONNECTED | SCTP_PCB_FLAGS_IN_TCPPOOL | SCTP_PCB_FLAGS_UNBOUND | (SCTP_PCB_COPY_FLAGS & (*inp_p)->sctp_flags) | SCTP_PCB_FLAGS_DONT_WAKE); inp->sctp_features = (*inp_p)->sctp_features; inp->sctp_mobility_features = (*inp_p)->sctp_mobility_features; inp->sctp_socket = so; inp->sctp_frag_point = (*inp_p)->sctp_frag_point; inp->max_cwnd = (*inp_p)->max_cwnd; inp->sctp_cmt_on_off = (*inp_p)->sctp_cmt_on_off; inp->ecn_supported = (*inp_p)->ecn_supported; inp->prsctp_supported = (*inp_p)->prsctp_supported; inp->auth_supported = (*inp_p)->auth_supported; inp->asconf_supported = (*inp_p)->asconf_supported; inp->reconfig_supported = (*inp_p)->reconfig_supported; inp->nrsack_supported = (*inp_p)->nrsack_supported; inp->pktdrop_supported = (*inp_p)->pktdrop_supported; inp->partial_delivery_point = (*inp_p)->partial_delivery_point; inp->sctp_context = (*inp_p)->sctp_context; inp->local_strreset_support = (*inp_p)->local_strreset_support; inp->fibnum = (*inp_p)->fibnum; inp->inp_starting_point_for_iterator = NULL; /* * copy in the authentication parameters from the * original endpoint */ if (inp->sctp_ep.local_hmacs) sctp_free_hmaclist(inp->sctp_ep.local_hmacs); inp->sctp_ep.local_hmacs = sctp_copy_hmaclist((*inp_p)->sctp_ep.local_hmacs); if (inp->sctp_ep.local_auth_chunks) sctp_free_chunklist(inp->sctp_ep.local_auth_chunks); inp->sctp_ep.local_auth_chunks = sctp_copy_chunklist((*inp_p)->sctp_ep.local_auth_chunks); /* * Now we must move it from one hash table to * another and get the tcb in the right place. */ /* * This is where the one-2-one socket is put into * the accept state waiting for the accept! */ if (*stcb) { (*stcb)->asoc.state |= SCTP_STATE_IN_ACCEPT_QUEUE; } sctp_move_pcb_and_assoc(*inp_p, inp, *stcb); atomic_add_int(&(*stcb)->asoc.refcnt, 1); SCTP_TCB_UNLOCK((*stcb)); sctp_pull_off_control_to_new_inp((*inp_p), inp, *stcb, 0); SCTP_TCB_LOCK((*stcb)); atomic_subtract_int(&(*stcb)->asoc.refcnt, 1); /* * now we must check to see if we were aborted while * the move was going on and the lock/unlock * happened. */ if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE) { /* * yep it was, we leave the assoc attached * to the socket since the sctp_inpcb_free() * call will send an abort for us. */ SCTP_INP_DECR_REF(inp); return (NULL); } SCTP_INP_DECR_REF(inp); /* Switch over to the new guy */ *inp_p = inp; sctp_ulp_notify(notification, *stcb, 0, NULL, SCTP_SO_NOT_LOCKED); if (send_int_conf) { sctp_ulp_notify(SCTP_NOTIFY_INTERFACE_CONFIRMED, (*stcb), 0, (void *)netl, SCTP_SO_NOT_LOCKED); } /* * Pull it from the incomplete queue and wake the * guy */ #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) atomic_add_int(&(*stcb)->asoc.refcnt, 1); SCTP_TCB_UNLOCK((*stcb)); SCTP_SOCKET_LOCK(so, 1); #endif soisconnected(so); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_TCB_LOCK((*stcb)); atomic_subtract_int(&(*stcb)->asoc.refcnt, 1); SCTP_SOCKET_UNLOCK(so, 1); #endif return (m); } } if (notification) { sctp_ulp_notify(notification, *stcb, 0, NULL, SCTP_SO_NOT_LOCKED); } if (send_int_conf) { sctp_ulp_notify(SCTP_NOTIFY_INTERFACE_CONFIRMED, (*stcb), 0, (void *)netl, SCTP_SO_NOT_LOCKED); } return (m); } static void sctp_handle_cookie_ack(struct sctp_cookie_ack_chunk *cp SCTP_UNUSED, struct sctp_tcb *stcb, struct sctp_nets *net) { /* cp must not be used, others call this without a c-ack :-) */ struct sctp_association *asoc; SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_cookie_ack: handling COOKIE-ACK\n"); if ((stcb == NULL) || (net == NULL)) { return; } asoc = &stcb->asoc; sctp_stop_all_cookie_timers(stcb); /* process according to association state */ if (SCTP_GET_STATE(asoc) == SCTP_STATE_COOKIE_ECHOED) { /* state change only needed when I am in right state */ SCTPDBG(SCTP_DEBUG_INPUT2, "moving to OPEN state\n"); SCTP_SET_STATE(asoc, SCTP_STATE_OPEN); sctp_start_net_timers(stcb); if (asoc->state & SCTP_STATE_SHUTDOWN_PENDING) { sctp_timer_start(SCTP_TIMER_TYPE_SHUTDOWNGUARD, stcb->sctp_ep, stcb, asoc->primary_destination); } /* update RTO */ SCTP_STAT_INCR_COUNTER32(sctps_activeestab); SCTP_STAT_INCR_GAUGE32(sctps_currestab); if (asoc->overall_error_count == 0) { net->RTO = sctp_calculate_rto(stcb, asoc, net, &asoc->time_entered, sctp_align_safe_nocopy, SCTP_RTT_FROM_NON_DATA); } (void)SCTP_GETTIME_TIMEVAL(&asoc->time_entered); sctp_ulp_notify(SCTP_NOTIFY_ASSOC_UP, stcb, 0, NULL, SCTP_SO_NOT_LOCKED); if ((stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE) || (stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL)) { #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; #endif stcb->sctp_ep->sctp_flags |= SCTP_PCB_FLAGS_CONNECTED; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(stcb->sctp_ep); atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); #endif if ((stcb->asoc.state & SCTP_STATE_CLOSED_SOCKET) == 0) { soisconnected(stcb->sctp_socket); } #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif } /* * since we did not send a HB make sure we don't double * things */ net->hb_responded = 1; if (stcb->asoc.state & SCTP_STATE_CLOSED_SOCKET) { /* * We don't need to do the asconf thing, nor hb or * autoclose if the socket is closed. */ goto closed_socket; } sctp_timer_start(SCTP_TIMER_TYPE_HEARTBEAT, stcb->sctp_ep, stcb, net); if (stcb->asoc.sctp_autoclose_ticks && sctp_is_feature_on(stcb->sctp_ep, SCTP_PCB_FLAGS_AUTOCLOSE)) { sctp_timer_start(SCTP_TIMER_TYPE_AUTOCLOSE, stcb->sctp_ep, stcb, NULL); } /* * send ASCONF if parameters are pending and ASCONFs are * allowed (eg. addresses changed when init/cookie echo were * in flight) */ if ((sctp_is_feature_on(stcb->sctp_ep, SCTP_PCB_FLAGS_DO_ASCONF)) && (stcb->asoc.asconf_supported == 1) && (!TAILQ_EMPTY(&stcb->asoc.asconf_queue))) { #ifdef SCTP_TIMER_BASED_ASCONF sctp_timer_start(SCTP_TIMER_TYPE_ASCONF, stcb->sctp_ep, stcb, stcb->asoc.primary_destination); #else sctp_send_asconf(stcb, stcb->asoc.primary_destination, SCTP_ADDR_NOT_LOCKED); #endif } } closed_socket: /* Toss the cookie if I can */ sctp_toss_old_cookies(stcb, asoc); if (!TAILQ_EMPTY(&asoc->sent_queue)) { /* Restart the timer if we have pending data */ struct sctp_tmit_chunk *chk; chk = TAILQ_FIRST(&asoc->sent_queue); sctp_timer_start(SCTP_TIMER_TYPE_SEND, stcb->sctp_ep, stcb, chk->whoTo); } } static void sctp_handle_ecn_echo(struct sctp_ecne_chunk *cp, struct sctp_tcb *stcb) { struct sctp_nets *net; struct sctp_tmit_chunk *lchk; struct sctp_ecne_chunk bkup; uint8_t override_bit; uint32_t tsn, window_data_tsn; int len; unsigned int pkt_cnt; len = ntohs(cp->ch.chunk_length); if ((len != sizeof(struct sctp_ecne_chunk)) && (len != sizeof(struct old_sctp_ecne_chunk))) { return; } if (len == sizeof(struct old_sctp_ecne_chunk)) { /* Its the old format */ memcpy(&bkup, cp, sizeof(struct old_sctp_ecne_chunk)); bkup.num_pkts_since_cwr = htonl(1); cp = &bkup; } SCTP_STAT_INCR(sctps_recvecne); tsn = ntohl(cp->tsn); pkt_cnt = ntohl(cp->num_pkts_since_cwr); lchk = TAILQ_LAST(&stcb->asoc.send_queue, sctpchunk_listhead); if (lchk == NULL) { window_data_tsn = stcb->asoc.sending_seq - 1; } else { window_data_tsn = lchk->rec.data.TSN_seq; } /* Find where it was sent to if possible. */ net = NULL; TAILQ_FOREACH(lchk, &stcb->asoc.sent_queue, sctp_next) { if (lchk->rec.data.TSN_seq == tsn) { net = lchk->whoTo; net->ecn_prev_cwnd = lchk->rec.data.cwnd_at_send; break; } if (SCTP_TSN_GT(lchk->rec.data.TSN_seq, tsn)) { break; } } if (net == NULL) { /* * What to do. A previous send of a CWR was possibly lost. * See how old it is, we may have it marked on the actual * net. */ TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { if (tsn == net->last_cwr_tsn) { /* Found him, send it off */ break; } } if (net == NULL) { /* * If we reach here, we need to send a special CWR * that says hey, we did this a long time ago and * you lost the response. */ net = TAILQ_FIRST(&stcb->asoc.nets); if (net == NULL) { /* TSNH */ return; } override_bit = SCTP_CWR_REDUCE_OVERRIDE; } else { override_bit = 0; } } else { override_bit = 0; } if (SCTP_TSN_GT(tsn, net->cwr_window_tsn) && ((override_bit & SCTP_CWR_REDUCE_OVERRIDE) == 0)) { /* * JRS - Use the congestion control given in the pluggable * CC module */ stcb->asoc.cc_functions.sctp_cwnd_update_after_ecn_echo(stcb, net, 0, pkt_cnt); /* * We reduce once every RTT. So we will only lower cwnd at * the next sending seq i.e. the window_data_tsn */ net->cwr_window_tsn = window_data_tsn; net->ecn_ce_pkt_cnt += pkt_cnt; net->lost_cnt = pkt_cnt; net->last_cwr_tsn = tsn; } else { override_bit |= SCTP_CWR_IN_SAME_WINDOW; if (SCTP_TSN_GT(tsn, net->last_cwr_tsn) && ((override_bit & SCTP_CWR_REDUCE_OVERRIDE) == 0)) { /* * Another loss in the same window update how many * marks/packets lost we have had. */ int cnt = 1; if (pkt_cnt > net->lost_cnt) { /* Should be the case */ cnt = (pkt_cnt - net->lost_cnt); net->ecn_ce_pkt_cnt += cnt; } net->lost_cnt = pkt_cnt; net->last_cwr_tsn = tsn; /* * Most CC functions will ignore this call, since we * are in-window yet of the initial CE the peer saw. */ stcb->asoc.cc_functions.sctp_cwnd_update_after_ecn_echo(stcb, net, 1, cnt); } } /* * We always send a CWR this way if our previous one was lost our * peer will get an update, or if it is not time again to reduce we * still get the cwr to the peer. Note we set the override when we * could not find the TSN on the chunk or the destination network. */ sctp_send_cwr(stcb, net, net->last_cwr_tsn, override_bit); } static void sctp_handle_ecn_cwr(struct sctp_cwr_chunk *cp, struct sctp_tcb *stcb, struct sctp_nets *net) { /* * Here we get a CWR from the peer. We must look in the outqueue and * make sure that we have a covered ECNE in the control chunk part. * If so remove it. */ struct sctp_tmit_chunk *chk; struct sctp_ecne_chunk *ecne; int override; uint32_t cwr_tsn; cwr_tsn = ntohl(cp->tsn); override = cp->ch.chunk_flags & SCTP_CWR_REDUCE_OVERRIDE; TAILQ_FOREACH(chk, &stcb->asoc.control_send_queue, sctp_next) { if (chk->rec.chunk_id.id != SCTP_ECN_ECHO) { continue; } if ((override == 0) && (chk->whoTo != net)) { /* Must be from the right src unless override is set */ continue; } ecne = mtod(chk->data, struct sctp_ecne_chunk *); if (SCTP_TSN_GE(cwr_tsn, ntohl(ecne->tsn))) { /* this covers this ECNE, we can remove it */ stcb->asoc.ecn_echo_cnt_onq--; TAILQ_REMOVE(&stcb->asoc.control_send_queue, chk, sctp_next); sctp_m_freem(chk->data); chk->data = NULL; stcb->asoc.ctrl_queue_cnt--; sctp_free_a_chunk(stcb, chk, SCTP_SO_NOT_LOCKED); if (override == 0) { break; } } } } static void sctp_handle_shutdown_complete(struct sctp_shutdown_complete_chunk *cp SCTP_UNUSED, struct sctp_tcb *stcb, struct sctp_nets *net) { struct sctp_association *asoc; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; #endif SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_shutdown_complete: handling SHUTDOWN-COMPLETE\n"); if (stcb == NULL) return; asoc = &stcb->asoc; /* process according to association state */ if (SCTP_GET_STATE(asoc) != SCTP_STATE_SHUTDOWN_ACK_SENT) { /* unexpected SHUTDOWN-COMPLETE... so ignore... */ SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_shutdown_complete: not in SCTP_STATE_SHUTDOWN_ACK_SENT --- ignore\n"); SCTP_TCB_UNLOCK(stcb); return; } /* notify upper layer protocol */ if (stcb->sctp_socket) { sctp_ulp_notify(SCTP_NOTIFY_ASSOC_DOWN, stcb, 0, NULL, SCTP_SO_NOT_LOCKED); } #ifdef INVARIANTS if (!TAILQ_EMPTY(&asoc->send_queue) || !TAILQ_EMPTY(&asoc->sent_queue) || !stcb->asoc.ss_functions.sctp_ss_is_empty(stcb, asoc)) { panic("Queues are not empty when handling SHUTDOWN-COMPLETE"); } #endif /* stop the timer */ sctp_timer_stop(SCTP_TIMER_TYPE_SHUTDOWNACK, stcb->sctp_ep, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_24); SCTP_STAT_INCR_COUNTER32(sctps_shutdown); /* free the TCB */ SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_handle_shutdown_complete: calls free-asoc\n"); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(stcb->sctp_ep); atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); #endif (void)sctp_free_assoc(stcb->sctp_ep, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_25); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif return; } static int process_chunk_drop(struct sctp_tcb *stcb, struct sctp_chunk_desc *desc, struct sctp_nets *net, uint8_t flg) { switch (desc->chunk_type) { case SCTP_DATA: /* find the tsn to resend (possibly */ { uint32_t tsn; struct sctp_tmit_chunk *tp1; tsn = ntohl(desc->tsn_ifany); TAILQ_FOREACH(tp1, &stcb->asoc.sent_queue, sctp_next) { if (tp1->rec.data.TSN_seq == tsn) { /* found it */ break; } if (SCTP_TSN_GT(tp1->rec.data.TSN_seq, tsn)) { /* not found */ tp1 = NULL; break; } } if (tp1 == NULL) { /* * Do it the other way , aka without paying * attention to queue seq order. */ SCTP_STAT_INCR(sctps_pdrpdnfnd); TAILQ_FOREACH(tp1, &stcb->asoc.sent_queue, sctp_next) { if (tp1->rec.data.TSN_seq == tsn) { /* found it */ break; } } } if (tp1 == NULL) { SCTP_STAT_INCR(sctps_pdrptsnnf); } if ((tp1) && (tp1->sent < SCTP_DATAGRAM_ACKED)) { uint8_t *ddp; if (((flg & SCTP_BADCRC) == 0) && ((flg & SCTP_FROM_MIDDLE_BOX) == 0)) { return (0); } if ((stcb->asoc.peers_rwnd == 0) && ((flg & SCTP_FROM_MIDDLE_BOX) == 0)) { SCTP_STAT_INCR(sctps_pdrpdiwnp); return (0); } if (stcb->asoc.peers_rwnd == 0 && (flg & SCTP_FROM_MIDDLE_BOX)) { SCTP_STAT_INCR(sctps_pdrpdizrw); return (0); } ddp = (uint8_t *) (mtod(tp1->data, caddr_t)+ sizeof(struct sctp_data_chunk)); { unsigned int iii; for (iii = 0; iii < sizeof(desc->data_bytes); iii++) { if (ddp[iii] != desc->data_bytes[iii]) { SCTP_STAT_INCR(sctps_pdrpbadd); return (-1); } } } if (tp1->do_rtt) { /* * this guy had a RTO calculation * pending on it, cancel it */ if (tp1->whoTo->rto_needed == 0) { tp1->whoTo->rto_needed = 1; } tp1->do_rtt = 0; } SCTP_STAT_INCR(sctps_pdrpmark); if (tp1->sent != SCTP_DATAGRAM_RESEND) sctp_ucount_incr(stcb->asoc.sent_queue_retran_cnt); /* * mark it as if we were doing a FR, since * we will be getting gap ack reports behind * the info from the router. */ tp1->rec.data.doing_fast_retransmit = 1; /* * mark the tsn with what sequences can * cause a new FR. */ if (TAILQ_EMPTY(&stcb->asoc.send_queue)) { tp1->rec.data.fast_retran_tsn = stcb->asoc.sending_seq; } else { tp1->rec.data.fast_retran_tsn = (TAILQ_FIRST(&stcb->asoc.send_queue))->rec.data.TSN_seq; } /* restart the timer */ sctp_timer_stop(SCTP_TIMER_TYPE_SEND, stcb->sctp_ep, stcb, tp1->whoTo, SCTP_FROM_SCTP_INPUT + SCTP_LOC_26); sctp_timer_start(SCTP_TIMER_TYPE_SEND, stcb->sctp_ep, stcb, tp1->whoTo); /* fix counts and things */ if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_FLIGHT_LOGGING_ENABLE) { sctp_misc_ints(SCTP_FLIGHT_LOG_DOWN_PDRP, tp1->whoTo->flight_size, tp1->book_size, (uint32_t) (uintptr_t) stcb, tp1->rec.data.TSN_seq); } if (tp1->sent < SCTP_DATAGRAM_RESEND) { sctp_flight_size_decrease(tp1); sctp_total_flight_decrease(stcb, tp1); } tp1->sent = SCTP_DATAGRAM_RESEND; } { /* audit code */ unsigned int audit; audit = 0; TAILQ_FOREACH(tp1, &stcb->asoc.sent_queue, sctp_next) { if (tp1->sent == SCTP_DATAGRAM_RESEND) audit++; } TAILQ_FOREACH(tp1, &stcb->asoc.control_send_queue, sctp_next) { if (tp1->sent == SCTP_DATAGRAM_RESEND) audit++; } if (audit != stcb->asoc.sent_queue_retran_cnt) { SCTP_PRINTF("**Local Audit finds cnt:%d asoc cnt:%d\n", audit, stcb->asoc.sent_queue_retran_cnt); #ifndef SCTP_AUDITING_ENABLED stcb->asoc.sent_queue_retran_cnt = audit; #endif } } } break; case SCTP_ASCONF: { struct sctp_tmit_chunk *asconf; TAILQ_FOREACH(asconf, &stcb->asoc.control_send_queue, sctp_next) { if (asconf->rec.chunk_id.id == SCTP_ASCONF) { break; } } if (asconf) { if (asconf->sent != SCTP_DATAGRAM_RESEND) sctp_ucount_incr(stcb->asoc.sent_queue_retran_cnt); asconf->sent = SCTP_DATAGRAM_RESEND; asconf->snd_count--; } } break; case SCTP_INITIATION: /* resend the INIT */ stcb->asoc.dropped_special_cnt++; if (stcb->asoc.dropped_special_cnt < SCTP_RETRY_DROPPED_THRESH) { /* * If we can get it in, in a few attempts we do * this, otherwise we let the timer fire. */ sctp_timer_stop(SCTP_TIMER_TYPE_INIT, stcb->sctp_ep, stcb, net, SCTP_FROM_SCTP_INPUT + SCTP_LOC_27); sctp_send_initiate(stcb->sctp_ep, stcb, SCTP_SO_NOT_LOCKED); } break; case SCTP_SELECTIVE_ACK: case SCTP_NR_SELECTIVE_ACK: /* resend the sack */ sctp_send_sack(stcb, SCTP_SO_NOT_LOCKED); break; case SCTP_HEARTBEAT_REQUEST: /* resend a demand HB */ if ((stcb->asoc.overall_error_count + 3) < stcb->asoc.max_send_times) { /* * Only retransmit if we KNOW we wont destroy the * tcb */ sctp_send_hb(stcb, net, SCTP_SO_NOT_LOCKED); } break; case SCTP_SHUTDOWN: sctp_send_shutdown(stcb, net); break; case SCTP_SHUTDOWN_ACK: sctp_send_shutdown_ack(stcb, net); break; case SCTP_COOKIE_ECHO: { struct sctp_tmit_chunk *cookie; cookie = NULL; TAILQ_FOREACH(cookie, &stcb->asoc.control_send_queue, sctp_next) { if (cookie->rec.chunk_id.id == SCTP_COOKIE_ECHO) { break; } } if (cookie) { if (cookie->sent != SCTP_DATAGRAM_RESEND) sctp_ucount_incr(stcb->asoc.sent_queue_retran_cnt); cookie->sent = SCTP_DATAGRAM_RESEND; sctp_stop_all_cookie_timers(stcb); } } break; case SCTP_COOKIE_ACK: sctp_send_cookie_ack(stcb); break; case SCTP_ASCONF_ACK: /* resend last asconf ack */ sctp_send_asconf_ack(stcb); break; case SCTP_IFORWARD_CUM_TSN: case SCTP_FORWARD_CUM_TSN: send_forward_tsn(stcb, &stcb->asoc); break; /* can't do anything with these */ case SCTP_PACKET_DROPPED: case SCTP_INITIATION_ACK: /* this should not happen */ case SCTP_HEARTBEAT_ACK: case SCTP_ABORT_ASSOCIATION: case SCTP_OPERATION_ERROR: case SCTP_SHUTDOWN_COMPLETE: case SCTP_ECN_ECHO: case SCTP_ECN_CWR: default: break; } return (0); } void sctp_reset_in_stream(struct sctp_tcb *stcb, uint32_t number_entries, uint16_t * list) { uint32_t i; uint16_t temp; /* * We set things to 0xffffffff since this is the last delivered * sequence and we will be sending in 0 after the reset. */ if (number_entries) { for (i = 0; i < number_entries; i++) { temp = ntohs(list[i]); if (temp >= stcb->asoc.streamincnt) { continue; } stcb->asoc.strmin[temp].last_sequence_delivered = 0xffffffff; } } else { list = NULL; for (i = 0; i < stcb->asoc.streamincnt; i++) { stcb->asoc.strmin[i].last_sequence_delivered = 0xffffffff; } } sctp_ulp_notify(SCTP_NOTIFY_STR_RESET_RECV, stcb, number_entries, (void *)list, SCTP_SO_NOT_LOCKED); } static void sctp_reset_out_streams(struct sctp_tcb *stcb, uint32_t number_entries, uint16_t * list) { uint32_t i; uint16_t temp; if (number_entries > 0) { for (i = 0; i < number_entries; i++) { temp = ntohs(list[i]); if (temp >= stcb->asoc.streamoutcnt) { /* no such stream */ continue; } stcb->asoc.strmout[temp].next_sequence_send = 0; } } else { for (i = 0; i < stcb->asoc.streamoutcnt; i++) { stcb->asoc.strmout[i].next_sequence_send = 0; } } sctp_ulp_notify(SCTP_NOTIFY_STR_RESET_SEND, stcb, number_entries, (void *)list, SCTP_SO_NOT_LOCKED); } static void sctp_reset_clear_pending(struct sctp_tcb *stcb, uint32_t number_entries, uint16_t * list) { uint32_t i; uint16_t temp; if (number_entries > 0) { for (i = 0; i < number_entries; i++) { temp = ntohs(list[i]); if (temp >= stcb->asoc.streamoutcnt) { /* no such stream */ continue; } stcb->asoc.strmout[temp].state = SCTP_STREAM_OPEN; } } else { for (i = 0; i < stcb->asoc.streamoutcnt; i++) { stcb->asoc.strmout[i].state = SCTP_STREAM_OPEN; } } } struct sctp_stream_reset_request * sctp_find_stream_reset(struct sctp_tcb *stcb, uint32_t seq, struct sctp_tmit_chunk **bchk) { struct sctp_association *asoc; struct sctp_chunkhdr *ch; struct sctp_stream_reset_request *r; struct sctp_tmit_chunk *chk; int len, clen; asoc = &stcb->asoc; if (TAILQ_EMPTY(&stcb->asoc.control_send_queue)) { asoc->stream_reset_outstanding = 0; return (NULL); } if (stcb->asoc.str_reset == NULL) { asoc->stream_reset_outstanding = 0; return (NULL); } chk = stcb->asoc.str_reset; if (chk->data == NULL) { return (NULL); } if (bchk) { /* he wants a copy of the chk pointer */ *bchk = chk; } clen = chk->send_size; ch = mtod(chk->data, struct sctp_chunkhdr *); r = (struct sctp_stream_reset_request *)(ch + 1); if (ntohl(r->request_seq) == seq) { /* found it */ return (r); } len = SCTP_SIZE32(ntohs(r->ph.param_length)); if (clen > (len + (int)sizeof(struct sctp_chunkhdr))) { /* move to the next one, there can only be a max of two */ r = (struct sctp_stream_reset_request *)((caddr_t)r + len); if (ntohl(r->request_seq) == seq) { return (r); } } /* that seq is not here */ return (NULL); } static void sctp_clean_up_stream_reset(struct sctp_tcb *stcb) { struct sctp_association *asoc; struct sctp_tmit_chunk *chk = stcb->asoc.str_reset; if (stcb->asoc.str_reset == NULL) { return; } asoc = &stcb->asoc; sctp_timer_stop(SCTP_TIMER_TYPE_STRRESET, stcb->sctp_ep, stcb, chk->whoTo, SCTP_FROM_SCTP_INPUT + SCTP_LOC_28); TAILQ_REMOVE(&asoc->control_send_queue, chk, sctp_next); if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } asoc->ctrl_queue_cnt--; sctp_free_a_chunk(stcb, chk, SCTP_SO_NOT_LOCKED); /* sa_ignore NO_NULL_CHK */ stcb->asoc.str_reset = NULL; } static int sctp_handle_stream_reset_response(struct sctp_tcb *stcb, uint32_t seq, uint32_t action, struct sctp_stream_reset_response *respin) { uint16_t type; int lparm_len; struct sctp_association *asoc = &stcb->asoc; struct sctp_tmit_chunk *chk; struct sctp_stream_reset_request *req_param; struct sctp_stream_reset_out_request *req_out_param; struct sctp_stream_reset_in_request *req_in_param; uint32_t number_entries; if (asoc->stream_reset_outstanding == 0) { /* duplicate */ return (0); } if (seq == stcb->asoc.str_reset_seq_out) { req_param = sctp_find_stream_reset(stcb, seq, &chk); if (req_param != NULL) { stcb->asoc.str_reset_seq_out++; type = ntohs(req_param->ph.param_type); lparm_len = ntohs(req_param->ph.param_length); if (type == SCTP_STR_RESET_OUT_REQUEST) { int no_clear = 0; req_out_param = (struct sctp_stream_reset_out_request *)req_param; number_entries = (lparm_len - sizeof(struct sctp_stream_reset_out_request)) / sizeof(uint16_t); asoc->stream_reset_out_is_outstanding = 0; if (asoc->stream_reset_outstanding) asoc->stream_reset_outstanding--; if (action == SCTP_STREAM_RESET_RESULT_PERFORMED) { /* do it */ sctp_reset_out_streams(stcb, number_entries, req_out_param->list_of_streams); } else if (action == SCTP_STREAM_RESET_RESULT_DENIED) { sctp_ulp_notify(SCTP_NOTIFY_STR_RESET_DENIED_OUT, stcb, number_entries, req_out_param->list_of_streams, SCTP_SO_NOT_LOCKED); } else if (action == SCTP_STREAM_RESET_RESULT_IN_PROGRESS) { /* * Set it up so we don't stop * retransmitting */ asoc->stream_reset_outstanding++; stcb->asoc.str_reset_seq_out--; asoc->stream_reset_out_is_outstanding = 1; no_clear = 1; } else { sctp_ulp_notify(SCTP_NOTIFY_STR_RESET_FAILED_OUT, stcb, number_entries, req_out_param->list_of_streams, SCTP_SO_NOT_LOCKED); } if (no_clear == 0) { sctp_reset_clear_pending(stcb, number_entries, req_out_param->list_of_streams); } } else if (type == SCTP_STR_RESET_IN_REQUEST) { req_in_param = (struct sctp_stream_reset_in_request *)req_param; number_entries = (lparm_len - sizeof(struct sctp_stream_reset_in_request)) / sizeof(uint16_t); if (asoc->stream_reset_outstanding) asoc->stream_reset_outstanding--; if (action == SCTP_STREAM_RESET_RESULT_DENIED) { sctp_ulp_notify(SCTP_NOTIFY_STR_RESET_DENIED_IN, stcb, number_entries, req_in_param->list_of_streams, SCTP_SO_NOT_LOCKED); } else if (action != SCTP_STREAM_RESET_RESULT_PERFORMED) { sctp_ulp_notify(SCTP_NOTIFY_STR_RESET_FAILED_IN, stcb, number_entries, req_in_param->list_of_streams, SCTP_SO_NOT_LOCKED); } } else if (type == SCTP_STR_RESET_ADD_OUT_STREAMS) { /* Ok we now may have more streams */ int num_stream; num_stream = stcb->asoc.strm_pending_add_size; if (num_stream > (stcb->asoc.strm_realoutsize - stcb->asoc.streamoutcnt)) { /* TSNH */ num_stream = stcb->asoc.strm_realoutsize - stcb->asoc.streamoutcnt; } stcb->asoc.strm_pending_add_size = 0; if (asoc->stream_reset_outstanding) asoc->stream_reset_outstanding--; if (action == SCTP_STREAM_RESET_RESULT_PERFORMED) { /* Put the new streams into effect */ int i; for (i = asoc->streamoutcnt; i < (asoc->streamoutcnt + num_stream); i++) { asoc->strmout[i].state = SCTP_STREAM_OPEN; } asoc->streamoutcnt += num_stream; sctp_notify_stream_reset_add(stcb, stcb->asoc.streamincnt, stcb->asoc.streamoutcnt, 0); } else if (action == SCTP_STREAM_RESET_RESULT_DENIED) { sctp_notify_stream_reset_add(stcb, stcb->asoc.streamincnt, stcb->asoc.streamoutcnt, SCTP_STREAM_CHANGE_DENIED); } else { sctp_notify_stream_reset_add(stcb, stcb->asoc.streamincnt, stcb->asoc.streamoutcnt, SCTP_STREAM_CHANGE_FAILED); } } else if (type == SCTP_STR_RESET_ADD_IN_STREAMS) { if (asoc->stream_reset_outstanding) asoc->stream_reset_outstanding--; if (action == SCTP_STREAM_RESET_RESULT_DENIED) { sctp_notify_stream_reset_add(stcb, stcb->asoc.streamincnt, stcb->asoc.streamoutcnt, SCTP_STREAM_CHANGE_DENIED); } else if (action != SCTP_STREAM_RESET_RESULT_PERFORMED) { sctp_notify_stream_reset_add(stcb, stcb->asoc.streamincnt, stcb->asoc.streamoutcnt, SCTP_STREAM_CHANGE_FAILED); } } else if (type == SCTP_STR_RESET_TSN_REQUEST) { /** * a) Adopt the new in tsn. * b) reset the map * c) Adopt the new out-tsn */ struct sctp_stream_reset_response_tsn *resp; struct sctp_forward_tsn_chunk fwdtsn; int abort_flag = 0; if (respin == NULL) { /* huh ? */ return (0); } if (ntohs(respin->ph.param_length) < sizeof(struct sctp_stream_reset_response_tsn)) { return (0); } if (action == SCTP_STREAM_RESET_RESULT_PERFORMED) { resp = (struct sctp_stream_reset_response_tsn *)respin; asoc->stream_reset_outstanding--; fwdtsn.ch.chunk_length = htons(sizeof(struct sctp_forward_tsn_chunk)); fwdtsn.ch.chunk_type = SCTP_FORWARD_CUM_TSN; fwdtsn.new_cumulative_tsn = htonl(ntohl(resp->senders_next_tsn) - 1); sctp_handle_forward_tsn(stcb, &fwdtsn, &abort_flag, NULL, 0); if (abort_flag) { return (1); } stcb->asoc.highest_tsn_inside_map = (ntohl(resp->senders_next_tsn) - 1); if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_MAP_LOGGING_ENABLE) { sctp_log_map(0, 7, asoc->highest_tsn_inside_map, SCTP_MAP_SLIDE_RESULT); } stcb->asoc.tsn_last_delivered = stcb->asoc.cumulative_tsn = stcb->asoc.highest_tsn_inside_map; stcb->asoc.mapping_array_base_tsn = ntohl(resp->senders_next_tsn); memset(stcb->asoc.mapping_array, 0, stcb->asoc.mapping_array_size); stcb->asoc.highest_tsn_inside_nr_map = stcb->asoc.highest_tsn_inside_map; memset(stcb->asoc.nr_mapping_array, 0, stcb->asoc.mapping_array_size); stcb->asoc.sending_seq = ntohl(resp->receivers_next_tsn); stcb->asoc.last_acked_seq = stcb->asoc.cumulative_tsn; sctp_reset_out_streams(stcb, 0, (uint16_t *) NULL); sctp_reset_in_stream(stcb, 0, (uint16_t *) NULL); sctp_notify_stream_reset_tsn(stcb, stcb->asoc.sending_seq, (stcb->asoc.mapping_array_base_tsn + 1), 0); } else if (action == SCTP_STREAM_RESET_RESULT_DENIED) { sctp_notify_stream_reset_tsn(stcb, stcb->asoc.sending_seq, (stcb->asoc.mapping_array_base_tsn + 1), SCTP_ASSOC_RESET_DENIED); } else { sctp_notify_stream_reset_tsn(stcb, stcb->asoc.sending_seq, (stcb->asoc.mapping_array_base_tsn + 1), SCTP_ASSOC_RESET_FAILED); } } /* get rid of the request and get the request flags */ if (asoc->stream_reset_outstanding == 0) { sctp_clean_up_stream_reset(stcb); } } } if (asoc->stream_reset_outstanding == 0) { sctp_send_stream_reset_out_if_possible(stcb, SCTP_SO_NOT_LOCKED); } return (0); } static void sctp_handle_str_reset_request_in(struct sctp_tcb *stcb, struct sctp_tmit_chunk *chk, struct sctp_stream_reset_in_request *req, int trunc) { uint32_t seq; int len, i; int number_entries; uint16_t temp; /* * peer wants me to send a str-reset to him for my outgoing seq's if * seq_in is right. */ struct sctp_association *asoc = &stcb->asoc; seq = ntohl(req->request_seq); if (asoc->str_reset_seq_in == seq) { asoc->last_reset_action[1] = asoc->last_reset_action[0]; if (!(asoc->local_strreset_support & SCTP_ENABLE_RESET_STREAM_REQ)) { asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; } else if (trunc) { /* Can't do it, since they exceeded our buffer size */ asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; } else if (stcb->asoc.stream_reset_out_is_outstanding == 0) { len = ntohs(req->ph.param_length); number_entries = ((len - sizeof(struct sctp_stream_reset_in_request)) / sizeof(uint16_t)); if (number_entries) { for (i = 0; i < number_entries; i++) { temp = ntohs(req->list_of_streams[i]); if (temp >= stcb->asoc.streamoutcnt) { asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; goto bad_boy; } req->list_of_streams[i] = temp; } for (i = 0; i < number_entries; i++) { if (stcb->asoc.strmout[req->list_of_streams[i]].state == SCTP_STREAM_OPEN) { stcb->asoc.strmout[req->list_of_streams[i]].state = SCTP_STREAM_RESET_PENDING; } } } else { /* Its all */ for (i = 0; i < stcb->asoc.streamoutcnt; i++) { if (stcb->asoc.strmout[i].state == SCTP_STREAM_OPEN) stcb->asoc.strmout[i].state = SCTP_STREAM_RESET_PENDING; } } asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_PERFORMED; } else { /* Can't do it, since we have sent one out */ asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_ERR_IN_PROGRESS; } bad_boy: sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[0]); asoc->str_reset_seq_in++; } else if (asoc->str_reset_seq_in - 1 == seq) { sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[0]); } else if (asoc->str_reset_seq_in - 2 == seq) { sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[1]); } else { sctp_add_stream_reset_result(chk, seq, SCTP_STREAM_RESET_RESULT_ERR_BAD_SEQNO); } sctp_send_stream_reset_out_if_possible(stcb, SCTP_SO_NOT_LOCKED); } static int sctp_handle_str_reset_request_tsn(struct sctp_tcb *stcb, struct sctp_tmit_chunk *chk, struct sctp_stream_reset_tsn_request *req) { /* reset all in and out and update the tsn */ /* * A) reset my str-seq's on in and out. B) Select a receive next, * and set cum-ack to it. Also process this selected number as a * fwd-tsn as well. C) set in the response my next sending seq. */ struct sctp_forward_tsn_chunk fwdtsn; struct sctp_association *asoc = &stcb->asoc; int abort_flag = 0; uint32_t seq; seq = ntohl(req->request_seq); if (asoc->str_reset_seq_in == seq) { asoc->last_reset_action[1] = stcb->asoc.last_reset_action[0]; if (!(asoc->local_strreset_support & SCTP_ENABLE_CHANGE_ASSOC_REQ)) { asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; } else { fwdtsn.ch.chunk_length = htons(sizeof(struct sctp_forward_tsn_chunk)); fwdtsn.ch.chunk_type = SCTP_FORWARD_CUM_TSN; fwdtsn.ch.chunk_flags = 0; fwdtsn.new_cumulative_tsn = htonl(stcb->asoc.highest_tsn_inside_map + 1); sctp_handle_forward_tsn(stcb, &fwdtsn, &abort_flag, NULL, 0); if (abort_flag) { return (1); } asoc->highest_tsn_inside_map += SCTP_STREAM_RESET_TSN_DELTA; if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_MAP_LOGGING_ENABLE) { sctp_log_map(0, 10, asoc->highest_tsn_inside_map, SCTP_MAP_SLIDE_RESULT); } asoc->tsn_last_delivered = asoc->cumulative_tsn = asoc->highest_tsn_inside_map; asoc->mapping_array_base_tsn = asoc->highest_tsn_inside_map + 1; memset(asoc->mapping_array, 0, asoc->mapping_array_size); asoc->highest_tsn_inside_nr_map = asoc->highest_tsn_inside_map; memset(asoc->nr_mapping_array, 0, asoc->mapping_array_size); atomic_add_int(&asoc->sending_seq, 1); /* save off historical data for retrans */ asoc->last_sending_seq[1] = asoc->last_sending_seq[0]; asoc->last_sending_seq[0] = asoc->sending_seq; asoc->last_base_tsnsent[1] = asoc->last_base_tsnsent[0]; asoc->last_base_tsnsent[0] = asoc->mapping_array_base_tsn; sctp_reset_out_streams(stcb, 0, (uint16_t *) NULL); sctp_reset_in_stream(stcb, 0, (uint16_t *) NULL); asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_PERFORMED; sctp_notify_stream_reset_tsn(stcb, asoc->sending_seq, (asoc->mapping_array_base_tsn + 1), 0); } sctp_add_stream_reset_result_tsn(chk, seq, asoc->last_reset_action[0], asoc->last_sending_seq[0], asoc->last_base_tsnsent[0]); asoc->str_reset_seq_in++; } else if (asoc->str_reset_seq_in - 1 == seq) { sctp_add_stream_reset_result_tsn(chk, seq, asoc->last_reset_action[0], asoc->last_sending_seq[0], asoc->last_base_tsnsent[0]); } else if (asoc->str_reset_seq_in - 2 == seq) { sctp_add_stream_reset_result_tsn(chk, seq, asoc->last_reset_action[1], asoc->last_sending_seq[1], asoc->last_base_tsnsent[1]); } else { sctp_add_stream_reset_result(chk, seq, SCTP_STREAM_RESET_RESULT_ERR_BAD_SEQNO); } return (0); } static void sctp_handle_str_reset_request_out(struct sctp_tcb *stcb, struct sctp_tmit_chunk *chk, struct sctp_stream_reset_out_request *req, int trunc) { uint32_t seq, tsn; int number_entries, len; struct sctp_association *asoc = &stcb->asoc; seq = ntohl(req->request_seq); /* now if its not a duplicate we process it */ if (asoc->str_reset_seq_in == seq) { len = ntohs(req->ph.param_length); number_entries = ((len - sizeof(struct sctp_stream_reset_out_request)) / sizeof(uint16_t)); /* * the sender is resetting, handle the list issue.. we must * a) verify if we can do the reset, if so no problem b) If * we can't do the reset we must copy the request. c) queue * it, and setup the data in processor to trigger it off * when needed and dequeue all the queued data. */ tsn = ntohl(req->send_reset_at_tsn); /* move the reset action back one */ asoc->last_reset_action[1] = asoc->last_reset_action[0]; if (!(asoc->local_strreset_support & SCTP_ENABLE_RESET_STREAM_REQ)) { asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; } else if (trunc) { asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; } else if (SCTP_TSN_GE(asoc->cumulative_tsn, tsn)) { /* we can do it now */ sctp_reset_in_stream(stcb, number_entries, req->list_of_streams); asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_PERFORMED; } else { /* * we must queue it up and thus wait for the TSN's * to arrive that are at or before tsn */ struct sctp_stream_reset_list *liste; int siz; siz = sizeof(struct sctp_stream_reset_list) + (number_entries * sizeof(uint16_t)); SCTP_MALLOC(liste, struct sctp_stream_reset_list *, siz, SCTP_M_STRESET); if (liste == NULL) { /* gak out of memory */ asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[0]); return; } liste->seq = seq; liste->tsn = tsn; liste->number_entries = number_entries; memcpy(&liste->list_of_streams, req->list_of_streams, number_entries * sizeof(uint16_t)); TAILQ_INSERT_TAIL(&asoc->resetHead, liste, next_resp); asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_IN_PROGRESS; } sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[0]); asoc->str_reset_seq_in++; } else if ((asoc->str_reset_seq_in - 1) == seq) { /* * one seq back, just echo back last action since my * response was lost. */ sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[0]); } else if ((asoc->str_reset_seq_in - 2) == seq) { /* * two seq back, just echo back last action since my * response was lost. */ sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[1]); } else { sctp_add_stream_reset_result(chk, seq, SCTP_STREAM_RESET_RESULT_ERR_BAD_SEQNO); } } static void sctp_handle_str_reset_add_strm(struct sctp_tcb *stcb, struct sctp_tmit_chunk *chk, struct sctp_stream_reset_add_strm *str_add) { /* * Peer is requesting to add more streams. If its within our * max-streams we will allow it. */ uint32_t num_stream, i; uint32_t seq; struct sctp_association *asoc = &stcb->asoc; struct sctp_queued_to_read *ctl, *nctl; /* Get the number. */ seq = ntohl(str_add->request_seq); num_stream = ntohs(str_add->number_of_streams); /* Now what would be the new total? */ if (asoc->str_reset_seq_in == seq) { num_stream += stcb->asoc.streamincnt; stcb->asoc.last_reset_action[1] = stcb->asoc.last_reset_action[0]; if (!(asoc->local_strreset_support & SCTP_ENABLE_CHANGE_ASSOC_REQ)) { asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; } else if ((num_stream > stcb->asoc.max_inbound_streams) || (num_stream > 0xffff)) { /* We must reject it they ask for to many */ denied: stcb->asoc.last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; } else { /* Ok, we can do that :-) */ struct sctp_stream_in *oldstrm; /* save off the old */ oldstrm = stcb->asoc.strmin; SCTP_MALLOC(stcb->asoc.strmin, struct sctp_stream_in *, (num_stream * sizeof(struct sctp_stream_in)), SCTP_M_STRMI); if (stcb->asoc.strmin == NULL) { stcb->asoc.strmin = oldstrm; goto denied; } /* copy off the old data */ for (i = 0; i < stcb->asoc.streamincnt; i++) { TAILQ_INIT(&stcb->asoc.strmin[i].inqueue); TAILQ_INIT(&stcb->asoc.strmin[i].uno_inqueue); stcb->asoc.strmin[i].stream_no = i; stcb->asoc.strmin[i].last_sequence_delivered = oldstrm[i].last_sequence_delivered; stcb->asoc.strmin[i].delivery_started = oldstrm[i].delivery_started; stcb->asoc.strmin[i].pd_api_started = oldstrm[i].pd_api_started; /* now anything on those queues? */ TAILQ_FOREACH_SAFE(ctl, &oldstrm[i].inqueue, next_instrm, nctl) { TAILQ_REMOVE(&oldstrm[i].inqueue, ctl, next_instrm); TAILQ_INSERT_TAIL(&stcb->asoc.strmin[i].inqueue, ctl, next_instrm); } TAILQ_FOREACH_SAFE(ctl, &oldstrm[i].uno_inqueue, next_instrm, nctl) { TAILQ_REMOVE(&oldstrm[i].uno_inqueue, ctl, next_instrm); TAILQ_INSERT_TAIL(&stcb->asoc.strmin[i].uno_inqueue, ctl, next_instrm); } } /* Init the new streams */ for (i = stcb->asoc.streamincnt; i < num_stream; i++) { TAILQ_INIT(&stcb->asoc.strmin[i].inqueue); TAILQ_INIT(&stcb->asoc.strmin[i].uno_inqueue); stcb->asoc.strmin[i].stream_no = i; stcb->asoc.strmin[i].last_sequence_delivered = 0xffffffff; stcb->asoc.strmin[i].pd_api_started = 0; stcb->asoc.strmin[i].delivery_started = 0; } SCTP_FREE(oldstrm, SCTP_M_STRMI); /* update the size */ stcb->asoc.streamincnt = num_stream; stcb->asoc.last_reset_action[0] = SCTP_STREAM_RESET_RESULT_PERFORMED; sctp_notify_stream_reset_add(stcb, stcb->asoc.streamincnt, stcb->asoc.streamoutcnt, 0); } sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[0]); asoc->str_reset_seq_in++; } else if ((asoc->str_reset_seq_in - 1) == seq) { /* * one seq back, just echo back last action since my * response was lost. */ sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[0]); } else if ((asoc->str_reset_seq_in - 2) == seq) { /* * two seq back, just echo back last action since my * response was lost. */ sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[1]); } else { sctp_add_stream_reset_result(chk, seq, SCTP_STREAM_RESET_RESULT_ERR_BAD_SEQNO); } } static void sctp_handle_str_reset_add_out_strm(struct sctp_tcb *stcb, struct sctp_tmit_chunk *chk, struct sctp_stream_reset_add_strm *str_add) { /* * Peer is requesting to add more streams. If its within our * max-streams we will allow it. */ uint16_t num_stream; uint32_t seq; struct sctp_association *asoc = &stcb->asoc; /* Get the number. */ seq = ntohl(str_add->request_seq); num_stream = ntohs(str_add->number_of_streams); /* Now what would be the new total? */ if (asoc->str_reset_seq_in == seq) { stcb->asoc.last_reset_action[1] = stcb->asoc.last_reset_action[0]; if (!(asoc->local_strreset_support & SCTP_ENABLE_CHANGE_ASSOC_REQ)) { asoc->last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; } else if (stcb->asoc.stream_reset_outstanding) { /* We must reject it we have something pending */ stcb->asoc.last_reset_action[0] = SCTP_STREAM_RESET_RESULT_ERR_IN_PROGRESS; } else { /* Ok, we can do that :-) */ int mychk; mychk = stcb->asoc.streamoutcnt; mychk += num_stream; if (mychk < 0x10000) { stcb->asoc.last_reset_action[0] = SCTP_STREAM_RESET_RESULT_PERFORMED; if (sctp_send_str_reset_req(stcb, 0, NULL, 0, 0, 1, num_stream, 0, 1)) { stcb->asoc.last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; } } else { stcb->asoc.last_reset_action[0] = SCTP_STREAM_RESET_RESULT_DENIED; } } sctp_add_stream_reset_result(chk, seq, stcb->asoc.last_reset_action[0]); asoc->str_reset_seq_in++; } else if ((asoc->str_reset_seq_in - 1) == seq) { /* * one seq back, just echo back last action since my * response was lost. */ sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[0]); } else if ((asoc->str_reset_seq_in - 2) == seq) { /* * two seq back, just echo back last action since my * response was lost. */ sctp_add_stream_reset_result(chk, seq, asoc->last_reset_action[1]); } else { sctp_add_stream_reset_result(chk, seq, SCTP_STREAM_RESET_RESULT_ERR_BAD_SEQNO); } } #ifdef __GNUC__ __attribute__((noinline)) #endif static int sctp_handle_stream_reset(struct sctp_tcb *stcb, struct mbuf *m, int offset, struct sctp_chunkhdr *ch_req) { uint16_t remaining_length, param_len, ptype; struct sctp_paramhdr pstore; uint8_t cstore[SCTP_CHUNK_BUFFER_SIZE]; uint32_t seq = 0; int num_req = 0; int trunc = 0; struct sctp_tmit_chunk *chk; struct sctp_chunkhdr *ch; struct sctp_paramhdr *ph; int ret_code = 0; int num_param = 0; /* now it may be a reset or a reset-response */ remaining_length = ntohs(ch_req->chunk_length) - sizeof(struct sctp_chunkhdr); /* setup for adding the response */ sctp_alloc_a_chunk(stcb, chk); if (chk == NULL) { return (ret_code); } chk->copy_by_ref = 0; chk->rec.chunk_id.id = SCTP_STREAM_RESET; chk->rec.chunk_id.can_take_data = 0; chk->flags = 0; chk->asoc = &stcb->asoc; chk->no_fr_allowed = 0; chk->book_size = chk->send_size = sizeof(struct sctp_chunkhdr); chk->book_size_scale = 0; chk->data = sctp_get_mbuf_for_msg(MCLBYTES, 0, M_NOWAIT, 1, MT_DATA); if (chk->data == NULL) { strres_nochunk: if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } sctp_free_a_chunk(stcb, chk, SCTP_SO_NOT_LOCKED); return (ret_code); } SCTP_BUF_RESV_UF(chk->data, SCTP_MIN_OVERHEAD); /* setup chunk parameters */ chk->sent = SCTP_DATAGRAM_UNSENT; chk->snd_count = 0; chk->whoTo = NULL; ch = mtod(chk->data, struct sctp_chunkhdr *); ch->chunk_type = SCTP_STREAM_RESET; ch->chunk_flags = 0; ch->chunk_length = htons(chk->send_size); SCTP_BUF_LEN(chk->data) = SCTP_SIZE32(chk->send_size); offset += sizeof(struct sctp_chunkhdr); while (remaining_length >= sizeof(struct sctp_paramhdr)) { ph = (struct sctp_paramhdr *)sctp_m_getptr(m, offset, sizeof(pstore), (uint8_t *) & pstore); if (ph == NULL) { /* TSNH */ break; } param_len = ntohs(ph->param_length); if ((param_len > remaining_length) || (param_len < (sizeof(struct sctp_paramhdr) + sizeof(uint32_t)))) { /* bad parameter length */ break; } ph = (struct sctp_paramhdr *)sctp_m_getptr(m, offset, min(param_len, sizeof(cstore)), (uint8_t *) & cstore); if (ph == NULL) { /* TSNH */ break; } ptype = ntohs(ph->param_type); num_param++; if (param_len > sizeof(cstore)) { trunc = 1; } else { trunc = 0; } if (num_param > SCTP_MAX_RESET_PARAMS) { /* hit the max of parameters already sorry.. */ break; } if (ptype == SCTP_STR_RESET_OUT_REQUEST) { struct sctp_stream_reset_out_request *req_out; if (param_len < sizeof(struct sctp_stream_reset_out_request)) { break; } req_out = (struct sctp_stream_reset_out_request *)ph; num_req++; if (stcb->asoc.stream_reset_outstanding) { seq = ntohl(req_out->response_seq); if (seq == stcb->asoc.str_reset_seq_out) { /* implicit ack */ (void)sctp_handle_stream_reset_response(stcb, seq, SCTP_STREAM_RESET_RESULT_PERFORMED, NULL); } } sctp_handle_str_reset_request_out(stcb, chk, req_out, trunc); } else if (ptype == SCTP_STR_RESET_ADD_OUT_STREAMS) { struct sctp_stream_reset_add_strm *str_add; if (param_len < sizeof(struct sctp_stream_reset_add_strm)) { break; } str_add = (struct sctp_stream_reset_add_strm *)ph; num_req++; sctp_handle_str_reset_add_strm(stcb, chk, str_add); } else if (ptype == SCTP_STR_RESET_ADD_IN_STREAMS) { struct sctp_stream_reset_add_strm *str_add; if (param_len < sizeof(struct sctp_stream_reset_add_strm)) { break; } str_add = (struct sctp_stream_reset_add_strm *)ph; num_req++; sctp_handle_str_reset_add_out_strm(stcb, chk, str_add); } else if (ptype == SCTP_STR_RESET_IN_REQUEST) { struct sctp_stream_reset_in_request *req_in; num_req++; req_in = (struct sctp_stream_reset_in_request *)ph; sctp_handle_str_reset_request_in(stcb, chk, req_in, trunc); } else if (ptype == SCTP_STR_RESET_TSN_REQUEST) { struct sctp_stream_reset_tsn_request *req_tsn; num_req++; req_tsn = (struct sctp_stream_reset_tsn_request *)ph; if (sctp_handle_str_reset_request_tsn(stcb, chk, req_tsn)) { ret_code = 1; goto strres_nochunk; } /* no more */ break; } else if (ptype == SCTP_STR_RESET_RESPONSE) { struct sctp_stream_reset_response *resp; uint32_t result; if (param_len < sizeof(struct sctp_stream_reset_response)) { break; } resp = (struct sctp_stream_reset_response *)ph; seq = ntohl(resp->response_seq); result = ntohl(resp->result); if (sctp_handle_stream_reset_response(stcb, seq, result, resp)) { ret_code = 1; goto strres_nochunk; } } else { break; } offset += SCTP_SIZE32(param_len); if (remaining_length >= SCTP_SIZE32(param_len)) { remaining_length -= SCTP_SIZE32(param_len); } else { remaining_length = 0; } } if (num_req == 0) { /* we have no response free the stuff */ goto strres_nochunk; } /* ok we have a chunk to link in */ TAILQ_INSERT_TAIL(&stcb->asoc.control_send_queue, chk, sctp_next); stcb->asoc.ctrl_queue_cnt++; return (ret_code); } /* * Handle a router or endpoints report of a packet loss, there are two ways * to handle this, either we get the whole packet and must disect it * ourselves (possibly with truncation and or corruption) or it is a summary * from a middle box that did the disectting for us. */ static void sctp_handle_packet_dropped(struct sctp_pktdrop_chunk *cp, struct sctp_tcb *stcb, struct sctp_nets *net, uint32_t limit) { uint32_t bottle_bw, on_queue; uint16_t trunc_len; unsigned int chlen; unsigned int at; struct sctp_chunk_desc desc; struct sctp_chunkhdr *ch; chlen = ntohs(cp->ch.chunk_length); chlen -= sizeof(struct sctp_pktdrop_chunk); /* XXX possible chlen underflow */ if (chlen == 0) { ch = NULL; if (cp->ch.chunk_flags & SCTP_FROM_MIDDLE_BOX) SCTP_STAT_INCR(sctps_pdrpbwrpt); } else { ch = (struct sctp_chunkhdr *)(cp->data + sizeof(struct sctphdr)); chlen -= sizeof(struct sctphdr); /* XXX possible chlen underflow */ memset(&desc, 0, sizeof(desc)); } trunc_len = (uint16_t) ntohs(cp->trunc_len); if (trunc_len > limit) { trunc_len = limit; } /* now the chunks themselves */ while ((ch != NULL) && (chlen >= sizeof(struct sctp_chunkhdr))) { desc.chunk_type = ch->chunk_type; /* get amount we need to move */ at = ntohs(ch->chunk_length); if (at < sizeof(struct sctp_chunkhdr)) { /* corrupt chunk, maybe at the end? */ SCTP_STAT_INCR(sctps_pdrpcrupt); break; } if (trunc_len == 0) { /* we are supposed to have all of it */ if (at > chlen) { /* corrupt skip it */ SCTP_STAT_INCR(sctps_pdrpcrupt); break; } } else { /* is there enough of it left ? */ if (desc.chunk_type == SCTP_DATA) { if (chlen < (sizeof(struct sctp_data_chunk) + sizeof(desc.data_bytes))) { break; } } else { if (chlen < sizeof(struct sctp_chunkhdr)) { break; } } } if (desc.chunk_type == SCTP_DATA) { /* can we get out the tsn? */ if ((cp->ch.chunk_flags & SCTP_FROM_MIDDLE_BOX)) SCTP_STAT_INCR(sctps_pdrpmbda); if (chlen >= (sizeof(struct sctp_data_chunk) + sizeof(uint32_t))) { /* yep */ struct sctp_data_chunk *dcp; uint8_t *ddp; unsigned int iii; dcp = (struct sctp_data_chunk *)ch; ddp = (uint8_t *) (dcp + 1); for (iii = 0; iii < sizeof(desc.data_bytes); iii++) { desc.data_bytes[iii] = ddp[iii]; } desc.tsn_ifany = dcp->dp.tsn; } else { /* nope we are done. */ SCTP_STAT_INCR(sctps_pdrpnedat); break; } } else { if ((cp->ch.chunk_flags & SCTP_FROM_MIDDLE_BOX)) SCTP_STAT_INCR(sctps_pdrpmbct); } if (process_chunk_drop(stcb, &desc, net, cp->ch.chunk_flags)) { SCTP_STAT_INCR(sctps_pdrppdbrk); break; } if (SCTP_SIZE32(at) > chlen) { break; } chlen -= SCTP_SIZE32(at); if (chlen < sizeof(struct sctp_chunkhdr)) { /* done, none left */ break; } ch = (struct sctp_chunkhdr *)((caddr_t)ch + SCTP_SIZE32(at)); } /* Now update any rwnd --- possibly */ if ((cp->ch.chunk_flags & SCTP_FROM_MIDDLE_BOX) == 0) { /* From a peer, we get a rwnd report */ uint32_t a_rwnd; SCTP_STAT_INCR(sctps_pdrpfehos); bottle_bw = ntohl(cp->bottle_bw); on_queue = ntohl(cp->current_onq); if (bottle_bw && on_queue) { /* a rwnd report is in here */ if (bottle_bw > on_queue) a_rwnd = bottle_bw - on_queue; else a_rwnd = 0; if (a_rwnd == 0) stcb->asoc.peers_rwnd = 0; else { if (a_rwnd > stcb->asoc.total_flight) { stcb->asoc.peers_rwnd = a_rwnd - stcb->asoc.total_flight; } else { stcb->asoc.peers_rwnd = 0; } if (stcb->asoc.peers_rwnd < stcb->sctp_ep->sctp_ep.sctp_sws_sender) { /* SWS sender side engages */ stcb->asoc.peers_rwnd = 0; } } } } else { SCTP_STAT_INCR(sctps_pdrpfmbox); } /* now middle boxes in sat networks get a cwnd bump */ if ((cp->ch.chunk_flags & SCTP_FROM_MIDDLE_BOX) && (stcb->asoc.sat_t3_loss_recovery == 0) && (stcb->asoc.sat_network)) { /* * This is debatable but for sat networks it makes sense * Note if a T3 timer has went off, we will prohibit any * changes to cwnd until we exit the t3 loss recovery. */ stcb->asoc.cc_functions.sctp_cwnd_update_after_packet_dropped(stcb, net, cp, &bottle_bw, &on_queue); } } /* * handles all control chunks in a packet inputs: - m: mbuf chain, assumed to * still contain IP/SCTP header - stcb: is the tcb found for this packet - * offset: offset into the mbuf chain to first chunkhdr - length: is the * length of the complete packet outputs: - length: modified to remaining * length after control processing - netp: modified to new sctp_nets after * cookie-echo processing - return NULL to discard the packet (ie. no asoc, * bad packet,...) otherwise return the tcb for this packet */ #ifdef __GNUC__ __attribute__((noinline)) #endif static struct sctp_tcb * sctp_process_control(struct mbuf *m, int iphlen, int *offset, int length, struct sockaddr *src, struct sockaddr *dst, struct sctphdr *sh, struct sctp_chunkhdr *ch, struct sctp_inpcb *inp, struct sctp_tcb *stcb, struct sctp_nets **netp, int *fwd_tsn_seen, uint8_t mflowtype, uint32_t mflowid, uint16_t fibnum, uint32_t vrf_id, uint16_t port) { struct sctp_association *asoc; struct mbuf *op_err; char msg[SCTP_DIAG_INFO_LEN]; uint32_t vtag_in; int num_chunks = 0; /* number of control chunks processed */ uint32_t chk_length; int ret; int abort_no_unlock = 0; int ecne_seen = 0; /* * How big should this be, and should it be alloc'd? Lets try the * d-mtu-ceiling for now (2k) and that should hopefully work ... * until we get into jumbo grams and such.. */ uint8_t chunk_buf[SCTP_CHUNK_BUFFER_SIZE]; struct sctp_tcb *locked_tcb = stcb; int got_auth = 0; uint32_t auth_offset = 0, auth_len = 0; int auth_skipped = 0; int asconf_cnt = 0; #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) struct socket *so; #endif SCTPDBG(SCTP_DEBUG_INPUT1, "sctp_process_control: iphlen=%u, offset=%u, length=%u stcb:%p\n", iphlen, *offset, length, (void *)stcb); /* validate chunk header length... */ if (ntohs(ch->chunk_length) < sizeof(*ch)) { SCTPDBG(SCTP_DEBUG_INPUT1, "Invalid header length %d\n", ntohs(ch->chunk_length)); if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } /* * validate the verification tag */ vtag_in = ntohl(sh->v_tag); if (locked_tcb) { SCTP_TCB_LOCK_ASSERT(locked_tcb); } if (ch->chunk_type == SCTP_INITIATION) { SCTPDBG(SCTP_DEBUG_INPUT1, "Its an INIT of len:%d vtag:%x\n", ntohs(ch->chunk_length), vtag_in); if (vtag_in != 0) { /* protocol error- silently discard... */ SCTP_STAT_INCR(sctps_badvtag); if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } } else if (ch->chunk_type != SCTP_COOKIE_ECHO) { /* * If there is no stcb, skip the AUTH chunk and process * later after a stcb is found (to validate the lookup was * valid. */ if ((ch->chunk_type == SCTP_AUTHENTICATION) && (stcb == NULL) && (inp->auth_supported == 1)) { /* save this chunk for later processing */ auth_skipped = 1; auth_offset = *offset; auth_len = ntohs(ch->chunk_length); /* (temporarily) move past this chunk */ *offset += SCTP_SIZE32(auth_len); if (*offset >= length) { /* no more data left in the mbuf chain */ *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } ch = (struct sctp_chunkhdr *)sctp_m_getptr(m, *offset, sizeof(struct sctp_chunkhdr), chunk_buf); } if (ch == NULL) { /* Help */ *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } if (ch->chunk_type == SCTP_COOKIE_ECHO) { goto process_control_chunks; } /* * first check if it's an ASCONF with an unknown src addr we * need to look inside to find the association */ if (ch->chunk_type == SCTP_ASCONF && stcb == NULL) { struct sctp_chunkhdr *asconf_ch = ch; uint32_t asconf_offset = 0, asconf_len = 0; /* inp's refcount may be reduced */ SCTP_INP_INCR_REF(inp); asconf_offset = *offset; do { asconf_len = ntohs(asconf_ch->chunk_length); if (asconf_len < sizeof(struct sctp_asconf_paramhdr)) break; stcb = sctp_findassociation_ep_asconf(m, *offset, dst, sh, &inp, netp, vrf_id); if (stcb != NULL) break; asconf_offset += SCTP_SIZE32(asconf_len); asconf_ch = (struct sctp_chunkhdr *)sctp_m_getptr(m, asconf_offset, sizeof(struct sctp_chunkhdr), chunk_buf); } while (asconf_ch != NULL && asconf_ch->chunk_type == SCTP_ASCONF); if (stcb == NULL) { /* * reduce inp's refcount if not reduced in * sctp_findassociation_ep_asconf(). */ SCTP_INP_DECR_REF(inp); } else { locked_tcb = stcb; } /* now go back and verify any auth chunk to be sure */ if (auth_skipped && (stcb != NULL)) { struct sctp_auth_chunk *auth; auth = (struct sctp_auth_chunk *) sctp_m_getptr(m, auth_offset, auth_len, chunk_buf); got_auth = 1; auth_skipped = 0; if ((auth == NULL) || sctp_handle_auth(stcb, auth, m, auth_offset)) { /* auth HMAC failed so dump it */ *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } else { /* remaining chunks are HMAC checked */ stcb->asoc.authenticated = 1; } } } if (stcb == NULL) { snprintf(msg, sizeof(msg), "OOTB, %s:%d at %s", __FILE__, __LINE__, __func__); op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), msg); /* no association, so it's out of the blue... */ sctp_handle_ootb(m, iphlen, *offset, src, dst, sh, inp, op_err, mflowtype, mflowid, inp->fibnum, vrf_id, port); *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } asoc = &stcb->asoc; /* ABORT and SHUTDOWN can use either v_tag... */ if ((ch->chunk_type == SCTP_ABORT_ASSOCIATION) || (ch->chunk_type == SCTP_SHUTDOWN_COMPLETE) || (ch->chunk_type == SCTP_PACKET_DROPPED)) { /* Take the T-bit always into account. */ if ((((ch->chunk_flags & SCTP_HAD_NO_TCB) == 0) && (vtag_in == asoc->my_vtag)) || (((ch->chunk_flags & SCTP_HAD_NO_TCB) == SCTP_HAD_NO_TCB) && (vtag_in == asoc->peer_vtag))) { /* this is valid */ } else { /* drop this packet... */ SCTP_STAT_INCR(sctps_badvtag); if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } } else if (ch->chunk_type == SCTP_SHUTDOWN_ACK) { if (vtag_in != asoc->my_vtag) { /* * this could be a stale SHUTDOWN-ACK or the * peer never got the SHUTDOWN-COMPLETE and * is still hung; we have started a new asoc * but it won't complete until the shutdown * is completed */ if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } snprintf(msg, sizeof(msg), "OOTB, %s:%d at %s", __FILE__, __LINE__, __func__); op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), msg); sctp_handle_ootb(m, iphlen, *offset, src, dst, sh, inp, op_err, mflowtype, mflowid, fibnum, vrf_id, port); return (NULL); } } else { /* for all other chunks, vtag must match */ if (vtag_in != asoc->my_vtag) { /* invalid vtag... */ SCTPDBG(SCTP_DEBUG_INPUT3, "invalid vtag: %xh, expect %xh\n", vtag_in, asoc->my_vtag); SCTP_STAT_INCR(sctps_badvtag); if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } *offset = length; return (NULL); } } } /* end if !SCTP_COOKIE_ECHO */ /* * process all control chunks... */ if (((ch->chunk_type == SCTP_SELECTIVE_ACK) || (ch->chunk_type == SCTP_NR_SELECTIVE_ACK) || (ch->chunk_type == SCTP_HEARTBEAT_REQUEST)) && (SCTP_GET_STATE(&stcb->asoc) == SCTP_STATE_COOKIE_ECHOED)) { /* implied cookie-ack.. we must have lost the ack */ if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; sctp_handle_cookie_ack((struct sctp_cookie_ack_chunk *)ch, stcb, *netp); } process_control_chunks: while (IS_SCTP_CONTROL(ch)) { /* validate chunk length */ chk_length = ntohs(ch->chunk_length); SCTPDBG(SCTP_DEBUG_INPUT2, "sctp_process_control: processing a chunk type=%u, len=%u\n", ch->chunk_type, chk_length); SCTP_LTRACE_CHK(inp, stcb, ch->chunk_type, chk_length); if (chk_length < sizeof(*ch) || (*offset + (int)chk_length) > length) { *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } SCTP_STAT_INCR_COUNTER64(sctps_incontrolchunks); /* * INIT-ACK only gets the init ack "header" portion only * because we don't have to process the peer's COOKIE. All * others get a complete chunk. */ if ((ch->chunk_type == SCTP_INITIATION_ACK) || (ch->chunk_type == SCTP_INITIATION)) { /* get an init-ack chunk */ ch = (struct sctp_chunkhdr *)sctp_m_getptr(m, *offset, sizeof(struct sctp_init_ack_chunk), chunk_buf); if (ch == NULL) { *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } } else { /* For cookies and all other chunks. */ if (chk_length > sizeof(chunk_buf)) { /* * use just the size of the chunk buffer so * the front part of our chunks fit in * contiguous space up to the chunk buffer * size (508 bytes). For chunks that need to * get more than that they must use the * sctp_m_getptr() function or other means * (e.g. know how to parse mbuf chains). * Cookies do this already. */ ch = (struct sctp_chunkhdr *)sctp_m_getptr(m, *offset, (sizeof(chunk_buf) - 4), chunk_buf); if (ch == NULL) { *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } } else { /* We can fit it all */ ch = (struct sctp_chunkhdr *)sctp_m_getptr(m, *offset, chk_length, chunk_buf); if (ch == NULL) { SCTP_PRINTF("sctp_process_control: Can't get the all data....\n"); *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } } } num_chunks++; /* Save off the last place we got a control from */ if (stcb != NULL) { if (((netp != NULL) && (*netp != NULL)) || (ch->chunk_type == SCTP_ASCONF)) { /* * allow last_control to be NULL if * ASCONF... ASCONF processing will find the * right net later */ if ((netp != NULL) && (*netp != NULL)) stcb->asoc.last_control_chunk_from = *netp; } } #ifdef SCTP_AUDITING_ENABLED sctp_audit_log(0xB0, ch->chunk_type); #endif /* check to see if this chunk required auth, but isn't */ if ((stcb != NULL) && (stcb->asoc.auth_supported == 1) && sctp_auth_is_required_chunk(ch->chunk_type, stcb->asoc.local_auth_chunks) && !stcb->asoc.authenticated) { /* "silently" ignore */ SCTP_STAT_INCR(sctps_recvauthmissing); goto next_chunk; } switch (ch->chunk_type) { case SCTP_INITIATION: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_INIT\n"); /* The INIT chunk must be the only chunk. */ if ((num_chunks > 1) || (length - *offset > (int)SCTP_SIZE32(chk_length))) { /* RFC 4960 requires that no ABORT is sent */ *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } /* Honor our resource limit. */ if (chk_length > SCTP_LARGEST_INIT_ACCEPTED) { op_err = sctp_generate_cause(SCTP_CAUSE_OUT_OF_RESC, ""); sctp_abort_association(inp, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); *offset = length; return (NULL); } sctp_handle_init(m, iphlen, *offset, src, dst, sh, (struct sctp_init_chunk *)ch, inp, stcb, *netp, &abort_no_unlock, mflowtype, mflowid, vrf_id, port); *offset = length; if ((!abort_no_unlock) && (locked_tcb)) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); break; case SCTP_PAD_CHUNK: break; case SCTP_INITIATION_ACK: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_INIT-ACK\n"); if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE) { /* We are not interested anymore */ if ((stcb) && (stcb->asoc.total_output_queue_size)) { ; } else { if ((locked_tcb != NULL) && (locked_tcb != stcb)) { /* Very unlikely */ SCTP_TCB_UNLOCK(locked_tcb); } *offset = length; if (stcb) { #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(inp); atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); #endif (void)sctp_free_assoc(inp, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_29); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif } return (NULL); } } /* The INIT-ACK chunk must be the only chunk. */ if ((num_chunks > 1) || (length - *offset > (int)SCTP_SIZE32(chk_length))) { *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } if ((netp) && (*netp)) { ret = sctp_handle_init_ack(m, iphlen, *offset, src, dst, sh, (struct sctp_init_ack_chunk *)ch, stcb, *netp, &abort_no_unlock, mflowtype, mflowid, vrf_id); } else { ret = -1; } *offset = length; if (abort_no_unlock) { return (NULL); } /* * Special case, I must call the output routine to * get the cookie echoed */ if ((stcb != NULL) && (ret == 0)) { sctp_chunk_output(stcb->sctp_ep, stcb, SCTP_OUTPUT_FROM_CONTROL_PROC, SCTP_SO_NOT_LOCKED); } if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); break; case SCTP_SELECTIVE_ACK: { struct sctp_sack_chunk *sack; int abort_now = 0; uint32_t a_rwnd, cum_ack; uint16_t num_seg, num_dup; uint8_t flags; int offset_seg, offset_dup; SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_SACK\n"); SCTP_STAT_INCR(sctps_recvsacks); if (stcb == NULL) { SCTPDBG(SCTP_DEBUG_INDATA1, "No stcb when processing SACK chunk\n"); break; } if (chk_length < sizeof(struct sctp_sack_chunk)) { SCTPDBG(SCTP_DEBUG_INDATA1, "Bad size on SACK chunk, too small\n"); break; } if (SCTP_GET_STATE(&stcb->asoc) == SCTP_STATE_SHUTDOWN_ACK_SENT) { /*- * If we have sent a shutdown-ack, we will pay no * attention to a sack sent in to us since * we don't care anymore. */ break; } sack = (struct sctp_sack_chunk *)ch; flags = ch->chunk_flags; cum_ack = ntohl(sack->sack.cum_tsn_ack); num_seg = ntohs(sack->sack.num_gap_ack_blks); num_dup = ntohs(sack->sack.num_dup_tsns); a_rwnd = (uint32_t) ntohl(sack->sack.a_rwnd); if (sizeof(struct sctp_sack_chunk) + num_seg * sizeof(struct sctp_gap_ack_block) + num_dup * sizeof(uint32_t) != chk_length) { SCTPDBG(SCTP_DEBUG_INDATA1, "Bad size of SACK chunk\n"); break; } offset_seg = *offset + sizeof(struct sctp_sack_chunk); offset_dup = offset_seg + num_seg * sizeof(struct sctp_gap_ack_block); SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_SACK process cum_ack:%x num_seg:%d a_rwnd:%d\n", cum_ack, num_seg, a_rwnd); stcb->asoc.seen_a_sack_this_pkt = 1; if ((stcb->asoc.pr_sctp_cnt == 0) && (num_seg == 0) && SCTP_TSN_GE(cum_ack, stcb->asoc.last_acked_seq) && (stcb->asoc.saw_sack_with_frags == 0) && (stcb->asoc.saw_sack_with_nr_frags == 0) && (!TAILQ_EMPTY(&stcb->asoc.sent_queue)) ) { /* * We have a SIMPLE sack having no * prior segments and data on sent * queue to be acked.. Use the * faster path sack processing. We * also allow window update sacks * with no missing segments to go * this way too. */ sctp_express_handle_sack(stcb, cum_ack, a_rwnd, &abort_now, ecne_seen); } else { if (netp && *netp) sctp_handle_sack(m, offset_seg, offset_dup, stcb, num_seg, 0, num_dup, &abort_now, flags, cum_ack, a_rwnd, ecne_seen); } if (abort_now) { /* ABORT signal from sack processing */ *offset = length; return (NULL); } if (TAILQ_EMPTY(&stcb->asoc.send_queue) && TAILQ_EMPTY(&stcb->asoc.sent_queue) && (stcb->asoc.stream_queue_cnt == 0)) { sctp_ulp_notify(SCTP_NOTIFY_SENDER_DRY, stcb, 0, NULL, SCTP_SO_NOT_LOCKED); } } break; /* * EY - nr_sack: If the received chunk is an * nr_sack chunk */ case SCTP_NR_SELECTIVE_ACK: { struct sctp_nr_sack_chunk *nr_sack; int abort_now = 0; uint32_t a_rwnd, cum_ack; uint16_t num_seg, num_nr_seg, num_dup; uint8_t flags; int offset_seg, offset_dup; SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_NR_SACK\n"); SCTP_STAT_INCR(sctps_recvsacks); if (stcb == NULL) { SCTPDBG(SCTP_DEBUG_INDATA1, "No stcb when processing NR-SACK chunk\n"); break; } if (stcb->asoc.nrsack_supported == 0) { goto unknown_chunk; } if (chk_length < sizeof(struct sctp_nr_sack_chunk)) { SCTPDBG(SCTP_DEBUG_INDATA1, "Bad size on NR-SACK chunk, too small\n"); break; } if (SCTP_GET_STATE(&stcb->asoc) == SCTP_STATE_SHUTDOWN_ACK_SENT) { /*- * If we have sent a shutdown-ack, we will pay no * attention to a sack sent in to us since * we don't care anymore. */ break; } nr_sack = (struct sctp_nr_sack_chunk *)ch; flags = ch->chunk_flags; cum_ack = ntohl(nr_sack->nr_sack.cum_tsn_ack); num_seg = ntohs(nr_sack->nr_sack.num_gap_ack_blks); num_nr_seg = ntohs(nr_sack->nr_sack.num_nr_gap_ack_blks); num_dup = ntohs(nr_sack->nr_sack.num_dup_tsns); a_rwnd = (uint32_t) ntohl(nr_sack->nr_sack.a_rwnd); if (sizeof(struct sctp_nr_sack_chunk) + (num_seg + num_nr_seg) * sizeof(struct sctp_gap_ack_block) + num_dup * sizeof(uint32_t) != chk_length) { SCTPDBG(SCTP_DEBUG_INDATA1, "Bad size of NR_SACK chunk\n"); break; } offset_seg = *offset + sizeof(struct sctp_nr_sack_chunk); offset_dup = offset_seg + num_seg * sizeof(struct sctp_gap_ack_block); SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_NR_SACK process cum_ack:%x num_seg:%d a_rwnd:%d\n", cum_ack, num_seg, a_rwnd); stcb->asoc.seen_a_sack_this_pkt = 1; if ((stcb->asoc.pr_sctp_cnt == 0) && (num_seg == 0) && (num_nr_seg == 0) && SCTP_TSN_GE(cum_ack, stcb->asoc.last_acked_seq) && (stcb->asoc.saw_sack_with_frags == 0) && (stcb->asoc.saw_sack_with_nr_frags == 0) && (!TAILQ_EMPTY(&stcb->asoc.sent_queue))) { /* * We have a SIMPLE sack having no * prior segments and data on sent * queue to be acked. Use the faster * path sack processing. We also * allow window update sacks with no * missing segments to go this way * too. */ sctp_express_handle_sack(stcb, cum_ack, a_rwnd, &abort_now, ecne_seen); } else { if (netp && *netp) sctp_handle_sack(m, offset_seg, offset_dup, stcb, num_seg, num_nr_seg, num_dup, &abort_now, flags, cum_ack, a_rwnd, ecne_seen); } if (abort_now) { /* ABORT signal from sack processing */ *offset = length; return (NULL); } if (TAILQ_EMPTY(&stcb->asoc.send_queue) && TAILQ_EMPTY(&stcb->asoc.sent_queue) && (stcb->asoc.stream_queue_cnt == 0)) { sctp_ulp_notify(SCTP_NOTIFY_SENDER_DRY, stcb, 0, NULL, SCTP_SO_NOT_LOCKED); } } break; case SCTP_HEARTBEAT_REQUEST: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_HEARTBEAT\n"); if ((stcb) && netp && *netp) { SCTP_STAT_INCR(sctps_recvheartbeat); sctp_send_heartbeat_ack(stcb, m, *offset, chk_length, *netp); /* He's alive so give him credit */ if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; } break; case SCTP_HEARTBEAT_ACK: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_HEARTBEAT-ACK\n"); if ((stcb == NULL) || (chk_length != sizeof(struct sctp_heartbeat_chunk))) { /* Its not ours */ *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } /* He's alive so give him credit */ if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; SCTP_STAT_INCR(sctps_recvheartbeatack); if (netp && *netp) sctp_handle_heartbeat_ack((struct sctp_heartbeat_chunk *)ch, stcb, *netp); break; case SCTP_ABORT_ASSOCIATION: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_ABORT, stcb %p\n", (void *)stcb); if ((stcb) && netp && *netp) sctp_handle_abort((struct sctp_abort_chunk *)ch, stcb, *netp); *offset = length; return (NULL); break; case SCTP_SHUTDOWN: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_SHUTDOWN, stcb %p\n", (void *)stcb); if ((stcb == NULL) || (chk_length != sizeof(struct sctp_shutdown_chunk))) { *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } if (netp && *netp) { int abort_flag = 0; sctp_handle_shutdown((struct sctp_shutdown_chunk *)ch, stcb, *netp, &abort_flag); if (abort_flag) { *offset = length; return (NULL); } } break; case SCTP_SHUTDOWN_ACK: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_SHUTDOWN-ACK, stcb %p\n", (void *)stcb); if ((stcb) && (netp) && (*netp)) sctp_handle_shutdown_ack((struct sctp_shutdown_ack_chunk *)ch, stcb, *netp); *offset = length; return (NULL); break; case SCTP_OPERATION_ERROR: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_OP-ERR\n"); if ((stcb) && netp && *netp && sctp_handle_error(ch, stcb, *netp) < 0) { *offset = length; return (NULL); } break; case SCTP_COOKIE_ECHO: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_COOKIE-ECHO, stcb %p\n", (void *)stcb); if ((stcb) && (stcb->asoc.total_output_queue_size)) { ; } else { if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE) { /* We are not interested anymore */ abend: if (stcb) { SCTP_TCB_UNLOCK(stcb); } *offset = length; return (NULL); } } /* * First are we accepting? We do this again here * since it is possible that a previous endpoint WAS * listening responded to a INIT-ACK and then * closed. We opened and bound.. and are now no * longer listening. */ if ((stcb == NULL) && (inp->sctp_socket->so_qlen >= inp->sctp_socket->so_qlimit)) { if ((inp->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE) && (SCTP_BASE_SYSCTL(sctp_abort_if_one_2_one_hits_limit))) { op_err = sctp_generate_cause(SCTP_CAUSE_OUT_OF_RESC, ""); sctp_abort_association(inp, stcb, m, iphlen, src, dst, sh, op_err, mflowtype, mflowid, vrf_id, port); } *offset = length; return (NULL); } else { struct mbuf *ret_buf; struct sctp_inpcb *linp; if (stcb) { linp = NULL; } else { linp = inp; } if (linp) { SCTP_ASOC_CREATE_LOCK(linp); if ((inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE) || (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE)) { SCTP_ASOC_CREATE_UNLOCK(linp); goto abend; } } if (netp) { ret_buf = sctp_handle_cookie_echo(m, iphlen, *offset, src, dst, sh, (struct sctp_cookie_echo_chunk *)ch, &inp, &stcb, netp, auth_skipped, auth_offset, auth_len, &locked_tcb, mflowtype, mflowid, vrf_id, port); } else { ret_buf = NULL; } if (linp) { SCTP_ASOC_CREATE_UNLOCK(linp); } if (ret_buf == NULL) { if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } SCTPDBG(SCTP_DEBUG_INPUT3, "GAK, null buffer\n"); *offset = length; return (NULL); } /* if AUTH skipped, see if it verified... */ if (auth_skipped) { got_auth = 1; auth_skipped = 0; } if (!TAILQ_EMPTY(&stcb->asoc.sent_queue)) { /* * Restart the timer if we have * pending data */ struct sctp_tmit_chunk *chk; chk = TAILQ_FIRST(&stcb->asoc.sent_queue); sctp_timer_start(SCTP_TIMER_TYPE_SEND, stcb->sctp_ep, stcb, chk->whoTo); } } break; case SCTP_COOKIE_ACK: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_COOKIE-ACK, stcb %p\n", (void *)stcb); if ((stcb == NULL) || chk_length != sizeof(struct sctp_cookie_ack_chunk)) { if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE) { /* We are not interested anymore */ if ((stcb) && (stcb->asoc.total_output_queue_size)) { ; } else if (stcb) { #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(inp); atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); #endif (void)sctp_free_assoc(inp, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_30); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif *offset = length; return (NULL); } } /* He's alive so give him credit */ if ((stcb) && netp && *netp) { if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; sctp_handle_cookie_ack((struct sctp_cookie_ack_chunk *)ch, stcb, *netp); } break; case SCTP_ECN_ECHO: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_ECN-ECHO\n"); /* He's alive so give him credit */ if ((stcb == NULL) || (chk_length != sizeof(struct sctp_ecne_chunk))) { /* Its not ours */ if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } *offset = length; return (NULL); } if (stcb) { if (stcb->asoc.ecn_supported == 0) { goto unknown_chunk; } if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; sctp_handle_ecn_echo((struct sctp_ecne_chunk *)ch, stcb); ecne_seen = 1; } break; case SCTP_ECN_CWR: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_ECN-CWR\n"); /* He's alive so give him credit */ if ((stcb == NULL) || (chk_length != sizeof(struct sctp_cwr_chunk))) { /* Its not ours */ if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } *offset = length; return (NULL); } if (stcb) { if (stcb->asoc.ecn_supported == 0) { goto unknown_chunk; } if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; sctp_handle_ecn_cwr((struct sctp_cwr_chunk *)ch, stcb, *netp); } break; case SCTP_SHUTDOWN_COMPLETE: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_SHUTDOWN-COMPLETE, stcb %p\n", (void *)stcb); /* must be first and only chunk */ if ((num_chunks > 1) || (length - *offset > (int)SCTP_SIZE32(chk_length))) { *offset = length; if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } return (NULL); } if ((stcb) && netp && *netp) { sctp_handle_shutdown_complete((struct sctp_shutdown_complete_chunk *)ch, stcb, *netp); } *offset = length; return (NULL); break; case SCTP_ASCONF: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_ASCONF\n"); /* He's alive so give him credit */ if (stcb) { if (stcb->asoc.asconf_supported == 0) { goto unknown_chunk; } if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; sctp_handle_asconf(m, *offset, src, (struct sctp_asconf_chunk *)ch, stcb, asconf_cnt == 0); asconf_cnt++; } break; case SCTP_ASCONF_ACK: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_ASCONF-ACK\n"); if (chk_length < sizeof(struct sctp_asconf_ack_chunk)) { /* Its not ours */ if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } *offset = length; return (NULL); } if ((stcb) && netp && *netp) { if (stcb->asoc.asconf_supported == 0) { goto unknown_chunk; } /* He's alive so give him credit */ if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; sctp_handle_asconf_ack(m, *offset, (struct sctp_asconf_ack_chunk *)ch, stcb, *netp, &abort_no_unlock); if (abort_no_unlock) return (NULL); } break; case SCTP_FORWARD_CUM_TSN: case SCTP_IFORWARD_CUM_TSN: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_FWD-TSN\n"); if (chk_length < sizeof(struct sctp_forward_tsn_chunk)) { /* Its not ours */ if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } *offset = length; return (NULL); } /* He's alive so give him credit */ if (stcb) { int abort_flag = 0; if (stcb->asoc.prsctp_supported == 0) { goto unknown_chunk; } stcb->asoc.overall_error_count = 0; if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } *fwd_tsn_seen = 1; if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE) { /* We are not interested anymore */ #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) so = SCTP_INP_SO(inp); atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_SOCKET_LOCK(so, 1); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); #endif (void)sctp_free_assoc(inp, stcb, SCTP_NORMAL_PROC, SCTP_FROM_SCTP_INPUT + SCTP_LOC_31); #if defined(__APPLE__) || defined(SCTP_SO_LOCK_TESTING) SCTP_SOCKET_UNLOCK(so, 1); #endif *offset = length; return (NULL); } sctp_handle_forward_tsn(stcb, (struct sctp_forward_tsn_chunk *)ch, &abort_flag, m, *offset); if (abort_flag) { *offset = length; return (NULL); } else { if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; } } break; case SCTP_STREAM_RESET: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_STREAM_RESET\n"); if (((stcb == NULL) || (ch == NULL) || (chk_length < sizeof(struct sctp_stream_reset_tsn_req)))) { /* Its not ours */ if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } *offset = length; return (NULL); } if (stcb->asoc.reconfig_supported == 0) { goto unknown_chunk; } if (sctp_handle_stream_reset(stcb, m, *offset, ch)) { /* stop processing */ *offset = length; return (NULL); } break; case SCTP_PACKET_DROPPED: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_PACKET_DROPPED\n"); /* re-get it all please */ if (chk_length < sizeof(struct sctp_pktdrop_chunk)) { /* Its not ours */ if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } *offset = length; return (NULL); } if (ch && (stcb) && netp && (*netp)) { if (stcb->asoc.pktdrop_supported == 0) { goto unknown_chunk; } sctp_handle_packet_dropped((struct sctp_pktdrop_chunk *)ch, stcb, *netp, min(chk_length, (sizeof(chunk_buf) - 4))); } break; case SCTP_AUTHENTICATION: SCTPDBG(SCTP_DEBUG_INPUT3, "SCTP_AUTHENTICATION\n"); if (stcb == NULL) { /* save the first AUTH for later processing */ if (auth_skipped == 0) { auth_offset = *offset; auth_len = chk_length; auth_skipped = 1; } /* skip this chunk (temporarily) */ goto next_chunk; } if (stcb->asoc.auth_supported == 0) { goto unknown_chunk; } if ((chk_length < (sizeof(struct sctp_auth_chunk))) || (chk_length > (sizeof(struct sctp_auth_chunk) + SCTP_AUTH_DIGEST_LEN_MAX))) { /* Its not ours */ if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } *offset = length; return (NULL); } if (got_auth == 1) { /* skip this chunk... it's already auth'd */ goto next_chunk; } got_auth = 1; if ((ch == NULL) || sctp_handle_auth(stcb, (struct sctp_auth_chunk *)ch, m, *offset)) { /* auth HMAC failed so dump the packet */ *offset = length; return (stcb); } else { /* remaining chunks are HMAC checked */ stcb->asoc.authenticated = 1; } break; default: unknown_chunk: /* it's an unknown chunk! */ if ((ch->chunk_type & 0x40) && (stcb != NULL)) { struct sctp_gen_error_cause *cause; int len; op_err = sctp_get_mbuf_for_msg(sizeof(struct sctp_gen_error_cause), 0, M_NOWAIT, 1, MT_DATA); if (op_err != NULL) { len = min(SCTP_SIZE32(chk_length), (uint32_t) (length - *offset)); cause = mtod(op_err, struct sctp_gen_error_cause *); cause->code = htons(SCTP_CAUSE_UNRECOG_CHUNK); cause->length = htons((uint16_t) (len + sizeof(struct sctp_gen_error_cause))); SCTP_BUF_LEN(op_err) = sizeof(struct sctp_gen_error_cause); SCTP_BUF_NEXT(op_err) = SCTP_M_COPYM(m, *offset, len, M_NOWAIT); if (SCTP_BUF_NEXT(op_err) != NULL) { #ifdef SCTP_MBUF_LOGGING if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_MBUF_LOGGING_ENABLE) { sctp_log_mbc(SCTP_BUF_NEXT(op_err), SCTP_MBUF_ICOPY); } #endif sctp_queue_op_err(stcb, op_err); } else { sctp_m_freem(op_err); } } } if ((ch->chunk_type & 0x80) == 0) { /* discard this packet */ *offset = length; return (stcb); } /* else skip this bad chunk and continue... */ break; } /* switch (ch->chunk_type) */ next_chunk: /* get the next chunk */ *offset += SCTP_SIZE32(chk_length); if (*offset >= length) { /* no more data left in the mbuf chain */ break; } ch = (struct sctp_chunkhdr *)sctp_m_getptr(m, *offset, sizeof(struct sctp_chunkhdr), chunk_buf); if (ch == NULL) { if (locked_tcb) { SCTP_TCB_UNLOCK(locked_tcb); } *offset = length; return (NULL); } } /* while */ if (asconf_cnt > 0 && stcb != NULL) { sctp_send_asconf_ack(stcb); } return (stcb); } /* * common input chunk processing (v4 and v6) */ void sctp_common_input_processing(struct mbuf **mm, int iphlen, int offset, int length, struct sockaddr *src, struct sockaddr *dst, struct sctphdr *sh, struct sctp_chunkhdr *ch, #if !defined(SCTP_WITH_NO_CSUM) uint8_t compute_crc, #endif uint8_t ecn_bits, uint8_t mflowtype, uint32_t mflowid, uint16_t fibnum, uint32_t vrf_id, uint16_t port) { uint32_t high_tsn; int fwd_tsn_seen = 0, data_processed = 0; struct mbuf *m = *mm, *op_err; char msg[SCTP_DIAG_INFO_LEN]; int un_sent; int cnt_ctrl_ready = 0; struct sctp_inpcb *inp = NULL, *inp_decr = NULL; struct sctp_tcb *stcb = NULL; struct sctp_nets *net = NULL; SCTP_STAT_INCR(sctps_recvdatagrams); #ifdef SCTP_AUDITING_ENABLED sctp_audit_log(0xE0, 1); sctp_auditing(0, inp, stcb, net); #endif #if !defined(SCTP_WITH_NO_CSUM) if (compute_crc != 0) { uint32_t check, calc_check; check = sh->checksum; sh->checksum = 0; calc_check = sctp_calculate_cksum(m, iphlen); sh->checksum = check; if (calc_check != check) { SCTPDBG(SCTP_DEBUG_INPUT1, "Bad CSUM on SCTP packet calc_check:%x check:%x m:%p mlen:%d iphlen:%d\n", calc_check, check, (void *)m, length, iphlen); stcb = sctp_findassociation_addr(m, offset, src, dst, sh, ch, &inp, &net, vrf_id); #if defined(INET) || defined(INET6) if ((ch->chunk_type != SCTP_INITIATION) && (net != NULL) && (net->port != port)) { if (net->port == 0) { /* UDP encapsulation turned on. */ net->mtu -= sizeof(struct udphdr); if (stcb->asoc.smallest_mtu > net->mtu) { sctp_pathmtu_adjustment(stcb, net->mtu); } } else if (port == 0) { /* UDP encapsulation turned off. */ net->mtu += sizeof(struct udphdr); /* XXX Update smallest_mtu */ } net->port = port; } #endif if (net != NULL) { net->flowtype = mflowtype; net->flowid = mflowid; } if ((inp != NULL) && (stcb != NULL)) { sctp_send_packet_dropped(stcb, net, m, length, iphlen, 1); sctp_chunk_output(inp, stcb, SCTP_OUTPUT_FROM_INPUT_ERROR, SCTP_SO_NOT_LOCKED); } else if ((inp != NULL) && (stcb == NULL)) { inp_decr = inp; } SCTP_STAT_INCR(sctps_badsum); SCTP_STAT_INCR_COUNTER32(sctps_checksumerrors); goto out; } } #endif /* Destination port of 0 is illegal, based on RFC4960. */ if (sh->dest_port == 0) { SCTP_STAT_INCR(sctps_hdrops); goto out; } stcb = sctp_findassociation_addr(m, offset, src, dst, sh, ch, &inp, &net, vrf_id); #if defined(INET) || defined(INET6) if ((ch->chunk_type != SCTP_INITIATION) && (net != NULL) && (net->port != port)) { if (net->port == 0) { /* UDP encapsulation turned on. */ net->mtu -= sizeof(struct udphdr); if (stcb->asoc.smallest_mtu > net->mtu) { sctp_pathmtu_adjustment(stcb, net->mtu); } } else if (port == 0) { /* UDP encapsulation turned off. */ net->mtu += sizeof(struct udphdr); /* XXX Update smallest_mtu */ } net->port = port; } #endif if (net != NULL) { net->flowtype = mflowtype; net->flowid = mflowid; } if (inp == NULL) { SCTP_STAT_INCR(sctps_noport); if (badport_bandlim(BANDLIM_SCTP_OOTB) < 0) { goto out; } if (ch->chunk_type == SCTP_SHUTDOWN_ACK) { sctp_send_shutdown_complete2(src, dst, sh, mflowtype, mflowid, fibnum, vrf_id, port); goto out; } if (ch->chunk_type == SCTP_SHUTDOWN_COMPLETE) { goto out; } if (ch->chunk_type != SCTP_ABORT_ASSOCIATION) { if ((SCTP_BASE_SYSCTL(sctp_blackhole) == 0) || ((SCTP_BASE_SYSCTL(sctp_blackhole) == 1) && (ch->chunk_type != SCTP_INIT))) { op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), "Out of the blue"); sctp_send_abort(m, iphlen, src, dst, sh, 0, op_err, mflowtype, mflowid, fibnum, vrf_id, port); } } goto out; } else if (stcb == NULL) { inp_decr = inp; } #ifdef IPSEC /*- * I very much doubt any of the IPSEC stuff will work but I have no * idea, so I will leave it in place. */ if (inp != NULL) { switch (dst->sa_family) { #ifdef INET case AF_INET: if (ipsec4_in_reject(m, &inp->ip_inp.inp)) { SCTP_STAT_INCR(sctps_hdrops); goto out; } break; #endif #ifdef INET6 case AF_INET6: if (ipsec6_in_reject(m, &inp->ip_inp.inp)) { SCTP_STAT_INCR(sctps_hdrops); goto out; } break; #endif default: break; } } #endif SCTPDBG(SCTP_DEBUG_INPUT1, "Ok, Common input processing called, m:%p iphlen:%d offset:%d length:%d stcb:%p\n", (void *)m, iphlen, offset, length, (void *)stcb); if (stcb) { /* always clear this before beginning a packet */ stcb->asoc.authenticated = 0; stcb->asoc.seen_a_sack_this_pkt = 0; SCTPDBG(SCTP_DEBUG_INPUT1, "stcb:%p state:%x\n", (void *)stcb, stcb->asoc.state); if ((stcb->asoc.state & SCTP_STATE_WAS_ABORTED) || (stcb->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED)) { /*- * If we hit here, we had a ref count * up when the assoc was aborted and the * timer is clearing out the assoc, we should * NOT respond to any packet.. its OOTB. */ SCTP_TCB_UNLOCK(stcb); stcb = NULL; snprintf(msg, sizeof(msg), "OOTB, %s:%d at %s", __FILE__, __LINE__, __func__); op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), msg); sctp_handle_ootb(m, iphlen, offset, src, dst, sh, inp, op_err, mflowtype, mflowid, inp->fibnum, vrf_id, port); goto out; } } if (IS_SCTP_CONTROL(ch)) { /* process the control portion of the SCTP packet */ /* sa_ignore NO_NULL_CHK */ stcb = sctp_process_control(m, iphlen, &offset, length, src, dst, sh, ch, inp, stcb, &net, &fwd_tsn_seen, mflowtype, mflowid, fibnum, vrf_id, port); if (stcb) { /* * This covers us if the cookie-echo was there and * it changes our INP. */ inp = stcb->sctp_ep; #if defined(INET) || defined(INET6) if ((ch->chunk_type != SCTP_INITIATION) && (net != NULL) && (net->port != port)) { if (net->port == 0) { /* UDP encapsulation turned on. */ net->mtu -= sizeof(struct udphdr); if (stcb->asoc.smallest_mtu > net->mtu) { sctp_pathmtu_adjustment(stcb, net->mtu); } } else if (port == 0) { /* UDP encapsulation turned off. */ net->mtu += sizeof(struct udphdr); /* XXX Update smallest_mtu */ } net->port = port; } #endif } } else { /* * no control chunks, so pre-process DATA chunks (these * checks are taken care of by control processing) */ /* * if DATA only packet, and auth is required, then punt... * can't have authenticated without any AUTH (control) * chunks */ if ((stcb != NULL) && (stcb->asoc.auth_supported == 1) && sctp_auth_is_required_chunk(SCTP_DATA, stcb->asoc.local_auth_chunks)) { /* "silently" ignore */ SCTP_STAT_INCR(sctps_recvauthmissing); goto out; } if (stcb == NULL) { /* out of the blue DATA chunk */ snprintf(msg, sizeof(msg), "OOTB, %s:%d at %s", __FILE__, __LINE__, __func__); op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), msg); sctp_handle_ootb(m, iphlen, offset, src, dst, sh, inp, op_err, mflowtype, mflowid, fibnum, vrf_id, port); goto out; } if (stcb->asoc.my_vtag != ntohl(sh->v_tag)) { /* v_tag mismatch! */ SCTP_STAT_INCR(sctps_badvtag); goto out; } } if (stcb == NULL) { /* * no valid TCB for this packet, or we found it's a bad * packet while processing control, or we're done with this * packet (done or skip rest of data), so we drop it... */ goto out; } /* * DATA chunk processing */ /* plow through the data chunks while length > offset */ /* * Rest should be DATA only. Check authentication state if AUTH for * DATA is required. */ if ((length > offset) && (stcb != NULL) && (stcb->asoc.auth_supported == 1) && sctp_auth_is_required_chunk(SCTP_DATA, stcb->asoc.local_auth_chunks) && !stcb->asoc.authenticated) { /* "silently" ignore */ SCTP_STAT_INCR(sctps_recvauthmissing); SCTPDBG(SCTP_DEBUG_AUTH1, "Data chunk requires AUTH, skipped\n"); goto trigger_send; } if (length > offset) { int retval; /* * First check to make sure our state is correct. We would * not get here unless we really did have a tag, so we don't * abort if this happens, just dump the chunk silently. */ switch (SCTP_GET_STATE(&stcb->asoc)) { case SCTP_STATE_COOKIE_ECHOED: /* * we consider data with valid tags in this state * shows us the cookie-ack was lost. Imply it was * there. */ if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_THRESHOLD_LOGGING) { sctp_misc_ints(SCTP_THRESHOLD_CLEAR, stcb->asoc.overall_error_count, 0, SCTP_FROM_SCTP_INPUT, __LINE__); } stcb->asoc.overall_error_count = 0; sctp_handle_cookie_ack((struct sctp_cookie_ack_chunk *)ch, stcb, net); break; case SCTP_STATE_COOKIE_WAIT: /* * We consider OOTB any data sent during asoc setup. */ snprintf(msg, sizeof(msg), "OOTB, %s:%d at %s", __FILE__, __LINE__, __func__); op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), msg); sctp_handle_ootb(m, iphlen, offset, src, dst, sh, inp, op_err, mflowtype, mflowid, inp->fibnum, vrf_id, port); goto out; /* sa_ignore NOTREACHED */ break; case SCTP_STATE_EMPTY: /* should not happen */ case SCTP_STATE_INUSE: /* should not happen */ case SCTP_STATE_SHUTDOWN_RECEIVED: /* This is a peer error */ case SCTP_STATE_SHUTDOWN_ACK_SENT: default: goto out; /* sa_ignore NOTREACHED */ break; case SCTP_STATE_OPEN: case SCTP_STATE_SHUTDOWN_SENT: break; } /* plow through the data chunks while length > offset */ retval = sctp_process_data(mm, iphlen, &offset, length, inp, stcb, net, &high_tsn); if (retval == 2) { /* * The association aborted, NO UNLOCK needed since * the association is destroyed. */ stcb = NULL; goto out; } data_processed = 1; /* * Anything important needs to have been m_copy'ed in * process_data */ } /* take care of ecn */ if ((data_processed == 1) && (stcb->asoc.ecn_supported == 1) && ((ecn_bits & SCTP_CE_BITS) == SCTP_CE_BITS)) { /* Yep, we need to add a ECNE */ sctp_send_ecn_echo(stcb, net, high_tsn); } if ((data_processed == 0) && (fwd_tsn_seen)) { int was_a_gap; uint32_t highest_tsn; if (SCTP_TSN_GT(stcb->asoc.highest_tsn_inside_nr_map, stcb->asoc.highest_tsn_inside_map)) { highest_tsn = stcb->asoc.highest_tsn_inside_nr_map; } else { highest_tsn = stcb->asoc.highest_tsn_inside_map; } was_a_gap = SCTP_TSN_GT(highest_tsn, stcb->asoc.cumulative_tsn); stcb->asoc.send_sack = 1; sctp_sack_check(stcb, was_a_gap); } else if (fwd_tsn_seen) { stcb->asoc.send_sack = 1; } /* trigger send of any chunks in queue... */ trigger_send: #ifdef SCTP_AUDITING_ENABLED sctp_audit_log(0xE0, 2); sctp_auditing(1, inp, stcb, net); #endif SCTPDBG(SCTP_DEBUG_INPUT1, "Check for chunk output prw:%d tqe:%d tf=%d\n", stcb->asoc.peers_rwnd, TAILQ_EMPTY(&stcb->asoc.control_send_queue), stcb->asoc.total_flight); un_sent = (stcb->asoc.total_output_queue_size - stcb->asoc.total_flight); if (!TAILQ_EMPTY(&stcb->asoc.control_send_queue)) { cnt_ctrl_ready = stcb->asoc.ctrl_queue_cnt - stcb->asoc.ecn_echo_cnt_onq; } if (!TAILQ_EMPTY(&stcb->asoc.asconf_send_queue) || cnt_ctrl_ready || stcb->asoc.trigger_reset || ((un_sent) && (stcb->asoc.peers_rwnd > 0 || (stcb->asoc.peers_rwnd <= 0 && stcb->asoc.total_flight == 0)))) { SCTPDBG(SCTP_DEBUG_INPUT3, "Calling chunk OUTPUT\n"); sctp_chunk_output(inp, stcb, SCTP_OUTPUT_FROM_CONTROL_PROC, SCTP_SO_NOT_LOCKED); SCTPDBG(SCTP_DEBUG_INPUT3, "chunk OUTPUT returns\n"); } #ifdef SCTP_AUDITING_ENABLED sctp_audit_log(0xE0, 3); sctp_auditing(2, inp, stcb, net); #endif out: if (stcb != NULL) { SCTP_TCB_UNLOCK(stcb); } if (inp_decr != NULL) { /* reduce ref-count */ SCTP_INP_WLOCK(inp_decr); SCTP_INP_DECR_REF(inp_decr); SCTP_INP_WUNLOCK(inp_decr); } return; } #ifdef INET void sctp_input_with_port(struct mbuf *i_pak, int off, uint16_t port) { struct mbuf *m; int iphlen; uint32_t vrf_id = 0; uint8_t ecn_bits; struct sockaddr_in src, dst; struct ip *ip; struct sctphdr *sh; struct sctp_chunkhdr *ch; int length, offset; #if !defined(SCTP_WITH_NO_CSUM) uint8_t compute_crc; #endif uint32_t mflowid; uint8_t mflowtype; uint16_t fibnum; iphlen = off; if (SCTP_GET_PKT_VRFID(i_pak, vrf_id)) { SCTP_RELEASE_PKT(i_pak); return; } m = SCTP_HEADER_TO_CHAIN(i_pak); #ifdef SCTP_MBUF_LOGGING /* Log in any input mbufs */ if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_MBUF_LOGGING_ENABLE) { sctp_log_mbc(m, SCTP_MBUF_INPUT); } #endif #ifdef SCTP_PACKET_LOGGING if (SCTP_BASE_SYSCTL(sctp_logging_level) & SCTP_LAST_PACKET_TRACING) { sctp_packet_log(m); } #endif SCTPDBG(SCTP_DEBUG_CRCOFFLOAD, "sctp_input(): Packet of length %d received on %s with csum_flags 0x%b.\n", m->m_pkthdr.len, if_name(m->m_pkthdr.rcvif), (int)m->m_pkthdr.csum_flags, CSUM_BITS); mflowid = m->m_pkthdr.flowid; mflowtype = M_HASHTYPE_GET(m); fibnum = M_GETFIB(m); SCTP_STAT_INCR(sctps_recvpackets); SCTP_STAT_INCR_COUNTER64(sctps_inpackets); /* Get IP, SCTP, and first chunk header together in the first mbuf. */ offset = iphlen + sizeof(struct sctphdr) + sizeof(struct sctp_chunkhdr); if (SCTP_BUF_LEN(m) < offset) { if ((m = m_pullup(m, offset)) == NULL) { SCTP_STAT_INCR(sctps_hdrops); return; } } ip = mtod(m, struct ip *); sh = (struct sctphdr *)((caddr_t)ip + iphlen); ch = (struct sctp_chunkhdr *)((caddr_t)sh + sizeof(struct sctphdr)); offset -= sizeof(struct sctp_chunkhdr); memset(&src, 0, sizeof(struct sockaddr_in)); src.sin_family = AF_INET; src.sin_len = sizeof(struct sockaddr_in); src.sin_port = sh->src_port; src.sin_addr = ip->ip_src; memset(&dst, 0, sizeof(struct sockaddr_in)); dst.sin_family = AF_INET; dst.sin_len = sizeof(struct sockaddr_in); dst.sin_port = sh->dest_port; dst.sin_addr = ip->ip_dst; length = ntohs(ip->ip_len); /* Validate mbuf chain length with IP payload length. */ if (SCTP_HEADER_LEN(m) != length) { SCTPDBG(SCTP_DEBUG_INPUT1, "sctp_input() length:%d reported length:%d\n", length, SCTP_HEADER_LEN(m)); SCTP_STAT_INCR(sctps_hdrops); goto out; } /* SCTP does not allow broadcasts or multicasts */ if (IN_MULTICAST(ntohl(dst.sin_addr.s_addr))) { goto out; } if (SCTP_IS_IT_BROADCAST(dst.sin_addr, m)) { goto out; } ecn_bits = ip->ip_tos; #if defined(SCTP_WITH_NO_CSUM) SCTP_STAT_INCR(sctps_recvnocrc); #else if (m->m_pkthdr.csum_flags & CSUM_SCTP_VALID) { SCTP_STAT_INCR(sctps_recvhwcrc); compute_crc = 0; } else { SCTP_STAT_INCR(sctps_recvswcrc); compute_crc = 1; } #endif sctp_common_input_processing(&m, iphlen, offset, length, (struct sockaddr *)&src, (struct sockaddr *)&dst, sh, ch, #if !defined(SCTP_WITH_NO_CSUM) compute_crc, #endif ecn_bits, mflowtype, mflowid, fibnum, vrf_id, port); out: if (m) { sctp_m_freem(m); } return; } #if defined(__FreeBSD__) && defined(SCTP_MCORE_INPUT) && defined(SMP) extern int *sctp_cpuarry; #endif int sctp_input(struct mbuf **mp, int *offp, int proto SCTP_UNUSED) { struct mbuf *m; int off; m = *mp; off = *offp; #if defined(__FreeBSD__) && defined(SCTP_MCORE_INPUT) && defined(SMP) if (mp_ncpus > 1) { struct ip *ip; struct sctphdr *sh; int offset; int cpu_to_use; uint32_t flowid, tag; if (M_HASHTYPE_GET(m) != M_HASHTYPE_NONE) { flowid = m->m_pkthdr.flowid; } else { /* * No flow id built by lower layers fix it so we * create one. */ offset = off + sizeof(struct sctphdr); if (SCTP_BUF_LEN(m) < offset) { if ((m = m_pullup(m, offset)) == NULL) { SCTP_STAT_INCR(sctps_hdrops); return (IPPROTO_DONE); } } ip = mtod(m, struct ip *); sh = (struct sctphdr *)((caddr_t)ip + off); tag = htonl(sh->v_tag); flowid = tag ^ ntohs(sh->dest_port) ^ ntohs(sh->src_port); m->m_pkthdr.flowid = flowid; - M_HASHTYPE_SET(m, M_HASHTYPE_OPAQUE); + M_HASHTYPE_SET(m, M_HASHTYPE_OPAQUE_HASH); } cpu_to_use = sctp_cpuarry[flowid % mp_ncpus]; sctp_queue_to_mcore(m, off, cpu_to_use); return (IPPROTO_DONE); } #endif sctp_input_with_port(m, off, 0); return (IPPROTO_DONE); } #endif Index: projects/vnet/sys/netinet/sctp_pcb.c =================================================================== --- projects/vnet/sys/netinet/sctp_pcb.c (revision 301546) +++ projects/vnet/sys/netinet/sctp_pcb.c (revision 301547) @@ -1,7117 +1,7117 @@ /*- * Copyright (c) 2001-2008, by Cisco Systems, Inc. All rights reserved. * Copyright (c) 2008-2012, by Randall Stewart. All rights reserved. * Copyright (c) 2008-2012, by Michael Tuexen. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * a) Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * * b) Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the distribution. * * c) Neither the name of Cisco Systems, Inc. nor the names of its * contributors may be used to endorse or promote products derived * from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF * THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #if defined(INET) || defined(INET6) #include #endif #ifdef INET6 #include #endif #include #include #include VNET_DEFINE(struct sctp_base_info, system_base_info); /* FIX: we don't handle multiple link local scopes */ /* "scopeless" replacement IN6_ARE_ADDR_EQUAL */ #ifdef INET6 int SCTP6_ARE_ADDR_EQUAL(struct sockaddr_in6 *a, struct sockaddr_in6 *b) { struct sockaddr_in6 tmp_a, tmp_b; memcpy(&tmp_a, a, sizeof(struct sockaddr_in6)); if (sa6_embedscope(&tmp_a, MODULE_GLOBAL(ip6_use_defzone)) != 0) { return (0); } memcpy(&tmp_b, b, sizeof(struct sockaddr_in6)); if (sa6_embedscope(&tmp_b, MODULE_GLOBAL(ip6_use_defzone)) != 0) { return (0); } return (IN6_ARE_ADDR_EQUAL(&tmp_a.sin6_addr, &tmp_b.sin6_addr)); } #endif void sctp_fill_pcbinfo(struct sctp_pcbinfo *spcb) { /* * We really don't need to lock this, but I will just because it * does not hurt. */ SCTP_INP_INFO_RLOCK(); spcb->ep_count = SCTP_BASE_INFO(ipi_count_ep); spcb->asoc_count = SCTP_BASE_INFO(ipi_count_asoc); spcb->laddr_count = SCTP_BASE_INFO(ipi_count_laddr); spcb->raddr_count = SCTP_BASE_INFO(ipi_count_raddr); spcb->chk_count = SCTP_BASE_INFO(ipi_count_chunk); spcb->readq_count = SCTP_BASE_INFO(ipi_count_readq); spcb->stream_oque = SCTP_BASE_INFO(ipi_count_strmoq); spcb->free_chunks = SCTP_BASE_INFO(ipi_free_chunks); SCTP_INP_INFO_RUNLOCK(); } /*- * Addresses are added to VRF's (Virtual Router's). For BSD we * have only the default VRF 0. We maintain a hash list of * VRF's. Each VRF has its own list of sctp_ifn's. Each of * these has a list of addresses. When we add a new address * to a VRF we lookup the ifn/ifn_index, if the ifn does * not exist we create it and add it to the list of IFN's * within the VRF. Once we have the sctp_ifn, we add the * address to the list. So we look something like: * * hash-vrf-table * vrf-> ifn-> ifn -> ifn * vrf | * ... +--ifa-> ifa -> ifa * vrf * * We keep these separate lists since the SCTP subsystem will * point to these from its source address selection nets structure. * When an address is deleted it does not happen right away on * the SCTP side, it gets scheduled. What we do when a * delete happens is immediately remove the address from * the master list and decrement the refcount. As our * addip iterator works through and frees the src address * selection pointing to the sctp_ifa, eventually the refcount * will reach 0 and we will delete it. Note that it is assumed * that any locking on system level ifn/ifa is done at the * caller of these functions and these routines will only * lock the SCTP structures as they add or delete things. * * Other notes on VRF concepts. * - An endpoint can be in multiple VRF's * - An association lives within a VRF and only one VRF. * - Any incoming packet we can deduce the VRF for by * looking at the mbuf/pak inbound (for BSD its VRF=0 :D) * - Any downward send call or connect call must supply the * VRF via ancillary data or via some sort of set default * VRF socket option call (again for BSD no brainer since * the VRF is always 0). * - An endpoint may add multiple VRF's to it. * - Listening sockets can accept associations in any * of the VRF's they are in but the assoc will end up * in only one VRF (gotten from the packet or connect/send). * */ struct sctp_vrf * sctp_allocate_vrf(int vrf_id) { struct sctp_vrf *vrf = NULL; struct sctp_vrflist *bucket; /* First allocate the VRF structure */ vrf = sctp_find_vrf(vrf_id); if (vrf) { /* Already allocated */ return (vrf); } SCTP_MALLOC(vrf, struct sctp_vrf *, sizeof(struct sctp_vrf), SCTP_M_VRF); if (vrf == NULL) { /* No memory */ #ifdef INVARIANTS panic("No memory for VRF:%d", vrf_id); #endif return (NULL); } /* setup the VRF */ memset(vrf, 0, sizeof(struct sctp_vrf)); vrf->vrf_id = vrf_id; LIST_INIT(&vrf->ifnlist); vrf->total_ifa_count = 0; vrf->refcount = 0; /* now also setup table ids */ SCTP_INIT_VRF_TABLEID(vrf); /* Init the HASH of addresses */ vrf->vrf_addr_hash = SCTP_HASH_INIT(SCTP_VRF_ADDR_HASH_SIZE, &vrf->vrf_addr_hashmark); if (vrf->vrf_addr_hash == NULL) { /* No memory */ #ifdef INVARIANTS panic("No memory for VRF:%d", vrf_id); #endif SCTP_FREE(vrf, SCTP_M_VRF); return (NULL); } /* Add it to the hash table */ bucket = &SCTP_BASE_INFO(sctp_vrfhash)[(vrf_id & SCTP_BASE_INFO(hashvrfmark))]; LIST_INSERT_HEAD(bucket, vrf, next_vrf); atomic_add_int(&SCTP_BASE_INFO(ipi_count_vrfs), 1); return (vrf); } struct sctp_ifn * sctp_find_ifn(void *ifn, uint32_t ifn_index) { struct sctp_ifn *sctp_ifnp; struct sctp_ifnlist *hash_ifn_head; /* * We assume the lock is held for the addresses if that's wrong * problems could occur :-) */ hash_ifn_head = &SCTP_BASE_INFO(vrf_ifn_hash)[(ifn_index & SCTP_BASE_INFO(vrf_ifn_hashmark))]; LIST_FOREACH(sctp_ifnp, hash_ifn_head, next_bucket) { if (sctp_ifnp->ifn_index == ifn_index) { return (sctp_ifnp); } if (sctp_ifnp->ifn_p && ifn && (sctp_ifnp->ifn_p == ifn)) { return (sctp_ifnp); } } return (NULL); } struct sctp_vrf * sctp_find_vrf(uint32_t vrf_id) { struct sctp_vrflist *bucket; struct sctp_vrf *liste; bucket = &SCTP_BASE_INFO(sctp_vrfhash)[(vrf_id & SCTP_BASE_INFO(hashvrfmark))]; LIST_FOREACH(liste, bucket, next_vrf) { if (vrf_id == liste->vrf_id) { return (liste); } } return (NULL); } void sctp_free_vrf(struct sctp_vrf *vrf) { if (SCTP_DECREMENT_AND_CHECK_REFCOUNT(&vrf->refcount)) { if (vrf->vrf_addr_hash) { SCTP_HASH_FREE(vrf->vrf_addr_hash, vrf->vrf_addr_hashmark); vrf->vrf_addr_hash = NULL; } /* We zero'd the count */ LIST_REMOVE(vrf, next_vrf); SCTP_FREE(vrf, SCTP_M_VRF); atomic_subtract_int(&SCTP_BASE_INFO(ipi_count_vrfs), 1); } } void sctp_free_ifn(struct sctp_ifn *sctp_ifnp) { if (SCTP_DECREMENT_AND_CHECK_REFCOUNT(&sctp_ifnp->refcount)) { /* We zero'd the count */ if (sctp_ifnp->vrf) { sctp_free_vrf(sctp_ifnp->vrf); } SCTP_FREE(sctp_ifnp, SCTP_M_IFN); atomic_subtract_int(&SCTP_BASE_INFO(ipi_count_ifns), 1); } } void sctp_update_ifn_mtu(uint32_t ifn_index, uint32_t mtu) { struct sctp_ifn *sctp_ifnp; sctp_ifnp = sctp_find_ifn((void *)NULL, ifn_index); if (sctp_ifnp != NULL) { sctp_ifnp->ifn_mtu = mtu; } } void sctp_free_ifa(struct sctp_ifa *sctp_ifap) { if (SCTP_DECREMENT_AND_CHECK_REFCOUNT(&sctp_ifap->refcount)) { /* We zero'd the count */ if (sctp_ifap->ifn_p) { sctp_free_ifn(sctp_ifap->ifn_p); } SCTP_FREE(sctp_ifap, SCTP_M_IFA); atomic_subtract_int(&SCTP_BASE_INFO(ipi_count_ifas), 1); } } static void sctp_delete_ifn(struct sctp_ifn *sctp_ifnp, int hold_addr_lock) { struct sctp_ifn *found; found = sctp_find_ifn(sctp_ifnp->ifn_p, sctp_ifnp->ifn_index); if (found == NULL) { /* Not in the list.. sorry */ return; } if (hold_addr_lock == 0) SCTP_IPI_ADDR_WLOCK(); LIST_REMOVE(sctp_ifnp, next_bucket); LIST_REMOVE(sctp_ifnp, next_ifn); SCTP_DEREGISTER_INTERFACE(sctp_ifnp->ifn_index, sctp_ifnp->registered_af); if (hold_addr_lock == 0) SCTP_IPI_ADDR_WUNLOCK(); /* Take away the reference, and possibly free it */ sctp_free_ifn(sctp_ifnp); } void sctp_mark_ifa_addr_down(uint32_t vrf_id, struct sockaddr *addr, const char *if_name, uint32_t ifn_index) { struct sctp_vrf *vrf; struct sctp_ifa *sctp_ifap; SCTP_IPI_ADDR_RLOCK(); vrf = sctp_find_vrf(vrf_id); if (vrf == NULL) { SCTPDBG(SCTP_DEBUG_PCB4, "Can't find vrf_id 0x%x\n", vrf_id); goto out; } sctp_ifap = sctp_find_ifa_by_addr(addr, vrf->vrf_id, SCTP_ADDR_LOCKED); if (sctp_ifap == NULL) { SCTPDBG(SCTP_DEBUG_PCB4, "Can't find sctp_ifap for address\n"); goto out; } if (sctp_ifap->ifn_p == NULL) { SCTPDBG(SCTP_DEBUG_PCB4, "IFA has no IFN - can't mark unusable\n"); goto out; } if (if_name) { if (strncmp(if_name, sctp_ifap->ifn_p->ifn_name, SCTP_IFNAMSIZ) != 0) { SCTPDBG(SCTP_DEBUG_PCB4, "IFN %s of IFA not the same as %s\n", sctp_ifap->ifn_p->ifn_name, if_name); goto out; } } else { if (sctp_ifap->ifn_p->ifn_index != ifn_index) { SCTPDBG(SCTP_DEBUG_PCB4, "IFA owned by ifn_index:%d down command for ifn_index:%d - ignored\n", sctp_ifap->ifn_p->ifn_index, ifn_index); goto out; } } sctp_ifap->localifa_flags &= (~SCTP_ADDR_VALID); sctp_ifap->localifa_flags |= SCTP_ADDR_IFA_UNUSEABLE; out: SCTP_IPI_ADDR_RUNLOCK(); } void sctp_mark_ifa_addr_up(uint32_t vrf_id, struct sockaddr *addr, const char *if_name, uint32_t ifn_index) { struct sctp_vrf *vrf; struct sctp_ifa *sctp_ifap; SCTP_IPI_ADDR_RLOCK(); vrf = sctp_find_vrf(vrf_id); if (vrf == NULL) { SCTPDBG(SCTP_DEBUG_PCB4, "Can't find vrf_id 0x%x\n", vrf_id); goto out; } sctp_ifap = sctp_find_ifa_by_addr(addr, vrf->vrf_id, SCTP_ADDR_LOCKED); if (sctp_ifap == NULL) { SCTPDBG(SCTP_DEBUG_PCB4, "Can't find sctp_ifap for address\n"); goto out; } if (sctp_ifap->ifn_p == NULL) { SCTPDBG(SCTP_DEBUG_PCB4, "IFA has no IFN - can't mark unusable\n"); goto out; } if (if_name) { if (strncmp(if_name, sctp_ifap->ifn_p->ifn_name, SCTP_IFNAMSIZ) != 0) { SCTPDBG(SCTP_DEBUG_PCB4, "IFN %s of IFA not the same as %s\n", sctp_ifap->ifn_p->ifn_name, if_name); goto out; } } else { if (sctp_ifap->ifn_p->ifn_index != ifn_index) { SCTPDBG(SCTP_DEBUG_PCB4, "IFA owned by ifn_index:%d down command for ifn_index:%d - ignored\n", sctp_ifap->ifn_p->ifn_index, ifn_index); goto out; } } sctp_ifap->localifa_flags &= (~SCTP_ADDR_IFA_UNUSEABLE); sctp_ifap->localifa_flags |= SCTP_ADDR_VALID; out: SCTP_IPI_ADDR_RUNLOCK(); } /*- * Add an ifa to an ifn. * Register the interface as necessary. * NOTE: ADDR write lock MUST be held. */ static void sctp_add_ifa_to_ifn(struct sctp_ifn *sctp_ifnp, struct sctp_ifa *sctp_ifap) { int ifa_af; LIST_INSERT_HEAD(&sctp_ifnp->ifalist, sctp_ifap, next_ifa); sctp_ifap->ifn_p = sctp_ifnp; atomic_add_int(&sctp_ifap->ifn_p->refcount, 1); /* update address counts */ sctp_ifnp->ifa_count++; ifa_af = sctp_ifap->address.sa.sa_family; switch (ifa_af) { #ifdef INET case AF_INET: sctp_ifnp->num_v4++; break; #endif #ifdef INET6 case AF_INET6: sctp_ifnp->num_v6++; break; #endif default: break; } if (sctp_ifnp->ifa_count == 1) { /* register the new interface */ SCTP_REGISTER_INTERFACE(sctp_ifnp->ifn_index, ifa_af); sctp_ifnp->registered_af = ifa_af; } } /*- * Remove an ifa from its ifn. * If no more addresses exist, remove the ifn too. Otherwise, re-register * the interface based on the remaining address families left. * NOTE: ADDR write lock MUST be held. */ static void sctp_remove_ifa_from_ifn(struct sctp_ifa *sctp_ifap) { LIST_REMOVE(sctp_ifap, next_ifa); if (sctp_ifap->ifn_p) { /* update address counts */ sctp_ifap->ifn_p->ifa_count--; switch (sctp_ifap->address.sa.sa_family) { #ifdef INET case AF_INET: sctp_ifap->ifn_p->num_v4--; break; #endif #ifdef INET6 case AF_INET6: sctp_ifap->ifn_p->num_v6--; break; #endif default: break; } if (LIST_EMPTY(&sctp_ifap->ifn_p->ifalist)) { /* remove the ifn, possibly freeing it */ sctp_delete_ifn(sctp_ifap->ifn_p, SCTP_ADDR_LOCKED); } else { /* re-register address family type, if needed */ if ((sctp_ifap->ifn_p->num_v6 == 0) && (sctp_ifap->ifn_p->registered_af == AF_INET6)) { SCTP_DEREGISTER_INTERFACE(sctp_ifap->ifn_p->ifn_index, AF_INET6); SCTP_REGISTER_INTERFACE(sctp_ifap->ifn_p->ifn_index, AF_INET); sctp_ifap->ifn_p->registered_af = AF_INET; } else if ((sctp_ifap->ifn_p->num_v4 == 0) && (sctp_ifap->ifn_p->registered_af == AF_INET)) { SCTP_DEREGISTER_INTERFACE(sctp_ifap->ifn_p->ifn_index, AF_INET); SCTP_REGISTER_INTERFACE(sctp_ifap->ifn_p->ifn_index, AF_INET6); sctp_ifap->ifn_p->registered_af = AF_INET6; } /* free the ifn refcount */ sctp_free_ifn(sctp_ifap->ifn_p); } sctp_ifap->ifn_p = NULL; } } struct sctp_ifa * sctp_add_addr_to_vrf(uint32_t vrf_id, void *ifn, uint32_t ifn_index, uint32_t ifn_type, const char *if_name, void *ifa, struct sockaddr *addr, uint32_t ifa_flags, int dynamic_add) { struct sctp_vrf *vrf; struct sctp_ifn *sctp_ifnp = NULL; struct sctp_ifa *sctp_ifap = NULL; struct sctp_ifalist *hash_addr_head; struct sctp_ifnlist *hash_ifn_head; uint32_t hash_of_addr; int new_ifn_af = 0; #ifdef SCTP_DEBUG SCTPDBG(SCTP_DEBUG_PCB4, "vrf_id 0x%x: adding address: ", vrf_id); SCTPDBG_ADDR(SCTP_DEBUG_PCB4, addr); #endif SCTP_IPI_ADDR_WLOCK(); sctp_ifnp = sctp_find_ifn(ifn, ifn_index); if (sctp_ifnp) { vrf = sctp_ifnp->vrf; } else { vrf = sctp_find_vrf(vrf_id); if (vrf == NULL) { vrf = sctp_allocate_vrf(vrf_id); if (vrf == NULL) { SCTP_IPI_ADDR_WUNLOCK(); return (NULL); } } } if (sctp_ifnp == NULL) { /* * build one and add it, can't hold lock until after malloc * done though. */ SCTP_IPI_ADDR_WUNLOCK(); SCTP_MALLOC(sctp_ifnp, struct sctp_ifn *, sizeof(struct sctp_ifn), SCTP_M_IFN); if (sctp_ifnp == NULL) { #ifdef INVARIANTS panic("No memory for IFN"); #endif return (NULL); } memset(sctp_ifnp, 0, sizeof(struct sctp_ifn)); sctp_ifnp->ifn_index = ifn_index; sctp_ifnp->ifn_p = ifn; sctp_ifnp->ifn_type = ifn_type; sctp_ifnp->refcount = 0; sctp_ifnp->vrf = vrf; atomic_add_int(&vrf->refcount, 1); sctp_ifnp->ifn_mtu = SCTP_GATHER_MTU_FROM_IFN_INFO(ifn, ifn_index, addr->sa_family); if (if_name != NULL) { snprintf(sctp_ifnp->ifn_name, SCTP_IFNAMSIZ, "%s", if_name); } else { snprintf(sctp_ifnp->ifn_name, SCTP_IFNAMSIZ, "%s", "unknown"); } hash_ifn_head = &SCTP_BASE_INFO(vrf_ifn_hash)[(ifn_index & SCTP_BASE_INFO(vrf_ifn_hashmark))]; LIST_INIT(&sctp_ifnp->ifalist); SCTP_IPI_ADDR_WLOCK(); LIST_INSERT_HEAD(hash_ifn_head, sctp_ifnp, next_bucket); LIST_INSERT_HEAD(&vrf->ifnlist, sctp_ifnp, next_ifn); atomic_add_int(&SCTP_BASE_INFO(ipi_count_ifns), 1); new_ifn_af = 1; } sctp_ifap = sctp_find_ifa_by_addr(addr, vrf->vrf_id, SCTP_ADDR_LOCKED); if (sctp_ifap) { /* Hmm, it already exists? */ if ((sctp_ifap->ifn_p) && (sctp_ifap->ifn_p->ifn_index == ifn_index)) { SCTPDBG(SCTP_DEBUG_PCB4, "Using existing ifn %s (0x%x) for ifa %p\n", sctp_ifap->ifn_p->ifn_name, ifn_index, (void *)sctp_ifap); if (new_ifn_af) { /* Remove the created one that we don't want */ sctp_delete_ifn(sctp_ifnp, SCTP_ADDR_LOCKED); } if (sctp_ifap->localifa_flags & SCTP_BEING_DELETED) { /* easy to solve, just switch back to active */ SCTPDBG(SCTP_DEBUG_PCB4, "Clearing deleted ifa flag\n"); sctp_ifap->localifa_flags = SCTP_ADDR_VALID; sctp_ifap->ifn_p = sctp_ifnp; atomic_add_int(&sctp_ifap->ifn_p->refcount, 1); } exit_stage_left: SCTP_IPI_ADDR_WUNLOCK(); return (sctp_ifap); } else { if (sctp_ifap->ifn_p) { /* * The last IFN gets the address, remove the * old one */ SCTPDBG(SCTP_DEBUG_PCB4, "Moving ifa %p from %s (0x%x) to %s (0x%x)\n", (void *)sctp_ifap, sctp_ifap->ifn_p->ifn_name, sctp_ifap->ifn_p->ifn_index, if_name, ifn_index); /* remove the address from the old ifn */ sctp_remove_ifa_from_ifn(sctp_ifap); /* move the address over to the new ifn */ sctp_add_ifa_to_ifn(sctp_ifnp, sctp_ifap); goto exit_stage_left; } else { /* repair ifnp which was NULL ? */ sctp_ifap->localifa_flags = SCTP_ADDR_VALID; SCTPDBG(SCTP_DEBUG_PCB4, "Repairing ifn %p for ifa %p\n", (void *)sctp_ifnp, (void *)sctp_ifap); sctp_add_ifa_to_ifn(sctp_ifnp, sctp_ifap); } goto exit_stage_left; } } SCTP_IPI_ADDR_WUNLOCK(); SCTP_MALLOC(sctp_ifap, struct sctp_ifa *, sizeof(struct sctp_ifa), SCTP_M_IFA); if (sctp_ifap == NULL) { #ifdef INVARIANTS panic("No memory for IFA"); #endif return (NULL); } memset(sctp_ifap, 0, sizeof(struct sctp_ifa)); sctp_ifap->ifn_p = sctp_ifnp; atomic_add_int(&sctp_ifnp->refcount, 1); sctp_ifap->vrf_id = vrf_id; sctp_ifap->ifa = ifa; memcpy(&sctp_ifap->address, addr, addr->sa_len); sctp_ifap->localifa_flags = SCTP_ADDR_VALID | SCTP_ADDR_DEFER_USE; sctp_ifap->flags = ifa_flags; /* Set scope */ switch (sctp_ifap->address.sa.sa_family) { #ifdef INET case AF_INET: { struct sockaddr_in *sin; sin = &sctp_ifap->address.sin; if (SCTP_IFN_IS_IFT_LOOP(sctp_ifap->ifn_p) || (IN4_ISLOOPBACK_ADDRESS(&sin->sin_addr))) { sctp_ifap->src_is_loop = 1; } if ((IN4_ISPRIVATE_ADDRESS(&sin->sin_addr))) { sctp_ifap->src_is_priv = 1; } sctp_ifnp->num_v4++; if (new_ifn_af) new_ifn_af = AF_INET; break; } #endif #ifdef INET6 case AF_INET6: { /* ok to use deprecated addresses? */ struct sockaddr_in6 *sin6; sin6 = &sctp_ifap->address.sin6; if (SCTP_IFN_IS_IFT_LOOP(sctp_ifap->ifn_p) || (IN6_IS_ADDR_LOOPBACK(&sin6->sin6_addr))) { sctp_ifap->src_is_loop = 1; } if (IN6_IS_ADDR_LINKLOCAL(&sin6->sin6_addr)) { sctp_ifap->src_is_priv = 1; } sctp_ifnp->num_v6++; if (new_ifn_af) new_ifn_af = AF_INET6; break; } #endif default: new_ifn_af = 0; break; } hash_of_addr = sctp_get_ifa_hash_val(&sctp_ifap->address.sa); if ((sctp_ifap->src_is_priv == 0) && (sctp_ifap->src_is_loop == 0)) { sctp_ifap->src_is_glob = 1; } SCTP_IPI_ADDR_WLOCK(); hash_addr_head = &vrf->vrf_addr_hash[(hash_of_addr & vrf->vrf_addr_hashmark)]; LIST_INSERT_HEAD(hash_addr_head, sctp_ifap, next_bucket); sctp_ifap->refcount = 1; LIST_INSERT_HEAD(&sctp_ifnp->ifalist, sctp_ifap, next_ifa); sctp_ifnp->ifa_count++; vrf->total_ifa_count++; atomic_add_int(&SCTP_BASE_INFO(ipi_count_ifas), 1); if (new_ifn_af) { SCTP_REGISTER_INTERFACE(ifn_index, new_ifn_af); sctp_ifnp->registered_af = new_ifn_af; } SCTP_IPI_ADDR_WUNLOCK(); if (dynamic_add) { /* * Bump up the refcount so that when the timer completes it * will drop back down. */ struct sctp_laddr *wi; atomic_add_int(&sctp_ifap->refcount, 1); wi = SCTP_ZONE_GET(SCTP_BASE_INFO(ipi_zone_laddr), struct sctp_laddr); if (wi == NULL) { /* * Gak, what can we do? We have lost an address * change can you say HOSED? */ SCTPDBG(SCTP_DEBUG_PCB4, "Lost an address change?\n"); /* Opps, must decrement the count */ sctp_del_addr_from_vrf(vrf_id, addr, ifn_index, if_name); return (NULL); } SCTP_INCR_LADDR_COUNT(); bzero(wi, sizeof(*wi)); (void)SCTP_GETTIME_TIMEVAL(&wi->start_time); wi->ifa = sctp_ifap; wi->action = SCTP_ADD_IP_ADDRESS; SCTP_WQ_ADDR_LOCK(); LIST_INSERT_HEAD(&SCTP_BASE_INFO(addr_wq), wi, sctp_nxt_addr); SCTP_WQ_ADDR_UNLOCK(); sctp_timer_start(SCTP_TIMER_TYPE_ADDR_WQ, (struct sctp_inpcb *)NULL, (struct sctp_tcb *)NULL, (struct sctp_nets *)NULL); } else { /* it's ready for use */ sctp_ifap->localifa_flags &= ~SCTP_ADDR_DEFER_USE; } return (sctp_ifap); } void sctp_del_addr_from_vrf(uint32_t vrf_id, struct sockaddr *addr, uint32_t ifn_index, const char *if_name) { struct sctp_vrf *vrf; struct sctp_ifa *sctp_ifap = NULL; SCTP_IPI_ADDR_WLOCK(); vrf = sctp_find_vrf(vrf_id); if (vrf == NULL) { SCTPDBG(SCTP_DEBUG_PCB4, "Can't find vrf_id 0x%x\n", vrf_id); goto out_now; } #ifdef SCTP_DEBUG SCTPDBG(SCTP_DEBUG_PCB4, "vrf_id 0x%x: deleting address:", vrf_id); SCTPDBG_ADDR(SCTP_DEBUG_PCB4, addr); #endif sctp_ifap = sctp_find_ifa_by_addr(addr, vrf->vrf_id, SCTP_ADDR_LOCKED); if (sctp_ifap) { /* Validate the delete */ if (sctp_ifap->ifn_p) { int valid = 0; /*- * The name has priority over the ifn_index * if its given. We do this especially for * panda who might recycle indexes fast. */ if (if_name) { if (strncmp(if_name, sctp_ifap->ifn_p->ifn_name, SCTP_IFNAMSIZ) == 0) { /* They match its a correct delete */ valid = 1; } } if (!valid) { /* last ditch check ifn_index */ if (ifn_index == sctp_ifap->ifn_p->ifn_index) { valid = 1; } } if (!valid) { SCTPDBG(SCTP_DEBUG_PCB4, "ifn:%d ifname:%s does not match addresses\n", ifn_index, ((if_name == NULL) ? "NULL" : if_name)); SCTPDBG(SCTP_DEBUG_PCB4, "ifn:%d ifname:%s - ignoring delete\n", sctp_ifap->ifn_p->ifn_index, sctp_ifap->ifn_p->ifn_name); SCTP_IPI_ADDR_WUNLOCK(); return; } } SCTPDBG(SCTP_DEBUG_PCB4, "Deleting ifa %p\n", (void *)sctp_ifap); sctp_ifap->localifa_flags &= SCTP_ADDR_VALID; /* * We don't set the flag. This means that the structure will * hang around in EP's that have bound specific to it until * they close. This gives us TCP like behavior if someone * removes an address (or for that matter adds it right * back). */ /* sctp_ifap->localifa_flags |= SCTP_BEING_DELETED; */ vrf->total_ifa_count--; LIST_REMOVE(sctp_ifap, next_bucket); sctp_remove_ifa_from_ifn(sctp_ifap); } #ifdef SCTP_DEBUG else { SCTPDBG(SCTP_DEBUG_PCB4, "Del Addr-ifn:%d Could not find address:", ifn_index); SCTPDBG_ADDR(SCTP_DEBUG_PCB1, addr); } #endif out_now: SCTP_IPI_ADDR_WUNLOCK(); if (sctp_ifap) { struct sctp_laddr *wi; wi = SCTP_ZONE_GET(SCTP_BASE_INFO(ipi_zone_laddr), struct sctp_laddr); if (wi == NULL) { /* * Gak, what can we do? We have lost an address * change can you say HOSED? */ SCTPDBG(SCTP_DEBUG_PCB4, "Lost an address change?\n"); /* Oops, must decrement the count */ sctp_free_ifa(sctp_ifap); return; } SCTP_INCR_LADDR_COUNT(); bzero(wi, sizeof(*wi)); (void)SCTP_GETTIME_TIMEVAL(&wi->start_time); wi->ifa = sctp_ifap; wi->action = SCTP_DEL_IP_ADDRESS; SCTP_WQ_ADDR_LOCK(); /* * Should this really be a tailq? As it is we will process * the newest first :-0 */ LIST_INSERT_HEAD(&SCTP_BASE_INFO(addr_wq), wi, sctp_nxt_addr); SCTP_WQ_ADDR_UNLOCK(); sctp_timer_start(SCTP_TIMER_TYPE_ADDR_WQ, (struct sctp_inpcb *)NULL, (struct sctp_tcb *)NULL, (struct sctp_nets *)NULL); } return; } static int sctp_does_stcb_own_this_addr(struct sctp_tcb *stcb, struct sockaddr *to) { int loopback_scope; #if defined(INET) int ipv4_local_scope, ipv4_addr_legal; #endif #if defined(INET6) int local_scope, site_scope, ipv6_addr_legal; #endif struct sctp_vrf *vrf; struct sctp_ifn *sctp_ifn; struct sctp_ifa *sctp_ifa; loopback_scope = stcb->asoc.scope.loopback_scope; #if defined(INET) ipv4_local_scope = stcb->asoc.scope.ipv4_local_scope; ipv4_addr_legal = stcb->asoc.scope.ipv4_addr_legal; #endif #if defined(INET6) local_scope = stcb->asoc.scope.local_scope; site_scope = stcb->asoc.scope.site_scope; ipv6_addr_legal = stcb->asoc.scope.ipv6_addr_legal; #endif SCTP_IPI_ADDR_RLOCK(); vrf = sctp_find_vrf(stcb->asoc.vrf_id); if (vrf == NULL) { /* no vrf, no addresses */ SCTP_IPI_ADDR_RUNLOCK(); return (0); } if (stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_BOUNDALL) { LIST_FOREACH(sctp_ifn, &vrf->ifnlist, next_ifn) { if ((loopback_scope == 0) && SCTP_IFN_IS_IFT_LOOP(sctp_ifn)) { continue; } LIST_FOREACH(sctp_ifa, &sctp_ifn->ifalist, next_ifa) { if (sctp_is_addr_restricted(stcb, sctp_ifa) && (!sctp_is_addr_pending(stcb, sctp_ifa))) { /* * We allow pending addresses, where * we have sent an asconf-add to be * considered valid. */ continue; } if (sctp_ifa->address.sa.sa_family != to->sa_family) { continue; } switch (sctp_ifa->address.sa.sa_family) { #ifdef INET case AF_INET: if (ipv4_addr_legal) { struct sockaddr_in *sin, *rsin; sin = &sctp_ifa->address.sin; rsin = (struct sockaddr_in *)to; if ((ipv4_local_scope == 0) && IN4_ISPRIVATE_ADDRESS(&sin->sin_addr)) { continue; } if (prison_check_ip4(stcb->sctp_ep->ip_inp.inp.inp_cred, &sin->sin_addr) != 0) { continue; } if (sin->sin_addr.s_addr == rsin->sin_addr.s_addr) { SCTP_IPI_ADDR_RUNLOCK(); return (1); } } break; #endif #ifdef INET6 case AF_INET6: if (ipv6_addr_legal) { struct sockaddr_in6 *sin6, *rsin6; sin6 = &sctp_ifa->address.sin6; rsin6 = (struct sockaddr_in6 *)to; if (prison_check_ip6(stcb->sctp_ep->ip_inp.inp.inp_cred, &sin6->sin6_addr) != 0) { continue; } if (IN6_IS_ADDR_LINKLOCAL(&sin6->sin6_addr)) { if (local_scope == 0) continue; if (sin6->sin6_scope_id == 0) { if (sa6_recoverscope(sin6) != 0) continue; } } if ((site_scope == 0) && (IN6_IS_ADDR_SITELOCAL(&sin6->sin6_addr))) { continue; } if (SCTP6_ARE_ADDR_EQUAL(sin6, rsin6)) { SCTP_IPI_ADDR_RUNLOCK(); return (1); } } break; #endif default: /* TSNH */ break; } } } } else { struct sctp_laddr *laddr; LIST_FOREACH(laddr, &stcb->sctp_ep->sctp_addr_list, sctp_nxt_addr) { if (laddr->ifa->localifa_flags & SCTP_BEING_DELETED) { SCTPDBG(SCTP_DEBUG_PCB1, "ifa being deleted\n"); continue; } if (sctp_is_addr_restricted(stcb, laddr->ifa) && (!sctp_is_addr_pending(stcb, laddr->ifa))) { /* * We allow pending addresses, where we have * sent an asconf-add to be considered * valid. */ continue; } if (laddr->ifa->address.sa.sa_family != to->sa_family) { continue; } switch (to->sa_family) { #ifdef INET case AF_INET: { struct sockaddr_in *sin, *rsin; sin = &laddr->ifa->address.sin; rsin = (struct sockaddr_in *)to; if (sin->sin_addr.s_addr == rsin->sin_addr.s_addr) { SCTP_IPI_ADDR_RUNLOCK(); return (1); } break; } #endif #ifdef INET6 case AF_INET6: { struct sockaddr_in6 *sin6, *rsin6; sin6 = &laddr->ifa->address.sin6; rsin6 = (struct sockaddr_in6 *)to; if (SCTP6_ARE_ADDR_EQUAL(sin6, rsin6)) { SCTP_IPI_ADDR_RUNLOCK(); return (1); } break; } #endif default: /* TSNH */ break; } } } SCTP_IPI_ADDR_RUNLOCK(); return (0); } static struct sctp_tcb * sctp_tcb_special_locate(struct sctp_inpcb **inp_p, struct sockaddr *from, struct sockaddr *to, struct sctp_nets **netp, uint32_t vrf_id) { /**** ASSUMES THE CALLER holds the INP_INFO_RLOCK */ /* * If we support the TCP model, then we must now dig through to see * if we can find our endpoint in the list of tcp ep's. */ uint16_t lport, rport; struct sctppcbhead *ephead; struct sctp_inpcb *inp; struct sctp_laddr *laddr; struct sctp_tcb *stcb; struct sctp_nets *net; if ((to == NULL) || (from == NULL)) { return (NULL); } switch (to->sa_family) { #ifdef INET case AF_INET: if (from->sa_family == AF_INET) { lport = ((struct sockaddr_in *)to)->sin_port; rport = ((struct sockaddr_in *)from)->sin_port; } else { return (NULL); } break; #endif #ifdef INET6 case AF_INET6: if (from->sa_family == AF_INET6) { lport = ((struct sockaddr_in6 *)to)->sin6_port; rport = ((struct sockaddr_in6 *)from)->sin6_port; } else { return (NULL); } break; #endif default: return (NULL); } ephead = &SCTP_BASE_INFO(sctp_tcpephash)[SCTP_PCBHASH_ALLADDR((lport | rport), SCTP_BASE_INFO(hashtcpmark))]; /* * Ok now for each of the guys in this bucket we must look and see: * - Does the remote port match. - Does there single association's * addresses match this address (to). If so we update p_ep to point * to this ep and return the tcb from it. */ LIST_FOREACH(inp, ephead, sctp_hash) { SCTP_INP_RLOCK(inp); if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { SCTP_INP_RUNLOCK(inp); continue; } if (lport != inp->sctp_lport) { SCTP_INP_RUNLOCK(inp); continue; } switch (to->sa_family) { #ifdef INET case AF_INET: { struct sockaddr_in *sin; sin = (struct sockaddr_in *)to; if (prison_check_ip4(inp->ip_inp.inp.inp_cred, &sin->sin_addr) != 0) { SCTP_INP_RUNLOCK(inp); continue; } break; } #endif #ifdef INET6 case AF_INET6: { struct sockaddr_in6 *sin6; sin6 = (struct sockaddr_in6 *)to; if (prison_check_ip6(inp->ip_inp.inp.inp_cred, &sin6->sin6_addr) != 0) { SCTP_INP_RUNLOCK(inp); continue; } break; } #endif default: SCTP_INP_RUNLOCK(inp); continue; } if (inp->def_vrf_id != vrf_id) { SCTP_INP_RUNLOCK(inp); continue; } /* check to see if the ep has one of the addresses */ if ((inp->sctp_flags & SCTP_PCB_FLAGS_BOUNDALL) == 0) { /* We are NOT bound all, so look further */ int match = 0; LIST_FOREACH(laddr, &inp->sctp_addr_list, sctp_nxt_addr) { if (laddr->ifa == NULL) { SCTPDBG(SCTP_DEBUG_PCB1, "%s: NULL ifa\n", __func__); continue; } if (laddr->ifa->localifa_flags & SCTP_BEING_DELETED) { SCTPDBG(SCTP_DEBUG_PCB1, "ifa being deleted\n"); continue; } if (laddr->ifa->address.sa.sa_family == to->sa_family) { /* see if it matches */ #ifdef INET if (from->sa_family == AF_INET) { struct sockaddr_in *intf_addr, *sin; intf_addr = &laddr->ifa->address.sin; sin = (struct sockaddr_in *)to; if (sin->sin_addr.s_addr == intf_addr->sin_addr.s_addr) { match = 1; break; } } #endif #ifdef INET6 if (from->sa_family == AF_INET6) { struct sockaddr_in6 *intf_addr6; struct sockaddr_in6 *sin6; sin6 = (struct sockaddr_in6 *) to; intf_addr6 = &laddr->ifa->address.sin6; if (SCTP6_ARE_ADDR_EQUAL(sin6, intf_addr6)) { match = 1; break; } } #endif } } if (match == 0) { /* This endpoint does not have this address */ SCTP_INP_RUNLOCK(inp); continue; } } /* * Ok if we hit here the ep has the address, does it hold * the tcb? */ /* XXX: Why don't we TAILQ_FOREACH through sctp_asoc_list? */ stcb = LIST_FIRST(&inp->sctp_asoc_list); if (stcb == NULL) { SCTP_INP_RUNLOCK(inp); continue; } SCTP_TCB_LOCK(stcb); if (!sctp_does_stcb_own_this_addr(stcb, to)) { SCTP_TCB_UNLOCK(stcb); SCTP_INP_RUNLOCK(inp); continue; } if (stcb->rport != rport) { /* remote port does not match. */ SCTP_TCB_UNLOCK(stcb); SCTP_INP_RUNLOCK(inp); continue; } if (stcb->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) { SCTP_TCB_UNLOCK(stcb); SCTP_INP_RUNLOCK(inp); continue; } if (!sctp_does_stcb_own_this_addr(stcb, to)) { SCTP_TCB_UNLOCK(stcb); SCTP_INP_RUNLOCK(inp); continue; } /* Does this TCB have a matching address? */ TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { if (net->ro._l_addr.sa.sa_family != from->sa_family) { /* not the same family, can't be a match */ continue; } switch (from->sa_family) { #ifdef INET case AF_INET: { struct sockaddr_in *sin, *rsin; sin = (struct sockaddr_in *)&net->ro._l_addr; rsin = (struct sockaddr_in *)from; if (sin->sin_addr.s_addr == rsin->sin_addr.s_addr) { /* found it */ if (netp != NULL) { *netp = net; } /* * Update the endpoint * pointer */ *inp_p = inp; SCTP_INP_RUNLOCK(inp); return (stcb); } break; } #endif #ifdef INET6 case AF_INET6: { struct sockaddr_in6 *sin6, *rsin6; sin6 = (struct sockaddr_in6 *)&net->ro._l_addr; rsin6 = (struct sockaddr_in6 *)from; if (SCTP6_ARE_ADDR_EQUAL(sin6, rsin6)) { /* found it */ if (netp != NULL) { *netp = net; } /* * Update the endpoint * pointer */ *inp_p = inp; SCTP_INP_RUNLOCK(inp); return (stcb); } break; } #endif default: /* TSNH */ break; } } SCTP_TCB_UNLOCK(stcb); SCTP_INP_RUNLOCK(inp); } return (NULL); } /* * rules for use * * 1) If I return a NULL you must decrement any INP ref cnt. 2) If I find an * stcb, both will be locked (locked_tcb and stcb) but decrement will be done * (if locked == NULL). 3) Decrement happens on return ONLY if locked == * NULL. */ struct sctp_tcb * sctp_findassociation_ep_addr(struct sctp_inpcb **inp_p, struct sockaddr *remote, struct sctp_nets **netp, struct sockaddr *local, struct sctp_tcb *locked_tcb) { struct sctpasochead *head; struct sctp_inpcb *inp; struct sctp_tcb *stcb = NULL; struct sctp_nets *net; uint16_t rport; inp = *inp_p; switch (remote->sa_family) { #ifdef INET case AF_INET: rport = (((struct sockaddr_in *)remote)->sin_port); break; #endif #ifdef INET6 case AF_INET6: rport = (((struct sockaddr_in6 *)remote)->sin6_port); break; #endif default: return (NULL); } if (locked_tcb) { /* * UN-lock so we can do proper locking here this occurs when * called from load_addresses_from_init. */ atomic_add_int(&locked_tcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(locked_tcb); } SCTP_INP_INFO_RLOCK(); if ((inp->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE) || (inp->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL)) { /*- * Now either this guy is our listener or it's the * connector. If it is the one that issued the connect, then * it's only chance is to be the first TCB in the list. If * it is the acceptor, then do the special_lookup to hash * and find the real inp. */ if ((inp->sctp_socket) && (inp->sctp_socket->so_qlimit)) { /* to is peer addr, from is my addr */ stcb = sctp_tcb_special_locate(inp_p, remote, local, netp, inp->def_vrf_id); if ((stcb != NULL) && (locked_tcb == NULL)) { /* we have a locked tcb, lower refcount */ SCTP_INP_DECR_REF(inp); } if ((locked_tcb != NULL) && (locked_tcb != stcb)) { SCTP_INP_RLOCK(locked_tcb->sctp_ep); SCTP_TCB_LOCK(locked_tcb); atomic_subtract_int(&locked_tcb->asoc.refcnt, 1); SCTP_INP_RUNLOCK(locked_tcb->sctp_ep); } SCTP_INP_INFO_RUNLOCK(); return (stcb); } else { SCTP_INP_WLOCK(inp); if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { goto null_return; } stcb = LIST_FIRST(&inp->sctp_asoc_list); if (stcb == NULL) { goto null_return; } SCTP_TCB_LOCK(stcb); if (stcb->rport != rport) { /* remote port does not match. */ SCTP_TCB_UNLOCK(stcb); goto null_return; } if (stcb->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) { SCTP_TCB_UNLOCK(stcb); goto null_return; } if (local && !sctp_does_stcb_own_this_addr(stcb, local)) { SCTP_TCB_UNLOCK(stcb); goto null_return; } /* now look at the list of remote addresses */ TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { #ifdef INVARIANTS if (net == (TAILQ_NEXT(net, sctp_next))) { panic("Corrupt net list"); } #endif if (net->ro._l_addr.sa.sa_family != remote->sa_family) { /* not the same family */ continue; } switch (remote->sa_family) { #ifdef INET case AF_INET: { struct sockaddr_in *sin, *rsin; sin = (struct sockaddr_in *) &net->ro._l_addr; rsin = (struct sockaddr_in *)remote; if (sin->sin_addr.s_addr == rsin->sin_addr.s_addr) { /* found it */ if (netp != NULL) { *netp = net; } if (locked_tcb == NULL) { SCTP_INP_DECR_REF(inp); } else if (locked_tcb != stcb) { SCTP_TCB_LOCK(locked_tcb); } if (locked_tcb) { atomic_subtract_int(&locked_tcb->asoc.refcnt, 1); } SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_RUNLOCK(); return (stcb); } break; } #endif #ifdef INET6 case AF_INET6: { struct sockaddr_in6 *sin6, *rsin6; sin6 = (struct sockaddr_in6 *)&net->ro._l_addr; rsin6 = (struct sockaddr_in6 *)remote; if (SCTP6_ARE_ADDR_EQUAL(sin6, rsin6)) { /* found it */ if (netp != NULL) { *netp = net; } if (locked_tcb == NULL) { SCTP_INP_DECR_REF(inp); } else if (locked_tcb != stcb) { SCTP_TCB_LOCK(locked_tcb); } if (locked_tcb) { atomic_subtract_int(&locked_tcb->asoc.refcnt, 1); } SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_RUNLOCK(); return (stcb); } break; } #endif default: /* TSNH */ break; } } SCTP_TCB_UNLOCK(stcb); } } else { SCTP_INP_WLOCK(inp); if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { goto null_return; } head = &inp->sctp_tcbhash[SCTP_PCBHASH_ALLADDR(rport, inp->sctp_hashmark)]; LIST_FOREACH(stcb, head, sctp_tcbhash) { if (stcb->rport != rport) { /* remote port does not match */ continue; } SCTP_TCB_LOCK(stcb); if (stcb->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) { SCTP_TCB_UNLOCK(stcb); continue; } if (local && !sctp_does_stcb_own_this_addr(stcb, local)) { SCTP_TCB_UNLOCK(stcb); continue; } /* now look at the list of remote addresses */ TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { #ifdef INVARIANTS if (net == (TAILQ_NEXT(net, sctp_next))) { panic("Corrupt net list"); } #endif if (net->ro._l_addr.sa.sa_family != remote->sa_family) { /* not the same family */ continue; } switch (remote->sa_family) { #ifdef INET case AF_INET: { struct sockaddr_in *sin, *rsin; sin = (struct sockaddr_in *) &net->ro._l_addr; rsin = (struct sockaddr_in *)remote; if (sin->sin_addr.s_addr == rsin->sin_addr.s_addr) { /* found it */ if (netp != NULL) { *netp = net; } if (locked_tcb == NULL) { SCTP_INP_DECR_REF(inp); } else if (locked_tcb != stcb) { SCTP_TCB_LOCK(locked_tcb); } if (locked_tcb) { atomic_subtract_int(&locked_tcb->asoc.refcnt, 1); } SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_RUNLOCK(); return (stcb); } break; } #endif #ifdef INET6 case AF_INET6: { struct sockaddr_in6 *sin6, *rsin6; sin6 = (struct sockaddr_in6 *) &net->ro._l_addr; rsin6 = (struct sockaddr_in6 *)remote; if (SCTP6_ARE_ADDR_EQUAL(sin6, rsin6)) { /* found it */ if (netp != NULL) { *netp = net; } if (locked_tcb == NULL) { SCTP_INP_DECR_REF(inp); } else if (locked_tcb != stcb) { SCTP_TCB_LOCK(locked_tcb); } if (locked_tcb) { atomic_subtract_int(&locked_tcb->asoc.refcnt, 1); } SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_RUNLOCK(); return (stcb); } break; } #endif default: /* TSNH */ break; } } SCTP_TCB_UNLOCK(stcb); } } null_return: /* clean up for returning null */ if (locked_tcb) { SCTP_TCB_LOCK(locked_tcb); atomic_subtract_int(&locked_tcb->asoc.refcnt, 1); } SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_RUNLOCK(); /* not found */ return (NULL); } /* * Find an association for a specific endpoint using the association id given * out in the COMM_UP notification */ struct sctp_tcb * sctp_findasoc_ep_asocid_locked(struct sctp_inpcb *inp, sctp_assoc_t asoc_id, int want_lock) { /* * Use my the assoc_id to find a endpoint */ struct sctpasochead *head; struct sctp_tcb *stcb; uint32_t id; if (inp == NULL) { SCTP_PRINTF("TSNH ep_associd\n"); return (NULL); } if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { SCTP_PRINTF("TSNH ep_associd0\n"); return (NULL); } id = (uint32_t) asoc_id; head = &inp->sctp_asocidhash[SCTP_PCBHASH_ASOC(id, inp->hashasocidmark)]; if (head == NULL) { /* invalid id TSNH */ SCTP_PRINTF("TSNH ep_associd1\n"); return (NULL); } LIST_FOREACH(stcb, head, sctp_tcbasocidhash) { if (stcb->asoc.assoc_id == id) { if (inp != stcb->sctp_ep) { /* * some other guy has the same id active (id * collision ??). */ SCTP_PRINTF("TSNH ep_associd2\n"); continue; } if (stcb->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) { continue; } if (want_lock) { SCTP_TCB_LOCK(stcb); } return (stcb); } } return (NULL); } struct sctp_tcb * sctp_findassociation_ep_asocid(struct sctp_inpcb *inp, sctp_assoc_t asoc_id, int want_lock) { struct sctp_tcb *stcb; SCTP_INP_RLOCK(inp); stcb = sctp_findasoc_ep_asocid_locked(inp, asoc_id, want_lock); SCTP_INP_RUNLOCK(inp); return (stcb); } /* * Endpoint probe expects that the INP_INFO is locked. */ static struct sctp_inpcb * sctp_endpoint_probe(struct sockaddr *nam, struct sctppcbhead *head, uint16_t lport, uint32_t vrf_id) { struct sctp_inpcb *inp; struct sctp_laddr *laddr; #ifdef INET struct sockaddr_in *sin; #endif #ifdef INET6 struct sockaddr_in6 *sin6; struct sockaddr_in6 *intf_addr6; #endif int fnd; #ifdef INET sin = NULL; #endif #ifdef INET6 sin6 = NULL; #endif switch (nam->sa_family) { #ifdef INET case AF_INET: sin = (struct sockaddr_in *)nam; break; #endif #ifdef INET6 case AF_INET6: sin6 = (struct sockaddr_in6 *)nam; break; #endif default: /* unsupported family */ return (NULL); } if (head == NULL) return (NULL); LIST_FOREACH(inp, head, sctp_hash) { SCTP_INP_RLOCK(inp); if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { SCTP_INP_RUNLOCK(inp); continue; } if ((inp->sctp_flags & SCTP_PCB_FLAGS_BOUNDALL) && (inp->sctp_lport == lport)) { /* got it */ switch (nam->sa_family) { #ifdef INET case AF_INET: if ((inp->sctp_flags & SCTP_PCB_FLAGS_BOUND_V6) && SCTP_IPV6_V6ONLY(inp)) { /* * IPv4 on a IPv6 socket with ONLY * IPv6 set */ SCTP_INP_RUNLOCK(inp); continue; } if (prison_check_ip4(inp->ip_inp.inp.inp_cred, &sin->sin_addr) != 0) { SCTP_INP_RUNLOCK(inp); continue; } break; #endif #ifdef INET6 case AF_INET6: /* * A V6 address and the endpoint is NOT * bound V6 */ if ((inp->sctp_flags & SCTP_PCB_FLAGS_BOUND_V6) == 0) { SCTP_INP_RUNLOCK(inp); continue; } if (prison_check_ip6(inp->ip_inp.inp.inp_cred, &sin6->sin6_addr) != 0) { SCTP_INP_RUNLOCK(inp); continue; } break; #endif default: break; } /* does a VRF id match? */ fnd = 0; if (inp->def_vrf_id == vrf_id) fnd = 1; SCTP_INP_RUNLOCK(inp); if (!fnd) continue; return (inp); } SCTP_INP_RUNLOCK(inp); } switch (nam->sa_family) { #ifdef INET case AF_INET: if (sin->sin_addr.s_addr == INADDR_ANY) { /* Can't hunt for one that has no address specified */ return (NULL); } break; #endif #ifdef INET6 case AF_INET6: if (IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) { /* Can't hunt for one that has no address specified */ return (NULL); } break; #endif default: break; } /* * ok, not bound to all so see if we can find a EP bound to this * address. */ LIST_FOREACH(inp, head, sctp_hash) { SCTP_INP_RLOCK(inp); if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { SCTP_INP_RUNLOCK(inp); continue; } if ((inp->sctp_flags & SCTP_PCB_FLAGS_BOUNDALL)) { SCTP_INP_RUNLOCK(inp); continue; } /* * Ok this could be a likely candidate, look at all of its * addresses */ if (inp->sctp_lport != lport) { SCTP_INP_RUNLOCK(inp); continue; } /* does a VRF id match? */ fnd = 0; if (inp->def_vrf_id == vrf_id) fnd = 1; if (!fnd) { SCTP_INP_RUNLOCK(inp); continue; } LIST_FOREACH(laddr, &inp->sctp_addr_list, sctp_nxt_addr) { if (laddr->ifa == NULL) { SCTPDBG(SCTP_DEBUG_PCB1, "%s: NULL ifa\n", __func__); continue; } SCTPDBG(SCTP_DEBUG_PCB1, "Ok laddr->ifa:%p is possible, ", (void *)laddr->ifa); if (laddr->ifa->localifa_flags & SCTP_BEING_DELETED) { SCTPDBG(SCTP_DEBUG_PCB1, "Huh IFA being deleted\n"); continue; } if (laddr->ifa->address.sa.sa_family == nam->sa_family) { /* possible, see if it matches */ switch (nam->sa_family) { #ifdef INET case AF_INET: if (sin->sin_addr.s_addr == laddr->ifa->address.sin.sin_addr.s_addr) { SCTP_INP_RUNLOCK(inp); return (inp); } break; #endif #ifdef INET6 case AF_INET6: intf_addr6 = &laddr->ifa->address.sin6; if (SCTP6_ARE_ADDR_EQUAL(sin6, intf_addr6)) { SCTP_INP_RUNLOCK(inp); return (inp); } break; #endif } } } SCTP_INP_RUNLOCK(inp); } return (NULL); } static struct sctp_inpcb * sctp_isport_inuse(struct sctp_inpcb *inp, uint16_t lport, uint32_t vrf_id) { struct sctppcbhead *head; struct sctp_inpcb *t_inp; int fnd; head = &SCTP_BASE_INFO(sctp_ephash)[SCTP_PCBHASH_ALLADDR(lport, SCTP_BASE_INFO(hashmark))]; LIST_FOREACH(t_inp, head, sctp_hash) { if (t_inp->sctp_lport != lport) { continue; } /* is it in the VRF in question */ fnd = 0; if (t_inp->def_vrf_id == vrf_id) fnd = 1; if (!fnd) continue; /* This one is in use. */ /* check the v6/v4 binding issue */ if ((t_inp->sctp_flags & SCTP_PCB_FLAGS_BOUND_V6) && SCTP_IPV6_V6ONLY(t_inp)) { if (inp->sctp_flags & SCTP_PCB_FLAGS_BOUND_V6) { /* collision in V6 space */ return (t_inp); } else { /* inp is BOUND_V4 no conflict */ continue; } } else if (t_inp->sctp_flags & SCTP_PCB_FLAGS_BOUND_V6) { /* t_inp is bound v4 and v6, conflict always */ return (t_inp); } else { /* t_inp is bound only V4 */ if ((inp->sctp_flags & SCTP_PCB_FLAGS_BOUND_V6) && SCTP_IPV6_V6ONLY(inp)) { /* no conflict */ continue; } /* else fall through to conflict */ } return (t_inp); } return (NULL); } int sctp_swap_inpcb_for_listen(struct sctp_inpcb *inp) { /* For 1-2-1 with port reuse */ struct sctppcbhead *head; struct sctp_inpcb *tinp, *ninp; if (sctp_is_feature_off(inp, SCTP_PCB_FLAGS_PORTREUSE)) { /* only works with port reuse on */ return (-1); } if ((inp->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL) == 0) { return (0); } SCTP_INP_RUNLOCK(inp); SCTP_INP_INFO_WLOCK(); head = &SCTP_BASE_INFO(sctp_ephash)[SCTP_PCBHASH_ALLADDR(inp->sctp_lport, SCTP_BASE_INFO(hashmark))]; /* Kick out all non-listeners to the TCP hash */ LIST_FOREACH_SAFE(tinp, head, sctp_hash, ninp) { if (tinp->sctp_lport != inp->sctp_lport) { continue; } if (tinp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { continue; } if (tinp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE) { continue; } if (tinp->sctp_socket->so_qlimit) { continue; } SCTP_INP_WLOCK(tinp); LIST_REMOVE(tinp, sctp_hash); head = &SCTP_BASE_INFO(sctp_tcpephash)[SCTP_PCBHASH_ALLADDR(tinp->sctp_lport, SCTP_BASE_INFO(hashtcpmark))]; tinp->sctp_flags |= SCTP_PCB_FLAGS_IN_TCPPOOL; LIST_INSERT_HEAD(head, tinp, sctp_hash); SCTP_INP_WUNLOCK(tinp); } SCTP_INP_WLOCK(inp); /* Pull from where he was */ LIST_REMOVE(inp, sctp_hash); inp->sctp_flags &= ~SCTP_PCB_FLAGS_IN_TCPPOOL; head = &SCTP_BASE_INFO(sctp_ephash)[SCTP_PCBHASH_ALLADDR(inp->sctp_lport, SCTP_BASE_INFO(hashmark))]; LIST_INSERT_HEAD(head, inp, sctp_hash); SCTP_INP_WUNLOCK(inp); SCTP_INP_RLOCK(inp); SCTP_INP_INFO_WUNLOCK(); return (0); } struct sctp_inpcb * sctp_pcb_findep(struct sockaddr *nam, int find_tcp_pool, int have_lock, uint32_t vrf_id) { /* * First we check the hash table to see if someone has this port * bound with just the port. */ struct sctp_inpcb *inp; struct sctppcbhead *head; int lport; unsigned int i; #ifdef INET struct sockaddr_in *sin; #endif #ifdef INET6 struct sockaddr_in6 *sin6; #endif switch (nam->sa_family) { #ifdef INET case AF_INET: sin = (struct sockaddr_in *)nam; lport = sin->sin_port; break; #endif #ifdef INET6 case AF_INET6: sin6 = (struct sockaddr_in6 *)nam; lport = sin6->sin6_port; break; #endif default: return (NULL); } /* * I could cheat here and just cast to one of the types but we will * do it right. It also provides the check against an Unsupported * type too. */ /* Find the head of the ALLADDR chain */ if (have_lock == 0) { SCTP_INP_INFO_RLOCK(); } head = &SCTP_BASE_INFO(sctp_ephash)[SCTP_PCBHASH_ALLADDR(lport, SCTP_BASE_INFO(hashmark))]; inp = sctp_endpoint_probe(nam, head, lport, vrf_id); /* * If the TCP model exists it could be that the main listening * endpoint is gone but there still exists a connected socket for * this guy. If so we can return the first one that we find. This * may NOT be the correct one so the caller should be wary on the * returned INP. Currently the only caller that sets find_tcp_pool * is in bindx where we are verifying that a user CAN bind the * address. He either has bound it already, or someone else has, or * its open to bind, so this is good enough. */ if (inp == NULL && find_tcp_pool) { for (i = 0; i < SCTP_BASE_INFO(hashtcpmark) + 1; i++) { head = &SCTP_BASE_INFO(sctp_tcpephash)[i]; inp = sctp_endpoint_probe(nam, head, lport, vrf_id); if (inp) { break; } } } if (inp) { SCTP_INP_INCR_REF(inp); } if (have_lock == 0) { SCTP_INP_INFO_RUNLOCK(); } return (inp); } /* * Find an association for an endpoint with the pointer to whom you want to * send to and the endpoint pointer. The address can be IPv4 or IPv6. We may * need to change the *to to some other struct like a mbuf... */ struct sctp_tcb * sctp_findassociation_addr_sa(struct sockaddr *from, struct sockaddr *to, struct sctp_inpcb **inp_p, struct sctp_nets **netp, int find_tcp_pool, uint32_t vrf_id) { struct sctp_inpcb *inp = NULL; struct sctp_tcb *stcb; SCTP_INP_INFO_RLOCK(); if (find_tcp_pool) { if (inp_p != NULL) { stcb = sctp_tcb_special_locate(inp_p, from, to, netp, vrf_id); } else { stcb = sctp_tcb_special_locate(&inp, from, to, netp, vrf_id); } if (stcb != NULL) { SCTP_INP_INFO_RUNLOCK(); return (stcb); } } inp = sctp_pcb_findep(to, 0, 1, vrf_id); if (inp_p != NULL) { *inp_p = inp; } SCTP_INP_INFO_RUNLOCK(); if (inp == NULL) { return (NULL); } /* * ok, we have an endpoint, now lets find the assoc for it (if any) * we now place the source address or from in the to of the find * endpoint call. Since in reality this chain is used from the * inbound packet side. */ if (inp_p != NULL) { stcb = sctp_findassociation_ep_addr(inp_p, from, netp, to, NULL); } else { stcb = sctp_findassociation_ep_addr(&inp, from, netp, to, NULL); } return (stcb); } /* * This routine will grub through the mbuf that is a INIT or INIT-ACK and * find all addresses that the sender has specified in any address list. Each * address will be used to lookup the TCB and see if one exits. */ static struct sctp_tcb * sctp_findassociation_special_addr(struct mbuf *m, int offset, struct sctphdr *sh, struct sctp_inpcb **inp_p, struct sctp_nets **netp, struct sockaddr *dst) { struct sctp_paramhdr *phdr, parm_buf; #if defined(INET) || defined(INET6) struct sctp_tcb *stcb; uint16_t ptype; #endif uint16_t plen; #ifdef INET struct sockaddr_in sin4; #endif #ifdef INET6 struct sockaddr_in6 sin6; #endif #ifdef INET memset(&sin4, 0, sizeof(sin4)); sin4.sin_len = sizeof(sin4); sin4.sin_family = AF_INET; sin4.sin_port = sh->src_port; #endif #ifdef INET6 memset(&sin6, 0, sizeof(sin6)); sin6.sin6_len = sizeof(sin6); sin6.sin6_family = AF_INET6; sin6.sin6_port = sh->src_port; #endif offset += sizeof(struct sctp_init_chunk); phdr = sctp_get_next_param(m, offset, &parm_buf, sizeof(parm_buf)); while (phdr != NULL) { /* now we must see if we want the parameter */ #if defined(INET) || defined(INET6) ptype = ntohs(phdr->param_type); #endif plen = ntohs(phdr->param_length); if (plen == 0) { break; } #ifdef INET if (ptype == SCTP_IPV4_ADDRESS && plen == sizeof(struct sctp_ipv4addr_param)) { /* Get the rest of the address */ struct sctp_ipv4addr_param ip4_parm, *p4; phdr = sctp_get_next_param(m, offset, (struct sctp_paramhdr *)&ip4_parm, min(plen, sizeof(ip4_parm))); if (phdr == NULL) { return (NULL); } p4 = (struct sctp_ipv4addr_param *)phdr; memcpy(&sin4.sin_addr, &p4->addr, sizeof(p4->addr)); /* look it up */ stcb = sctp_findassociation_ep_addr(inp_p, (struct sockaddr *)&sin4, netp, dst, NULL); if (stcb != NULL) { return (stcb); } } #endif #ifdef INET6 if (ptype == SCTP_IPV6_ADDRESS && plen == sizeof(struct sctp_ipv6addr_param)) { /* Get the rest of the address */ struct sctp_ipv6addr_param ip6_parm, *p6; phdr = sctp_get_next_param(m, offset, (struct sctp_paramhdr *)&ip6_parm, min(plen, sizeof(ip6_parm))); if (phdr == NULL) { return (NULL); } p6 = (struct sctp_ipv6addr_param *)phdr; memcpy(&sin6.sin6_addr, &p6->addr, sizeof(p6->addr)); /* look it up */ stcb = sctp_findassociation_ep_addr(inp_p, (struct sockaddr *)&sin6, netp, dst, NULL); if (stcb != NULL) { return (stcb); } } #endif offset += SCTP_SIZE32(plen); phdr = sctp_get_next_param(m, offset, &parm_buf, sizeof(parm_buf)); } return (NULL); } static struct sctp_tcb * sctp_findassoc_by_vtag(struct sockaddr *from, struct sockaddr *to, uint32_t vtag, struct sctp_inpcb **inp_p, struct sctp_nets **netp, uint16_t rport, uint16_t lport, int skip_src_check, uint32_t vrf_id, uint32_t remote_tag) { /* * Use my vtag to hash. If we find it we then verify the source addr * is in the assoc. If all goes well we save a bit on rec of a * packet. */ struct sctpasochead *head; struct sctp_nets *net; struct sctp_tcb *stcb; SCTP_INP_INFO_RLOCK(); head = &SCTP_BASE_INFO(sctp_asochash)[SCTP_PCBHASH_ASOC(vtag, SCTP_BASE_INFO(hashasocmark))]; LIST_FOREACH(stcb, head, sctp_asocs) { SCTP_INP_RLOCK(stcb->sctp_ep); if (stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { SCTP_INP_RUNLOCK(stcb->sctp_ep); continue; } if (stcb->sctp_ep->def_vrf_id != vrf_id) { SCTP_INP_RUNLOCK(stcb->sctp_ep); continue; } SCTP_TCB_LOCK(stcb); SCTP_INP_RUNLOCK(stcb->sctp_ep); if (stcb->asoc.my_vtag == vtag) { /* candidate */ if (stcb->rport != rport) { SCTP_TCB_UNLOCK(stcb); continue; } if (stcb->sctp_ep->sctp_lport != lport) { SCTP_TCB_UNLOCK(stcb); continue; } if (stcb->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) { SCTP_TCB_UNLOCK(stcb); continue; } /* RRS:Need toaddr check here */ if (sctp_does_stcb_own_this_addr(stcb, to) == 0) { /* Endpoint does not own this address */ SCTP_TCB_UNLOCK(stcb); continue; } if (remote_tag) { /* * If we have both vtags that's all we match * on */ if (stcb->asoc.peer_vtag == remote_tag) { /* * If both tags match we consider it * conclusive and check NO * source/destination addresses */ goto conclusive; } } if (skip_src_check) { conclusive: if (from) { *netp = sctp_findnet(stcb, from); } else { *netp = NULL; /* unknown */ } if (inp_p) *inp_p = stcb->sctp_ep; SCTP_INP_INFO_RUNLOCK(); return (stcb); } net = sctp_findnet(stcb, from); if (net) { /* yep its him. */ *netp = net; SCTP_STAT_INCR(sctps_vtagexpress); *inp_p = stcb->sctp_ep; SCTP_INP_INFO_RUNLOCK(); return (stcb); } else { /* * not him, this should only happen in rare * cases so I peg it. */ SCTP_STAT_INCR(sctps_vtagbogus); } } SCTP_TCB_UNLOCK(stcb); } SCTP_INP_INFO_RUNLOCK(); return (NULL); } /* * Find an association with the pointer to the inbound IP packet. This can be * a IPv4 or IPv6 packet. */ struct sctp_tcb * sctp_findassociation_addr(struct mbuf *m, int offset, struct sockaddr *src, struct sockaddr *dst, struct sctphdr *sh, struct sctp_chunkhdr *ch, struct sctp_inpcb **inp_p, struct sctp_nets **netp, uint32_t vrf_id) { struct sctp_tcb *stcb; struct sctp_inpcb *inp; if (sh->v_tag) { /* we only go down this path if vtag is non-zero */ stcb = sctp_findassoc_by_vtag(src, dst, ntohl(sh->v_tag), inp_p, netp, sh->src_port, sh->dest_port, 0, vrf_id, 0); if (stcb) { return (stcb); } } if (inp_p) { stcb = sctp_findassociation_addr_sa(src, dst, inp_p, netp, 1, vrf_id); inp = *inp_p; } else { stcb = sctp_findassociation_addr_sa(src, dst, &inp, netp, 1, vrf_id); } SCTPDBG(SCTP_DEBUG_PCB1, "stcb:%p inp:%p\n", (void *)stcb, (void *)inp); if (stcb == NULL && inp) { /* Found a EP but not this address */ if ((ch->chunk_type == SCTP_INITIATION) || (ch->chunk_type == SCTP_INITIATION_ACK)) { /*- * special hook, we do NOT return linp or an * association that is linked to an existing * association that is under the TCP pool (i.e. no * listener exists). The endpoint finding routine * will always find a listener before examining the * TCP pool. */ if (inp->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL) { if (inp_p) { *inp_p = NULL; } return (NULL); } stcb = sctp_findassociation_special_addr(m, offset, sh, &inp, netp, dst); if (inp_p != NULL) { *inp_p = inp; } } } SCTPDBG(SCTP_DEBUG_PCB1, "stcb is %p\n", (void *)stcb); return (stcb); } /* * lookup an association by an ASCONF lookup address. * if the lookup address is 0.0.0.0 or ::0, use the vtag to do the lookup */ struct sctp_tcb * sctp_findassociation_ep_asconf(struct mbuf *m, int offset, struct sockaddr *dst, struct sctphdr *sh, struct sctp_inpcb **inp_p, struct sctp_nets **netp, uint32_t vrf_id) { struct sctp_tcb *stcb; union sctp_sockstore remote_store; struct sctp_paramhdr parm_buf, *phdr; int ptype; int zero_address = 0; #ifdef INET struct sockaddr_in *sin; #endif #ifdef INET6 struct sockaddr_in6 *sin6; #endif memset(&remote_store, 0, sizeof(remote_store)); phdr = sctp_get_next_param(m, offset + sizeof(struct sctp_asconf_chunk), &parm_buf, sizeof(struct sctp_paramhdr)); if (phdr == NULL) { SCTPDBG(SCTP_DEBUG_INPUT3, "%s: failed to get asconf lookup addr\n", __func__); return NULL; } ptype = (int)((uint32_t) ntohs(phdr->param_type)); /* get the correlation address */ switch (ptype) { #ifdef INET6 case SCTP_IPV6_ADDRESS: { /* ipv6 address param */ struct sctp_ipv6addr_param *p6, p6_buf; if (ntohs(phdr->param_length) != sizeof(struct sctp_ipv6addr_param)) { return NULL; } p6 = (struct sctp_ipv6addr_param *)sctp_get_next_param(m, offset + sizeof(struct sctp_asconf_chunk), &p6_buf.ph, sizeof(*p6)); if (p6 == NULL) { SCTPDBG(SCTP_DEBUG_INPUT3, "%s: failed to get asconf v6 lookup addr\n", __func__); return (NULL); } sin6 = &remote_store.sin6; sin6->sin6_family = AF_INET6; sin6->sin6_len = sizeof(*sin6); sin6->sin6_port = sh->src_port; memcpy(&sin6->sin6_addr, &p6->addr, sizeof(struct in6_addr)); if (IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) zero_address = 1; break; } #endif #ifdef INET case SCTP_IPV4_ADDRESS: { /* ipv4 address param */ struct sctp_ipv4addr_param *p4, p4_buf; if (ntohs(phdr->param_length) != sizeof(struct sctp_ipv4addr_param)) { return NULL; } p4 = (struct sctp_ipv4addr_param *)sctp_get_next_param(m, offset + sizeof(struct sctp_asconf_chunk), &p4_buf.ph, sizeof(*p4)); if (p4 == NULL) { SCTPDBG(SCTP_DEBUG_INPUT3, "%s: failed to get asconf v4 lookup addr\n", __func__); return (NULL); } sin = &remote_store.sin; sin->sin_family = AF_INET; sin->sin_len = sizeof(*sin); sin->sin_port = sh->src_port; memcpy(&sin->sin_addr, &p4->addr, sizeof(struct in_addr)); if (sin->sin_addr.s_addr == INADDR_ANY) zero_address = 1; break; } #endif default: /* invalid address param type */ return NULL; } if (zero_address) { stcb = sctp_findassoc_by_vtag(NULL, dst, ntohl(sh->v_tag), inp_p, netp, sh->src_port, sh->dest_port, 1, vrf_id, 0); if (stcb != NULL) { SCTP_INP_DECR_REF(*inp_p); } } else { stcb = sctp_findassociation_ep_addr(inp_p, &remote_store.sa, netp, dst, NULL); } return (stcb); } /* * allocate a sctp_inpcb and setup a temporary binding to a port/all * addresses. This way if we don't get a bind we by default pick a ephemeral * port with all addresses bound. */ int sctp_inpcb_alloc(struct socket *so, uint32_t vrf_id) { /* * we get called when a new endpoint starts up. We need to allocate * the sctp_inpcb structure from the zone and init it. Mark it as * unbound and find a port that we can use as an ephemeral with * INADDR_ANY. If the user binds later no problem we can then add in * the specific addresses. And setup the default parameters for the * EP. */ int i, error; struct sctp_inpcb *inp; struct sctp_pcb *m; struct timeval time; sctp_sharedkey_t *null_key; error = 0; SCTP_INP_INFO_WLOCK(); inp = SCTP_ZONE_GET(SCTP_BASE_INFO(ipi_zone_ep), struct sctp_inpcb); if (inp == NULL) { SCTP_PRINTF("Out of SCTP-INPCB structures - no resources\n"); SCTP_INP_INFO_WUNLOCK(); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, ENOBUFS); return (ENOBUFS); } /* zap it */ bzero(inp, sizeof(*inp)); /* bump generations */ /* setup socket pointers */ inp->sctp_socket = so; inp->ip_inp.inp.inp_socket = so; inp->ip_inp.inp.inp_cred = crhold(so->so_cred); #ifdef INET6 if (INP_SOCKAF(so) == AF_INET6) { if (MODULE_GLOBAL(ip6_auto_flowlabel)) { inp->ip_inp.inp.inp_flags |= IN6P_AUTOFLOWLABEL; } if (MODULE_GLOBAL(ip6_v6only)) { inp->ip_inp.inp.inp_flags |= IN6P_IPV6_V6ONLY; } } #endif inp->sctp_associd_counter = 1; inp->partial_delivery_point = SCTP_SB_LIMIT_RCV(so) >> SCTP_PARTIAL_DELIVERY_SHIFT; inp->sctp_frag_point = SCTP_DEFAULT_MAXSEGMENT; inp->max_cwnd = 0; inp->sctp_cmt_on_off = SCTP_BASE_SYSCTL(sctp_cmt_on_off); inp->ecn_supported = (uint8_t) SCTP_BASE_SYSCTL(sctp_ecn_enable); inp->prsctp_supported = (uint8_t) SCTP_BASE_SYSCTL(sctp_pr_enable); inp->auth_supported = (uint8_t) SCTP_BASE_SYSCTL(sctp_auth_enable); inp->asconf_supported = (uint8_t) SCTP_BASE_SYSCTL(sctp_asconf_enable); inp->reconfig_supported = (uint8_t) SCTP_BASE_SYSCTL(sctp_reconfig_enable); inp->nrsack_supported = (uint8_t) SCTP_BASE_SYSCTL(sctp_nrsack_enable); inp->pktdrop_supported = (uint8_t) SCTP_BASE_SYSCTL(sctp_pktdrop_enable); inp->idata_supported = 0; inp->fibnum = so->so_fibnum; /* init the small hash table we use to track asocid <-> tcb */ inp->sctp_asocidhash = SCTP_HASH_INIT(SCTP_STACK_VTAG_HASH_SIZE, &inp->hashasocidmark); if (inp->sctp_asocidhash == NULL) { crfree(inp->ip_inp.inp.inp_cred); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_ep), inp); SCTP_INP_INFO_WUNLOCK(); return (ENOBUFS); } #ifdef IPSEC error = ipsec_init_policy(so, &inp->ip_inp.inp.inp_sp); if (error != 0) { crfree(inp->ip_inp.inp.inp_cred); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_ep), inp); SCTP_INP_INFO_WUNLOCK(); return error; } #endif /* IPSEC */ SCTP_INCR_EP_COUNT(); inp->ip_inp.inp.inp_ip_ttl = MODULE_GLOBAL(ip_defttl); SCTP_INP_INFO_WUNLOCK(); so->so_pcb = (caddr_t)inp; if (SCTP_SO_TYPE(so) == SOCK_SEQPACKET) { /* UDP style socket */ inp->sctp_flags = (SCTP_PCB_FLAGS_UDPTYPE | SCTP_PCB_FLAGS_UNBOUND); /* Be sure it is NON-BLOCKING IO for UDP */ /* SCTP_SET_SO_NBIO(so); */ } else if (SCTP_SO_TYPE(so) == SOCK_STREAM) { /* TCP style socket */ inp->sctp_flags = (SCTP_PCB_FLAGS_TCPTYPE | SCTP_PCB_FLAGS_UNBOUND); /* Be sure we have blocking IO by default */ SCTP_CLEAR_SO_NBIO(so); } else { /* * unsupported socket type (RAW, etc)- in case we missed it * in protosw */ SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EOPNOTSUPP); so->so_pcb = NULL; crfree(inp->ip_inp.inp.inp_cred); #ifdef IPSEC ipsec_delete_pcbpolicy(&inp->ip_inp.inp); #endif SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_ep), inp); return (EOPNOTSUPP); } if (SCTP_BASE_SYSCTL(sctp_default_frag_interleave) == SCTP_FRAG_LEVEL_1) { sctp_feature_on(inp, SCTP_PCB_FLAGS_FRAG_INTERLEAVE); sctp_feature_off(inp, SCTP_PCB_FLAGS_INTERLEAVE_STRMS); } else if (SCTP_BASE_SYSCTL(sctp_default_frag_interleave) == SCTP_FRAG_LEVEL_2) { sctp_feature_on(inp, SCTP_PCB_FLAGS_FRAG_INTERLEAVE); sctp_feature_on(inp, SCTP_PCB_FLAGS_INTERLEAVE_STRMS); } else if (SCTP_BASE_SYSCTL(sctp_default_frag_interleave) == SCTP_FRAG_LEVEL_0) { sctp_feature_off(inp, SCTP_PCB_FLAGS_FRAG_INTERLEAVE); sctp_feature_off(inp, SCTP_PCB_FLAGS_INTERLEAVE_STRMS); } inp->sctp_tcbhash = SCTP_HASH_INIT(SCTP_BASE_SYSCTL(sctp_pcbtblsize), &inp->sctp_hashmark); if (inp->sctp_tcbhash == NULL) { SCTP_PRINTF("Out of SCTP-INPCB->hashinit - no resources\n"); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, ENOBUFS); so->so_pcb = NULL; crfree(inp->ip_inp.inp.inp_cred); #ifdef IPSEC ipsec_delete_pcbpolicy(&inp->ip_inp.inp); #endif SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_ep), inp); return (ENOBUFS); } inp->def_vrf_id = vrf_id; SCTP_INP_INFO_WLOCK(); SCTP_INP_LOCK_INIT(inp); INP_LOCK_INIT(&inp->ip_inp.inp, "inp", "sctpinp"); SCTP_INP_READ_INIT(inp); SCTP_ASOC_CREATE_LOCK_INIT(inp); /* lock the new ep */ SCTP_INP_WLOCK(inp); /* add it to the info area */ LIST_INSERT_HEAD(&SCTP_BASE_INFO(listhead), inp, sctp_list); SCTP_INP_INFO_WUNLOCK(); TAILQ_INIT(&inp->read_queue); LIST_INIT(&inp->sctp_addr_list); LIST_INIT(&inp->sctp_asoc_list); #ifdef SCTP_TRACK_FREED_ASOCS /* TEMP CODE */ LIST_INIT(&inp->sctp_asoc_free_list); #endif /* Init the timer structure for signature change */ SCTP_OS_TIMER_INIT(&inp->sctp_ep.signature_change.timer); inp->sctp_ep.signature_change.type = SCTP_TIMER_TYPE_NEWCOOKIE; /* now init the actual endpoint default data */ m = &inp->sctp_ep; /* setup the base timeout information */ m->sctp_timeoutticks[SCTP_TIMER_SEND] = SEC_TO_TICKS(SCTP_SEND_SEC); /* needed ? */ m->sctp_timeoutticks[SCTP_TIMER_INIT] = SEC_TO_TICKS(SCTP_INIT_SEC); /* needed ? */ m->sctp_timeoutticks[SCTP_TIMER_RECV] = MSEC_TO_TICKS(SCTP_BASE_SYSCTL(sctp_delayed_sack_time_default)); m->sctp_timeoutticks[SCTP_TIMER_HEARTBEAT] = MSEC_TO_TICKS(SCTP_BASE_SYSCTL(sctp_heartbeat_interval_default)); m->sctp_timeoutticks[SCTP_TIMER_PMTU] = SEC_TO_TICKS(SCTP_BASE_SYSCTL(sctp_pmtu_raise_time_default)); m->sctp_timeoutticks[SCTP_TIMER_MAXSHUTDOWN] = SEC_TO_TICKS(SCTP_BASE_SYSCTL(sctp_shutdown_guard_time_default)); m->sctp_timeoutticks[SCTP_TIMER_SIGNATURE] = SEC_TO_TICKS(SCTP_BASE_SYSCTL(sctp_secret_lifetime_default)); /* all max/min max are in ms */ m->sctp_maxrto = SCTP_BASE_SYSCTL(sctp_rto_max_default); m->sctp_minrto = SCTP_BASE_SYSCTL(sctp_rto_min_default); m->initial_rto = SCTP_BASE_SYSCTL(sctp_rto_initial_default); m->initial_init_rto_max = SCTP_BASE_SYSCTL(sctp_init_rto_max_default); m->sctp_sack_freq = SCTP_BASE_SYSCTL(sctp_sack_freq_default); m->max_init_times = SCTP_BASE_SYSCTL(sctp_init_rtx_max_default); m->max_send_times = SCTP_BASE_SYSCTL(sctp_assoc_rtx_max_default); m->def_net_failure = SCTP_BASE_SYSCTL(sctp_path_rtx_max_default); m->def_net_pf_threshold = SCTP_BASE_SYSCTL(sctp_path_pf_threshold); m->sctp_sws_sender = SCTP_SWS_SENDER_DEF; m->sctp_sws_receiver = SCTP_SWS_RECEIVER_DEF; m->max_burst = SCTP_BASE_SYSCTL(sctp_max_burst_default); m->fr_max_burst = SCTP_BASE_SYSCTL(sctp_fr_max_burst_default); m->sctp_default_cc_module = SCTP_BASE_SYSCTL(sctp_default_cc_module); m->sctp_default_ss_module = SCTP_BASE_SYSCTL(sctp_default_ss_module); m->max_open_streams_intome = SCTP_BASE_SYSCTL(sctp_nr_incoming_streams_default); /* number of streams to pre-open on a association */ m->pre_open_stream_count = SCTP_BASE_SYSCTL(sctp_nr_outgoing_streams_default); /* Add adaptation cookie */ m->adaptation_layer_indicator = 0; m->adaptation_layer_indicator_provided = 0; /* seed random number generator */ m->random_counter = 1; m->store_at = SCTP_SIGNATURE_SIZE; SCTP_READ_RANDOM(m->random_numbers, sizeof(m->random_numbers)); sctp_fill_random_store(m); /* Minimum cookie size */ m->size_of_a_cookie = (sizeof(struct sctp_init_msg) * 2) + sizeof(struct sctp_state_cookie); m->size_of_a_cookie += SCTP_SIGNATURE_SIZE; /* Setup the initial secret */ (void)SCTP_GETTIME_TIMEVAL(&time); m->time_of_secret_change = time.tv_sec; for (i = 0; i < SCTP_NUMBER_OF_SECRETS; i++) { m->secret_key[0][i] = sctp_select_initial_TSN(m); } sctp_timer_start(SCTP_TIMER_TYPE_NEWCOOKIE, inp, NULL, NULL); /* How long is a cookie good for ? */ m->def_cookie_life = MSEC_TO_TICKS(SCTP_BASE_SYSCTL(sctp_valid_cookie_life_default)); /* * Initialize authentication parameters */ m->local_hmacs = sctp_default_supported_hmaclist(); m->local_auth_chunks = sctp_alloc_chunklist(); if (inp->asconf_supported) { sctp_auth_add_chunk(SCTP_ASCONF, m->local_auth_chunks); sctp_auth_add_chunk(SCTP_ASCONF_ACK, m->local_auth_chunks); } m->default_dscp = 0; #ifdef INET6 m->default_flowlabel = 0; #endif m->port = 0; /* encapsulation disabled by default */ LIST_INIT(&m->shared_keys); /* add default NULL key as key id 0 */ null_key = sctp_alloc_sharedkey(); sctp_insert_sharedkey(&m->shared_keys, null_key); SCTP_INP_WUNLOCK(inp); #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, NULL, 12); #endif return (error); } void sctp_move_pcb_and_assoc(struct sctp_inpcb *old_inp, struct sctp_inpcb *new_inp, struct sctp_tcb *stcb) { struct sctp_nets *net; uint16_t lport, rport; struct sctppcbhead *head; struct sctp_laddr *laddr, *oladdr; atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_INP_INFO_WLOCK(); SCTP_INP_WLOCK(old_inp); SCTP_INP_WLOCK(new_inp); SCTP_TCB_LOCK(stcb); atomic_subtract_int(&stcb->asoc.refcnt, 1); new_inp->sctp_ep.time_of_secret_change = old_inp->sctp_ep.time_of_secret_change; memcpy(new_inp->sctp_ep.secret_key, old_inp->sctp_ep.secret_key, sizeof(old_inp->sctp_ep.secret_key)); new_inp->sctp_ep.current_secret_number = old_inp->sctp_ep.current_secret_number; new_inp->sctp_ep.last_secret_number = old_inp->sctp_ep.last_secret_number; new_inp->sctp_ep.size_of_a_cookie = old_inp->sctp_ep.size_of_a_cookie; /* make it so new data pours into the new socket */ stcb->sctp_socket = new_inp->sctp_socket; stcb->sctp_ep = new_inp; /* Copy the port across */ lport = new_inp->sctp_lport = old_inp->sctp_lport; rport = stcb->rport; /* Pull the tcb from the old association */ LIST_REMOVE(stcb, sctp_tcbhash); LIST_REMOVE(stcb, sctp_tcblist); if (stcb->asoc.in_asocid_hash) { LIST_REMOVE(stcb, sctp_tcbasocidhash); } /* Now insert the new_inp into the TCP connected hash */ head = &SCTP_BASE_INFO(sctp_tcpephash)[SCTP_PCBHASH_ALLADDR((lport | rport), SCTP_BASE_INFO(hashtcpmark))]; LIST_INSERT_HEAD(head, new_inp, sctp_hash); /* Its safe to access */ new_inp->sctp_flags &= ~SCTP_PCB_FLAGS_UNBOUND; /* Now move the tcb into the endpoint list */ LIST_INSERT_HEAD(&new_inp->sctp_asoc_list, stcb, sctp_tcblist); /* * Question, do we even need to worry about the ep-hash since we * only have one connection? Probably not :> so lets get rid of it * and not suck up any kernel memory in that. */ if (stcb->asoc.in_asocid_hash) { struct sctpasochead *lhd; lhd = &new_inp->sctp_asocidhash[SCTP_PCBHASH_ASOC(stcb->asoc.assoc_id, new_inp->hashasocidmark)]; LIST_INSERT_HEAD(lhd, stcb, sctp_tcbasocidhash); } /* Ok. Let's restart timer. */ TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { sctp_timer_start(SCTP_TIMER_TYPE_PATHMTURAISE, new_inp, stcb, net); } SCTP_INP_INFO_WUNLOCK(); if (new_inp->sctp_tcbhash != NULL) { SCTP_HASH_FREE(new_inp->sctp_tcbhash, new_inp->sctp_hashmark); new_inp->sctp_tcbhash = NULL; } if ((new_inp->sctp_flags & SCTP_PCB_FLAGS_BOUNDALL) == 0) { /* Subset bound, so copy in the laddr list from the old_inp */ LIST_FOREACH(oladdr, &old_inp->sctp_addr_list, sctp_nxt_addr) { laddr = SCTP_ZONE_GET(SCTP_BASE_INFO(ipi_zone_laddr), struct sctp_laddr); if (laddr == NULL) { /* * Gak, what can we do? This assoc is really * HOSED. We probably should send an abort * here. */ SCTPDBG(SCTP_DEBUG_PCB1, "Association hosed in TCP model, out of laddr memory\n"); continue; } SCTP_INCR_LADDR_COUNT(); bzero(laddr, sizeof(*laddr)); (void)SCTP_GETTIME_TIMEVAL(&laddr->start_time); laddr->ifa = oladdr->ifa; atomic_add_int(&laddr->ifa->refcount, 1); LIST_INSERT_HEAD(&new_inp->sctp_addr_list, laddr, sctp_nxt_addr); new_inp->laddr_count++; if (oladdr == stcb->asoc.last_used_address) { stcb->asoc.last_used_address = laddr; } } } /* * Now any running timers need to be adjusted since we really don't * care if they are running or not just blast in the new_inp into * all of them. */ stcb->asoc.dack_timer.ep = (void *)new_inp; stcb->asoc.asconf_timer.ep = (void *)new_inp; stcb->asoc.strreset_timer.ep = (void *)new_inp; stcb->asoc.shut_guard_timer.ep = (void *)new_inp; stcb->asoc.autoclose_timer.ep = (void *)new_inp; stcb->asoc.delayed_event_timer.ep = (void *)new_inp; stcb->asoc.delete_prim_timer.ep = (void *)new_inp; /* now what about the nets? */ TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { net->pmtu_timer.ep = (void *)new_inp; net->hb_timer.ep = (void *)new_inp; net->rxt_timer.ep = (void *)new_inp; } SCTP_INP_WUNLOCK(new_inp); SCTP_INP_WUNLOCK(old_inp); } /* * insert an laddr entry with the given ifa for the desired list */ static int sctp_insert_laddr(struct sctpladdr *list, struct sctp_ifa *ifa, uint32_t act) { struct sctp_laddr *laddr; laddr = SCTP_ZONE_GET(SCTP_BASE_INFO(ipi_zone_laddr), struct sctp_laddr); if (laddr == NULL) { /* out of memory? */ SCTP_LTRACE_ERR_RET(NULL, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); return (EINVAL); } SCTP_INCR_LADDR_COUNT(); bzero(laddr, sizeof(*laddr)); (void)SCTP_GETTIME_TIMEVAL(&laddr->start_time); laddr->ifa = ifa; laddr->action = act; atomic_add_int(&ifa->refcount, 1); /* insert it */ LIST_INSERT_HEAD(list, laddr, sctp_nxt_addr); return (0); } /* * Remove an laddr entry from the local address list (on an assoc) */ static void sctp_remove_laddr(struct sctp_laddr *laddr) { /* remove from the list */ LIST_REMOVE(laddr, sctp_nxt_addr); sctp_free_ifa(laddr->ifa); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_laddr), laddr); SCTP_DECR_LADDR_COUNT(); } /* sctp_ifap is used to bypass normal local address validation checks */ int sctp_inpcb_bind(struct socket *so, struct sockaddr *addr, struct sctp_ifa *sctp_ifap, struct thread *p) { /* bind a ep to a socket address */ struct sctppcbhead *head; struct sctp_inpcb *inp, *inp_tmp; struct inpcb *ip_inp; int port_reuse_active = 0; int bindall; uint16_t lport; int error; uint32_t vrf_id; lport = 0; bindall = 1; inp = (struct sctp_inpcb *)so->so_pcb; ip_inp = (struct inpcb *)so->so_pcb; #ifdef SCTP_DEBUG if (addr) { SCTPDBG(SCTP_DEBUG_PCB1, "Bind called port: %d\n", ntohs(((struct sockaddr_in *)addr)->sin_port)); SCTPDBG(SCTP_DEBUG_PCB1, "Addr: "); SCTPDBG_ADDR(SCTP_DEBUG_PCB1, addr); } #endif if ((inp->sctp_flags & SCTP_PCB_FLAGS_UNBOUND) == 0) { /* already did a bind, subsequent binds NOT allowed ! */ SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); return (EINVAL); } #ifdef INVARIANTS if (p == NULL) panic("null proc/thread"); #endif if (addr != NULL) { switch (addr->sa_family) { #ifdef INET case AF_INET: { struct sockaddr_in *sin; /* IPV6_V6ONLY socket? */ if (SCTP_IPV6_V6ONLY(ip_inp)) { SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); return (EINVAL); } if (addr->sa_len != sizeof(*sin)) { SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); return (EINVAL); } sin = (struct sockaddr_in *)addr; lport = sin->sin_port; /* * For LOOPBACK the prison_local_ip4() call * will transmute the ip address to the * proper value. */ if (p && (error = prison_local_ip4(p->td_ucred, &sin->sin_addr)) != 0) { SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, error); return (error); } if (sin->sin_addr.s_addr != INADDR_ANY) { bindall = 0; } break; } #endif #ifdef INET6 case AF_INET6: { /* * Only for pure IPv6 Address. (No IPv4 * Mapped!) */ struct sockaddr_in6 *sin6; sin6 = (struct sockaddr_in6 *)addr; if (addr->sa_len != sizeof(*sin6)) { SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); return (EINVAL); } lport = sin6->sin6_port; /* * For LOOPBACK the prison_local_ip6() call * will transmute the ipv6 address to the * proper value. */ if (p && (error = prison_local_ip6(p->td_ucred, &sin6->sin6_addr, (SCTP_IPV6_V6ONLY(inp) != 0))) != 0) { SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, error); return (error); } if (!IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) { bindall = 0; /* KAME hack: embed scopeid */ if (sa6_embedscope(sin6, MODULE_GLOBAL(ip6_use_defzone)) != 0) { SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); return (EINVAL); } } /* this must be cleared for ifa_ifwithaddr() */ sin6->sin6_scope_id = 0; break; } #endif default: SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EAFNOSUPPORT); return (EAFNOSUPPORT); } } SCTP_INP_INFO_WLOCK(); SCTP_INP_WLOCK(inp); /* Setup a vrf_id to be the default for the non-bind-all case. */ vrf_id = inp->def_vrf_id; /* increase our count due to the unlock we do */ SCTP_INP_INCR_REF(inp); if (lport) { /* * Did the caller specify a port? if so we must see if an ep * already has this one bound. */ /* got to be root to get at low ports */ if (ntohs(lport) < IPPORT_RESERVED) { if (p && (error = priv_check(p, PRIV_NETINET_RESERVEDPORT) )) { SCTP_INP_DECR_REF(inp); SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); return (error); } } SCTP_INP_WUNLOCK(inp); if (bindall) { vrf_id = inp->def_vrf_id; inp_tmp = sctp_pcb_findep(addr, 0, 1, vrf_id); if (inp_tmp != NULL) { /* * lock guy returned and lower count note * that we are not bound so inp_tmp should * NEVER be inp. And it is this inp * (inp_tmp) that gets the reference bump, * so we must lower it. */ SCTP_INP_DECR_REF(inp_tmp); /* unlock info */ if ((sctp_is_feature_on(inp, SCTP_PCB_FLAGS_PORTREUSE)) && (sctp_is_feature_on(inp_tmp, SCTP_PCB_FLAGS_PORTREUSE))) { /* * Ok, must be one-2-one and * allowing port re-use */ port_reuse_active = 1; goto continue_anyway; } SCTP_INP_DECR_REF(inp); SCTP_INP_INFO_WUNLOCK(); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EADDRINUSE); return (EADDRINUSE); } } else { inp_tmp = sctp_pcb_findep(addr, 0, 1, vrf_id); if (inp_tmp != NULL) { /* * lock guy returned and lower count note * that we are not bound so inp_tmp should * NEVER be inp. And it is this inp * (inp_tmp) that gets the reference bump, * so we must lower it. */ SCTP_INP_DECR_REF(inp_tmp); /* unlock info */ if ((sctp_is_feature_on(inp, SCTP_PCB_FLAGS_PORTREUSE)) && (sctp_is_feature_on(inp_tmp, SCTP_PCB_FLAGS_PORTREUSE))) { /* * Ok, must be one-2-one and * allowing port re-use */ port_reuse_active = 1; goto continue_anyway; } SCTP_INP_DECR_REF(inp); SCTP_INP_INFO_WUNLOCK(); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EADDRINUSE); return (EADDRINUSE); } } continue_anyway: SCTP_INP_WLOCK(inp); if (bindall) { /* verify that no lport is not used by a singleton */ if ((port_reuse_active == 0) && (inp_tmp = sctp_isport_inuse(inp, lport, vrf_id))) { /* Sorry someone already has this one bound */ if ((sctp_is_feature_on(inp, SCTP_PCB_FLAGS_PORTREUSE)) && (sctp_is_feature_on(inp_tmp, SCTP_PCB_FLAGS_PORTREUSE))) { port_reuse_active = 1; } else { SCTP_INP_DECR_REF(inp); SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EADDRINUSE); return (EADDRINUSE); } } } } else { uint16_t first, last, candidate; uint16_t count; int done; if (ip_inp->inp_flags & INP_HIGHPORT) { first = MODULE_GLOBAL(ipport_hifirstauto); last = MODULE_GLOBAL(ipport_hilastauto); } else if (ip_inp->inp_flags & INP_LOWPORT) { if (p && (error = priv_check(p, PRIV_NETINET_RESERVEDPORT) )) { SCTP_INP_DECR_REF(inp); SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, error); return (error); } first = MODULE_GLOBAL(ipport_lowfirstauto); last = MODULE_GLOBAL(ipport_lowlastauto); } else { first = MODULE_GLOBAL(ipport_firstauto); last = MODULE_GLOBAL(ipport_lastauto); } if (first > last) { uint16_t temp; temp = first; first = last; last = temp; } count = last - first + 1; /* number of candidates */ candidate = first + sctp_select_initial_TSN(&inp->sctp_ep) % (count); done = 0; while (!done) { if (sctp_isport_inuse(inp, htons(candidate), inp->def_vrf_id) == NULL) { done = 1; } if (!done) { if (--count == 0) { SCTP_INP_DECR_REF(inp); SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EADDRINUSE); return (EADDRINUSE); } if (candidate == last) candidate = first; else candidate = candidate + 1; } } lport = htons(candidate); } SCTP_INP_DECR_REF(inp); if (inp->sctp_flags & (SCTP_PCB_FLAGS_SOCKET_GONE | SCTP_PCB_FLAGS_SOCKET_ALLGONE)) { /* * this really should not happen. The guy did a non-blocking * bind and then did a close at the same time. */ SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); return (EINVAL); } /* ok we look clear to give out this port, so lets setup the binding */ if (bindall) { /* binding to all addresses, so just set in the proper flags */ inp->sctp_flags |= SCTP_PCB_FLAGS_BOUNDALL; /* set the automatic addr changes from kernel flag */ if (SCTP_BASE_SYSCTL(sctp_auto_asconf) == 0) { sctp_feature_off(inp, SCTP_PCB_FLAGS_DO_ASCONF); sctp_feature_off(inp, SCTP_PCB_FLAGS_AUTO_ASCONF); } else { sctp_feature_on(inp, SCTP_PCB_FLAGS_DO_ASCONF); sctp_feature_on(inp, SCTP_PCB_FLAGS_AUTO_ASCONF); } if (SCTP_BASE_SYSCTL(sctp_multiple_asconfs) == 0) { sctp_feature_off(inp, SCTP_PCB_FLAGS_MULTIPLE_ASCONFS); } else { sctp_feature_on(inp, SCTP_PCB_FLAGS_MULTIPLE_ASCONFS); } /* * set the automatic mobility_base from kernel flag (by * micchie) */ if (SCTP_BASE_SYSCTL(sctp_mobility_base) == 0) { sctp_mobility_feature_off(inp, SCTP_MOBILITY_BASE); sctp_mobility_feature_off(inp, SCTP_MOBILITY_PRIM_DELETED); } else { sctp_mobility_feature_on(inp, SCTP_MOBILITY_BASE); sctp_mobility_feature_off(inp, SCTP_MOBILITY_PRIM_DELETED); } /* * set the automatic mobility_fasthandoff from kernel flag * (by micchie) */ if (SCTP_BASE_SYSCTL(sctp_mobility_fasthandoff) == 0) { sctp_mobility_feature_off(inp, SCTP_MOBILITY_FASTHANDOFF); sctp_mobility_feature_off(inp, SCTP_MOBILITY_PRIM_DELETED); } else { sctp_mobility_feature_on(inp, SCTP_MOBILITY_FASTHANDOFF); sctp_mobility_feature_off(inp, SCTP_MOBILITY_PRIM_DELETED); } } else { /* * bind specific, make sure flags is off and add a new * address structure to the sctp_addr_list inside the ep * structure. * * We will need to allocate one and insert it at the head. The * socketopt call can just insert new addresses in there as * well. It will also have to do the embed scope kame hack * too (before adding). */ struct sctp_ifa *ifa; union sctp_sockstore store; memset(&store, 0, sizeof(store)); switch (addr->sa_family) { #ifdef INET case AF_INET: memcpy(&store.sin, addr, sizeof(struct sockaddr_in)); store.sin.sin_port = 0; break; #endif #ifdef INET6 case AF_INET6: memcpy(&store.sin6, addr, sizeof(struct sockaddr_in6)); store.sin6.sin6_port = 0; break; #endif default: break; } /* * first find the interface with the bound address need to * zero out the port to find the address! yuck! can't do * this earlier since need port for sctp_pcb_findep() */ if (sctp_ifap != NULL) { ifa = sctp_ifap; } else { /* * Note for BSD we hit here always other O/S's will * pass things in via the sctp_ifap argument * (Panda). */ ifa = sctp_find_ifa_by_addr(&store.sa, vrf_id, SCTP_ADDR_NOT_LOCKED); } if (ifa == NULL) { /* Can't find an interface with that address */ SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EADDRNOTAVAIL); return (EADDRNOTAVAIL); } #ifdef INET6 if (addr->sa_family == AF_INET6) { /* GAK, more FIXME IFA lock? */ if (ifa->localifa_flags & SCTP_ADDR_IFA_UNUSEABLE) { /* Can't bind a non-existent addr. */ SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); return (EINVAL); } } #endif /* we're not bound all */ inp->sctp_flags &= ~SCTP_PCB_FLAGS_BOUNDALL; /* allow bindx() to send ASCONF's for binding changes */ sctp_feature_on(inp, SCTP_PCB_FLAGS_DO_ASCONF); /* clear automatic addr changes from kernel flag */ sctp_feature_off(inp, SCTP_PCB_FLAGS_AUTO_ASCONF); /* add this address to the endpoint list */ error = sctp_insert_laddr(&inp->sctp_addr_list, ifa, 0); if (error != 0) { SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); return (error); } inp->laddr_count++; } /* find the bucket */ if (port_reuse_active) { /* Put it into tcp 1-2-1 hash */ head = &SCTP_BASE_INFO(sctp_tcpephash)[SCTP_PCBHASH_ALLADDR(lport, SCTP_BASE_INFO(hashtcpmark))]; inp->sctp_flags |= SCTP_PCB_FLAGS_IN_TCPPOOL; } else { head = &SCTP_BASE_INFO(sctp_ephash)[SCTP_PCBHASH_ALLADDR(lport, SCTP_BASE_INFO(hashmark))]; } /* put it in the bucket */ LIST_INSERT_HEAD(head, inp, sctp_hash); SCTPDBG(SCTP_DEBUG_PCB1, "Main hash to bind at head:%p, bound port:%d - in tcp_pool=%d\n", (void *)head, ntohs(lport), port_reuse_active); /* set in the port */ inp->sctp_lport = lport; /* turn off just the unbound flag */ inp->sctp_flags &= ~SCTP_PCB_FLAGS_UNBOUND; SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); return (0); } static void sctp_iterator_inp_being_freed(struct sctp_inpcb *inp) { struct sctp_iterator *it, *nit; /* * We enter with the only the ITERATOR_LOCK in place and a write * lock on the inp_info stuff. */ it = sctp_it_ctl.cur_it; if (it && (it->vn != curvnet)) { /* Its not looking at our VNET */ return; } if (it && (it->inp == inp)) { /* * This is tricky and we hold the iterator lock, but when it * returns and gets the lock (when we release it) the * iterator will try to operate on inp. We need to stop that * from happening. But of course the iterator has a * reference on the stcb and inp. We can mark it and it will * stop. * * If its a single iterator situation, we set the end iterator * flag. Otherwise we set the iterator to go to the next * inp. * */ if (it->iterator_flags & SCTP_ITERATOR_DO_SINGLE_INP) { sctp_it_ctl.iterator_flags |= SCTP_ITERATOR_STOP_CUR_IT; } else { sctp_it_ctl.iterator_flags |= SCTP_ITERATOR_STOP_CUR_INP; } } /* * Now go through and remove any single reference to our inp that * may be still pending on the list */ SCTP_IPI_ITERATOR_WQ_LOCK(); TAILQ_FOREACH_SAFE(it, &sctp_it_ctl.iteratorhead, sctp_nxt_itr, nit) { if (it->vn != curvnet) { continue; } if (it->inp == inp) { /* This one points to me is it inp specific? */ if (it->iterator_flags & SCTP_ITERATOR_DO_SINGLE_INP) { /* Remove and free this one */ TAILQ_REMOVE(&sctp_it_ctl.iteratorhead, it, sctp_nxt_itr); if (it->function_atend != NULL) { (*it->function_atend) (it->pointer, it->val); } SCTP_FREE(it, SCTP_M_ITER); } else { it->inp = LIST_NEXT(it->inp, sctp_list); if (it->inp) { SCTP_INP_INCR_REF(it->inp); } } /* * When its put in the refcnt is incremented so decr * it */ SCTP_INP_DECR_REF(inp); } } SCTP_IPI_ITERATOR_WQ_UNLOCK(); } /* release sctp_inpcb unbind the port */ void sctp_inpcb_free(struct sctp_inpcb *inp, int immediate, int from) { /* * Here we free a endpoint. We must find it (if it is in the Hash * table) and remove it from there. Then we must also find it in the * overall list and remove it from there. After all removals are * complete then any timer has to be stopped. Then start the actual * freeing. a) Any local lists. b) Any associations. c) The hash of * all associations. d) finally the ep itself. */ struct sctp_tcb *asoc, *nasoc; struct sctp_laddr *laddr, *nladdr; struct inpcb *ip_pcb; struct socket *so; int being_refed = 0; struct sctp_queued_to_read *sq, *nsq; int cnt; sctp_sharedkey_t *shared_key, *nshared_key; #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, NULL, 0); #endif SCTP_ITERATOR_LOCK(); /* mark any iterators on the list or being processed */ sctp_iterator_inp_being_freed(inp); SCTP_ITERATOR_UNLOCK(); so = inp->sctp_socket; if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { /* been here before.. eeks.. get out of here */ SCTP_PRINTF("This conflict in free SHOULD not be happening! from %d, imm %d\n", from, immediate); #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, NULL, 1); #endif return; } SCTP_ASOC_CREATE_LOCK(inp); SCTP_INP_INFO_WLOCK(); SCTP_INP_WLOCK(inp); if (from == SCTP_CALLED_AFTER_CMPSET_OFCLOSE) { inp->sctp_flags &= ~SCTP_PCB_FLAGS_CLOSE_IP; /* socket is gone, so no more wakeups allowed */ inp->sctp_flags |= SCTP_PCB_FLAGS_DONT_WAKE; inp->sctp_flags &= ~SCTP_PCB_FLAGS_WAKEINPUT; inp->sctp_flags &= ~SCTP_PCB_FLAGS_WAKEOUTPUT; } /* First time through we have the socket lock, after that no more. */ sctp_timer_stop(SCTP_TIMER_TYPE_NEWCOOKIE, inp, NULL, NULL, SCTP_FROM_SCTP_PCB + SCTP_LOC_1); if (inp->control) { sctp_m_freem(inp->control); inp->control = NULL; } if (inp->pkt) { sctp_m_freem(inp->pkt); inp->pkt = NULL; } ip_pcb = &inp->ip_inp.inp; /* we could just cast the main pointer * here but I will be nice :> (i.e. * ip_pcb = ep;) */ if (immediate == SCTP_FREE_SHOULD_USE_GRACEFUL_CLOSE) { int cnt_in_sd; cnt_in_sd = 0; LIST_FOREACH_SAFE(asoc, &inp->sctp_asoc_list, sctp_tcblist, nasoc) { SCTP_TCB_LOCK(asoc); if (asoc->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) { /* Skip guys being freed */ cnt_in_sd++; if (asoc->asoc.state & SCTP_STATE_IN_ACCEPT_QUEUE) { /* * Special case - we did not start a * kill timer on the asoc due to it * was not closed. So go ahead and * start it now. */ asoc->asoc.state &= ~SCTP_STATE_IN_ACCEPT_QUEUE; sctp_timer_start(SCTP_TIMER_TYPE_ASOCKILL, inp, asoc, NULL); } SCTP_TCB_UNLOCK(asoc); continue; } if (((SCTP_GET_STATE(&asoc->asoc) == SCTP_STATE_COOKIE_WAIT) || (SCTP_GET_STATE(&asoc->asoc) == SCTP_STATE_COOKIE_ECHOED)) && (asoc->asoc.total_output_queue_size == 0)) { /* * If we have data in queue, we don't want * to just free since the app may have done, * send()/close or connect/send/close. And * it wants the data to get across first. */ /* Just abandon things in the front states */ if (sctp_free_assoc(inp, asoc, SCTP_PCBFREE_NOFORCE, SCTP_FROM_SCTP_PCB + SCTP_LOC_2) == 0) { cnt_in_sd++; } continue; } /* Disconnect the socket please */ asoc->sctp_socket = NULL; asoc->asoc.state |= SCTP_STATE_CLOSED_SOCKET; if ((asoc->asoc.size_on_reasm_queue > 0) || (asoc->asoc.control_pdapi) || (asoc->asoc.size_on_all_streams > 0) || (so && (so->so_rcv.sb_cc > 0))) { /* Left with Data unread */ struct mbuf *op_err; op_err = sctp_generate_cause(SCTP_CAUSE_USER_INITIATED_ABT, ""); asoc->sctp_ep->last_abort_code = SCTP_FROM_SCTP_PCB + SCTP_LOC_3; sctp_send_abort_tcb(asoc, op_err, SCTP_SO_LOCKED); SCTP_STAT_INCR_COUNTER32(sctps_aborted); if ((SCTP_GET_STATE(&asoc->asoc) == SCTP_STATE_OPEN) || (SCTP_GET_STATE(&asoc->asoc) == SCTP_STATE_SHUTDOWN_RECEIVED)) { SCTP_STAT_DECR_GAUGE32(sctps_currestab); } if (sctp_free_assoc(inp, asoc, SCTP_PCBFREE_NOFORCE, SCTP_FROM_SCTP_PCB + SCTP_LOC_4) == 0) { cnt_in_sd++; } continue; } else if (TAILQ_EMPTY(&asoc->asoc.send_queue) && TAILQ_EMPTY(&asoc->asoc.sent_queue) && (asoc->asoc.stream_queue_cnt == 0)) { if (asoc->asoc.locked_on_sending) { goto abort_anyway; } if ((SCTP_GET_STATE(&asoc->asoc) != SCTP_STATE_SHUTDOWN_SENT) && (SCTP_GET_STATE(&asoc->asoc) != SCTP_STATE_SHUTDOWN_ACK_SENT)) { struct sctp_nets *netp; /* * there is nothing queued to send, * so I send shutdown */ if ((SCTP_GET_STATE(&asoc->asoc) == SCTP_STATE_OPEN) || (SCTP_GET_STATE(&asoc->asoc) == SCTP_STATE_SHUTDOWN_RECEIVED)) { SCTP_STAT_DECR_GAUGE32(sctps_currestab); } SCTP_SET_STATE(&asoc->asoc, SCTP_STATE_SHUTDOWN_SENT); SCTP_CLEAR_SUBSTATE(&asoc->asoc, SCTP_STATE_SHUTDOWN_PENDING); sctp_stop_timers_for_shutdown(asoc); if (asoc->asoc.alternate) { netp = asoc->asoc.alternate; } else { netp = asoc->asoc.primary_destination; } sctp_send_shutdown(asoc, netp); sctp_timer_start(SCTP_TIMER_TYPE_SHUTDOWN, asoc->sctp_ep, asoc, netp); sctp_timer_start(SCTP_TIMER_TYPE_SHUTDOWNGUARD, asoc->sctp_ep, asoc, asoc->asoc.primary_destination); sctp_chunk_output(inp, asoc, SCTP_OUTPUT_FROM_SHUT_TMR, SCTP_SO_LOCKED); } } else { /* mark into shutdown pending */ struct sctp_stream_queue_pending *sp; asoc->asoc.state |= SCTP_STATE_SHUTDOWN_PENDING; sctp_timer_start(SCTP_TIMER_TYPE_SHUTDOWNGUARD, asoc->sctp_ep, asoc, asoc->asoc.primary_destination); if (asoc->asoc.locked_on_sending) { sp = TAILQ_LAST(&((asoc->asoc.locked_on_sending)->outqueue), sctp_streamhead); if (sp == NULL) { SCTP_PRINTF("Error, sp is NULL, locked on sending is %p strm:%d\n", (void *)asoc->asoc.locked_on_sending, asoc->asoc.locked_on_sending->stream_no); } else { if ((sp->length == 0) && (sp->msg_is_complete == 0)) asoc->asoc.state |= SCTP_STATE_PARTIAL_MSG_LEFT; } } if (TAILQ_EMPTY(&asoc->asoc.send_queue) && TAILQ_EMPTY(&asoc->asoc.sent_queue) && (asoc->asoc.state & SCTP_STATE_PARTIAL_MSG_LEFT)) { struct mbuf *op_err; abort_anyway: op_err = sctp_generate_cause(SCTP_CAUSE_USER_INITIATED_ABT, ""); asoc->sctp_ep->last_abort_code = SCTP_FROM_SCTP_PCB + SCTP_LOC_5; sctp_send_abort_tcb(asoc, op_err, SCTP_SO_LOCKED); SCTP_STAT_INCR_COUNTER32(sctps_aborted); if ((SCTP_GET_STATE(&asoc->asoc) == SCTP_STATE_OPEN) || (SCTP_GET_STATE(&asoc->asoc) == SCTP_STATE_SHUTDOWN_RECEIVED)) { SCTP_STAT_DECR_GAUGE32(sctps_currestab); } if (sctp_free_assoc(inp, asoc, SCTP_PCBFREE_NOFORCE, SCTP_FROM_SCTP_PCB + SCTP_LOC_6) == 0) { cnt_in_sd++; } continue; } else { sctp_chunk_output(inp, asoc, SCTP_OUTPUT_FROM_CLOSING, SCTP_SO_LOCKED); } } cnt_in_sd++; SCTP_TCB_UNLOCK(asoc); } /* now is there some left in our SHUTDOWN state? */ if (cnt_in_sd) { #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, NULL, 2); #endif inp->sctp_socket = NULL; SCTP_INP_WUNLOCK(inp); SCTP_ASOC_CREATE_UNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); return; } } inp->sctp_socket = NULL; if ((inp->sctp_flags & SCTP_PCB_FLAGS_UNBOUND) != SCTP_PCB_FLAGS_UNBOUND) { /* * ok, this guy has been bound. It's port is somewhere in * the SCTP_BASE_INFO(hash table). Remove it! */ LIST_REMOVE(inp, sctp_hash); inp->sctp_flags |= SCTP_PCB_FLAGS_UNBOUND; } /* * If there is a timer running to kill us, forget it, since it may * have a contest on the INP lock.. which would cause us to die ... */ cnt = 0; LIST_FOREACH_SAFE(asoc, &inp->sctp_asoc_list, sctp_tcblist, nasoc) { SCTP_TCB_LOCK(asoc); if (asoc->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) { if (asoc->asoc.state & SCTP_STATE_IN_ACCEPT_QUEUE) { asoc->asoc.state &= ~SCTP_STATE_IN_ACCEPT_QUEUE; sctp_timer_start(SCTP_TIMER_TYPE_ASOCKILL, inp, asoc, NULL); } cnt++; SCTP_TCB_UNLOCK(asoc); continue; } /* Free associations that are NOT killing us */ if ((SCTP_GET_STATE(&asoc->asoc) != SCTP_STATE_COOKIE_WAIT) && ((asoc->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) == 0)) { struct mbuf *op_err; op_err = sctp_generate_cause(SCTP_CAUSE_USER_INITIATED_ABT, ""); asoc->sctp_ep->last_abort_code = SCTP_FROM_SCTP_PCB + SCTP_LOC_7; sctp_send_abort_tcb(asoc, op_err, SCTP_SO_LOCKED); SCTP_STAT_INCR_COUNTER32(sctps_aborted); } else if (asoc->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) { cnt++; SCTP_TCB_UNLOCK(asoc); continue; } if ((SCTP_GET_STATE(&asoc->asoc) == SCTP_STATE_OPEN) || (SCTP_GET_STATE(&asoc->asoc) == SCTP_STATE_SHUTDOWN_RECEIVED)) { SCTP_STAT_DECR_GAUGE32(sctps_currestab); } if (sctp_free_assoc(inp, asoc, SCTP_PCBFREE_FORCE, SCTP_FROM_SCTP_PCB + SCTP_LOC_8) == 0) { cnt++; } } if (cnt) { /* Ok we have someone out there that will kill us */ (void)SCTP_OS_TIMER_STOP(&inp->sctp_ep.signature_change.timer); #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, NULL, 3); #endif SCTP_INP_WUNLOCK(inp); SCTP_ASOC_CREATE_UNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); return; } if (SCTP_INP_LOCK_CONTENDED(inp)) being_refed++; if (SCTP_INP_READ_CONTENDED(inp)) being_refed++; if (SCTP_ASOC_CREATE_LOCK_CONTENDED(inp)) being_refed++; if ((inp->refcount) || (being_refed) || (inp->sctp_flags & SCTP_PCB_FLAGS_CLOSE_IP)) { (void)SCTP_OS_TIMER_STOP(&inp->sctp_ep.signature_change.timer); #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, NULL, 4); #endif sctp_timer_start(SCTP_TIMER_TYPE_INPKILL, inp, NULL, NULL); SCTP_INP_WUNLOCK(inp); SCTP_ASOC_CREATE_UNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); return; } inp->sctp_ep.signature_change.type = 0; inp->sctp_flags |= SCTP_PCB_FLAGS_SOCKET_ALLGONE; /* * Remove it from the list .. last thing we need a lock for. */ LIST_REMOVE(inp, sctp_list); SCTP_INP_WUNLOCK(inp); SCTP_ASOC_CREATE_UNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); /* * Now we release all locks. Since this INP cannot be found anymore * except possibly by the kill timer that might be running. We call * the drain function here. It should hit the case were it sees the * ACTIVE flag cleared and exit out freeing us to proceed and * destroy everything. */ if (from != SCTP_CALLED_FROM_INPKILL_TIMER) { (void)SCTP_OS_TIMER_STOP_DRAIN(&inp->sctp_ep.signature_change.timer); } else { /* Probably un-needed */ (void)SCTP_OS_TIMER_STOP(&inp->sctp_ep.signature_change.timer); } #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, NULL, 5); #endif if ((inp->sctp_asocidhash) != NULL) { SCTP_HASH_FREE(inp->sctp_asocidhash, inp->hashasocidmark); inp->sctp_asocidhash = NULL; } /* sa_ignore FREED_MEMORY */ TAILQ_FOREACH_SAFE(sq, &inp->read_queue, next, nsq) { /* Its only abandoned if it had data left */ if (sq->length) SCTP_STAT_INCR(sctps_left_abandon); TAILQ_REMOVE(&inp->read_queue, sq, next); sctp_free_remote_addr(sq->whoFrom); if (so) so->so_rcv.sb_cc -= sq->length; if (sq->data) { sctp_m_freem(sq->data); sq->data = NULL; } /* * no need to free the net count, since at this point all * assoc's are gone. */ sctp_free_a_readq(NULL, sq); } /* Now the sctp_pcb things */ /* * free each asoc if it is not already closed/free. we can't use the * macro here since le_next will get freed as part of the * sctp_free_assoc() call. */ #ifdef IPSEC ipsec_delete_pcbpolicy(ip_pcb); #endif if (ip_pcb->inp_options) { (void)sctp_m_free(ip_pcb->inp_options); ip_pcb->inp_options = 0; } #ifdef INET6 if (ip_pcb->inp_vflag & INP_IPV6) { struct in6pcb *in6p; in6p = (struct in6pcb *)inp; ip6_freepcbopts(in6p->in6p_outputopts); } #endif /* INET6 */ ip_pcb->inp_vflag = 0; /* free up authentication fields */ if (inp->sctp_ep.local_auth_chunks != NULL) sctp_free_chunklist(inp->sctp_ep.local_auth_chunks); if (inp->sctp_ep.local_hmacs != NULL) sctp_free_hmaclist(inp->sctp_ep.local_hmacs); LIST_FOREACH_SAFE(shared_key, &inp->sctp_ep.shared_keys, next, nshared_key) { LIST_REMOVE(shared_key, next); sctp_free_sharedkey(shared_key); /* sa_ignore FREED_MEMORY */ } /* * if we have an address list the following will free the list of * ifaddr's that are set into this ep. Again macro limitations here, * since the LIST_FOREACH could be a bad idea. */ LIST_FOREACH_SAFE(laddr, &inp->sctp_addr_list, sctp_nxt_addr, nladdr) { sctp_remove_laddr(laddr); } #ifdef SCTP_TRACK_FREED_ASOCS /* TEMP CODE */ LIST_FOREACH_SAFE(asoc, &inp->sctp_asoc_free_list, sctp_tcblist, nasoc) { LIST_REMOVE(asoc, sctp_tcblist); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_asoc), asoc); SCTP_DECR_ASOC_COUNT(); } /* *** END TEMP CODE *** */ #endif /* Now lets see about freeing the EP hash table. */ if (inp->sctp_tcbhash != NULL) { SCTP_HASH_FREE(inp->sctp_tcbhash, inp->sctp_hashmark); inp->sctp_tcbhash = NULL; } /* Now we must put the ep memory back into the zone pool */ crfree(inp->ip_inp.inp.inp_cred); INP_LOCK_DESTROY(&inp->ip_inp.inp); SCTP_INP_LOCK_DESTROY(inp); SCTP_INP_READ_DESTROY(inp); SCTP_ASOC_CREATE_LOCK_DESTROY(inp); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_ep), inp); SCTP_DECR_EP_COUNT(); } struct sctp_nets * sctp_findnet(struct sctp_tcb *stcb, struct sockaddr *addr) { struct sctp_nets *net; /* locate the address */ TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { if (sctp_cmpaddr(addr, (struct sockaddr *)&net->ro._l_addr)) return (net); } return (NULL); } int sctp_is_address_on_local_host(struct sockaddr *addr, uint32_t vrf_id) { struct sctp_ifa *sctp_ifa; sctp_ifa = sctp_find_ifa_by_addr(addr, vrf_id, SCTP_ADDR_NOT_LOCKED); if (sctp_ifa) { return (1); } else { return (0); } } /* * add's a remote endpoint address, done with the INIT/INIT-ACK as well as * when a ASCONF arrives that adds it. It will also initialize all the cwnd * stats of stuff. */ int sctp_add_remote_addr(struct sctp_tcb *stcb, struct sockaddr *newaddr, struct sctp_nets **netp, uint16_t port, int set_scope, int from) { /* * The following is redundant to the same lines in the * sctp_aloc_assoc() but is needed since others call the add address * function */ struct sctp_nets *net, *netfirst; int addr_inscope; SCTPDBG(SCTP_DEBUG_PCB1, "Adding an address (from:%d) to the peer: ", from); SCTPDBG_ADDR(SCTP_DEBUG_PCB1, newaddr); netfirst = sctp_findnet(stcb, newaddr); if (netfirst) { /* * Lie and return ok, we don't want to make the association * go away for this behavior. It will happen in the TCP * model in a connected socket. It does not reach the hash * table until after the association is built so it can't be * found. Mark as reachable, since the initial creation will * have been cleared and the NOT_IN_ASSOC flag will have * been added... and we don't want to end up removing it * back out. */ if (netfirst->dest_state & SCTP_ADDR_UNCONFIRMED) { netfirst->dest_state = (SCTP_ADDR_REACHABLE | SCTP_ADDR_UNCONFIRMED); } else { netfirst->dest_state = SCTP_ADDR_REACHABLE; } return (0); } addr_inscope = 1; switch (newaddr->sa_family) { #ifdef INET case AF_INET: { struct sockaddr_in *sin; sin = (struct sockaddr_in *)newaddr; if (sin->sin_addr.s_addr == 0) { /* Invalid address */ return (-1); } /* zero out the bzero area */ memset(&sin->sin_zero, 0, sizeof(sin->sin_zero)); /* assure len is set */ sin->sin_len = sizeof(struct sockaddr_in); if (set_scope) { if (IN4_ISPRIVATE_ADDRESS(&sin->sin_addr)) { stcb->asoc.scope.ipv4_local_scope = 1; } } else { /* Validate the address is in scope */ if ((IN4_ISPRIVATE_ADDRESS(&sin->sin_addr)) && (stcb->asoc.scope.ipv4_local_scope == 0)) { addr_inscope = 0; } } break; } #endif #ifdef INET6 case AF_INET6: { struct sockaddr_in6 *sin6; sin6 = (struct sockaddr_in6 *)newaddr; if (IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) { /* Invalid address */ return (-1); } /* assure len is set */ sin6->sin6_len = sizeof(struct sockaddr_in6); if (set_scope) { if (sctp_is_address_on_local_host(newaddr, stcb->asoc.vrf_id)) { stcb->asoc.scope.loopback_scope = 1; stcb->asoc.scope.local_scope = 0; stcb->asoc.scope.ipv4_local_scope = 1; stcb->asoc.scope.site_scope = 1; } else if (IN6_IS_ADDR_LINKLOCAL(&sin6->sin6_addr)) { /* * If the new destination is a * LINK_LOCAL we must have common * site scope. Don't set the local * scope since we may not share all * links, only loopback can do this. * Links on the local network would * also be on our private network * for v4 too. */ stcb->asoc.scope.ipv4_local_scope = 1; stcb->asoc.scope.site_scope = 1; } else if (IN6_IS_ADDR_SITELOCAL(&sin6->sin6_addr)) { /* * If the new destination is * SITE_LOCAL then we must have site * scope in common. */ stcb->asoc.scope.site_scope = 1; } } else { /* Validate the address is in scope */ if (IN6_IS_ADDR_LOOPBACK(&sin6->sin6_addr) && (stcb->asoc.scope.loopback_scope == 0)) { addr_inscope = 0; } else if (IN6_IS_ADDR_LINKLOCAL(&sin6->sin6_addr) && (stcb->asoc.scope.local_scope == 0)) { addr_inscope = 0; } else if (IN6_IS_ADDR_SITELOCAL(&sin6->sin6_addr) && (stcb->asoc.scope.site_scope == 0)) { addr_inscope = 0; } } break; } #endif default: /* not supported family type */ return (-1); } net = SCTP_ZONE_GET(SCTP_BASE_INFO(ipi_zone_net), struct sctp_nets); if (net == NULL) { return (-1); } SCTP_INCR_RADDR_COUNT(); bzero(net, sizeof(struct sctp_nets)); (void)SCTP_GETTIME_TIMEVAL(&net->start_time); memcpy(&net->ro._l_addr, newaddr, newaddr->sa_len); switch (newaddr->sa_family) { #ifdef INET case AF_INET: ((struct sockaddr_in *)&net->ro._l_addr)->sin_port = stcb->rport; break; #endif #ifdef INET6 case AF_INET6: ((struct sockaddr_in6 *)&net->ro._l_addr)->sin6_port = stcb->rport; break; #endif default: break; } net->addr_is_local = sctp_is_address_on_local_host(newaddr, stcb->asoc.vrf_id); if (net->addr_is_local && ((set_scope || (from == SCTP_ADDR_IS_CONFIRMED)))) { stcb->asoc.scope.loopback_scope = 1; stcb->asoc.scope.ipv4_local_scope = 1; stcb->asoc.scope.local_scope = 0; stcb->asoc.scope.site_scope = 1; addr_inscope = 1; } net->failure_threshold = stcb->asoc.def_net_failure; net->pf_threshold = stcb->asoc.def_net_pf_threshold; if (addr_inscope == 0) { net->dest_state = (SCTP_ADDR_REACHABLE | SCTP_ADDR_OUT_OF_SCOPE); } else { if (from == SCTP_ADDR_IS_CONFIRMED) /* SCTP_ADDR_IS_CONFIRMED is passed by connect_x */ net->dest_state = SCTP_ADDR_REACHABLE; else net->dest_state = SCTP_ADDR_REACHABLE | SCTP_ADDR_UNCONFIRMED; } /* * We set this to 0, the timer code knows that this means its an * initial value */ net->rto_needed = 1; net->RTO = 0; net->RTO_measured = 0; stcb->asoc.numnets++; net->ref_count = 1; net->cwr_window_tsn = net->last_cwr_tsn = stcb->asoc.sending_seq - 1; net->port = port; net->dscp = stcb->asoc.default_dscp; #ifdef INET6 net->flowlabel = stcb->asoc.default_flowlabel; #endif if (sctp_stcb_is_feature_on(stcb->sctp_ep, stcb, SCTP_PCB_FLAGS_DONOT_HEARTBEAT)) { net->dest_state |= SCTP_ADDR_NOHB; } else { net->dest_state &= ~SCTP_ADDR_NOHB; } if (sctp_stcb_is_feature_on(stcb->sctp_ep, stcb, SCTP_PCB_FLAGS_DO_NOT_PMTUD)) { net->dest_state |= SCTP_ADDR_NO_PMTUD; } else { net->dest_state &= ~SCTP_ADDR_NO_PMTUD; } net->heart_beat_delay = stcb->asoc.heart_beat_delay; /* Init the timer structure */ SCTP_OS_TIMER_INIT(&net->rxt_timer.timer); SCTP_OS_TIMER_INIT(&net->pmtu_timer.timer); SCTP_OS_TIMER_INIT(&net->hb_timer.timer); /* Now generate a route for this guy */ #ifdef INET6 /* KAME hack: embed scopeid */ if (newaddr->sa_family == AF_INET6) { struct sockaddr_in6 *sin6; sin6 = (struct sockaddr_in6 *)&net->ro._l_addr; (void)sa6_embedscope(sin6, MODULE_GLOBAL(ip6_use_defzone)); sin6->sin6_scope_id = 0; } #endif SCTP_RTALLOC((sctp_route_t *) & net->ro, stcb->asoc.vrf_id, stcb->sctp_ep->fibnum); if (SCTP_ROUTE_HAS_VALID_IFN(&net->ro)) { /* Get source address */ net->ro._s_addr = sctp_source_address_selection(stcb->sctp_ep, stcb, (sctp_route_t *) & net->ro, net, 0, stcb->asoc.vrf_id); if (net->ro._s_addr != NULL) { net->src_addr_selected = 1; /* Now get the interface MTU */ if (net->ro._s_addr->ifn_p != NULL) { net->mtu = SCTP_GATHER_MTU_FROM_INTFC(net->ro._s_addr->ifn_p); } } else { net->src_addr_selected = 0; } if (net->mtu > 0) { uint32_t rmtu; rmtu = SCTP_GATHER_MTU_FROM_ROUTE(net->ro._s_addr, &net->ro._l_addr.sa, net->ro.ro_rt); if (rmtu == 0) { /* * Start things off to match mtu of * interface please. */ SCTP_SET_MTU_OF_ROUTE(&net->ro._l_addr.sa, net->ro.ro_rt, net->mtu); } else { /* * we take the route mtu over the interface, * since the route may be leading out the * loopback, or a different interface. */ net->mtu = rmtu; } } } else { net->src_addr_selected = 0; } if (net->mtu == 0) { switch (newaddr->sa_family) { #ifdef INET case AF_INET: net->mtu = SCTP_DEFAULT_MTU; break; #endif #ifdef INET6 case AF_INET6: net->mtu = 1280; break; #endif default: break; } } #if defined(INET) || defined(INET6) if (net->port) { net->mtu -= (uint32_t) sizeof(struct udphdr); } #endif if (from == SCTP_ALLOC_ASOC) { stcb->asoc.smallest_mtu = net->mtu; } if (stcb->asoc.smallest_mtu > net->mtu) { sctp_pathmtu_adjustment(stcb, net->mtu); } #ifdef INET6 if (newaddr->sa_family == AF_INET6) { struct sockaddr_in6 *sin6; sin6 = (struct sockaddr_in6 *)&net->ro._l_addr; (void)sa6_recoverscope(sin6); } #endif /* JRS - Use the congestion control given in the CC module */ if (stcb->asoc.cc_functions.sctp_set_initial_cc_param != NULL) (*stcb->asoc.cc_functions.sctp_set_initial_cc_param) (stcb, net); /* * CMT: CUC algo - set find_pseudo_cumack to TRUE (1) at beginning * of assoc (2005/06/27, iyengar@cis.udel.edu) */ net->find_pseudo_cumack = 1; net->find_rtx_pseudo_cumack = 1; /* Choose an initial flowid. */ net->flowid = stcb->asoc.my_vtag ^ ntohs(stcb->rport) ^ ntohs(stcb->sctp_ep->sctp_lport); - net->flowtype = M_HASHTYPE_OPAQUE; + net->flowtype = M_HASHTYPE_OPAQUE_HASH; if (netp) { *netp = net; } netfirst = TAILQ_FIRST(&stcb->asoc.nets); if (net->ro.ro_rt == NULL) { /* Since we have no route put it at the back */ TAILQ_INSERT_TAIL(&stcb->asoc.nets, net, sctp_next); } else if (netfirst == NULL) { /* We are the first one in the pool. */ TAILQ_INSERT_HEAD(&stcb->asoc.nets, net, sctp_next); } else if (netfirst->ro.ro_rt == NULL) { /* * First one has NO route. Place this one ahead of the first * one. */ TAILQ_INSERT_HEAD(&stcb->asoc.nets, net, sctp_next); } else if (net->ro.ro_rt->rt_ifp != netfirst->ro.ro_rt->rt_ifp) { /* * This one has a different interface than the one at the * top of the list. Place it ahead. */ TAILQ_INSERT_HEAD(&stcb->asoc.nets, net, sctp_next); } else { /* * Ok we have the same interface as the first one. Move * forward until we find either a) one with a NULL route... * insert ahead of that b) one with a different ifp.. insert * after that. c) end of the list.. insert at the tail. */ struct sctp_nets *netlook; do { netlook = TAILQ_NEXT(netfirst, sctp_next); if (netlook == NULL) { /* End of the list */ TAILQ_INSERT_TAIL(&stcb->asoc.nets, net, sctp_next); break; } else if (netlook->ro.ro_rt == NULL) { /* next one has NO route */ TAILQ_INSERT_BEFORE(netfirst, net, sctp_next); break; } else if (netlook->ro.ro_rt->rt_ifp != net->ro.ro_rt->rt_ifp) { TAILQ_INSERT_AFTER(&stcb->asoc.nets, netlook, net, sctp_next); break; } /* Shift forward */ netfirst = netlook; } while (netlook != NULL); } /* got to have a primary set */ if (stcb->asoc.primary_destination == 0) { stcb->asoc.primary_destination = net; } else if ((stcb->asoc.primary_destination->ro.ro_rt == NULL) && (net->ro.ro_rt) && ((net->dest_state & SCTP_ADDR_UNCONFIRMED) == 0)) { /* No route to current primary adopt new primary */ stcb->asoc.primary_destination = net; } /* Validate primary is first */ net = TAILQ_FIRST(&stcb->asoc.nets); if ((net != stcb->asoc.primary_destination) && (stcb->asoc.primary_destination)) { /* * first one on the list is NOT the primary sctp_cmpaddr() * is much more efficient if the primary is the first on the * list, make it so. */ TAILQ_REMOVE(&stcb->asoc.nets, stcb->asoc.primary_destination, sctp_next); TAILQ_INSERT_HEAD(&stcb->asoc.nets, stcb->asoc.primary_destination, sctp_next); } return (0); } static uint32_t sctp_aloc_a_assoc_id(struct sctp_inpcb *inp, struct sctp_tcb *stcb) { uint32_t id; struct sctpasochead *head; struct sctp_tcb *lstcb; SCTP_INP_WLOCK(inp); try_again: if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { /* TSNH */ SCTP_INP_WUNLOCK(inp); return (0); } /* * We don't allow assoc id to be one of SCTP_FUTURE_ASSOC, * SCTP_CURRENT_ASSOC and SCTP_ALL_ASSOC. */ if (inp->sctp_associd_counter <= SCTP_ALL_ASSOC) { inp->sctp_associd_counter = SCTP_ALL_ASSOC + 1; } id = inp->sctp_associd_counter; inp->sctp_associd_counter++; lstcb = sctp_findasoc_ep_asocid_locked(inp, (sctp_assoc_t) id, 0); if (lstcb) { goto try_again; } head = &inp->sctp_asocidhash[SCTP_PCBHASH_ASOC(id, inp->hashasocidmark)]; LIST_INSERT_HEAD(head, stcb, sctp_tcbasocidhash); stcb->asoc.in_asocid_hash = 1; SCTP_INP_WUNLOCK(inp); return id; } /* * allocate an association and add it to the endpoint. The caller must be * careful to add all additional addresses once they are know right away or * else the assoc will be may experience a blackout scenario. */ struct sctp_tcb * sctp_aloc_assoc(struct sctp_inpcb *inp, struct sockaddr *firstaddr, int *error, uint32_t override_tag, uint32_t vrf_id, uint16_t o_streams, uint16_t port, struct thread *p ) { /* note the p argument is only valid in unbound sockets */ struct sctp_tcb *stcb; struct sctp_association *asoc; struct sctpasochead *head; uint16_t rport; int err; /* * Assumption made here: Caller has done a * sctp_findassociation_ep_addr(ep, addr's); to make sure the * address does not exist already. */ if (SCTP_BASE_INFO(ipi_count_asoc) >= SCTP_MAX_NUM_OF_ASOC) { /* Hit max assoc, sorry no more */ SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, ENOBUFS); *error = ENOBUFS; return (NULL); } if (firstaddr == NULL) { SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); *error = EINVAL; return (NULL); } SCTP_INP_RLOCK(inp); if ((inp->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL) && ((sctp_is_feature_off(inp, SCTP_PCB_FLAGS_PORTREUSE)) || (inp->sctp_flags & SCTP_PCB_FLAGS_CONNECTED))) { /* * If its in the TCP pool, its NOT allowed to create an * association. The parent listener needs to call * sctp_aloc_assoc.. or the one-2-many socket. If a peeled * off, or connected one does this.. its an error. */ SCTP_INP_RUNLOCK(inp); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); *error = EINVAL; return (NULL); } if ((inp->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL) || (inp->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE)) { if ((inp->sctp_flags & SCTP_PCB_FLAGS_WAS_CONNECTED) || (inp->sctp_flags & SCTP_PCB_FLAGS_WAS_ABORTED)) { SCTP_INP_RUNLOCK(inp); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); *error = EINVAL; return (NULL); } } SCTPDBG(SCTP_DEBUG_PCB3, "Allocate an association for peer:"); #ifdef SCTP_DEBUG if (firstaddr) { SCTPDBG_ADDR(SCTP_DEBUG_PCB3, firstaddr); switch (firstaddr->sa_family) { #ifdef INET case AF_INET: SCTPDBG(SCTP_DEBUG_PCB3, "Port:%d\n", ntohs(((struct sockaddr_in *)firstaddr)->sin_port)); break; #endif #ifdef INET6 case AF_INET6: SCTPDBG(SCTP_DEBUG_PCB3, "Port:%d\n", ntohs(((struct sockaddr_in6 *)firstaddr)->sin6_port)); break; #endif default: break; } } else { SCTPDBG(SCTP_DEBUG_PCB3, "None\n"); } #endif /* SCTP_DEBUG */ switch (firstaddr->sa_family) { #ifdef INET case AF_INET: { struct sockaddr_in *sin; sin = (struct sockaddr_in *)firstaddr; if ((ntohs(sin->sin_port) == 0) || (sin->sin_addr.s_addr == INADDR_ANY) || (sin->sin_addr.s_addr == INADDR_BROADCAST) || IN_MULTICAST(ntohl(sin->sin_addr.s_addr))) { /* Invalid address */ SCTP_INP_RUNLOCK(inp); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); *error = EINVAL; return (NULL); } rport = sin->sin_port; break; } #endif #ifdef INET6 case AF_INET6: { struct sockaddr_in6 *sin6; sin6 = (struct sockaddr_in6 *)firstaddr; if ((ntohs(sin6->sin6_port) == 0) || IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr) || IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr)) { /* Invalid address */ SCTP_INP_RUNLOCK(inp); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); *error = EINVAL; return (NULL); } rport = sin6->sin6_port; break; } #endif default: /* not supported family type */ SCTP_INP_RUNLOCK(inp); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); *error = EINVAL; return (NULL); } SCTP_INP_RUNLOCK(inp); if (inp->sctp_flags & SCTP_PCB_FLAGS_UNBOUND) { /* * If you have not performed a bind, then we need to do the * ephemeral bind for you. */ if ((err = sctp_inpcb_bind(inp->sctp_socket, (struct sockaddr *)NULL, (struct sctp_ifa *)NULL, p ))) { /* bind error, probably perm */ *error = err; return (NULL); } } stcb = SCTP_ZONE_GET(SCTP_BASE_INFO(ipi_zone_asoc), struct sctp_tcb); if (stcb == NULL) { /* out of memory? */ SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, ENOMEM); *error = ENOMEM; return (NULL); } SCTP_INCR_ASOC_COUNT(); bzero(stcb, sizeof(*stcb)); asoc = &stcb->asoc; asoc->assoc_id = sctp_aloc_a_assoc_id(inp, stcb); SCTP_TCB_LOCK_INIT(stcb); SCTP_TCB_SEND_LOCK_INIT(stcb); stcb->rport = rport; /* setup back pointer's */ stcb->sctp_ep = inp; stcb->sctp_socket = inp->sctp_socket; if ((err = sctp_init_asoc(inp, stcb, override_tag, vrf_id, o_streams))) { /* failed */ SCTP_TCB_LOCK_DESTROY(stcb); SCTP_TCB_SEND_LOCK_DESTROY(stcb); LIST_REMOVE(stcb, sctp_tcbasocidhash); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_asoc), stcb); SCTP_DECR_ASOC_COUNT(); *error = err; return (NULL); } /* and the port */ SCTP_INP_INFO_WLOCK(); SCTP_INP_WLOCK(inp); if (inp->sctp_flags & (SCTP_PCB_FLAGS_SOCKET_GONE | SCTP_PCB_FLAGS_SOCKET_ALLGONE)) { /* inpcb freed while alloc going on */ SCTP_TCB_LOCK_DESTROY(stcb); SCTP_TCB_SEND_LOCK_DESTROY(stcb); LIST_REMOVE(stcb, sctp_tcbasocidhash); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_asoc), stcb); SCTP_INP_WUNLOCK(inp); SCTP_INP_INFO_WUNLOCK(); SCTP_DECR_ASOC_COUNT(); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, EINVAL); *error = EINVAL; return (NULL); } SCTP_TCB_LOCK(stcb); /* now that my_vtag is set, add it to the hash */ head = &SCTP_BASE_INFO(sctp_asochash)[SCTP_PCBHASH_ASOC(stcb->asoc.my_vtag, SCTP_BASE_INFO(hashasocmark))]; /* put it in the bucket in the vtag hash of assoc's for the system */ LIST_INSERT_HEAD(head, stcb, sctp_asocs); SCTP_INP_INFO_WUNLOCK(); if ((err = sctp_add_remote_addr(stcb, firstaddr, NULL, port, SCTP_DO_SETSCOPE, SCTP_ALLOC_ASOC))) { /* failure.. memory error? */ if (asoc->strmout) { SCTP_FREE(asoc->strmout, SCTP_M_STRMO); asoc->strmout = NULL; } if (asoc->mapping_array) { SCTP_FREE(asoc->mapping_array, SCTP_M_MAP); asoc->mapping_array = NULL; } if (asoc->nr_mapping_array) { SCTP_FREE(asoc->nr_mapping_array, SCTP_M_MAP); asoc->nr_mapping_array = NULL; } SCTP_DECR_ASOC_COUNT(); SCTP_TCB_UNLOCK(stcb); SCTP_TCB_LOCK_DESTROY(stcb); SCTP_TCB_SEND_LOCK_DESTROY(stcb); LIST_REMOVE(stcb, sctp_tcbasocidhash); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_asoc), stcb); SCTP_INP_WUNLOCK(inp); SCTP_LTRACE_ERR_RET(inp, NULL, NULL, SCTP_FROM_SCTP_PCB, ENOBUFS); *error = ENOBUFS; return (NULL); } /* Init all the timers */ SCTP_OS_TIMER_INIT(&asoc->dack_timer.timer); SCTP_OS_TIMER_INIT(&asoc->strreset_timer.timer); SCTP_OS_TIMER_INIT(&asoc->asconf_timer.timer); SCTP_OS_TIMER_INIT(&asoc->shut_guard_timer.timer); SCTP_OS_TIMER_INIT(&asoc->autoclose_timer.timer); SCTP_OS_TIMER_INIT(&asoc->delayed_event_timer.timer); SCTP_OS_TIMER_INIT(&asoc->delete_prim_timer.timer); LIST_INSERT_HEAD(&inp->sctp_asoc_list, stcb, sctp_tcblist); /* now file the port under the hash as well */ if (inp->sctp_tcbhash != NULL) { head = &inp->sctp_tcbhash[SCTP_PCBHASH_ALLADDR(stcb->rport, inp->sctp_hashmark)]; LIST_INSERT_HEAD(head, stcb, sctp_tcbhash); } SCTP_INP_WUNLOCK(inp); SCTPDBG(SCTP_DEBUG_PCB1, "Association %p now allocated\n", (void *)stcb); return (stcb); } void sctp_remove_net(struct sctp_tcb *stcb, struct sctp_nets *net) { struct sctp_association *asoc; asoc = &stcb->asoc; asoc->numnets--; TAILQ_REMOVE(&asoc->nets, net, sctp_next); if (net == asoc->primary_destination) { /* Reset primary */ struct sctp_nets *lnet; lnet = TAILQ_FIRST(&asoc->nets); /* * Mobility adaptation Ideally, if deleted destination is * the primary, it becomes a fast retransmission trigger by * the subsequent SET PRIMARY. (by micchie) */ if (sctp_is_mobility_feature_on(stcb->sctp_ep, SCTP_MOBILITY_BASE) || sctp_is_mobility_feature_on(stcb->sctp_ep, SCTP_MOBILITY_FASTHANDOFF)) { SCTPDBG(SCTP_DEBUG_ASCONF1, "remove_net: primary dst is deleting\n"); if (asoc->deleted_primary != NULL) { SCTPDBG(SCTP_DEBUG_ASCONF1, "remove_net: deleted primary may be already stored\n"); goto out; } asoc->deleted_primary = net; atomic_add_int(&net->ref_count, 1); memset(&net->lastsa, 0, sizeof(net->lastsa)); memset(&net->lastsv, 0, sizeof(net->lastsv)); sctp_mobility_feature_on(stcb->sctp_ep, SCTP_MOBILITY_PRIM_DELETED); sctp_timer_start(SCTP_TIMER_TYPE_PRIM_DELETED, stcb->sctp_ep, stcb, NULL); } out: /* Try to find a confirmed primary */ asoc->primary_destination = sctp_find_alternate_net(stcb, lnet, 0); } if (net == asoc->last_data_chunk_from) { /* Reset primary */ asoc->last_data_chunk_from = TAILQ_FIRST(&asoc->nets); } if (net == asoc->last_control_chunk_from) { /* Clear net */ asoc->last_control_chunk_from = NULL; } if (net == stcb->asoc.alternate) { sctp_free_remote_addr(stcb->asoc.alternate); stcb->asoc.alternate = NULL; } sctp_free_remote_addr(net); } /* * remove a remote endpoint address from an association, it will fail if the * address does not exist. */ int sctp_del_remote_addr(struct sctp_tcb *stcb, struct sockaddr *remaddr) { /* * Here we need to remove a remote address. This is quite simple, we * first find it in the list of address for the association * (tasoc->asoc.nets) and then if it is there, we do a LIST_REMOVE * on that item. Note we do not allow it to be removed if there are * no other addresses. */ struct sctp_association *asoc; struct sctp_nets *net, *nnet; asoc = &stcb->asoc; /* locate the address */ TAILQ_FOREACH_SAFE(net, &asoc->nets, sctp_next, nnet) { if (net->ro._l_addr.sa.sa_family != remaddr->sa_family) { continue; } if (sctp_cmpaddr((struct sockaddr *)&net->ro._l_addr, remaddr)) { /* we found the guy */ if (asoc->numnets < 2) { /* Must have at LEAST two remote addresses */ return (-1); } else { sctp_remove_net(stcb, net); return (0); } } } /* not found. */ return (-2); } void sctp_delete_from_timewait(uint32_t tag, uint16_t lport, uint16_t rport) { struct sctpvtaghead *chain; struct sctp_tagblock *twait_block; int found = 0; int i; chain = &SCTP_BASE_INFO(vtag_timewait)[(tag % SCTP_STACK_VTAG_HASH_SIZE)]; LIST_FOREACH(twait_block, chain, sctp_nxt_tagblock) { for (i = 0; i < SCTP_NUMBER_IN_VTAG_BLOCK; i++) { if ((twait_block->vtag_block[i].v_tag == tag) && (twait_block->vtag_block[i].lport == lport) && (twait_block->vtag_block[i].rport == rport)) { twait_block->vtag_block[i].tv_sec_at_expire = 0; twait_block->vtag_block[i].v_tag = 0; twait_block->vtag_block[i].lport = 0; twait_block->vtag_block[i].rport = 0; found = 1; break; } } if (found) break; } } int sctp_is_in_timewait(uint32_t tag, uint16_t lport, uint16_t rport) { struct sctpvtaghead *chain; struct sctp_tagblock *twait_block; int found = 0; int i; SCTP_INP_INFO_WLOCK(); chain = &SCTP_BASE_INFO(vtag_timewait)[(tag % SCTP_STACK_VTAG_HASH_SIZE)]; LIST_FOREACH(twait_block, chain, sctp_nxt_tagblock) { for (i = 0; i < SCTP_NUMBER_IN_VTAG_BLOCK; i++) { if ((twait_block->vtag_block[i].v_tag == tag) && (twait_block->vtag_block[i].lport == lport) && (twait_block->vtag_block[i].rport == rport)) { found = 1; break; } } if (found) break; } SCTP_INP_INFO_WUNLOCK(); return (found); } void sctp_add_vtag_to_timewait(uint32_t tag, uint32_t time, uint16_t lport, uint16_t rport) { struct sctpvtaghead *chain; struct sctp_tagblock *twait_block; struct timeval now; int set, i; if (time == 0) { /* Its disabled */ return; } (void)SCTP_GETTIME_TIMEVAL(&now); chain = &SCTP_BASE_INFO(vtag_timewait)[(tag % SCTP_STACK_VTAG_HASH_SIZE)]; set = 0; LIST_FOREACH(twait_block, chain, sctp_nxt_tagblock) { /* Block(s) present, lets find space, and expire on the fly */ for (i = 0; i < SCTP_NUMBER_IN_VTAG_BLOCK; i++) { if ((twait_block->vtag_block[i].v_tag == 0) && !set) { twait_block->vtag_block[i].tv_sec_at_expire = now.tv_sec + time; twait_block->vtag_block[i].v_tag = tag; twait_block->vtag_block[i].lport = lport; twait_block->vtag_block[i].rport = rport; set = 1; } else if ((twait_block->vtag_block[i].v_tag) && ((long)twait_block->vtag_block[i].tv_sec_at_expire < now.tv_sec)) { /* Audit expires this guy */ twait_block->vtag_block[i].tv_sec_at_expire = 0; twait_block->vtag_block[i].v_tag = 0; twait_block->vtag_block[i].lport = 0; twait_block->vtag_block[i].rport = 0; if (set == 0) { /* Reuse it for my new tag */ twait_block->vtag_block[i].tv_sec_at_expire = now.tv_sec + time; twait_block->vtag_block[i].v_tag = tag; twait_block->vtag_block[i].lport = lport; twait_block->vtag_block[i].rport = rport; set = 1; } } } if (set) { /* * We only do up to the block where we can place our * tag for audits */ break; } } /* Need to add a new block to chain */ if (!set) { SCTP_MALLOC(twait_block, struct sctp_tagblock *, sizeof(struct sctp_tagblock), SCTP_M_TIMW); if (twait_block == NULL) { #ifdef INVARIANTS panic("Can not alloc tagblock"); #endif return; } memset(twait_block, 0, sizeof(struct sctp_tagblock)); LIST_INSERT_HEAD(chain, twait_block, sctp_nxt_tagblock); twait_block->vtag_block[0].tv_sec_at_expire = now.tv_sec + time; twait_block->vtag_block[0].v_tag = tag; twait_block->vtag_block[0].lport = lport; twait_block->vtag_block[0].rport = rport; } } void sctp_clean_up_stream(struct sctp_tcb *stcb, struct sctp_readhead *rh) { struct sctp_tmit_chunk *chk, *nchk; struct sctp_queued_to_read *ctl, *nctl; TAILQ_FOREACH_SAFE(ctl, rh, next_instrm, nctl) { TAILQ_REMOVE(rh, ctl, next_instrm); ctl->on_strm_q = 0; if (ctl->on_read_q == 0) { sctp_free_remote_addr(ctl->whoFrom); if (ctl->data) { sctp_m_freem(ctl->data); ctl->data = NULL; } } /* Reassembly free? */ TAILQ_FOREACH_SAFE(chk, &ctl->reasm, sctp_next, nchk) { TAILQ_REMOVE(&ctl->reasm, chk, sctp_next); if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } if (chk->holds_key_ref) sctp_auth_key_release(stcb, chk->auth_keyid, SCTP_SO_LOCKED); sctp_free_remote_addr(chk->whoTo); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_chunk), chk); SCTP_DECR_CHK_COUNT(); /* sa_ignore FREED_MEMORY */ } /* * We don't free the address here since all the net's were * freed above. */ if (ctl->on_read_q == 0) { sctp_free_a_readq(stcb, ctl); } } } /*- * Free the association after un-hashing the remote port. This * function ALWAYS returns holding NO LOCK on the stcb. It DOES * expect that the input to this function IS a locked TCB. * It will return 0, if it did NOT destroy the association (instead * it unlocks it. It will return NON-zero if it either destroyed the * association OR the association is already destroyed. */ int sctp_free_assoc(struct sctp_inpcb *inp, struct sctp_tcb *stcb, int from_inpcbfree, int from_location) { int i; struct sctp_association *asoc; struct sctp_nets *net, *nnet; struct sctp_laddr *laddr, *naddr; struct sctp_tmit_chunk *chk, *nchk; struct sctp_asconf_addr *aparam, *naparam; struct sctp_asconf_ack *aack, *naack; struct sctp_stream_reset_list *strrst, *nstrrst; struct sctp_queued_to_read *sq, *nsq; struct sctp_stream_queue_pending *sp, *nsp; sctp_sharedkey_t *shared_key, *nshared_key; struct socket *so; /* first, lets purge the entry from the hash table. */ #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, stcb, 6); #endif if (stcb->asoc.state == 0) { #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, NULL, 7); #endif /* there is no asoc, really TSNH :-0 */ return (1); } if (stcb->asoc.alternate) { sctp_free_remote_addr(stcb->asoc.alternate); stcb->asoc.alternate = NULL; } /* TEMP CODE */ if (stcb->freed_from_where == 0) { /* Only record the first place free happened from */ stcb->freed_from_where = from_location; } /* TEMP CODE */ asoc = &stcb->asoc; if ((inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) || (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE)) /* nothing around */ so = NULL; else so = inp->sctp_socket; /* * We used timer based freeing if a reader or writer is in the way. * So we first check if we are actually being called from a timer, * if so we abort early if a reader or writer is still in the way. */ if ((stcb->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) && (from_inpcbfree == SCTP_NORMAL_PROC)) { /* * is it the timer driving us? if so are the reader/writers * gone? */ if (stcb->asoc.refcnt) { /* nope, reader or writer in the way */ sctp_timer_start(SCTP_TIMER_TYPE_ASOCKILL, inp, stcb, NULL); /* no asoc destroyed */ SCTP_TCB_UNLOCK(stcb); #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, stcb, 8); #endif return (0); } } /* now clean up any other timers */ (void)SCTP_OS_TIMER_STOP(&asoc->dack_timer.timer); asoc->dack_timer.self = NULL; (void)SCTP_OS_TIMER_STOP(&asoc->strreset_timer.timer); /*- * For stream reset we don't blast this unless * it is a str-reset timer, it might be the * free-asoc timer which we DON'T want to * disturb. */ if (asoc->strreset_timer.type == SCTP_TIMER_TYPE_STRRESET) asoc->strreset_timer.self = NULL; (void)SCTP_OS_TIMER_STOP(&asoc->asconf_timer.timer); asoc->asconf_timer.self = NULL; (void)SCTP_OS_TIMER_STOP(&asoc->autoclose_timer.timer); asoc->autoclose_timer.self = NULL; (void)SCTP_OS_TIMER_STOP(&asoc->shut_guard_timer.timer); asoc->shut_guard_timer.self = NULL; (void)SCTP_OS_TIMER_STOP(&asoc->delayed_event_timer.timer); asoc->delayed_event_timer.self = NULL; /* Mobility adaptation */ (void)SCTP_OS_TIMER_STOP(&asoc->delete_prim_timer.timer); asoc->delete_prim_timer.self = NULL; TAILQ_FOREACH(net, &asoc->nets, sctp_next) { (void)SCTP_OS_TIMER_STOP(&net->rxt_timer.timer); net->rxt_timer.self = NULL; (void)SCTP_OS_TIMER_STOP(&net->pmtu_timer.timer); net->pmtu_timer.self = NULL; (void)SCTP_OS_TIMER_STOP(&net->hb_timer.timer); net->hb_timer.self = NULL; } /* Now the read queue needs to be cleaned up (only once) */ if ((stcb->asoc.state & SCTP_STATE_ABOUT_TO_BE_FREED) == 0) { stcb->asoc.state |= SCTP_STATE_ABOUT_TO_BE_FREED; SCTP_INP_READ_LOCK(inp); TAILQ_FOREACH(sq, &inp->read_queue, next) { if (sq->stcb == stcb) { sq->do_not_ref_stcb = 1; sq->sinfo_cumtsn = stcb->asoc.cumulative_tsn; /* * If there is no end, there never will be * now. */ if (sq->end_added == 0) { /* Held for PD-API clear that. */ sq->pdapi_aborted = 1; sq->held_length = 0; if (sctp_stcb_is_feature_on(inp, stcb, SCTP_PCB_FLAGS_PDAPIEVNT) && (so != NULL)) { /* * Need to add a PD-API * aborted indication. * Setting the control_pdapi * assures that it will be * added right after this * msg. */ uint32_t strseq; stcb->asoc.control_pdapi = sq; strseq = (sq->sinfo_stream << 16) | sq->sinfo_ssn; sctp_ulp_notify(SCTP_NOTIFY_PARTIAL_DELVIERY_INDICATION, stcb, SCTP_PARTIAL_DELIVERY_ABORTED, (void *)&strseq, SCTP_SO_LOCKED); stcb->asoc.control_pdapi = NULL; } } /* Add an end to wake them */ sq->end_added = 1; } } SCTP_INP_READ_UNLOCK(inp); if (stcb->block_entry) { SCTP_LTRACE_ERR_RET(inp, stcb, NULL, SCTP_FROM_SCTP_PCB, ECONNRESET); stcb->block_entry->error = ECONNRESET; stcb->block_entry = NULL; } } if ((stcb->asoc.refcnt) || (stcb->asoc.state & SCTP_STATE_IN_ACCEPT_QUEUE)) { /* * Someone holds a reference OR the socket is unaccepted * yet. */ if ((stcb->asoc.refcnt) || (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) || (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE)) { stcb->asoc.state &= ~SCTP_STATE_IN_ACCEPT_QUEUE; sctp_timer_start(SCTP_TIMER_TYPE_ASOCKILL, inp, stcb, NULL); } SCTP_TCB_UNLOCK(stcb); if ((inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) || (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE)) /* nothing around */ so = NULL; if (so) { /* Wake any reader/writers */ sctp_sorwakeup(inp, so); sctp_sowwakeup(inp, so); } #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, stcb, 9); #endif /* no asoc destroyed */ return (0); } #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, stcb, 10); #endif /* * When I reach here, no others want to kill the assoc yet.. and I * own the lock. Now its possible an abort comes in when I do the * lock exchange below to grab all the locks to do the final take * out. to prevent this we increment the count, which will start a * timer and blow out above thus assuring us that we hold exclusive * killing of the asoc. Note that after getting back the TCB lock we * will go ahead and increment the counter back up and stop any * timer a passing stranger may have started :-S */ if (from_inpcbfree == SCTP_NORMAL_PROC) { atomic_add_int(&stcb->asoc.refcnt, 1); SCTP_TCB_UNLOCK(stcb); SCTP_INP_INFO_WLOCK(); SCTP_INP_WLOCK(inp); SCTP_TCB_LOCK(stcb); } /* Double check the GONE flag */ if ((inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) || (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE)) /* nothing around */ so = NULL; if ((inp->sctp_flags & SCTP_PCB_FLAGS_TCPTYPE) || (inp->sctp_flags & SCTP_PCB_FLAGS_IN_TCPPOOL)) { /* * For TCP type we need special handling when we are * connected. We also include the peel'ed off ones to. */ if (inp->sctp_flags & SCTP_PCB_FLAGS_CONNECTED) { inp->sctp_flags &= ~SCTP_PCB_FLAGS_CONNECTED; inp->sctp_flags |= SCTP_PCB_FLAGS_WAS_CONNECTED; if (so) { SOCK_LOCK(so); if (so->so_rcv.sb_cc == 0) { so->so_state &= ~(SS_ISCONNECTING | SS_ISDISCONNECTING | SS_ISCONFIRMING | SS_ISCONNECTED); } socantrcvmore_locked(so); sctp_sowwakeup(inp, so); sctp_sorwakeup(inp, so); SCTP_SOWAKEUP(so); } } } /* * Make it invalid too, that way if its about to run it will abort * and return. */ /* re-increment the lock */ if (from_inpcbfree == SCTP_NORMAL_PROC) { atomic_add_int(&stcb->asoc.refcnt, -1); } if (stcb->asoc.refcnt) { stcb->asoc.state &= ~SCTP_STATE_IN_ACCEPT_QUEUE; sctp_timer_start(SCTP_TIMER_TYPE_ASOCKILL, inp, stcb, NULL); if (from_inpcbfree == SCTP_NORMAL_PROC) { SCTP_INP_INFO_WUNLOCK(); SCTP_INP_WUNLOCK(inp); } SCTP_TCB_UNLOCK(stcb); return (0); } asoc->state = 0; if (inp->sctp_tcbhash) { LIST_REMOVE(stcb, sctp_tcbhash); } if (stcb->asoc.in_asocid_hash) { LIST_REMOVE(stcb, sctp_tcbasocidhash); } /* Now lets remove it from the list of ALL associations in the EP */ LIST_REMOVE(stcb, sctp_tcblist); if (from_inpcbfree == SCTP_NORMAL_PROC) { SCTP_INP_INCR_REF(inp); SCTP_INP_WUNLOCK(inp); } /* pull from vtag hash */ LIST_REMOVE(stcb, sctp_asocs); sctp_add_vtag_to_timewait(asoc->my_vtag, SCTP_BASE_SYSCTL(sctp_vtag_time_wait), inp->sctp_lport, stcb->rport); /* * Now restop the timers to be sure this is paranoia at is finest! */ (void)SCTP_OS_TIMER_STOP(&asoc->strreset_timer.timer); (void)SCTP_OS_TIMER_STOP(&asoc->dack_timer.timer); (void)SCTP_OS_TIMER_STOP(&asoc->strreset_timer.timer); (void)SCTP_OS_TIMER_STOP(&asoc->asconf_timer.timer); (void)SCTP_OS_TIMER_STOP(&asoc->shut_guard_timer.timer); (void)SCTP_OS_TIMER_STOP(&asoc->autoclose_timer.timer); (void)SCTP_OS_TIMER_STOP(&asoc->delayed_event_timer.timer); TAILQ_FOREACH(net, &asoc->nets, sctp_next) { (void)SCTP_OS_TIMER_STOP(&net->rxt_timer.timer); (void)SCTP_OS_TIMER_STOP(&net->pmtu_timer.timer); (void)SCTP_OS_TIMER_STOP(&net->hb_timer.timer); } asoc->strreset_timer.type = SCTP_TIMER_TYPE_NONE; /* * The chunk lists and such SHOULD be empty but we check them just * in case. */ /* anything on the wheel needs to be removed */ for (i = 0; i < asoc->streamoutcnt; i++) { struct sctp_stream_out *outs; outs = &asoc->strmout[i]; /* now clean up any chunks here */ TAILQ_FOREACH_SAFE(sp, &outs->outqueue, next, nsp) { TAILQ_REMOVE(&outs->outqueue, sp, next); sctp_free_spbufspace(stcb, asoc, sp); if (sp->data) { if (so) { /* Still an open socket - report */ sctp_ulp_notify(SCTP_NOTIFY_SPECIAL_SP_FAIL, stcb, 0, (void *)sp, SCTP_SO_LOCKED); } if (sp->data) { sctp_m_freem(sp->data); sp->data = NULL; sp->tail_mbuf = NULL; sp->length = 0; } } if (sp->net) { sctp_free_remote_addr(sp->net); sp->net = NULL; } sctp_free_a_strmoq(stcb, sp, SCTP_SO_LOCKED); } } /* sa_ignore FREED_MEMORY */ TAILQ_FOREACH_SAFE(strrst, &asoc->resetHead, next_resp, nstrrst) { TAILQ_REMOVE(&asoc->resetHead, strrst, next_resp); SCTP_FREE(strrst, SCTP_M_STRESET); } TAILQ_FOREACH_SAFE(sq, &asoc->pending_reply_queue, next, nsq) { TAILQ_REMOVE(&asoc->pending_reply_queue, sq, next); if (sq->data) { sctp_m_freem(sq->data); sq->data = NULL; } sctp_free_remote_addr(sq->whoFrom); sq->whoFrom = NULL; sq->stcb = NULL; /* Free the ctl entry */ sctp_free_a_readq(stcb, sq); /* sa_ignore FREED_MEMORY */ } TAILQ_FOREACH_SAFE(chk, &asoc->free_chunks, sctp_next, nchk) { TAILQ_REMOVE(&asoc->free_chunks, chk, sctp_next); if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } if (chk->holds_key_ref) sctp_auth_key_release(stcb, chk->auth_keyid, SCTP_SO_LOCKED); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_chunk), chk); SCTP_DECR_CHK_COUNT(); atomic_subtract_int(&SCTP_BASE_INFO(ipi_free_chunks), 1); asoc->free_chunk_cnt--; /* sa_ignore FREED_MEMORY */ } /* pending send queue SHOULD be empty */ TAILQ_FOREACH_SAFE(chk, &asoc->send_queue, sctp_next, nchk) { if (asoc->strmout[chk->rec.data.stream_number].chunks_on_queues > 0) { asoc->strmout[chk->rec.data.stream_number].chunks_on_queues--; #ifdef INVARIANTS } else { panic("No chunks on the queues for sid %u.", chk->rec.data.stream_number); #endif } TAILQ_REMOVE(&asoc->send_queue, chk, sctp_next); if (chk->data) { if (so) { /* Still a socket? */ sctp_ulp_notify(SCTP_NOTIFY_UNSENT_DG_FAIL, stcb, 0, chk, SCTP_SO_LOCKED); } if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } } if (chk->holds_key_ref) sctp_auth_key_release(stcb, chk->auth_keyid, SCTP_SO_LOCKED); if (chk->whoTo) { sctp_free_remote_addr(chk->whoTo); chk->whoTo = NULL; } SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_chunk), chk); SCTP_DECR_CHK_COUNT(); /* sa_ignore FREED_MEMORY */ } /* sent queue SHOULD be empty */ TAILQ_FOREACH_SAFE(chk, &asoc->sent_queue, sctp_next, nchk) { if (chk->sent != SCTP_DATAGRAM_NR_ACKED) { if (asoc->strmout[chk->rec.data.stream_number].chunks_on_queues > 0) { asoc->strmout[chk->rec.data.stream_number].chunks_on_queues--; #ifdef INVARIANTS } else { panic("No chunks on the queues for sid %u.", chk->rec.data.stream_number); #endif } } TAILQ_REMOVE(&asoc->sent_queue, chk, sctp_next); if (chk->data) { if (so) { /* Still a socket? */ sctp_ulp_notify(SCTP_NOTIFY_SENT_DG_FAIL, stcb, 0, chk, SCTP_SO_LOCKED); } if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } } if (chk->holds_key_ref) sctp_auth_key_release(stcb, chk->auth_keyid, SCTP_SO_LOCKED); sctp_free_remote_addr(chk->whoTo); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_chunk), chk); SCTP_DECR_CHK_COUNT(); /* sa_ignore FREED_MEMORY */ } #ifdef INVARIANTS for (i = 0; i < stcb->asoc.streamoutcnt; i++) { if (stcb->asoc.strmout[i].chunks_on_queues > 0) { panic("%u chunks left for stream %u.", stcb->asoc.strmout[i].chunks_on_queues, i); } } #endif /* control queue MAY not be empty */ TAILQ_FOREACH_SAFE(chk, &asoc->control_send_queue, sctp_next, nchk) { TAILQ_REMOVE(&asoc->control_send_queue, chk, sctp_next); if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } if (chk->holds_key_ref) sctp_auth_key_release(stcb, chk->auth_keyid, SCTP_SO_LOCKED); sctp_free_remote_addr(chk->whoTo); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_chunk), chk); SCTP_DECR_CHK_COUNT(); /* sa_ignore FREED_MEMORY */ } /* ASCONF queue MAY not be empty */ TAILQ_FOREACH_SAFE(chk, &asoc->asconf_send_queue, sctp_next, nchk) { TAILQ_REMOVE(&asoc->asconf_send_queue, chk, sctp_next); if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } if (chk->holds_key_ref) sctp_auth_key_release(stcb, chk->auth_keyid, SCTP_SO_LOCKED); sctp_free_remote_addr(chk->whoTo); SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_chunk), chk); SCTP_DECR_CHK_COUNT(); /* sa_ignore FREED_MEMORY */ } if (asoc->mapping_array) { SCTP_FREE(asoc->mapping_array, SCTP_M_MAP); asoc->mapping_array = NULL; } if (asoc->nr_mapping_array) { SCTP_FREE(asoc->nr_mapping_array, SCTP_M_MAP); asoc->nr_mapping_array = NULL; } /* the stream outs */ if (asoc->strmout) { SCTP_FREE(asoc->strmout, SCTP_M_STRMO); asoc->strmout = NULL; } asoc->strm_realoutsize = asoc->streamoutcnt = 0; if (asoc->strmin) { for (i = 0; i < asoc->streamincnt; i++) { sctp_clean_up_stream(stcb, &asoc->strmin[i].inqueue); sctp_clean_up_stream(stcb, &asoc->strmin[i].uno_inqueue); } SCTP_FREE(asoc->strmin, SCTP_M_STRMI); asoc->strmin = NULL; } asoc->streamincnt = 0; TAILQ_FOREACH_SAFE(net, &asoc->nets, sctp_next, nnet) { #ifdef INVARIANTS if (SCTP_BASE_INFO(ipi_count_raddr) == 0) { panic("no net's left alloc'ed, or list points to itself"); } #endif TAILQ_REMOVE(&asoc->nets, net, sctp_next); sctp_free_remote_addr(net); } LIST_FOREACH_SAFE(laddr, &asoc->sctp_restricted_addrs, sctp_nxt_addr, naddr) { /* sa_ignore FREED_MEMORY */ sctp_remove_laddr(laddr); } /* pending asconf (address) parameters */ TAILQ_FOREACH_SAFE(aparam, &asoc->asconf_queue, next, naparam) { /* sa_ignore FREED_MEMORY */ TAILQ_REMOVE(&asoc->asconf_queue, aparam, next); SCTP_FREE(aparam, SCTP_M_ASC_ADDR); } TAILQ_FOREACH_SAFE(aack, &asoc->asconf_ack_sent, next, naack) { /* sa_ignore FREED_MEMORY */ TAILQ_REMOVE(&asoc->asconf_ack_sent, aack, next); if (aack->data != NULL) { sctp_m_freem(aack->data); } SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_asconf_ack), aack); } /* clean up auth stuff */ if (asoc->local_hmacs) sctp_free_hmaclist(asoc->local_hmacs); if (asoc->peer_hmacs) sctp_free_hmaclist(asoc->peer_hmacs); if (asoc->local_auth_chunks) sctp_free_chunklist(asoc->local_auth_chunks); if (asoc->peer_auth_chunks) sctp_free_chunklist(asoc->peer_auth_chunks); sctp_free_authinfo(&asoc->authinfo); LIST_FOREACH_SAFE(shared_key, &asoc->shared_keys, next, nshared_key) { LIST_REMOVE(shared_key, next); sctp_free_sharedkey(shared_key); /* sa_ignore FREED_MEMORY */ } /* Insert new items here :> */ /* Get rid of LOCK */ SCTP_TCB_UNLOCK(stcb); SCTP_TCB_LOCK_DESTROY(stcb); SCTP_TCB_SEND_LOCK_DESTROY(stcb); if (from_inpcbfree == SCTP_NORMAL_PROC) { SCTP_INP_INFO_WUNLOCK(); SCTP_INP_RLOCK(inp); } #ifdef SCTP_TRACK_FREED_ASOCS if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE) { /* now clean up the tasoc itself */ SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_asoc), stcb); SCTP_DECR_ASOC_COUNT(); } else { LIST_INSERT_HEAD(&inp->sctp_asoc_free_list, stcb, sctp_tcblist); } #else SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_asoc), stcb); SCTP_DECR_ASOC_COUNT(); #endif if (from_inpcbfree == SCTP_NORMAL_PROC) { if (inp->sctp_flags & SCTP_PCB_FLAGS_SOCKET_GONE) { /* * If its NOT the inp_free calling us AND sctp_close * as been called, we call back... */ SCTP_INP_RUNLOCK(inp); /* * This will start the kill timer (if we are the * last one) since we hold an increment yet. But * this is the only safe way to do this since * otherwise if the socket closes at the same time * we are here we might collide in the cleanup. */ sctp_inpcb_free(inp, SCTP_FREE_SHOULD_USE_GRACEFUL_CLOSE, SCTP_CALLED_DIRECTLY_NOCMPSET); SCTP_INP_DECR_REF(inp); goto out_of; } else { /* The socket is still open. */ SCTP_INP_DECR_REF(inp); } } if (from_inpcbfree == SCTP_NORMAL_PROC) { SCTP_INP_RUNLOCK(inp); } out_of: /* destroyed the asoc */ #ifdef SCTP_LOG_CLOSING sctp_log_closing(inp, NULL, 11); #endif return (1); } /* * determine if a destination is "reachable" based upon the addresses bound * to the current endpoint (e.g. only v4 or v6 currently bound) */ /* * FIX: if we allow assoc-level bindx(), then this needs to be fixed to use * assoc level v4/v6 flags, as the assoc *may* not have the same address * types bound as its endpoint */ int sctp_destination_is_reachable(struct sctp_tcb *stcb, struct sockaddr *destaddr) { struct sctp_inpcb *inp; int answer; /* * No locks here, the TCB, in all cases is already locked and an * assoc is up. There is either a INP lock by the caller applied (in * asconf case when deleting an address) or NOT in the HB case, * however if HB then the INP increment is up and the INP will not * be removed (on top of the fact that we have a TCB lock). So we * only want to read the sctp_flags, which is either bound-all or * not.. no protection needed since once an assoc is up you can't be * changing your binding. */ inp = stcb->sctp_ep; if (inp->sctp_flags & SCTP_PCB_FLAGS_BOUNDALL) { /* if bound all, destination is not restricted */ /* * RRS: Question during lock work: Is this correct? If you * are bound-all you still might need to obey the V4--V6 * flags??? IMO this bound-all stuff needs to be removed! */ return (1); } /* NOTE: all "scope" checks are done when local addresses are added */ switch (destaddr->sa_family) { #ifdef INET6 case AF_INET6: answer = inp->ip_inp.inp.inp_vflag & INP_IPV6; break; #endif #ifdef INET case AF_INET: answer = inp->ip_inp.inp.inp_vflag & INP_IPV4; break; #endif default: /* invalid family, so it's unreachable */ answer = 0; break; } return (answer); } /* * update the inp_vflags on an endpoint */ static void sctp_update_ep_vflag(struct sctp_inpcb *inp) { struct sctp_laddr *laddr; /* first clear the flag */ inp->ip_inp.inp.inp_vflag = 0; /* set the flag based on addresses on the ep list */ LIST_FOREACH(laddr, &inp->sctp_addr_list, sctp_nxt_addr) { if (laddr->ifa == NULL) { SCTPDBG(SCTP_DEBUG_PCB1, "%s: NULL ifa\n", __func__); continue; } if (laddr->ifa->localifa_flags & SCTP_BEING_DELETED) { continue; } switch (laddr->ifa->address.sa.sa_family) { #ifdef INET6 case AF_INET6: inp->ip_inp.inp.inp_vflag |= INP_IPV6; break; #endif #ifdef INET case AF_INET: inp->ip_inp.inp.inp_vflag |= INP_IPV4; break; #endif default: break; } } } /* * Add the address to the endpoint local address list There is nothing to be * done if we are bound to all addresses */ void sctp_add_local_addr_ep(struct sctp_inpcb *inp, struct sctp_ifa *ifa, uint32_t action) { struct sctp_laddr *laddr; struct sctp_tcb *stcb; int fnd, error = 0; fnd = 0; if (inp->sctp_flags & SCTP_PCB_FLAGS_BOUNDALL) { /* You are already bound to all. You have it already */ return; } #ifdef INET6 if (ifa->address.sa.sa_family == AF_INET6) { if (ifa->localifa_flags & SCTP_ADDR_IFA_UNUSEABLE) { /* Can't bind a non-useable addr. */ return; } } #endif /* first, is it already present? */ LIST_FOREACH(laddr, &inp->sctp_addr_list, sctp_nxt_addr) { if (laddr->ifa == ifa) { fnd = 1; break; } } if (fnd == 0) { /* Not in the ep list */ error = sctp_insert_laddr(&inp->sctp_addr_list, ifa, action); if (error != 0) return; inp->laddr_count++; /* update inp_vflag flags */ switch (ifa->address.sa.sa_family) { #ifdef INET6 case AF_INET6: inp->ip_inp.inp.inp_vflag |= INP_IPV6; break; #endif #ifdef INET case AF_INET: inp->ip_inp.inp.inp_vflag |= INP_IPV4; break; #endif default: break; } LIST_FOREACH(stcb, &inp->sctp_asoc_list, sctp_tcblist) { sctp_add_local_addr_restricted(stcb, ifa); } } return; } /* * select a new (hopefully reachable) destination net (should only be used * when we deleted an ep addr that is the only usable source address to reach * the destination net) */ static void sctp_select_primary_destination(struct sctp_tcb *stcb) { struct sctp_nets *net; TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { /* for now, we'll just pick the first reachable one we find */ if (net->dest_state & SCTP_ADDR_UNCONFIRMED) continue; if (sctp_destination_is_reachable(stcb, (struct sockaddr *)&net->ro._l_addr)) { /* found a reachable destination */ stcb->asoc.primary_destination = net; } } /* I can't there from here! ...we're gonna die shortly... */ } /* * Delete the address from the endpoint local address list. There is nothing * to be done if we are bound to all addresses */ void sctp_del_local_addr_ep(struct sctp_inpcb *inp, struct sctp_ifa *ifa) { struct sctp_laddr *laddr; int fnd; fnd = 0; if (inp->sctp_flags & SCTP_PCB_FLAGS_BOUNDALL) { /* You are already bound to all. You have it already */ return; } LIST_FOREACH(laddr, &inp->sctp_addr_list, sctp_nxt_addr) { if (laddr->ifa == ifa) { fnd = 1; break; } } if (fnd && (inp->laddr_count < 2)) { /* can't delete unless there are at LEAST 2 addresses */ return; } if (fnd) { /* * clean up any use of this address go through our * associations and clear any last_used_address that match * this one for each assoc, see if a new primary_destination * is needed */ struct sctp_tcb *stcb; /* clean up "next_addr_touse" */ if (inp->next_addr_touse == laddr) /* delete this address */ inp->next_addr_touse = NULL; /* clean up "last_used_address" */ LIST_FOREACH(stcb, &inp->sctp_asoc_list, sctp_tcblist) { struct sctp_nets *net; SCTP_TCB_LOCK(stcb); if (stcb->asoc.last_used_address == laddr) /* delete this address */ stcb->asoc.last_used_address = NULL; /* * Now spin through all the nets and purge any ref * to laddr */ TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { if (net->ro._s_addr == laddr->ifa) { /* Yep, purge src address selected */ sctp_rtentry_t *rt; /* delete this address if cached */ rt = net->ro.ro_rt; if (rt != NULL) { RTFREE(rt); net->ro.ro_rt = NULL; } sctp_free_ifa(net->ro._s_addr); net->ro._s_addr = NULL; net->src_addr_selected = 0; } } SCTP_TCB_UNLOCK(stcb); } /* for each tcb */ /* remove it from the ep list */ sctp_remove_laddr(laddr); inp->laddr_count--; /* update inp_vflag flags */ sctp_update_ep_vflag(inp); } return; } /* * Add the address to the TCB local address restricted list. * This is a "pending" address list (eg. addresses waiting for an * ASCONF-ACK response) and cannot be used as a valid source address. */ void sctp_add_local_addr_restricted(struct sctp_tcb *stcb, struct sctp_ifa *ifa) { struct sctp_laddr *laddr; struct sctpladdr *list; /* * Assumes TCB is locked.. and possibly the INP. May need to * confirm/fix that if we need it and is not the case. */ list = &stcb->asoc.sctp_restricted_addrs; #ifdef INET6 if (ifa->address.sa.sa_family == AF_INET6) { if (ifa->localifa_flags & SCTP_ADDR_IFA_UNUSEABLE) { /* Can't bind a non-existent addr. */ return; } } #endif /* does the address already exist? */ LIST_FOREACH(laddr, list, sctp_nxt_addr) { if (laddr->ifa == ifa) { return; } } /* add to the list */ (void)sctp_insert_laddr(list, ifa, 0); return; } /* * Remove a local address from the TCB local address restricted list */ void sctp_del_local_addr_restricted(struct sctp_tcb *stcb, struct sctp_ifa *ifa) { struct sctp_inpcb *inp; struct sctp_laddr *laddr; /* * This is called by asconf work. It is assumed that a) The TCB is * locked and b) The INP is locked. This is true in as much as I can * trace through the entry asconf code where I did these locks. * Again, the ASCONF code is a bit different in that it does lock * the INP during its work often times. This must be since we don't * want other proc's looking up things while what they are looking * up is changing :-D */ inp = stcb->sctp_ep; /* if subset bound and don't allow ASCONF's, can't delete last */ if (((inp->sctp_flags & SCTP_PCB_FLAGS_BOUNDALL) == 0) && sctp_is_feature_off(inp, SCTP_PCB_FLAGS_DO_ASCONF)) { if (stcb->sctp_ep->laddr_count < 2) { /* can't delete last address */ return; } } LIST_FOREACH(laddr, &stcb->asoc.sctp_restricted_addrs, sctp_nxt_addr) { /* remove the address if it exists */ if (laddr->ifa == NULL) continue; if (laddr->ifa == ifa) { sctp_remove_laddr(laddr); return; } } /* address not found! */ return; } /* * Temporarily remove for __APPLE__ until we use the Tiger equivalents */ /* sysctl */ static int sctp_max_number_of_assoc = SCTP_MAX_NUM_OF_ASOC; static int sctp_scale_up_for_address = SCTP_SCALE_FOR_ADDR; #if defined(__FreeBSD__) && defined(SCTP_MCORE_INPUT) && defined(SMP) struct sctp_mcore_ctrl *sctp_mcore_workers = NULL; int *sctp_cpuarry = NULL; void sctp_queue_to_mcore(struct mbuf *m, int off, int cpu_to_use) { /* Queue a packet to a processor for the specified core */ struct sctp_mcore_queue *qent; struct sctp_mcore_ctrl *wkq; int need_wake = 0; if (sctp_mcore_workers == NULL) { /* Something went way bad during setup */ sctp_input_with_port(m, off, 0); return; } SCTP_MALLOC(qent, struct sctp_mcore_queue *, (sizeof(struct sctp_mcore_queue)), SCTP_M_MCORE); if (qent == NULL) { /* This is trouble */ sctp_input_with_port(m, off, 0); return; } qent->vn = curvnet; qent->m = m; qent->off = off; qent->v6 = 0; wkq = &sctp_mcore_workers[cpu_to_use]; SCTP_MCORE_QLOCK(wkq); TAILQ_INSERT_TAIL(&wkq->que, qent, next); if (wkq->running == 0) { need_wake = 1; } SCTP_MCORE_QUNLOCK(wkq); if (need_wake) { wakeup(&wkq->running); } } static void sctp_mcore_thread(void *arg) { struct sctp_mcore_ctrl *wkq; struct sctp_mcore_queue *qent; wkq = (struct sctp_mcore_ctrl *)arg; struct mbuf *m; int off, v6; /* Wait for first tickle */ SCTP_MCORE_LOCK(wkq); wkq->running = 0; msleep(&wkq->running, &wkq->core_mtx, 0, "wait for pkt", 0); SCTP_MCORE_UNLOCK(wkq); /* Bind to our cpu */ thread_lock(curthread); sched_bind(curthread, wkq->cpuid); thread_unlock(curthread); /* Now lets start working */ SCTP_MCORE_LOCK(wkq); /* Now grab lock and go */ for (;;) { SCTP_MCORE_QLOCK(wkq); skip_sleep: wkq->running = 1; qent = TAILQ_FIRST(&wkq->que); if (qent) { TAILQ_REMOVE(&wkq->que, qent, next); SCTP_MCORE_QUNLOCK(wkq); CURVNET_SET(qent->vn); m = qent->m; off = qent->off; v6 = qent->v6; SCTP_FREE(qent, SCTP_M_MCORE); if (v6 == 0) { sctp_input_with_port(m, off, 0); } else { SCTP_PRINTF("V6 not yet supported\n"); sctp_m_freem(m); } CURVNET_RESTORE(); SCTP_MCORE_QLOCK(wkq); } wkq->running = 0; if (!TAILQ_EMPTY(&wkq->que)) { goto skip_sleep; } SCTP_MCORE_QUNLOCK(wkq); msleep(&wkq->running, &wkq->core_mtx, 0, "wait for pkt", 0); } } static void sctp_startup_mcore_threads(void) { int i, cpu; if (mp_ncpus == 1) return; if (sctp_mcore_workers != NULL) { /* * Already been here in some previous vnet? */ return; } SCTP_MALLOC(sctp_mcore_workers, struct sctp_mcore_ctrl *, ((mp_maxid + 1) * sizeof(struct sctp_mcore_ctrl)), SCTP_M_MCORE); if (sctp_mcore_workers == NULL) { /* TSNH I hope */ return; } memset(sctp_mcore_workers, 0, ((mp_maxid + 1) * sizeof(struct sctp_mcore_ctrl))); /* Init the structures */ for (i = 0; i <= mp_maxid; i++) { TAILQ_INIT(&sctp_mcore_workers[i].que); SCTP_MCORE_LOCK_INIT(&sctp_mcore_workers[i]); SCTP_MCORE_QLOCK_INIT(&sctp_mcore_workers[i]); sctp_mcore_workers[i].cpuid = i; } if (sctp_cpuarry == NULL) { SCTP_MALLOC(sctp_cpuarry, int *, (mp_ncpus * sizeof(int)), SCTP_M_MCORE); i = 0; CPU_FOREACH(cpu) { sctp_cpuarry[i] = cpu; i++; } } /* Now start them all */ CPU_FOREACH(cpu) { (void)kproc_create(sctp_mcore_thread, (void *)&sctp_mcore_workers[cpu], &sctp_mcore_workers[cpu].thread_proc, RFPROC, SCTP_KTHREAD_PAGES, SCTP_MCORE_NAME); } } #endif void sctp_pcb_init() { /* * SCTP initialization for the PCB structures should be called by * the sctp_init() function. */ int i; struct timeval tv; if (SCTP_BASE_VAR(sctp_pcb_initialized) != 0) { /* error I was called twice */ return; } SCTP_BASE_VAR(sctp_pcb_initialized) = 1; #if defined(SCTP_LOCAL_TRACE_BUF) bzero(&SCTP_BASE_SYSCTL(sctp_log), sizeof(struct sctp_log)); #endif #if defined(__FreeBSD__) && defined(SMP) && defined(SCTP_USE_PERCPU_STAT) SCTP_MALLOC(SCTP_BASE_STATS, struct sctpstat *, ((mp_maxid + 1) * sizeof(struct sctpstat)), SCTP_M_MCORE); #endif (void)SCTP_GETTIME_TIMEVAL(&tv); #if defined(__FreeBSD__) && defined(SMP) && defined(SCTP_USE_PERCPU_STAT) bzero(SCTP_BASE_STATS, (sizeof(struct sctpstat) * (mp_maxid + 1))); SCTP_BASE_STATS[PCPU_GET(cpuid)].sctps_discontinuitytime.tv_sec = (uint32_t) tv.tv_sec; SCTP_BASE_STATS[PCPU_GET(cpuid)].sctps_discontinuitytime.tv_usec = (uint32_t) tv.tv_usec; #else bzero(&SCTP_BASE_STATS, sizeof(struct sctpstat)); SCTP_BASE_STAT(sctps_discontinuitytime).tv_sec = (uint32_t) tv.tv_sec; SCTP_BASE_STAT(sctps_discontinuitytime).tv_usec = (uint32_t) tv.tv_usec; #endif /* init the empty list of (All) Endpoints */ LIST_INIT(&SCTP_BASE_INFO(listhead)); /* init the hash table of endpoints */ TUNABLE_INT_FETCH("net.inet.sctp.tcbhashsize", &SCTP_BASE_SYSCTL(sctp_hashtblsize)); TUNABLE_INT_FETCH("net.inet.sctp.pcbhashsize", &SCTP_BASE_SYSCTL(sctp_pcbtblsize)); TUNABLE_INT_FETCH("net.inet.sctp.chunkscale", &SCTP_BASE_SYSCTL(sctp_chunkscale)); SCTP_BASE_INFO(sctp_asochash) = SCTP_HASH_INIT((SCTP_BASE_SYSCTL(sctp_hashtblsize) * 31), &SCTP_BASE_INFO(hashasocmark)); SCTP_BASE_INFO(sctp_ephash) = SCTP_HASH_INIT(SCTP_BASE_SYSCTL(sctp_hashtblsize), &SCTP_BASE_INFO(hashmark)); SCTP_BASE_INFO(sctp_tcpephash) = SCTP_HASH_INIT(SCTP_BASE_SYSCTL(sctp_hashtblsize), &SCTP_BASE_INFO(hashtcpmark)); SCTP_BASE_INFO(hashtblsize) = SCTP_BASE_SYSCTL(sctp_hashtblsize); SCTP_BASE_INFO(sctp_vrfhash) = SCTP_HASH_INIT(SCTP_SIZE_OF_VRF_HASH, &SCTP_BASE_INFO(hashvrfmark)); SCTP_BASE_INFO(vrf_ifn_hash) = SCTP_HASH_INIT(SCTP_VRF_IFN_HASH_SIZE, &SCTP_BASE_INFO(vrf_ifn_hashmark)); /* init the zones */ /* * FIX ME: Should check for NULL returns, but if it does fail we are * doomed to panic anyways... add later maybe. */ SCTP_ZONE_INIT(SCTP_BASE_INFO(ipi_zone_ep), "sctp_ep", sizeof(struct sctp_inpcb), maxsockets); SCTP_ZONE_INIT(SCTP_BASE_INFO(ipi_zone_asoc), "sctp_asoc", sizeof(struct sctp_tcb), sctp_max_number_of_assoc); SCTP_ZONE_INIT(SCTP_BASE_INFO(ipi_zone_laddr), "sctp_laddr", sizeof(struct sctp_laddr), (sctp_max_number_of_assoc * sctp_scale_up_for_address)); SCTP_ZONE_INIT(SCTP_BASE_INFO(ipi_zone_net), "sctp_raddr", sizeof(struct sctp_nets), (sctp_max_number_of_assoc * sctp_scale_up_for_address)); SCTP_ZONE_INIT(SCTP_BASE_INFO(ipi_zone_chunk), "sctp_chunk", sizeof(struct sctp_tmit_chunk), (sctp_max_number_of_assoc * SCTP_BASE_SYSCTL(sctp_chunkscale))); SCTP_ZONE_INIT(SCTP_BASE_INFO(ipi_zone_readq), "sctp_readq", sizeof(struct sctp_queued_to_read), (sctp_max_number_of_assoc * SCTP_BASE_SYSCTL(sctp_chunkscale))); SCTP_ZONE_INIT(SCTP_BASE_INFO(ipi_zone_strmoq), "sctp_stream_msg_out", sizeof(struct sctp_stream_queue_pending), (sctp_max_number_of_assoc * SCTP_BASE_SYSCTL(sctp_chunkscale))); SCTP_ZONE_INIT(SCTP_BASE_INFO(ipi_zone_asconf), "sctp_asconf", sizeof(struct sctp_asconf), (sctp_max_number_of_assoc * SCTP_BASE_SYSCTL(sctp_chunkscale))); SCTP_ZONE_INIT(SCTP_BASE_INFO(ipi_zone_asconf_ack), "sctp_asconf_ack", sizeof(struct sctp_asconf_ack), (sctp_max_number_of_assoc * SCTP_BASE_SYSCTL(sctp_chunkscale))); /* Master Lock INIT for info structure */ SCTP_INP_INFO_LOCK_INIT(); SCTP_STATLOG_INIT_LOCK(); SCTP_IPI_COUNT_INIT(); SCTP_IPI_ADDR_INIT(); #ifdef SCTP_PACKET_LOGGING SCTP_IP_PKTLOG_INIT(); #endif LIST_INIT(&SCTP_BASE_INFO(addr_wq)); SCTP_WQ_ADDR_INIT(); /* not sure if we need all the counts */ SCTP_BASE_INFO(ipi_count_ep) = 0; /* assoc/tcb zone info */ SCTP_BASE_INFO(ipi_count_asoc) = 0; /* local addrlist zone info */ SCTP_BASE_INFO(ipi_count_laddr) = 0; /* remote addrlist zone info */ SCTP_BASE_INFO(ipi_count_raddr) = 0; /* chunk info */ SCTP_BASE_INFO(ipi_count_chunk) = 0; /* socket queue zone info */ SCTP_BASE_INFO(ipi_count_readq) = 0; /* stream out queue cont */ SCTP_BASE_INFO(ipi_count_strmoq) = 0; SCTP_BASE_INFO(ipi_free_strmoq) = 0; SCTP_BASE_INFO(ipi_free_chunks) = 0; SCTP_OS_TIMER_INIT(&SCTP_BASE_INFO(addr_wq_timer.timer)); /* Init the TIMEWAIT list */ for (i = 0; i < SCTP_STACK_VTAG_HASH_SIZE; i++) { LIST_INIT(&SCTP_BASE_INFO(vtag_timewait)[i]); } sctp_startup_iterator(); #if defined(__FreeBSD__) && defined(SCTP_MCORE_INPUT) && defined(SMP) sctp_startup_mcore_threads(); #endif /* * INIT the default VRF which for BSD is the only one, other O/S's * may have more. But initially they must start with one and then * add the VRF's as addresses are added. */ sctp_init_vrf_list(SCTP_DEFAULT_VRF); } /* * Assumes that the SCTP_BASE_INFO() lock is NOT held. */ void sctp_pcb_finish(void) { struct sctp_vrflist *vrf_bucket; struct sctp_vrf *vrf, *nvrf; struct sctp_ifn *ifn, *nifn; struct sctp_ifa *ifa, *nifa; struct sctpvtaghead *chain; struct sctp_tagblock *twait_block, *prev_twait_block; struct sctp_laddr *wi, *nwi; int i; struct sctp_iterator *it, *nit; if (SCTP_BASE_VAR(sctp_pcb_initialized) == 0) { SCTP_PRINTF("%s: race condition on teardown.\n", __func__); return; } SCTP_BASE_VAR(sctp_pcb_initialized) = 0; /* * In FreeBSD the iterator thread never exits but we do clean up. * The only way FreeBSD reaches here is if we have VRF's but we * still add the ifdef to make it compile on old versions. */ retry: SCTP_IPI_ITERATOR_WQ_LOCK(); /* * sctp_iterator_worker() might be working on an it entry without * holding the lock. We won't find it on the list either and * continue and free/destroy it. While holding the lock, spin, to * avoid the race condition as sctp_iterator_worker() will have to * wait to re-aquire the lock. */ if (sctp_it_ctl.iterator_running != 0 || sctp_it_ctl.cur_it != NULL) { SCTP_IPI_ITERATOR_WQ_UNLOCK(); SCTP_PRINTF("%s: Iterator running while we held the lock. Retry. " "cur_it=%p\n", __func__, sctp_it_ctl.cur_it); DELAY(10); goto retry; } TAILQ_FOREACH_SAFE(it, &sctp_it_ctl.iteratorhead, sctp_nxt_itr, nit) { if (it->vn != curvnet) { continue; } TAILQ_REMOVE(&sctp_it_ctl.iteratorhead, it, sctp_nxt_itr); if (it->function_atend != NULL) { (*it->function_atend) (it->pointer, it->val); } SCTP_FREE(it, SCTP_M_ITER); } SCTP_IPI_ITERATOR_WQ_UNLOCK(); SCTP_ITERATOR_LOCK(); if ((sctp_it_ctl.cur_it) && (sctp_it_ctl.cur_it->vn == curvnet)) { sctp_it_ctl.iterator_flags |= SCTP_ITERATOR_STOP_CUR_IT; } SCTP_ITERATOR_UNLOCK(); SCTP_OS_TIMER_STOP_DRAIN(&SCTP_BASE_INFO(addr_wq_timer.timer)); SCTP_WQ_ADDR_LOCK(); LIST_FOREACH_SAFE(wi, &SCTP_BASE_INFO(addr_wq), sctp_nxt_addr, nwi) { LIST_REMOVE(wi, sctp_nxt_addr); SCTP_DECR_LADDR_COUNT(); if (wi->action == SCTP_DEL_IP_ADDRESS) { SCTP_FREE(wi->ifa, SCTP_M_IFA); } SCTP_ZONE_FREE(SCTP_BASE_INFO(ipi_zone_laddr), wi); } SCTP_WQ_ADDR_UNLOCK(); /* * free the vrf/ifn/ifa lists and hashes (be sure address monitor is * destroyed first). */ vrf_bucket = &SCTP_BASE_INFO(sctp_vrfhash)[(SCTP_DEFAULT_VRFID & SCTP_BASE_INFO(hashvrfmark))]; LIST_FOREACH_SAFE(vrf, vrf_bucket, next_vrf, nvrf) { LIST_FOREACH_SAFE(ifn, &vrf->ifnlist, next_ifn, nifn) { LIST_FOREACH_SAFE(ifa, &ifn->ifalist, next_ifa, nifa) { /* free the ifa */ LIST_REMOVE(ifa, next_bucket); LIST_REMOVE(ifa, next_ifa); SCTP_FREE(ifa, SCTP_M_IFA); } /* free the ifn */ LIST_REMOVE(ifn, next_bucket); LIST_REMOVE(ifn, next_ifn); SCTP_FREE(ifn, SCTP_M_IFN); } SCTP_HASH_FREE(vrf->vrf_addr_hash, vrf->vrf_addr_hashmark); /* free the vrf */ LIST_REMOVE(vrf, next_vrf); SCTP_FREE(vrf, SCTP_M_VRF); } /* free the vrf hashes */ SCTP_HASH_FREE(SCTP_BASE_INFO(sctp_vrfhash), SCTP_BASE_INFO(hashvrfmark)); SCTP_HASH_FREE(SCTP_BASE_INFO(vrf_ifn_hash), SCTP_BASE_INFO(vrf_ifn_hashmark)); /* * free the TIMEWAIT list elements malloc'd in the function * sctp_add_vtag_to_timewait()... */ for (i = 0; i < SCTP_STACK_VTAG_HASH_SIZE; i++) { chain = &SCTP_BASE_INFO(vtag_timewait)[i]; if (!LIST_EMPTY(chain)) { prev_twait_block = NULL; LIST_FOREACH(twait_block, chain, sctp_nxt_tagblock) { if (prev_twait_block) { SCTP_FREE(prev_twait_block, SCTP_M_TIMW); } prev_twait_block = twait_block; } SCTP_FREE(prev_twait_block, SCTP_M_TIMW); } } /* free the locks and mutexes */ #ifdef SCTP_PACKET_LOGGING SCTP_IP_PKTLOG_DESTROY(); #endif SCTP_IPI_ADDR_DESTROY(); SCTP_STATLOG_DESTROY(); SCTP_INP_INFO_LOCK_DESTROY(); SCTP_WQ_ADDR_DESTROY(); /* Get rid of other stuff too. */ if (SCTP_BASE_INFO(sctp_asochash) != NULL) SCTP_HASH_FREE(SCTP_BASE_INFO(sctp_asochash), SCTP_BASE_INFO(hashasocmark)); if (SCTP_BASE_INFO(sctp_ephash) != NULL) SCTP_HASH_FREE(SCTP_BASE_INFO(sctp_ephash), SCTP_BASE_INFO(hashmark)); if (SCTP_BASE_INFO(sctp_tcpephash) != NULL) SCTP_HASH_FREE(SCTP_BASE_INFO(sctp_tcpephash), SCTP_BASE_INFO(hashtcpmark)); SCTP_ZONE_DESTROY(SCTP_BASE_INFO(ipi_zone_ep)); SCTP_ZONE_DESTROY(SCTP_BASE_INFO(ipi_zone_asoc)); SCTP_ZONE_DESTROY(SCTP_BASE_INFO(ipi_zone_laddr)); SCTP_ZONE_DESTROY(SCTP_BASE_INFO(ipi_zone_net)); SCTP_ZONE_DESTROY(SCTP_BASE_INFO(ipi_zone_chunk)); SCTP_ZONE_DESTROY(SCTP_BASE_INFO(ipi_zone_readq)); SCTP_ZONE_DESTROY(SCTP_BASE_INFO(ipi_zone_strmoq)); SCTP_ZONE_DESTROY(SCTP_BASE_INFO(ipi_zone_asconf)); SCTP_ZONE_DESTROY(SCTP_BASE_INFO(ipi_zone_asconf_ack)); #if defined(__FreeBSD__) && defined(SMP) && defined(SCTP_USE_PERCPU_STAT) SCTP_FREE(SCTP_BASE_STATS, SCTP_M_MCORE); #endif } int sctp_load_addresses_from_init(struct sctp_tcb *stcb, struct mbuf *m, int offset, int limit, struct sockaddr *src, struct sockaddr *dst, struct sockaddr *altsa, uint16_t port) { /* * grub through the INIT pulling addresses and loading them to the * nets structure in the asoc. The from address in the mbuf should * also be loaded (if it is not already). This routine can be called * with either INIT or INIT-ACK's as long as the m points to the IP * packet and the offset points to the beginning of the parameters. */ struct sctp_inpcb *inp; struct sctp_nets *net, *nnet, *net_tmp; struct sctp_paramhdr *phdr, parm_buf; struct sctp_tcb *stcb_tmp; uint16_t ptype, plen; struct sockaddr *sa; uint8_t random_store[SCTP_PARAM_BUFFER_SIZE]; struct sctp_auth_random *p_random = NULL; uint16_t random_len = 0; uint8_t hmacs_store[SCTP_PARAM_BUFFER_SIZE]; struct sctp_auth_hmac_algo *hmacs = NULL; uint16_t hmacs_len = 0; uint8_t saw_asconf = 0; uint8_t saw_asconf_ack = 0; uint8_t chunks_store[SCTP_PARAM_BUFFER_SIZE]; struct sctp_auth_chunk_list *chunks = NULL; uint16_t num_chunks = 0; sctp_key_t *new_key; uint32_t keylen; int got_random = 0, got_hmacs = 0, got_chklist = 0; uint8_t peer_supports_ecn; uint8_t peer_supports_prsctp; uint8_t peer_supports_auth; uint8_t peer_supports_asconf; uint8_t peer_supports_asconf_ack; uint8_t peer_supports_reconfig; uint8_t peer_supports_nrsack; uint8_t peer_supports_pktdrop; uint8_t peer_supports_idata; #ifdef INET struct sockaddr_in sin; #endif #ifdef INET6 struct sockaddr_in6 sin6; #endif /* First get the destination address setup too. */ #ifdef INET memset(&sin, 0, sizeof(sin)); sin.sin_family = AF_INET; sin.sin_len = sizeof(sin); sin.sin_port = stcb->rport; #endif #ifdef INET6 memset(&sin6, 0, sizeof(sin6)); sin6.sin6_family = AF_INET6; sin6.sin6_len = sizeof(struct sockaddr_in6); sin6.sin6_port = stcb->rport; #endif if (altsa) { sa = altsa; } else { sa = src; } peer_supports_idata = 0; peer_supports_ecn = 0; peer_supports_prsctp = 0; peer_supports_auth = 0; peer_supports_asconf = 0; peer_supports_reconfig = 0; peer_supports_nrsack = 0; peer_supports_pktdrop = 0; TAILQ_FOREACH(net, &stcb->asoc.nets, sctp_next) { /* mark all addresses that we have currently on the list */ net->dest_state |= SCTP_ADDR_NOT_IN_ASSOC; } /* does the source address already exist? if so skip it */ inp = stcb->sctp_ep; atomic_add_int(&stcb->asoc.refcnt, 1); stcb_tmp = sctp_findassociation_ep_addr(&inp, sa, &net_tmp, dst, stcb); atomic_add_int(&stcb->asoc.refcnt, -1); if ((stcb_tmp == NULL && inp == stcb->sctp_ep) || inp == NULL) { /* we must add the source address */ /* no scope set here since we have a tcb already. */ switch (sa->sa_family) { #ifdef INET case AF_INET: if (stcb->asoc.scope.ipv4_addr_legal) { if (sctp_add_remote_addr(stcb, sa, NULL, port, SCTP_DONOT_SETSCOPE, SCTP_LOAD_ADDR_2)) { return (-1); } } break; #endif #ifdef INET6 case AF_INET6: if (stcb->asoc.scope.ipv6_addr_legal) { if (sctp_add_remote_addr(stcb, sa, NULL, port, SCTP_DONOT_SETSCOPE, SCTP_LOAD_ADDR_3)) { return (-2); } } break; #endif default: break; } } else { if (net_tmp != NULL && stcb_tmp == stcb) { net_tmp->dest_state &= ~SCTP_ADDR_NOT_IN_ASSOC; } else if (stcb_tmp != stcb) { /* It belongs to another association? */ if (stcb_tmp) SCTP_TCB_UNLOCK(stcb_tmp); return (-3); } } if (stcb->asoc.state == 0) { /* the assoc was freed? */ return (-4); } /* now we must go through each of the params. */ phdr = sctp_get_next_param(m, offset, &parm_buf, sizeof(parm_buf)); while (phdr) { ptype = ntohs(phdr->param_type); plen = ntohs(phdr->param_length); /* * SCTP_PRINTF("ptype => %0x, plen => %d\n", * (uint32_t)ptype, (int)plen); */ if (offset + plen > limit) { break; } if (plen == 0) { break; } #ifdef INET if (ptype == SCTP_IPV4_ADDRESS) { if (stcb->asoc.scope.ipv4_addr_legal) { struct sctp_ipv4addr_param *p4, p4_buf; /* ok get the v4 address and check/add */ phdr = sctp_get_next_param(m, offset, (struct sctp_paramhdr *)&p4_buf, sizeof(p4_buf)); if (plen != sizeof(struct sctp_ipv4addr_param) || phdr == NULL) { return (-5); } p4 = (struct sctp_ipv4addr_param *)phdr; sin.sin_addr.s_addr = p4->addr; if (IN_MULTICAST(ntohl(sin.sin_addr.s_addr))) { /* Skip multi-cast addresses */ goto next_param; } if ((sin.sin_addr.s_addr == INADDR_BROADCAST) || (sin.sin_addr.s_addr == INADDR_ANY)) { goto next_param; } sa = (struct sockaddr *)&sin; inp = stcb->sctp_ep; atomic_add_int(&stcb->asoc.refcnt, 1); stcb_tmp = sctp_findassociation_ep_addr(&inp, sa, &net, dst, stcb); atomic_add_int(&stcb->asoc.refcnt, -1); if ((stcb_tmp == NULL && inp == stcb->sctp_ep) || inp == NULL) { /* we must add the source address */ /* * no scope set since we have a tcb * already */ /* * we must validate the state again * here */ add_it_now: if (stcb->asoc.state == 0) { /* the assoc was freed? */ return (-7); } if (sctp_add_remote_addr(stcb, sa, NULL, port, SCTP_DONOT_SETSCOPE, SCTP_LOAD_ADDR_4)) { return (-8); } } else if (stcb_tmp == stcb) { if (stcb->asoc.state == 0) { /* the assoc was freed? */ return (-10); } if (net != NULL) { /* clear flag */ net->dest_state &= ~SCTP_ADDR_NOT_IN_ASSOC; } } else { /* * strange, address is in another * assoc? straighten out locks. */ if (stcb_tmp) { if (SCTP_GET_STATE(&stcb_tmp->asoc) & SCTP_STATE_COOKIE_WAIT) { struct mbuf *op_err; char msg[SCTP_DIAG_INFO_LEN]; /* * in setup state we * abort this guy */ snprintf(msg, sizeof(msg), "%s:%d at %s", __FILE__, __LINE__, __func__); op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), msg); sctp_abort_an_association(stcb_tmp->sctp_ep, stcb_tmp, op_err, SCTP_SO_NOT_LOCKED); goto add_it_now; } SCTP_TCB_UNLOCK(stcb_tmp); } if (stcb->asoc.state == 0) { /* the assoc was freed? */ return (-12); } return (-13); } } } else #endif #ifdef INET6 if (ptype == SCTP_IPV6_ADDRESS) { if (stcb->asoc.scope.ipv6_addr_legal) { /* ok get the v6 address and check/add */ struct sctp_ipv6addr_param *p6, p6_buf; phdr = sctp_get_next_param(m, offset, (struct sctp_paramhdr *)&p6_buf, sizeof(p6_buf)); if (plen != sizeof(struct sctp_ipv6addr_param) || phdr == NULL) { return (-14); } p6 = (struct sctp_ipv6addr_param *)phdr; memcpy((caddr_t)&sin6.sin6_addr, p6->addr, sizeof(p6->addr)); if (IN6_IS_ADDR_MULTICAST(&sin6.sin6_addr)) { /* Skip multi-cast addresses */ goto next_param; } if (IN6_IS_ADDR_LINKLOCAL(&sin6.sin6_addr)) { /* * Link local make no sense without * scope */ goto next_param; } sa = (struct sockaddr *)&sin6; inp = stcb->sctp_ep; atomic_add_int(&stcb->asoc.refcnt, 1); stcb_tmp = sctp_findassociation_ep_addr(&inp, sa, &net, dst, stcb); atomic_add_int(&stcb->asoc.refcnt, -1); if (stcb_tmp == NULL && (inp == stcb->sctp_ep || inp == NULL)) { /* * we must validate the state again * here */ add_it_now6: if (stcb->asoc.state == 0) { /* the assoc was freed? */ return (-16); } /* * we must add the address, no scope * set */ if (sctp_add_remote_addr(stcb, sa, NULL, port, SCTP_DONOT_SETSCOPE, SCTP_LOAD_ADDR_5)) { return (-17); } } else if (stcb_tmp == stcb) { /* * we must validate the state again * here */ if (stcb->asoc.state == 0) { /* the assoc was freed? */ return (-19); } if (net != NULL) { /* clear flag */ net->dest_state &= ~SCTP_ADDR_NOT_IN_ASSOC; } } else { /* * strange, address is in another * assoc? straighten out locks. */ if (stcb_tmp) { if (SCTP_GET_STATE(&stcb_tmp->asoc) & SCTP_STATE_COOKIE_WAIT) { struct mbuf *op_err; char msg[SCTP_DIAG_INFO_LEN]; /* * in setup state we * abort this guy */ snprintf(msg, sizeof(msg), "%s:%d at %s", __FILE__, __LINE__, __func__); op_err = sctp_generate_cause(SCTP_BASE_SYSCTL(sctp_diag_info_code), msg); sctp_abort_an_association(stcb_tmp->sctp_ep, stcb_tmp, op_err, SCTP_SO_NOT_LOCKED); goto add_it_now6; } SCTP_TCB_UNLOCK(stcb_tmp); } if (stcb->asoc.state == 0) { /* the assoc was freed? */ return (-21); } return (-22); } } } else #endif if (ptype == SCTP_ECN_CAPABLE) { peer_supports_ecn = 1; } else if (ptype == SCTP_ULP_ADAPTATION) { if (stcb->asoc.state != SCTP_STATE_OPEN) { struct sctp_adaptation_layer_indication ai, *aip; phdr = sctp_get_next_param(m, offset, (struct sctp_paramhdr *)&ai, sizeof(ai)); aip = (struct sctp_adaptation_layer_indication *)phdr; if (aip) { stcb->asoc.peers_adaptation = ntohl(aip->indication); stcb->asoc.adaptation_needed = 1; } } } else if (ptype == SCTP_SET_PRIM_ADDR) { struct sctp_asconf_addr_param lstore, *fee; int lptype; struct sockaddr *lsa = NULL; #ifdef INET struct sctp_asconf_addrv4_param *fii; #endif if (stcb->asoc.asconf_supported == 0) { return (-100); } if (plen > sizeof(lstore)) { return (-23); } phdr = sctp_get_next_param(m, offset, (struct sctp_paramhdr *)&lstore, min(plen, sizeof(lstore))); if (phdr == NULL) { return (-24); } fee = (struct sctp_asconf_addr_param *)phdr; lptype = ntohs(fee->addrp.ph.param_type); switch (lptype) { #ifdef INET case SCTP_IPV4_ADDRESS: if (plen != sizeof(struct sctp_asconf_addrv4_param)) { SCTP_PRINTF("Sizeof setprim in init/init ack not %d but %d - ignored\n", (int)sizeof(struct sctp_asconf_addrv4_param), plen); } else { fii = (struct sctp_asconf_addrv4_param *)fee; sin.sin_addr.s_addr = fii->addrp.addr; lsa = (struct sockaddr *)&sin; } break; #endif #ifdef INET6 case SCTP_IPV6_ADDRESS: if (plen != sizeof(struct sctp_asconf_addr_param)) { SCTP_PRINTF("Sizeof setprim (v6) in init/init ack not %d but %d - ignored\n", (int)sizeof(struct sctp_asconf_addr_param), plen); } else { memcpy(sin6.sin6_addr.s6_addr, fee->addrp.addr, sizeof(fee->addrp.addr)); lsa = (struct sockaddr *)&sin6; } break; #endif default: break; } if (lsa) { (void)sctp_set_primary_addr(stcb, sa, NULL); } } else if (ptype == SCTP_HAS_NAT_SUPPORT) { stcb->asoc.peer_supports_nat = 1; } else if (ptype == SCTP_PRSCTP_SUPPORTED) { /* Peer supports pr-sctp */ peer_supports_prsctp = 1; } else if (ptype == SCTP_SUPPORTED_CHUNK_EXT) { /* A supported extension chunk */ struct sctp_supported_chunk_types_param *pr_supported; uint8_t local_store[SCTP_PARAM_BUFFER_SIZE]; int num_ent, i; phdr = sctp_get_next_param(m, offset, (struct sctp_paramhdr *)&local_store, min(sizeof(local_store), plen)); if (phdr == NULL) { return (-25); } pr_supported = (struct sctp_supported_chunk_types_param *)phdr; num_ent = plen - sizeof(struct sctp_paramhdr); for (i = 0; i < num_ent; i++) { switch (pr_supported->chunk_types[i]) { case SCTP_ASCONF: peer_supports_asconf = 1; break; case SCTP_ASCONF_ACK: peer_supports_asconf_ack = 1; break; case SCTP_FORWARD_CUM_TSN: peer_supports_prsctp = 1; break; case SCTP_PACKET_DROPPED: peer_supports_pktdrop = 1; break; case SCTP_NR_SELECTIVE_ACK: peer_supports_nrsack = 1; break; case SCTP_STREAM_RESET: peer_supports_reconfig = 1; break; case SCTP_AUTHENTICATION: peer_supports_auth = 1; break; case SCTP_IDATA: peer_supports_idata = 1; break; default: /* one I have not learned yet */ break; } } } else if (ptype == SCTP_RANDOM) { if (plen > sizeof(random_store)) break; if (got_random) { /* already processed a RANDOM */ goto next_param; } phdr = sctp_get_next_param(m, offset, (struct sctp_paramhdr *)random_store, min(sizeof(random_store), plen)); if (phdr == NULL) return (-26); p_random = (struct sctp_auth_random *)phdr; random_len = plen - sizeof(*p_random); /* enforce the random length */ if (random_len != SCTP_AUTH_RANDOM_SIZE_REQUIRED) { SCTPDBG(SCTP_DEBUG_AUTH1, "SCTP: invalid RANDOM len\n"); return (-27); } got_random = 1; } else if (ptype == SCTP_HMAC_LIST) { uint16_t num_hmacs; uint16_t i; if (plen > sizeof(hmacs_store)) break; if (got_hmacs) { /* already processed a HMAC list */ goto next_param; } phdr = sctp_get_next_param(m, offset, (struct sctp_paramhdr *)hmacs_store, min(plen, sizeof(hmacs_store))); if (phdr == NULL) return (-28); hmacs = (struct sctp_auth_hmac_algo *)phdr; hmacs_len = plen - sizeof(*hmacs); num_hmacs = hmacs_len / sizeof(hmacs->hmac_ids[0]); /* validate the hmac list */ if (sctp_verify_hmac_param(hmacs, num_hmacs)) { return (-29); } if (stcb->asoc.peer_hmacs != NULL) sctp_free_hmaclist(stcb->asoc.peer_hmacs); stcb->asoc.peer_hmacs = sctp_alloc_hmaclist(num_hmacs); if (stcb->asoc.peer_hmacs != NULL) { for (i = 0; i < num_hmacs; i++) { (void)sctp_auth_add_hmacid(stcb->asoc.peer_hmacs, ntohs(hmacs->hmac_ids[i])); } } got_hmacs = 1; } else if (ptype == SCTP_CHUNK_LIST) { int i; if (plen > sizeof(chunks_store)) break; if (got_chklist) { /* already processed a Chunks list */ goto next_param; } phdr = sctp_get_next_param(m, offset, (struct sctp_paramhdr *)chunks_store, min(plen, sizeof(chunks_store))); if (phdr == NULL) return (-30); chunks = (struct sctp_auth_chunk_list *)phdr; num_chunks = plen - sizeof(*chunks); if (stcb->asoc.peer_auth_chunks != NULL) sctp_clear_chunklist(stcb->asoc.peer_auth_chunks); else stcb->asoc.peer_auth_chunks = sctp_alloc_chunklist(); for (i = 0; i < num_chunks; i++) { (void)sctp_auth_add_chunk(chunks->chunk_types[i], stcb->asoc.peer_auth_chunks); /* record asconf/asconf-ack if listed */ if (chunks->chunk_types[i] == SCTP_ASCONF) saw_asconf = 1; if (chunks->chunk_types[i] == SCTP_ASCONF_ACK) saw_asconf_ack = 1; } got_chklist = 1; } else if ((ptype == SCTP_HEARTBEAT_INFO) || (ptype == SCTP_STATE_COOKIE) || (ptype == SCTP_UNRECOG_PARAM) || (ptype == SCTP_COOKIE_PRESERVE) || (ptype == SCTP_SUPPORTED_ADDRTYPE) || (ptype == SCTP_ADD_IP_ADDRESS) || (ptype == SCTP_DEL_IP_ADDRESS) || (ptype == SCTP_ERROR_CAUSE_IND) || (ptype == SCTP_SUCCESS_REPORT)) { /* don't care */ ; } else { if ((ptype & 0x8000) == 0x0000) { /* * must stop processing the rest of the * param's. Any report bits were handled * with the call to * sctp_arethere_unrecognized_parameters() * when the INIT or INIT-ACK was first seen. */ break; } } next_param: offset += SCTP_SIZE32(plen); if (offset >= limit) { break; } phdr = sctp_get_next_param(m, offset, &parm_buf, sizeof(parm_buf)); } /* Now check to see if we need to purge any addresses */ TAILQ_FOREACH_SAFE(net, &stcb->asoc.nets, sctp_next, nnet) { if ((net->dest_state & SCTP_ADDR_NOT_IN_ASSOC) == SCTP_ADDR_NOT_IN_ASSOC) { /* This address has been removed from the asoc */ /* remove and free it */ stcb->asoc.numnets--; TAILQ_REMOVE(&stcb->asoc.nets, net, sctp_next); sctp_free_remote_addr(net); if (net == stcb->asoc.primary_destination) { stcb->asoc.primary_destination = NULL; sctp_select_primary_destination(stcb); } } } if ((stcb->asoc.ecn_supported == 1) && (peer_supports_ecn == 0)) { stcb->asoc.ecn_supported = 0; } if ((stcb->asoc.prsctp_supported == 1) && (peer_supports_prsctp == 0)) { stcb->asoc.prsctp_supported = 0; } if ((stcb->asoc.auth_supported == 1) && ((peer_supports_auth == 0) || (got_random == 0) || (got_hmacs == 0))) { stcb->asoc.auth_supported = 0; } if ((stcb->asoc.asconf_supported == 1) && ((peer_supports_asconf == 0) || (peer_supports_asconf_ack == 0) || (stcb->asoc.auth_supported == 0) || (saw_asconf == 0) || (saw_asconf_ack == 0))) { stcb->asoc.asconf_supported = 0; } if ((stcb->asoc.reconfig_supported == 1) && (peer_supports_reconfig == 0)) { stcb->asoc.reconfig_supported = 0; } if ((stcb->asoc.idata_supported == 1) && (peer_supports_idata == 0)) { stcb->asoc.idata_supported = 0; } if ((stcb->asoc.nrsack_supported == 1) && (peer_supports_nrsack == 0)) { stcb->asoc.nrsack_supported = 0; } if ((stcb->asoc.pktdrop_supported == 1) && (peer_supports_pktdrop == 0)) { stcb->asoc.pktdrop_supported = 0; } /* validate authentication required parameters */ if ((peer_supports_auth == 0) && (got_chklist == 1)) { /* peer does not support auth but sent a chunks list? */ return (-31); } if ((peer_supports_asconf == 1) && (peer_supports_auth == 0)) { /* peer supports asconf but not auth? */ return (-32); } else if ((peer_supports_asconf == 1) && (peer_supports_auth == 1) && ((saw_asconf == 0) || (saw_asconf_ack == 0))) { return (-33); } /* concatenate the full random key */ keylen = sizeof(*p_random) + random_len + sizeof(*hmacs) + hmacs_len; if (chunks != NULL) { keylen += sizeof(*chunks) + num_chunks; } new_key = sctp_alloc_key(keylen); if (new_key != NULL) { /* copy in the RANDOM */ if (p_random != NULL) { keylen = sizeof(*p_random) + random_len; bcopy(p_random, new_key->key, keylen); } /* append in the AUTH chunks */ if (chunks != NULL) { bcopy(chunks, new_key->key + keylen, sizeof(*chunks) + num_chunks); keylen += sizeof(*chunks) + num_chunks; } /* append in the HMACs */ if (hmacs != NULL) { bcopy(hmacs, new_key->key + keylen, sizeof(*hmacs) + hmacs_len); } } else { /* failed to get memory for the key */ return (-34); } if (stcb->asoc.authinfo.peer_random != NULL) sctp_free_key(stcb->asoc.authinfo.peer_random); stcb->asoc.authinfo.peer_random = new_key; sctp_clear_cachedkeys(stcb, stcb->asoc.authinfo.assoc_keyid); sctp_clear_cachedkeys(stcb, stcb->asoc.authinfo.recv_keyid); return (0); } int sctp_set_primary_addr(struct sctp_tcb *stcb, struct sockaddr *sa, struct sctp_nets *net) { /* make sure the requested primary address exists in the assoc */ if (net == NULL && sa) net = sctp_findnet(stcb, sa); if (net == NULL) { /* didn't find the requested primary address! */ return (-1); } else { /* set the primary address */ if (net->dest_state & SCTP_ADDR_UNCONFIRMED) { /* Must be confirmed, so queue to set */ net->dest_state |= SCTP_ADDR_REQ_PRIMARY; return (0); } stcb->asoc.primary_destination = net; if (!(net->dest_state & SCTP_ADDR_PF) && (stcb->asoc.alternate)) { sctp_free_remote_addr(stcb->asoc.alternate); stcb->asoc.alternate = NULL; } net = TAILQ_FIRST(&stcb->asoc.nets); if (net != stcb->asoc.primary_destination) { /* * first one on the list is NOT the primary * sctp_cmpaddr() is much more efficient if the * primary is the first on the list, make it so. */ TAILQ_REMOVE(&stcb->asoc.nets, stcb->asoc.primary_destination, sctp_next); TAILQ_INSERT_HEAD(&stcb->asoc.nets, stcb->asoc.primary_destination, sctp_next); } return (0); } } int sctp_is_vtag_good(uint32_t tag, uint16_t lport, uint16_t rport, struct timeval *now) { /* * This function serves two purposes. It will see if a TAG can be * re-used and return 1 for yes it is ok and 0 for don't use that * tag. A secondary function it will do is purge out old tags that * can be removed. */ struct sctpvtaghead *chain; struct sctp_tagblock *twait_block; struct sctpasochead *head; struct sctp_tcb *stcb; int i; SCTP_INP_INFO_RLOCK(); head = &SCTP_BASE_INFO(sctp_asochash)[SCTP_PCBHASH_ASOC(tag, SCTP_BASE_INFO(hashasocmark))]; LIST_FOREACH(stcb, head, sctp_asocs) { /* * We choose not to lock anything here. TCB's can't be * removed since we have the read lock, so they can't be * freed on us, same thing for the INP. I may be wrong with * this assumption, but we will go with it for now :-) */ if (stcb->sctp_ep->sctp_flags & SCTP_PCB_FLAGS_SOCKET_ALLGONE) { continue; } if (stcb->asoc.my_vtag == tag) { /* candidate */ if (stcb->rport != rport) { continue; } if (stcb->sctp_ep->sctp_lport != lport) { continue; } /* Its a used tag set */ SCTP_INP_INFO_RUNLOCK(); return (0); } } chain = &SCTP_BASE_INFO(vtag_timewait)[(tag % SCTP_STACK_VTAG_HASH_SIZE)]; /* Now what about timed wait ? */ LIST_FOREACH(twait_block, chain, sctp_nxt_tagblock) { /* * Block(s) are present, lets see if we have this tag in the * list */ for (i = 0; i < SCTP_NUMBER_IN_VTAG_BLOCK; i++) { if (twait_block->vtag_block[i].v_tag == 0) { /* not used */ continue; } else if ((long)twait_block->vtag_block[i].tv_sec_at_expire < now->tv_sec) { /* Audit expires this guy */ twait_block->vtag_block[i].tv_sec_at_expire = 0; twait_block->vtag_block[i].v_tag = 0; twait_block->vtag_block[i].lport = 0; twait_block->vtag_block[i].rport = 0; } else if ((twait_block->vtag_block[i].v_tag == tag) && (twait_block->vtag_block[i].lport == lport) && (twait_block->vtag_block[i].rport == rport)) { /* Bad tag, sorry :< */ SCTP_INP_INFO_RUNLOCK(); return (0); } } } SCTP_INP_INFO_RUNLOCK(); return (1); } static void sctp_drain_mbufs(struct sctp_tcb *stcb) { /* * We must hunt this association for MBUF's past the cumack (i.e. * out of order data that we can renege on). */ struct sctp_association *asoc; struct sctp_tmit_chunk *chk, *nchk; uint32_t cumulative_tsn_p1; struct sctp_queued_to_read *ctl, *nctl; int cnt, strmat; uint32_t gap, i; int fnd = 0; /* We look for anything larger than the cum-ack + 1 */ asoc = &stcb->asoc; if (asoc->cumulative_tsn == asoc->highest_tsn_inside_map) { /* none we can reneg on. */ return; } SCTP_STAT_INCR(sctps_protocol_drains_done); cumulative_tsn_p1 = asoc->cumulative_tsn + 1; cnt = 0; /* Ok that was fun, now we will drain all the inbound streams? */ for (strmat = 0; strmat < asoc->streamincnt; strmat++) { TAILQ_FOREACH_SAFE(ctl, &asoc->strmin[strmat].inqueue, next_instrm, nctl) { if (SCTP_TSN_GT(ctl->sinfo_tsn, cumulative_tsn_p1)) { /* Yep it is above cum-ack */ cnt++; SCTP_CALC_TSN_TO_GAP(gap, ctl->sinfo_tsn, asoc->mapping_array_base_tsn); asoc->size_on_all_streams = sctp_sbspace_sub(asoc->size_on_all_streams, ctl->length); sctp_ucount_decr(asoc->cnt_on_all_streams); SCTP_UNSET_TSN_PRESENT(asoc->mapping_array, gap); TAILQ_REMOVE(&asoc->strmin[strmat].inqueue, ctl, next_instrm); if (ctl->data) { sctp_m_freem(ctl->data); ctl->data = NULL; } sctp_free_remote_addr(ctl->whoFrom); /* Now its reasm? */ TAILQ_FOREACH_SAFE(chk, &ctl->reasm, sctp_next, nchk) { cnt++; SCTP_CALC_TSN_TO_GAP(gap, chk->rec.data.TSN_seq, asoc->mapping_array_base_tsn); asoc->size_on_reasm_queue = sctp_sbspace_sub(asoc->size_on_reasm_queue, chk->send_size); sctp_ucount_decr(asoc->cnt_on_reasm_queue); SCTP_UNSET_TSN_PRESENT(asoc->mapping_array, gap); TAILQ_REMOVE(&ctl->reasm, chk, sctp_next); if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } sctp_free_a_chunk(stcb, chk, SCTP_SO_NOT_LOCKED); } sctp_free_a_readq(stcb, ctl); } } TAILQ_FOREACH_SAFE(ctl, &asoc->strmin[strmat].uno_inqueue, next_instrm, nctl) { if (SCTP_TSN_GT(ctl->sinfo_tsn, cumulative_tsn_p1)) { /* Yep it is above cum-ack */ cnt++; SCTP_CALC_TSN_TO_GAP(gap, ctl->sinfo_tsn, asoc->mapping_array_base_tsn); asoc->size_on_all_streams = sctp_sbspace_sub(asoc->size_on_all_streams, ctl->length); sctp_ucount_decr(asoc->cnt_on_all_streams); SCTP_UNSET_TSN_PRESENT(asoc->mapping_array, gap); TAILQ_REMOVE(&asoc->strmin[strmat].uno_inqueue, ctl, next_instrm); if (ctl->data) { sctp_m_freem(ctl->data); ctl->data = NULL; } sctp_free_remote_addr(ctl->whoFrom); /* Now its reasm? */ TAILQ_FOREACH_SAFE(chk, &ctl->reasm, sctp_next, nchk) { cnt++; SCTP_CALC_TSN_TO_GAP(gap, chk->rec.data.TSN_seq, asoc->mapping_array_base_tsn); asoc->size_on_reasm_queue = sctp_sbspace_sub(asoc->size_on_reasm_queue, chk->send_size); sctp_ucount_decr(asoc->cnt_on_reasm_queue); SCTP_UNSET_TSN_PRESENT(asoc->mapping_array, gap); TAILQ_REMOVE(&ctl->reasm, chk, sctp_next); if (chk->data) { sctp_m_freem(chk->data); chk->data = NULL; } sctp_free_a_chunk(stcb, chk, SCTP_SO_NOT_LOCKED); } sctp_free_a_readq(stcb, ctl); } } } if (cnt) { /* We must back down to see what the new highest is */ for (i = asoc->highest_tsn_inside_map; SCTP_TSN_GE(i, asoc->mapping_array_base_tsn); i--) { SCTP_CALC_TSN_TO_GAP(gap, i, asoc->mapping_array_base_tsn); if (SCTP_IS_TSN_PRESENT(asoc->mapping_array, gap)) { asoc->highest_tsn_inside_map = i; fnd = 1; break; } } if (!fnd) { asoc->highest_tsn_inside_map = asoc->mapping_array_base_tsn - 1; } /* * Question, should we go through the delivery queue? The * only reason things are on here is the app not reading OR * a p-d-api up. An attacker COULD send enough in to * initiate the PD-API and then send a bunch of stuff to * other streams... these would wind up on the delivery * queue.. and then we would not get to them. But in order * to do this I then have to back-track and un-deliver * sequence numbers in streams.. el-yucko. I think for now * we will NOT look at the delivery queue and leave it to be * something to consider later. An alternative would be to * abort the P-D-API with a notification and then deliver * the data.... Or another method might be to keep track of * how many times the situation occurs and if we see a * possible attack underway just abort the association. */ #ifdef SCTP_DEBUG SCTPDBG(SCTP_DEBUG_PCB1, "Freed %d chunks from reneg harvest\n", cnt); #endif /* * Now do we need to find a new * asoc->highest_tsn_inside_map? */ asoc->last_revoke_count = cnt; (void)SCTP_OS_TIMER_STOP(&stcb->asoc.dack_timer.timer); /* sa_ignore NO_NULL_CHK */ sctp_send_sack(stcb, SCTP_SO_NOT_LOCKED); sctp_chunk_output(stcb->sctp_ep, stcb, SCTP_OUTPUT_FROM_DRAIN, SCTP_SO_NOT_LOCKED); } /* * Another issue, in un-setting the TSN's in the mapping array we * DID NOT adjust the highest_tsn marker. This will cause one of * two things to occur. It may cause us to do extra work in checking * for our mapping array movement. More importantly it may cause us * to SACK every datagram. This may not be a bad thing though since * we will recover once we get our cum-ack above and all this stuff * we dumped recovered. */ } void sctp_drain() { /* * We must walk the PCB lists for ALL associations here. The system * is LOW on MBUF's and needs help. This is where reneging will * occur. We really hope this does NOT happen! */ VNET_ITERATOR_DECL(vnet_iter); VNET_LIST_RLOCK_NOSLEEP(); VNET_FOREACH(vnet_iter) { CURVNET_SET(vnet_iter); struct sctp_inpcb *inp; struct sctp_tcb *stcb; SCTP_STAT_INCR(sctps_protocol_drain_calls); if (SCTP_BASE_SYSCTL(sctp_do_drain) == 0) { #ifdef VIMAGE continue; #else return; #endif } SCTP_INP_INFO_RLOCK(); LIST_FOREACH(inp, &SCTP_BASE_INFO(listhead), sctp_list) { /* For each endpoint */ SCTP_INP_RLOCK(inp); LIST_FOREACH(stcb, &inp->sctp_asoc_list, sctp_tcblist) { /* For each association */ SCTP_TCB_LOCK(stcb); sctp_drain_mbufs(stcb); SCTP_TCB_UNLOCK(stcb); } SCTP_INP_RUNLOCK(inp); } SCTP_INP_INFO_RUNLOCK(); CURVNET_RESTORE(); } VNET_LIST_RUNLOCK_NOSLEEP(); } /* * start a new iterator * iterates through all endpoints and associations based on the pcb_state * flags and asoc_state. "af" (mandatory) is executed for all matching * assocs and "ef" (optional) is executed when the iterator completes. * "inpf" (optional) is executed for each new endpoint as it is being * iterated through. inpe (optional) is called when the inp completes * its way through all the stcbs. */ int sctp_initiate_iterator(inp_func inpf, asoc_func af, inp_func inpe, uint32_t pcb_state, uint32_t pcb_features, uint32_t asoc_state, void *argp, uint32_t argi, end_func ef, struct sctp_inpcb *s_inp, uint8_t chunk_output_off) { struct sctp_iterator *it = NULL; if (af == NULL) { return (-1); } if (SCTP_BASE_VAR(sctp_pcb_initialized) == 0) { SCTP_PRINTF("%s: abort on initialize being %d\n", __func__, SCTP_BASE_VAR(sctp_pcb_initialized)); return (-1); } SCTP_MALLOC(it, struct sctp_iterator *, sizeof(struct sctp_iterator), SCTP_M_ITER); if (it == NULL) { SCTP_LTRACE_ERR_RET(NULL, NULL, NULL, SCTP_FROM_SCTP_PCB, ENOMEM); return (ENOMEM); } memset(it, 0, sizeof(*it)); it->function_assoc = af; it->function_inp = inpf; if (inpf) it->done_current_ep = 0; else it->done_current_ep = 1; it->function_atend = ef; it->pointer = argp; it->val = argi; it->pcb_flags = pcb_state; it->pcb_features = pcb_features; it->asoc_state = asoc_state; it->function_inp_end = inpe; it->no_chunk_output = chunk_output_off; it->vn = curvnet; if (s_inp) { /* Assume lock is held here */ it->inp = s_inp; SCTP_INP_INCR_REF(it->inp); it->iterator_flags = SCTP_ITERATOR_DO_SINGLE_INP; } else { SCTP_INP_INFO_RLOCK(); it->inp = LIST_FIRST(&SCTP_BASE_INFO(listhead)); if (it->inp) { SCTP_INP_INCR_REF(it->inp); } SCTP_INP_INFO_RUNLOCK(); it->iterator_flags = SCTP_ITERATOR_DO_ALL_INP; } SCTP_IPI_ITERATOR_WQ_LOCK(); if (SCTP_BASE_VAR(sctp_pcb_initialized) == 0) { SCTP_IPI_ITERATOR_WQ_UNLOCK(); SCTP_PRINTF("%s: rollback on initialize being %d it=%p\n", __func__, SCTP_BASE_VAR(sctp_pcb_initialized), it); SCTP_FREE(it, SCTP_M_ITER); return (-1); } TAILQ_INSERT_TAIL(&sctp_it_ctl.iteratorhead, it, sctp_nxt_itr); if (sctp_it_ctl.iterator_running == 0) { sctp_wakeup_iterator(); } SCTP_IPI_ITERATOR_WQ_UNLOCK(); /* sa_ignore MEMLEAK {memory is put on the tailq for the iterator} */ return (0); } Index: projects/vnet/sys/ofed/drivers/net/mlx4/en_rx.c =================================================================== --- projects/vnet/sys/ofed/drivers/net/mlx4/en_rx.c (revision 301546) +++ projects/vnet/sys/ofed/drivers/net/mlx4/en_rx.c (revision 301547) @@ -1,922 +1,922 @@ /* * Copyright (c) 2007, 2014 Mellanox Technologies. All rights reserved. * * This software is available to you under a choice of one of two * licenses. You may choose to be licensed under the terms of the GNU * General Public License (GPL) Version 2, available from the file * COPYING in the main directory of this source tree, or the * OpenIB.org BSD license below: * * Redistribution and use in source and binary forms, with or * without modification, are permitted provided that the following * conditions are met: * * - Redistributions of source code must retain the above * copyright notice, this list of conditions and the following * disclaimer. * * - Redistributions in binary form must reproduce the above * copyright notice, this list of conditions and the following * disclaimer in the documentation and/or other materials * provided with the distribution. * * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE * SOFTWARE. * */ #include "opt_inet.h" #include #include #include #include #include #include #include #ifdef CONFIG_NET_RX_BUSY_POLL #include #endif #include "mlx4_en.h" static void mlx4_en_init_rx_desc(struct mlx4_en_priv *priv, struct mlx4_en_rx_ring *ring, int index) { struct mlx4_en_rx_desc *rx_desc = (struct mlx4_en_rx_desc *) (ring->buf + (ring->stride * index)); int possible_frags; int i; /* Set size and memtype fields */ rx_desc->data[0].byte_count = cpu_to_be32(priv->rx_mb_size - MLX4_NET_IP_ALIGN); rx_desc->data[0].lkey = cpu_to_be32(priv->mdev->mr.key); /* * If the number of used fragments does not fill up the ring * stride, remaining (unused) fragments must be padded with * null address/size and a special memory key: */ possible_frags = (ring->stride - sizeof(struct mlx4_en_rx_desc)) / DS_SIZE; for (i = 1; i < possible_frags; i++) { rx_desc->data[i].byte_count = 0; rx_desc->data[i].lkey = cpu_to_be32(MLX4_EN_MEMTYPE_PAD); rx_desc->data[i].addr = 0; } } static int mlx4_en_alloc_buf(struct mlx4_en_rx_ring *ring, __be64 *pdma, struct mlx4_en_rx_mbuf *mb_list) { bus_dma_segment_t segs[1]; bus_dmamap_t map; struct mbuf *mb; int nsegs; int err; /* try to allocate a new spare mbuf */ if (unlikely(ring->spare.mbuf == NULL)) { mb = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, ring->rx_mb_size); if (unlikely(mb == NULL)) return (-ENOMEM); /* setup correct length */ mb->m_pkthdr.len = mb->m_len = ring->rx_mb_size; /* make sure IP header gets aligned */ m_adj(mb, MLX4_NET_IP_ALIGN); /* load spare mbuf into BUSDMA */ err = -bus_dmamap_load_mbuf_sg(ring->dma_tag, ring->spare.dma_map, mb, segs, &nsegs, BUS_DMA_NOWAIT); if (unlikely(err != 0)) { m_freem(mb); return (err); } /* store spare info */ ring->spare.mbuf = mb; ring->spare.paddr_be = cpu_to_be64(segs[0].ds_addr); bus_dmamap_sync(ring->dma_tag, ring->spare.dma_map, BUS_DMASYNC_PREREAD); } /* synchronize and unload the current mbuf, if any */ if (likely(mb_list->mbuf != NULL)) { bus_dmamap_sync(ring->dma_tag, mb_list->dma_map, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(ring->dma_tag, mb_list->dma_map); } mb = m_getjcl(M_NOWAIT, MT_DATA, M_PKTHDR, ring->rx_mb_size); if (unlikely(mb == NULL)) goto use_spare; /* setup correct length */ mb->m_pkthdr.len = mb->m_len = ring->rx_mb_size; /* make sure IP header gets aligned */ m_adj(mb, MLX4_NET_IP_ALIGN); err = -bus_dmamap_load_mbuf_sg(ring->dma_tag, mb_list->dma_map, mb, segs, &nsegs, BUS_DMA_NOWAIT); if (unlikely(err != 0)) { m_freem(mb); goto use_spare; } *pdma = cpu_to_be64(segs[0].ds_addr); mb_list->mbuf = mb; bus_dmamap_sync(ring->dma_tag, mb_list->dma_map, BUS_DMASYNC_PREREAD); return (0); use_spare: /* swap DMA maps */ map = mb_list->dma_map; mb_list->dma_map = ring->spare.dma_map; ring->spare.dma_map = map; /* swap MBUFs */ mb_list->mbuf = ring->spare.mbuf; ring->spare.mbuf = NULL; /* store physical address */ *pdma = ring->spare.paddr_be; return (0); } static void mlx4_en_free_buf(struct mlx4_en_rx_ring *ring, struct mlx4_en_rx_mbuf *mb_list) { bus_dmamap_t map = mb_list->dma_map; bus_dmamap_sync(ring->dma_tag, map, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(ring->dma_tag, map); m_freem(mb_list->mbuf); mb_list->mbuf = NULL; /* safety clearing */ } static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv *priv, struct mlx4_en_rx_ring *ring, int index) { struct mlx4_en_rx_desc *rx_desc = (struct mlx4_en_rx_desc *) (ring->buf + (index * ring->stride)); struct mlx4_en_rx_mbuf *mb_list = ring->mbuf + index; mb_list->mbuf = NULL; if (mlx4_en_alloc_buf(ring, &rx_desc->data[0].addr, mb_list)) { priv->port_stats.rx_alloc_failed++; return (-ENOMEM); } return (0); } static inline void mlx4_en_update_rx_prod_db(struct mlx4_en_rx_ring *ring) { *ring->wqres.db.db = cpu_to_be32(ring->prod & 0xffff); } static int mlx4_en_fill_rx_buffers(struct mlx4_en_priv *priv) { struct mlx4_en_rx_ring *ring; int ring_ind; int buf_ind; int new_size; int err; for (buf_ind = 0; buf_ind < priv->prof->rx_ring_size; buf_ind++) { for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) { ring = priv->rx_ring[ring_ind]; err = mlx4_en_prepare_rx_desc(priv, ring, ring->actual_size); if (err) { if (ring->actual_size == 0) { en_err(priv, "Failed to allocate " "enough rx buffers\n"); return -ENOMEM; } else { new_size = rounddown_pow_of_two(ring->actual_size); en_warn(priv, "Only %d buffers allocated " "reducing ring size to %d\n", ring->actual_size, new_size); goto reduce_rings; } } ring->actual_size++; ring->prod++; } } return 0; reduce_rings: for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) { ring = priv->rx_ring[ring_ind]; while (ring->actual_size > new_size) { ring->actual_size--; ring->prod--; mlx4_en_free_buf(ring, ring->mbuf + ring->actual_size); } } return 0; } static void mlx4_en_free_rx_buf(struct mlx4_en_priv *priv, struct mlx4_en_rx_ring *ring) { int index; en_dbg(DRV, priv, "Freeing Rx buf - cons:%d prod:%d\n", ring->cons, ring->prod); /* Unmap and free Rx buffers */ BUG_ON((u32) (ring->prod - ring->cons) > ring->actual_size); while (ring->cons != ring->prod) { index = ring->cons & ring->size_mask; en_dbg(DRV, priv, "Processing descriptor:%d\n", index); mlx4_en_free_buf(ring, ring->mbuf + index); ++ring->cons; } } void mlx4_en_calc_rx_buf(struct net_device *dev) { struct mlx4_en_priv *priv = netdev_priv(dev); int eff_mtu = dev->if_mtu + ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN + MLX4_NET_IP_ALIGN; if (eff_mtu > MJUM16BYTES) { en_err(priv, "MTU(%d) is too big\n", dev->if_mtu); eff_mtu = MJUM16BYTES; } else if (eff_mtu > MJUM9BYTES) { eff_mtu = MJUM16BYTES; } else if (eff_mtu > MJUMPAGESIZE) { eff_mtu = MJUM9BYTES; } else if (eff_mtu > MCLBYTES) { eff_mtu = MJUMPAGESIZE; } else { eff_mtu = MCLBYTES; } priv->rx_mb_size = eff_mtu; en_dbg(DRV, priv, "Effective RX MTU: %d bytes\n", eff_mtu); } int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv, struct mlx4_en_rx_ring **pring, u32 size, int node) { struct mlx4_en_dev *mdev = priv->mdev; struct mlx4_en_rx_ring *ring; int err; int tmp; uint32_t x; ring = kzalloc(sizeof(struct mlx4_en_rx_ring), GFP_KERNEL); if (!ring) { en_err(priv, "Failed to allocate RX ring structure\n"); return -ENOMEM; } /* Create DMA descriptor TAG */ if ((err = -bus_dma_tag_create( bus_get_dma_tag(mdev->pdev->dev.bsddev), 1, /* any alignment */ 0, /* no boundary */ BUS_SPACE_MAXADDR, /* lowaddr */ BUS_SPACE_MAXADDR, /* highaddr */ NULL, NULL, /* filter, filterarg */ MJUM16BYTES, /* maxsize */ 1, /* nsegments */ MJUM16BYTES, /* maxsegsize */ 0, /* flags */ NULL, NULL, /* lockfunc, lockfuncarg */ &ring->dma_tag))) { en_err(priv, "Failed to create DMA tag\n"); goto err_ring; } ring->prod = 0; ring->cons = 0; ring->size = size; ring->size_mask = size - 1; ring->stride = roundup_pow_of_two( sizeof(struct mlx4_en_rx_desc) + DS_SIZE); ring->log_stride = ffs(ring->stride) - 1; ring->buf_size = ring->size * ring->stride + TXBB_SIZE; tmp = size * sizeof(struct mlx4_en_rx_mbuf); ring->mbuf = kzalloc(tmp, GFP_KERNEL); if (ring->mbuf == NULL) { err = -ENOMEM; goto err_dma_tag; } err = -bus_dmamap_create(ring->dma_tag, 0, &ring->spare.dma_map); if (err != 0) goto err_info; for (x = 0; x != size; x++) { err = -bus_dmamap_create(ring->dma_tag, 0, &ring->mbuf[x].dma_map); if (err != 0) { while (x--) bus_dmamap_destroy(ring->dma_tag, ring->mbuf[x].dma_map); goto err_info; } } en_dbg(DRV, priv, "Allocated MBUF ring at addr:%p size:%d\n", ring->mbuf, tmp); err = mlx4_alloc_hwq_res(mdev->dev, &ring->wqres, ring->buf_size, 2 * PAGE_SIZE); if (err) goto err_dma_map; err = mlx4_en_map_buffer(&ring->wqres.buf); if (err) { en_err(priv, "Failed to map RX buffer\n"); goto err_hwq; } ring->buf = ring->wqres.buf.direct.buf; *pring = ring; return 0; err_hwq: mlx4_free_hwq_res(mdev->dev, &ring->wqres, ring->buf_size); err_dma_map: for (x = 0; x != size; x++) { bus_dmamap_destroy(ring->dma_tag, ring->mbuf[x].dma_map); } bus_dmamap_destroy(ring->dma_tag, ring->spare.dma_map); err_info: vfree(ring->mbuf); err_dma_tag: bus_dma_tag_destroy(ring->dma_tag); err_ring: kfree(ring); return (err); } int mlx4_en_activate_rx_rings(struct mlx4_en_priv *priv) { struct mlx4_en_rx_ring *ring; int i; int ring_ind; int err; int stride = roundup_pow_of_two( sizeof(struct mlx4_en_rx_desc) + DS_SIZE); for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) { ring = priv->rx_ring[ring_ind]; ring->prod = 0; ring->cons = 0; ring->actual_size = 0; ring->cqn = priv->rx_cq[ring_ind]->mcq.cqn; ring->rx_mb_size = priv->rx_mb_size; ring->stride = stride; if (ring->stride <= TXBB_SIZE) ring->buf += TXBB_SIZE; ring->log_stride = ffs(ring->stride) - 1; ring->buf_size = ring->size * ring->stride; memset(ring->buf, 0, ring->buf_size); mlx4_en_update_rx_prod_db(ring); /* Initialize all descriptors */ for (i = 0; i < ring->size; i++) mlx4_en_init_rx_desc(priv, ring, i); #ifdef INET /* Configure lro mngr */ if (priv->dev->if_capenable & IFCAP_LRO) { if (tcp_lro_init(&ring->lro)) priv->dev->if_capenable &= ~IFCAP_LRO; else ring->lro.ifp = priv->dev; } #endif } err = mlx4_en_fill_rx_buffers(priv); if (err) goto err_buffers; for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) { ring = priv->rx_ring[ring_ind]; ring->size_mask = ring->actual_size - 1; mlx4_en_update_rx_prod_db(ring); } return 0; err_buffers: for (ring_ind = 0; ring_ind < priv->rx_ring_num; ring_ind++) mlx4_en_free_rx_buf(priv, priv->rx_ring[ring_ind]); ring_ind = priv->rx_ring_num - 1; while (ring_ind >= 0) { ring = priv->rx_ring[ring_ind]; if (ring->stride <= TXBB_SIZE) ring->buf -= TXBB_SIZE; ring_ind--; } return err; } void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv, struct mlx4_en_rx_ring **pring, u32 size, u16 stride) { struct mlx4_en_dev *mdev = priv->mdev; struct mlx4_en_rx_ring *ring = *pring; uint32_t x; mlx4_en_unmap_buffer(&ring->wqres.buf); mlx4_free_hwq_res(mdev->dev, &ring->wqres, size * stride + TXBB_SIZE); for (x = 0; x != size; x++) bus_dmamap_destroy(ring->dma_tag, ring->mbuf[x].dma_map); /* free spare mbuf, if any */ if (ring->spare.mbuf != NULL) { bus_dmamap_sync(ring->dma_tag, ring->spare.dma_map, BUS_DMASYNC_POSTREAD); bus_dmamap_unload(ring->dma_tag, ring->spare.dma_map); m_freem(ring->spare.mbuf); } bus_dmamap_destroy(ring->dma_tag, ring->spare.dma_map); vfree(ring->mbuf); bus_dma_tag_destroy(ring->dma_tag); kfree(ring); *pring = NULL; #ifdef CONFIG_RFS_ACCEL mlx4_en_cleanup_filters(priv, ring); #endif } void mlx4_en_deactivate_rx_ring(struct mlx4_en_priv *priv, struct mlx4_en_rx_ring *ring) { #ifdef INET tcp_lro_free(&ring->lro); #endif mlx4_en_free_rx_buf(priv, ring); if (ring->stride <= TXBB_SIZE) ring->buf -= TXBB_SIZE; } static void validate_loopback(struct mlx4_en_priv *priv, struct mbuf *mb) { int i; int offset = ETHER_HDR_LEN; for (i = 0; i < MLX4_LOOPBACK_TEST_PAYLOAD; i++, offset++) { if (*(mb->m_data + offset) != (unsigned char) (i & 0xff)) goto out_loopback; } /* Loopback found */ priv->loopback_ok = 1; out_loopback: m_freem(mb); } static inline int invalid_cqe(struct mlx4_en_priv *priv, struct mlx4_cqe *cqe) { /* Drop packet on bad receive or bad checksum */ if (unlikely((cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) == MLX4_CQE_OPCODE_ERROR)) { en_err(priv, "CQE completed in error - vendor syndrom:%d syndrom:%d\n", ((struct mlx4_err_cqe *)cqe)->vendor_err_syndrome, ((struct mlx4_err_cqe *)cqe)->syndrome); return 1; } if (unlikely(cqe->badfcs_enc & MLX4_CQE_BAD_FCS)) { en_dbg(RX_ERR, priv, "Accepted frame with bad FCS\n"); return 1; } return 0; } static struct mbuf * mlx4_en_rx_mb(struct mlx4_en_priv *priv, struct mlx4_en_rx_ring *ring, struct mlx4_en_rx_desc *rx_desc, struct mlx4_en_rx_mbuf *mb_list, int length) { struct mbuf *mb; /* get mbuf */ mb = mb_list->mbuf; /* collect used fragment while atomically replacing it */ if (mlx4_en_alloc_buf(ring, &rx_desc->data[0].addr, mb_list)) return (NULL); /* range check hardware computed value */ if (unlikely(length > mb->m_len)) length = mb->m_len; /* update total packet length in packet header */ mb->m_len = mb->m_pkthdr.len = length; return (mb); } /* For cpu arch with cache line of 64B the performance is better when cqe size==64B * To enlarge cqe size from 32B to 64B --> 32B of garbage (i.e. 0xccccccc) * was added in the beginning of each cqe (the real data is in the corresponding 32B). * The following calc ensures that when factor==1, it means we are aligned to 64B * and we get the real cqe data*/ #define CQE_FACTOR_INDEX(index, factor) ((index << factor) + factor) int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget) { struct mlx4_en_priv *priv = netdev_priv(dev); struct mlx4_cqe *cqe; struct mlx4_en_rx_ring *ring = priv->rx_ring[cq->ring]; struct mlx4_en_rx_mbuf *mb_list; struct mlx4_en_rx_desc *rx_desc; struct mbuf *mb; struct mlx4_cq *mcq = &cq->mcq; struct mlx4_cqe *buf = cq->buf; int index; unsigned int length; int polled = 0; u32 cons_index = mcq->cons_index; u32 size_mask = ring->size_mask; int size = cq->size; int factor = priv->cqe_factor; if (!priv->port_up) return 0; /* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx * descriptor offset can be deducted from the CQE index instead of * reading 'cqe->index' */ index = cons_index & size_mask; cqe = &buf[CQE_FACTOR_INDEX(index, factor)]; /* Process all completed CQEs */ while (XNOR(cqe->owner_sr_opcode & MLX4_CQE_OWNER_MASK, cons_index & size)) { mb_list = ring->mbuf + index; rx_desc = (struct mlx4_en_rx_desc *) (ring->buf + (index << ring->log_stride)); /* * make sure we read the CQE after we read the ownership bit */ rmb(); if (invalid_cqe(priv, cqe)) { goto next; } /* * Packet is OK - process it. */ length = be32_to_cpu(cqe->byte_cnt); length -= ring->fcs_del; mb = mlx4_en_rx_mb(priv, ring, rx_desc, mb_list, length); if (unlikely(!mb)) { ring->errors++; goto next; } ring->bytes += length; ring->packets++; if (unlikely(priv->validate_loopback)) { validate_loopback(priv, mb); goto next; } /* forward Toeplitz compatible hash value */ mb->m_pkthdr.flowid = be32_to_cpu(cqe->immed_rss_invalid); - M_HASHTYPE_SET(mb, M_HASHTYPE_OPAQUE); + M_HASHTYPE_SET(mb, M_HASHTYPE_OPAQUE_HASH); mb->m_pkthdr.rcvif = dev; if (be32_to_cpu(cqe->vlan_my_qpn) & MLX4_CQE_VLAN_PRESENT_MASK) { mb->m_pkthdr.ether_vtag = be16_to_cpu(cqe->sl_vid); mb->m_flags |= M_VLANTAG; } if (likely(dev->if_capenable & (IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6)) && (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && (cqe->checksum == cpu_to_be16(0xffff))) { priv->port_stats.rx_chksum_good++; mb->m_pkthdr.csum_flags = CSUM_IP_CHECKED | CSUM_IP_VALID | CSUM_DATA_VALID | CSUM_PSEUDO_HDR; mb->m_pkthdr.csum_data = htons(0xffff); /* This packet is eligible for LRO if it is: * - DIX Ethernet (type interpretation) * - TCP/IP (v4) * - without IP options * - not an IP fragment */ #ifdef INET if (mlx4_en_can_lro(cqe->status) && (dev->if_capenable & IFCAP_LRO)) { if (ring->lro.lro_cnt != 0 && tcp_lro_rx(&ring->lro, mb, 0) == 0) goto next; } #endif /* LRO not possible, complete processing here */ INC_PERF_COUNTER(priv->pstats.lro_misses); } else { mb->m_pkthdr.csum_flags = 0; priv->port_stats.rx_chksum_none++; } /* Push it up the stack */ dev->if_input(dev, mb); next: ++cons_index; index = cons_index & size_mask; cqe = &buf[CQE_FACTOR_INDEX(index, factor)]; if (++polled == budget) goto out; } /* Flush all pending IP reassembly sessions */ out: #ifdef INET tcp_lro_flush_all(&ring->lro); #endif AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled); mcq->cons_index = cons_index; mlx4_cq_set_ci(mcq); wmb(); /* ensure HW sees CQ consumer before we post new buffers */ ring->cons = mcq->cons_index; ring->prod += polled; /* Polled descriptors were realocated in place */ mlx4_en_update_rx_prod_db(ring); return polled; } /* Rx CQ polling - called by NAPI */ static int mlx4_en_poll_rx_cq(struct mlx4_en_cq *cq, int budget) { struct net_device *dev = cq->dev; int done; done = mlx4_en_process_rx_cq(dev, cq, budget); cq->tot_rx += done; return done; } void mlx4_en_rx_irq(struct mlx4_cq *mcq) { struct mlx4_en_cq *cq = container_of(mcq, struct mlx4_en_cq, mcq); struct mlx4_en_priv *priv = netdev_priv(cq->dev); int done; // Shoot one within the irq context // Because there is no NAPI in freeBSD done = mlx4_en_poll_rx_cq(cq, MLX4_EN_RX_BUDGET); if (priv->port_up && (done == MLX4_EN_RX_BUDGET) ) { cq->curr_poll_rx_cpu_id = curcpu; taskqueue_enqueue(cq->tq, &cq->cq_task); } else { mlx4_en_arm_cq(priv, cq); } } void mlx4_en_rx_que(void *context, int pending) { struct mlx4_en_cq *cq; struct thread *td; cq = context; td = curthread; thread_lock(td); sched_bind(td, cq->curr_poll_rx_cpu_id); thread_unlock(td); while (mlx4_en_poll_rx_cq(cq, MLX4_EN_RX_BUDGET) == MLX4_EN_RX_BUDGET); mlx4_en_arm_cq(cq->dev->if_softc, cq); } /* RSS related functions */ static int mlx4_en_config_rss_qp(struct mlx4_en_priv *priv, int qpn, struct mlx4_en_rx_ring *ring, enum mlx4_qp_state *state, struct mlx4_qp *qp) { struct mlx4_en_dev *mdev = priv->mdev; struct mlx4_qp_context *context; int err = 0; context = kmalloc(sizeof *context , GFP_KERNEL); if (!context) { en_err(priv, "Failed to allocate qp context\n"); return -ENOMEM; } err = mlx4_qp_alloc(mdev->dev, qpn, qp); if (err) { en_err(priv, "Failed to allocate qp #%x\n", qpn); goto out; } qp->event = mlx4_en_sqp_event; memset(context, 0, sizeof *context); mlx4_en_fill_qp_context(priv, ring->actual_size, ring->stride, 0, 0, qpn, ring->cqn, -1, context); context->db_rec_addr = cpu_to_be64(ring->wqres.db.dma); /* Cancel FCS removal if FW allows */ if (mdev->dev->caps.flags & MLX4_DEV_CAP_FLAG_FCS_KEEP) { context->param3 |= cpu_to_be32(1 << 29); ring->fcs_del = ETH_FCS_LEN; } else ring->fcs_del = 0; err = mlx4_qp_to_ready(mdev->dev, &ring->wqres.mtt, context, qp, state); if (err) { mlx4_qp_remove(mdev->dev, qp); mlx4_qp_free(mdev->dev, qp); } mlx4_en_update_rx_prod_db(ring); out: kfree(context); return err; } int mlx4_en_create_drop_qp(struct mlx4_en_priv *priv) { int err; u32 qpn; err = mlx4_qp_reserve_range(priv->mdev->dev, 1, 1, &qpn, 0); if (err) { en_err(priv, "Failed reserving drop qpn\n"); return err; } err = mlx4_qp_alloc(priv->mdev->dev, qpn, &priv->drop_qp); if (err) { en_err(priv, "Failed allocating drop qp\n"); mlx4_qp_release_range(priv->mdev->dev, qpn, 1); return err; } return 0; } void mlx4_en_destroy_drop_qp(struct mlx4_en_priv *priv) { u32 qpn; qpn = priv->drop_qp.qpn; mlx4_qp_remove(priv->mdev->dev, &priv->drop_qp); mlx4_qp_free(priv->mdev->dev, &priv->drop_qp); mlx4_qp_release_range(priv->mdev->dev, qpn, 1); } /* Allocate rx qp's and configure them according to rss map */ int mlx4_en_config_rss_steer(struct mlx4_en_priv *priv) { struct mlx4_en_dev *mdev = priv->mdev; struct mlx4_en_rss_map *rss_map = &priv->rss_map; struct mlx4_qp_context context; struct mlx4_rss_context *rss_context; int rss_rings; void *ptr; u8 rss_mask = (MLX4_RSS_IPV4 | MLX4_RSS_TCP_IPV4 | MLX4_RSS_IPV6 | MLX4_RSS_TCP_IPV6); int i; int err = 0; int good_qps = 0; static const u32 rsskey[10] = { 0xD181C62C, 0xF7F4DB5B, 0x1983A2FC, 0x943E1ADB, 0xD9389E6B, 0xD1039C2C, 0xA74499AD, 0x593D56D9, 0xF3253C06, 0x2ADC1FFC}; en_dbg(DRV, priv, "Configuring rss steering\n"); err = mlx4_qp_reserve_range(mdev->dev, priv->rx_ring_num, priv->rx_ring_num, &rss_map->base_qpn, 0); if (err) { en_err(priv, "Failed reserving %d qps\n", priv->rx_ring_num); return err; } for (i = 0; i < priv->rx_ring_num; i++) { priv->rx_ring[i]->qpn = rss_map->base_qpn + i; err = mlx4_en_config_rss_qp(priv, priv->rx_ring[i]->qpn, priv->rx_ring[i], &rss_map->state[i], &rss_map->qps[i]); if (err) goto rss_err; ++good_qps; } /* Configure RSS indirection qp */ err = mlx4_qp_alloc(mdev->dev, priv->base_qpn, &rss_map->indir_qp); if (err) { en_err(priv, "Failed to allocate RSS indirection QP\n"); goto rss_err; } rss_map->indir_qp.event = mlx4_en_sqp_event; mlx4_en_fill_qp_context(priv, 0, 0, 0, 1, priv->base_qpn, priv->rx_ring[0]->cqn, -1, &context); if (!priv->prof->rss_rings || priv->prof->rss_rings > priv->rx_ring_num) rss_rings = priv->rx_ring_num; else rss_rings = priv->prof->rss_rings; ptr = ((u8 *)&context) + offsetof(struct mlx4_qp_context, pri_path) + MLX4_RSS_OFFSET_IN_QPC_PRI_PATH; rss_context = ptr; rss_context->base_qpn = cpu_to_be32(ilog2(rss_rings) << 24 | (rss_map->base_qpn)); rss_context->default_qpn = cpu_to_be32(rss_map->base_qpn); if (priv->mdev->profile.udp_rss) { rss_mask |= MLX4_RSS_UDP_IPV4 | MLX4_RSS_UDP_IPV6; rss_context->base_qpn_udp = rss_context->default_qpn; } rss_context->flags = rss_mask; rss_context->hash_fn = MLX4_RSS_HASH_TOP; for (i = 0; i < 10; i++) rss_context->rss_key[i] = cpu_to_be32(rsskey[i]); err = mlx4_qp_to_ready(mdev->dev, &priv->res.mtt, &context, &rss_map->indir_qp, &rss_map->indir_state); if (err) goto indir_err; return 0; indir_err: mlx4_qp_modify(mdev->dev, NULL, rss_map->indir_state, MLX4_QP_STATE_RST, NULL, 0, 0, &rss_map->indir_qp); mlx4_qp_remove(mdev->dev, &rss_map->indir_qp); mlx4_qp_free(mdev->dev, &rss_map->indir_qp); rss_err: for (i = 0; i < good_qps; i++) { mlx4_qp_modify(mdev->dev, NULL, rss_map->state[i], MLX4_QP_STATE_RST, NULL, 0, 0, &rss_map->qps[i]); mlx4_qp_remove(mdev->dev, &rss_map->qps[i]); mlx4_qp_free(mdev->dev, &rss_map->qps[i]); } mlx4_qp_release_range(mdev->dev, rss_map->base_qpn, priv->rx_ring_num); return err; } void mlx4_en_release_rss_steer(struct mlx4_en_priv *priv) { struct mlx4_en_dev *mdev = priv->mdev; struct mlx4_en_rss_map *rss_map = &priv->rss_map; int i; mlx4_qp_modify(mdev->dev, NULL, rss_map->indir_state, MLX4_QP_STATE_RST, NULL, 0, 0, &rss_map->indir_qp); mlx4_qp_remove(mdev->dev, &rss_map->indir_qp); mlx4_qp_free(mdev->dev, &rss_map->indir_qp); for (i = 0; i < priv->rx_ring_num; i++) { mlx4_qp_modify(mdev->dev, NULL, rss_map->state[i], MLX4_QP_STATE_RST, NULL, 0, 0, &rss_map->qps[i]); mlx4_qp_remove(mdev->dev, &rss_map->qps[i]); mlx4_qp_free(mdev->dev, &rss_map->qps[i]); } mlx4_qp_release_range(mdev->dev, rss_map->base_qpn, priv->rx_ring_num); } Index: projects/vnet/sys/sys/intr.h =================================================================== --- projects/vnet/sys/sys/intr.h (revision 301546) +++ projects/vnet/sys/sys/intr.h (revision 301547) @@ -1,153 +1,129 @@ /*- * Copyright (c) 2015-2016 Svatopluk Kraus * Copyright (c) 2015-2016 Michal Meloun * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _SYS_INTR_H_ #define _SYS_INTR_H_ #include #define INTR_IRQ_INVALID 0xFFFFFFFF -#ifdef DEV_ACPI -struct intr_map_data_acpi { - struct intr_map_data hdr; - u_int irq; - enum intr_polarity pol; - enum intr_trigger trig; -}; -#endif - -struct intr_map_data_gpio { - struct intr_map_data hdr; - u_int gpio_pin_num; - u_int gpio_pin_flags; - u_int gpio_intr_mode; -}; - #ifdef notyet #define INTR_SOLO INTR_MD1 typedef int intr_irq_filter_t(void *arg, struct trapframe *tf); #else typedef int intr_irq_filter_t(void *arg); #endif typedef int intr_child_irq_filter_t(void *arg, uintptr_t irq); #define INTR_ISRC_NAMELEN (MAXCOMLEN + 1) #define INTR_ISRCF_IPI 0x01 /* IPI interrupt */ #define INTR_ISRCF_PPI 0x02 /* PPI interrupt */ #define INTR_ISRCF_BOUND 0x04 /* bound to a CPU */ struct intr_pic; /* Interrupt source definition. */ struct intr_irqsrc { device_t isrc_dev; /* where isrc is mapped */ u_int isrc_irq; /* unique identificator */ u_int isrc_flags; char isrc_name[INTR_ISRC_NAMELEN]; cpuset_t isrc_cpu; /* on which CPUs is enabled */ u_int isrc_index; u_long * isrc_count; u_int isrc_handlers; struct intr_event * isrc_event; #ifdef INTR_SOLO intr_irq_filter_t * isrc_filter; void * isrc_arg; #endif }; /* Intr interface for PIC. */ int intr_isrc_deregister(struct intr_irqsrc *); int intr_isrc_register(struct intr_irqsrc *, device_t, u_int, const char *, ...) __printflike(4, 5); #ifdef SMP bool intr_isrc_init_on_cpu(struct intr_irqsrc *isrc, u_int cpu); #endif int intr_isrc_dispatch(struct intr_irqsrc *, struct trapframe *); u_int intr_irq_next_cpu(u_int current_cpu, cpuset_t *cpumask); struct intr_pic *intr_pic_register(device_t, intptr_t); int intr_pic_deregister(device_t, intptr_t); int intr_pic_claim_root(device_t, intptr_t, intr_irq_filter_t *, void *, u_int); struct intr_pic *intr_pic_add_handler(device_t, struct intr_pic *, intr_child_irq_filter_t *, void *, uintptr_t, uintptr_t); extern device_t intr_irq_root_dev; /* Intr interface for BUS. */ int intr_map_irq(device_t, intptr_t, struct intr_map_data *, u_int *); int intr_alloc_irq(device_t, struct resource *); int intr_release_irq(device_t, struct resource *); int intr_setup_irq(device_t, struct resource *, driver_filter_t, driver_intr_t, void *, int, void **); int intr_teardown_irq(device_t, struct resource *, void *); int intr_describe_irq(device_t, struct resource *, void *, const char *); int intr_child_irq_handler(struct intr_pic *, uintptr_t); /* MSI/MSI-X handling */ int intr_msi_register(device_t, intptr_t); int intr_alloc_msi(device_t, device_t, intptr_t, int, int, int *); int intr_release_msi(device_t, device_t, intptr_t, int, int *); int intr_map_msi(device_t, device_t, intptr_t, int, uint64_t *, uint32_t *); int intr_alloc_msix(device_t, device_t, intptr_t, int *); int intr_release_msix(device_t, device_t, intptr_t, int); - -#ifdef DEV_ACPI -u_int intr_acpi_map_irq(device_t, u_int, enum intr_polarity, - enum intr_trigger); -#endif - -u_int intr_gpio_map_irq(device_t dev, u_int pin_num, u_int pin_flags, - u_int intr_mode); #ifdef SMP int intr_bind_irq(device_t, struct resource *, int); void intr_pic_init_secondary(void); /* Virtualization for interrupt source IPI counter increment. */ static inline void intr_ipi_increment_count(u_long *counter, u_int cpu) { KASSERT(cpu < MAXCPU, ("%s: too big cpu %u", __func__, cpu)); counter[cpu]++; } /* Virtualization for interrupt source IPI counters setup. */ u_long * intr_ipi_setup_counters(const char *name); #endif #endif /* _SYS_INTR_H */ Index: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3kfw.c =================================================================== --- projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3kfw.c (revision 301546) +++ projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3kfw.c (nonexistent) @@ -1,297 +0,0 @@ -/* - * ath3kfw.c - */ - -/*- - * Copyright (c) 2010 Maksim Yevmenkin - * All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions - * are met: - * 1. Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE - * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF - * SUCH DAMAGE. - * - * $FreeBSD$ - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#define ATH3KFW "ath3kfw" -#define ATH3KFW_VENDOR_ID 0x0cf3 -#define ATH3KFW_PRODUCT_ID 0x3000 -#define ATH3KFW_FW "/usr/local/etc/ath3k-1.fw" -#define ATH3KFW_BULK_EP 0x02 -#define ATH3KFW_REQ_DFU_DNLOAD 1 -#define ATH3KFW_MAX_BSIZE 4096 - -static int parse_ugen_name (char const *ugen, uint8_t *bus, - uint8_t *addr); -static int find_device (struct libusb20_backend *be, - uint8_t bus, uint8_t addr, - struct libusb20_device **dev); -static int download_firmware (struct libusb20_device *dev, - char const *firmware); -static void usage (void); - -static int vendor_id = ATH3KFW_VENDOR_ID; -static int product_id = ATH3KFW_PRODUCT_ID; - -/* - * Firmware downloader for Atheros AR3011 based USB Bluetooth devices - */ - -int -main(int argc, char **argv) -{ - uint8_t bus, addr; - char const *firmware; - struct libusb20_backend *be; - struct libusb20_device *dev; - int n; - - openlog(ATH3KFW, LOG_NDELAY|LOG_PERROR|LOG_PID, LOG_USER); - - bus = 0; - addr = 0; - firmware = ATH3KFW_FW; - - while ((n = getopt(argc, argv, "d:f:hp:v:")) != -1) { - switch (n) { - case 'd': /* ugen device name */ - if (parse_ugen_name(optarg, &bus, &addr) < 0) - usage(); - break; - - case 'f': /* firmware file */ - firmware = optarg; - break; - case 'p': /* product id */ - product_id = strtol(optarg, NULL, 0); - break; - case 'v': /* vendor id */ - vendor_id = strtol(optarg, NULL, 0); - break; - case 'h': - default: - usage(); - break; - /* NOT REACHED */ - } - } - - be = libusb20_be_alloc_default(); - if (be == NULL) { - syslog(LOG_ERR, "libusb20_be_alloc_default() failed"); - return (-1); - } - - if (find_device(be, bus, addr, &dev) < 0) { - syslog(LOG_ERR, "ugen%d.%d is not recognized as " \ - "Atheros AR3011 based device " \ - "(possibly caused by lack of permissions)", bus, addr); - return (-1); - } - - if (download_firmware(dev, firmware) < 0) { - syslog(LOG_ERR, "could not download %s firmare to ugen%d.%d", - firmware, bus, addr); - return (-1); - } - - libusb20_be_free(be); - closelog(); - - return (0); -} - -/* - * Parse ugen name and extract device's bus and address - */ - -static int -parse_ugen_name(char const *ugen, uint8_t *bus, uint8_t *addr) -{ - char *ep; - - if (strncmp(ugen, "ugen", 4) != 0) - return (-1); - - *bus = (uint8_t) strtoul(ugen + 4, &ep, 10); - if (*ep != '.') - return (-1); - - *addr = (uint8_t) strtoul(ep + 1, &ep, 10); - if (*ep != '\0') - return (-1); - - return (0); -} - -/* - * Find USB device - */ - -static int -find_device(struct libusb20_backend *be, uint8_t bus, uint8_t addr, - struct libusb20_device **dev) -{ - struct LIBUSB20_DEVICE_DESC_DECODED *desc; - - *dev = NULL; - - while ((*dev = libusb20_be_device_foreach(be, *dev)) != NULL) { - if (libusb20_dev_get_bus_number(*dev) != bus || - libusb20_dev_get_address(*dev) != addr) - continue; - - desc = libusb20_dev_get_device_desc(*dev); - if (desc == NULL) - continue; - - if (desc->idVendor != vendor_id || - desc->idProduct != product_id) - continue; - - break; - } - - return ((*dev == NULL)? -1 : 0); -} - -/* - * Download firmware - */ - -static int -download_firmware(struct libusb20_device *dev, char const *firmware) -{ - struct libusb20_transfer *bulk; - struct LIBUSB20_CONTROL_SETUP_DECODED req; - int fd, n, error; - uint8_t buf[ATH3KFW_MAX_BSIZE]; - - error = -1; - - if (libusb20_dev_open(dev, 1) != 0) { - syslog(LOG_ERR, "libusb20_dev_open() failed"); - return (error); - } - - if ((bulk = libusb20_tr_get_pointer(dev, 0)) == NULL) { - syslog(LOG_ERR, "libusb20_tr_get_pointer() failed"); - goto out; - } - - if (libusb20_tr_open(bulk, ATH3KFW_MAX_BSIZE, 1, ATH3KFW_BULK_EP) != 0) { - syslog(LOG_ERR, "libusb20_tr_open(%d, 1, %d) failed", - ATH3KFW_MAX_BSIZE, ATH3KFW_BULK_EP); - goto out; - } - - if ((fd = open(firmware, O_RDONLY)) < 0) { - syslog(LOG_ERR, "open(%s) failed. %s", - firmware, strerror(errno)); - goto out1; - } - - n = read(fd, buf, 20); - if (n != 20) { - syslog(LOG_ERR, "read(%s, 20) failed. %s", - firmware, strerror(errno)); - goto out2; - } - - LIBUSB20_INIT(LIBUSB20_CONTROL_SETUP, &req); - req.bmRequestType = LIBUSB20_REQUEST_TYPE_VENDOR; - req.bRequest = ATH3KFW_REQ_DFU_DNLOAD; - req.wLength = 20; - - if (libusb20_dev_request_sync(dev, &req, buf, NULL, 5000, 0) != 0) { - syslog(LOG_ERR, "libusb20_dev_request_sync() failed"); - goto out2; - } - - for (;;) { - n = read(fd, buf, sizeof(buf)); - if (n < 0) { - syslog(LOG_ERR, "read(%s, %d) failed. %s", - firmware, (int) sizeof(buf), strerror(errno)); - goto out2; - } - if (n == 0) - break; - - libusb20_tr_setup_bulk(bulk, buf, n, 3000); - libusb20_tr_start(bulk); - - while (libusb20_dev_process(dev) == 0) { - if (libusb20_tr_pending(bulk) == 0) - break; - - libusb20_dev_wait_process(dev, -1); - } - - if (libusb20_tr_get_status(bulk) != 0) { - syslog(LOG_ERR, "bulk transfer failed with status %d", - libusb20_tr_get_status(bulk)); - goto out2; - } - } - - error = 0; -out2: - close(fd); -out1: - libusb20_tr_close(bulk); -out: - libusb20_dev_close(dev); - - return (error); -} - -/* - * Display usage and exit - */ - -static void -usage(void) -{ - printf( -"Usage: %s -d ugenX.Y -f firmware_file\n" -"Usage: %s -h\n" \ -"Where:\n" \ -"\t-d ugenX.Y ugen device name\n" \ -"\t-f firmware image firmware image file name for download\n" \ -"\t-v vendor_id vendor id\n" \ -"\t-p vendor_id product id\n" \ -"\t-h display this message\n", ATH3KFW, ATH3KFW); - - exit(255); -} - Property changes on: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3kfw.c ___________________________________________________________________ Deleted: svn:keywords ## -1 +0,0 ## -FreeBSD=%H \ No newline at end of property Index: projects/vnet/usr.sbin/bluetooth/ath3kfw/Makefile =================================================================== --- projects/vnet/usr.sbin/bluetooth/ath3kfw/Makefile (revision 301546) +++ projects/vnet/usr.sbin/bluetooth/ath3kfw/Makefile (revision 301547) @@ -1,7 +1,8 @@ # $FreeBSD$ PROG= ath3kfw MAN= ath3kfw.8 LIBADD+= usb +SRCS= main.c ath3k_fw.c ath3k_hw.c .include Index: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_dbg.h =================================================================== --- projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_dbg.h (nonexistent) +++ projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_dbg.h (revision 301547) @@ -0,0 +1,41 @@ +/*- + * Copyright (c) 2013 Adrian Chadd + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer, + * without modification. + * 2. Redistributions in binary form must reproduce at minimum a disclaimer + * similar to the "NO WARRANTY" disclaimer below ("Disclaimer") and any + * redistribution must be conditioned upon including a substantially + * similar Disclaimer requirement for further binary redistribution. + * + * NO WARRANTY + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTIBILITY + * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + * THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, + * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF + * THE POSSIBILITY OF SUCH DAMAGES. + * + * $FreeBSD$ + */ +#ifndef __ATH3K_DEBUG_H__ +#define __ATH3K_DEBUG_H__ + +extern int ath3k_do_debug; +extern int ath3k_do_info; + +#define ath3k_debug(...) if (ath3k_do_debug) fprintf(stderr, __VA_ARGS__) +#define ath3k_err(...) fprintf(stderr, __VA_ARGS__) +#define ath3k_info(...) if (ath3k_do_info) fprintf(stdout, __VA_ARGS__) + +#endif Property changes on: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_dbg.h ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_fw.c =================================================================== --- projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_fw.c (nonexistent) +++ projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_fw.c (revision 301547) @@ -0,0 +1,114 @@ +/*- + * Copyright (c) 2013 Adrian Chadd + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer, + * without modification. + * 2. Redistributions in binary form must reproduce at minimum a disclaimer + * similar to the "NO WARRANTY" disclaimer below ("Disclaimer") and any + * redistribution must be conditioned upon including a substantially + * similar Disclaimer requirement for further binary redistribution. + * + * NO WARRANTY + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTIBILITY + * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + * THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, + * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF + * THE POSSIBILITY OF SUCH DAMAGES. + * + * $FreeBSD$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "ath3k_fw.h" +#include "ath3k_dbg.h" + +int +ath3k_fw_read(struct ath3k_firmware *fw, const char *fwname) +{ + int fd; + struct stat sb; + unsigned char *buf; + ssize_t r; + int i; + + fd = open(fwname, O_RDONLY); + if (fd < 0) { + warn("%s: open: %s", __func__, fwname); + return (0); + } + + if (fstat(fd, &sb) != 0) { + warn("%s: stat: %s", __func__, fwname); + close(fd); + return (0); + } + + buf = calloc(1, sb.st_size); + if (buf == NULL) { + warn("%s: calloc", __func__); + close(fd); + return (0); + } + + i = 0; + /* XXX handle partial reads */ + r = read(fd, buf, sb.st_size); + if (r < 0) { + warn("%s: read", __func__); + free(buf); + close(fd); + return (0); + } + + if (r != sb.st_size) { + fprintf(stderr, "%s: read len %d != file size %d\n", + __func__, + (int) r, + (int) sb.st_size); + free(buf); + close(fd); + return (0); + } + + /* We have everything, so! */ + + bzero(fw, sizeof(*fw)); + + fw->fwname = strdup(fwname); + fw->len = sb.st_size; + fw->size = sb.st_size; + fw->buf = buf; + + close(fd); + return (1); +} + +void +ath3k_fw_free(struct ath3k_firmware *fw) +{ + if (fw->fwname) + free(fw->fwname); + if (fw->buf) + free(fw->buf); + bzero(fw, sizeof(*fw)); +} Property changes on: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_fw.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_fw.h =================================================================== --- projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_fw.h (nonexistent) +++ projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_fw.h (revision 301547) @@ -0,0 +1,56 @@ +/*- + * Copyright (c) 2013 Adrian Chadd + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer, + * without modification. + * 2. Redistributions in binary form must reproduce at minimum a disclaimer + * similar to the "NO WARRANTY" disclaimer below ("Disclaimer") and any + * redistribution must be conditioned upon including a substantially + * similar Disclaimer requirement for further binary redistribution. + * + * NO WARRANTY + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTIBILITY + * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + * THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, + * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF + * THE POSSIBILITY OF SUCH DAMAGES. + * + * $FreeBSD$ + */ +#ifndef __ATH3K_FW_H__ +#define __ATH3K_FW_H__ + +/* + * XXX TODO: ensure that the endian-ness of this stuff is + * correct! + */ +struct ath3k_version { + unsigned int rom_version; + unsigned int build_version; + unsigned int ram_version; + unsigned char ref_clock; + unsigned char reserved[0x07]; +}; + +struct ath3k_firmware { + char *fwname; + int len; /* firmware length */ + int size; /* buffer size */ + unsigned char *buf; +}; + +extern int ath3k_fw_read(struct ath3k_firmware *fw, const char *fwname); +extern void ath3k_fw_free(struct ath3k_firmware *fw); + +#endif Property changes on: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_fw.h ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_hw.c =================================================================== --- projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_hw.c (nonexistent) +++ projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_hw.c (revision 301547) @@ -0,0 +1,358 @@ +/*- + * Copyright (c) 2013 Adrian Chadd + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer, + * without modification. + * 2. Redistributions in binary form must reproduce at minimum a disclaimer + * similar to the "NO WARRANTY" disclaimer below ("Disclaimer") and any + * redistribution must be conditioned upon including a substantially + * similar Disclaimer requirement for further binary redistribution. + * + * NO WARRANTY + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTIBILITY + * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + * THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, + * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF + * THE POSSIBILITY OF SUCH DAMAGES. + * + * $FreeBSD$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "ath3k_fw.h" +#include "ath3k_hw.h" +#include "ath3k_dbg.h" + +#define XMIN(x, y) ((x) < (y) ? (x) : (y)) + +int +ath3k_load_fwfile(struct libusb_device_handle *hdl, + const struct ath3k_firmware *fw) +{ + int size, count, sent = 0; + int ret, r; + + count = fw->len; + + size = XMIN(count, FW_HDR_SIZE); + + ath3k_debug("%s: file=%s, size=%d\n", + __func__, fw->fwname, count); + + /* + * Flip the device over to configuration mode. + */ + ret = libusb_control_transfer(hdl, + LIBUSB_REQUEST_TYPE_VENDOR | LIBUSB_ENDPOINT_OUT, + ATH3K_DNLOAD, + 0, + 0, + fw->buf + sent, + size, + 1000); /* XXX timeout */ + + if (ret != size) { + fprintf(stderr, "Can't switch to config mode; ret=%d\n", + ret); + return (-1); + } + + sent += size; + count -= size; + + /* Load in the rest of the data */ + while (count) { + size = XMIN(count, BULK_SIZE); + ath3k_debug("%s: transferring %d bytes, offset %d\n", + __func__, + sent, + size); + + ret = libusb_bulk_transfer(hdl, + 0x2, + fw->buf + sent, + size, + &r, + 1000); + + if (ret < 0 || r != size) { + fprintf(stderr, "Can't load firmware: err=%s, size=%d\n", + libusb_strerror(ret), + size); + return (-1); + } + sent += size; + count -= size; + } + return (0); +} + +int +ath3k_get_state(struct libusb_device_handle *hdl, unsigned char *state) +{ + int ret; + + ret = libusb_control_transfer(hdl, + LIBUSB_REQUEST_TYPE_VENDOR | LIBUSB_ENDPOINT_IN, + ATH3K_GETSTATE, + 0, + 0, + state, + 1, + 1000); /* XXX timeout */ + + if (ret < 0) { + fprintf(stderr, + "%s: libusb_control_transfer() failed: code=%d\n", + __func__, + ret); + return (0); + } + + return (ret == 1); +} + +int +ath3k_get_version(struct libusb_device_handle *hdl, + struct ath3k_version *version) +{ + int ret; + + ret = libusb_control_transfer(hdl, + LIBUSB_REQUEST_TYPE_VENDOR | LIBUSB_ENDPOINT_IN, + ATH3K_GETVERSION, + 0, + 0, + (unsigned char *) version, + sizeof(struct ath3k_version), + 1000); /* XXX timeout */ + + if (ret < 0) { + fprintf(stderr, + "%s: libusb_control_transfer() failed: code=%d\n", + __func__, + ret); + return (0); + } + + /* XXX endian fix! */ + + return (ret == sizeof(struct ath3k_version)); +} + +int +ath3k_load_patch(libusb_device_handle *hdl, const char *fw_path) +{ + int ret; + unsigned char fw_state; + struct ath3k_version fw_ver, pt_ver; + char fwname[FILENAME_MAX]; + struct ath3k_firmware fw; + uint32_t tmp; + + ret = ath3k_get_state(hdl, &fw_state); + if (ret < 0) { + ath3k_err("%s: Can't get state\n", __func__); + return (ret); + } + + if (fw_state & ATH3K_PATCH_UPDATE) { + ath3k_info("%s: Patch already downloaded\n", + __func__); + return (0); + } + + ret = ath3k_get_version(hdl, &fw_ver); + if (ret < 0) { + ath3k_debug("%s: Can't get version\n", __func__); + return (ret); + } + + /* XXX path info? */ + snprintf(fwname, FILENAME_MAX, "%s/ar3k/AthrBT_0x%08x.dfu", + fw_path, + fw_ver.rom_version); + + /* Read in the firmware */ + if (ath3k_fw_read(&fw, fwname) <= 0) { + ath3k_debug("%s: ath3k_fw_read() failed\n", + __func__); + return (-1); + } + + /* + * Extract the ROM/build version from the patch file. + */ + memcpy(&tmp, fw.buf + fw.len - 8, sizeof(tmp)); + pt_ver.rom_version = le32toh(tmp); + memcpy(&tmp, fw.buf + fw.len - 4, sizeof(tmp)); + pt_ver.build_version = le32toh(tmp); + + ath3k_info("%s: file %s: rom_ver=%d, build_ver=%d\n", + __func__, + fwname, + (int) pt_ver.rom_version, + (int) pt_ver.build_version); + + /* Check the ROM/build version against the firmware */ + if ((pt_ver.rom_version != fw_ver.rom_version) || + (pt_ver.build_version <= fw_ver.build_version)) { + ath3k_debug("Patch file version mismatch!\n"); + ath3k_fw_free(&fw); + return (-1); + } + + /* Load in the firmware */ + ret = ath3k_load_fwfile(hdl, &fw); + + /* free it */ + ath3k_fw_free(&fw); + + return (ret); +} + +int +ath3k_load_syscfg(libusb_device_handle *hdl, const char *fw_path) +{ + unsigned char fw_state; + char filename[FILENAME_MAX]; + struct ath3k_firmware fw; + struct ath3k_version fw_ver; + int clk_value, ret; + + ret = ath3k_get_state(hdl, &fw_state); + if (ret < 0) { + ath3k_err("Can't get state to change to load configuration err"); + return (-EBUSY); + } + + ret = ath3k_get_version(hdl, &fw_ver); + if (ret < 0) { + ath3k_err("Can't get version to change to load ram patch err"); + return (ret); + } + + switch (fw_ver.ref_clock) { + case ATH3K_XTAL_FREQ_26M: + clk_value = 26; + break; + case ATH3K_XTAL_FREQ_40M: + clk_value = 40; + break; + case ATH3K_XTAL_FREQ_19P2: + clk_value = 19; + break; + default: + clk_value = 0; + break; +} + + snprintf(filename, FILENAME_MAX, "%s/ar3k/ramps_0x%08x_%d%s", + fw_path, + fw_ver.rom_version, + clk_value, + ".dfu"); + + ath3k_info("%s: syscfg file = %s\n", + __func__, + filename); + + /* Read in the firmware */ + if (ath3k_fw_read(&fw, filename) <= 0) { + ath3k_err("%s: ath3k_fw_read() failed\n", + __func__); + return (-1); + } + + ret = ath3k_load_fwfile(hdl, &fw); + + ath3k_fw_free(&fw); + return (ret); +} + +int +ath3k_set_normal_mode(libusb_device_handle *hdl) +{ + int ret; + unsigned char fw_state; + + ret = ath3k_get_state(hdl, &fw_state); + if (ret < 0) { + ath3k_err("%s: can't get state\n", __func__); + return (ret); + } + + /* + * This isn't a fatal error - the device may have detached + * already. + */ + if ((fw_state & ATH3K_MODE_MASK) == ATH3K_NORMAL_MODE) { + ath3k_debug("%s: firmware is already in normal mode\n", + __func__); + return (0); + } + + ret = libusb_control_transfer(hdl, + LIBUSB_REQUEST_TYPE_VENDOR, /* XXX out direction? */ + ATH3K_SET_NORMAL_MODE, + 0, + 0, + NULL, + 0, + 1000); /* XXX timeout */ + + if (ret < 0) { + ath3k_err("%s: libusb_control_transfer() failed: code=%d\n", + __func__, + ret); + return (0); + } + + return (ret == 0); +} + +int +ath3k_switch_pid(libusb_device_handle *hdl) +{ + int ret; + ret = libusb_control_transfer(hdl, + LIBUSB_REQUEST_TYPE_VENDOR, /* XXX set an out flag? */ + USB_REG_SWITCH_VID_PID, + 0, + 0, + NULL, + 0, + 1000); /* XXX timeout */ + + if (ret < 0) { + ath3k_debug("%s: libusb_control_transfer() failed: code=%d\n", + __func__, + ret); + return (0); + } + + return (ret == 0); +} Property changes on: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_hw.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_hw.h =================================================================== --- projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_hw.h (nonexistent) +++ projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_hw.h (revision 301547) @@ -0,0 +1,66 @@ +/*- + * Copyright (c) 2013 Adrian Chadd + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer, + * without modification. + * 2. Redistributions in binary form must reproduce at minimum a disclaimer + * similar to the "NO WARRANTY" disclaimer below ("Disclaimer") and any + * redistribution must be conditioned upon including a substantially + * similar Disclaimer requirement for further binary redistribution. + * + * NO WARRANTY + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTIBILITY + * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + * THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, + * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF + * THE POSSIBILITY OF SUCH DAMAGES. + * + * $FreeBSD$ + */ +#ifndef __ATH3K_HW_H__ +#define __ATH3K_HW_H__ + +#define ATH3K_DNLOAD 0x01 +#define ATH3K_GETSTATE 0x05 +#define ATH3K_SET_NORMAL_MODE 0x07 +#define ATH3K_GETVERSION 0x09 +#define USB_REG_SWITCH_VID_PID 0x0a + +#define ATH3K_MODE_MASK 0x3F +#define ATH3K_NORMAL_MODE 0x0E + +#define ATH3K_PATCH_UPDATE 0x80 +#define ATH3K_SYSCFG_UPDATE 0x40 + +#define ATH3K_XTAL_FREQ_26M 0x00 +#define ATH3K_XTAL_FREQ_40M 0x01 +#define ATH3K_XTAL_FREQ_19P2 0x02 +#define ATH3K_NAME_LEN 0xFF + +#define USB_REQ_DFU_DNLOAD 1 +#define BULK_SIZE 4096 +#define FW_HDR_SIZE 20 + +extern int ath3k_load_fwfile(struct libusb_device_handle *hdl, + const struct ath3k_firmware *fw); +extern int ath3k_get_state(struct libusb_device_handle *hdl, + unsigned char *state); +extern int ath3k_get_version(struct libusb_device_handle *hdl, + struct ath3k_version *version); +extern int ath3k_load_patch(libusb_device_handle *hdl, const char *fw_path); +extern int ath3k_load_syscfg(libusb_device_handle *hdl, const char *fw_path); +extern int ath3k_set_normal_mode(libusb_device_handle *hdl); +extern int ath3k_switch_pid(libusb_device_handle *hdl); + +#endif Property changes on: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3k_hw.h ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3kfw.8 =================================================================== --- projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3kfw.8 (revision 301546) +++ projects/vnet/usr.sbin/bluetooth/ath3kfw/ath3kfw.8 (revision 301547) @@ -1,78 +1,90 @@ .\" Copyright (c) 2010 Maksim Yevmenkin +.\" Copyright (c) 2013, 2016 Adrian Chadd .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" $FreeBSD$ .\" -.Dd November 9, 2010 +.Dd June 4, 2016 .Dt ATH3KFW 8 .Os .Sh NAME .Nm ath3kfw -.Nd firmware download utility for Atheros AR3011 chip based Bluetooth USB devices +.Nd firmware download utility for Atheros AR3011/AR3012 chip based Bluetooth USB devices .Sh SYNOPSIS .Nm .Fl d Ar device_name -.Fl f Ar firmware_file_name +.Fl f Ar firmware_path .Nm .Fl h .Sh DESCRIPTION The .Nm utility downloads the specified firmware file to the specified .Xr ugen 4 device. .Pp This utility will .Em only -work with Atheros AR3011 chip based Bluetooth USB devices. +work with Atheros AR3011 and AR3012 chip based Bluetooth USB devices. The identification is currently based on USB vendor ID/product ID pair. The vendor ID should be 0x0cf3 .Pq Dv USB_VENDOR_ATHEROS2 -and the product ID should be 0x3000. +and the product ID should be one of the supported devices. .Pp -Firmware files ath3k-1.fw and ath3k-2.fw can be obtained from the -linux-firmware RPM. +Firmware files are available in the linux-firmware RPM. .Pp +The +.Nm +utility will query the device to determine which firmware image and board +configuration to load in at runtime. +.Pp The options are as follows: .Bl -tag -width indent +.It Fl D +Enable verbose debugging. .It Fl d Ar device_name Specify .Xr ugen 4 device name. -.It Fl f Ar firmware_file_name -Specify firmware file name for download. +.It I +Enable informational debugging. +.It Fl f Ar firmware_path +Specify the directory containing the firmware files to search and upload. .It Fl h Display usage message and exit. .El .Sh EXIT STATUS .Ex -std .Sh SEE ALSO .Xr libusb 3 , .Xr ugen 4 , .Xr devd 8 .Sh AUTHORS -.An Maksim Yevmenkin Aq Mt m_evmenkin@yahoo.com +The original utility was written by +.An Maksim Yevmenkin Aq Mt m_evmenkin@yahoo.com . +This was written based on Linux ath3k by +.An Adrian Chadd Aq Mt adrian@freebsd.org . .Sh BUGS Most likely. Please report if found. Index: projects/vnet/usr.sbin/bluetooth/ath3kfw/main.c =================================================================== --- projects/vnet/usr.sbin/bluetooth/ath3kfw/main.c (nonexistent) +++ projects/vnet/usr.sbin/bluetooth/ath3kfw/main.c (revision 301547) @@ -0,0 +1,394 @@ +/*- + * Copyright (c) 2013 Adrian Chadd + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer, + * without modification. + * 2. Redistributions in binary form must reproduce at minimum a disclaimer + * similar to the "NO WARRANTY" disclaimer below ("Disclaimer") and any + * redistribution must be conditioned upon including a substantially + * similar Disclaimer requirement for further binary redistribution. + * + * NO WARRANTY + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTIBILITY + * AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL + * THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR SPECIAL, EXEMPLARY, + * OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF + * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS + * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER + * IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) + * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF + * THE POSSIBILITY OF SUCH DAMAGES. + * + * $FreeBSD$ + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "ath3k_fw.h" +#include "ath3k_hw.h" +#include "ath3k_dbg.h" + +#define _DEFAULT_ATH3K_FIRMWARE_PATH "/usr/share/firmware/ath3k/" + +int ath3k_do_debug = 0; +int ath3k_do_info = 0; + +struct ath3k_devid { + uint16_t product_id; + uint16_t vendor_id; + int is_3012; +}; + +static struct ath3k_devid ath3k_list[] = { + + /* Atheros AR3012 with sflash firmware*/ + { .vendor_id = 0x0489, .product_id = 0xe04e, .is_3012 = 1 }, + { .vendor_id = 0x0489, .product_id = 0xe04d, .is_3012 = 1 }, + { .vendor_id = 0x0489, .product_id = 0xe056, .is_3012 = 1 }, + { .vendor_id = 0x0489, .product_id = 0xe057, .is_3012 = 1 }, + { .vendor_id = 0x0489, .product_id = 0xe05f, .is_3012 = 1 }, + { .vendor_id = 0x04c5, .product_id = 0x1330, .is_3012 = 1 }, + { .vendor_id = 0x04ca, .product_id = 0x3004, .is_3012 = 1 }, + { .vendor_id = 0x04ca, .product_id = 0x3005, .is_3012 = 1 }, + { .vendor_id = 0x04ca, .product_id = 0x3006, .is_3012 = 1 }, + { .vendor_id = 0x04ca, .product_id = 0x3008, .is_3012 = 1 }, + { .vendor_id = 0x04ca, .product_id = 0x300b, .is_3012 = 1 }, + { .vendor_id = 0x0930, .product_id = 0x0219, .is_3012 = 1 }, + { .vendor_id = 0x0930, .product_id = 0x0220, .is_3012 = 1 }, + { .vendor_id = 0x0b05, .product_id = 0x17d0, .is_3012 = 1 }, + { .vendor_id = 0x0CF3, .product_id = 0x0036, .is_3012 = 1 }, + { .vendor_id = 0x0cf3, .product_id = 0x3004, .is_3012 = 1 }, + { .vendor_id = 0x0cf3, .product_id = 0x3005, .is_3012 = 1 }, + { .vendor_id = 0x0cf3, .product_id = 0x3008, .is_3012 = 1 }, + { .vendor_id = 0x0cf3, .product_id = 0x311D, .is_3012 = 1 }, + { .vendor_id = 0x0cf3, .product_id = 0x311E, .is_3012 = 1 }, + { .vendor_id = 0x0cf3, .product_id = 0x311F, .is_3012 = 1 }, + { .vendor_id = 0x0cf3, .product_id = 0x3121, .is_3012 = 1 }, + { .vendor_id = 0x0CF3, .product_id = 0x817a, .is_3012 = 1 }, + { .vendor_id = 0x0cf3, .product_id = 0xe004, .is_3012 = 1 }, + { .vendor_id = 0x0cf3, .product_id = 0xe005, .is_3012 = 1 }, + { .vendor_id = 0x0cf3, .product_id = 0xe003, .is_3012 = 1 }, + { .vendor_id = 0x13d3, .product_id = 0x3362, .is_3012 = 1 }, + { .vendor_id = 0x13d3, .product_id = 0x3375, .is_3012 = 1 }, + { .vendor_id = 0x13d3, .product_id = 0x3393, .is_3012 = 1 }, + { .vendor_id = 0x13d3, .product_id = 0x3402, .is_3012 = 1 }, + + /* Atheros AR5BBU22 with sflash firmware */ + { .vendor_id = 0x0489, .product_id = 0xE036, .is_3012 = 1 }, + { .vendor_id = 0x0489, .product_id = 0xE03C, .is_3012 = 1 }, +}; + +static int +ath3k_is_3012(struct libusb_device_descriptor *d) +{ + int i; + + /* Search looking for whether it's an AR3012 */ + for (i = 0; i < (int) nitems(ath3k_list); i++) { + if ((ath3k_list[i].product_id == d->idProduct) && + (ath3k_list[i].vendor_id == d->idVendor)) { + fprintf(stderr, "%s: found AR3012\n", __func__); + return (ath3k_list[i].is_3012); + } + } + + /* Not found */ + return (0); +} + +static libusb_device * +ath3k_find_device(libusb_context *ctx, int bus_id, int dev_id) +{ + libusb_device **list, *dev = NULL, *found = NULL; + ssize_t cnt, i; + + cnt = libusb_get_device_list(ctx, &list); + if (cnt < 0) { + ath3k_err("%s: libusb_get_device_list() failed: code %lld\n", + __func__, + (long long int) cnt); + return (NULL); + } + + /* + * XXX TODO: match on the vendor/product id too! + */ + for (i = 0; i < cnt; i++) { + dev = list[i]; + if (bus_id == libusb_get_bus_number(dev) && + dev_id == libusb_get_device_address(dev)) { + /* + * Take a reference so it's not freed later on. + */ + found = libusb_ref_device(dev); + break; + } + } + + libusb_free_device_list(list, 1); + return (found); +} + +static int +ath3k_init_ar3012(libusb_device_handle *hdl, const char *fw_path) +{ + int ret; + + ret = ath3k_load_patch(hdl, fw_path); + if (ret < 0) { + ath3k_err("Loading patch file failed\n"); + return (ret); + } + + ret = ath3k_load_syscfg(hdl, fw_path); + if (ret < 0) { + ath3k_err("Loading sysconfig file failed\n"); + return (ret); + } + + ret = ath3k_set_normal_mode(hdl); + if (ret < 0) { + ath3k_err("Set normal mode failed\n"); + return (ret); + } + + ath3k_switch_pid(hdl); + return (0); +} + +static int +ath3k_init_firmware(libusb_device_handle *hdl, const char *file_prefix) +{ + struct ath3k_firmware fw; + char fwname[FILENAME_MAX]; + int ret; + + /* XXX path info? */ + snprintf(fwname, FILENAME_MAX, "%s/ath3k-1.fw", file_prefix); + + ath3k_debug("%s: loading ath3k-1.fw\n", __func__); + + /* Read in the firmware */ + if (ath3k_fw_read(&fw, fwname) <= 0) { + fprintf(stderr, "%s: ath3k_fw_read() failed\n", + __func__); + return (-1); + } + + /* Load in the firmware */ + ret = ath3k_load_fwfile(hdl, &fw); + + /* free it */ + ath3k_fw_free(&fw); + + return (0); +} + +/* + * Parse ugen name and extract device's bus and address + */ + +static int +parse_ugen_name(char const *ugen, uint8_t *bus, uint8_t *addr) +{ + char *ep; + + if (strncmp(ugen, "ugen", 4) != 0) + return (-1); + + *bus = (uint8_t) strtoul(ugen + 4, &ep, 10); + if (*ep != '.') + return (-1); + + *addr = (uint8_t) strtoul(ep + 1, &ep, 10); + if (*ep != '\0') + return (-1); + + return (0); +} + +static void +usage(void) +{ + fprintf(stderr, + "Usage: ath3kfw (-D) -d ugenX.Y (-f firmware path) (-I)\n"); + fprintf(stderr, " -D: enable debugging\n"); + fprintf(stderr, " -d: device to operate upon\n"); + fprintf(stderr, " -f: firmware path, if not default\n"); + fprintf(stderr, " -I: enable informational output\n"); + exit(127); +} + +int +main(int argc, char *argv[]) +{ + struct libusb_device_descriptor d; + libusb_context *ctx; + libusb_device *dev; + libusb_device_handle *hdl; + unsigned char state; + struct ath3k_version ver; + int r; + uint8_t bus_id = 0, dev_id = 0; + int devid_set = 0; + int n; + char *firmware_path = NULL; + int is_3012 = 0; + + /* libusb setup */ + r = libusb_init(&ctx); + if (r != 0) { + ath3k_err("%s: libusb_init failed: code %d\n", + argv[0], + r); + exit(127); + } + + /* Enable debugging, just because */ + libusb_set_debug(ctx, 3); + + /* Parse command line arguments */ + while ((n = getopt(argc, argv, "Dd:f:hIm:p:v:")) != -1) { + switch (n) { + case 'd': /* ugen device name */ + devid_set = 1; + if (parse_ugen_name(optarg, &bus_id, &dev_id) < 0) + usage(); + break; + case 'D': + ath3k_do_debug = 1; + break; + case 'f': /* firmware path */ + if (firmware_path) + free(firmware_path); + firmware_path = strdup(optarg); + break; + case 'I': + ath3k_do_info = 1; + break; + case 'h': + default: + usage(); + break; + /* NOT REACHED */ + } + } + + /* Ensure the devid was given! */ + if (devid_set == 0) { + usage(); + /* NOTREACHED */ + } + + ath3k_debug("%s: opening dev %d.%d\n", + basename(argv[0]), + (int) bus_id, + (int) dev_id); + + /* Find a device based on the bus/dev id */ + dev = ath3k_find_device(ctx, bus_id, dev_id); + if (dev == NULL) { + ath3k_err("%s: device not found\n", __func__); + /* XXX cleanup? */ + exit(1); + } + + /* Get the device descriptor for this device entry */ + r = libusb_get_device_descriptor(dev, &d); + if (r != 0) { + warn("%s: libusb_get_device_descriptor: %s\n", + __func__, + libusb_strerror(r)); + exit(1); + } + + /* See if its an AR3012 */ + if (ath3k_is_3012(&d)) { + is_3012 = 1; + + /* If it's bcdDevice > 1, don't attach */ + if (d.bcdDevice > 0x0001) { + ath3k_debug("%s: AR3012; bcdDevice=%d, exiting\n", + __func__, + d.bcdDevice); + exit(0); + } + } + + /* XXX enforce that bInterfaceNumber is 0 */ + + /* XXX enforce the device/product id if they're non-zero */ + + /* Grab device handle */ + r = libusb_open(dev, &hdl); + if (r != 0) { + ath3k_err("%s: libusb_open() failed: code %d\n", __func__, r); + /* XXX cleanup? */ + exit(1); + } + + /* + * Get the initial NIC state. + */ + r = ath3k_get_state(hdl, &state); + if (r == 0) { + ath3k_err("%s: ath3k_get_state() failed!\n", __func__); + /* XXX cleanup? */ + exit(1); + } + ath3k_debug("%s: state=0x%02x\n", + __func__, + (int) state); + + /* And the version */ + r = ath3k_get_version(hdl, &ver); + if (r == 0) { + ath3k_err("%s: ath3k_get_version() failed!\n", __func__); + /* XXX cleanup? */ + exit(1); + } + ath3k_info("ROM version: %d, build version: %d, ram version: %d, " + "ref clock=%d\n", + ver.rom_version, + ver.build_version, + ver.ram_version, + ver.ref_clock); + + /* Default the firmware path */ + if (firmware_path == NULL) + firmware_path = strdup(_DEFAULT_ATH3K_FIRMWARE_PATH); + + if (is_3012) { + (void) ath3k_init_ar3012(hdl, firmware_path); + } else { + (void) ath3k_init_firmware(hdl, firmware_path); + } + + /* Shutdown */ + libusb_close(hdl); + hdl = NULL; + + libusb_unref_device(dev); + dev = NULL; + + libusb_exit(ctx); + ctx = NULL; +} Property changes on: projects/vnet/usr.sbin/bluetooth/ath3kfw/main.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: projects/vnet/usr.sbin/newsyslog/newsyslog.c =================================================================== --- projects/vnet/usr.sbin/newsyslog/newsyslog.c (revision 301546) +++ projects/vnet/usr.sbin/newsyslog/newsyslog.c (revision 301547) @@ -1,2673 +1,2675 @@ /*- * ------+---------+---------+-------- + --------+---------+---------+---------* * This file includes significant modifications done by: * Copyright (c) 2003, 2004 - Garance Alistair Drosehn . * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * ------+---------+---------+-------- + --------+---------+---------+---------* */ /* * This file contains changes from the Open Software Foundation. */ /* * Copyright 1988, 1989 by the Massachusetts Institute of Technology * * Permission to use, copy, modify, and distribute this software and its * documentation for any purpose and without fee is hereby granted, provided * that the above copyright notice appear in all copies and that both that * copyright notice and this permission notice appear in supporting * documentation, and that the names of M.I.T. and the M.I.T. S.I.P.B. not be * used in advertising or publicity pertaining to distribution of the * software without specific, written prior permission. M.I.T. and the M.I.T. * S.I.P.B. make no representations about the suitability of this software * for any purpose. It is provided "as is" without express or implied * warranty. * */ /* * newsyslog - roll over selected logs at the appropriate time, keeping the a * specified number of backup files around. */ #include __FBSDID("$FreeBSD$"); #define OSF #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "pathnames.h" #include "extern.h" /* * Compression suffixes */ #ifndef COMPRESS_SUFFIX_GZ #define COMPRESS_SUFFIX_GZ ".gz" #endif #ifndef COMPRESS_SUFFIX_BZ2 #define COMPRESS_SUFFIX_BZ2 ".bz2" #endif #ifndef COMPRESS_SUFFIX_XZ #define COMPRESS_SUFFIX_XZ ".xz" #endif #define COMPRESS_SUFFIX_MAXLEN MAX(MAX(sizeof(COMPRESS_SUFFIX_GZ),sizeof(COMPRESS_SUFFIX_BZ2)),sizeof(COMPRESS_SUFFIX_XZ)) /* * Compression types */ #define COMPRESS_TYPES 4 /* Number of supported compression types */ #define COMPRESS_NONE 0 #define COMPRESS_GZIP 1 #define COMPRESS_BZIP2 2 #define COMPRESS_XZ 3 /* * Bit-values for the 'flags' parsed from a config-file entry. */ #define CE_BINARY 0x0008 /* Logfile is in binary, do not add status */ /* messages to logfile(s) when rotating. */ #define CE_NOSIGNAL 0x0010 /* There is no process to signal when */ /* trimming this file. */ #define CE_TRIMAT 0x0020 /* trim file at a specific time. */ #define CE_GLOB 0x0040 /* name of the log is file name pattern. */ #define CE_SIGNALGROUP 0x0080 /* Signal a process-group instead of a single */ /* process when trimming this file. */ #define CE_CREATE 0x0100 /* Create the log file if it does not exist. */ #define CE_NODUMP 0x0200 /* Set 'nodump' on newly created log file. */ #define CE_PID2CMD 0x0400 /* Replace PID file with a shell command.*/ #define MIN_PID 5 /* Don't touch pids lower than this */ #define MAX_PID 99999 /* was lower, see /usr/include/sys/proc.h */ #define kbytes(size) (((size) + 1023) >> 10) #define DEFAULT_MARKER "" #define DEBUG_MARKER "" #define INCLUDE_MARKER "" #define DEFAULT_TIMEFNAME_FMT "%Y%m%dT%H%M%S" #define MAX_OLDLOGS 65536 /* Default maximum number of old logfiles */ struct compress_types { const char *flag; /* Flag in configuration file */ const char *suffix; /* Compression suffix */ const char *path; /* Path to compression program */ }; static const struct compress_types compress_type[COMPRESS_TYPES] = { { "", "", "" }, /* no compression */ { "Z", COMPRESS_SUFFIX_GZ, _PATH_GZIP }, /* gzip compression */ { "J", COMPRESS_SUFFIX_BZ2, _PATH_BZIP2 }, /* bzip2 compression */ { "X", COMPRESS_SUFFIX_XZ, _PATH_XZ } /* xz compression */ }; struct conf_entry { STAILQ_ENTRY(conf_entry) cf_nextp; char *log; /* Name of the log */ char *pid_cmd_file; /* PID or command file */ char *r_reason; /* The reason this file is being rotated */ int firstcreate; /* Creating log for the first time (-C). */ int rotate; /* Non-zero if this file should be rotated */ int fsize; /* size found for the log file */ uid_t uid; /* Owner of log */ gid_t gid; /* Group of log */ int numlogs; /* Number of logs to keep */ int trsize; /* Size cutoff to trigger trimming the log */ int hours; /* Hours between log trimming */ struct ptime_data *trim_at; /* Specific time to do trimming */ unsigned int permissions; /* File permissions on the log */ int flags; /* CE_BINARY */ int compress; /* Compression */ int sig; /* Signal to send */ int def_cfg; /* Using the rule for this file */ }; struct sigwork_entry { SLIST_ENTRY(sigwork_entry) sw_nextp; int sw_signum; /* the signal to send */ int sw_pidok; /* true if pid value is valid */ pid_t sw_pid; /* the process id from the PID file */ const char *sw_pidtype; /* "daemon" or "process group" */ int sw_runcmd; /* run command or send PID to signal */ char sw_fname[1]; /* file the PID was read from or shell cmd */ }; struct zipwork_entry { SLIST_ENTRY(zipwork_entry) zw_nextp; const struct conf_entry *zw_conf; /* for chown/perm/flag info */ const struct sigwork_entry *zw_swork; /* to know success of signal */ int zw_fsize; /* size of the file to compress */ char zw_fname[1]; /* the file to compress */ }; struct include_entry { STAILQ_ENTRY(include_entry) inc_nextp; const char *file; /* Name of file to process */ }; struct oldlog_entry { char *fname; /* Filename of the log file */ time_t t; /* Parsed timestamp of the logfile */ }; typedef enum { FREE_ENT, KEEP_ENT } fk_entry; STAILQ_HEAD(cflist, conf_entry); static SLIST_HEAD(swlisthead, sigwork_entry) swhead = SLIST_HEAD_INITIALIZER(swhead); static SLIST_HEAD(zwlisthead, zipwork_entry) zwhead = SLIST_HEAD_INITIALIZER(zwhead); STAILQ_HEAD(ilist, include_entry); int dbg_at_times; /* -D Show details of 'trim_at' code */ static int archtodir = 0; /* Archive old logfiles to other directory */ static int createlogs; /* Create (non-GLOB) logfiles which do not */ /* already exist. 1=='for entries with */ /* C flag', 2=='for all entries'. */ int verbose = 0; /* Print out what's going on */ static int needroot = 1; /* Root privs are necessary */ int noaction = 0; /* Don't do anything, just show it */ static int norotate = 0; /* Don't rotate */ static int nosignal; /* Do not send any signals */ static int enforcepid = 0; /* If PID file does not exist or empty, do nothing */ static int force = 0; /* Force the trim no matter what */ static int rotatereq = 0; /* -R = Always rotate the file(s) as given */ /* on the command (this also requires */ /* that a list of files *are* given on */ /* the run command). */ static char *requestor; /* The name given on a -R request */ static char *timefnamefmt = NULL;/* Use time based filenames instead of .0 */ static char *archdirname; /* Directory path to old logfiles archive */ static char *destdir = NULL; /* Directory to treat at root for logs */ static const char *conf; /* Configuration file to use */ struct ptime_data *dbg_timenow; /* A "timenow" value set via -D option */ static struct ptime_data *timenow; /* The time to use for checking at-fields */ #define DAYTIME_LEN 16 static char daytime[DAYTIME_LEN];/* The current time in human readable form, * used for rotation-tracking messages. */ static char hostname[MAXHOSTNAMELEN]; /* hostname */ static const char *path_syslogpid = _PATH_SYSLOGPID; static struct cflist *get_worklist(char **files); static void parse_file(FILE *cf, struct cflist *work_p, struct cflist *glob_p, struct conf_entry *defconf_p, struct ilist *inclist); static void add_to_queue(const char *fname, struct ilist *inclist); static char *sob(char *p); static char *son(char *p); static int isnumberstr(const char *); static int isglobstr(const char *); static char *missing_field(char *p, char *errline); static void change_attrs(const char *, const struct conf_entry *); static const char *get_logfile_suffix(const char *logfile); static fk_entry do_entry(struct conf_entry *); static fk_entry do_rotate(const struct conf_entry *); static void do_sigwork(struct sigwork_entry *); static void do_zipwork(struct zipwork_entry *); static struct sigwork_entry * save_sigwork(const struct conf_entry *); static struct zipwork_entry * save_zipwork(const struct conf_entry *, const struct sigwork_entry *, int, const char *); static void set_swpid(struct sigwork_entry *, const struct conf_entry *); static int sizefile(const char *); static void expand_globs(struct cflist *work_p, struct cflist *glob_p); static void free_clist(struct cflist *list); static void free_entry(struct conf_entry *ent); static struct conf_entry *init_entry(const char *fname, struct conf_entry *src_entry); static void parse_args(int argc, char **argv); static int parse_doption(const char *doption); static void usage(void); static int log_trim(const char *logname, const struct conf_entry *log_ent); static int age_old_log(const char *file); static void savelog(char *from, char *to); static void createdir(const struct conf_entry *ent, char *dirpart); static void createlog(const struct conf_entry *ent); static int parse_signal(const char *str); /* * All the following take a parameter of 'int', but expect values in the * range of unsigned char. Define wrappers which take values of type 'char', * whether signed or unsigned, and ensure they end up in the right range. */ #define isdigitch(Anychar) isdigit((u_char)(Anychar)) #define isprintch(Anychar) isprint((u_char)(Anychar)) #define isspacech(Anychar) isspace((u_char)(Anychar)) #define tolowerch(Anychar) tolower((u_char)(Anychar)) int main(int argc, char **argv) { struct cflist *worklist; struct conf_entry *p; struct sigwork_entry *stmp; struct zipwork_entry *ztmp; SLIST_INIT(&swhead); SLIST_INIT(&zwhead); parse_args(argc, argv); argc -= optind; argv += optind; if (needroot && getuid() && geteuid()) errx(1, "must have root privs"); worklist = get_worklist(argv); /* * Rotate all the files which need to be rotated. Note that * some users have *hundreds* of entries in newsyslog.conf! */ while (!STAILQ_EMPTY(worklist)) { p = STAILQ_FIRST(worklist); STAILQ_REMOVE_HEAD(worklist, cf_nextp); if (do_entry(p) == FREE_ENT) free_entry(p); } /* * Send signals to any processes which need a signal to tell * them to close and re-open the log file(s) we have rotated. * Note that zipwork_entries include pointers to these * sigwork_entry's, so we can not free the entries here. */ if (!SLIST_EMPTY(&swhead)) { if (noaction || verbose) printf("Signal all daemon process(es)...\n"); SLIST_FOREACH(stmp, &swhead, sw_nextp) do_sigwork(stmp); - if (noaction) - printf("\tsleep 10\n"); - else { - if (verbose) - printf("Pause 10 seconds to allow daemon(s)" - " to close log file(s)\n"); - sleep(10); + if (!(rotatereq && nosignal)) { + if (noaction) + printf("\tsleep 10\n"); + else { + if (verbose) + printf("Pause 10 seconds to allow " + "daemon(s) to close log file(s)\n"); + sleep(10); + } } } /* * Compress all files that we're expected to compress, now * that all processes should have closed the files which * have been rotated. */ if (!SLIST_EMPTY(&zwhead)) { if (noaction || verbose) printf("Compress all rotated log file(s)...\n"); while (!SLIST_EMPTY(&zwhead)) { ztmp = SLIST_FIRST(&zwhead); do_zipwork(ztmp); SLIST_REMOVE_HEAD(&zwhead, zw_nextp); free(ztmp); } } /* Now free all the sigwork entries. */ while (!SLIST_EMPTY(&swhead)) { stmp = SLIST_FIRST(&swhead); SLIST_REMOVE_HEAD(&swhead, sw_nextp); free(stmp); } while (wait(NULL) > 0 || errno == EINTR) ; return (0); } static struct conf_entry * init_entry(const char *fname, struct conf_entry *src_entry) { struct conf_entry *tempwork; if (verbose > 4) printf("\t--> [creating entry for %s]\n", fname); tempwork = malloc(sizeof(struct conf_entry)); if (tempwork == NULL) err(1, "malloc of conf_entry for %s", fname); if (destdir == NULL || fname[0] != '/') tempwork->log = strdup(fname); else asprintf(&tempwork->log, "%s%s", destdir, fname); if (tempwork->log == NULL) err(1, "strdup for %s", fname); if (src_entry != NULL) { tempwork->pid_cmd_file = NULL; if (src_entry->pid_cmd_file) tempwork->pid_cmd_file = strdup(src_entry->pid_cmd_file); tempwork->r_reason = NULL; tempwork->firstcreate = 0; tempwork->rotate = 0; tempwork->fsize = -1; tempwork->uid = src_entry->uid; tempwork->gid = src_entry->gid; tempwork->numlogs = src_entry->numlogs; tempwork->trsize = src_entry->trsize; tempwork->hours = src_entry->hours; tempwork->trim_at = NULL; if (src_entry->trim_at != NULL) tempwork->trim_at = ptime_init(src_entry->trim_at); tempwork->permissions = src_entry->permissions; tempwork->flags = src_entry->flags; tempwork->compress = src_entry->compress; tempwork->sig = src_entry->sig; tempwork->def_cfg = src_entry->def_cfg; } else { /* Initialize as a "do-nothing" entry */ tempwork->pid_cmd_file = NULL; tempwork->r_reason = NULL; tempwork->firstcreate = 0; tempwork->rotate = 0; tempwork->fsize = -1; tempwork->uid = (uid_t)-1; tempwork->gid = (gid_t)-1; tempwork->numlogs = 1; tempwork->trsize = -1; tempwork->hours = -1; tempwork->trim_at = NULL; tempwork->permissions = 0; tempwork->flags = 0; tempwork->compress = COMPRESS_NONE; tempwork->sig = SIGHUP; tempwork->def_cfg = 0; } return (tempwork); } static void free_entry(struct conf_entry *ent) { if (ent == NULL) return; if (ent->log != NULL) { if (verbose > 4) printf("\t--> [freeing entry for %s]\n", ent->log); free(ent->log); ent->log = NULL; } if (ent->pid_cmd_file != NULL) { free(ent->pid_cmd_file); ent->pid_cmd_file = NULL; } if (ent->r_reason != NULL) { free(ent->r_reason); ent->r_reason = NULL; } if (ent->trim_at != NULL) { ptime_free(ent->trim_at); ent->trim_at = NULL; } free(ent); } static void free_clist(struct cflist *list) { struct conf_entry *ent; while (!STAILQ_EMPTY(list)) { ent = STAILQ_FIRST(list); STAILQ_REMOVE_HEAD(list, cf_nextp); free_entry(ent); } free(list); list = NULL; } static fk_entry do_entry(struct conf_entry * ent) { #define REASON_MAX 80 int modtime; fk_entry free_or_keep; double diffsecs; char temp_reason[REASON_MAX]; int oversized; free_or_keep = FREE_ENT; if (verbose) printf("%s <%d%s>: ", ent->log, ent->numlogs, compress_type[ent->compress].flag); ent->fsize = sizefile(ent->log); oversized = ((ent->trsize > 0) && (ent->fsize >= ent->trsize)); modtime = age_old_log(ent->log); ent->rotate = 0; ent->firstcreate = 0; if (ent->fsize < 0) { /* * If either the C flag or the -C option was specified, * and if we won't be creating the file, then have the * verbose message include a hint as to why the file * will not be created. */ temp_reason[0] = '\0'; if (createlogs > 1) ent->firstcreate = 1; else if ((ent->flags & CE_CREATE) && createlogs) ent->firstcreate = 1; else if (ent->flags & CE_CREATE) strlcpy(temp_reason, " (no -C option)", REASON_MAX); else if (createlogs) strlcpy(temp_reason, " (no C flag)", REASON_MAX); if (ent->firstcreate) { if (verbose) printf("does not exist -> will create.\n"); createlog(ent); } else if (verbose) { printf("does not exist, skipped%s.\n", temp_reason); } } else { if (ent->flags & CE_TRIMAT && !force && !rotatereq && !oversized) { diffsecs = ptimeget_diff(timenow, ent->trim_at); if (diffsecs < 0.0) { /* trim_at is some time in the future. */ if (verbose) { ptime_adjust4dst(ent->trim_at, timenow); printf("--> will trim at %s", ptimeget_ctime(ent->trim_at)); } return (free_or_keep); } else if (diffsecs >= 3600.0) { /* * trim_at is more than an hour in the past, * so find the next valid trim_at time, and * tell the user what that will be. */ if (verbose && dbg_at_times) printf("\n\t--> prev trim at %s\t", ptimeget_ctime(ent->trim_at)); if (verbose) { ptimeset_nxtime(ent->trim_at); printf("--> will trim at %s", ptimeget_ctime(ent->trim_at)); } return (free_or_keep); } else if (verbose && noaction && dbg_at_times) { /* * If we are just debugging at-times, then * a detailed message is helpful. Also * skip "doing" any commands, since they * would all be turned off by no-action. */ printf("\n\t--> timematch at %s", ptimeget_ctime(ent->trim_at)); return (free_or_keep); } else if (verbose && ent->hours <= 0) { printf("--> time is up\n"); } } if (verbose && (ent->trsize > 0)) printf("size (Kb): %d [%d] ", ent->fsize, ent->trsize); if (verbose && (ent->hours > 0)) printf(" age (hr): %d [%d] ", modtime, ent->hours); /* * Figure out if this logfile needs to be rotated. */ temp_reason[0] = '\0'; if (rotatereq) { ent->rotate = 1; snprintf(temp_reason, REASON_MAX, " due to -R from %s", requestor); } else if (force) { ent->rotate = 1; snprintf(temp_reason, REASON_MAX, " due to -F request"); } else if (oversized) { ent->rotate = 1; snprintf(temp_reason, REASON_MAX, " due to size>%dK", ent->trsize); } else if (ent->hours <= 0 && (ent->flags & CE_TRIMAT)) { ent->rotate = 1; } else if ((ent->hours > 0) && ((modtime >= ent->hours) || (modtime < 0))) { ent->rotate = 1; } /* * If the file needs to be rotated, then rotate it. */ if (ent->rotate && !norotate) { if (temp_reason[0] != '\0') ent->r_reason = strdup(temp_reason); if (verbose) printf("--> trimming log....\n"); if (noaction && !verbose) printf("%s <%d%s>: trimming\n", ent->log, ent->numlogs, compress_type[ent->compress].flag); free_or_keep = do_rotate(ent); } else { if (verbose) printf("--> skipping\n"); } } return (free_or_keep); #undef REASON_MAX } static void parse_args(int argc, char **argv) { int ch; char *p; timenow = ptime_init(NULL); ptimeset_time(timenow, time(NULL)); strlcpy(daytime, ptimeget_ctime(timenow) + 4, DAYTIME_LEN); /* Let's get our hostname */ (void)gethostname(hostname, sizeof(hostname)); /* Truncate domain */ if ((p = strchr(hostname, '.')) != NULL) *p = '\0'; /* Parse command line options. */ while ((ch = getopt(argc, argv, "a:d:f:nrst:vCD:FNPR:S:")) != -1) switch (ch) { case 'a': archtodir++; archdirname = optarg; break; case 'd': destdir = optarg; break; case 'f': conf = optarg; break; case 'n': noaction++; /* FALLTHROUGH */ case 'r': needroot = 0; break; case 's': nosignal = 1; break; case 't': if (optarg[0] == '\0' || strcmp(optarg, "DEFAULT") == 0) timefnamefmt = strdup(DEFAULT_TIMEFNAME_FMT); else timefnamefmt = strdup(optarg); break; case 'v': verbose++; break; case 'C': /* Useful for things like rc.diskless... */ createlogs++; break; case 'D': /* * Set some debugging option. The specific option * depends on the value of optarg. These options * may come and go without notice or documentation. */ if (parse_doption(optarg)) break; usage(); /* NOTREACHED */ case 'F': force++; break; case 'N': norotate++; break; case 'P': enforcepid++; break; case 'R': rotatereq++; requestor = strdup(optarg); break; case 'S': path_syslogpid = optarg; break; case 'm': /* Used by OpenBSD for "monitor mode" */ default: usage(); /* NOTREACHED */ } if (force && norotate) { warnx("Only one of -F and -N may be specified."); usage(); /* NOTREACHED */ } if (rotatereq) { if (optind == argc) { warnx("At least one filename must be given when -R is specified."); usage(); /* NOTREACHED */ } /* Make sure "requestor" value is safe for a syslog message. */ for (p = requestor; *p != '\0'; p++) { if (!isprintch(*p) && (*p != '\t')) *p = '.'; } } if (dbg_timenow) { /* * Note that the 'daytime' variable is not changed. * That is only used in messages that track when a * logfile is rotated, and if a file *is* rotated, * then it will still rotated at the "real now" time. */ ptime_free(timenow); timenow = dbg_timenow; fprintf(stderr, "Debug: Running as if TimeNow is %s", ptimeget_ctime(dbg_timenow)); } } /* * These debugging options are mainly meant for developer use, such * as writing regression-tests. They would not be needed by users * during normal operation of newsyslog... */ static int parse_doption(const char *doption) { const char TN[] = "TN="; int res; if (strncmp(doption, TN, sizeof(TN) - 1) == 0) { /* * The "TimeNow" debugging option. This might be off * by an hour when crossing a timezone change. */ dbg_timenow = ptime_init(NULL); res = ptime_relparse(dbg_timenow, PTM_PARSE_ISO8601, time(NULL), doption + sizeof(TN) - 1); if (res == -2) { warnx("Non-existent time specified on -D %s", doption); return (0); /* failure */ } else if (res < 0) { warnx("Malformed time given on -D %s", doption); return (0); /* failure */ } return (1); /* successfully parsed */ } if (strcmp(doption, "ats") == 0) { dbg_at_times++; return (1); /* successfully parsed */ } /* XXX - This check could probably be dropped. */ if ((strcmp(doption, "neworder") == 0) || (strcmp(doption, "oldorder") == 0)) { warnx("NOTE: newsyslog always uses 'neworder'."); return (1); /* successfully parsed */ } warnx("Unknown -D (debug) option: '%s'", doption); return (0); /* failure */ } static void usage(void) { fprintf(stderr, "usage: newsyslog [-CFNPnrsv] [-a directory] [-d directory] [-f config_file]\n" " [-S pidfile] [-t timefmt] [[-R tagname] file ...]\n"); exit(1); } /* * Parse a configuration file and return a linked list of all the logs * which should be processed. */ static struct cflist * get_worklist(char **files) { FILE *f; char **given; struct cflist *cmdlist, *filelist, *globlist; struct conf_entry *defconf, *dupent, *ent; struct ilist inclist; struct include_entry *inc; int gmatch, fnres; defconf = NULL; STAILQ_INIT(&inclist); filelist = malloc(sizeof(struct cflist)); if (filelist == NULL) err(1, "malloc of filelist"); STAILQ_INIT(filelist); globlist = malloc(sizeof(struct cflist)); if (globlist == NULL) err(1, "malloc of globlist"); STAILQ_INIT(globlist); inc = malloc(sizeof(struct include_entry)); if (inc == NULL) err(1, "malloc of inc"); inc->file = conf; if (inc->file == NULL) inc->file = _PATH_CONF; STAILQ_INSERT_TAIL(&inclist, inc, inc_nextp); STAILQ_FOREACH(inc, &inclist, inc_nextp) { if (strcmp(inc->file, "-") != 0) f = fopen(inc->file, "r"); else { f = stdin; inc->file = ""; } if (!f) err(1, "%s", inc->file); if (verbose) printf("Processing %s\n", inc->file); parse_file(f, filelist, globlist, defconf, &inclist); (void) fclose(f); } /* * All config-file information has been read in and turned into * a filelist and a globlist. If there were no specific files * given on the run command, then the only thing left to do is to * call a routine which finds all files matched by the globlist * and adds them to the filelist. Then return the worklist. */ if (*files == NULL) { expand_globs(filelist, globlist); free_clist(globlist); if (defconf != NULL) free_entry(defconf); return (filelist); /* NOTREACHED */ } /* * If newsyslog was given a specific list of files to process, * it may be that some of those files were not listed in any * config file. Those unlisted files should get the default * rotation action. First, create the default-rotation action * if none was found in a system config file. */ if (defconf == NULL) { defconf = init_entry(DEFAULT_MARKER, NULL); defconf->numlogs = 3; defconf->trsize = 50; defconf->permissions = S_IRUSR|S_IWUSR; } /* * If newsyslog was run with a list of specific filenames, * then create a new worklist which has only those files in * it, picking up the rotation-rules for those files from * the original filelist. * * XXX - Note that this will copy multiple rules for a single * logfile, if multiple entries are an exact match for * that file. That matches the historic behavior, but do * we want to continue to allow it? If so, it should * probably be handled more intelligently. */ cmdlist = malloc(sizeof(struct cflist)); if (cmdlist == NULL) err(1, "malloc of cmdlist"); STAILQ_INIT(cmdlist); for (given = files; *given; ++given) { /* * First try to find exact-matches for this given file. */ gmatch = 0; STAILQ_FOREACH(ent, filelist, cf_nextp) { if (strcmp(ent->log, *given) == 0) { gmatch++; dupent = init_entry(*given, ent); STAILQ_INSERT_TAIL(cmdlist, dupent, cf_nextp); } } if (gmatch) { if (verbose > 2) printf("\t+ Matched entry %s\n", *given); continue; } /* * There was no exact-match for this given file, so look * for a "glob" entry which does match. */ gmatch = 0; if (verbose > 2 && globlist != NULL) printf("\t+ Checking globs for %s\n", *given); STAILQ_FOREACH(ent, globlist, cf_nextp) { fnres = fnmatch(ent->log, *given, FNM_PATHNAME); if (verbose > 2) printf("\t+ = %d for pattern %s\n", fnres, ent->log); if (fnres == 0) { gmatch++; dupent = init_entry(*given, ent); /* This new entry is not a glob! */ dupent->flags &= ~CE_GLOB; STAILQ_INSERT_TAIL(cmdlist, dupent, cf_nextp); /* Only allow a match to one glob-entry */ break; } } if (gmatch) { if (verbose > 2) printf("\t+ Matched %s via %s\n", *given, ent->log); continue; } /* * This given file was not found in any config file, so * add a worklist item based on the default entry. */ if (verbose > 2) printf("\t+ No entry matched %s (will use %s)\n", *given, DEFAULT_MARKER); dupent = init_entry(*given, defconf); /* Mark that it was *not* found in a config file */ dupent->def_cfg = 1; STAILQ_INSERT_TAIL(cmdlist, dupent, cf_nextp); } /* * Free all the entries in the original work list, the list of * glob entries, and the default entry. */ free_clist(filelist); free_clist(globlist); free_entry(defconf); /* And finally, return a worklist which matches the given files. */ return (cmdlist); } /* * Expand the list of entries with filename patterns, and add all files * which match those glob-entries onto the worklist. */ static void expand_globs(struct cflist *work_p, struct cflist *glob_p) { int gmatch, gres; size_t i; char *mfname; struct conf_entry *dupent, *ent, *globent; glob_t pglob; struct stat st_fm; /* * The worklist contains all fully-specified (non-GLOB) names. * * Now expand the list of filename-pattern (GLOB) entries into * a second list, which (by definition) will only match files * that already exist. Do not add a glob-related entry for any * file which already exists in the fully-specified list. */ STAILQ_FOREACH(globent, glob_p, cf_nextp) { gres = glob(globent->log, GLOB_NOCHECK, NULL, &pglob); if (gres != 0) { warn("cannot expand pattern (%d): %s", gres, globent->log); continue; } if (verbose > 2) printf("\t+ Expanding pattern %s\n", globent->log); for (i = 0; i < pglob.gl_matchc; i++) { mfname = pglob.gl_pathv[i]; /* See if this file already has a specific entry. */ gmatch = 0; STAILQ_FOREACH(ent, work_p, cf_nextp) { if (strcmp(mfname, ent->log) == 0) { gmatch++; break; } } if (gmatch) continue; /* Make sure the named matched is a file. */ gres = lstat(mfname, &st_fm); if (gres != 0) { /* Error on a file that glob() matched?!? */ warn("Skipping %s - lstat() error", mfname); continue; } if (!S_ISREG(st_fm.st_mode)) { /* We only rotate files! */ if (verbose > 2) printf("\t+ . skipping %s (!file)\n", mfname); continue; } if (verbose > 2) printf("\t+ . add file %s\n", mfname); dupent = init_entry(mfname, globent); /* This new entry is not a glob! */ dupent->flags &= ~CE_GLOB; /* Add to the worklist. */ STAILQ_INSERT_TAIL(work_p, dupent, cf_nextp); } globfree(&pglob); if (verbose > 2) printf("\t+ Done with pattern %s\n", globent->log); } } /* * Parse a configuration file and update a linked list of all the logs to * process. */ static void parse_file(FILE *cf, struct cflist *work_p, struct cflist *glob_p, struct conf_entry *defconf_p, struct ilist *inclist) { char line[BUFSIZ], *parse, *q; char *cp, *errline, *group; struct conf_entry *working; struct passwd *pwd; struct group *grp; glob_t pglob; int eol, ptm_opts, res, special; size_t i; errline = NULL; while (fgets(line, BUFSIZ, cf)) { if ((line[0] == '\n') || (line[0] == '#') || (strlen(line) == 0)) continue; if (errline != NULL) free(errline); errline = strdup(line); for (cp = line + 1; *cp != '\0'; cp++) { if (*cp != '#') continue; if (*(cp - 1) == '\\') { strcpy(cp - 1, cp); cp--; continue; } *cp = '\0'; break; } q = parse = missing_field(sob(line), errline); parse = son(line); if (!*parse) errx(1, "malformed line (missing fields):\n%s", errline); *parse = '\0'; /* * Allow people to set debug options via the config file. * (NOTE: debug options are undocumented, and may disappear * at any time, etc). */ if (strcasecmp(DEBUG_MARKER, q) == 0) { q = parse = missing_field(sob(parse + 1), errline); parse = son(parse); if (!*parse) warnx("debug line specifies no option:\n%s", errline); else { *parse = '\0'; parse_doption(q); } continue; } else if (strcasecmp(INCLUDE_MARKER, q) == 0) { if (verbose) printf("Found: %s", errline); q = parse = missing_field(sob(parse + 1), errline); parse = son(parse); if (!*parse) { warnx("include line missing argument:\n%s", errline); continue; } *parse = '\0'; if (isglobstr(q)) { res = glob(q, GLOB_NOCHECK, NULL, &pglob); if (res != 0) { warn("cannot expand pattern (%d): %s", res, q); continue; } if (verbose > 2) printf("\t+ Expanding pattern %s\n", q); for (i = 0; i < pglob.gl_matchc; i++) add_to_queue(pglob.gl_pathv[i], inclist); globfree(&pglob); } else add_to_queue(q, inclist); continue; } special = 0; working = init_entry(q, NULL); if (strcasecmp(DEFAULT_MARKER, q) == 0) { special = 1; if (defconf_p != NULL) { warnx("Ignoring duplicate entry for %s!", q); free_entry(working); continue; } defconf_p = working; } q = parse = missing_field(sob(parse + 1), errline); parse = son(parse); if (!*parse) errx(1, "malformed line (missing fields):\n%s", errline); *parse = '\0'; if ((group = strchr(q, ':')) != NULL || (group = strrchr(q, '.')) != NULL) { *group++ = '\0'; if (*q) { if (!(isnumberstr(q))) { if ((pwd = getpwnam(q)) == NULL) errx(1, "error in config file; unknown user:\n%s", errline); working->uid = pwd->pw_uid; } else working->uid = atoi(q); } else working->uid = (uid_t)-1; q = group; if (*q) { if (!(isnumberstr(q))) { if ((grp = getgrnam(q)) == NULL) errx(1, "error in config file; unknown group:\n%s", errline); working->gid = grp->gr_gid; } else working->gid = atoi(q); } else working->gid = (gid_t)-1; q = parse = missing_field(sob(parse + 1), errline); parse = son(parse); if (!*parse) errx(1, "malformed line (missing fields):\n%s", errline); *parse = '\0'; } else { working->uid = (uid_t)-1; working->gid = (gid_t)-1; } if (!sscanf(q, "%o", &working->permissions)) errx(1, "error in config file; bad permissions:\n%s", errline); q = parse = missing_field(sob(parse + 1), errline); parse = son(parse); if (!*parse) errx(1, "malformed line (missing fields):\n%s", errline); *parse = '\0'; if (!sscanf(q, "%d", &working->numlogs) || working->numlogs < 0) errx(1, "error in config file; bad value for count of logs to save:\n%s", errline); q = parse = missing_field(sob(parse + 1), errline); parse = son(parse); if (!*parse) errx(1, "malformed line (missing fields):\n%s", errline); *parse = '\0'; if (isdigitch(*q)) working->trsize = atoi(q); else if (strcmp(q, "*") == 0) working->trsize = -1; else { warnx("Invalid value of '%s' for 'size' in line:\n%s", q, errline); working->trsize = -1; } working->flags = 0; working->compress = COMPRESS_NONE; q = parse = missing_field(sob(parse + 1), errline); parse = son(parse); eol = !*parse; *parse = '\0'; { char *ep; u_long ul; ul = strtoul(q, &ep, 10); if (ep == q) working->hours = 0; else if (*ep == '*') working->hours = -1; else if (ul > INT_MAX) errx(1, "interval is too large:\n%s", errline); else working->hours = ul; if (*ep == '\0' || strcmp(ep, "*") == 0) goto no_trimat; if (*ep != '@' && *ep != '$') errx(1, "malformed interval/at:\n%s", errline); working->flags |= CE_TRIMAT; working->trim_at = ptime_init(NULL); ptm_opts = PTM_PARSE_ISO8601; if (*ep == '$') ptm_opts = PTM_PARSE_DWM; ptm_opts |= PTM_PARSE_MATCHDOM; res = ptime_relparse(working->trim_at, ptm_opts, ptimeget_secs(timenow), ep + 1); if (res == -2) errx(1, "nonexistent time for 'at' value:\n%s", errline); else if (res < 0) errx(1, "malformed 'at' value:\n%s", errline); } no_trimat: if (eol) q = NULL; else { q = parse = sob(parse + 1); /* Optional field */ parse = son(parse); if (!*parse) eol = 1; *parse = '\0'; } for (; q && *q && !isspacech(*q); q++) { switch (tolowerch(*q)) { case 'b': working->flags |= CE_BINARY; break; case 'c': working->flags |= CE_CREATE; break; case 'd': working->flags |= CE_NODUMP; break; case 'g': working->flags |= CE_GLOB; break; case 'j': working->compress = COMPRESS_BZIP2; break; case 'n': working->flags |= CE_NOSIGNAL; break; case 'r': working->flags |= CE_PID2CMD; break; case 'u': working->flags |= CE_SIGNALGROUP; break; case 'w': /* Depreciated flag - keep for compatibility purposes */ break; case 'x': working->compress = COMPRESS_XZ; break; case 'z': working->compress = COMPRESS_GZIP; break; case '-': break; case 'f': /* Used by OpenBSD for "CE_FOLLOW" */ case 'm': /* Used by OpenBSD for "CE_MONITOR" */ case 'p': /* Used by NetBSD for "CE_PLAIN0" */ default: errx(1, "illegal flag in config file -- %c", *q); } } if (eol) q = NULL; else { q = parse = sob(parse + 1); /* Optional field */ parse = son(parse); if (!*parse) eol = 1; *parse = '\0'; } working->pid_cmd_file = NULL; if (q && *q) { if (*q == '/') working->pid_cmd_file = strdup(q); else if (isalnum(*q)) goto got_sig; else { errx(1, "illegal pid file or signal in config file:\n%s", errline); } } if (eol) q = NULL; else { q = parse = sob(parse + 1); /* Optional field */ *(parse = son(parse)) = '\0'; } working->sig = SIGHUP; if (q && *q) { got_sig: working->sig = parse_signal(q); if (working->sig < 1 || working->sig >= sys_nsig) { errx(1, "illegal signal in config file:\n%s", errline); } } /* * Finish figuring out what pid-file to use (if any) in * later processing if this logfile needs to be rotated. */ if ((working->flags & CE_NOSIGNAL) == CE_NOSIGNAL) { /* * This config-entry specified 'n' for nosignal, * see if it also specified an explicit pid_cmd_file. * This would be a pretty pointless combination. */ if (working->pid_cmd_file != NULL) { warnx("Ignoring '%s' because flag 'n' was specified in line:\n%s", working->pid_cmd_file, errline); free(working->pid_cmd_file); working->pid_cmd_file = NULL; } } else if (working->pid_cmd_file == NULL) { /* * This entry did not specify the 'n' flag, which * means it should signal syslogd unless it had * specified some other pid-file (and obviously the * syslog pid-file will not be for a process-group). * Also, we should only try to notify syslog if we * are root. */ if (working->flags & CE_SIGNALGROUP) { warnx("Ignoring flag 'U' in line:\n%s", errline); working->flags &= ~CE_SIGNALGROUP; } if (needroot) working->pid_cmd_file = strdup(path_syslogpid); } /* * Add this entry to the appropriate list of entries, unless * it was some kind of special entry (eg: ). */ if (special) { ; /* Do not add to any list */ } else if (working->flags & CE_GLOB) { STAILQ_INSERT_TAIL(glob_p, working, cf_nextp); } else { STAILQ_INSERT_TAIL(work_p, working, cf_nextp); } } if (errline != NULL) free(errline); } static char * missing_field(char *p, char *errline) { if (!p || !*p) errx(1, "missing field in config file:\n%s", errline); return (p); } /* * In our sort we return it in the reverse of what qsort normally * would do, as we want the newest files first. If we have two * entries with the same time we don't really care about order. * * Support function for qsort() in delete_oldest_timelog(). */ static int oldlog_entry_compare(const void *a, const void *b) { const struct oldlog_entry *ola = a, *olb = b; if (ola->t > olb->t) return (-1); else if (ola->t < olb->t) return (1); else return (0); } /* * Check whether the file corresponding to dp is an archive of the logfile * logfname, based on the timefnamefmt format string. Return true and fill out * tm if this is the case; otherwise return false. */ static int validate_old_timelog(int fd, const struct dirent *dp, const char *logfname, struct tm *tm) { struct stat sb; size_t logfname_len; char *s; int c; logfname_len = strlen(logfname); if (dp->d_type != DT_REG) { /* * Some filesystems (e.g. NFS) don't fill out the d_type field * and leave it set to DT_UNKNOWN; in this case we must obtain * the file type ourselves. */ if (dp->d_type != DT_UNKNOWN || fstatat(fd, dp->d_name, &sb, AT_SYMLINK_NOFOLLOW) != 0 || !S_ISREG(sb.st_mode)) return (0); } /* Ignore everything but files with our logfile prefix. */ if (strncmp(dp->d_name, logfname, logfname_len) != 0) return (0); /* Ignore the actual non-rotated logfile. */ if (dp->d_namlen == logfname_len) return (0); /* * Make sure we created have found a logfile, so the * postfix is valid, IE format is: '.