Index: head/UPDATING =================================================================== --- head/UPDATING (revision 195653) +++ head/UPDATING (revision 195654) @@ -1,1651 +1,1658 @@ Updating Information for FreeBSD current users This file is maintained and copyrighted by M. Warner Losh . See end of file for further details. For commonly done items, please see the COMMON ITEMS: section later in the file. Items affecting the ports and packages system can be found in /usr/ports/UPDATING. Please read that file before running portupgrade. NOTE TO PEOPLE WHO THINK THAT FreeBSD 8.x IS SLOW: FreeBSD 8.x has many debugging features turned on, in both the kernel and userland. These features attempt to detect incorrect use of system primitives, and encourage loud failure through extra sanity checking and fail stop semantics. They also substantially impact system performance. If you want to do performance measurement, benchmarking, and optimization, you'll want to turn them off. This includes various WITNESS- related kernel options, INVARIANTS, malloc debugging flags in userland, and various verbose features in the kernel. Many developers choose to disable these features on build machines to maximize performance. (To disable malloc debugging, run ln -s aj /etc/malloc.conf.) +20090713: + The TOE interface to the TCP syncache has been modified to remove struct + tcpopt () from the ABI of the network stack. The + cxgb driver is the only TOE consumer affected by this change, and needs + to be recompiled along with the kernel. As this change breaks the ABI, + bump __FreeBSD_version to 800103. + 20090712: Padding has been added to struct tcpcb, sackhint and tcpstat in to facilitate future MFCs and bug fixes whilst maintainig the ABI. However, this change breaks the ABI, so bump __FreeBSD_version to 800102. User space tools that rely on the size of any of these structs (e.g. sockstat) need to be recompiled. 20090630: The NFS_LEGACYRPC option has been removed along with the old kernel RPC implementation that this option selected. Kernel configurations may need to be adjusted. 20090629: The network interface device nodes at /dev/net/ have been removed. All ioctl operations can be performed the normal way using routing sockets. The kqueue functionality can generally be replaced with routing sockets. 20090628: The documentation from the FreeBSD Documentation Project (Handbook, FAQ, etc.) is now installed via packages by sysinstall(8) and under the /usr/local/share/doc/freebsd directory instead of /usr/share/doc. 20090624: The ABI of various structures related to the SYSV IPC API have been changed. As a result, the COMPAT_FREEBSD[456] kernel options now all require COMPAT_FREEBSD7. Bump __FreeBSD_version to 800100. 20090622: Layout of struct vnet has changed as routing related variables were moved to their own Vimage module. Modules need to be recompiled. Bump __FreeBSD_version to 800099. 20090619: NGROUPS_MAX and NGROUPS have been increased from 16 to 1023 and 1024 respectively. As long as no more than 16 groups per process are used, no changes should be visible. When more than 16 groups are used, old binaries may fail if they call getgroups() or getgrouplist() with statically sized storage. Recompiling will work around this, but applications should be modified to use dynamically allocated storage for group arrays as POSIX.1-2008 does not cap an implementation's number of supported groups at NGROUPS_MAX+1 as previous versions did. NFS and portalfs mounts may also be affected as the list of groups is truncated to 16. Users of NFS who use more than 16 groups, should take care that negative group permissions are not used on the exported file systems as they will not be reliable unless a GSSAPI based authentication method is used. 20090616: The compiling option ADAPTIVE_LOCKMGRS has been introduced. This option compiles in the support for adaptive spinning for lockmgrs which want to enable it. The lockinit() function now accepts the flag LK_ADAPTIVE in order to make the lock object subject to adaptive spinning when both held in write and read mode. 20090613: The layout of the structure returned by IEEE80211_IOC_STA_INFO has changed. User applications that use this ioctl need to be rebuilt. 20090611: The layout of struct thread has changed. Kernel and modules need to be rebuilt. 20090608: The layout of structs ifnet, domain, protosw and vnet_net has changed. Kernel modules need to be rebuilt. Bump __FreeBSD_version to 800097. 20090602: window(1) has been removed from the base system. It can now be installed from ports. The port is called misc/window. 20090601: The way we are storing and accessing `routing table' entries has changed. Programs reading the FIB, like netstat, need to be re-compiled. 20090601: A new netisr implementation has been added for FreeBSD 8. Network file system modules, such as igmp, ipdivert, and others, should be rebuilt. Bump __FreeBSD_version to 800096. 20090530: Remove the tunable/sysctl debug.mpsafevfs as its initial purpose is no more valid. 20090530: Add VOP_ACCESSX(9). File system modules need to be rebuilt. Bump __FreeBSD_version to 800094. 20090529: Add mnt_xflag field to 'struct mount'. File system modules need to be rebuilt. Bump __FreeBSD_version to 800093. 20090528: The compiling option ADAPTIVE_SX has been retired while it has been introduced the option NO_ADAPTIVE_SX which handles the reversed logic. The KPI for sx_init_flags() changes as accepting flags: SX_ADAPTIVESPIN flag has been retired while the SX_NOADAPTIVE flag has been introduced in order to handle the reversed logic. Bump __FreeBSD_version to 800092. 20090527: Add support for hierarchical jails. Remove global securelevel. Bump __FreeBSD_version to 800091. 20090523: The layout of struct vnet_net has changed, therefore modules need to be rebuilt. Bump __FreeBSD_version to 800090. 20090523: The newly imported zic(8) produces a new format in the output. Please run tzsetup(8) to install the newly created data to /etc/localtime. 20090520: The sysctl tree for the usb stack has renamed from hw.usb2.* to hw.usb.* and is now consistent again with previous releases. 20090520: 802.11 monitor mode support was revised and driver api's were changed. Drivers dependent on net80211 now support DLT_IEEE802_11_RADIO instead of DLT_IEEE802_11. No user-visible data structures were changed but applications that use DLT_IEEE802_11 may require changes. Bump __FreeBSD_version to 800088. 20090430: The layout of the following structs has changed: sysctl_oid, socket, ifnet, inpcbinfo, tcpcb, syncache_head, vnet_inet, vnet_inet6 and vnet_ipfw. Most modules need to be rebuild or panics may be experienced. World rebuild is required for correctly checking networking state from userland. Bump __FreeBSD_version to 800085. 20090429: MLDv2 and Source-Specific Multicast (SSM) have been merged to the IPv6 stack. VIMAGE hooks are in but not yet used. The implementation of SSM within FreeBSD's IPv6 stack closely follows the IPv4 implementation. For kernel developers: * The most important changes are that the ip6_output() and ip6_input() paths no longer take the IN6_MULTI_LOCK, and this lock has been downgraded to a non-recursive mutex. * As with the changes to the IPv4 stack to support SSM, filtering of inbound multicast traffic must now be performed by transport protocols within the IPv6 stack. This does not apply to TCP and SCTP, however, it does apply to UDP in IPv6 and raw IPv6. * The KPIs used by IPv6 multicast are similar to those used by the IPv4 stack, with the following differences: * im6o_mc_filter() is analogous to imo_multicast_filter(). * The legacy KAME entry points in6_joingroup and in6_leavegroup() are shimmed to in6_mc_join() and in6_mc_leave() respectively. * IN6_LOOKUP_MULTI() has been deprecated and removed. * IPv6 relies on MLD for the DAD mechanism. KAME's internal KPIs for MLDv1 have an additional 'timer' argument which is used to jitter the initial membership report for the solicited-node multicast membership on-link. * This is not strictly needed for MLDv2, which already jitters its report transmissions. However, the 'timer' argument is preserved in case MLDv1 is active on the interface. * The KAME linked-list based IPv6 membership implementation has been refactored to use a vector similar to that used by the IPv4 stack. Code which maintains a list of its own multicast memberships internally, e.g. carp, has been updated to reflect the new semantics. * There is a known Lock Order Reversal (LOR) due to in6_setscope() acquiring the IF_AFDATA_LOCK and being called within ip6_output(). Whilst MLDv2 tries to avoid this otherwise benign LOR, it is an implementation constraint which needs to be addressed in HEAD. For application developers: * The changes are broadly similar to those made for the IPv4 stack. * The use of IPv4 and IPv6 multicast socket options on the same socket, using mapped addresses, HAS NOT been tested or supported. * There are a number of issues with the implementation of various IPv6 multicast APIs which need to be resolved in the API surface before the implementation is fully compatible with KAME userland use, and these are mostly to do with interface index treatment. * The literature available discusses the use of either the delta / ASM API with setsockopt(2)/getsockopt(2), or the full-state / ASM API using setsourcefilter(3)/getsourcefilter(3). For more information please refer to RFC 3768, 'Socket Interface Extensions for Multicast Source Filters'. * Applications which use the published RFC 3678 APIs should be fine. For systems administrators: * The mtest(8) utility has been refactored to support IPv6, in addition to IPv4. Interface addresses are no longer accepted as arguments, their names must be used instead. The utility will map the interface name to its first IPv4 address as returned by getifaddrs(3). * The ifmcstat(8) utility has also been updated to print the MLDv2 endpoint state and source filter lists via sysctl(3). * The net.inet6.ip6.mcast.loop sysctl may be tuned to 0 to disable loopback of IPv6 multicast datagrams by default; it defaults to 1 to preserve the existing behaviour. Disabling multicast loopback is recommended for optimal system performance. * The IPv6 MROUTING code has been changed to examine this sysctl instead of attempting to perform a group lookup before looping back forwarded datagrams. Bump __FreeBSD_version to 800084. 20090422: Implement low-level Bluetooth HCI API. Bump __FreeBSD_version to 800083. 20090419: The layout of struct malloc_type, used by modules to register new memory allocation types, has changed. Most modules will need to be rebuilt or panics may be experienced. Bump __FreeBSD_version to 800081. 20090415: Anticipate overflowing inp_flags - add inp_flags2. This changes most offsets in inpcb, so checking v4 connection state will require a world rebuild. Bump __FreeBSD_version to 800080. 20090415: Add an llentry to struct route and struct route_in6. Modules embedding a struct route will need to be recompiled. Bump __FreeBSD_version to 800079. 20090414: The size of rt_metrics_lite and by extension rtentry has changed. Networking administration apps will need to be recompiled. The route command now supports show as an alias for get, weighting of routes, sticky and nostick flags to alter the behavior of stateful load balancing. Bump __FreeBSD_version to 800078. 20090408: Do not use Giant for kbdmux(4) locking. This is wrong and apparently causing more problems than it solves. This will re-open the issue where interrupt handlers may race with kbdmux(4) in polling mode. Typical symptoms include (but not limited to) duplicated and/or missing characters when low level console functions (such as gets) are used while interrupts are enabled (for example geli password prompt, mountroot prompt etc.). Disabling kbdmux(4) may help. 20090407: The size of structs vnet_net, vnet_inet and vnet_ipfw has changed; kernel modules referencing any of the above need to be recompiled. Bump __FreeBSD_version to 800075. 20090320: GEOM_PART has become the default partition slicer for storage devices, replacing GEOM_MBR, GEOM_BSD, GEOM_PC98 and GEOM_GPT slicers. It introduces some changes: MSDOS/EBR: the devices created from MSDOS extended partition entries (EBR) can be named differently than with GEOM_MBR and are now symlinks to devices with offset-based names. fstabs may need to be modified. BSD: the "geometry does not match label" warning is harmless in most cases but it points to problems in file system misalignment with disk geometry. The "c" partition is now implicit, covers the whole top-level drive and cannot be (mis)used by users. General: Kernel dumps are now not allowed to be written to devices whose partition types indicate they are meant to be used for file systems (or, in case of MSDOS partitions, as something else than the "386BSD" type). Most of these changes date approximately from 200812. 20090319: The uscanner(4) driver has been removed from the kernel. This follows Linux removing theirs in 2.6 and making libusb the default interface (supported by sane). 20090319: The multicast forwarding code has been cleaned up. netstat(1) only relies on KVM now for printing bandwidth upcall meters. The IPv4 and IPv6 modules are split into ip_mroute_mod and ip6_mroute_mod respectively. The config(5) options for statically compiling this code remain the same, i.e. 'options MROUTING'. 20090315: Support for the IFF_NEEDSGIANT network interface flag has been removed, which means that non-MPSAFE network device drivers are no longer supported. In particular, if_ar, if_sr, and network device drivers from the old (legacy) USB stack can no longer be built or used. 20090313: POSIX.1 Native Language Support (NLS) has been enabled in libc and a bunch of new language catalog files have also been added. This means that some common libc messages are now localized and they depend on the LC_MESSAGES environmental variable. 20090313: The k8temp(4) driver has been renamed to amdtemp(4) since support for K10 and K11 CPU families was added. 20090309: IGMPv3 and Source-Specific Multicast (SSM) have been merged to the IPv4 stack. VIMAGE hooks are in but not yet used. For kernel developers, the most important changes are that the ip_output() and ip_input() paths no longer take the IN_MULTI_LOCK(), and this lock has been downgraded to a non-recursive mutex. Transport protocols (UDP, Raw IP) are now responsible for filtering inbound multicast traffic according to group membership and source filters. The imo_multicast_filter() KPI exists for this purpose. Transports which do not use multicast (SCTP, TCP) already reject multicast by default. Forwarding and receive performance may improve as a mutex acquisition is no longer needed in the ip_input() low-level input path. in_addmulti() and in_delmulti() are shimmed to new KPIs which exist to support SSM in-kernel. For application developers, it is recommended that loopback of multicast datagrams be disabled for best performance, as this will still cause the lock to be taken for each looped-back datagram transmission. The net.inet.ip.mcast.loop sysctl may be tuned to 0 to disable loopback by default; it defaults to 1 to preserve the existing behaviour. For systems administrators, to obtain best performance with multicast reception and multiple groups, it is always recommended that a card with a suitably precise hash filter is used. Hash collisions will still result in the lock being taken within the transport protocol input path to check group membership. If deploying FreeBSD in an environment with IGMP snooping switches, it is recommended that the net.inet.igmp.sendlocal sysctl remain enabled; this forces 224.0.0.0/24 group membership to be announced via IGMP. The size of 'struct igmpstat' has changed; netstat needs to be recompiled to reflect this. Bump __FreeBSD_version to 800070. 20090309: libusb20.so.1 is now installed as libusb.so.1 and the ports system updated to use it. This requires a buildworld/installworld in order to update the library and dependencies (usbconfig, etc). Its advisable to rebuild all ports which uses libusb. More specific directions are given in the ports collection UPDATING file. Any /etc/libmap.conf entries for libusb are no longer required and can be removed. 20090302: A workaround is committed to allow the creation of System V shared memory segment of size > 2 GB on the 64-bit architectures. Due to a limitation of the existing ABI, the shm_segsz member of the struct shmid_ds, returned by shmctl(IPC_STAT) call is wrong for large segments. Note that limits must be explicitly raised to allow such segments to be created. 20090301: The layout of struct ifnet has changed, requiring a rebuild of all network device driver modules. 20090227: The /dev handling for the new USB stack has changed, a buildworld/installworld is required for libusb20. 20090223: The new USB2 stack has now been permanently moved in and all kernel and module names reverted to their previous values (eg, usb, ehci, ohci, ums, ...). The old usb stack can be compiled in by prefixing the name with the letter 'o', the old usb modules have been removed. Updating entry 20090216 for xorg and 20090215 for libmap may still apply. 20090217: The rc.conf(5) option if_up_delay has been renamed to defaultroute_delay to better reflect its purpose. If you have customized this setting in /etc/rc.conf you need to update it to use the new name. 20090216: xorg 7.4 wants to configure its input devices via hald which does not yet work with USB2. If the keyboard/mouse does not work in xorg then add Option "AllowEmptyInput" "off" to your ServerLayout section. This will cause X to use the configured kbd and mouse sections from your xorg.conf. 20090215: The GENERIC kernels for all architectures now default to the new USB2 stack. No kernel config options or code have been removed so if a problem arises please report it and optionally revert to the old USB stack. If you are loading USB kernel modules or have a custom kernel that includes GENERIC then ensure that usb names are also changed over, eg uftdi -> usb2_serial_ftdi. Older programs linked against the ports libusb 0.1 need to be redirected to the new stack's libusb20. /etc/libmap.conf can be used for this: # Map old usb library to new one for usb2 stack libusb-0.1.so.8 libusb20.so.1 20090203: The ichsmb(4) driver has been changed to require SMBus slave addresses be left-justified (xxxxxxx0b) rather than right-justified. All of the other SMBus controller drivers require left-justified slave addresses, so this change makes all the drivers provide the same interface. 20090201: INET6 statistics (struct ip6stat) was updated. netstat(1) needs to be recompiled. 20090119: NTFS has been removed from GENERIC kernel on amd64 to match GENERIC on i386. Should not cause any issues since mount_ntfs(8) will load ntfs.ko module automatically when NTFS support is actually needed, unless ntfs.ko is not installed or security level prohibits loading kernel modules. If either is the case, "options NTFS" has to be added into kernel config. 20090115: TCP Appropriate Byte Counting (RFC 3465) support added to kernel. New field in struct tcpcb breaks ABI, so bump __FreeBSD_version to 800061. User space tools that rely on the size of struct tcpcb in tcp_var.h (e.g. sockstat) need to be recompiled. 20081225: ng_tty(4) module updated to match the new TTY subsystem. Due to API change, user-level applications must be updated. New API support added to mpd5 CVS and expected to be present in next mpd5.3 release. 20081219: With __FreeBSD_version 800060 the makefs tool is part of the base system (it was a port). 20081216: The afdata and ifnet locks have been changed from mutexes to rwlocks, network modules will need to be re-compiled. 20081214: __FreeBSD_version 800059 incorporates the new arp-v2 rewrite. RTF_CLONING, RTF_LLINFO and RTF_WASCLONED flags are eliminated. The new code reduced struct rtentry{} by 16 bytes on 32-bit architecture and 40 bytes on 64-bit architecture. The userland applications "arp" and "ndp" have been updated accordingly. The output from "netstat -r" shows only routing entries and none of the L2 information. 20081130: __FreeBSD_version 800057 marks the switchover from the binary ath hal to source code. Users must add the line: options AH_SUPPORT_AR5416 to their kernel config files when specifying: device ath_hal The ath_hal module no longer exists; the code is now compiled together with the driver in the ath module. It is now possible to tailor chip support (i.e. reduce the set of chips and thereby the code size); consult ath_hal(4) for details. 20081121: __FreeBSD_version 800054 adds memory barriers to , new interfaces to ifnet to facilitate multiple hardware transmit queues for cards that support them, and a lock-less ring-buffer implementation to enable drivers to more efficiently manage queueing of packets. 20081117: A new version of ZFS (version 13) has been merged to -HEAD. This version has zpool attribute "listsnapshots" off by default, which means "zfs list" does not show snapshots, and is the same as Solaris behavior. 20081028: dummynet(4) ABI has changed. ipfw(8) needs to be recompiled. 20081009: The uhci, ohci, ehci and slhci USB Host controller drivers have been put into separate modules. If you load the usb module separately through loader.conf you will need to load the appropriate *hci module as well. E.g. for a UHCI-based USB 2.0 controller add the following to loader.conf: uhci_load="YES" ehci_load="YES" 20081009: The ABI used by the PMC toolset has changed. Please keep userland (libpmc(3)) and the kernel module (hwpmc(4)) in sync. 20080820: The TTY subsystem of the kernel has been replaced by a new implementation, which provides better scalability and an improved driver model. Most common drivers have been migrated to the new TTY subsystem, while others have not. The following drivers have not yet been ported to the new TTY layer: PCI/ISA: cy, digi, rc, rp, sio USB: ubser, ucycom Line disciplines: ng_h4, ng_tty, ppp, sl, snp Adding these drivers to your kernel configuration file shall cause compilation to fail. 20080818: ntpd has been upgraded to 4.2.4p5. 20080801: OpenSSH has been upgraded to 5.1p1. For many years, FreeBSD's version of OpenSSH preferred DSA over RSA for host and user authentication keys. With this upgrade, we've switched to the vendor's default of RSA over DSA. This may cause upgraded clients to warn about unknown host keys even for previously known hosts. Users should follow the usual procedure for verifying host keys before accepting the RSA key. This can be circumvented by setting the "HostKeyAlgorithms" option to "ssh-dss,ssh-rsa" in ~/.ssh/config or on the ssh command line. Please note that the sequence of keys offered for authentication has been changed as well. You may want to specify IdentityFile in a different order to revert this behavior. 20080713: The sio(4) driver has been removed from the i386 and amd64 kernel configuration files. This means uart(4) is now the default serial port driver on those platforms as well. To prevent collisions with the sio(4) driver, the uart(4) driver uses different names for its device nodes. This means the onboard serial port will now most likely be called "ttyu0" instead of "ttyd0". You may need to reconfigure applications to use the new device names. When using the serial port as a boot console, be sure to update /boot/device.hints and /etc/ttys before booting the new kernel. If you forget to do so, you can still manually specify the hints at the loader prompt: set hint.uart.0.at="isa" set hint.uart.0.port="0x3F8" set hint.uart.0.flags="0x10" set hint.uart.0.irq="4" boot -s 20080609: The gpt(8) utility has been removed. Use gpart(8) to partition disks instead. 20080603: The version that Linuxulator emulates was changed from 2.4.2 to 2.6.16. If you experience any problems with Linux binaries please try to set sysctl compat.linux.osrelease to 2.4.2 and if it fixes the problem contact emulation mailing list. 20080525: ISDN4BSD (I4B) was removed from the src tree. You may need to update a your kernel configuration and remove relevant entries. 20080509: I have checked in code to support multiple routing tables. See the man pages setfib(1) and setfib(2). This is a hopefully backwards compatible version, but to make use of it you need to compile your kernel with options ROUTETABLES=2 (or more up to 16). 20080420: The 802.11 wireless support was redone to enable multi-bss operation on devices that are capable. The underlying device is no longer used directly but instead wlanX devices are cloned with ifconfig. This requires changes to rc.conf files. For example, change: ifconfig_ath0="WPA DHCP" to wlans_ath0=wlan0 ifconfig_wlan0="WPA DHCP" see rc.conf(5) for more details. In addition, mergemaster of /etc/rc.d is highly recommended. Simultaneous update of userland and kernel wouldn't hurt either. As part of the multi-bss changes the wlan_scan_ap and wlan_scan_sta modules were merged into the base wlan module. All references to these modules (e.g. in kernel config files) must be removed. 20080408: psm(4) has gained write(2) support in native operation level. Arbitrary commands can be written to /dev/psm%d and status can be read back from it. Therefore, an application is responsible for status validation and error recovery. It is a no-op in other operation levels. 20080312: Support for KSE threading has been removed from the kernel. To run legacy applications linked against KSE libmap.conf may be used. The following libmap.conf may be used to ensure compatibility with any prior release: libpthread.so.1 libthr.so.1 libpthread.so.2 libthr.so.2 libkse.so.3 libthr.so.3 20080301: The layout of struct vmspace has changed. This affects libkvm and any executables that link against libkvm and use the kvm_getprocs() function. In particular, but not exclusively, it affects ps(1), fstat(1), pkill(1), systat(1), top(1) and w(1). The effects are minimal, but it's advisable to upgrade world nonetheless. 20080229: The latest em driver no longer has support in it for the 82575 adapter, this is now moved to the igb driver. The split was done to make new features that are incompatible with older hardware easier to do. 20080220: The new geom_lvm(4) geom class has been renamed to geom_linux_lvm(4), likewise the kernel option is now GEOM_LINUX_LVM. 20080211: The default NFS mount mode has changed from UDP to TCP for increased reliability. If you rely on (insecurely) NFS mounting across a firewall you may need to update your firewall rules. 20080208: Belatedly note the addition of m_collapse for compacting mbuf chains. 20080126: The fts(3) structures have been changed to use adequate integer types for their members and so to be able to cope with huge file trees. The old fts(3) ABI is preserved through symbol versioning in libc, so third-party binaries using fts(3) should still work, although they will not take advantage of the extended types. At the same time, some third-party software might fail to build after this change due to unportable assumptions made in its source code about fts(3) structure members. Such software should be fixed by its vendor or, in the worst case, in the ports tree. FreeBSD_version 800015 marks this change for the unlikely case that a portable fix is impossible. 20080123: To upgrade to -current after this date, you must be running FreeBSD not older than 6.0-RELEASE. Upgrading to -current from 5.x now requires a stop over at RELENG_6 or RELENG_7 systems. 20071128: The ADAPTIVE_GIANT kernel option has been retired because its functionality is the default now. 20071118: The AT keyboard emulation of sunkbd(4) has been turned on by default. In order to make the special symbols of the Sun keyboards driven by sunkbd(4) work under X these now have to be configured the same way as Sun USB keyboards driven by ukbd(4) (which also does AT keyboard emulation), f.e.: Option "XkbLayout" "us" Option "XkbRules" "xorg" Option "XkbSymbols" "pc(pc105)+sun_vndr/usb(sun_usb)+us" 20071024: It has been decided that it is desirable to provide ABI backwards compatibility to the FreeBSD 4/5/6 versions of the PCIOCGETCONF, PCIOCREAD and PCIOCWRITE IOCTLs, which was broken with the introduction of PCI domain support (see the 20070930 entry). Unfortunately, this required the ABI of PCIOCGETCONF to be broken again in order to be able to provide backwards compatibility to the old version of that IOCTL. Thus consumers of PCIOCGETCONF have to be recompiled again. As for prominent ports this affects neither pciutils nor xorg-server this time, the hal port needs to be rebuilt however. 20071020: The misnamed kthread_create() and friends have been renamed to kproc_create() etc. Many of the callers already used kproc_start().. I will return kthread_create() and friends in a while with implementations that actually create threads, not procs. Renaming corresponds with version 800002. 20071010: RELENG_7 branched. 20071009: Setting WITHOUT_LIBPTHREAD now means WITHOUT_LIBKSE and WITHOUT_LIBTHR are set. 20070930: The PCI code has been made aware of PCI domains. This means that the location strings as used by pciconf(8) etc are now in the following format: pci::[:]. It also means that consumers of potentially need to be recompiled; this includes the hal and xorg-server ports. 20070928: The caching daemon (cached) was renamed to nscd. nscd.conf configuration file should be used instead of cached.conf and nscd_enable, nscd_pidfile and nscd_flags options should be used instead of cached_enable, cached_pidfile and cached_flags in rc.conf. 20070921: The getfacl(1) utility now prints owning user and group name instead of owning uid and gid in the three line comment header. This is the same behavior as getfacl(1) on Solaris and Linux. 20070704: The new IPsec code is now compiled in using the IPSEC option. The IPSEC option now requires "device crypto" be defined in your kernel configuration. The FAST_IPSEC kernel option is now deprecated. 20070702: The packet filter (pf) code has been updated to OpenBSD 4.1 Please note the changed syntax - keep state is now on by default. Also note the fact that ftp-proxy(8) has been changed from bottom up and has been moved from libexec to usr/sbin. Changes in the ALTQ handling also affect users of IPFW's ALTQ capabilities. 20070701: Remove KAME IPsec in favor of FAST_IPSEC, which is now the only IPsec supported by FreeBSD. The new IPsec stack supports both IPv4 and IPv6. The kernel option will change after the code changes have settled in. For now the kernel option IPSEC is deprecated and FAST_IPSEC is the only option, that will change after some settling time. 20070701: The wicontrol(8) utility has been removed from the base system. wi(4) cards should be configured using ifconfig(8), see the man page for more information. 20070612: The i386/amd64 GENERIC kernel now defaults to the nfe(4) driver instead of the nve(4) driver. Please update your configuration accordingly. 20070612: By default, /etc/rc.d/sendmail no longer rebuilds the aliases database if it is missing or older than the aliases file. If desired, set the new rc.conf option sendmail_rebuild_aliases to "YES" to restore that functionality. 20070612: The IPv4 multicast socket code has been considerably modified, and moved to the file sys/netinet/in_mcast.c. Initial support for the RFC 3678 Source-Specific Multicast Socket API has been added to the IPv4 network stack. Strict multicast and broadcast reception is now the default for UDP/IPv4 sockets; the net.inet.udp.strict_mcast_mship sysctl variable has now been removed. The RFC 1724 hack for interface selection has been removed; the use of the Linux-derived ip_mreqn structure with IP_MULTICAST_IF has been added to replace it. Consumers such as routed will soon be updated to reflect this. These changes affect users who are running routed(8) or rdisc(8) from the FreeBSD base system on point-to-point or unnumbered interfaces. 20070610: The net80211 layer has changed significantly and all wireless drivers that depend on it need to be recompiled. Further these changes require that any program that interacts with the wireless support in the kernel be recompiled; this includes: ifconfig, wpa_supplicant, hostapd, and wlanstats. Users must also, for the moment, kldload the wlan_scan_sta and/or wlan_scan_ap modules if they use modules for wireless support. These modules implement scanning support for station and ap modes, respectively. Failure to load the appropriate module before marking a wireless interface up will result in a message to the console and the device not operating properly. 20070610: The pam_nologin(8) module ceases to provide an authentication function and starts providing an account management function. Consequent changes to /etc/pam.d should be brought in using mergemaster(8). Third-party files in /usr/local/etc/pam.d may need manual editing as follows. Locate this line (or similar): auth required pam_nologin.so no_warn and change it according to this example: account required pam_nologin.so no_warn That is, the first word needs to be changed from "auth" to "account". The new line can be moved to the account section within the file for clarity. Not updating pam.conf(5) files will result in nologin(5) ignored by the respective services. 20070529: The ether_ioctl() function has been synchronized with ioctl(2) and ifnet.if_ioctl. Due to that, the size of one of its arguments has changed on 64-bit architectures. All kernel modules using ether_ioctl() need to be rebuilt on such architectures. 20070516: Improved INCLUDE_CONFIG_FILE support has been introduced to the config(8) utility. In order to take advantage of this new functionality, you are expected to recompile and install src/usr.sbin/config. If you don't rebuild config(8), and your kernel configuration depends on INCLUDE_CONFIG_FILE, the kernel build will be broken because of a missing "kernconfstring" symbol. 20070513: Symbol versioning is enabled by default. To disable it, use option WITHOUT_SYMVER. It is not advisable to attempt to disable symbol versioning once it is enabled; your installworld will break because a symbol version-less libc will get installed before the install tools. As a result, the old install tools, which previously had symbol dependencies to FBSD_1.0, will fail because the freshly installed libc will not have them. The default threading library (providing "libpthread") has been changed to libthr. If you wish to have libkse as your default, use option DEFAULT_THREAD_LIB=libkse for the buildworld. 20070423: The ABI breakage in sendmail(8)'s libmilter has been repaired so it is no longer necessary to recompile mail filters (aka, milters). If you recompiled mail filters after the 20070408 note, it is not necessary to recompile them again. 20070417: The new trunk(4) driver has been renamed to lagg(4) as it better reflects its purpose. ifconfig will need to be recompiled. 20070408: sendmail(8) has been updated to version 8.14.1. Mail filters (aka, milters) compiled against the libmilter included in the base operating system should be recompiled. 20070302: Firmwares for ipw(4) and iwi(4) are now included in the base tree. In order to use them one must agree to the respective LICENSE in share/doc/legal and define legal.intel_.license_ack=1 via loader.conf(5) or kenv(1). Make sure to deinstall the now deprecated modules from the respective firmware ports. 20070228: The name resolution/mapping functions addr2ascii(3) and ascii2addr(3) were removed from FreeBSD's libc. These originally came from INRIA IPv6. Nothing in FreeBSD ever used them. They may be regarded as deprecated in previous releases. The AF_LINK support for getnameinfo(3) was merged from NetBSD to replace it as a more portable (and re-entrant) API. 20070224: To support interrupt filtering a modification to the newbus API has occurred, ABI was broken and __FreeBSD_version was bumped to 700031. Please make sure that your kernel and modules are in sync. For more info: http://docs.freebsd.org/cgi/mid.cgi?20070221233124.GA13941 20070224: The IPv6 multicast forwarding code may now be loaded into GENERIC kernels by loading the ip_mroute.ko module. This is built into the module unless WITHOUT_INET6 or WITHOUT_INET6_SUPPORT options are set; see src.conf(5) for more information. 20070214: The output of netstat -r has changed. Without -n, we now only print a "network name" without the prefix length if the network address and mask exactly match a Class A/B/C network, and an entry exists in the nsswitch "networks" map. With -n, we print the full unabbreviated CIDR network prefix in the form "a.b.c.d/p". 0.0.0.0/0 is always printed as "default". This change is in preparation for changes such as equal-cost multipath, and to more generally assist operational deployment of FreeBSD as a modern IPv4 router. 20070210: PIM has been turned on by default in the IPv4 multicast routing code. The kernel option 'PIM' has now been removed. PIM is now built by default if option 'MROUTING' is specified. It may now be loaded into GENERIC kernels by loading the ip_mroute.ko module. 20070207: Support for IPIP tunnels (VIFF_TUNNEL) in IPv4 multicast routing has been removed. Its functionality may be achieved by explicitly configuring gif(4) interfaces and using the 'phyint' keyword in mrouted.conf. XORP does not support source-routed IPv4 multicast tunnels nor the integrated IPIP tunneling, therefore it is not affected by this change. The __FreeBSD_version macro has been bumped to 700030. 20061221: Support for PCI Message Signalled Interrupts has been re-enabled in the bge driver, only for those chips which are believed to support it properly. If there are any problems, MSI can be disabled completely by setting the 'hw.pci.enable_msi' and 'hw.pci.enable_msix' tunables to 0 in the loader. 20061214: Support for PCI Message Signalled Interrupts has been disabled again in the bge driver. Many revisions of the hardware fail to support it properly. Support can be re-enabled by removing the #define of BGE_DISABLE_MSI in "src/sys/dev/bge/if_bge.c". 20061214: Support for PCI Message Signalled Interrupts has been added to the bge driver. If there are any problems, MSI can be disabled completely by setting the 'hw.pci.enable_msi' and 'hw.pci.enable_msix' tunables to 0 in the loader. 20061205: The removal of several facets of the experimental Threading system from the kernel means that the proc and thread structures have changed quite a bit. I suggest all kernel modules that might reference these structures be recompiled.. Especially the linux module. 20061126: Sound infrastructure has been updated with various fixes and improvements. Most of the changes are pretty much transparent, with exceptions of followings: 1) All sound driver specific sysctls (hw.snd.pcm%d.*) have been moved to their own dev sysctl nodes, for example: hw.snd.pcm0.vchans -> dev.pcm.0.vchans 2) /dev/dspr%d.%d has been deprecated. Each channel now has its own chardev in the form of "dsp%d.%d", where is p = playback, r = record and v = virtual, respectively. Users are encouraged to use these devs instead of (old) "/dev/dsp%d.%d". This does not affect those who are using "/dev/dsp". 20061122: geom(4)'s gmirror(8) class metadata structure has been rev'd from v3 to v4. If you update across this point and your metadata is converted for you, you will not be easily able to downgrade since the /boot/kernel.old/geom_mirror.ko kernel module will be unable to read the v4 metadata. You can resolve this by doing from the loader(8) prompt: set vfs.root.mountfrom="ufs:/dev/XXX" where XXX is the root slice of one of the disks that composed the mirror (i.e.: /dev/ad0s1a). You can then rebuild the array the same way you built it originally. 20061122: The following binaries have been disconnected from the build: mount_devfs, mount_ext2fs, mount_fdescfs, mount_procfs, mount_linprocfs, and mount_std. The functionality of these programs has been moved into the mount program. For example, to mount a devfs filesystem, instead of using mount_devfs, use: "mount -t devfs". This does not affect entries in /etc/fstab, since entries in /etc/fstab are always processed with "mount -t fstype". 20061113: Support for PCI Message Signalled Interrupts on i386 and amd64 has been added to the kernel and various drivers will soon be updated to use MSI when it is available. If there are any problems, MSI can be disabled completely by setting the 'hw.pci.enable_msi' and 'hw.pci.enable_msix' tunables to 0 in the loader. 20061110: The MUTEX_PROFILING option has been renamed to LOCK_PROFILING. The lockmgr object layout has been changed as a result of having a lock_object embedded in it. As a consequence all file system kernel modules must be re-compiled. The mutex profiling man page has not yet been updated to reflect this change. 20061026: KSE in the kernel has now been made optional and turned on by default. Use 'nooption KSE' in your kernel config to turn it off. All kernel modules *must* be recompiled after this change. There-after, modules from a KSE kernel should be compatible with modules from a NOKSE kernel due to the temporary padding fields added to 'struct proc'. 20060929: mrouted and its utilities have been removed from the base system. 20060927: Some ioctl(2) command codes have changed. Full backward ABI compatibility is provided if the "options COMPAT_FREEBSD6" is present in the kernel configuration file. Make sure to add this option to your kernel config file, or recompile X.Org and the rest of ports; otherwise they may refuse to work. 20060924: tcpslice has been removed from the base system. 20060913: The sizes of struct tcpcb (and struct xtcpcb) have changed due to the rewrite of TCP syncookies. Tools like netstat, sockstat, and systat needs to be rebuilt. 20060903: libpcap updated to v0.9.4 and tcpdump to v3.9.4 20060816: The IPFIREWALL_FORWARD_EXTENDED option is gone and the behaviour for IPFIREWALL_FORWARD is now as it was before when it was first committed and for years after. The behaviour is now ON. 20060725: enigma(1)/crypt(1) utility has been changed on 64 bit architectures. Now it can decrypt files created from different architectures. Unfortunately, it is no longer able to decrypt a cipher text generated with an older version on 64 bit architectures. If you have such a file, you need old utility to decrypt it. 20060709: The interface version of the i4b kernel part has changed. So after updating the kernel sources and compiling a new kernel, the i4b user space tools in "/usr/src/usr.sbin/i4b" must also be rebuilt, and vice versa. 20060627: The XBOX kernel now defaults to the nfe(4) driver instead of the nve(4) driver. Please update your configuration accordingly. 20060514: The i386-only lnc(4) driver for the AMD Am7900 LANCE and Am79C9xx PCnet family of NICs has been removed. The new le(4) driver serves as an equivalent but cross-platform replacement with the pcn(4) driver still providing performance-optimized support for the subset of AMD Am79C971 PCnet-FAST and greater chips as before. 20060511: The machdep.* sysctls and the adjkerntz utility have been modified a bit. The new adjkerntz utility uses the new sysctl names and sysctlbyname() calls, so it may be impossible to run an old /sbin/adjkerntz utility in single-user mode with a new kernel. Replace the `adjkerntz -i' step before `make installworld' with: /usr/obj/usr/src/sbin/adjkerntz/adjkerntz -i and proceed as usual with the rest of the installworld-stage steps. Otherwise, you risk installing binaries with their timestamp set several hours in the future, especially if you are running with local time set to GMT+X hours. 20060412: The ip6fw utility has been removed. The behavior provided by ip6fw has been in ipfw2 for a good while and the rc.d scripts have been updated to deal with it. There are some rules that might not migrate cleanly. Use rc.firewall6 as a template to rewrite rules. 20060428: The puc(4) driver has been overhauled. The ebus(4) and sbus(4) attachments have been removed. Make sure to configure scc(4) on sparc64. Note also that by default puc(4) will use uart(4) and not sio(4) for serial ports because interrupt handling has been optimized for multi-port serial cards and only uart(4) implements the interface to support it. 20060330: The scc(4) driver replaces puc(4) for Serial Communications Controllers (SCCs) like the Siemens SAB82532 and the Zilog Z8530. On sparc64, it is advised to add scc(4) to the kernel configuration to make sure that the serial ports remain functional. 20060317: Most world/kernel related NO_* build options changed names. New knobs have common prefixes WITHOUT_*/WITH_* (modelled after FreeBSD ports) and should be set in /etc/src.conf (the src.conf(5) manpage is provided). Full backwards compatibility is maintained for the time being though it's highly recommended to start moving old options out of the system-wide /etc/make.conf file into the new /etc/src.conf while also properly renaming them. More conversions will likely follow. Posting to current@: http://lists.freebsd.org/pipermail/freebsd-current/2006-March/061725.html 20060305: The NETSMBCRYPTO kernel option has been retired because its functionality is always included in NETSMB and smbfs.ko now. 20060303: The TDFX_LINUX kernel option was retired and replaced by the tdfx_linux device. The latter can be loaded as the 3dfx_linux.ko kernel module. Loading it alone should suffice to get 3dfx support for Linux apps because it will pull in 3dfx.ko and linux.ko through its dependencies. 20060204: The 'audit' group was added to support the new auditing functionality in the base system. Be sure to follow the directions for updating, including the requirement to run mergemaster -p. 20060201: The kernel ABI to file system modules was changed on i386. Please make sure that your kernel and modules are in sync. 20060118: This actually occured some time ago, but installing the kernel now also installs a bunch of symbol files for the kernel modules. This increases the size of /boot/kernel to about 67Mbytes. You will need twice this if you will eventually back this up to kernel.old on your next install. If you have a shortage of room in your root partition, you should add -DINSTALL_NODEBUG to your make arguments or add INSTALL_NODEBUG="yes" to your /etc/make.conf. 20060113: libc's malloc implementation has been replaced. This change has the potential to uncover application bugs that previously went unnoticed. See the malloc(3) manual page for more details. 20060112: The generic netgraph(4) cookie has been changed. If you upgrade kernel passing this point, you also need to upgrade userland and netgraph(4) utilities like ports/net/mpd or ports/net/mpd4. 20060106: si(4)'s device files now contain the unit number. Uses of {cua,tty}A[0-9a-f] should be replaced by {cua,tty}A0[0-9a-f]. 20060106: The kernel ABI was mostly destroyed due to a change in the size of struct lock_object which is nested in other structures such as mutexes which are nested in all sorts of other structures. Make sure your kernel and modules are in sync. 20051231: The page coloring algorithm in the VM subsystem was converted from tuning with kernel options to autotuning. Please remove any PQ_* option except PQ_NOOPT from your kernel config. 20051211: The net80211-related tools in the tools/tools/ath directory have been moved to tools/tools/net80211 and renamed with a "wlan" prefix. Scripts that use them should be adjusted accordingly. 20051202: Scripts in the local_startup directories (as defined in /etc/defaults/rc.conf) that have the new rc.d semantics will now be run as part of the base system rcorder. If there are errors or problems with one of these local scripts, it could cause boot problems. If you encounter such problems, boot in single user mode, remove that script from the */rc.d directory. Please report the problem to the port's maintainer, and the freebsd-ports@freebsd.org mailing list. 20051129: The nodev mount option was deprecated in RELENG_6 (where it was a no-op), and is now unsupported. If you have nodev or dev listed in /etc/fstab, remove it, otherwise it will result in a mount error. 20051129: ABI between ipfw(4) and ipfw(8) has been changed. You need to rebuild ipfw(8) when rebuilding kernel. 20051108: rp(4)'s device files now contain the unit number. Uses of {cua,tty}R[0-9a-f] should be replaced by {cua,tty}R0[0-9a-f]. 20051029: /etc/rc.d/ppp-user has been renamed to /etc/rc.d/ppp. Its /etc/rc.conf.d configuration file has been `ppp' from the beginning, and hence there is no need to touch it. 20051014: Now most modules get their build-time options from the kernel configuration file. A few modules still have fixed options due to their non-conformant implementation, but they will be corrected eventually. You may need to review the options of the modules in use, explicitly specify the non-default options in the kernel configuration file, and rebuild the kernel and modules afterwards. 20051001: kern.polling.enable sysctl MIB is now deprecated. Use ifconfig(8) to turn polling(4) on your interfaces. 20050927: The old bridge(4) implementation was retired. The new if_bridge(4) serves as a full functional replacement. 20050722: The ai_addrlen of a struct addrinfo was changed to a socklen_t to conform to POSIX-2001. This change broke an ABI compatibility on 64 bit architecture. You have to recompile userland programs that use getaddrinfo(3) on 64 bit architecture. 20050711: RELENG_6 branched here. 20050629: The pccard_ifconfig rc.conf variable has been removed and a new variable, ifconfig_DEFAULT has been introduced. Unlike pccard_ifconfig, ifconfig_DEFAULT applies to ALL interfaces that do not have ifconfig_ifn entries rather than just those in removable_interfaces. 20050616: Some previous versions of PAM have permitted the use of non-absolute paths in /etc/pam.conf or /etc/pam.d/* when referring to third party PAM modules in /usr/local/lib. A change has been made to require the use of absolute paths in order to avoid ambiguity and dependence on library path configuration, which may affect existing configurations. 20050610: Major changes to network interface API. All drivers must be recompiled. Drivers not in the base system will need to be updated to the new APIs. 20050609: Changes were made to kinfo_proc in sys/user.h. Please recompile userland, or commands like `fstat', `pkill', `ps', `top' and `w' will not behave correctly. The API and ABI for hwpmc(4) have changed with the addition of sampling support. Please recompile lib/libpmc(3) and usr.sbin/{pmcstat,pmccontrol}. 20050606: The OpenBSD dhclient was imported in place of the ISC dhclient and the network interface configuration scripts were updated accordingly. If you use DHCP to configure your interfaces, you must now run devd. Also, DNS updating was lost so you will need to find a workaround if you use this feature. The '_dhcp' user was added to support the OpenBSD dhclient. Be sure to run mergemaster -p (like you are supposed to do every time anyway). 20050605: if_bridge was added to the tree. This has changed struct ifnet. Please recompile userland and all network related modules. 20050603: The n_net of a struct netent was changed to an uint32_t, and 1st argument of getnetbyaddr() was changed to an uint32_t, to conform to POSIX-2001. These changes broke an ABI compatibility on 64 bit architecture. With these changes, shlib major of libpcap was bumped. You have to recompile userland programs that use getnetbyaddr(3), getnetbyname(3), getnetent(3) and/or libpcap on 64 bit architecture. 20050528: Kernel parsing of extra options on '#!' first lines of shell scripts has changed. Lines with multiple options likely will fail after this date. For full details, please see http://people.freebsd.org/~gad/Updating-20050528.txt 20050503: The packet filter (pf) code has been updated to OpenBSD 3.7 Please note the changed anchor syntax and the fact that authpf(8) now needs a mounted fdescfs(5) to function. 20050415: The NO_MIXED_MODE kernel option has been removed from the i386 amd64 platforms as its use has been superceded by the new local APIC timer code. Any kernel config files containing this option should be updated. 20050227: The on-disk format of LC_CTYPE files was changed to be machine independent. Please make sure NOT to use NO_CLEAN buildworld when crossing this point. Crossing this point also requires recompile or reinstall of all locale depended packages. 20050225: The ifi_epoch member of struct if_data has been changed to contain the uptime at which the interface was created or the statistics zeroed rather then the wall clock time because wallclock time may go backwards. This should have no impact unless an snmp implementation is using this value (I know of none at this point.) 20050224: The acpi_perf and acpi_throttle drivers are now part of the acpi(4) main module. They are no longer built separately. 20050223: The layout of struct image_params has changed. You have to recompile all compatibility modules (linux, svr4, etc) for use with the new kernel. 20050223: The p4tcc driver has been merged into cpufreq(4). This makes "options CPU_ENABLE_TCC" obsolete. Please load cpufreq.ko or compile in "device cpufreq" to restore this functionality. 20050220: The responsibility of recomputing the file system summary of a SoftUpdates-enabled dirty volume has been transferred to the background fsck. A rebuild of fsck(8) utility is recommended if you have updated the kernel. To get the old behavior (recompute file system summary at mount time), you can set vfs.ffs.compute_summary_at_mount=1 before mounting the new volume. 20050206: The cpufreq import is complete. As part of this, the sysctls for acpi(4) throttling have been removed. The power_profile script has been updated, so you can use performance/economy_cpu_freq in rc.conf(5) to set AC on/offline cpu frequencies. 20050206: NG_VERSION has been increased. Recompiling kernel (or ng_socket.ko) requires recompiling libnetgraph and userland netgraph utilities. 20050114: Support for abbreviated forms of a number of ipfw options is now deprecated. Warnings are printed to stderr indicating the correct full form when a match occurs. Some abbreviations may be supported at a later date based on user feedback. To be considered for support, abbreviations must be in use prior to this commit and unlikely to be confused with current key words. 20041221: By a popular demand, a lot of NOFOO options were renamed to NO_FOO (see bsd.compat.mk for a full list). The old spellings are still supported, but will cause annoying warnings on stderr. Make sure you upgrade properly (see the COMMON ITEMS: section later in this file). 20041219: Auto-loading of ancillary wlan modules such as wlan_wep has been temporarily disabled; you need to statically configure the modules you need into your kernel or explicitly load them prior to use. Specifically, if you intend to use WEP encryption with an 802.11 device load/configure wlan_wep; if you want to use WPA with the ath driver load/configure wlan_tkip, wlan_ccmp, and wlan_xauth as required. 20041213: The behaviour of ppp(8) has changed slightly. If lqr is enabled (``enable lqr''), older versions would revert to LCP ECHO mode on negotiation failure. Now, ``enable echo'' is required for this behaviour. The ppp version number has been bumped to 3.4.2 to reflect the change. 20041201: The wlan support has been updated to split the crypto support into separate modules. For static WEP you must configure the wlan_wep module in your system or build and install the module in place where it can be loaded (the kernel will auto-load the module when a wep key is configured). 20041201: The ath driver has been updated to split the tx rate control algorithm into a separate module. You need to include either ath_rate_onoe or ath_rate_amrr when configuring the kernel. 20041116: Support for systems with an 80386 CPU has been removed. Please use FreeBSD 5.x or earlier on systems with an 80386. 20041110: We have had a hack which would mount the root filesystem R/W if the device were named 'md*'. As part of the vnode work I'm doing I have had to remove this hack. People building systems which use preloaded MD root filesystems may need to insert a "/sbin/mount -u -o rw /dev/md0 /" in their /etc/rc scripts. 20041104: FreeBSD 5.3 shipped here. 20041102: The size of struct tcpcb has changed again due to the removal of RFC1644 T/TCP. You have to recompile userland programs that read kmem for tcp sockets directly (netstat, sockstat, etc.) 20041022: The size of struct tcpcb has changed. You have to recompile userland programs that read kmem for tcp sockets directly (netstat, sockstat, etc.) 20041016: RELENG_5 branched here. For older entries, please see updating in the RELENG_5 branch. COMMON ITEMS: General Notes ------------- Avoid using make -j when upgrading. From time to time in the past there have been problems using -j with buildworld and/or installworld. This is especially true when upgrading between "distant" versions (eg one that cross a major release boundary or several minor releases, or when several months have passed on the -current branch). Sometimes, obscure build problems are the result of environment poisoning. This can happen because the make utility reads its environment when searching for values for global variables. To run your build attempts in an "environmental clean room", prefix all make commands with 'env -i '. See the env(1) manual page for more details. When upgrading from one major version to another it is generally best to upgrade to the latest code in the currently installed branch first, then do an upgrade to the new branch. This is the best-tested upgrade path, and has the highest probability of being successful. Please try this approach before reporting problems with a major version upgrade. To build a kernel ----------------- If you are updating from a prior version of FreeBSD (even one just a few days old), you should follow this procedure. It is the most failsafe as it uses a /usr/obj tree with a fresh mini-buildworld, make kernel-toolchain make -DALWAYS_CHECK_MAKE buildkernel KERNCONF=YOUR_KERNEL_HERE make -DALWAYS_CHECK_MAKE installkernel KERNCONF=YOUR_KERNEL_HERE To test a kernel once --------------------- If you just want to boot a kernel once (because you are not sure if it works, or if you want to boot a known bad kernel to provide debugging information) run make installkernel KERNCONF=YOUR_KERNEL_HERE KODIR=/boot/testkernel nextboot -k testkernel To just build a kernel when you know that it won't mess you up -------------------------------------------------------------- This assumes you are already running a 5.X system. Replace ${arch} with the architecture of your machine (e.g. "i386", "alpha", "amd64", "ia64", "pc98", "sparc64", etc). cd src/sys/${arch}/conf config KERNEL_NAME_HERE cd ../compile/KERNEL_NAME_HERE make depend make make install If this fails, go to the "To build a kernel" section. To rebuild everything and install it on the current system. ----------------------------------------------------------- # Note: sometimes if you are running current you gotta do more than # is listed here if you are upgrading from a really old current. make buildworld make kernel KERNCONF=YOUR_KERNEL_HERE [1] [3] mergemaster -p [5] make installworld make delete-old mergemaster [4] To cross-install current onto a separate partition -------------------------------------------------- # In this approach we use a separate partition to hold # current's root, 'usr', and 'var' directories. A partition # holding "/", "/usr" and "/var" should be about 2GB in # size. make buildworld make buildkernel KERNCONF=YOUR_KERNEL_HERE make installworld DESTDIR=${CURRENT_ROOT} make distribution DESTDIR=${CURRENT_ROOT} # if newfs'd make installkernel KERNCONF=YOUR_KERNEL_HERE DESTDIR=${CURRENT_ROOT} cp /etc/fstab ${CURRENT_ROOT}/etc/fstab # if newfs'd To upgrade in-place from 5.x-stable to current ---------------------------------------------- make buildworld [9] make kernel KERNCONF=YOUR_KERNEL_HERE [8] [1] [3] mergemaster -p [5] make installworld make delete-old mergemaster -i [4] Make sure that you've read the UPDATING file to understand the tweaks to various things you need. At this point in the life cycle of current, things change often and you are on your own to cope. The defaults can also change, so please read ALL of the UPDATING entries. Also, if you are tracking -current, you must be subscribed to freebsd-current@freebsd.org. Make sure that before you update your sources that you have read and understood all the recent messages there. If in doubt, please track -stable which has much fewer pitfalls. [1] If you have third party modules, such as vmware, you should disable them at this point so they don't crash your system on reboot. [3] From the bootblocks, boot -s, and then do fsck -p mount -u / mount -a cd src adjkerntz -i # if CMOS is wall time Also, when doing a major release upgrade, it is required that you boot into single user mode to do the installworld. [4] Note: This step is non-optional. Failure to do this step can result in a significant reduction in the functionality of the system. Attempting to do it by hand is not recommended and those that pursue this avenue should read this file carefully, as well as the archives of freebsd-current and freebsd-hackers mailing lists for potential gotchas. [5] Usually this step is a noop. However, from time to time you may need to do this if you get unknown user in the following step. It never hurts to do it all the time. You may need to install a new mergemaster (cd src/usr.sbin/mergemaster && make install) after the buildworld before this step if you last updated from current before 20020224 or from -stable before 20020408. [8] In order to have a kernel that can run the 4.x binaries needed to do an installworld, you must include the COMPAT_FREEBSD4 option in your kernel. Failure to do so may leave you with a system that is hard to boot to recover. A similar kernel option COMPAT_FREEBSD5 is required to run the 5.x binaries on more recent kernels. Make sure that you merge any new devices from GENERIC since the last time you updated your kernel config file. [9] When checking out sources, you must include the -P flag to have cvs prune empty directories. If CPUTYPE is defined in your /etc/make.conf, make sure to use the "?=" instead of the "=" assignment operator, so that buildworld can override the CPUTYPE if it needs to. MAKEOBJDIRPREFIX must be defined in an environment variable, and not on the command line, or in /etc/make.conf. buildworld will warn if it is improperly defined. FORMAT: This file contains a list, in reverse chronological order, of major breakages in tracking -current. Not all things will be listed here, and it only starts on October 16, 2004. Updating files can found in previous releases if your system is older than this. Copyright information: Copyright 1998-2005 M. Warner Losh. All Rights Reserved. Redistribution, publication, translation and use, with or without modification, in full or in part, in any form or format of this document are permitted without further permission from the author. THIS DOCUMENT IS PROVIDED BY WARNER LOSH ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL WARNER LOSH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. If you find this document useful, and you want to, you may buy the author a beer. Contact Warner Losh if you have any questions about your use of this document. $FreeBSD$ Index: head/sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c =================================================================== --- head/sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c (revision 195653) +++ head/sys/dev/cxgb/ulp/tom/cxgb_cpl_io.c (revision 195654) @@ -1,4468 +1,4469 @@ /************************************************************************** Copyright (c) 2007-2008, Chelsio Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Neither the name of the Chelsio Corporation nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ***************************************************************************/ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #if __FreeBSD_version >= 800044 #include #else #define V_tcp_do_autosndbuf tcp_do_autosndbuf #define V_tcp_autosndbuf_max tcp_autosndbuf_max #define V_tcp_do_rfc1323 tcp_do_rfc1323 #define V_tcp_do_autorcvbuf tcp_do_autorcvbuf #define V_tcp_autorcvbuf_max tcp_autorcvbuf_max #define V_tcpstat tcpstat #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #if __FreeBSD_version >= 800056 #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* * For ULP connections HW may add headers, e.g., for digests, that aren't part * of the messages sent by the host but that are part of the TCP payload and * therefore consume TCP sequence space. Tx connection parameters that * operate in TCP sequence space are affected by the HW additions and need to * compensate for them to accurately track TCP sequence numbers. This array * contains the compensating extra lengths for ULP packets. It is indexed by * a packet's ULP submode. */ const unsigned int t3_ulp_extra_len[] = {0, 4, 4, 8}; #ifdef notyet /* * This sk_buff holds a fake header-only TCP segment that we use whenever we * need to exploit SW TCP functionality that expects TCP headers, such as * tcp_create_openreq_child(). It's a RO buffer that may be used by multiple * CPUs without locking. */ static struct mbuf *tcphdr_mbuf __read_mostly; #endif /* * Size of WRs in bytes. Note that we assume all devices we are handling have * the same WR size. */ static unsigned int wrlen __read_mostly; /* * The number of WRs needed for an skb depends on the number of page fragments * in the skb and whether it has any payload in its main body. This maps the * length of the gather list represented by an skb into the # of necessary WRs. */ static unsigned int mbuf_wrs[TX_MAX_SEGS + 1] __read_mostly; /* * Max receive window supported by HW in bytes. Only a small part of it can * be set through option0, the rest needs to be set through RX_DATA_ACK. */ #define MAX_RCV_WND ((1U << 27) - 1) /* * Min receive window. We want it to be large enough to accommodate receive * coalescing, handle jumbo frames, and not trigger sender SWS avoidance. */ #define MIN_RCV_WND (24 * 1024U) #define INP_TOS(inp) ((inp_ip_tos_get(inp) >> 2) & M_TOS) #define VALIDATE_SEQ 0 #define VALIDATE_SOCK(so) #define DEBUG_WR 0 #define TCP_TIMEWAIT 1 #define TCP_CLOSE 2 #define TCP_DROP 3 static void t3_send_reset(struct toepcb *toep); static void send_abort_rpl(struct mbuf *m, struct toedev *tdev, int rst_status); static inline void free_atid(struct t3cdev *cdev, unsigned int tid); static void handle_syncache_event(int event, void *arg); static inline void SBAPPEND(struct sockbuf *sb, struct mbuf *n) { struct mbuf *m; m = sb->sb_mb; while (m) { KASSERT(((m->m_flags & M_EXT) && (m->m_ext.ext_type == EXT_EXTREF)) || !(m->m_flags & M_EXT), ("unexpected type M_EXT=%d ext_type=%d m_len=%d\n", !!(m->m_flags & M_EXT), m->m_ext.ext_type, m->m_len)); KASSERT(m->m_next != (struct mbuf *)0xffffffff, ("bad next value m_next=%p m_nextpkt=%p m_flags=0x%x", m->m_next, m->m_nextpkt, m->m_flags)); m = m->m_next; } m = n; while (m) { KASSERT(((m->m_flags & M_EXT) && (m->m_ext.ext_type == EXT_EXTREF)) || !(m->m_flags & M_EXT), ("unexpected type M_EXT=%d ext_type=%d m_len=%d\n", !!(m->m_flags & M_EXT), m->m_ext.ext_type, m->m_len)); KASSERT(m->m_next != (struct mbuf *)0xffffffff, ("bad next value m_next=%p m_nextpkt=%p m_flags=0x%x", m->m_next, m->m_nextpkt, m->m_flags)); m = m->m_next; } KASSERT(sb->sb_flags & SB_NOCOALESCE, ("NOCOALESCE not set")); sbappendstream_locked(sb, n); m = sb->sb_mb; while (m) { KASSERT(m->m_next != (struct mbuf *)0xffffffff, ("bad next value m_next=%p m_nextpkt=%p m_flags=0x%x", m->m_next, m->m_nextpkt, m->m_flags)); m = m->m_next; } } static inline int is_t3a(const struct toedev *dev) { return (dev->tod_ttid == TOE_ID_CHELSIO_T3); } static void dump_toepcb(struct toepcb *toep) { DPRINTF("qset_idx=%d qset=%d ulp_mode=%d mtu_idx=%d tid=%d\n", toep->tp_qset_idx, toep->tp_qset, toep->tp_ulp_mode, toep->tp_mtu_idx, toep->tp_tid); DPRINTF("wr_max=%d wr_avail=%d wr_unacked=%d mss_clamp=%d flags=0x%x\n", toep->tp_wr_max, toep->tp_wr_avail, toep->tp_wr_unacked, toep->tp_mss_clamp, toep->tp_flags); } #ifndef RTALLOC2_DEFINED static struct rtentry * rtalloc2(struct sockaddr *dst, int report, u_long ignflags) { struct rtentry *rt = NULL; if ((rt = rtalloc1(dst, report, ignflags)) != NULL) RT_UNLOCK(rt); return (rt); } #endif /* * Determine whether to send a CPL message now or defer it. A message is * deferred if the connection is in SYN_SENT since we don't know the TID yet. * For connections in other states the message is sent immediately. * If through_l2t is set the message is subject to ARP processing, otherwise * it is sent directly. */ static inline void send_or_defer(struct toepcb *toep, struct mbuf *m, int through_l2t) { struct tcpcb *tp = toep->tp_tp; if (__predict_false(tp->t_state == TCPS_SYN_SENT)) { inp_wlock(tp->t_inpcb); mbufq_tail(&toep->out_of_order_queue, m); // defer inp_wunlock(tp->t_inpcb); } else if (through_l2t) l2t_send(TOEP_T3C_DEV(toep), m, toep->tp_l2t); // send through L2T else cxgb_ofld_send(TOEP_T3C_DEV(toep), m); // send directly } static inline unsigned int mkprio(unsigned int cntrl, const struct toepcb *toep) { return (cntrl); } /* * Populate a TID_RELEASE WR. The skb must be already propely sized. */ static inline void mk_tid_release(struct mbuf *m, const struct toepcb *toep, unsigned int tid) { struct cpl_tid_release *req; m_set_priority(m, mkprio(CPL_PRIORITY_SETUP, toep)); m->m_pkthdr.len = m->m_len = sizeof(*req); req = mtod(m, struct cpl_tid_release *); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); req->wr.wr_lo = 0; OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_TID_RELEASE, tid)); } static inline void make_tx_data_wr(struct socket *so, struct mbuf *m, int len, struct mbuf *tail) { INIT_VNET_INET(so->so_vnet); struct tcpcb *tp = so_sototcpcb(so); struct toepcb *toep = tp->t_toe; struct tx_data_wr *req; struct sockbuf *snd; inp_lock_assert(tp->t_inpcb); snd = so_sockbuf_snd(so); req = mtod(m, struct tx_data_wr *); m->m_len = sizeof(*req); req->wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_TX_DATA)); req->wr_lo = htonl(V_WR_TID(toep->tp_tid)); /* len includes the length of any HW ULP additions */ req->len = htonl(len); req->param = htonl(V_TX_PORT(toep->tp_l2t->smt_idx)); /* V_TX_ULP_SUBMODE sets both the mode and submode */ req->flags = htonl(V_TX_ULP_SUBMODE(/*skb_ulp_mode(skb)*/ 0) | V_TX_URG(/* skb_urgent(skb) */ 0 ) | V_TX_SHOVE((!(tp->t_flags & TF_MORETOCOME) && (tail ? 0 : 1)))); req->sndseq = htonl(tp->snd_nxt); if (__predict_false((toep->tp_flags & TP_DATASENT) == 0)) { req->flags |= htonl(V_TX_ACK_PAGES(2) | F_TX_INIT | V_TX_CPU_IDX(toep->tp_qset)); /* Sendbuffer is in units of 32KB. */ if (V_tcp_do_autosndbuf && snd->sb_flags & SB_AUTOSIZE) req->param |= htonl(V_TX_SNDBUF(V_tcp_autosndbuf_max >> 15)); else { req->param |= htonl(V_TX_SNDBUF(snd->sb_hiwat >> 15)); } toep->tp_flags |= TP_DATASENT; } } #define IMM_LEN 64 /* XXX - see WR_LEN in the cxgb driver */ int t3_push_frames(struct socket *so, int req_completion) { struct tcpcb *tp = so_sototcpcb(so); struct toepcb *toep = tp->t_toe; struct mbuf *tail, *m0, *last; struct t3cdev *cdev; struct tom_data *d; int state, bytes, count, total_bytes; bus_dma_segment_t segs[TX_MAX_SEGS], *segp; struct sockbuf *snd; if (tp->t_state == TCPS_SYN_SENT || tp->t_state == TCPS_CLOSED) { DPRINTF("tcp state=%d\n", tp->t_state); return (0); } state = so_state_get(so); if (state & (SS_ISDISCONNECTING|SS_ISDISCONNECTED)) { DPRINTF("disconnecting\n"); return (0); } inp_lock_assert(tp->t_inpcb); snd = so_sockbuf_snd(so); sockbuf_lock(snd); d = TOM_DATA(toep->tp_toedev); cdev = d->cdev; last = tail = snd->sb_sndptr ? snd->sb_sndptr : snd->sb_mb; total_bytes = 0; DPRINTF("wr_avail=%d tail=%p snd.cc=%d tp_last=%p\n", toep->tp_wr_avail, tail, snd->sb_cc, toep->tp_m_last); if (last && toep->tp_m_last == last && snd->sb_sndptroff != 0) { KASSERT(tail, ("sbdrop error")); last = tail = tail->m_next; } if ((toep->tp_wr_avail == 0 ) || (tail == NULL)) { DPRINTF("wr_avail=%d tail=%p\n", toep->tp_wr_avail, tail); sockbuf_unlock(snd); return (0); } toep->tp_m_last = NULL; while (toep->tp_wr_avail && (tail != NULL)) { count = bytes = 0; segp = segs; if ((m0 = m_gethdr(M_NOWAIT, MT_DATA)) == NULL) { sockbuf_unlock(snd); return (0); } /* * If the data in tail fits as in-line, then * make an immediate data wr. */ if (tail->m_len <= IMM_LEN) { count = 1; bytes = tail->m_len; last = tail; tail = tail->m_next; m_set_sgl(m0, NULL); m_set_sgllen(m0, 0); make_tx_data_wr(so, m0, bytes, tail); m_append(m0, bytes, mtod(last, caddr_t)); KASSERT(!m0->m_next, ("bad append")); } else { while ((mbuf_wrs[count + 1] <= toep->tp_wr_avail) && (tail != NULL) && (count < TX_MAX_SEGS-1)) { bytes += tail->m_len; last = tail; count++; /* * technically an abuse to be using this for a VA * but less gross than defining my own structure * or calling pmap_kextract from here :-| */ segp->ds_addr = (bus_addr_t)tail->m_data; segp->ds_len = tail->m_len; DPRINTF("count=%d wr_needed=%d ds_addr=%p ds_len=%d\n", count, mbuf_wrs[count], tail->m_data, tail->m_len); segp++; tail = tail->m_next; } DPRINTF("wr_avail=%d mbuf_wrs[%d]=%d tail=%p\n", toep->tp_wr_avail, count, mbuf_wrs[count], tail); m_set_sgl(m0, segs); m_set_sgllen(m0, count); make_tx_data_wr(so, m0, bytes, tail); } m_set_priority(m0, mkprio(CPL_PRIORITY_DATA, toep)); if (tail) { snd->sb_sndptr = tail; toep->tp_m_last = NULL; } else toep->tp_m_last = snd->sb_sndptr = last; DPRINTF("toep->tp_m_last=%p\n", toep->tp_m_last); snd->sb_sndptroff += bytes; total_bytes += bytes; toep->tp_write_seq += bytes; CTR6(KTR_TOM, "t3_push_frames: wr_avail=%d mbuf_wrs[%d]=%d" " tail=%p sndptr=%p sndptroff=%d", toep->tp_wr_avail, count, mbuf_wrs[count], tail, snd->sb_sndptr, snd->sb_sndptroff); if (tail) CTR4(KTR_TOM, "t3_push_frames: total_bytes=%d" " tp_m_last=%p tailbuf=%p snd_una=0x%08x", total_bytes, toep->tp_m_last, tail->m_data, tp->snd_una); else CTR3(KTR_TOM, "t3_push_frames: total_bytes=%d" " tp_m_last=%p snd_una=0x%08x", total_bytes, toep->tp_m_last, tp->snd_una); #ifdef KTR { int i; i = 0; while (i < count && m_get_sgllen(m0)) { if ((count - i) >= 3) { CTR6(KTR_TOM, "t3_push_frames: pa=0x%zx len=%d pa=0x%zx" " len=%d pa=0x%zx len=%d", segs[i].ds_addr, segs[i].ds_len, segs[i + 1].ds_addr, segs[i + 1].ds_len, segs[i + 2].ds_addr, segs[i + 2].ds_len); i += 3; } else if ((count - i) == 2) { CTR4(KTR_TOM, "t3_push_frames: pa=0x%zx len=%d pa=0x%zx" " len=%d", segs[i].ds_addr, segs[i].ds_len, segs[i + 1].ds_addr, segs[i + 1].ds_len); i += 2; } else { CTR2(KTR_TOM, "t3_push_frames: pa=0x%zx len=%d", segs[i].ds_addr, segs[i].ds_len); i++; } } } #endif /* * remember credits used */ m0->m_pkthdr.csum_data = mbuf_wrs[count]; m0->m_pkthdr.len = bytes; toep->tp_wr_avail -= mbuf_wrs[count]; toep->tp_wr_unacked += mbuf_wrs[count]; if ((req_completion && toep->tp_wr_unacked == mbuf_wrs[count]) || toep->tp_wr_unacked >= toep->tp_wr_max / 2) { struct work_request_hdr *wr = cplhdr(m0); wr->wr_hi |= htonl(F_WR_COMPL); toep->tp_wr_unacked = 0; } KASSERT((m0->m_pkthdr.csum_data > 0) && (m0->m_pkthdr.csum_data <= 4), ("bad credit count %d", m0->m_pkthdr.csum_data)); m0->m_type = MT_DONTFREE; enqueue_wr(toep, m0); DPRINTF("sending offload tx with %d bytes in %d segments\n", bytes, count); l2t_send(cdev, m0, toep->tp_l2t); } sockbuf_unlock(snd); return (total_bytes); } /* * Close a connection by sending a CPL_CLOSE_CON_REQ message. Cannot fail * under any circumstances. We take the easy way out and always queue the * message to the write_queue. We can optimize the case where the queue is * already empty though the optimization is probably not worth it. */ static void close_conn(struct socket *so) { struct mbuf *m; struct cpl_close_con_req *req; struct tom_data *d; struct inpcb *inp = so_sotoinpcb(so); struct tcpcb *tp; struct toepcb *toep; unsigned int tid; inp_wlock(inp); tp = so_sototcpcb(so); toep = tp->t_toe; if (tp->t_state != TCPS_SYN_SENT) t3_push_frames(so, 1); if (toep->tp_flags & TP_FIN_SENT) { inp_wunlock(inp); return; } tid = toep->tp_tid; d = TOM_DATA(toep->tp_toedev); m = m_gethdr_nofail(sizeof(*req)); m_set_priority(m, CPL_PRIORITY_DATA); m_set_sgl(m, NULL); m_set_sgllen(m, 0); toep->tp_flags |= TP_FIN_SENT; req = mtod(m, struct cpl_close_con_req *); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_CLOSE_CON)); req->wr.wr_lo = htonl(V_WR_TID(tid)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, tid)); req->rsvd = 0; inp_wunlock(inp); /* * XXX - need to defer shutdown while there is still data in the queue * */ CTR4(KTR_TOM, "%s CLOSE_CON_REQ so %p tp %p tid=%u", __FUNCTION__, so, tp, tid); cxgb_ofld_send(d->cdev, m); } /* * Handle an ARP failure for a CPL_ABORT_REQ. Change it into a no RST variant * and send it along. */ static void abort_arp_failure(struct t3cdev *cdev, struct mbuf *m) { struct cpl_abort_req *req = cplhdr(m); req->cmd = CPL_ABORT_NO_RST; cxgb_ofld_send(cdev, m); } /* * Send RX credits through an RX_DATA_ACK CPL message. If nofail is 0 we are * permitted to return without sending the message in case we cannot allocate * an sk_buff. Returns the number of credits sent. */ uint32_t t3_send_rx_credits(struct tcpcb *tp, uint32_t credits, uint32_t dack, int nofail) { struct mbuf *m; struct cpl_rx_data_ack *req; struct toepcb *toep = tp->t_toe; struct toedev *tdev = toep->tp_toedev; m = m_gethdr_nofail(sizeof(*req)); DPRINTF("returning %u credits to HW\n", credits); req = mtod(m, struct cpl_rx_data_ack *); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); req->wr.wr_lo = 0; OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, toep->tp_tid)); req->credit_dack = htonl(dack | V_RX_CREDITS(credits)); m_set_priority(m, mkprio(CPL_PRIORITY_ACK, toep)); cxgb_ofld_send(TOM_DATA(tdev)->cdev, m); return (credits); } /* * Send RX_DATA_ACK CPL message to request a modulation timer to be scheduled. * This is only used in DDP mode, so we take the opportunity to also set the * DACK mode and flush any Rx credits. */ void t3_send_rx_modulate(struct toepcb *toep) { struct mbuf *m; struct cpl_rx_data_ack *req; m = m_gethdr_nofail(sizeof(*req)); req = mtod(m, struct cpl_rx_data_ack *); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); req->wr.wr_lo = 0; m->m_pkthdr.len = m->m_len = sizeof(*req); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, toep->tp_tid)); req->credit_dack = htonl(F_RX_MODULATE | F_RX_DACK_CHANGE | V_RX_DACK_MODE(1) | V_RX_CREDITS(toep->tp_copied_seq - toep->tp_rcv_wup)); m_set_priority(m, mkprio(CPL_PRIORITY_CONTROL, toep)); cxgb_ofld_send(TOEP_T3C_DEV(toep), m); toep->tp_rcv_wup = toep->tp_copied_seq; } /* * Handle receipt of an urgent pointer. */ static void handle_urg_ptr(struct socket *so, uint32_t urg_seq) { #ifdef URGENT_DATA_SUPPORTED struct tcpcb *tp = so_sototcpcb(so); urg_seq--; /* initially points past the urgent data, per BSD */ if (tp->urg_data && !after(urg_seq, tp->urg_seq)) return; /* duplicate pointer */ sk_send_sigurg(sk); if (tp->urg_seq == tp->copied_seq && tp->urg_data && !sock_flag(sk, SOCK_URGINLINE) && tp->copied_seq != tp->rcv_nxt) { struct sk_buff *skb = skb_peek(&sk->sk_receive_queue); tp->copied_seq++; if (skb && tp->copied_seq - TCP_SKB_CB(skb)->seq >= skb->len) tom_eat_skb(sk, skb, 0); } tp->urg_data = TCP_URG_NOTYET; tp->urg_seq = urg_seq; #endif } /* * Returns true if a socket cannot accept new Rx data. */ static inline int so_no_receive(const struct socket *so) { return (so_state_get(so) & (SS_ISDISCONNECTED|SS_ISDISCONNECTING)); } /* * Process an urgent data notification. */ static void rx_urg_notify(struct toepcb *toep, struct mbuf *m) { struct cpl_rx_urg_notify *hdr = cplhdr(m); struct socket *so = inp_inpcbtosocket(toep->tp_tp->t_inpcb); VALIDATE_SOCK(so); if (!so_no_receive(so)) handle_urg_ptr(so, ntohl(hdr->seq)); m_freem(m); } /* * Handler for RX_URG_NOTIFY CPL messages. */ static int do_rx_urg_notify(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct toepcb *toep = (struct toepcb *)ctx; rx_urg_notify(toep, m); return (0); } static __inline int is_delack_mode_valid(struct toedev *dev, struct toepcb *toep) { return (toep->tp_ulp_mode || (toep->tp_ulp_mode == ULP_MODE_TCPDDP && dev->tod_ttid >= TOE_ID_CHELSIO_T3)); } /* * Set of states for which we should return RX credits. */ #define CREDIT_RETURN_STATE (TCPF_ESTABLISHED | TCPF_FIN_WAIT1 | TCPF_FIN_WAIT2) /* * Called after some received data has been read. It returns RX credits * to the HW for the amount of data processed. */ void t3_cleanup_rbuf(struct tcpcb *tp, int copied) { struct toepcb *toep = tp->t_toe; struct socket *so; struct toedev *dev; int dack_mode, must_send, read; u32 thres, credits, dack = 0; struct sockbuf *rcv; so = inp_inpcbtosocket(tp->t_inpcb); rcv = so_sockbuf_rcv(so); if (!((tp->t_state == TCPS_ESTABLISHED) || (tp->t_state == TCPS_FIN_WAIT_1) || (tp->t_state == TCPS_FIN_WAIT_2))) { if (copied) { sockbuf_lock(rcv); toep->tp_copied_seq += copied; sockbuf_unlock(rcv); } return; } inp_lock_assert(tp->t_inpcb); sockbuf_lock(rcv); if (copied) toep->tp_copied_seq += copied; else { read = toep->tp_enqueued_bytes - rcv->sb_cc; toep->tp_copied_seq += read; } credits = toep->tp_copied_seq - toep->tp_rcv_wup; toep->tp_enqueued_bytes = rcv->sb_cc; sockbuf_unlock(rcv); if (credits > rcv->sb_mbmax) { log(LOG_ERR, "copied_seq=%u rcv_wup=%u credits=%u\n", toep->tp_copied_seq, toep->tp_rcv_wup, credits); credits = rcv->sb_mbmax; } /* * XXX this won't accurately reflect credit return - we need * to look at the difference between the amount that has been * put in the recv sockbuf and what is there now */ if (__predict_false(!credits)) return; dev = toep->tp_toedev; thres = TOM_TUNABLE(dev, rx_credit_thres); if (__predict_false(thres == 0)) return; if (is_delack_mode_valid(dev, toep)) { dack_mode = TOM_TUNABLE(dev, delack); if (__predict_false(dack_mode != toep->tp_delack_mode)) { u32 r = tp->rcv_nxt - toep->tp_delack_seq; if (r >= tp->rcv_wnd || r >= 16 * toep->tp_mss_clamp) dack = F_RX_DACK_CHANGE | V_RX_DACK_MODE(dack_mode); } } else dack = F_RX_DACK_CHANGE | V_RX_DACK_MODE(1); /* * For coalescing to work effectively ensure the receive window has * at least 16KB left. */ must_send = credits + 16384 >= tp->rcv_wnd; if (must_send || credits >= thres) toep->tp_rcv_wup += t3_send_rx_credits(tp, credits, dack, must_send); } static int cxgb_toe_disconnect(struct tcpcb *tp) { struct socket *so; DPRINTF("cxgb_toe_disconnect\n"); so = inp_inpcbtosocket(tp->t_inpcb); close_conn(so); return (0); } static int cxgb_toe_reset(struct tcpcb *tp) { struct toepcb *toep = tp->t_toe; t3_send_reset(toep); /* * unhook from socket */ tp->t_flags &= ~TF_TOE; toep->tp_tp = NULL; tp->t_toe = NULL; return (0); } static int cxgb_toe_send(struct tcpcb *tp) { struct socket *so; DPRINTF("cxgb_toe_send\n"); dump_toepcb(tp->t_toe); so = inp_inpcbtosocket(tp->t_inpcb); t3_push_frames(so, 1); return (0); } static int cxgb_toe_rcvd(struct tcpcb *tp) { inp_lock_assert(tp->t_inpcb); t3_cleanup_rbuf(tp, 0); return (0); } static void cxgb_toe_detach(struct tcpcb *tp) { struct toepcb *toep; /* * XXX how do we handle teardown in the SYN_SENT state? * */ inp_lock_assert(tp->t_inpcb); toep = tp->t_toe; toep->tp_tp = NULL; /* * unhook from socket */ tp->t_flags &= ~TF_TOE; tp->t_toe = NULL; } static struct toe_usrreqs cxgb_toe_usrreqs = { .tu_disconnect = cxgb_toe_disconnect, .tu_reset = cxgb_toe_reset, .tu_send = cxgb_toe_send, .tu_rcvd = cxgb_toe_rcvd, .tu_detach = cxgb_toe_detach, .tu_detach = cxgb_toe_detach, .tu_syncache_event = handle_syncache_event, }; static void __set_tcb_field(struct toepcb *toep, struct mbuf *m, uint16_t word, uint64_t mask, uint64_t val, int no_reply) { struct cpl_set_tcb_field *req; CTR4(KTR_TCB, "__set_tcb_field_ulp(tid=%u word=0x%x mask=%jx val=%jx", toep->tp_tid, word, mask, val); req = mtod(m, struct cpl_set_tcb_field *); m->m_pkthdr.len = m->m_len = sizeof(*req); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); req->wr.wr_lo = 0; OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, toep->tp_tid)); req->reply = V_NO_REPLY(no_reply); req->cpu_idx = 0; req->word = htons(word); req->mask = htobe64(mask); req->val = htobe64(val); m_set_priority(m, mkprio(CPL_PRIORITY_CONTROL, toep)); send_or_defer(toep, m, 0); } static void t3_set_tcb_field(struct toepcb *toep, uint16_t word, uint64_t mask, uint64_t val) { struct mbuf *m; struct tcpcb *tp = toep->tp_tp; if (toep == NULL) return; if (tp->t_state == TCPS_CLOSED || (toep->tp_flags & TP_ABORT_SHUTDOWN)) { printf("not seting field\n"); return; } m = m_gethdr_nofail(sizeof(struct cpl_set_tcb_field)); __set_tcb_field(toep, m, word, mask, val, 1); } /* * Set one of the t_flags bits in the TCB. */ static void set_tcb_tflag(struct toepcb *toep, unsigned int bit_pos, int val) { t3_set_tcb_field(toep, W_TCB_T_FLAGS1, 1ULL << bit_pos, val << bit_pos); } /* * Send a SET_TCB_FIELD CPL message to change a connection's Nagle setting. */ static void t3_set_nagle(struct toepcb *toep) { struct tcpcb *tp = toep->tp_tp; set_tcb_tflag(toep, S_TF_NAGLE, !(tp->t_flags & TF_NODELAY)); } /* * Send a SET_TCB_FIELD CPL message to change a connection's keepalive setting. */ void t3_set_keepalive(struct toepcb *toep, int on_off) { set_tcb_tflag(toep, S_TF_KEEPALIVE, on_off); } void t3_set_rcv_coalesce_enable(struct toepcb *toep, int on_off) { set_tcb_tflag(toep, S_TF_RCV_COALESCE_ENABLE, on_off); } void t3_set_dack_mss(struct toepcb *toep, int on_off) { set_tcb_tflag(toep, S_TF_DACK_MSS, on_off); } /* * Send a SET_TCB_FIELD CPL message to change a connection's TOS setting. */ static void t3_set_tos(struct toepcb *toep) { int tos = inp_ip_tos_get(toep->tp_tp->t_inpcb); t3_set_tcb_field(toep, W_TCB_TOS, V_TCB_TOS(M_TCB_TOS), V_TCB_TOS(tos)); } /* * In DDP mode, TP fails to schedule a timer to push RX data to the host when * DDP is disabled (data is delivered to freelist). [Note that, the peer should * set the PSH bit in the last segment, which would trigger delivery.] * We work around the issue by setting a DDP buffer in a partial placed state, * which guarantees that TP will schedule a timer. */ #define TP_DDP_TIMER_WORKAROUND_MASK\ (V_TF_DDP_BUF0_VALID(1) | V_TF_DDP_ACTIVE_BUF(1) |\ ((V_TCB_RX_DDP_BUF0_OFFSET(M_TCB_RX_DDP_BUF0_OFFSET) |\ V_TCB_RX_DDP_BUF0_LEN(3)) << 32)) #define TP_DDP_TIMER_WORKAROUND_VAL\ (V_TF_DDP_BUF0_VALID(1) | V_TF_DDP_ACTIVE_BUF(0) |\ ((V_TCB_RX_DDP_BUF0_OFFSET((uint64_t)1) | V_TCB_RX_DDP_BUF0_LEN((uint64_t)2)) <<\ 32)) static void t3_enable_ddp(struct toepcb *toep, int on) { if (on) { t3_set_tcb_field(toep, W_TCB_RX_DDP_FLAGS, V_TF_DDP_OFF(1), V_TF_DDP_OFF(0)); } else t3_set_tcb_field(toep, W_TCB_RX_DDP_FLAGS, V_TF_DDP_OFF(1) | TP_DDP_TIMER_WORKAROUND_MASK, V_TF_DDP_OFF(1) | TP_DDP_TIMER_WORKAROUND_VAL); } void t3_set_ddp_tag(struct toepcb *toep, int buf_idx, unsigned int tag_color) { t3_set_tcb_field(toep, W_TCB_RX_DDP_BUF0_TAG + buf_idx, V_TCB_RX_DDP_BUF0_TAG(M_TCB_RX_DDP_BUF0_TAG), tag_color); } void t3_set_ddp_buf(struct toepcb *toep, int buf_idx, unsigned int offset, unsigned int len) { if (buf_idx == 0) t3_set_tcb_field(toep, W_TCB_RX_DDP_BUF0_OFFSET, V_TCB_RX_DDP_BUF0_OFFSET(M_TCB_RX_DDP_BUF0_OFFSET) | V_TCB_RX_DDP_BUF0_LEN(M_TCB_RX_DDP_BUF0_LEN), V_TCB_RX_DDP_BUF0_OFFSET((uint64_t)offset) | V_TCB_RX_DDP_BUF0_LEN((uint64_t)len)); else t3_set_tcb_field(toep, W_TCB_RX_DDP_BUF1_OFFSET, V_TCB_RX_DDP_BUF1_OFFSET(M_TCB_RX_DDP_BUF1_OFFSET) | V_TCB_RX_DDP_BUF1_LEN(M_TCB_RX_DDP_BUF1_LEN << 32), V_TCB_RX_DDP_BUF1_OFFSET((uint64_t)offset) | V_TCB_RX_DDP_BUF1_LEN(((uint64_t)len) << 32)); } static int t3_set_cong_control(struct socket *so, const char *name) { #ifdef CONGESTION_CONTROL_SUPPORTED int cong_algo; for (cong_algo = 0; cong_algo < ARRAY_SIZE(t3_cong_ops); cong_algo++) if (!strcmp(name, t3_cong_ops[cong_algo].name)) break; if (cong_algo >= ARRAY_SIZE(t3_cong_ops)) return -EINVAL; #endif return 0; } int t3_get_tcb(struct toepcb *toep) { struct cpl_get_tcb *req; struct tcpcb *tp = toep->tp_tp; struct mbuf *m = m_gethdr(M_NOWAIT, MT_DATA); if (!m) return (ENOMEM); inp_lock_assert(tp->t_inpcb); m_set_priority(m, mkprio(CPL_PRIORITY_CONTROL, toep)); req = mtod(m, struct cpl_get_tcb *); m->m_pkthdr.len = m->m_len = sizeof(*req); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); req->wr.wr_lo = 0; OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_GET_TCB, toep->tp_tid)); req->cpuno = htons(toep->tp_qset); req->rsvd = 0; if (tp->t_state == TCPS_SYN_SENT) mbufq_tail(&toep->out_of_order_queue, m); // defer else cxgb_ofld_send(TOEP_T3C_DEV(toep), m); return 0; } static inline void so_insert_tid(struct tom_data *d, struct toepcb *toep, unsigned int tid) { toepcb_hold(toep); cxgb_insert_tid(d->cdev, d->client, toep, tid); } /** * find_best_mtu - find the entry in the MTU table closest to an MTU * @d: TOM state * @mtu: the target MTU * * Returns the index of the value in the MTU table that is closest to but * does not exceed the target MTU. */ static unsigned int find_best_mtu(const struct t3c_data *d, unsigned short mtu) { int i = 0; while (i < d->nmtus - 1 && d->mtus[i + 1] <= mtu) ++i; return (i); } static unsigned int select_mss(struct t3c_data *td, struct tcpcb *tp, unsigned int pmtu) { unsigned int idx; #ifdef notyet struct rtentry *dst = so_sotoinpcb(so)->inp_route.ro_rt; #endif if (tp) { tp->t_maxseg = pmtu - 40; if (tp->t_maxseg < td->mtus[0] - 40) tp->t_maxseg = td->mtus[0] - 40; idx = find_best_mtu(td, tp->t_maxseg + 40); tp->t_maxseg = td->mtus[idx] - 40; } else idx = find_best_mtu(td, pmtu); return (idx); } static inline void free_atid(struct t3cdev *cdev, unsigned int tid) { struct toepcb *toep = cxgb_free_atid(cdev, tid); if (toep) toepcb_release(toep); } /* * Release resources held by an offload connection (TID, L2T entry, etc.) */ static void t3_release_offload_resources(struct toepcb *toep) { struct tcpcb *tp = toep->tp_tp; struct toedev *tdev = toep->tp_toedev; struct t3cdev *cdev; struct socket *so; unsigned int tid = toep->tp_tid; struct sockbuf *rcv; CTR0(KTR_TOM, "t3_release_offload_resources"); if (!tdev) return; cdev = TOEP_T3C_DEV(toep); if (!cdev) return; toep->tp_qset = 0; t3_release_ddp_resources(toep); #ifdef CTRL_SKB_CACHE kfree_skb(CTRL_SKB_CACHE(tp)); CTRL_SKB_CACHE(tp) = NULL; #endif if (toep->tp_wr_avail != toep->tp_wr_max) { purge_wr_queue(toep); reset_wr_list(toep); } if (toep->tp_l2t) { l2t_release(L2DATA(cdev), toep->tp_l2t); toep->tp_l2t = NULL; } toep->tp_tp = NULL; if (tp) { inp_lock_assert(tp->t_inpcb); so = inp_inpcbtosocket(tp->t_inpcb); rcv = so_sockbuf_rcv(so); /* * cancel any offloaded reads * */ sockbuf_lock(rcv); tp->t_toe = NULL; tp->t_flags &= ~TF_TOE; if (toep->tp_ddp_state.user_ddp_pending) { t3_cancel_ubuf(toep, rcv); toep->tp_ddp_state.user_ddp_pending = 0; } so_sorwakeup_locked(so); } if (toep->tp_state == TCPS_SYN_SENT) { free_atid(cdev, tid); #ifdef notyet __skb_queue_purge(&tp->out_of_order_queue); #endif } else { // we have TID cxgb_remove_tid(cdev, toep, tid); toepcb_release(toep); } #if 0 log(LOG_INFO, "closing TID %u, state %u\n", tid, tp->t_state); #endif } static void install_offload_ops(struct socket *so) { struct tcpcb *tp = so_sototcpcb(so); KASSERT(tp->t_toe != NULL, ("toepcb not set")); t3_install_socket_ops(so); tp->t_flags |= TF_TOE; tp->t_tu = &cxgb_toe_usrreqs; } /* * Determine the receive window scaling factor given a target max * receive window. */ static __inline int select_rcv_wscale(int space, struct vnet *vnet) { INIT_VNET_INET(vnet); int wscale = 0; if (space > MAX_RCV_WND) space = MAX_RCV_WND; if (V_tcp_do_rfc1323) for (; space > 65535 && wscale < 14; space >>= 1, ++wscale) ; return (wscale); } /* * Determine the receive window size for a socket. */ static unsigned long select_rcv_wnd(struct toedev *dev, struct socket *so) { INIT_VNET_INET(so->so_vnet); struct tom_data *d = TOM_DATA(dev); unsigned int wnd; unsigned int max_rcv_wnd; struct sockbuf *rcv; rcv = so_sockbuf_rcv(so); if (V_tcp_do_autorcvbuf) wnd = V_tcp_autorcvbuf_max; else wnd = rcv->sb_hiwat; /* XXX * For receive coalescing to work effectively we need a receive window * that can accomodate a coalesced segment. */ if (wnd < MIN_RCV_WND) wnd = MIN_RCV_WND; /* PR 5138 */ max_rcv_wnd = (dev->tod_ttid < TOE_ID_CHELSIO_T3C ? (uint32_t)d->rx_page_size * 23 : MAX_RCV_WND); return min(wnd, max_rcv_wnd); } /* * Assign offload parameters to some socket fields. This code is used by * both active and passive opens. */ static inline void init_offload_socket(struct socket *so, struct toedev *dev, unsigned int tid, struct l2t_entry *e, struct rtentry *dst, struct toepcb *toep) { struct tcpcb *tp = so_sototcpcb(so); struct t3c_data *td = T3C_DATA(TOM_DATA(dev)->cdev); struct sockbuf *snd, *rcv; #ifdef notyet SOCK_LOCK_ASSERT(so); #endif snd = so_sockbuf_snd(so); rcv = so_sockbuf_rcv(so); log(LOG_INFO, "initializing offload socket\n"); /* * We either need to fix push frames to work with sbcompress * or we need to add this */ snd->sb_flags |= SB_NOCOALESCE; rcv->sb_flags |= SB_NOCOALESCE; tp->t_toe = toep; toep->tp_tp = tp; toep->tp_toedev = dev; toep->tp_tid = tid; toep->tp_l2t = e; toep->tp_wr_max = toep->tp_wr_avail = TOM_TUNABLE(dev, max_wrs); toep->tp_wr_unacked = 0; toep->tp_delack_mode = 0; toep->tp_mtu_idx = select_mss(td, tp, dst->rt_ifp->if_mtu); /* * XXX broken * */ tp->rcv_wnd = select_rcv_wnd(dev, so); toep->tp_ulp_mode = TOM_TUNABLE(dev, ddp) && !(so_options_get(so) & SO_NO_DDP) && tp->rcv_wnd >= MIN_DDP_RCV_WIN ? ULP_MODE_TCPDDP : 0; toep->tp_qset_idx = 0; reset_wr_list(toep); DPRINTF("initialization done\n"); } /* * The next two functions calculate the option 0 value for a socket. */ static inline unsigned int calc_opt0h(struct socket *so, int mtu_idx) { struct tcpcb *tp = so_sototcpcb(so); int wscale = select_rcv_wscale(tp->rcv_wnd, so->so_vnet); return V_NAGLE((tp->t_flags & TF_NODELAY) == 0) | V_KEEP_ALIVE((so_options_get(so) & SO_KEEPALIVE) != 0) | F_TCAM_BYPASS | V_WND_SCALE(wscale) | V_MSS_IDX(mtu_idx); } static inline unsigned int calc_opt0l(struct socket *so, int ulp_mode) { struct tcpcb *tp = so_sototcpcb(so); unsigned int val; val = V_TOS(INP_TOS(tp->t_inpcb)) | V_ULP_MODE(ulp_mode) | V_RCV_BUFSIZ(min(tp->rcv_wnd >> 10, (u32)M_RCV_BUFSIZ)); DPRINTF("opt0l tos=%08x rcv_wnd=%ld opt0l=%08x\n", INP_TOS(tp->t_inpcb), tp->rcv_wnd, val); return (val); } static inline unsigned int calc_opt2(const struct socket *so, struct toedev *dev) { int flv_valid; flv_valid = (TOM_TUNABLE(dev, cong_alg) != -1); return (V_FLAVORS_VALID(flv_valid) | V_CONG_CONTROL_FLAVOR(flv_valid ? TOM_TUNABLE(dev, cong_alg) : 0)); } #if DEBUG_WR > 1 static int count_pending_wrs(const struct toepcb *toep) { const struct mbuf *m; int n = 0; wr_queue_walk(toep, m) n += m->m_pkthdr.csum_data; return (n); } #endif #if 0 (((*(struct tom_data **)&(dev)->l4opt)->conf.cong_alg) != -1) #endif static void mk_act_open_req(struct socket *so, struct mbuf *m, unsigned int atid, const struct l2t_entry *e) { struct cpl_act_open_req *req; struct inpcb *inp = so_sotoinpcb(so); struct tcpcb *tp = inp_inpcbtotcpcb(inp); struct toepcb *toep = tp->t_toe; struct toedev *tdev = toep->tp_toedev; m_set_priority((struct mbuf *)m, mkprio(CPL_PRIORITY_SETUP, toep)); req = mtod(m, struct cpl_act_open_req *); m->m_pkthdr.len = m->m_len = sizeof(*req); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); req->wr.wr_lo = 0; OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ACT_OPEN_REQ, atid)); inp_4tuple_get(inp, &req->local_ip, &req->local_port, &req->peer_ip, &req->peer_port); #if 0 req->local_port = inp->inp_lport; req->peer_port = inp->inp_fport; memcpy(&req->local_ip, &inp->inp_laddr, 4); memcpy(&req->peer_ip, &inp->inp_faddr, 4); #endif req->opt0h = htonl(calc_opt0h(so, toep->tp_mtu_idx) | V_L2T_IDX(e->idx) | V_TX_CHANNEL(e->smt_idx)); req->opt0l = htonl(calc_opt0l(so, toep->tp_ulp_mode)); req->params = 0; req->opt2 = htonl(calc_opt2(so, tdev)); } /* * Convert an ACT_OPEN_RPL status to an errno. */ static int act_open_rpl_status_to_errno(int status) { switch (status) { case CPL_ERR_CONN_RESET: return (ECONNREFUSED); case CPL_ERR_ARP_MISS: return (EHOSTUNREACH); case CPL_ERR_CONN_TIMEDOUT: return (ETIMEDOUT); case CPL_ERR_TCAM_FULL: return (ENOMEM); case CPL_ERR_CONN_EXIST: log(LOG_ERR, "ACTIVE_OPEN_RPL: 4-tuple in use\n"); return (EADDRINUSE); default: return (EIO); } } static void fail_act_open(struct toepcb *toep, int errno) { struct tcpcb *tp = toep->tp_tp; t3_release_offload_resources(toep); if (tp) { inp_wunlock(tp->t_inpcb); tcp_offload_drop(tp, errno); } #ifdef notyet TCP_INC_STATS_BH(TCP_MIB_ATTEMPTFAILS); #endif } /* * Handle active open failures. */ static void active_open_failed(struct toepcb *toep, struct mbuf *m) { struct cpl_act_open_rpl *rpl = cplhdr(m); struct inpcb *inp; if (toep->tp_tp == NULL) goto done; inp = toep->tp_tp->t_inpcb; /* * Don't handle connection retry for now */ #ifdef notyet struct inet_connection_sock *icsk = inet_csk(sk); if (rpl->status == CPL_ERR_CONN_EXIST && icsk->icsk_retransmit_timer.function != act_open_retry_timer) { icsk->icsk_retransmit_timer.function = act_open_retry_timer; sk_reset_timer(so, &icsk->icsk_retransmit_timer, jiffies + HZ / 2); } else #endif { inp_wlock(inp); /* * drops the inpcb lock */ fail_act_open(toep, act_open_rpl_status_to_errno(rpl->status)); } done: m_free(m); } /* * Return whether a failed active open has allocated a TID */ static inline int act_open_has_tid(int status) { return status != CPL_ERR_TCAM_FULL && status != CPL_ERR_CONN_EXIST && status != CPL_ERR_ARP_MISS; } /* * Process an ACT_OPEN_RPL CPL message. */ static int do_act_open_rpl(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct toepcb *toep = (struct toepcb *)ctx; struct cpl_act_open_rpl *rpl = cplhdr(m); if (cdev->type != T3A && act_open_has_tid(rpl->status)) cxgb_queue_tid_release(cdev, GET_TID(rpl)); active_open_failed(toep, m); return (0); } /* * Handle an ARP failure for an active open. XXX purge ofo queue * * XXX badly broken for crossed SYNs as the ATID is no longer valid. * XXX crossed SYN errors should be generated by PASS_ACCEPT_RPL which should * check SOCK_DEAD or sk->sk_sock. Or maybe generate the error here but don't * free the atid. Hmm. */ #ifdef notyet static void act_open_req_arp_failure(struct t3cdev *dev, struct mbuf *m) { struct toepcb *toep = m_get_toep(m); struct tcpcb *tp = toep->tp_tp; struct inpcb *inp = tp->t_inpcb; struct socket *so; inp_wlock(inp); if (tp->t_state == TCPS_SYN_SENT || tp->t_state == TCPS_SYN_RECEIVED) { /* * drops the inpcb lock */ fail_act_open(so, EHOSTUNREACH); printf("freeing %p\n", m); m_free(m); } else inp_wunlock(inp); } #endif /* * Send an active open request. */ int t3_connect(struct toedev *tdev, struct socket *so, struct rtentry *rt, struct sockaddr *nam) { struct mbuf *m; struct l2t_entry *e; struct tom_data *d = TOM_DATA(tdev); struct inpcb *inp = so_sotoinpcb(so); struct tcpcb *tp = intotcpcb(inp); struct toepcb *toep; /* allocated by init_offload_socket */ int atid; toep = toepcb_alloc(); if (toep == NULL) goto out_err; if ((atid = cxgb_alloc_atid(d->cdev, d->client, toep)) < 0) goto out_err; e = t3_l2t_get(d->cdev, rt, rt->rt_ifp, nam); if (!e) goto free_tid; inp_lock_assert(inp); m = m_gethdr(MT_DATA, M_WAITOK); #if 0 m->m_toe.mt_toepcb = tp->t_toe; set_arp_failure_handler((struct mbuf *)m, act_open_req_arp_failure); #endif so_lock(so); init_offload_socket(so, tdev, atid, e, rt, toep); install_offload_ops(so); mk_act_open_req(so, m, atid, e); so_unlock(so); soisconnecting(so); toep = tp->t_toe; m_set_toep(m, tp->t_toe); toep->tp_state = TCPS_SYN_SENT; l2t_send(d->cdev, (struct mbuf *)m, e); if (toep->tp_ulp_mode) t3_enable_ddp(toep, 0); return (0); free_tid: printf("failing connect - free atid\n"); free_atid(d->cdev, atid); out_err: printf("return ENOMEM\n"); return (ENOMEM); } /* * Send an ABORT_REQ message. Cannot fail. This routine makes sure we do * not send multiple ABORT_REQs for the same connection and also that we do * not try to send a message after the connection has closed. Returns 1 if * an ABORT_REQ wasn't generated after all, 0 otherwise. */ static void t3_send_reset(struct toepcb *toep) { struct cpl_abort_req *req; unsigned int tid = toep->tp_tid; int mode = CPL_ABORT_SEND_RST; struct tcpcb *tp = toep->tp_tp; struct toedev *tdev = toep->tp_toedev; struct socket *so = NULL; struct mbuf *m; struct sockbuf *snd; if (tp) { inp_lock_assert(tp->t_inpcb); so = inp_inpcbtosocket(tp->t_inpcb); } if (__predict_false((toep->tp_flags & TP_ABORT_SHUTDOWN) || tdev == NULL)) return; toep->tp_flags |= (TP_ABORT_RPL_PENDING|TP_ABORT_SHUTDOWN); snd = so_sockbuf_snd(so); /* Purge the send queue so we don't send anything after an abort. */ if (so) sbflush(snd); if ((toep->tp_flags & TP_CLOSE_CON_REQUESTED) && is_t3a(tdev)) mode |= CPL_ABORT_POST_CLOSE_REQ; m = m_gethdr_nofail(sizeof(*req)); m_set_priority(m, mkprio(CPL_PRIORITY_DATA, toep)); set_arp_failure_handler(m, abort_arp_failure); req = mtod(m, struct cpl_abort_req *); req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_REQ)); req->wr.wr_lo = htonl(V_WR_TID(tid)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_ABORT_REQ, tid)); req->rsvd0 = tp ? htonl(tp->snd_nxt) : 0; req->rsvd1 = !(toep->tp_flags & TP_DATASENT); req->cmd = mode; if (tp && (tp->t_state == TCPS_SYN_SENT)) mbufq_tail(&toep->out_of_order_queue, m); // defer else l2t_send(TOEP_T3C_DEV(toep), m, toep->tp_l2t); } static int t3_ip_ctloutput(struct socket *so, struct sockopt *sopt) { struct inpcb *inp; int error, optval; if (sopt->sopt_name == IP_OPTIONS) return (ENOPROTOOPT); if (sopt->sopt_name != IP_TOS) return (EOPNOTSUPP); error = sooptcopyin(sopt, &optval, sizeof optval, sizeof optval); if (error) return (error); if (optval > IPTOS_PREC_CRITIC_ECP) return (EINVAL); inp = so_sotoinpcb(so); inp_wlock(inp); inp_ip_tos_set(inp, optval); #if 0 inp->inp_ip_tos = optval; #endif t3_set_tos(inp_inpcbtotcpcb(inp)->t_toe); inp_wunlock(inp); return (0); } static int t3_tcp_ctloutput(struct socket *so, struct sockopt *sopt) { int err = 0; size_t copied; if (sopt->sopt_name != TCP_CONGESTION && sopt->sopt_name != TCP_NODELAY) return (EOPNOTSUPP); if (sopt->sopt_name == TCP_CONGESTION) { char name[TCP_CA_NAME_MAX]; int optlen = sopt->sopt_valsize; struct tcpcb *tp; if (sopt->sopt_dir == SOPT_GET) { KASSERT(0, ("unimplemented")); return (EOPNOTSUPP); } if (optlen < 1) return (EINVAL); err = copyinstr(sopt->sopt_val, name, min(TCP_CA_NAME_MAX - 1, optlen), &copied); if (err) return (err); if (copied < 1) return (EINVAL); tp = so_sototcpcb(so); /* * XXX I need to revisit this */ if ((err = t3_set_cong_control(so, name)) == 0) { #ifdef CONGESTION_CONTROL_SUPPORTED tp->t_cong_control = strdup(name, M_CXGB); #endif } else return (err); } else { int optval, oldval; struct inpcb *inp; struct tcpcb *tp; if (sopt->sopt_dir == SOPT_GET) return (EOPNOTSUPP); err = sooptcopyin(sopt, &optval, sizeof optval, sizeof optval); if (err) return (err); inp = so_sotoinpcb(so); inp_wlock(inp); tp = inp_inpcbtotcpcb(inp); oldval = tp->t_flags; if (optval) tp->t_flags |= TF_NODELAY; else tp->t_flags &= ~TF_NODELAY; inp_wunlock(inp); if (oldval != tp->t_flags && (tp->t_toe != NULL)) t3_set_nagle(tp->t_toe); } return (0); } int t3_ctloutput(struct socket *so, struct sockopt *sopt) { int err; if (sopt->sopt_level != IPPROTO_TCP) err = t3_ip_ctloutput(so, sopt); else err = t3_tcp_ctloutput(so, sopt); if (err != EOPNOTSUPP) return (err); return (tcp_ctloutput(so, sopt)); } /* * Returns true if we need to explicitly request RST when we receive new data * on an RX-closed connection. */ static inline int need_rst_on_excess_rx(const struct toepcb *toep) { return (1); } /* * Handles Rx data that arrives in a state where the socket isn't accepting * new data. */ static void handle_excess_rx(struct toepcb *toep, struct mbuf *m) { if (need_rst_on_excess_rx(toep) && !(toep->tp_flags & TP_ABORT_SHUTDOWN)) t3_send_reset(toep); m_freem(m); } /* * Process a get_tcb_rpl as a DDP completion (similar to RX_DDP_COMPLETE) * by getting the DDP offset from the TCB. */ static void tcb_rpl_as_ddp_complete(struct toepcb *toep, struct mbuf *m) { struct ddp_state *q = &toep->tp_ddp_state; struct ddp_buf_state *bsp; struct cpl_get_tcb_rpl *hdr; unsigned int ddp_offset; struct socket *so; struct tcpcb *tp; struct sockbuf *rcv; int state; uint64_t t; __be64 *tcb; tp = toep->tp_tp; so = inp_inpcbtosocket(tp->t_inpcb); inp_lock_assert(tp->t_inpcb); rcv = so_sockbuf_rcv(so); sockbuf_lock(rcv); /* Note that we only accout for CPL_GET_TCB issued by the DDP code. * We really need a cookie in order to dispatch the RPLs. */ q->get_tcb_count--; /* It is a possible that a previous CPL already invalidated UBUF DDP * and moved the cur_buf idx and hence no further processing of this * skb is required. However, the app might be sleeping on * !q->get_tcb_count and we need to wake it up. */ if (q->cancel_ubuf && !t3_ddp_ubuf_pending(toep)) { int state = so_state_get(so); m_freem(m); if (__predict_true((state & SS_NOFDREF) == 0)) so_sorwakeup_locked(so); else sockbuf_unlock(rcv); return; } bsp = &q->buf_state[q->cur_buf]; hdr = cplhdr(m); tcb = (__be64 *)(hdr + 1); if (q->cur_buf == 0) { t = be64toh(tcb[(31 - W_TCB_RX_DDP_BUF0_OFFSET) / 2]); ddp_offset = t >> (32 + S_TCB_RX_DDP_BUF0_OFFSET); } else { t = be64toh(tcb[(31 - W_TCB_RX_DDP_BUF1_OFFSET) / 2]); ddp_offset = t >> S_TCB_RX_DDP_BUF1_OFFSET; } ddp_offset &= M_TCB_RX_DDP_BUF0_OFFSET; m->m_cur_offset = bsp->cur_offset; bsp->cur_offset = ddp_offset; m->m_len = m->m_pkthdr.len = ddp_offset - m->m_cur_offset; CTR5(KTR_TOM, "tcb_rpl_as_ddp_complete: idx=%d seq=0x%x hwbuf=%u ddp_offset=%u cur_offset=%u", q->cur_buf, tp->rcv_nxt, q->cur_buf, ddp_offset, m->m_cur_offset); KASSERT(ddp_offset >= m->m_cur_offset, ("ddp_offset=%u less than cur_offset=%u", ddp_offset, m->m_cur_offset)); #if 0 { unsigned int ddp_flags, rcv_nxt, rx_hdr_offset, buf_idx; t = be64toh(tcb[(31 - W_TCB_RX_DDP_FLAGS) / 2]); ddp_flags = (t >> S_TCB_RX_DDP_FLAGS) & M_TCB_RX_DDP_FLAGS; t = be64toh(tcb[(31 - W_TCB_RCV_NXT) / 2]); rcv_nxt = t >> S_TCB_RCV_NXT; rcv_nxt &= M_TCB_RCV_NXT; t = be64toh(tcb[(31 - W_TCB_RX_HDR_OFFSET) / 2]); rx_hdr_offset = t >> (32 + S_TCB_RX_HDR_OFFSET); rx_hdr_offset &= M_TCB_RX_HDR_OFFSET; T3_TRACE2(TIDTB(sk), "tcb_rpl_as_ddp_complete: DDP FLAGS 0x%x dma up to 0x%x", ddp_flags, rcv_nxt - rx_hdr_offset); T3_TRACE4(TB(q), "tcb_rpl_as_ddp_complete: rcvnxt 0x%x hwbuf %u cur_offset %u cancel %u", tp->rcv_nxt, q->cur_buf, bsp->cur_offset, q->cancel_ubuf); T3_TRACE3(TB(q), "tcb_rpl_as_ddp_complete: TCB rcvnxt 0x%x hwbuf 0x%x ddp_offset %u", rcv_nxt - rx_hdr_offset, ddp_flags, ddp_offset); T3_TRACE2(TB(q), "tcb_rpl_as_ddp_complete: flags0 0x%x flags1 0x%x", q->buf_state[0].flags, q->buf_state[1].flags); } #endif if (__predict_false(so_no_receive(so) && m->m_pkthdr.len)) { handle_excess_rx(toep, m); return; } #ifdef T3_TRACE if ((int)m->m_pkthdr.len < 0) { t3_ddp_error(so, "tcb_rpl_as_ddp_complete: neg len"); } #endif if (bsp->flags & DDP_BF_NOCOPY) { #ifdef T3_TRACE T3_TRACE0(TB(q), "tcb_rpl_as_ddp_complete: CANCEL UBUF"); if (!q->cancel_ubuf && !(sk->sk_shutdown & RCV_SHUTDOWN)) { printk("!cancel_ubuf"); t3_ddp_error(sk, "tcb_rpl_as_ddp_complete: !cancel_ubuf"); } #endif m->m_ddp_flags = DDP_BF_PSH | DDP_BF_NOCOPY | 1; bsp->flags &= ~(DDP_BF_NOCOPY|DDP_BF_NODATA); q->cur_buf ^= 1; } else if (bsp->flags & DDP_BF_NOFLIP) { m->m_ddp_flags = 1; /* always a kernel buffer */ /* now HW buffer carries a user buffer */ bsp->flags &= ~DDP_BF_NOFLIP; bsp->flags |= DDP_BF_NOCOPY; /* It is possible that the CPL_GET_TCB_RPL doesn't indicate * any new data in which case we're done. If in addition the * offset is 0, then there wasn't a completion for the kbuf * and we need to decrement the posted count. */ if (m->m_pkthdr.len == 0) { if (ddp_offset == 0) { q->kbuf_posted--; bsp->flags |= DDP_BF_NODATA; } sockbuf_unlock(rcv); m_free(m); return; } } else { sockbuf_unlock(rcv); /* This reply is for a CPL_GET_TCB_RPL to cancel the UBUF DDP, * but it got here way late and nobody cares anymore. */ m_free(m); return; } m->m_ddp_gl = (unsigned char *)bsp->gl; m->m_flags |= M_DDP; m->m_seq = tp->rcv_nxt; tp->rcv_nxt += m->m_pkthdr.len; tp->t_rcvtime = ticks; CTR3(KTR_TOM, "tcb_rpl_as_ddp_complete: seq 0x%x hwbuf %u m->m_pktlen %u", m->m_seq, q->cur_buf, m->m_pkthdr.len); if (m->m_pkthdr.len == 0) { q->user_ddp_pending = 0; m_free(m); } else SBAPPEND(rcv, m); state = so_state_get(so); if (__predict_true((state & SS_NOFDREF) == 0)) so_sorwakeup_locked(so); else sockbuf_unlock(rcv); } /* * Process a CPL_GET_TCB_RPL. These can also be generated by the DDP code, * in that case they are similar to DDP completions. */ static int do_get_tcb_rpl(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct toepcb *toep = (struct toepcb *)ctx; /* OK if socket doesn't exist */ if (toep == NULL) { printf("null toep in do_get_tcb_rpl\n"); return (CPL_RET_BUF_DONE); } inp_wlock(toep->tp_tp->t_inpcb); tcb_rpl_as_ddp_complete(toep, m); inp_wunlock(toep->tp_tp->t_inpcb); return (0); } static void handle_ddp_data(struct toepcb *toep, struct mbuf *m) { struct tcpcb *tp = toep->tp_tp; struct socket *so; struct ddp_state *q; struct ddp_buf_state *bsp; struct cpl_rx_data *hdr = cplhdr(m); unsigned int rcv_nxt = ntohl(hdr->seq); struct sockbuf *rcv; if (tp->rcv_nxt == rcv_nxt) return; inp_lock_assert(tp->t_inpcb); so = inp_inpcbtosocket(tp->t_inpcb); rcv = so_sockbuf_rcv(so); sockbuf_lock(rcv); q = &toep->tp_ddp_state; bsp = &q->buf_state[q->cur_buf]; KASSERT(SEQ_GT(rcv_nxt, tp->rcv_nxt), ("tp->rcv_nxt=0x%08x decreased rcv_nxt=0x08%x", rcv_nxt, tp->rcv_nxt)); m->m_len = m->m_pkthdr.len = rcv_nxt - tp->rcv_nxt; KASSERT(m->m_len > 0, ("%s m_len=%d", __FUNCTION__, m->m_len)); CTR3(KTR_TOM, "rcv_nxt=0x%x tp->rcv_nxt=0x%x len=%d", rcv_nxt, tp->rcv_nxt, m->m_pkthdr.len); #ifdef T3_TRACE if ((int)m->m_pkthdr.len < 0) { t3_ddp_error(so, "handle_ddp_data: neg len"); } #endif m->m_ddp_gl = (unsigned char *)bsp->gl; m->m_flags |= M_DDP; m->m_cur_offset = bsp->cur_offset; m->m_ddp_flags = DDP_BF_PSH | (bsp->flags & DDP_BF_NOCOPY) | 1; if (bsp->flags & DDP_BF_NOCOPY) bsp->flags &= ~DDP_BF_NOCOPY; m->m_seq = tp->rcv_nxt; tp->rcv_nxt = rcv_nxt; bsp->cur_offset += m->m_pkthdr.len; if (!(bsp->flags & DDP_BF_NOFLIP)) q->cur_buf ^= 1; /* * For now, don't re-enable DDP after a connection fell out of DDP * mode. */ q->ubuf_ddp_ready = 0; sockbuf_unlock(rcv); } /* * Process new data received for a connection. */ static void new_rx_data(struct toepcb *toep, struct mbuf *m) { struct cpl_rx_data *hdr = cplhdr(m); struct tcpcb *tp = toep->tp_tp; struct socket *so; struct sockbuf *rcv; int state; int len = be16toh(hdr->len); inp_wlock(tp->t_inpcb); so = inp_inpcbtosocket(tp->t_inpcb); if (__predict_false(so_no_receive(so))) { handle_excess_rx(toep, m); inp_wunlock(tp->t_inpcb); TRACE_EXIT; return; } if (toep->tp_ulp_mode == ULP_MODE_TCPDDP) handle_ddp_data(toep, m); m->m_seq = ntohl(hdr->seq); m->m_ulp_mode = 0; /* for iSCSI */ #if VALIDATE_SEQ if (__predict_false(m->m_seq != tp->rcv_nxt)) { log(LOG_ERR, "%s: TID %u: Bad sequence number %u, expected %u\n", toep->tp_toedev->name, toep->tp_tid, m->m_seq, tp->rcv_nxt); m_freem(m); inp_wunlock(tp->t_inpcb); return; } #endif m_adj(m, sizeof(*hdr)); #ifdef URGENT_DATA_SUPPORTED /* * We don't handle urgent data yet */ if (__predict_false(hdr->urg)) handle_urg_ptr(so, tp->rcv_nxt + ntohs(hdr->urg)); if (__predict_false(tp->urg_data == TCP_URG_NOTYET && tp->urg_seq - tp->rcv_nxt < skb->len)) tp->urg_data = TCP_URG_VALID | skb->data[tp->urg_seq - tp->rcv_nxt]; #endif if (__predict_false(hdr->dack_mode != toep->tp_delack_mode)) { toep->tp_delack_mode = hdr->dack_mode; toep->tp_delack_seq = tp->rcv_nxt; } CTR6(KTR_TOM, "appending mbuf=%p pktlen=%d m_len=%d len=%d rcv_nxt=0x%x enqueued_bytes=%d", m, m->m_pkthdr.len, m->m_len, len, tp->rcv_nxt, toep->tp_enqueued_bytes); if (len < m->m_pkthdr.len) m->m_pkthdr.len = m->m_len = len; tp->rcv_nxt += m->m_pkthdr.len; tp->t_rcvtime = ticks; toep->tp_enqueued_bytes += m->m_pkthdr.len; CTR2(KTR_TOM, "new_rx_data: seq 0x%x len %u", m->m_seq, m->m_pkthdr.len); inp_wunlock(tp->t_inpcb); rcv = so_sockbuf_rcv(so); sockbuf_lock(rcv); #if 0 if (sb_notify(rcv)) DPRINTF("rx_data so=%p flags=0x%x len=%d\n", so, rcv->sb_flags, m->m_pkthdr.len); #endif SBAPPEND(rcv, m); #ifdef notyet /* * We're giving too many credits to the card - but disable this check so we can keep on moving :-| * */ KASSERT(rcv->sb_cc < (rcv->sb_mbmax << 1), ("so=%p, data contents exceed mbmax, sb_cc=%d sb_mbmax=%d", so, rcv->sb_cc, rcv->sb_mbmax)); #endif CTR2(KTR_TOM, "sb_cc=%d sb_mbcnt=%d", rcv->sb_cc, rcv->sb_mbcnt); state = so_state_get(so); if (__predict_true((state & SS_NOFDREF) == 0)) so_sorwakeup_locked(so); else sockbuf_unlock(rcv); } /* * Handler for RX_DATA CPL messages. */ static int do_rx_data(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct toepcb *toep = (struct toepcb *)ctx; DPRINTF("rx_data len=%d\n", m->m_pkthdr.len); new_rx_data(toep, m); return (0); } static void new_rx_data_ddp(struct toepcb *toep, struct mbuf *m) { struct tcpcb *tp; struct ddp_state *q; struct ddp_buf_state *bsp; struct cpl_rx_data_ddp *hdr; struct socket *so; unsigned int ddp_len, rcv_nxt, ddp_report, end_offset, buf_idx; int nomoredata = 0; unsigned int delack_mode; struct sockbuf *rcv; tp = toep->tp_tp; inp_wlock(tp->t_inpcb); so = inp_inpcbtosocket(tp->t_inpcb); if (__predict_false(so_no_receive(so))) { handle_excess_rx(toep, m); inp_wunlock(tp->t_inpcb); return; } q = &toep->tp_ddp_state; hdr = cplhdr(m); ddp_report = ntohl(hdr->u.ddp_report); buf_idx = (ddp_report >> S_DDP_BUF_IDX) & 1; bsp = &q->buf_state[buf_idx]; CTR4(KTR_TOM, "new_rx_data_ddp: tp->rcv_nxt 0x%x cur_offset %u " "hdr seq 0x%x len %u", tp->rcv_nxt, bsp->cur_offset, ntohl(hdr->seq), ntohs(hdr->len)); CTR3(KTR_TOM, "new_rx_data_ddp: offset %u ddp_report 0x%x buf_idx=%d", G_DDP_OFFSET(ddp_report), ddp_report, buf_idx); ddp_len = ntohs(hdr->len); rcv_nxt = ntohl(hdr->seq) + ddp_len; delack_mode = G_DDP_DACK_MODE(ddp_report); if (__predict_false(G_DDP_DACK_MODE(ddp_report) != toep->tp_delack_mode)) { toep->tp_delack_mode = delack_mode; toep->tp_delack_seq = tp->rcv_nxt; } m->m_seq = tp->rcv_nxt; tp->rcv_nxt = rcv_nxt; tp->t_rcvtime = ticks; /* * Store the length in m->m_len. We are changing the meaning of * m->m_len here, we need to be very careful that nothing from now on * interprets ->len of this packet the usual way. */ m->m_len = m->m_pkthdr.len = rcv_nxt - m->m_seq; inp_wunlock(tp->t_inpcb); CTR3(KTR_TOM, "new_rx_data_ddp: m_len=%u rcv_next 0x%08x rcv_nxt_prev=0x%08x ", m->m_len, rcv_nxt, m->m_seq); /* * Figure out where the new data was placed in the buffer and store it * in when. Assumes the buffer offset starts at 0, consumer needs to * account for page pod's pg_offset. */ end_offset = G_DDP_OFFSET(ddp_report) + ddp_len; m->m_cur_offset = end_offset - m->m_pkthdr.len; rcv = so_sockbuf_rcv(so); sockbuf_lock(rcv); m->m_ddp_gl = (unsigned char *)bsp->gl; m->m_flags |= M_DDP; bsp->cur_offset = end_offset; toep->tp_enqueued_bytes += m->m_pkthdr.len; /* * Length is only meaningful for kbuf */ if (!(bsp->flags & DDP_BF_NOCOPY)) KASSERT(m->m_len <= bsp->gl->dgl_length, ("length received exceeds ddp pages: len=%d dgl_length=%d", m->m_len, bsp->gl->dgl_length)); KASSERT(m->m_len > 0, ("%s m_len=%d", __FUNCTION__, m->m_len)); KASSERT(m->m_next == NULL, ("m_len=%p", m->m_next)); /* * Bit 0 of flags stores whether the DDP buffer is completed. * Note that other parts of the code depend on this being in bit 0. */ if ((bsp->flags & DDP_BF_NOINVAL) && end_offset != bsp->gl->dgl_length) { panic("spurious ddp completion"); } else { m->m_ddp_flags = !!(ddp_report & F_DDP_BUF_COMPLETE); if (m->m_ddp_flags && !(bsp->flags & DDP_BF_NOFLIP)) q->cur_buf ^= 1; /* flip buffers */ } if (bsp->flags & DDP_BF_NOCOPY) { m->m_ddp_flags |= (bsp->flags & DDP_BF_NOCOPY); bsp->flags &= ~DDP_BF_NOCOPY; } if (ddp_report & F_DDP_PSH) m->m_ddp_flags |= DDP_BF_PSH; if (nomoredata) m->m_ddp_flags |= DDP_BF_NODATA; #ifdef notyet skb_reset_transport_header(skb); tcp_hdr(skb)->fin = 0; /* changes original hdr->ddp_report */ #endif SBAPPEND(rcv, m); if ((so_state_get(so) & SS_NOFDREF) == 0 && ((ddp_report & F_DDP_PSH) || (((m->m_ddp_flags & (DDP_BF_NOCOPY|1)) == (DDP_BF_NOCOPY|1)) || !(m->m_ddp_flags & DDP_BF_NOCOPY)))) so_sorwakeup_locked(so); else sockbuf_unlock(rcv); } #define DDP_ERR (F_DDP_PPOD_MISMATCH | F_DDP_LLIMIT_ERR | F_DDP_ULIMIT_ERR |\ F_DDP_PPOD_PARITY_ERR | F_DDP_PADDING_ERR | F_DDP_OFFSET_ERR |\ F_DDP_INVALID_TAG | F_DDP_COLOR_ERR | F_DDP_TID_MISMATCH |\ F_DDP_INVALID_PPOD) /* * Handler for RX_DATA_DDP CPL messages. */ static int do_rx_data_ddp(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct toepcb *toep = ctx; const struct cpl_rx_data_ddp *hdr = cplhdr(m); VALIDATE_SOCK(so); if (__predict_false(ntohl(hdr->ddpvld_status) & DDP_ERR)) { log(LOG_ERR, "RX_DATA_DDP for TID %u reported error 0x%x\n", GET_TID(hdr), G_DDP_VALID(ntohl(hdr->ddpvld_status))); return (CPL_RET_BUF_DONE); } #if 0 skb->h.th = tcphdr_skb->h.th; #endif new_rx_data_ddp(toep, m); return (0); } static void process_ddp_complete(struct toepcb *toep, struct mbuf *m) { struct tcpcb *tp = toep->tp_tp; struct socket *so; struct ddp_state *q; struct ddp_buf_state *bsp; struct cpl_rx_ddp_complete *hdr; unsigned int ddp_report, buf_idx, when, delack_mode; int nomoredata = 0; struct sockbuf *rcv; inp_wlock(tp->t_inpcb); so = inp_inpcbtosocket(tp->t_inpcb); if (__predict_false(so_no_receive(so))) { struct inpcb *inp = so_sotoinpcb(so); handle_excess_rx(toep, m); inp_wunlock(inp); return; } q = &toep->tp_ddp_state; hdr = cplhdr(m); ddp_report = ntohl(hdr->ddp_report); buf_idx = (ddp_report >> S_DDP_BUF_IDX) & 1; m->m_pkthdr.csum_data = tp->rcv_nxt; rcv = so_sockbuf_rcv(so); sockbuf_lock(rcv); bsp = &q->buf_state[buf_idx]; when = bsp->cur_offset; m->m_len = m->m_pkthdr.len = G_DDP_OFFSET(ddp_report) - when; tp->rcv_nxt += m->m_len; tp->t_rcvtime = ticks; delack_mode = G_DDP_DACK_MODE(ddp_report); if (__predict_false(G_DDP_DACK_MODE(ddp_report) != toep->tp_delack_mode)) { toep->tp_delack_mode = delack_mode; toep->tp_delack_seq = tp->rcv_nxt; } #ifdef notyet skb_reset_transport_header(skb); tcp_hdr(skb)->fin = 0; /* changes valid memory past CPL */ #endif inp_wunlock(tp->t_inpcb); KASSERT(m->m_len >= 0, ("%s m_len=%d", __FUNCTION__, m->m_len)); CTR5(KTR_TOM, "process_ddp_complete: tp->rcv_nxt 0x%x cur_offset %u " "ddp_report 0x%x offset %u, len %u", tp->rcv_nxt, bsp->cur_offset, ddp_report, G_DDP_OFFSET(ddp_report), m->m_len); m->m_cur_offset = bsp->cur_offset; bsp->cur_offset += m->m_len; if (!(bsp->flags & DDP_BF_NOFLIP)) { q->cur_buf ^= 1; /* flip buffers */ if (G_DDP_OFFSET(ddp_report) < q->kbuf[0]->dgl_length) nomoredata=1; } CTR4(KTR_TOM, "process_ddp_complete: tp->rcv_nxt 0x%x cur_offset %u " "ddp_report %u offset %u", tp->rcv_nxt, bsp->cur_offset, ddp_report, G_DDP_OFFSET(ddp_report)); m->m_ddp_gl = (unsigned char *)bsp->gl; m->m_flags |= M_DDP; m->m_ddp_flags = (bsp->flags & DDP_BF_NOCOPY) | 1; if (bsp->flags & DDP_BF_NOCOPY) bsp->flags &= ~DDP_BF_NOCOPY; if (nomoredata) m->m_ddp_flags |= DDP_BF_NODATA; SBAPPEND(rcv, m); if ((so_state_get(so) & SS_NOFDREF) == 0) so_sorwakeup_locked(so); else sockbuf_unlock(rcv); } /* * Handler for RX_DDP_COMPLETE CPL messages. */ static int do_rx_ddp_complete(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct toepcb *toep = ctx; VALIDATE_SOCK(so); #if 0 skb->h.th = tcphdr_skb->h.th; #endif process_ddp_complete(toep, m); return (0); } /* * Move a socket to TIME_WAIT state. We need to make some adjustments to the * socket state before calling tcp_time_wait to comply with its expectations. */ static void enter_timewait(struct tcpcb *tp) { /* * Bump rcv_nxt for the peer FIN. We don't do this at the time we * process peer_close because we don't want to carry the peer FIN in * the socket's receive queue and if we increment rcv_nxt without * having the FIN in the receive queue we'll confuse facilities such * as SIOCINQ. */ inp_wlock(tp->t_inpcb); tp->rcv_nxt++; tp->ts_recent_age = 0; /* defeat recycling */ tp->t_srtt = 0; /* defeat tcp_update_metrics */ inp_wunlock(tp->t_inpcb); tcp_offload_twstart(tp); } /* * For TCP DDP a PEER_CLOSE may also be an implicit RX_DDP_COMPLETE. This * function deals with the data that may be reported along with the FIN. * Returns -1 if no further processing of the PEER_CLOSE is needed, >= 0 to * perform normal FIN-related processing. In the latter case 1 indicates that * there was an implicit RX_DDP_COMPLETE and the skb should not be freed, 0 the * skb can be freed. */ static int handle_peer_close_data(struct socket *so, struct mbuf *m) { struct tcpcb *tp = so_sototcpcb(so); struct toepcb *toep = tp->t_toe; struct ddp_state *q; struct ddp_buf_state *bsp; struct cpl_peer_close *req = cplhdr(m); unsigned int rcv_nxt = ntohl(req->rcv_nxt) - 1; /* exclude FIN */ struct sockbuf *rcv; if (tp->rcv_nxt == rcv_nxt) /* no data */ return (0); CTR0(KTR_TOM, "handle_peer_close_data"); if (__predict_false(so_no_receive(so))) { handle_excess_rx(toep, m); /* * Although we discard the data we want to process the FIN so * that PEER_CLOSE + data behaves the same as RX_DATA_DDP + * PEER_CLOSE without data. In particular this PEER_CLOSE * may be what will close the connection. We return 1 because * handle_excess_rx() already freed the packet. */ return (1); } inp_lock_assert(tp->t_inpcb); q = &toep->tp_ddp_state; rcv = so_sockbuf_rcv(so); sockbuf_lock(rcv); bsp = &q->buf_state[q->cur_buf]; m->m_len = m->m_pkthdr.len = rcv_nxt - tp->rcv_nxt; KASSERT(m->m_len > 0, ("%s m_len=%d", __FUNCTION__, m->m_len)); m->m_ddp_gl = (unsigned char *)bsp->gl; m->m_flags |= M_DDP; m->m_cur_offset = bsp->cur_offset; m->m_ddp_flags = DDP_BF_PSH | (bsp->flags & DDP_BF_NOCOPY) | 1; m->m_seq = tp->rcv_nxt; tp->rcv_nxt = rcv_nxt; bsp->cur_offset += m->m_pkthdr.len; if (!(bsp->flags & DDP_BF_NOFLIP)) q->cur_buf ^= 1; #ifdef notyet skb_reset_transport_header(skb); tcp_hdr(skb)->fin = 0; /* changes valid memory past CPL */ #endif tp->t_rcvtime = ticks; SBAPPEND(rcv, m); if (__predict_true((so_state_get(so) & SS_NOFDREF) == 0)) so_sorwakeup_locked(so); else sockbuf_unlock(rcv); return (1); } /* * Handle a peer FIN. */ static void do_peer_fin(struct toepcb *toep, struct mbuf *m) { struct socket *so; struct tcpcb *tp = toep->tp_tp; int keep, action; action = keep = 0; CTR1(KTR_TOM, "do_peer_fin state=%d", tp->t_state); if (!is_t3a(toep->tp_toedev) && (toep->tp_flags & TP_ABORT_RPL_PENDING)) { printf("abort_pending set\n"); goto out; } inp_wlock(tp->t_inpcb); so = inp_inpcbtosocket(toep->tp_tp->t_inpcb); if (toep->tp_ulp_mode == ULP_MODE_TCPDDP) { keep = handle_peer_close_data(so, m); if (keep < 0) { inp_wunlock(tp->t_inpcb); return; } } if (TCPS_HAVERCVDFIN(tp->t_state) == 0) { CTR1(KTR_TOM, "waking up waiters for cantrcvmore on %p ", so); socantrcvmore(so); /* * If connection is half-synchronized * (ie NEEDSYN flag on) then delay ACK, * so it may be piggybacked when SYN is sent. * Otherwise, since we received a FIN then no * more input can be expected, send ACK now. */ if (tp->t_flags & TF_NEEDSYN) tp->t_flags |= TF_DELACK; else tp->t_flags |= TF_ACKNOW; tp->rcv_nxt++; } switch (tp->t_state) { case TCPS_SYN_RECEIVED: tp->t_starttime = ticks; /* FALLTHROUGH */ case TCPS_ESTABLISHED: tp->t_state = TCPS_CLOSE_WAIT; break; case TCPS_FIN_WAIT_1: tp->t_state = TCPS_CLOSING; break; case TCPS_FIN_WAIT_2: /* * If we've sent an abort_req we must have sent it too late, * HW will send us a reply telling us so, and this peer_close * is really the last message for this connection and needs to * be treated as an abort_rpl, i.e., transition the connection * to TCP_CLOSE (note that the host stack does this at the * time of generating the RST but we must wait for HW). * Otherwise we enter TIME_WAIT. */ t3_release_offload_resources(toep); if (toep->tp_flags & TP_ABORT_RPL_PENDING) { action = TCP_CLOSE; } else { action = TCP_TIMEWAIT; } break; default: log(LOG_ERR, "%s: TID %u received PEER_CLOSE in bad state %d\n", toep->tp_toedev->tod_name, toep->tp_tid, tp->t_state); } inp_wunlock(tp->t_inpcb); if (action == TCP_TIMEWAIT) { enter_timewait(tp); } else if (action == TCP_DROP) { tcp_offload_drop(tp, 0); } else if (action == TCP_CLOSE) { tcp_offload_close(tp); } #ifdef notyet /* Do not send POLL_HUP for half duplex close. */ if ((sk->sk_shutdown & SEND_SHUTDOWN) || sk->sk_state == TCP_CLOSE) sk_wake_async(so, 1, POLL_HUP); else sk_wake_async(so, 1, POLL_IN); #endif out: if (!keep) m_free(m); } /* * Handler for PEER_CLOSE CPL messages. */ static int do_peer_close(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct toepcb *toep = (struct toepcb *)ctx; VALIDATE_SOCK(so); do_peer_fin(toep, m); return (0); } static void process_close_con_rpl(struct toepcb *toep, struct mbuf *m) { struct cpl_close_con_rpl *rpl = cplhdr(m); struct tcpcb *tp = toep->tp_tp; struct socket *so; int action = 0; struct sockbuf *rcv; inp_wlock(tp->t_inpcb); so = inp_inpcbtosocket(tp->t_inpcb); tp->snd_una = ntohl(rpl->snd_nxt) - 1; /* exclude FIN */ if (!is_t3a(toep->tp_toedev) && (toep->tp_flags & TP_ABORT_RPL_PENDING)) { inp_wunlock(tp->t_inpcb); goto out; } CTR3(KTR_TOM, "process_close_con_rpl(%p) state=%d dead=%d", toep, tp->t_state, !!(so_state_get(so) & SS_NOFDREF)); switch (tp->t_state) { case TCPS_CLOSING: /* see FIN_WAIT2 case in do_peer_fin */ t3_release_offload_resources(toep); if (toep->tp_flags & TP_ABORT_RPL_PENDING) { action = TCP_CLOSE; } else { action = TCP_TIMEWAIT; } break; case TCPS_LAST_ACK: /* * In this state we don't care about pending abort_rpl. * If we've sent abort_req it was post-close and was sent too * late, this close_con_rpl is the actual last message. */ t3_release_offload_resources(toep); action = TCP_CLOSE; break; case TCPS_FIN_WAIT_1: /* * If we can't receive any more * data, then closing user can proceed. * Starting the timer is contrary to the * specification, but if we don't get a FIN * we'll hang forever. * * XXXjl: * we should release the tp also, and use a * compressed state. */ if (so) rcv = so_sockbuf_rcv(so); else break; if (rcv->sb_state & SBS_CANTRCVMORE) { int timeout; if (so) soisdisconnected(so); timeout = (tcp_fast_finwait2_recycle) ? tcp_finwait2_timeout : tcp_maxidle; tcp_timer_activate(tp, TT_2MSL, timeout); } tp->t_state = TCPS_FIN_WAIT_2; if ((so_options_get(so) & SO_LINGER) && so_linger_get(so) == 0 && (toep->tp_flags & TP_ABORT_SHUTDOWN) == 0) { action = TCP_DROP; } break; default: log(LOG_ERR, "%s: TID %u received CLOSE_CON_RPL in bad state %d\n", toep->tp_toedev->tod_name, toep->tp_tid, tp->t_state); } inp_wunlock(tp->t_inpcb); if (action == TCP_TIMEWAIT) { enter_timewait(tp); } else if (action == TCP_DROP) { tcp_offload_drop(tp, 0); } else if (action == TCP_CLOSE) { tcp_offload_close(tp); } out: m_freem(m); } /* * Handler for CLOSE_CON_RPL CPL messages. */ static int do_close_con_rpl(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct toepcb *toep = (struct toepcb *)ctx; process_close_con_rpl(toep, m); return (0); } /* * Process abort replies. We only process these messages if we anticipate * them as the coordination between SW and HW in this area is somewhat lacking * and sometimes we get ABORT_RPLs after we are done with the connection that * originated the ABORT_REQ. */ static void process_abort_rpl(struct toepcb *toep, struct mbuf *m) { struct tcpcb *tp = toep->tp_tp; struct socket *so; int needclose = 0; #ifdef T3_TRACE T3_TRACE1(TIDTB(sk), "process_abort_rpl: GTS rpl pending %d", sock_flag(sk, ABORT_RPL_PENDING)); #endif inp_wlock(tp->t_inpcb); so = inp_inpcbtosocket(tp->t_inpcb); if (toep->tp_flags & TP_ABORT_RPL_PENDING) { /* * XXX panic on tcpdrop */ if (!(toep->tp_flags & TP_ABORT_RPL_RCVD) && !is_t3a(toep->tp_toedev)) toep->tp_flags |= TP_ABORT_RPL_RCVD; else { toep->tp_flags &= ~(TP_ABORT_RPL_RCVD|TP_ABORT_RPL_PENDING); if (!(toep->tp_flags & TP_ABORT_REQ_RCVD) || !is_t3a(toep->tp_toedev)) { if (toep->tp_flags & TP_ABORT_REQ_RCVD) panic("TP_ABORT_REQ_RCVD set"); t3_release_offload_resources(toep); needclose = 1; } } } inp_wunlock(tp->t_inpcb); if (needclose) tcp_offload_close(tp); m_free(m); } /* * Handle an ABORT_RPL_RSS CPL message. */ static int do_abort_rpl(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct cpl_abort_rpl_rss *rpl = cplhdr(m); struct toepcb *toep; /* * Ignore replies to post-close aborts indicating that the abort was * requested too late. These connections are terminated when we get * PEER_CLOSE or CLOSE_CON_RPL and by the time the abort_rpl_rss * arrives the TID is either no longer used or it has been recycled. */ if (rpl->status == CPL_ERR_ABORT_FAILED) { discard: m_free(m); return (0); } toep = (struct toepcb *)ctx; /* * Sometimes we've already closed the socket, e.g., a post-close * abort races with ABORT_REQ_RSS, the latter frees the socket * expecting the ABORT_REQ will fail with CPL_ERR_ABORT_FAILED, * but FW turns the ABORT_REQ into a regular one and so we get * ABORT_RPL_RSS with status 0 and no socket. Only on T3A. */ if (!toep) goto discard; if (toep->tp_tp == NULL) { log(LOG_NOTICE, "removing tid for abort\n"); cxgb_remove_tid(cdev, toep, toep->tp_tid); if (toep->tp_l2t) l2t_release(L2DATA(cdev), toep->tp_l2t); toepcb_release(toep); goto discard; } log(LOG_NOTICE, "toep=%p\n", toep); log(LOG_NOTICE, "tp=%p\n", toep->tp_tp); toepcb_hold(toep); process_abort_rpl(toep, m); toepcb_release(toep); return (0); } /* * Convert the status code of an ABORT_REQ into a FreeBSD error code. Also * indicate whether RST should be sent in response. */ static int abort_status_to_errno(struct socket *so, int abort_reason, int *need_rst) { struct tcpcb *tp = so_sototcpcb(so); switch (abort_reason) { case CPL_ERR_BAD_SYN: #if 0 NET_INC_STATS_BH(LINUX_MIB_TCPABORTONSYN); // fall through #endif case CPL_ERR_CONN_RESET: // XXX need to handle SYN_RECV due to crossed SYNs return (tp->t_state == TCPS_CLOSE_WAIT ? EPIPE : ECONNRESET); case CPL_ERR_XMIT_TIMEDOUT: case CPL_ERR_PERSIST_TIMEDOUT: case CPL_ERR_FINWAIT2_TIMEDOUT: case CPL_ERR_KEEPALIVE_TIMEDOUT: #if 0 NET_INC_STATS_BH(LINUX_MIB_TCPABORTONTIMEOUT); #endif return (ETIMEDOUT); default: return (EIO); } } static inline void set_abort_rpl_wr(struct mbuf *m, unsigned int tid, int cmd) { struct cpl_abort_rpl *rpl = cplhdr(m); rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_OFLD_HOST_ABORT_CON_RPL)); rpl->wr.wr_lo = htonl(V_WR_TID(tid)); m->m_len = m->m_pkthdr.len = sizeof(*rpl); OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_ABORT_RPL, tid)); rpl->cmd = cmd; } static void send_deferred_abort_rpl(struct toedev *tdev, struct mbuf *m) { struct mbuf *reply_mbuf; struct cpl_abort_req_rss *req = cplhdr(m); reply_mbuf = m_gethdr_nofail(sizeof(struct cpl_abort_rpl)); m_set_priority(m, CPL_PRIORITY_DATA); m->m_len = m->m_pkthdr.len = sizeof(struct cpl_abort_rpl); set_abort_rpl_wr(reply_mbuf, GET_TID(req), req->status); cxgb_ofld_send(TOM_DATA(tdev)->cdev, reply_mbuf); m_free(m); } /* * Returns whether an ABORT_REQ_RSS message is a negative advice. */ static inline int is_neg_adv_abort(unsigned int status) { return status == CPL_ERR_RTX_NEG_ADVICE || status == CPL_ERR_PERSIST_NEG_ADVICE; } static void send_abort_rpl(struct mbuf *m, struct toedev *tdev, int rst_status) { struct mbuf *reply_mbuf; struct cpl_abort_req_rss *req = cplhdr(m); reply_mbuf = m_gethdr(M_NOWAIT, MT_DATA); if (!reply_mbuf) { /* Defer the reply. Stick rst_status into req->cmd. */ req->status = rst_status; t3_defer_reply(m, tdev, send_deferred_abort_rpl); return; } m_set_priority(reply_mbuf, CPL_PRIORITY_DATA); set_abort_rpl_wr(reply_mbuf, GET_TID(req), rst_status); m_free(m); /* * XXX need to sync with ARP as for SYN_RECV connections we can send * these messages while ARP is pending. For other connection states * it's not a problem. */ cxgb_ofld_send(TOM_DATA(tdev)->cdev, reply_mbuf); } #ifdef notyet static void cleanup_syn_rcv_conn(struct socket *child, struct socket *parent) { CXGB_UNIMPLEMENTED(); #ifdef notyet struct request_sock *req = child->sk_user_data; inet_csk_reqsk_queue_removed(parent, req); synq_remove(tcp_sk(child)); __reqsk_free(req); child->sk_user_data = NULL; #endif } /* * Performs the actual work to abort a SYN_RECV connection. */ static void do_abort_syn_rcv(struct socket *child, struct socket *parent) { struct tcpcb *parenttp = so_sototcpcb(parent); struct tcpcb *childtp = so_sototcpcb(child); /* * If the server is still open we clean up the child connection, * otherwise the server already did the clean up as it was purging * its SYN queue and the skb was just sitting in its backlog. */ if (__predict_false(parenttp->t_state == TCPS_LISTEN)) { cleanup_syn_rcv_conn(child, parent); inp_wlock(childtp->t_inpcb); t3_release_offload_resources(childtp->t_toe); inp_wunlock(childtp->t_inpcb); tcp_offload_close(childtp); } } #endif /* * Handle abort requests for a SYN_RECV connection. These need extra work * because the socket is on its parent's SYN queue. */ static int abort_syn_rcv(struct socket *so, struct mbuf *m) { CXGB_UNIMPLEMENTED(); #ifdef notyet struct socket *parent; struct toedev *tdev = toep->tp_toedev; struct t3cdev *cdev = TOM_DATA(tdev)->cdev; struct socket *oreq = so->so_incomp; struct t3c_tid_entry *t3c_stid; struct tid_info *t; if (!oreq) return -1; /* somehow we are not on the SYN queue */ t = &(T3C_DATA(cdev))->tid_maps; t3c_stid = lookup_stid(t, oreq->ts_recent); parent = ((struct listen_ctx *)t3c_stid->ctx)->lso; so_lock(parent); do_abort_syn_rcv(so, parent); send_abort_rpl(m, tdev, CPL_ABORT_NO_RST); so_unlock(parent); #endif return (0); } /* * Process abort requests. If we are waiting for an ABORT_RPL we ignore this * request except that we need to reply to it. */ static void process_abort_req(struct toepcb *toep, struct mbuf *m, struct toedev *tdev) { int rst_status = CPL_ABORT_NO_RST; const struct cpl_abort_req_rss *req = cplhdr(m); struct tcpcb *tp = toep->tp_tp; struct socket *so; int needclose = 0; inp_wlock(tp->t_inpcb); so = inp_inpcbtosocket(toep->tp_tp->t_inpcb); if ((toep->tp_flags & TP_ABORT_REQ_RCVD) == 0) { toep->tp_flags |= (TP_ABORT_REQ_RCVD|TP_ABORT_SHUTDOWN); m_free(m); goto skip; } toep->tp_flags &= ~TP_ABORT_REQ_RCVD; /* * Three cases to consider: * a) We haven't sent an abort_req; close the connection. * b) We have sent a post-close abort_req that will get to TP too late * and will generate a CPL_ERR_ABORT_FAILED reply. The reply will * be ignored and the connection should be closed now. * c) We have sent a regular abort_req that will get to TP too late. * That will generate an abort_rpl with status 0, wait for it. */ if (((toep->tp_flags & TP_ABORT_RPL_PENDING) == 0) || (is_t3a(toep->tp_toedev) && (toep->tp_flags & TP_CLOSE_CON_REQUESTED))) { int error; error = abort_status_to_errno(so, req->status, &rst_status); so_error_set(so, error); if (__predict_true((so_state_get(so) & SS_NOFDREF) == 0)) so_sorwakeup(so); /* * SYN_RECV needs special processing. If abort_syn_rcv() * returns 0 is has taken care of the abort. */ if ((tp->t_state == TCPS_SYN_RECEIVED) && !abort_syn_rcv(so, m)) goto skip; t3_release_offload_resources(toep); needclose = 1; } inp_wunlock(tp->t_inpcb); if (needclose) tcp_offload_close(tp); send_abort_rpl(m, tdev, rst_status); return; skip: inp_wunlock(tp->t_inpcb); } /* * Handle an ABORT_REQ_RSS CPL message. */ static int do_abort_req(struct t3cdev *cdev, struct mbuf *m, void *ctx) { const struct cpl_abort_req_rss *req = cplhdr(m); struct toepcb *toep = (struct toepcb *)ctx; if (is_neg_adv_abort(req->status)) { m_free(m); return (0); } log(LOG_NOTICE, "aborting tid=%d\n", toep->tp_tid); if ((toep->tp_flags & (TP_SYN_RCVD|TP_ABORT_REQ_RCVD)) == TP_SYN_RCVD) { cxgb_remove_tid(cdev, toep, toep->tp_tid); toep->tp_flags |= TP_ABORT_REQ_RCVD; send_abort_rpl(m, toep->tp_toedev, CPL_ABORT_NO_RST); if (toep->tp_l2t) l2t_release(L2DATA(cdev), toep->tp_l2t); /* * Unhook */ toep->tp_tp->t_toe = NULL; toep->tp_tp->t_flags &= ~TF_TOE; toep->tp_tp = NULL; /* * XXX need to call syncache_chkrst - but we don't * have a way of doing that yet */ toepcb_release(toep); log(LOG_ERR, "abort for unestablished connection :-(\n"); return (0); } if (toep->tp_tp == NULL) { log(LOG_NOTICE, "disconnected toepcb\n"); /* should be freed momentarily */ return (0); } toepcb_hold(toep); process_abort_req(toep, m, toep->tp_toedev); toepcb_release(toep); return (0); } #ifdef notyet static void pass_open_abort(struct socket *child, struct socket *parent, struct mbuf *m) { struct toedev *tdev = TOE_DEV(parent); do_abort_syn_rcv(child, parent); if (tdev->tod_ttid == TOE_ID_CHELSIO_T3) { struct cpl_pass_accept_rpl *rpl = cplhdr(m); rpl->opt0h = htonl(F_TCAM_BYPASS); rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); cxgb_ofld_send(TOM_DATA(tdev)->cdev, m); } else m_free(m); } #endif static void handle_pass_open_arp_failure(struct socket *so, struct mbuf *m) { CXGB_UNIMPLEMENTED(); #ifdef notyet struct t3cdev *cdev; struct socket *parent; struct socket *oreq; struct t3c_tid_entry *t3c_stid; struct tid_info *t; struct tcpcb *otp, *tp = so_sototcpcb(so); struct toepcb *toep = tp->t_toe; /* * If the connection is being aborted due to the parent listening * socket going away there's nothing to do, the ABORT_REQ will close * the connection. */ if (toep->tp_flags & TP_ABORT_RPL_PENDING) { m_free(m); return; } oreq = so->so_incomp; otp = so_sototcpcb(oreq); cdev = T3C_DEV(so); t = &(T3C_DATA(cdev))->tid_maps; t3c_stid = lookup_stid(t, otp->ts_recent); parent = ((struct listen_ctx *)t3c_stid->ctx)->lso; so_lock(parent); pass_open_abort(so, parent, m); so_unlock(parent); #endif } /* * Handle an ARP failure for a CPL_PASS_ACCEPT_RPL. This is treated similarly * to an ABORT_REQ_RSS in SYN_RECV as both events need to tear down a SYN_RECV * connection. */ static void pass_accept_rpl_arp_failure(struct t3cdev *cdev, struct mbuf *m) { #ifdef notyet TCP_INC_STATS_BH(TCP_MIB_ATTEMPTFAILS); BLOG_SKB_CB(skb)->dev = TOE_DEV(skb->sk); #endif handle_pass_open_arp_failure(m_get_socket(m), m); } /* * Populate a reject CPL_PASS_ACCEPT_RPL WR. */ static void mk_pass_accept_rpl(struct mbuf *reply_mbuf, struct mbuf *req_mbuf) { struct cpl_pass_accept_req *req = cplhdr(req_mbuf); struct cpl_pass_accept_rpl *rpl = cplhdr(reply_mbuf); unsigned int tid = GET_TID(req); m_set_priority(reply_mbuf, CPL_PRIORITY_SETUP); rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, tid)); rpl->peer_ip = req->peer_ip; // req->peer_ip not overwritten yet rpl->opt0h = htonl(F_TCAM_BYPASS); rpl->opt0l_status = htonl(CPL_PASS_OPEN_REJECT); rpl->opt2 = 0; rpl->rsvd = rpl->opt2; /* workaround for HW bug */ } /* * Send a deferred reject to an accept request. */ static void reject_pass_request(struct toedev *tdev, struct mbuf *m) { struct mbuf *reply_mbuf; reply_mbuf = m_gethdr_nofail(sizeof(struct cpl_pass_accept_rpl)); mk_pass_accept_rpl(reply_mbuf, m); cxgb_ofld_send(TOM_DATA(tdev)->cdev, reply_mbuf); m_free(m); } static void handle_syncache_event(int event, void *arg) { struct toepcb *toep = arg; switch (event) { case TOE_SC_ENTRY_PRESENT: /* * entry already exists - free toepcb * and l2t */ printf("syncache entry present\n"); toepcb_release(toep); break; case TOE_SC_DROP: /* * The syncache has given up on this entry * either it timed out, or it was evicted * we need to explicitly release the tid */ printf("syncache entry dropped\n"); toepcb_release(toep); break; default: log(LOG_ERR, "unknown syncache event %d\n", event); break; } } static void syncache_add_accept_req(struct cpl_pass_accept_req *req, struct socket *lso, struct toepcb *toep) { struct in_conninfo inc; - struct tcpopt to; + struct toeopt toeo; struct tcphdr th; struct inpcb *inp; int mss, wsf, sack, ts; uint32_t rcv_isn = ntohl(req->rcv_isn); - bzero(&to, sizeof(struct tcpopt)); + bzero(&toeo, sizeof(struct toeopt)); inp = so_sotoinpcb(lso); /* * Fill out information for entering us into the syncache */ bzero(&inc, sizeof(inc)); inc.inc_fport = th.th_sport = req->peer_port; inc.inc_lport = th.th_dport = req->local_port; th.th_seq = req->rcv_isn; th.th_flags = TH_SYN; toep->tp_iss = toep->tp_delack_seq = toep->tp_rcv_wup = toep->tp_copied_seq = rcv_isn + 1; inc.inc_len = 0; inc.inc_faddr.s_addr = req->peer_ip; inc.inc_laddr.s_addr = req->local_ip; DPRINTF("syncache add of %d:%d %d:%d\n", ntohl(req->local_ip), ntohs(req->local_port), ntohl(req->peer_ip), ntohs(req->peer_port)); mss = req->tcp_options.mss; wsf = req->tcp_options.wsf; ts = req->tcp_options.tstamp; sack = req->tcp_options.sack; - to.to_mss = mss; - to.to_wscale = wsf; - to.to_flags = (mss ? TOF_MSS : 0) | (wsf ? TOF_SCALE : 0) | (ts ? TOF_TS : 0) | (sack ? TOF_SACKPERM : 0); - tcp_offload_syncache_add(&inc, &to, &th, inp, &lso, &cxgb_toe_usrreqs, toep); + toeo.to_mss = mss; + toeo.to_wscale = wsf; + toeo.to_flags = (mss ? TOF_MSS : 0) | (wsf ? TOF_SCALE : 0) | (ts ? TOF_TS : 0) | (sack ? TOF_SACKPERM : 0); + tcp_offload_syncache_add(&inc, &toeo, &th, inp, &lso, &cxgb_toe_usrreqs, +toep); } /* * Process a CPL_PASS_ACCEPT_REQ message. Does the part that needs the socket * lock held. Note that the sock here is a listening socket that is not owned * by the TOE. */ static void process_pass_accept_req(struct socket *so, struct mbuf *m, struct toedev *tdev, struct listen_ctx *lctx) { int rt_flags; struct l2t_entry *e; struct iff_mac tim; struct mbuf *reply_mbuf, *ddp_mbuf = NULL; struct cpl_pass_accept_rpl *rpl; struct cpl_pass_accept_req *req = cplhdr(m); unsigned int tid = GET_TID(req); struct tom_data *d = TOM_DATA(tdev); struct t3cdev *cdev = d->cdev; struct tcpcb *tp = so_sototcpcb(so); struct toepcb *newtoep; struct rtentry *dst; struct sockaddr_in nam; struct t3c_data *td = T3C_DATA(cdev); reply_mbuf = m_gethdr(M_NOWAIT, MT_DATA); if (__predict_false(reply_mbuf == NULL)) { if (tdev->tod_ttid == TOE_ID_CHELSIO_T3) t3_defer_reply(m, tdev, reject_pass_request); else { cxgb_queue_tid_release(cdev, tid); m_free(m); } DPRINTF("failed to get reply_mbuf\n"); goto out; } if (tp->t_state != TCPS_LISTEN) { DPRINTF("socket not in listen state\n"); goto reject; } tim.mac_addr = req->dst_mac; tim.vlan_tag = ntohs(req->vlan_tag); if (cdev->ctl(cdev, GET_IFF_FROM_MAC, &tim) < 0 || !tim.dev) { DPRINTF("rejecting from failed GET_IFF_FROM_MAC\n"); goto reject; } #ifdef notyet /* * XXX do route lookup to confirm that we're still listening on this * address */ if (ip_route_input(skb, req->local_ip, req->peer_ip, G_PASS_OPEN_TOS(ntohl(req->tos_tid)), tim.dev)) goto reject; rt_flags = ((struct rtable *)skb->dst)->rt_flags & (RTCF_BROADCAST | RTCF_MULTICAST | RTCF_LOCAL); dst_release(skb->dst); // done with the input route, release it skb->dst = NULL; if ((rt_flags & RTF_LOCAL) == 0) goto reject; #endif /* * XXX */ rt_flags = RTF_LOCAL; if ((rt_flags & RTF_LOCAL) == 0) goto reject; /* * Calculate values and add to syncache */ newtoep = toepcb_alloc(); if (newtoep == NULL) goto reject; bzero(&nam, sizeof(struct sockaddr_in)); nam.sin_len = sizeof(struct sockaddr_in); nam.sin_family = AF_INET; nam.sin_addr.s_addr =req->peer_ip; dst = rtalloc2((struct sockaddr *)&nam, 1, 0); if (dst == NULL) { printf("failed to find route\n"); goto reject; } e = newtoep->tp_l2t = t3_l2t_get(d->cdev, dst, tim.dev, (struct sockaddr *)&nam); if (e == NULL) { DPRINTF("failed to get l2t\n"); } /* * Point to our listen socket until accept */ newtoep->tp_tp = tp; newtoep->tp_flags = TP_SYN_RCVD; newtoep->tp_tid = tid; newtoep->tp_toedev = tdev; tp->rcv_wnd = select_rcv_wnd(tdev, so); cxgb_insert_tid(cdev, d->client, newtoep, tid); so_lock(so); LIST_INSERT_HEAD(&lctx->synq_head, newtoep, synq_entry); so_unlock(so); newtoep->tp_ulp_mode = TOM_TUNABLE(tdev, ddp) && !(so_options_get(so) & SO_NO_DDP) && tp->rcv_wnd >= MIN_DDP_RCV_WIN ? ULP_MODE_TCPDDP : 0; if (newtoep->tp_ulp_mode) { ddp_mbuf = m_gethdr(M_NOWAIT, MT_DATA); if (ddp_mbuf == NULL) newtoep->tp_ulp_mode = 0; } CTR4(KTR_TOM, "ddp=%d rcv_wnd=%ld min_win=%d ulp_mode=%d", TOM_TUNABLE(tdev, ddp), tp->rcv_wnd, MIN_DDP_RCV_WIN, newtoep->tp_ulp_mode); set_arp_failure_handler(reply_mbuf, pass_accept_rpl_arp_failure); /* * XXX workaround for lack of syncache drop */ toepcb_hold(newtoep); syncache_add_accept_req(req, so, newtoep); rpl = cplhdr(reply_mbuf); reply_mbuf->m_pkthdr.len = reply_mbuf->m_len = sizeof(*rpl); rpl->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_FORWARD)); rpl->wr.wr_lo = 0; OPCODE_TID(rpl) = htonl(MK_OPCODE_TID(CPL_PASS_ACCEPT_RPL, tid)); rpl->opt2 = htonl(calc_opt2(so, tdev)); rpl->rsvd = rpl->opt2; /* workaround for HW bug */ rpl->peer_ip = req->peer_ip; // req->peer_ip is not overwritten rpl->opt0h = htonl(calc_opt0h(so, select_mss(td, NULL, dst->rt_ifp->if_mtu)) | V_L2T_IDX(e->idx) | V_TX_CHANNEL(e->smt_idx)); rpl->opt0l_status = htonl(calc_opt0l(so, newtoep->tp_ulp_mode) | CPL_PASS_OPEN_ACCEPT); DPRINTF("opt0l_status=%08x\n", rpl->opt0l_status); m_set_priority(reply_mbuf, mkprio(CPL_PRIORITY_SETUP, newtoep)); l2t_send(cdev, reply_mbuf, e); m_free(m); if (newtoep->tp_ulp_mode) { __set_tcb_field(newtoep, ddp_mbuf, W_TCB_RX_DDP_FLAGS, V_TF_DDP_OFF(1) | TP_DDP_TIMER_WORKAROUND_MASK, V_TF_DDP_OFF(1) | TP_DDP_TIMER_WORKAROUND_VAL, 1); } else DPRINTF("no DDP\n"); return; reject: if (tdev->tod_ttid == TOE_ID_CHELSIO_T3) mk_pass_accept_rpl(reply_mbuf, m); else mk_tid_release(reply_mbuf, newtoep, tid); cxgb_ofld_send(cdev, reply_mbuf); m_free(m); out: #if 0 TCP_INC_STATS_BH(TCP_MIB_ATTEMPTFAILS); #else return; #endif } /* * Handle a CPL_PASS_ACCEPT_REQ message. */ static int do_pass_accept_req(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct listen_ctx *listen_ctx = (struct listen_ctx *)ctx; struct socket *lso = listen_ctx->lso; /* XXX need an interlock against the listen socket going away */ struct tom_data *d = listen_ctx->tom_data; #if VALIDATE_TID struct cpl_pass_accept_req *req = cplhdr(m); unsigned int tid = GET_TID(req); struct tid_info *t = &(T3C_DATA(cdev))->tid_maps; if (unlikely(!lsk)) { printk(KERN_ERR "%s: PASS_ACCEPT_REQ had unknown STID %lu\n", cdev->name, (unsigned long)((union listen_entry *)ctx - t->stid_tab)); return CPL_RET_BUF_DONE; } if (unlikely(tid >= t->ntids)) { printk(KERN_ERR "%s: passive open TID %u too large\n", cdev->name, tid); return CPL_RET_BUF_DONE; } /* * For T3A the current user of the TID may have closed but its last * message(s) may have been backlogged so the TID appears to be still * in use. Just take the TID away, the connection can close at its * own leisure. For T3B this situation is a bug. */ if (!valid_new_tid(t, tid) && cdev->type != T3A) { printk(KERN_ERR "%s: passive open uses existing TID %u\n", cdev->name, tid); return CPL_RET_BUF_DONE; } #endif process_pass_accept_req(lso, m, &d->tdev, listen_ctx); return (0); } /* * Called when a connection is established to translate the TCP options * reported by HW to FreeBSD's native format. */ static void assign_rxopt(struct socket *so, unsigned int opt) { struct tcpcb *tp = so_sototcpcb(so); struct toepcb *toep = tp->t_toe; const struct t3c_data *td = T3C_DATA(TOEP_T3C_DEV(toep)); inp_lock_assert(tp->t_inpcb); toep->tp_mss_clamp = td->mtus[G_TCPOPT_MSS(opt)] - 40; tp->t_flags |= G_TCPOPT_TSTAMP(opt) ? TF_RCVD_TSTMP : 0; tp->t_flags |= G_TCPOPT_SACK(opt) ? TF_SACK_PERMIT : 0; tp->t_flags |= G_TCPOPT_WSCALE_OK(opt) ? TF_RCVD_SCALE : 0; if ((tp->t_flags & (TF_RCVD_SCALE|TF_REQ_SCALE)) == (TF_RCVD_SCALE|TF_REQ_SCALE)) tp->rcv_scale = tp->request_r_scale; } /* * Completes some final bits of initialization for just established connections * and changes their state to TCP_ESTABLISHED. * * snd_isn here is the ISN after the SYN, i.e., the true ISN + 1. */ static void make_established(struct socket *so, u32 snd_isn, unsigned int opt) { struct tcpcb *tp = so_sototcpcb(so); struct toepcb *toep = tp->t_toe; toep->tp_write_seq = tp->iss = tp->snd_max = tp->snd_nxt = tp->snd_una = snd_isn; assign_rxopt(so, opt); /* *XXXXXXXXXXX * */ #ifdef notyet so->so_proto->pr_ctloutput = t3_ctloutput; #endif #if 0 inet_sk(sk)->id = tp->write_seq ^ jiffies; #endif /* * XXX not clear what rcv_wup maps to */ /* * Causes the first RX_DATA_ACK to supply any Rx credits we couldn't * pass through opt0. */ if (tp->rcv_wnd > (M_RCV_BUFSIZ << 10)) toep->tp_rcv_wup -= tp->rcv_wnd - (M_RCV_BUFSIZ << 10); dump_toepcb(toep); #ifdef notyet /* * no clean interface for marking ARP up to date */ dst_confirm(sk->sk_dst_cache); #endif tp->t_starttime = ticks; tp->t_state = TCPS_ESTABLISHED; soisconnected(so); } static int syncache_expand_establish_req(struct cpl_pass_establish *req, struct socket **so, struct toepcb *toep) { struct in_conninfo inc; - struct tcpopt to; + struct toeopt to; struct tcphdr th; int mss, wsf, sack, ts; struct mbuf *m = NULL; const struct t3c_data *td = T3C_DATA(TOM_DATA(toep->tp_toedev)->cdev); unsigned int opt; #ifdef MAC #error "no MAC support" #endif opt = ntohs(req->tcp_opt); - bzero(&to, sizeof(struct tcpopt)); + bzero(&toeo, sizeof(struct toeopt)); /* * Fill out information for entering us into the syncache */ bzero(&inc, sizeof(inc)); inc.inc_fport = th.th_sport = req->peer_port; inc.inc_lport = th.th_dport = req->local_port; th.th_seq = req->rcv_isn; th.th_flags = TH_ACK; inc.inc_len = 0; inc.inc_faddr.s_addr = req->peer_ip; inc.inc_laddr.s_addr = req->local_ip; mss = td->mtus[G_TCPOPT_MSS(opt)] - 40; wsf = G_TCPOPT_WSCALE_OK(opt); ts = G_TCPOPT_TSTAMP(opt); sack = G_TCPOPT_SACK(opt); - to.to_mss = mss; - to.to_wscale = G_TCPOPT_SND_WSCALE(opt); - to.to_flags = (mss ? TOF_MSS : 0) | (wsf ? TOF_SCALE : 0) | (ts ? TOF_TS : 0) | (sack ? TOF_SACKPERM : 0); + toeo.to_mss = mss; + toeo.to_wscale = G_TCPOPT_SND_WSCALE(opt); + toeo.to_flags = (mss ? TOF_MSS : 0) | (wsf ? TOF_SCALE : 0) | (ts ? TOF_TS : 0) | (sack ? TOF_SACKPERM : 0); DPRINTF("syncache expand of %d:%d %d:%d mss:%d wsf:%d ts:%d sack:%d\n", ntohl(req->local_ip), ntohs(req->local_port), ntohl(req->peer_ip), ntohs(req->peer_port), mss, wsf, ts, sack); - return tcp_offload_syncache_expand(&inc, &to, &th, so, m); + return tcp_offload_syncache_expand(&inc, &toeo, &th, so, m); } /* * Process a CPL_PASS_ESTABLISH message. XXX a lot of the locking doesn't work * if we are in TCP_SYN_RECV due to crossed SYNs */ static int do_pass_establish(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct cpl_pass_establish *req = cplhdr(m); struct toepcb *toep = (struct toepcb *)ctx; struct tcpcb *tp = toep->tp_tp; struct socket *so, *lso; struct t3c_data *td = T3C_DATA(cdev); struct sockbuf *snd, *rcv; // Complete socket initialization now that we have the SND_ISN struct toedev *tdev; tdev = toep->tp_toedev; inp_wlock(tp->t_inpcb); /* * * XXX need to add reference while we're manipulating */ so = lso = inp_inpcbtosocket(tp->t_inpcb); inp_wunlock(tp->t_inpcb); so_lock(so); LIST_REMOVE(toep, synq_entry); so_unlock(so); if (!syncache_expand_establish_req(req, &so, toep)) { /* * No entry */ CXGB_UNIMPLEMENTED(); } if (so == NULL) { /* * Couldn't create the socket */ CXGB_UNIMPLEMENTED(); } tp = so_sototcpcb(so); inp_wlock(tp->t_inpcb); snd = so_sockbuf_snd(so); rcv = so_sockbuf_rcv(so); snd->sb_flags |= SB_NOCOALESCE; rcv->sb_flags |= SB_NOCOALESCE; toep->tp_tp = tp; toep->tp_flags = 0; tp->t_toe = toep; reset_wr_list(toep); tp->rcv_wnd = select_rcv_wnd(tdev, so); tp->rcv_nxt = toep->tp_copied_seq; install_offload_ops(so); toep->tp_wr_max = toep->tp_wr_avail = TOM_TUNABLE(tdev, max_wrs); toep->tp_wr_unacked = 0; toep->tp_qset = G_QNUM(ntohl(m->m_pkthdr.csum_data)); toep->tp_qset_idx = 0; toep->tp_mtu_idx = select_mss(td, tp, toep->tp_l2t->neigh->rt_ifp->if_mtu); /* * XXX Cancel any keep alive timer */ make_established(so, ntohl(req->snd_isn), ntohs(req->tcp_opt)); /* * XXX workaround for lack of syncache drop */ toepcb_release(toep); inp_wunlock(tp->t_inpcb); CTR1(KTR_TOM, "do_pass_establish tid=%u", toep->tp_tid); cxgb_log_tcb(cdev->adapter, toep->tp_tid); #ifdef notyet /* * XXX not sure how these checks map to us */ if (unlikely(sk->sk_socket)) { // simultaneous opens only sk->sk_state_change(sk); sk_wake_async(so, 0, POLL_OUT); } /* * The state for the new connection is now up to date. * Next check if we should add the connection to the parent's * accept queue. When the parent closes it resets connections * on its SYN queue, so check if we are being reset. If so we * don't need to do anything more, the coming ABORT_RPL will * destroy this socket. Otherwise move the connection to the * accept queue. * * Note that we reset the synq before closing the server so if * we are not being reset the stid is still open. */ if (unlikely(!tp->forward_skb_hint)) { // removed from synq __kfree_skb(skb); goto unlock; } #endif m_free(m); return (0); } /* * Fill in the right TID for CPL messages waiting in the out-of-order queue * and send them to the TOE. */ static void fixup_and_send_ofo(struct toepcb *toep) { struct mbuf *m; struct toedev *tdev = toep->tp_toedev; struct tcpcb *tp = toep->tp_tp; unsigned int tid = toep->tp_tid; log(LOG_NOTICE, "fixup_and_send_ofo\n"); inp_lock_assert(tp->t_inpcb); while ((m = mbufq_dequeue(&toep->out_of_order_queue)) != NULL) { /* * A variety of messages can be waiting but the fields we'll * be touching are common to all so any message type will do. */ struct cpl_close_con_req *p = cplhdr(m); p->wr.wr_lo = htonl(V_WR_TID(tid)); OPCODE_TID(p) = htonl(MK_OPCODE_TID(p->ot.opcode, tid)); cxgb_ofld_send(TOM_DATA(tdev)->cdev, m); } } /* * Updates socket state from an active establish CPL message. Runs with the * socket lock held. */ static void socket_act_establish(struct socket *so, struct mbuf *m) { INIT_VNET_INET(so->so_vnet); struct cpl_act_establish *req = cplhdr(m); u32 rcv_isn = ntohl(req->rcv_isn); /* real RCV_ISN + 1 */ struct tcpcb *tp = so_sototcpcb(so); struct toepcb *toep = tp->t_toe; if (__predict_false(tp->t_state != TCPS_SYN_SENT)) log(LOG_ERR, "TID %u expected SYN_SENT, found %d\n", toep->tp_tid, tp->t_state); tp->ts_recent_age = ticks; tp->irs = tp->rcv_wnd = tp->rcv_nxt = rcv_isn; toep->tp_delack_seq = toep->tp_rcv_wup = toep->tp_copied_seq = tp->irs; make_established(so, ntohl(req->snd_isn), ntohs(req->tcp_opt)); /* * Now that we finally have a TID send any CPL messages that we had to * defer for lack of a TID. */ if (mbufq_len(&toep->out_of_order_queue)) fixup_and_send_ofo(toep); if (__predict_false(so_state_get(so) & SS_NOFDREF)) { /* * XXX does this even make sense? */ so_sorwakeup(so); } m_free(m); #ifdef notyet /* * XXX assume no write requests permitted while socket connection is * incomplete */ /* * Currently the send queue must be empty at this point because the * socket layer does not send anything before a connection is * established. To be future proof though we handle the possibility * that there are pending buffers to send (either TX_DATA or * CLOSE_CON_REQ). First we need to adjust the sequence number of the * buffers according to the just learned write_seq, and then we send * them on their way. */ fixup_pending_writeq_buffers(sk); if (t3_push_frames(so, 1)) sk->sk_write_space(sk); #endif toep->tp_state = tp->t_state; TCPSTAT_INC(tcps_connects); } /* * Process a CPL_ACT_ESTABLISH message. */ static int do_act_establish(struct t3cdev *cdev, struct mbuf *m, void *ctx) { struct cpl_act_establish *req = cplhdr(m); unsigned int tid = GET_TID(req); unsigned int atid = G_PASS_OPEN_TID(ntohl(req->tos_tid)); struct toepcb *toep = (struct toepcb *)ctx; struct tcpcb *tp = toep->tp_tp; struct socket *so; struct toedev *tdev; struct tom_data *d; if (tp == NULL) { free_atid(cdev, atid); return (0); } inp_wlock(tp->t_inpcb); /* * XXX */ so = inp_inpcbtosocket(tp->t_inpcb); tdev = toep->tp_toedev; /* blow up here if link was down */ d = TOM_DATA(tdev); /* * It's OK if the TID is currently in use, the owning socket may have * backlogged its last CPL message(s). Just take it away. */ toep->tp_tid = tid; toep->tp_tp = tp; so_insert_tid(d, toep, tid); free_atid(cdev, atid); toep->tp_qset = G_QNUM(ntohl(m->m_pkthdr.csum_data)); socket_act_establish(so, m); inp_wunlock(tp->t_inpcb); CTR1(KTR_TOM, "do_act_establish tid=%u", toep->tp_tid); cxgb_log_tcb(cdev->adapter, toep->tp_tid); return (0); } /* * Process an acknowledgment of WR completion. Advance snd_una and send the * next batch of work requests from the write queue. */ static void wr_ack(struct toepcb *toep, struct mbuf *m) { struct tcpcb *tp = toep->tp_tp; struct cpl_wr_ack *hdr = cplhdr(m); struct socket *so; unsigned int credits = ntohs(hdr->credits); u32 snd_una = ntohl(hdr->snd_una); int bytes = 0; struct sockbuf *snd; CTR2(KTR_SPARE2, "wr_ack: snd_una=%u credits=%d", snd_una, credits); inp_wlock(tp->t_inpcb); so = inp_inpcbtosocket(tp->t_inpcb); toep->tp_wr_avail += credits; if (toep->tp_wr_unacked > toep->tp_wr_max - toep->tp_wr_avail) toep->tp_wr_unacked = toep->tp_wr_max - toep->tp_wr_avail; while (credits) { struct mbuf *p = peek_wr(toep); if (__predict_false(!p)) { log(LOG_ERR, "%u WR_ACK credits for TID %u with " "nothing pending, state %u wr_avail=%u\n", credits, toep->tp_tid, tp->t_state, toep->tp_wr_avail); break; } CTR2(KTR_TOM, "wr_ack: p->credits=%d p->bytes=%d", p->m_pkthdr.csum_data, p->m_pkthdr.len); KASSERT(p->m_pkthdr.csum_data != 0, ("empty request still on list")); if (__predict_false(credits < p->m_pkthdr.csum_data)) { #if DEBUG_WR > 1 struct tx_data_wr *w = cplhdr(p); log(LOG_ERR, "TID %u got %u WR credits, need %u, len %u, " "main body %u, frags %u, seq # %u, ACK una %u," " ACK nxt %u, WR_AVAIL %u, WRs pending %u\n", toep->tp_tid, credits, p->csum, p->len, p->len - p->data_len, skb_shinfo(p)->nr_frags, ntohl(w->sndseq), snd_una, ntohl(hdr->snd_nxt), toep->tp_wr_avail, count_pending_wrs(tp) - credits); #endif p->m_pkthdr.csum_data -= credits; break; } else { dequeue_wr(toep); credits -= p->m_pkthdr.csum_data; bytes += p->m_pkthdr.len; CTR3(KTR_TOM, "wr_ack: done with wr of %d bytes remain credits=%d wr credits=%d", p->m_pkthdr.len, credits, p->m_pkthdr.csum_data); m_free(p); } } #if DEBUG_WR check_wr_invariants(tp); #endif if (__predict_false(SEQ_LT(snd_una, tp->snd_una))) { #if VALIDATE_SEQ struct tom_data *d = TOM_DATA(TOE_DEV(so)); log(LOG_ERR "%s: unexpected sequence # %u in WR_ACK " "for TID %u, snd_una %u\n", (&d->tdev)->name, snd_una, toep->tp_tid, tp->snd_una); #endif goto out_free; } if (tp->snd_una != snd_una) { tp->snd_una = snd_una; tp->ts_recent_age = ticks; #ifdef notyet /* * Keep ARP entry "minty fresh" */ dst_confirm(sk->sk_dst_cache); #endif if (tp->snd_una == tp->snd_nxt) toep->tp_flags &= ~TP_TX_WAIT_IDLE; } snd = so_sockbuf_snd(so); if (bytes) { CTR1(KTR_SPARE2, "wr_ack: sbdrop(%d)", bytes); snd = so_sockbuf_snd(so); sockbuf_lock(snd); sbdrop_locked(snd, bytes); so_sowwakeup_locked(so); } if (snd->sb_sndptroff < snd->sb_cc) t3_push_frames(so, 0); out_free: inp_wunlock(tp->t_inpcb); m_free(m); } /* * Handler for TX_DATA_ACK CPL messages. */ static int do_wr_ack(struct t3cdev *dev, struct mbuf *m, void *ctx) { struct toepcb *toep = (struct toepcb *)ctx; VALIDATE_SOCK(so); wr_ack(toep, m); return 0; } /* * Handler for TRACE_PKT CPL messages. Just sink these packets. */ static int do_trace_pkt(struct t3cdev *dev, struct mbuf *m, void *ctx) { m_freem(m); return 0; } /* * Reset a connection that is on a listener's SYN queue or accept queue, * i.e., one that has not had a struct socket associated with it. * Must be called from process context. * * Modeled after code in inet_csk_listen_stop(). */ static void t3_reset_listen_child(struct socket *child) { struct tcpcb *tp = so_sototcpcb(child); t3_send_reset(tp->t_toe); } static void t3_child_disconnect(struct socket *so, void *arg) { struct tcpcb *tp = so_sototcpcb(so); if (tp->t_flags & TF_TOE) { inp_wlock(tp->t_inpcb); t3_reset_listen_child(so); inp_wunlock(tp->t_inpcb); } } /* * Disconnect offloaded established but not yet accepted connections sitting * on a server's accept_queue. We just send an ABORT_REQ at this point and * finish off the disconnect later as we may need to wait for the ABORT_RPL. */ void t3_disconnect_acceptq(struct socket *listen_so) { so_lock(listen_so); so_listeners_apply_all(listen_so, t3_child_disconnect, NULL); so_unlock(listen_so); } /* * Reset offloaded connections sitting on a server's syn queue. As above * we send ABORT_REQ and finish off when we get ABORT_RPL. */ void t3_reset_synq(struct listen_ctx *lctx) { struct toepcb *toep; so_lock(lctx->lso); while (!LIST_EMPTY(&lctx->synq_head)) { toep = LIST_FIRST(&lctx->synq_head); LIST_REMOVE(toep, synq_entry); toep->tp_tp = NULL; t3_send_reset(toep); cxgb_remove_tid(TOEP_T3C_DEV(toep), toep, toep->tp_tid); toepcb_release(toep); } so_unlock(lctx->lso); } int t3_setup_ppods(struct toepcb *toep, const struct ddp_gather_list *gl, unsigned int nppods, unsigned int tag, unsigned int maxoff, unsigned int pg_off, unsigned int color) { unsigned int i, j, pidx; struct pagepod *p; struct mbuf *m; struct ulp_mem_io *req; unsigned int tid = toep->tp_tid; const struct tom_data *td = TOM_DATA(toep->tp_toedev); unsigned int ppod_addr = tag * PPOD_SIZE + td->ddp_llimit; CTR6(KTR_TOM, "t3_setup_ppods(gl=%p nppods=%u tag=%u maxoff=%u pg_off=%u color=%u)", gl, nppods, tag, maxoff, pg_off, color); for (i = 0; i < nppods; ++i) { m = m_gethdr_nofail(sizeof(*req) + PPOD_SIZE); m_set_priority(m, mkprio(CPL_PRIORITY_CONTROL, toep)); req = mtod(m, struct ulp_mem_io *); m->m_pkthdr.len = m->m_len = sizeof(*req) + PPOD_SIZE; req->wr.wr_hi = htonl(V_WR_OP(FW_WROPCODE_BYPASS)); req->wr.wr_lo = 0; req->cmd_lock_addr = htonl(V_ULP_MEMIO_ADDR(ppod_addr >> 5) | V_ULPTX_CMD(ULP_MEM_WRITE)); req->len = htonl(V_ULP_MEMIO_DATA_LEN(PPOD_SIZE / 32) | V_ULPTX_NFLITS(PPOD_SIZE / 8 + 1)); p = (struct pagepod *)(req + 1); if (__predict_false(i < nppods - NUM_SENTINEL_PPODS)) { p->pp_vld_tid = htonl(F_PPOD_VALID | V_PPOD_TID(tid)); p->pp_pgsz_tag_color = htonl(V_PPOD_TAG(tag) | V_PPOD_COLOR(color)); p->pp_max_offset = htonl(maxoff); p->pp_page_offset = htonl(pg_off); p->pp_rsvd = 0; for (pidx = 4 * i, j = 0; j < 5; ++j, ++pidx) p->pp_addr[j] = pidx < gl->dgl_nelem ? htobe64(VM_PAGE_TO_PHYS(gl->dgl_pages[pidx])) : 0; } else p->pp_vld_tid = 0; /* mark sentinel page pods invalid */ send_or_defer(toep, m, 0); ppod_addr += PPOD_SIZE; } return (0); } /* * Build a CPL_BARRIER message as payload of a ULP_TX_PKT command. */ static inline void mk_cpl_barrier_ulp(struct cpl_barrier *b) { struct ulp_txpkt *txpkt = (struct ulp_txpkt *)b; txpkt->cmd_dest = htonl(V_ULPTX_CMD(ULP_TXPKT)); txpkt->len = htonl(V_ULPTX_NFLITS(sizeof(*b) / 8)); b->opcode = CPL_BARRIER; } /* * Build a CPL_GET_TCB message as payload of a ULP_TX_PKT command. */ static inline void mk_get_tcb_ulp(struct cpl_get_tcb *req, unsigned int tid, unsigned int cpuno) { struct ulp_txpkt *txpkt = (struct ulp_txpkt *)req; txpkt = (struct ulp_txpkt *)req; txpkt->cmd_dest = htonl(V_ULPTX_CMD(ULP_TXPKT)); txpkt->len = htonl(V_ULPTX_NFLITS(sizeof(*req) / 8)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_GET_TCB, tid)); req->cpuno = htons(cpuno); } /* * Build a CPL_SET_TCB_FIELD message as payload of a ULP_TX_PKT command. */ static inline void mk_set_tcb_field_ulp(struct cpl_set_tcb_field *req, unsigned int tid, unsigned int word, uint64_t mask, uint64_t val) { struct ulp_txpkt *txpkt = (struct ulp_txpkt *)req; CTR4(KTR_TCB, "mk_set_tcb_field_ulp(tid=%u word=0x%x mask=%jx val=%jx", tid, word, mask, val); txpkt->cmd_dest = htonl(V_ULPTX_CMD(ULP_TXPKT)); txpkt->len = htonl(V_ULPTX_NFLITS(sizeof(*req) / 8)); OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, tid)); req->reply = V_NO_REPLY(1); req->cpu_idx = 0; req->word = htons(word); req->mask = htobe64(mask); req->val = htobe64(val); } /* * Build a CPL_RX_DATA_ACK message as payload of a ULP_TX_PKT command. */ static void mk_rx_data_ack_ulp(struct toepcb *toep, struct cpl_rx_data_ack *ack, unsigned int tid, unsigned int credits) { struct ulp_txpkt *txpkt = (struct ulp_txpkt *)ack; txpkt->cmd_dest = htonl(V_ULPTX_CMD(ULP_TXPKT)); txpkt->len = htonl(V_ULPTX_NFLITS(sizeof(*ack) / 8)); OPCODE_TID(ack) = htonl(MK_OPCODE_TID(CPL_RX_DATA_ACK, tid)); ack->credit_dack = htonl(F_RX_MODULATE | F_RX_DACK_CHANGE | V_RX_DACK_MODE(TOM_TUNABLE(toep->tp_toedev, delack)) | V_RX_CREDITS(credits)); } void t3_cancel_ddpbuf(struct toepcb *toep, unsigned int bufidx) { unsigned int wrlen; struct mbuf *m; struct work_request_hdr *wr; struct cpl_barrier *lock; struct cpl_set_tcb_field *req; struct cpl_get_tcb *getreq; struct ddp_state *p = &toep->tp_ddp_state; #if 0 SOCKBUF_LOCK_ASSERT(&toeptoso(toep)->so_rcv); #endif wrlen = sizeof(*wr) + sizeof(*req) + 2 * sizeof(*lock) + sizeof(*getreq); m = m_gethdr_nofail(wrlen); m_set_priority(m, mkprio(CPL_PRIORITY_CONTROL, toep)); wr = mtod(m, struct work_request_hdr *); bzero(wr, wrlen); wr->wr_hi = htonl(V_WR_OP(FW_WROPCODE_BYPASS)); m->m_pkthdr.len = m->m_len = wrlen; lock = (struct cpl_barrier *)(wr + 1); mk_cpl_barrier_ulp(lock); req = (struct cpl_set_tcb_field *)(lock + 1); CTR1(KTR_TCB, "t3_cancel_ddpbuf(bufidx=%u)", bufidx); /* Hmmm, not sure if this actually a good thing: reactivating * the other buffer might be an issue if it has been completed * already. However, that is unlikely, since the fact that the UBUF * is not completed indicates that there is no oustanding data. */ if (bufidx == 0) mk_set_tcb_field_ulp(req, toep->tp_tid, W_TCB_RX_DDP_FLAGS, V_TF_DDP_ACTIVE_BUF(1) | V_TF_DDP_BUF0_VALID(1), V_TF_DDP_ACTIVE_BUF(1)); else mk_set_tcb_field_ulp(req, toep->tp_tid, W_TCB_RX_DDP_FLAGS, V_TF_DDP_ACTIVE_BUF(1) | V_TF_DDP_BUF1_VALID(1), 0); getreq = (struct cpl_get_tcb *)(req + 1); mk_get_tcb_ulp(getreq, toep->tp_tid, toep->tp_qset); mk_cpl_barrier_ulp((struct cpl_barrier *)(getreq + 1)); /* Keep track of the number of oustanding CPL_GET_TCB requests */ p->get_tcb_count++; #ifdef T3_TRACE T3_TRACE1(TIDTB(so), "t3_cancel_ddpbuf: bufidx %u", bufidx); #endif cxgb_ofld_send(TOEP_T3C_DEV(toep), m); } /** * t3_overlay_ddpbuf - overlay an existing DDP buffer with a new one * @sk: the socket associated with the buffers * @bufidx: index of HW DDP buffer (0 or 1) * @tag0: new tag for HW buffer 0 * @tag1: new tag for HW buffer 1 * @len: new length for HW buf @bufidx * * Sends a compound WR to overlay a new DDP buffer on top of an existing * buffer by changing the buffer tag and length and setting the valid and * active flag accordingly. The caller must ensure the new buffer is at * least as big as the existing one. Since we typically reprogram both HW * buffers this function sets both tags for convenience. Read the TCB to * determine how made data was written into the buffer before the overlay * took place. */ void t3_overlay_ddpbuf(struct toepcb *toep, unsigned int bufidx, unsigned int tag0, unsigned int tag1, unsigned int len) { unsigned int wrlen; struct mbuf *m; struct work_request_hdr *wr; struct cpl_get_tcb *getreq; struct cpl_set_tcb_field *req; struct ddp_state *p = &toep->tp_ddp_state; CTR4(KTR_TCB, "t3_setup_ppods(bufidx=%u tag0=%u tag1=%u len=%u)", bufidx, tag0, tag1, len); #if 0 SOCKBUF_LOCK_ASSERT(&toeptoso(toep)->so_rcv); #endif wrlen = sizeof(*wr) + 3 * sizeof(*req) + sizeof(*getreq); m = m_gethdr_nofail(wrlen); m_set_priority(m, mkprio(CPL_PRIORITY_CONTROL, toep)); wr = mtod(m, struct work_request_hdr *); m->m_pkthdr.len = m->m_len = wrlen; bzero(wr, wrlen); /* Set the ATOMIC flag to make sure that TP processes the following * CPLs in an atomic manner and no wire segments can be interleaved. */ wr->wr_hi = htonl(V_WR_OP(FW_WROPCODE_BYPASS) | F_WR_ATOMIC); req = (struct cpl_set_tcb_field *)(wr + 1); mk_set_tcb_field_ulp(req, toep->tp_tid, W_TCB_RX_DDP_BUF0_TAG, V_TCB_RX_DDP_BUF0_TAG(M_TCB_RX_DDP_BUF0_TAG) | V_TCB_RX_DDP_BUF1_TAG(M_TCB_RX_DDP_BUF1_TAG) << 32, V_TCB_RX_DDP_BUF0_TAG(tag0) | V_TCB_RX_DDP_BUF1_TAG((uint64_t)tag1) << 32); req++; if (bufidx == 0) { mk_set_tcb_field_ulp(req, toep->tp_tid, W_TCB_RX_DDP_BUF0_LEN, V_TCB_RX_DDP_BUF0_LEN(M_TCB_RX_DDP_BUF0_LEN), V_TCB_RX_DDP_BUF0_LEN((uint64_t)len)); req++; mk_set_tcb_field_ulp(req, toep->tp_tid, W_TCB_RX_DDP_FLAGS, V_TF_DDP_PUSH_DISABLE_0(1) | V_TF_DDP_BUF0_VALID(1) | V_TF_DDP_ACTIVE_BUF(1), V_TF_DDP_PUSH_DISABLE_0(0) | V_TF_DDP_BUF0_VALID(1)); } else { mk_set_tcb_field_ulp(req, toep->tp_tid, W_TCB_RX_DDP_BUF1_LEN, V_TCB_RX_DDP_BUF1_LEN(M_TCB_RX_DDP_BUF1_LEN), V_TCB_RX_DDP_BUF1_LEN((uint64_t)len)); req++; mk_set_tcb_field_ulp(req, toep->tp_tid, W_TCB_RX_DDP_FLAGS, V_TF_DDP_PUSH_DISABLE_1(1) | V_TF_DDP_BUF1_VALID(1) | V_TF_DDP_ACTIVE_BUF(1), V_TF_DDP_PUSH_DISABLE_1(0) | V_TF_DDP_BUF1_VALID(1) | V_TF_DDP_ACTIVE_BUF(1)); } getreq = (struct cpl_get_tcb *)(req + 1); mk_get_tcb_ulp(getreq, toep->tp_tid, toep->tp_qset); /* Keep track of the number of oustanding CPL_GET_TCB requests */ p->get_tcb_count++; #ifdef T3_TRACE T3_TRACE4(TIDTB(sk), "t3_overlay_ddpbuf: bufidx %u tag0 %u tag1 %u " "len %d", bufidx, tag0, tag1, len); #endif cxgb_ofld_send(TOEP_T3C_DEV(toep), m); } /* * Sends a compound WR containing all the CPL messages needed to program the * two HW DDP buffers, namely optionally setting up the length and offset of * each buffer, programming the DDP flags, and optionally sending RX_DATA_ACK. */ void t3_setup_ddpbufs(struct toepcb *toep, unsigned int len0, unsigned int offset0, unsigned int len1, unsigned int offset1, uint64_t ddp_flags, uint64_t flag_mask, int modulate) { unsigned int wrlen; struct mbuf *m; struct work_request_hdr *wr; struct cpl_set_tcb_field *req; CTR6(KTR_TCB, "t3_setup_ddpbufs(len0=%u offset0=%u len1=%u offset1=%u ddp_flags=0x%08x%08x ", len0, offset0, len1, offset1, ddp_flags >> 32, ddp_flags & 0xffffffff); #if 0 SOCKBUF_LOCK_ASSERT(&toeptoso(toep)->so_rcv); #endif wrlen = sizeof(*wr) + sizeof(*req) + (len0 ? sizeof(*req) : 0) + (len1 ? sizeof(*req) : 0) + (modulate ? sizeof(struct cpl_rx_data_ack) : 0); m = m_gethdr_nofail(wrlen); m_set_priority(m, mkprio(CPL_PRIORITY_CONTROL, toep)); wr = mtod(m, struct work_request_hdr *); bzero(wr, wrlen); wr->wr_hi = htonl(V_WR_OP(FW_WROPCODE_BYPASS)); m->m_pkthdr.len = m->m_len = wrlen; req = (struct cpl_set_tcb_field *)(wr + 1); if (len0) { /* program buffer 0 offset and length */ mk_set_tcb_field_ulp(req, toep->tp_tid, W_TCB_RX_DDP_BUF0_OFFSET, V_TCB_RX_DDP_BUF0_OFFSET(M_TCB_RX_DDP_BUF0_OFFSET) | V_TCB_RX_DDP_BUF0_LEN(M_TCB_RX_DDP_BUF0_LEN), V_TCB_RX_DDP_BUF0_OFFSET((uint64_t)offset0) | V_TCB_RX_DDP_BUF0_LEN((uint64_t)len0)); req++; } if (len1) { /* program buffer 1 offset and length */ mk_set_tcb_field_ulp(req, toep->tp_tid, W_TCB_RX_DDP_BUF1_OFFSET, V_TCB_RX_DDP_BUF1_OFFSET(M_TCB_RX_DDP_BUF1_OFFSET) | V_TCB_RX_DDP_BUF1_LEN(M_TCB_RX_DDP_BUF1_LEN) << 32, V_TCB_RX_DDP_BUF1_OFFSET((uint64_t)offset1) | V_TCB_RX_DDP_BUF1_LEN((uint64_t)len1) << 32); req++; } mk_set_tcb_field_ulp(req, toep->tp_tid, W_TCB_RX_DDP_FLAGS, flag_mask, ddp_flags); if (modulate) { mk_rx_data_ack_ulp(toep, (struct cpl_rx_data_ack *)(req + 1), toep->tp_tid, toep->tp_copied_seq - toep->tp_rcv_wup); toep->tp_rcv_wup = toep->tp_copied_seq; } #ifdef T3_TRACE T3_TRACE5(TIDTB(sk), "t3_setup_ddpbufs: len0 %u len1 %u ddp_flags 0x%08x%08x " "modulate %d", len0, len1, ddp_flags >> 32, ddp_flags & 0xffffffff, modulate); #endif cxgb_ofld_send(TOEP_T3C_DEV(toep), m); } void t3_init_wr_tab(unsigned int wr_len) { int i; if (mbuf_wrs[1]) /* already initialized */ return; for (i = 1; i < ARRAY_SIZE(mbuf_wrs); i++) { int sgl_len = (3 * i) / 2 + (i & 1); sgl_len += 3; mbuf_wrs[i] = sgl_len <= wr_len ? 1 : 1 + (sgl_len - 2) / (wr_len - 1); } wrlen = wr_len * 8; } int t3_init_cpl_io(void) { #ifdef notyet tcphdr_skb = alloc_skb(sizeof(struct tcphdr), GFP_KERNEL); if (!tcphdr_skb) { log(LOG_ERR, "Chelsio TCP offload: can't allocate sk_buff\n"); return -1; } skb_put(tcphdr_skb, sizeof(struct tcphdr)); tcphdr_skb->h.raw = tcphdr_skb->data; memset(tcphdr_skb->data, 0, tcphdr_skb->len); #endif t3tom_register_cpl_handler(CPL_ACT_ESTABLISH, do_act_establish); t3tom_register_cpl_handler(CPL_ACT_OPEN_RPL, do_act_open_rpl); t3tom_register_cpl_handler(CPL_TX_DMA_ACK, do_wr_ack); t3tom_register_cpl_handler(CPL_RX_DATA, do_rx_data); t3tom_register_cpl_handler(CPL_CLOSE_CON_RPL, do_close_con_rpl); t3tom_register_cpl_handler(CPL_PEER_CLOSE, do_peer_close); t3tom_register_cpl_handler(CPL_PASS_ESTABLISH, do_pass_establish); t3tom_register_cpl_handler(CPL_PASS_ACCEPT_REQ, do_pass_accept_req); t3tom_register_cpl_handler(CPL_ABORT_REQ_RSS, do_abort_req); t3tom_register_cpl_handler(CPL_ABORT_RPL_RSS, do_abort_rpl); t3tom_register_cpl_handler(CPL_RX_DATA_DDP, do_rx_data_ddp); t3tom_register_cpl_handler(CPL_RX_DDP_COMPLETE, do_rx_ddp_complete); t3tom_register_cpl_handler(CPL_RX_URG_NOTIFY, do_rx_urg_notify); t3tom_register_cpl_handler(CPL_TRACE_PKT, do_trace_pkt); t3tom_register_cpl_handler(CPL_GET_TCB_RPL, do_get_tcb_rpl); return (0); } Index: head/sys/netinet/tcp_offload.h =================================================================== --- head/sys/netinet/tcp_offload.h (revision 195653) +++ head/sys/netinet/tcp_offload.h (revision 195654) @@ -1,341 +1,354 @@ /*- * Copyright (c) 2007, Chelsio Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * * 2. Neither the name of the Chelsio Corporation nor the names of its * contributors may be used to endorse or promote products derived from * this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _NETINET_TCP_OFFLOAD_H_ #define _NETINET_TCP_OFFLOAD_H_ #ifndef _KERNEL #error "no user-serviceable parts inside" #endif /* * A driver publishes that it provides offload services * by setting IFCAP_TOE in the ifnet. The offload connect * will bypass any further work if the interface that a * connection would use does not support TCP offload. * * The TOE API assumes that the tcp offload engine can offload the * the entire connection from set up to teardown, with some provision * being made to allowing the software stack to handle time wait. If * the device does not meet these criteria, it is the driver's responsibility * to overload the functions that it needs to in tcp_usrreqs and make * its own calls to tcp_output if it needs to do so. * * There is currently no provision for the device advertising the congestion * control algorithms it supports as there is currently no API for querying * an operating system for the protocols that it has loaded. This is a desirable * future extension. * * * * It is assumed that individuals deploying TOE will want connections * to be offloaded without software changes so all connections on an * interface providing TOE are offloaded unless the the SO_NO_OFFLOAD * flag is set on the socket. * * * The toe_usrreqs structure constitutes the TOE driver's * interface to the TCP stack for functionality that doesn't * interact directly with userspace. If one wants to provide * (optional) functionality to do zero-copy to/from * userspace one still needs to override soreceive/sosend * with functions that fault in and pin the user buffers. * * + tu_send * - tells the driver that new data may have been added to the * socket's send buffer - the driver should not fail if the * buffer is in fact unchanged * - the driver is responsible for providing credits (bytes in the send window) * back to the socket by calling sbdrop() as segments are acknowledged. * - The driver expects the inpcb lock to be held - the driver is expected * not to drop the lock. Hence the driver is not allowed to acquire the * pcbinfo lock during this call. * * + tu_rcvd * - returns credits to the driver and triggers window updates * to the peer (a credit as used here is a byte in the peer's receive window) * - the driver is expected to determine how many bytes have been * consumed and credit that back to the card so that it can grow * the window again by maintaining its own state between invocations. * - In principle this could be used to shrink the window as well as * grow the window, although it is not used for that now. * - this function needs to correctly handle being called any number of * times without any bytes being consumed from the receive buffer. * - The driver expects the inpcb lock to be held - the driver is expected * not to drop the lock. Hence the driver is not allowed to acquire the * pcbinfo lock during this call. * * + tu_disconnect * - tells the driver to send FIN to peer * - driver is expected to send the remaining data and then do a clean half close * - disconnect implies at least half-close so only send, reset, and detach * are legal * - the driver is expected to handle transition through the shutdown * state machine and allow the stack to support SO_LINGER. * - The driver expects the inpcb lock to be held - the driver is expected * not to drop the lock. Hence the driver is not allowed to acquire the * pcbinfo lock during this call. * * + tu_reset * - closes the connection and sends a RST to peer * - driver is expectd to trigger an RST and detach the toepcb * - no further calls are legal after reset * - The driver expects the inpcb lock to be held - the driver is expected * not to drop the lock. Hence the driver is not allowed to acquire the * pcbinfo lock during this call. * * The following fields in the tcpcb are expected to be referenced by the driver: * + iss * + rcv_nxt * + rcv_wnd * + snd_isn * + snd_max * + snd_nxt * + snd_una * + t_flags * + t_inpcb * + t_maxseg * + t_toe * * The following fields in the inpcb are expected to be referenced by the driver: * + inp_lport * + inp_fport * + inp_laddr * + inp_fport * + inp_socket * + inp_ip_tos * * The following fields in the socket are expected to be referenced by the * driver: * + so_comp * + so_error * + so_linger * + so_options * + so_rcv * + so_snd * + so_state * + so_timeo * * These functions all return 0 on success and can return the following errors * as appropriate: * + EPERM: * + ENOBUFS: memory allocation failed * + EMSGSIZE: MTU changed during the call * + EHOSTDOWN: * + EHOSTUNREACH: * + ENETDOWN: * * ENETUNREACH: the peer is no longer reachable * * + tu_detach * - tells driver that the socket is going away so disconnect * the toepcb and free appropriate resources * - allows the driver to cleanly handle the case of connection state * outliving the socket * - no further calls are legal after detach * - the driver is expected to provide its own synchronization between * detach and receiving new data. * * + tu_syncache_event * - even if it is not actually needed, the driver is expected to * call syncache_add for the initial SYN and then syncache_expand * for the SYN,ACK * - tells driver that a connection either has not been added or has * been dropped from the syncache * - the driver is expected to maintain state that lives outside the * software stack so the syncache needs to be able to notify the * toe driver that the software stack is not going to create a connection * for a received SYN * - The driver is responsible for any synchronization required between * the syncache dropping an entry and the driver processing the SYN,ACK. * */ struct toe_usrreqs { int (*tu_send)(struct tcpcb *tp); int (*tu_rcvd)(struct tcpcb *tp); int (*tu_disconnect)(struct tcpcb *tp); int (*tu_reset)(struct tcpcb *tp); void (*tu_detach)(struct tcpcb *tp); void (*tu_syncache_event)(int event, void *toep); }; +/* + * Proxy for struct tcpopt between TOE drivers and TCP functions. + */ +struct toeopt { + u_int64_t to_flags; /* see tcpopt in tcp_var.h */ + u_int16_t to_mss; /* maximum segment size */ + u_int8_t to_wscale; /* window scaling */ + + u_int8_t _pad1; /* explicit pad for 64bit alignment */ + u_int32_t _pad2; /* explicit pad for 64bit alignment */ + u_int64_t _pad3[4]; /* TBD */ +}; + #define TOE_SC_ENTRY_PRESENT 1 /* 4-tuple already present */ #define TOE_SC_DROP 2 /* connection was timed out */ /* * Because listen is a one-to-many relationship (a socket can be listening * on all interfaces on a machine some of which may be using different TCP * offload devices), listen uses a publish/subscribe mechanism. The TCP * offload driver registers a listen notification function with the stack. * When a listen socket is created all TCP offload devices are notified * so that they can do the appropriate set up to offload connections on the * port to which the socket is bound. When the listen socket is closed, * the offload devices are notified so that they will stop listening on that * port and free any associated resources as well as sending RSTs on any * connections in the SYN_RCVD state. * */ typedef void (*tcp_offload_listen_start_fn)(void *, struct tcpcb *); typedef void (*tcp_offload_listen_stop_fn)(void *, struct tcpcb *); EVENTHANDLER_DECLARE(tcp_offload_listen_start, tcp_offload_listen_start_fn); EVENTHANDLER_DECLARE(tcp_offload_listen_stop, tcp_offload_listen_stop_fn); /* * Check if the socket can be offloaded by the following steps: * - determine the egress interface * - check the interface for TOE capability and TOE is enabled * - check if the device has resources to offload the connection */ int tcp_offload_connect(struct socket *so, struct sockaddr *nam); /* * The tcp_output_* routines are wrappers around the toe_usrreqs calls * which trigger packet transmission. In the non-offloaded case they * translate to tcp_output. The tcp_offload_* routines notify TOE * of specific events. I the non-offloaded case they are no-ops. * * Listen is a special case because it is a 1 to many relationship * and there can be more than one offload driver in the system. */ /* * Connection is offloaded */ #define tp_offload(tp) ((tp)->t_flags & TF_TOE) /* * hackish way of allowing this file to also be included by TOE * which needs to be kept ignorant of socket implementation details */ #ifdef _SYS_SOCKETVAR_H_ /* * The socket has not been marked as "do not offload" */ #define SO_OFFLOADABLE(so) ((so->so_options & SO_NO_OFFLOAD) == 0) static __inline int tcp_output_connect(struct socket *so, struct sockaddr *nam) { struct tcpcb *tp = sototcpcb(so); int error; /* * If offload has been disabled for this socket or the * connection cannot be offloaded just call tcp_output * to start the TCP state machine. */ #ifndef TCP_OFFLOAD_DISABLE if (!SO_OFFLOADABLE(so) || (error = tcp_offload_connect(so, nam)) != 0) #endif error = tcp_output(tp); return (error); } static __inline int tcp_output_send(struct tcpcb *tp) { #ifndef TCP_OFFLOAD_DISABLE if (tp_offload(tp)) return (tp->t_tu->tu_send(tp)); #endif return (tcp_output(tp)); } static __inline int tcp_output_rcvd(struct tcpcb *tp) { #ifndef TCP_OFFLOAD_DISABLE if (tp_offload(tp)) return (tp->t_tu->tu_rcvd(tp)); #endif return (tcp_output(tp)); } static __inline int tcp_output_disconnect(struct tcpcb *tp) { #ifndef TCP_OFFLOAD_DISABLE if (tp_offload(tp)) return (tp->t_tu->tu_disconnect(tp)); #endif return (tcp_output(tp)); } static __inline int tcp_output_reset(struct tcpcb *tp) { #ifndef TCP_OFFLOAD_DISABLE if (tp_offload(tp)) return (tp->t_tu->tu_reset(tp)); #endif return (tcp_output(tp)); } static __inline void tcp_offload_detach(struct tcpcb *tp) { #ifndef TCP_OFFLOAD_DISABLE if (tp_offload(tp)) tp->t_tu->tu_detach(tp); #endif } static __inline void tcp_offload_listen_open(struct tcpcb *tp) { #ifndef TCP_OFFLOAD_DISABLE if (SO_OFFLOADABLE(tp->t_inpcb->inp_socket)) EVENTHANDLER_INVOKE(tcp_offload_listen_start, tp); #endif } static __inline void tcp_offload_listen_close(struct tcpcb *tp) { #ifndef TCP_OFFLOAD_DISABLE EVENTHANDLER_INVOKE(tcp_offload_listen_stop, tp); #endif } #undef SO_OFFLOADABLE #endif /* _SYS_SOCKETVAR_H_ */ #undef tp_offload void tcp_offload_twstart(struct tcpcb *tp); struct tcpcb *tcp_offload_close(struct tcpcb *tp); struct tcpcb *tcp_offload_drop(struct tcpcb *tp, int error); #endif /* _NETINET_TCP_OFFLOAD_H_ */ Index: head/sys/netinet/tcp_syncache.c =================================================================== --- head/sys/netinet/tcp_syncache.c (revision 195653) +++ head/sys/netinet/tcp_syncache.c (revision 195654) @@ -1,1781 +1,1794 @@ /*- * Copyright (c) 2001 McAfee, Inc. * Copyright (c) 2006 Andre Oppermann, Internet Business Solutions AG * All rights reserved. * * This software was developed for the FreeBSD Project by Jonathan Lemon * and McAfee Research, the Security Research Division of McAfee, Inc. under * DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), as part of the * DARPA CHATS research program. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include "opt_ipsec.h" #include #include #include #include #include #include #include #include #include #include #include /* for proc0 declaration */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET6 #include #include #include #include #include #endif #include #include #include #include #include #include #include #ifdef INET6 #include #endif #include #ifdef IPSEC #include #ifdef INET6 #include #endif #include #endif /*IPSEC*/ #include #include #ifdef VIMAGE_GLOBALS static struct tcp_syncache tcp_syncache; static int tcp_syncookies; static int tcp_syncookiesonly; int tcp_sc_rst_sock_fail; #endif SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_tcp, OID_AUTO, syncookies, CTLFLAG_RW, tcp_syncookies, 0, "Use TCP SYN cookies if the syncache overflows"); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_tcp, OID_AUTO, syncookies_only, CTLFLAG_RW, tcp_syncookiesonly, 0, "Use only TCP SYN cookies"); #ifdef TCP_OFFLOAD_DISABLE #define TOEPCB_ISSET(sc) (0) #else #define TOEPCB_ISSET(sc) ((sc)->sc_toepcb != NULL) #endif static void syncache_drop(struct syncache *, struct syncache_head *); static void syncache_free(struct syncache *); static void syncache_insert(struct syncache *, struct syncache_head *); struct syncache *syncache_lookup(struct in_conninfo *, struct syncache_head **); static int syncache_respond(struct syncache *); static struct socket *syncache_socket(struct syncache *, struct socket *, struct mbuf *m); static void syncache_timeout(struct syncache *sc, struct syncache_head *sch, int docallout); static void syncache_timer(void *); static void syncookie_generate(struct syncache_head *, struct syncache *, u_int32_t *); static struct syncache *syncookie_lookup(struct in_conninfo *, struct syncache_head *, struct syncache *, struct tcpopt *, struct tcphdr *, struct socket *); /* * Transmit the SYN,ACK fewer times than TCP_MAXRXTSHIFT specifies. * 3 retransmits corresponds to a timeout of 3 * (1 + 2 + 4 + 8) == 45 seconds, * the odds are that the user has given up attempting to connect by then. */ #define SYNCACHE_MAXREXMTS 3 /* Arbitrary values */ #define TCP_SYNCACHE_HASHSIZE 512 #define TCP_SYNCACHE_BUCKETLIMIT 30 SYSCTL_NODE(_net_inet_tcp, OID_AUTO, syncache, CTLFLAG_RW, 0, "TCP SYN cache"); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_tcp_syncache, OID_AUTO, bucketlimit, CTLFLAG_RDTUN, tcp_syncache.bucket_limit, 0, "Per-bucket hash limit for syncache"); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_tcp_syncache, OID_AUTO, cachelimit, CTLFLAG_RDTUN, tcp_syncache.cache_limit, 0, "Overall entry limit for syncache"); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_tcp_syncache, OID_AUTO, count, CTLFLAG_RD, tcp_syncache.cache_count, 0, "Current number of entries in syncache"); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_tcp_syncache, OID_AUTO, hashsize, CTLFLAG_RDTUN, tcp_syncache.hashsize, 0, "Size of TCP syncache hashtable"); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_tcp_syncache, OID_AUTO, rexmtlimit, CTLFLAG_RW, tcp_syncache.rexmt_limit, 0, "Limit on SYN/ACK retransmissions"); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_tcp_syncache, OID_AUTO, rst_on_sock_fail, CTLFLAG_RW, tcp_sc_rst_sock_fail, 0, "Send reset on socket allocation failure"); static MALLOC_DEFINE(M_SYNCACHE, "syncache", "TCP syncache"); #define SYNCACHE_HASH(inc, mask) \ ((V_tcp_syncache.hash_secret ^ \ (inc)->inc_faddr.s_addr ^ \ ((inc)->inc_faddr.s_addr >> 16) ^ \ (inc)->inc_fport ^ (inc)->inc_lport) & mask) #define SYNCACHE_HASH6(inc, mask) \ ((V_tcp_syncache.hash_secret ^ \ (inc)->inc6_faddr.s6_addr32[0] ^ \ (inc)->inc6_faddr.s6_addr32[3] ^ \ (inc)->inc_fport ^ (inc)->inc_lport) & mask) #define ENDPTS_EQ(a, b) ( \ (a)->ie_fport == (b)->ie_fport && \ (a)->ie_lport == (b)->ie_lport && \ (a)->ie_faddr.s_addr == (b)->ie_faddr.s_addr && \ (a)->ie_laddr.s_addr == (b)->ie_laddr.s_addr \ ) #define ENDPTS6_EQ(a, b) (memcmp(a, b, sizeof(*a)) == 0) #define SCH_LOCK(sch) mtx_lock(&(sch)->sch_mtx) #define SCH_UNLOCK(sch) mtx_unlock(&(sch)->sch_mtx) #define SCH_LOCK_ASSERT(sch) mtx_assert(&(sch)->sch_mtx, MA_OWNED) /* * Requires the syncache entry to be already removed from the bucket list. */ static void syncache_free(struct syncache *sc) { INIT_VNET_INET(curvnet); if (sc->sc_ipopts) (void) m_free(sc->sc_ipopts); if (sc->sc_cred) crfree(sc->sc_cred); #ifdef MAC mac_syncache_destroy(&sc->sc_label); #endif uma_zfree(V_tcp_syncache.zone, sc); } void syncache_init(void) { INIT_VNET_INET(curvnet); int i; V_tcp_syncookies = 1; V_tcp_syncookiesonly = 0; V_tcp_sc_rst_sock_fail = 1; V_tcp_syncache.cache_count = 0; V_tcp_syncache.hashsize = TCP_SYNCACHE_HASHSIZE; V_tcp_syncache.bucket_limit = TCP_SYNCACHE_BUCKETLIMIT; V_tcp_syncache.rexmt_limit = SYNCACHE_MAXREXMTS; V_tcp_syncache.hash_secret = arc4random(); TUNABLE_INT_FETCH("net.inet.tcp.syncache.hashsize", &V_tcp_syncache.hashsize); TUNABLE_INT_FETCH("net.inet.tcp.syncache.bucketlimit", &V_tcp_syncache.bucket_limit); if (!powerof2(V_tcp_syncache.hashsize) || V_tcp_syncache.hashsize == 0) { printf("WARNING: syncache hash size is not a power of 2.\n"); V_tcp_syncache.hashsize = TCP_SYNCACHE_HASHSIZE; } V_tcp_syncache.hashmask = V_tcp_syncache.hashsize - 1; /* Set limits. */ V_tcp_syncache.cache_limit = V_tcp_syncache.hashsize * V_tcp_syncache.bucket_limit; TUNABLE_INT_FETCH("net.inet.tcp.syncache.cachelimit", &V_tcp_syncache.cache_limit); /* Allocate the hash table. */ V_tcp_syncache.hashbase = malloc(V_tcp_syncache.hashsize * sizeof(struct syncache_head), M_SYNCACHE, M_WAITOK | M_ZERO); /* Initialize the hash buckets. */ for (i = 0; i < V_tcp_syncache.hashsize; i++) { #ifdef VIMAGE V_tcp_syncache.hashbase[i].sch_vnet = curvnet; #endif TAILQ_INIT(&V_tcp_syncache.hashbase[i].sch_bucket); mtx_init(&V_tcp_syncache.hashbase[i].sch_mtx, "tcp_sc_head", NULL, MTX_DEF); callout_init_mtx(&V_tcp_syncache.hashbase[i].sch_timer, &V_tcp_syncache.hashbase[i].sch_mtx, 0); V_tcp_syncache.hashbase[i].sch_length = 0; } /* Create the syncache entry zone. */ V_tcp_syncache.zone = uma_zcreate("syncache", sizeof(struct syncache), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); uma_zone_set_max(V_tcp_syncache.zone, V_tcp_syncache.cache_limit); } #ifdef VIMAGE void syncache_destroy(void) { INIT_VNET_INET(curvnet); /* XXX walk the cache, free remaining objects, stop timers */ uma_zdestroy(V_tcp_syncache.zone); FREE(V_tcp_syncache.hashbase, M_SYNCACHE); } #endif /* * Inserts a syncache entry into the specified bucket row. * Locks and unlocks the syncache_head autonomously. */ static void syncache_insert(struct syncache *sc, struct syncache_head *sch) { INIT_VNET_INET(sch->sch_vnet); struct syncache *sc2; SCH_LOCK(sch); /* * Make sure that we don't overflow the per-bucket limit. * If the bucket is full, toss the oldest element. */ if (sch->sch_length >= V_tcp_syncache.bucket_limit) { KASSERT(!TAILQ_EMPTY(&sch->sch_bucket), ("sch->sch_length incorrect")); sc2 = TAILQ_LAST(&sch->sch_bucket, sch_head); syncache_drop(sc2, sch); TCPSTAT_INC(tcps_sc_bucketoverflow); } /* Put it into the bucket. */ TAILQ_INSERT_HEAD(&sch->sch_bucket, sc, sc_hash); sch->sch_length++; /* Reinitialize the bucket row's timer. */ if (sch->sch_length == 1) sch->sch_nextc = ticks + INT_MAX; syncache_timeout(sc, sch, 1); SCH_UNLOCK(sch); V_tcp_syncache.cache_count++; TCPSTAT_INC(tcps_sc_added); } /* * Remove and free entry from syncache bucket row. * Expects locked syncache head. */ static void syncache_drop(struct syncache *sc, struct syncache_head *sch) { INIT_VNET_INET(sch->sch_vnet); SCH_LOCK_ASSERT(sch); TAILQ_REMOVE(&sch->sch_bucket, sc, sc_hash); sch->sch_length--; #ifndef TCP_OFFLOAD_DISABLE if (sc->sc_tu) sc->sc_tu->tu_syncache_event(TOE_SC_DROP, sc->sc_toepcb); #endif syncache_free(sc); V_tcp_syncache.cache_count--; } /* * Engage/reengage time on bucket row. */ static void syncache_timeout(struct syncache *sc, struct syncache_head *sch, int docallout) { sc->sc_rxttime = ticks + TCPTV_RTOBASE * (tcp_backoff[sc->sc_rxmits]); sc->sc_rxmits++; if (TSTMP_LT(sc->sc_rxttime, sch->sch_nextc)) { sch->sch_nextc = sc->sc_rxttime; if (docallout) callout_reset(&sch->sch_timer, sch->sch_nextc - ticks, syncache_timer, (void *)sch); } } /* * Walk the timer queues, looking for SYN,ACKs that need to be retransmitted. * If we have retransmitted an entry the maximum number of times, expire it. * One separate timer for each bucket row. */ static void syncache_timer(void *xsch) { struct syncache_head *sch = (struct syncache_head *)xsch; struct syncache *sc, *nsc; int tick = ticks; char *s; CURVNET_SET(sch->sch_vnet); INIT_VNET_INET(sch->sch_vnet); /* NB: syncache_head has already been locked by the callout. */ SCH_LOCK_ASSERT(sch); /* * In the following cycle we may remove some entries and/or * advance some timeouts, so re-initialize the bucket timer. */ sch->sch_nextc = tick + INT_MAX; TAILQ_FOREACH_SAFE(sc, &sch->sch_bucket, sc_hash, nsc) { /* * We do not check if the listen socket still exists * and accept the case where the listen socket may be * gone by the time we resend the SYN/ACK. We do * not expect this to happens often. If it does, * then the RST will be sent by the time the remote * host does the SYN/ACK->ACK. */ if (TSTMP_GT(sc->sc_rxttime, tick)) { if (TSTMP_LT(sc->sc_rxttime, sch->sch_nextc)) sch->sch_nextc = sc->sc_rxttime; continue; } if (sc->sc_rxmits > V_tcp_syncache.rexmt_limit) { if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Retransmits exhausted, " "giving up and removing syncache entry\n", s, __func__); free(s, M_TCPLOG); } syncache_drop(sc, sch); TCPSTAT_INC(tcps_sc_stale); continue; } if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Response timeout, " "retransmitting (%u) SYN|ACK\n", s, __func__, sc->sc_rxmits); free(s, M_TCPLOG); } (void) syncache_respond(sc); TCPSTAT_INC(tcps_sc_retransmitted); syncache_timeout(sc, sch, 0); } if (!TAILQ_EMPTY(&(sch)->sch_bucket)) callout_reset(&(sch)->sch_timer, (sch)->sch_nextc - tick, syncache_timer, (void *)(sch)); CURVNET_RESTORE(); } /* * Find an entry in the syncache. * Returns always with locked syncache_head plus a matching entry or NULL. */ struct syncache * syncache_lookup(struct in_conninfo *inc, struct syncache_head **schp) { INIT_VNET_INET(curvnet); struct syncache *sc; struct syncache_head *sch; #ifdef INET6 if (inc->inc_flags & INC_ISIPV6) { sch = &V_tcp_syncache.hashbase[ SYNCACHE_HASH6(inc, V_tcp_syncache.hashmask)]; *schp = sch; SCH_LOCK(sch); /* Circle through bucket row to find matching entry. */ TAILQ_FOREACH(sc, &sch->sch_bucket, sc_hash) { if (ENDPTS6_EQ(&inc->inc_ie, &sc->sc_inc.inc_ie)) return (sc); } } else #endif { sch = &V_tcp_syncache.hashbase[ SYNCACHE_HASH(inc, V_tcp_syncache.hashmask)]; *schp = sch; SCH_LOCK(sch); /* Circle through bucket row to find matching entry. */ TAILQ_FOREACH(sc, &sch->sch_bucket, sc_hash) { #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) continue; #endif if (ENDPTS_EQ(&inc->inc_ie, &sc->sc_inc.inc_ie)) return (sc); } } SCH_LOCK_ASSERT(*schp); return (NULL); /* always returns with locked sch */ } /* * This function is called when we get a RST for a * non-existent connection, so that we can see if the * connection is in the syn cache. If it is, zap it. */ void syncache_chkrst(struct in_conninfo *inc, struct tcphdr *th) { INIT_VNET_INET(curvnet); struct syncache *sc; struct syncache_head *sch; char *s = NULL; sc = syncache_lookup(inc, &sch); /* returns locked sch */ SCH_LOCK_ASSERT(sch); /* * Any RST to our SYN|ACK must not carry ACK, SYN or FIN flags. * See RFC 793 page 65, section SEGMENT ARRIVES. */ if (th->th_flags & (TH_ACK|TH_SYN|TH_FIN)) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Spurious RST with ACK, SYN or " "FIN flag set, segment ignored\n", s, __func__); TCPSTAT_INC(tcps_badrst); goto done; } /* * No corresponding connection was found in syncache. * If syncookies are enabled and possibly exclusively * used, or we are under memory pressure, a valid RST * may not find a syncache entry. In that case we're * done and no SYN|ACK retransmissions will happen. * Otherwise the the RST was misdirected or spoofed. */ if (sc == NULL) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Spurious RST without matching " "syncache entry (possibly syncookie only), " "segment ignored\n", s, __func__); TCPSTAT_INC(tcps_badrst); goto done; } /* * If the RST bit is set, check the sequence number to see * if this is a valid reset segment. * RFC 793 page 37: * In all states except SYN-SENT, all reset (RST) segments * are validated by checking their SEQ-fields. A reset is * valid if its sequence number is in the window. * * The sequence number in the reset segment is normally an * echo of our outgoing acknowlegement numbers, but some hosts * send a reset with the sequence number at the rightmost edge * of our receive window, and we have to handle this case. */ if (SEQ_GEQ(th->th_seq, sc->sc_irs) && SEQ_LEQ(th->th_seq, sc->sc_irs + sc->sc_wnd)) { syncache_drop(sc, sch); if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Our SYN|ACK was rejected, " "connection attempt aborted by remote endpoint\n", s, __func__); TCPSTAT_INC(tcps_sc_reset); } else { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: RST with invalid SEQ %u != " "IRS %u (+WND %u), segment ignored\n", s, __func__, th->th_seq, sc->sc_irs, sc->sc_wnd); TCPSTAT_INC(tcps_badrst); } done: if (s != NULL) free(s, M_TCPLOG); SCH_UNLOCK(sch); } void syncache_badack(struct in_conninfo *inc) { INIT_VNET_INET(curvnet); struct syncache *sc; struct syncache_head *sch; sc = syncache_lookup(inc, &sch); /* returns locked sch */ SCH_LOCK_ASSERT(sch); if (sc != NULL) { syncache_drop(sc, sch); TCPSTAT_INC(tcps_sc_badack); } SCH_UNLOCK(sch); } void syncache_unreach(struct in_conninfo *inc, struct tcphdr *th) { INIT_VNET_INET(curvnet); struct syncache *sc; struct syncache_head *sch; sc = syncache_lookup(inc, &sch); /* returns locked sch */ SCH_LOCK_ASSERT(sch); if (sc == NULL) goto done; /* If the sequence number != sc_iss, then it's a bogus ICMP msg */ if (ntohl(th->th_seq) != sc->sc_iss) goto done; /* * If we've rertransmitted 3 times and this is our second error, * we remove the entry. Otherwise, we allow it to continue on. * This prevents us from incorrectly nuking an entry during a * spurious network outage. * * See tcp_notify(). */ if ((sc->sc_flags & SCF_UNREACH) == 0 || sc->sc_rxmits < 3 + 1) { sc->sc_flags |= SCF_UNREACH; goto done; } syncache_drop(sc, sch); TCPSTAT_INC(tcps_sc_unreach); done: SCH_UNLOCK(sch); } /* * Build a new TCP socket structure from a syncache entry. */ static struct socket * syncache_socket(struct syncache *sc, struct socket *lso, struct mbuf *m) { INIT_VNET_INET(lso->so_vnet); struct inpcb *inp = NULL; struct socket *so; struct tcpcb *tp; char *s; INP_INFO_WLOCK_ASSERT(&V_tcbinfo); /* * Ok, create the full blown connection, and set things up * as they would have been set up if we had created the * connection when the SYN arrived. If we can't create * the connection, abort it. */ so = sonewconn(lso, SS_ISCONNECTED); if (so == NULL) { /* * Drop the connection; we will either send a RST or * have the peer retransmit its SYN again after its * RTO and try again. */ TCPSTAT_INC(tcps_listendrop); if ((s = tcp_log_addrs(&sc->sc_inc, NULL, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Socket create failed " "due to limits or memory shortage\n", s, __func__); free(s, M_TCPLOG); } goto abort2; } #ifdef MAC mac_socketpeer_set_from_mbuf(m, so); #endif inp = sotoinpcb(so); inp->inp_inc.inc_fibnum = sc->sc_inc.inc_fibnum; so->so_fibnum = sc->sc_inc.inc_fibnum; INP_WLOCK(inp); /* Insert new socket into PCB hash list. */ inp->inp_inc.inc_flags = sc->sc_inc.inc_flags; #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) { inp->in6p_laddr = sc->sc_inc.inc6_laddr; } else { inp->inp_vflag &= ~INP_IPV6; inp->inp_vflag |= INP_IPV4; #endif inp->inp_laddr = sc->sc_inc.inc_laddr; #ifdef INET6 } #endif inp->inp_lport = sc->sc_inc.inc_lport; if (in_pcbinshash(inp) != 0) { /* * Undo the assignments above if we failed to * put the PCB on the hash lists. */ #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) inp->in6p_laddr = in6addr_any; else #endif inp->inp_laddr.s_addr = INADDR_ANY; inp->inp_lport = 0; goto abort; } #ifdef IPSEC /* Copy old policy into new socket's. */ if (ipsec_copy_policy(sotoinpcb(lso)->inp_sp, inp->inp_sp)) printf("syncache_socket: could not copy policy\n"); #endif #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) { struct inpcb *oinp = sotoinpcb(lso); struct in6_addr laddr6; struct sockaddr_in6 sin6; /* * Inherit socket options from the listening socket. * Note that in6p_inputopts are not (and should not be) * copied, since it stores previously received options and is * used to detect if each new option is different than the * previous one and hence should be passed to a user. * If we copied in6p_inputopts, a user would not be able to * receive options just after calling the accept system call. */ inp->inp_flags |= oinp->inp_flags & INP_CONTROLOPTS; if (oinp->in6p_outputopts) inp->in6p_outputopts = ip6_copypktopts(oinp->in6p_outputopts, M_NOWAIT); sin6.sin6_family = AF_INET6; sin6.sin6_len = sizeof(sin6); sin6.sin6_addr = sc->sc_inc.inc6_faddr; sin6.sin6_port = sc->sc_inc.inc_fport; sin6.sin6_flowinfo = sin6.sin6_scope_id = 0; laddr6 = inp->in6p_laddr; if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) inp->in6p_laddr = sc->sc_inc.inc6_laddr; if (in6_pcbconnect(inp, (struct sockaddr *)&sin6, thread0.td_ucred)) { inp->in6p_laddr = laddr6; goto abort; } /* Override flowlabel from in6_pcbconnect. */ inp->inp_flow &= ~IPV6_FLOWLABEL_MASK; inp->inp_flow |= sc->sc_flowlabel; } else #endif { struct in_addr laddr; struct sockaddr_in sin; inp->inp_options = (m) ? ip_srcroute(m) : NULL; if (inp->inp_options == NULL) { inp->inp_options = sc->sc_ipopts; sc->sc_ipopts = NULL; } sin.sin_family = AF_INET; sin.sin_len = sizeof(sin); sin.sin_addr = sc->sc_inc.inc_faddr; sin.sin_port = sc->sc_inc.inc_fport; bzero((caddr_t)sin.sin_zero, sizeof(sin.sin_zero)); laddr = inp->inp_laddr; if (inp->inp_laddr.s_addr == INADDR_ANY) inp->inp_laddr = sc->sc_inc.inc_laddr; if (in_pcbconnect(inp, (struct sockaddr *)&sin, thread0.td_ucred)) { inp->inp_laddr = laddr; goto abort; } } tp = intotcpcb(inp); tp->t_state = TCPS_SYN_RECEIVED; tp->iss = sc->sc_iss; tp->irs = sc->sc_irs; tcp_rcvseqinit(tp); tcp_sendseqinit(tp); tp->snd_wl1 = sc->sc_irs; tp->snd_max = tp->iss + 1; tp->snd_nxt = tp->iss + 1; tp->rcv_up = sc->sc_irs + 1; tp->rcv_wnd = sc->sc_wnd; tp->rcv_adv += tp->rcv_wnd; tp->last_ack_sent = tp->rcv_nxt; tp->t_flags = sototcpcb(lso)->t_flags & (TF_NOPUSH|TF_NODELAY); if (sc->sc_flags & SCF_NOOPT) tp->t_flags |= TF_NOOPT; else { if (sc->sc_flags & SCF_WINSCALE) { tp->t_flags |= TF_REQ_SCALE|TF_RCVD_SCALE; tp->snd_scale = sc->sc_requested_s_scale; tp->request_r_scale = sc->sc_requested_r_scale; } if (sc->sc_flags & SCF_TIMESTAMP) { tp->t_flags |= TF_REQ_TSTMP|TF_RCVD_TSTMP; tp->ts_recent = sc->sc_tsreflect; tp->ts_recent_age = ticks; tp->ts_offset = sc->sc_tsoff; } #ifdef TCP_SIGNATURE if (sc->sc_flags & SCF_SIGNATURE) tp->t_flags |= TF_SIGNATURE; #endif if (sc->sc_flags & SCF_SACK) tp->t_flags |= TF_SACK_PERMIT; } if (sc->sc_flags & SCF_ECN) tp->t_flags |= TF_ECN_PERMIT; /* * Set up MSS and get cached values from tcp_hostcache. * This might overwrite some of the defaults we just set. */ tcp_mss(tp, sc->sc_peer_mss); /* * If the SYN,ACK was retransmitted, reset cwnd to 1 segment. */ if (sc->sc_rxmits) tp->snd_cwnd = tp->t_maxseg; tcp_timer_activate(tp, TT_KEEP, tcp_keepinit); INP_WUNLOCK(inp); TCPSTAT_INC(tcps_accepts); return (so); abort: INP_WUNLOCK(inp); abort2: if (so != NULL) soabort(so); return (NULL); } /* * This function gets called when we receive an ACK for a * socket in the LISTEN state. We look up the connection * in the syncache, and if its there, we pull it out of * the cache and turn it into a full-blown connection in * the SYN-RECEIVED state. */ int syncache_expand(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th, struct socket **lsop, struct mbuf *m) { INIT_VNET_INET(curvnet); struct syncache *sc; struct syncache_head *sch; struct syncache scs; char *s; /* * Global TCP locks are held because we manipulate the PCB lists * and create a new socket. */ INP_INFO_WLOCK_ASSERT(&V_tcbinfo); KASSERT((th->th_flags & (TH_RST|TH_ACK|TH_SYN)) == TH_ACK, ("%s: can handle only ACK", __func__)); sc = syncache_lookup(inc, &sch); /* returns locked sch */ SCH_LOCK_ASSERT(sch); if (sc == NULL) { /* * There is no syncache entry, so see if this ACK is * a returning syncookie. To do this, first: * A. See if this socket has had a syncache entry dropped in * the past. We don't want to accept a bogus syncookie * if we've never received a SYN. * B. check that the syncookie is valid. If it is, then * cobble up a fake syncache entry, and return. */ if (!V_tcp_syncookies) { SCH_UNLOCK(sch); if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Spurious ACK, " "segment rejected (syncookies disabled)\n", s, __func__); goto failed; } bzero(&scs, sizeof(scs)); sc = syncookie_lookup(inc, sch, &scs, to, th, *lsop); SCH_UNLOCK(sch); if (sc == NULL) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Segment failed " "SYNCOOKIE authentication, segment rejected " "(probably spoofed)\n", s, __func__); goto failed; } } else { /* Pull out the entry to unlock the bucket row. */ TAILQ_REMOVE(&sch->sch_bucket, sc, sc_hash); sch->sch_length--; V_tcp_syncache.cache_count--; SCH_UNLOCK(sch); } /* * Segment validation: * ACK must match our initial sequence number + 1 (the SYN|ACK). */ if (th->th_ack != sc->sc_iss + 1 && !TOEPCB_ISSET(sc)) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: ACK %u != ISS+1 %u, segment " "rejected\n", s, __func__, th->th_ack, sc->sc_iss); goto failed; } /* * The SEQ must fall in the window starting at the received * initial receive sequence number + 1 (the SYN). */ if ((SEQ_LEQ(th->th_seq, sc->sc_irs) || SEQ_GT(th->th_seq, sc->sc_irs + sc->sc_wnd)) && !TOEPCB_ISSET(sc)) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: SEQ %u != IRS+1 %u, segment " "rejected\n", s, __func__, th->th_seq, sc->sc_irs); goto failed; } if (!(sc->sc_flags & SCF_TIMESTAMP) && (to->to_flags & TOF_TS)) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: Timestamp not expected, " "segment rejected\n", s, __func__); goto failed; } /* * If timestamps were negotiated the reflected timestamp * must be equal to what we actually sent in the SYN|ACK. */ if ((to->to_flags & TOF_TS) && to->to_tsecr != sc->sc_ts && !TOEPCB_ISSET(sc)) { if ((s = tcp_log_addrs(inc, th, NULL, NULL))) log(LOG_DEBUG, "%s; %s: TSECR %u != TS %u, " "segment rejected\n", s, __func__, to->to_tsecr, sc->sc_ts); goto failed; } *lsop = syncache_socket(sc, *lsop, m); if (*lsop == NULL) TCPSTAT_INC(tcps_sc_aborted); else TCPSTAT_INC(tcps_sc_completed); /* how do we find the inp for the new socket? */ if (sc != &scs) syncache_free(sc); return (1); failed: if (sc != NULL && sc != &scs) syncache_free(sc); if (s != NULL) free(s, M_TCPLOG); *lsop = NULL; return (0); } int -tcp_offload_syncache_expand(struct in_conninfo *inc, struct tcpopt *to, +tcp_offload_syncache_expand(struct in_conninfo *inc, struct toeopt *toeo, struct tcphdr *th, struct socket **lsop, struct mbuf *m) { INIT_VNET_INET(curvnet); + struct tcpopt to; int rc; + + bzero(&to, sizeof(struct tcpopt)); + to.to_mss = toeo->to_mss; + to.to_wscale = toeo->to_wscale; + to.to_flags = toeo->to_flags; INP_INFO_WLOCK(&V_tcbinfo); - rc = syncache_expand(inc, to, th, lsop, m); + rc = syncache_expand(inc, &to, th, lsop, m); INP_INFO_WUNLOCK(&V_tcbinfo); return (rc); } /* * Given a LISTEN socket and an inbound SYN request, add * this to the syn cache, and send back a segment: * * to the source. * * IMPORTANT NOTE: We do _NOT_ ACK data that might accompany the SYN. * Doing so would require that we hold onto the data and deliver it * to the application. However, if we are the target of a SYN-flood * DoS attack, an attacker could send data which would eventually * consume all available buffer space if it were ACKed. By not ACKing * the data, we avoid this DoS scenario. */ static void _syncache_add(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th, struct inpcb *inp, struct socket **lsop, struct mbuf *m, struct toe_usrreqs *tu, void *toepcb) { INIT_VNET_INET(inp->inp_vnet); struct tcpcb *tp; struct socket *so; struct syncache *sc = NULL; struct syncache_head *sch; struct mbuf *ipopts = NULL; u_int32_t flowtmp; int win, sb_hiwat, ip_ttl, ip_tos, noopt; char *s; #ifdef INET6 int autoflowlabel = 0; #endif #ifdef MAC struct label *maclabel; #endif struct syncache scs; struct ucred *cred; INP_INFO_WLOCK_ASSERT(&V_tcbinfo); INP_WLOCK_ASSERT(inp); /* listen socket */ KASSERT((th->th_flags & (TH_RST|TH_ACK|TH_SYN)) == TH_SYN, ("%s: unexpected tcp flags", __func__)); /* * Combine all so/tp operations very early to drop the INP lock as * soon as possible. */ so = *lsop; tp = sototcpcb(so); cred = crhold(so->so_cred); #ifdef INET6 if ((inc->inc_flags & INC_ISIPV6) && (inp->inp_flags & IN6P_AUTOFLOWLABEL)) autoflowlabel = 1; #endif ip_ttl = inp->inp_ip_ttl; ip_tos = inp->inp_ip_tos; win = sbspace(&so->so_rcv); sb_hiwat = so->so_rcv.sb_hiwat; noopt = (tp->t_flags & TF_NOOPT); /* By the time we drop the lock these should no longer be used. */ so = NULL; tp = NULL; #ifdef MAC if (mac_syncache_init(&maclabel) != 0) { INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_tcbinfo); goto done; } else mac_syncache_create(maclabel, inp); #endif INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_tcbinfo); /* * Remember the IP options, if any. */ #ifdef INET6 if (!(inc->inc_flags & INC_ISIPV6)) #endif ipopts = (m) ? ip_srcroute(m) : NULL; /* * See if we already have an entry for this connection. * If we do, resend the SYN,ACK, and reset the retransmit timer. * * XXX: should the syncache be re-initialized with the contents * of the new SYN here (which may have different options?) * * XXX: We do not check the sequence number to see if this is a * real retransmit or a new connection attempt. The question is * how to handle such a case; either ignore it as spoofed, or * drop the current entry and create a new one? */ sc = syncache_lookup(inc, &sch); /* returns locked entry */ SCH_LOCK_ASSERT(sch); if (sc != NULL) { #ifndef TCP_OFFLOAD_DISABLE if (sc->sc_tu) sc->sc_tu->tu_syncache_event(TOE_SC_ENTRY_PRESENT, sc->sc_toepcb); #endif TCPSTAT_INC(tcps_sc_dupsyn); if (ipopts) { /* * If we were remembering a previous source route, * forget it and use the new one we've been given. */ if (sc->sc_ipopts) (void) m_free(sc->sc_ipopts); sc->sc_ipopts = ipopts; } /* * Update timestamp if present. */ if ((sc->sc_flags & SCF_TIMESTAMP) && (to->to_flags & TOF_TS)) sc->sc_tsreflect = to->to_tsval; else sc->sc_flags &= ~SCF_TIMESTAMP; #ifdef MAC /* * Since we have already unconditionally allocated label * storage, free it up. The syncache entry will already * have an initialized label we can use. */ mac_syncache_destroy(&maclabel); #endif /* Retransmit SYN|ACK and reset retransmit count. */ if ((s = tcp_log_addrs(&sc->sc_inc, th, NULL, NULL))) { log(LOG_DEBUG, "%s; %s: Received duplicate SYN, " "resetting timer and retransmitting SYN|ACK\n", s, __func__); free(s, M_TCPLOG); } if (!TOEPCB_ISSET(sc) && syncache_respond(sc) == 0) { sc->sc_rxmits = 0; syncache_timeout(sc, sch, 1); TCPSTAT_INC(tcps_sndacks); TCPSTAT_INC(tcps_sndtotal); } SCH_UNLOCK(sch); goto done; } sc = uma_zalloc(V_tcp_syncache.zone, M_NOWAIT | M_ZERO); if (sc == NULL) { /* * The zone allocator couldn't provide more entries. * Treat this as if the cache was full; drop the oldest * entry and insert the new one. */ TCPSTAT_INC(tcps_sc_zonefail); if ((sc = TAILQ_LAST(&sch->sch_bucket, sch_head)) != NULL) syncache_drop(sc, sch); sc = uma_zalloc(V_tcp_syncache.zone, M_NOWAIT | M_ZERO); if (sc == NULL) { if (V_tcp_syncookies) { bzero(&scs, sizeof(scs)); sc = &scs; } else { SCH_UNLOCK(sch); if (ipopts) (void) m_free(ipopts); goto done; } } } /* * Fill in the syncache values. */ #ifdef MAC sc->sc_label = maclabel; #endif sc->sc_cred = cred; cred = NULL; sc->sc_ipopts = ipopts; /* XXX-BZ this fib assignment is just useless. */ sc->sc_inc.inc_fibnum = inp->inp_inc.inc_fibnum; bcopy(inc, &sc->sc_inc, sizeof(struct in_conninfo)); #ifdef INET6 if (!(inc->inc_flags & INC_ISIPV6)) #endif { sc->sc_ip_tos = ip_tos; sc->sc_ip_ttl = ip_ttl; } #ifndef TCP_OFFLOAD_DISABLE sc->sc_tu = tu; sc->sc_toepcb = toepcb; #endif sc->sc_irs = th->th_seq; sc->sc_iss = arc4random(); sc->sc_flags = 0; sc->sc_flowlabel = 0; /* * Initial receive window: clip sbspace to [0 .. TCP_MAXWIN]. * win was derived from socket earlier in the function. */ win = imax(win, 0); win = imin(win, TCP_MAXWIN); sc->sc_wnd = win; if (V_tcp_do_rfc1323) { /* * A timestamp received in a SYN makes * it ok to send timestamp requests and replies. */ if (to->to_flags & TOF_TS) { sc->sc_tsreflect = to->to_tsval; sc->sc_ts = ticks; sc->sc_flags |= SCF_TIMESTAMP; } if (to->to_flags & TOF_SCALE) { int wscale = 0; /* * Pick the smallest possible scaling factor that * will still allow us to scale up to sb_max, aka * kern.ipc.maxsockbuf. * * We do this because there are broken firewalls that * will corrupt the window scale option, leading to * the other endpoint believing that our advertised * window is unscaled. At scale factors larger than * 5 the unscaled window will drop below 1500 bytes, * leading to serious problems when traversing these * broken firewalls. * * With the default maxsockbuf of 256K, a scale factor * of 3 will be chosen by this algorithm. Those who * choose a larger maxsockbuf should watch out * for the compatiblity problems mentioned above. * * RFC1323: The Window field in a SYN (i.e., a * or ) segment itself is never scaled. */ while (wscale < TCP_MAX_WINSHIFT && (TCP_MAXWIN << wscale) < sb_max) wscale++; sc->sc_requested_r_scale = wscale; sc->sc_requested_s_scale = to->to_wscale; sc->sc_flags |= SCF_WINSCALE; } } #ifdef TCP_SIGNATURE /* * If listening socket requested TCP digests, and received SYN * contains the option, flag this in the syncache so that * syncache_respond() will do the right thing with the SYN+ACK. * XXX: Currently we always record the option by default and will * attempt to use it in syncache_respond(). */ if (to->to_flags & TOF_SIGNATURE) sc->sc_flags |= SCF_SIGNATURE; #endif if (to->to_flags & TOF_SACKPERM) sc->sc_flags |= SCF_SACK; if (to->to_flags & TOF_MSS) sc->sc_peer_mss = to->to_mss; /* peer mss may be zero */ if (noopt) sc->sc_flags |= SCF_NOOPT; if ((th->th_flags & (TH_ECE|TH_CWR)) && V_tcp_do_ecn) sc->sc_flags |= SCF_ECN; if (V_tcp_syncookies) { syncookie_generate(sch, sc, &flowtmp); #ifdef INET6 if (autoflowlabel) sc->sc_flowlabel = flowtmp; #endif } else { #ifdef INET6 if (autoflowlabel) sc->sc_flowlabel = (htonl(ip6_randomflowlabel()) & IPV6_FLOWLABEL_MASK); #endif } SCH_UNLOCK(sch); /* * Do a standard 3-way handshake. */ if (TOEPCB_ISSET(sc) || syncache_respond(sc) == 0) { if (V_tcp_syncookies && V_tcp_syncookiesonly && sc != &scs) syncache_free(sc); else if (sc != &scs) syncache_insert(sc, sch); /* locks and unlocks sch */ TCPSTAT_INC(tcps_sndacks); TCPSTAT_INC(tcps_sndtotal); } else { if (sc != &scs) syncache_free(sc); TCPSTAT_INC(tcps_sc_dropped); } done: if (cred != NULL) crfree(cred); #ifdef MAC if (sc == &scs) mac_syncache_destroy(&maclabel); #endif if (m) { *lsop = NULL; m_freem(m); } } static int syncache_respond(struct syncache *sc) { INIT_VNET_INET(curvnet); struct ip *ip = NULL; struct mbuf *m; struct tcphdr *th; int optlen, error; u_int16_t hlen, tlen, mssopt; struct tcpopt to; #ifdef INET6 struct ip6_hdr *ip6 = NULL; #endif hlen = #ifdef INET6 (sc->sc_inc.inc_flags & INC_ISIPV6) ? sizeof(struct ip6_hdr) : #endif sizeof(struct ip); tlen = hlen + sizeof(struct tcphdr); /* Determine MSS we advertize to other end of connection. */ mssopt = tcp_mssopt(&sc->sc_inc); if (sc->sc_peer_mss) mssopt = max( min(sc->sc_peer_mss, mssopt), V_tcp_minmss); /* XXX: Assume that the entire packet will fit in a header mbuf. */ KASSERT(max_linkhdr + tlen + TCP_MAXOLEN <= MHLEN, ("syncache: mbuf too small")); /* Create the IP+TCP header from scratch. */ m = m_gethdr(M_DONTWAIT, MT_DATA); if (m == NULL) return (ENOBUFS); #ifdef MAC mac_syncache_create_mbuf(sc->sc_label, m); #endif m->m_data += max_linkhdr; m->m_len = tlen; m->m_pkthdr.len = tlen; m->m_pkthdr.rcvif = NULL; #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) { ip6 = mtod(m, struct ip6_hdr *); ip6->ip6_vfc = IPV6_VERSION; ip6->ip6_nxt = IPPROTO_TCP; ip6->ip6_src = sc->sc_inc.inc6_laddr; ip6->ip6_dst = sc->sc_inc.inc6_faddr; ip6->ip6_plen = htons(tlen - hlen); /* ip6_hlim is set after checksum */ ip6->ip6_flow &= ~IPV6_FLOWLABEL_MASK; ip6->ip6_flow |= sc->sc_flowlabel; th = (struct tcphdr *)(ip6 + 1); } else #endif { ip = mtod(m, struct ip *); ip->ip_v = IPVERSION; ip->ip_hl = sizeof(struct ip) >> 2; ip->ip_len = tlen; ip->ip_id = 0; ip->ip_off = 0; ip->ip_sum = 0; ip->ip_p = IPPROTO_TCP; ip->ip_src = sc->sc_inc.inc_laddr; ip->ip_dst = sc->sc_inc.inc_faddr; ip->ip_ttl = sc->sc_ip_ttl; ip->ip_tos = sc->sc_ip_tos; /* * See if we should do MTU discovery. Route lookups are * expensive, so we will only unset the DF bit if: * * 1) path_mtu_discovery is disabled * 2) the SCF_UNREACH flag has been set */ if (V_path_mtu_discovery && ((sc->sc_flags & SCF_UNREACH) == 0)) ip->ip_off |= IP_DF; th = (struct tcphdr *)(ip + 1); } th->th_sport = sc->sc_inc.inc_lport; th->th_dport = sc->sc_inc.inc_fport; th->th_seq = htonl(sc->sc_iss); th->th_ack = htonl(sc->sc_irs + 1); th->th_off = sizeof(struct tcphdr) >> 2; th->th_x2 = 0; th->th_flags = TH_SYN|TH_ACK; th->th_win = htons(sc->sc_wnd); th->th_urp = 0; if (sc->sc_flags & SCF_ECN) { th->th_flags |= TH_ECE; TCPSTAT_INC(tcps_ecn_shs); } /* Tack on the TCP options. */ if ((sc->sc_flags & SCF_NOOPT) == 0) { to.to_flags = 0; to.to_mss = mssopt; to.to_flags = TOF_MSS; if (sc->sc_flags & SCF_WINSCALE) { to.to_wscale = sc->sc_requested_r_scale; to.to_flags |= TOF_SCALE; } if (sc->sc_flags & SCF_TIMESTAMP) { /* Virgin timestamp or TCP cookie enhanced one. */ to.to_tsval = sc->sc_ts; to.to_tsecr = sc->sc_tsreflect; to.to_flags |= TOF_TS; } if (sc->sc_flags & SCF_SACK) to.to_flags |= TOF_SACKPERM; #ifdef TCP_SIGNATURE if (sc->sc_flags & SCF_SIGNATURE) to.to_flags |= TOF_SIGNATURE; #endif optlen = tcp_addoptions(&to, (u_char *)(th + 1)); /* Adjust headers by option size. */ th->th_off = (sizeof(struct tcphdr) + optlen) >> 2; m->m_len += optlen; m->m_pkthdr.len += optlen; #ifdef TCP_SIGNATURE if (sc->sc_flags & SCF_SIGNATURE) tcp_signature_compute(m, 0, 0, optlen, to.to_signature, IPSEC_DIR_OUTBOUND); #endif #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) ip6->ip6_plen = htons(ntohs(ip6->ip6_plen) + optlen); else #endif ip->ip_len += optlen; } else optlen = 0; #ifdef INET6 if (sc->sc_inc.inc_flags & INC_ISIPV6) { th->th_sum = 0; th->th_sum = in6_cksum(m, IPPROTO_TCP, hlen, tlen + optlen - hlen); ip6->ip6_hlim = in6_selecthlim(NULL, NULL); error = ip6_output(m, NULL, NULL, 0, NULL, NULL, NULL); } else #endif { th->th_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htons(tlen + optlen - hlen + IPPROTO_TCP)); m->m_pkthdr.csum_flags = CSUM_TCP; m->m_pkthdr.csum_data = offsetof(struct tcphdr, th_sum); error = ip_output(m, sc->sc_ipopts, NULL, 0, NULL, NULL); } return (error); } void syncache_add(struct in_conninfo *inc, struct tcpopt *to, struct tcphdr *th, struct inpcb *inp, struct socket **lsop, struct mbuf *m) { _syncache_add(inc, to, th, inp, lsop, m, NULL, NULL); } void -tcp_offload_syncache_add(struct in_conninfo *inc, struct tcpopt *to, +tcp_offload_syncache_add(struct in_conninfo *inc, struct toeopt *toeo, struct tcphdr *th, struct inpcb *inp, struct socket **lsop, struct toe_usrreqs *tu, void *toepcb) { INIT_VNET_INET(curvnet); + struct tcpopt to; + bzero(&to, sizeof(struct tcpopt)); + to.to_mss = toeo->to_mss; + to.to_wscale = toeo->to_wscale; + to.to_flags = toeo->to_flags; + INP_INFO_WLOCK(&V_tcbinfo); INP_WLOCK(inp); - _syncache_add(inc, to, th, inp, lsop, NULL, tu, toepcb); + + _syncache_add(inc, &to, th, inp, lsop, NULL, tu, toepcb); } /* * The purpose of SYN cookies is to avoid keeping track of all SYN's we * receive and to be able to handle SYN floods from bogus source addresses * (where we will never receive any reply). SYN floods try to exhaust all * our memory and available slots in the SYN cache table to cause a denial * of service to legitimate users of the local host. * * The idea of SYN cookies is to encode and include all necessary information * about the connection setup state within the SYN-ACK we send back and thus * to get along without keeping any local state until the ACK to the SYN-ACK * arrives (if ever). Everything we need to know should be available from * the information we encoded in the SYN-ACK. * * More information about the theory behind SYN cookies and its first * discussion and specification can be found at: * http://cr.yp.to/syncookies.html (overview) * http://cr.yp.to/syncookies/archive (gory details) * * This implementation extends the orginal idea and first implementation * of FreeBSD by using not only the initial sequence number field to store * information but also the timestamp field if present. This way we can * keep track of the entire state we need to know to recreate the session in * its original form. Almost all TCP speakers implement RFC1323 timestamps * these days. For those that do not we still have to live with the known * shortcomings of the ISN only SYN cookies. * * Cookie layers: * * Initial sequence number we send: * 31|................................|0 * DDDDDDDDDDDDDDDDDDDDDDDDDMMMRRRP * D = MD5 Digest (first dword) * M = MSS index * R = Rotation of secret * P = Odd or Even secret * * The MD5 Digest is computed with over following parameters: * a) randomly rotated secret * b) struct in_conninfo containing the remote/local ip/port (IPv4&IPv6) * c) the received initial sequence number from remote host * d) the rotation offset and odd/even bit * * Timestamp we send: * 31|................................|0 * DDDDDDDDDDDDDDDDDDDDDDSSSSRRRRA5 * D = MD5 Digest (third dword) (only as filler) * S = Requested send window scale * R = Requested receive window scale * A = SACK allowed * 5 = TCP-MD5 enabled (not implemented yet) * XORed with MD5 Digest (forth dword) * * The timestamp isn't cryptographically secure and doesn't need to be. * The double use of the MD5 digest dwords ties it to a specific remote/ * local host/port, remote initial sequence number and our local time * limited secret. A received timestamp is reverted (XORed) and then * the contained MD5 dword is compared to the computed one to ensure the * timestamp belongs to the SYN-ACK we sent. The other parameters may * have been tampered with but this isn't different from supplying bogus * values in the SYN in the first place. * * Some problems with SYN cookies remain however: * Consider the problem of a recreated (and retransmitted) cookie. If the * original SYN was accepted, the connection is established. The second * SYN is inflight, and if it arrives with an ISN that falls within the * receive window, the connection is killed. * * Notes: * A heuristic to determine when to accept syn cookies is not necessary. * An ACK flood would cause the syncookie verification to be attempted, * but a SYN flood causes syncookies to be generated. Both are of equal * cost, so there's no point in trying to optimize the ACK flood case. * Also, if you don't process certain ACKs for some reason, then all someone * would have to do is launch a SYN and ACK flood at the same time, which * would stop cookie verification and defeat the entire purpose of syncookies. */ static int tcp_sc_msstab[] = { 0, 256, 468, 536, 996, 1452, 1460, 8960 }; static void syncookie_generate(struct syncache_head *sch, struct syncache *sc, u_int32_t *flowlabel) { INIT_VNET_INET(curvnet); MD5_CTX ctx; u_int32_t md5_buffer[MD5_DIGEST_LENGTH / sizeof(u_int32_t)]; u_int32_t data; u_int32_t *secbits; u_int off, pmss, mss; int i; SCH_LOCK_ASSERT(sch); /* Which of the two secrets to use. */ secbits = sch->sch_oddeven ? sch->sch_secbits_odd : sch->sch_secbits_even; /* Reseed secret if too old. */ if (sch->sch_reseed < time_uptime) { sch->sch_oddeven = sch->sch_oddeven ? 0 : 1; /* toggle */ secbits = sch->sch_oddeven ? sch->sch_secbits_odd : sch->sch_secbits_even; for (i = 0; i < SYNCOOKIE_SECRET_SIZE; i++) secbits[i] = arc4random(); sch->sch_reseed = time_uptime + SYNCOOKIE_LIFETIME; } /* Secret rotation offset. */ off = sc->sc_iss & 0x7; /* iss was randomized before */ /* Maximum segment size calculation. */ pmss = max( min(sc->sc_peer_mss, tcp_mssopt(&sc->sc_inc)), V_tcp_minmss); for (mss = sizeof(tcp_sc_msstab) / sizeof(int) - 1; mss > 0; mss--) if (tcp_sc_msstab[mss] <= pmss) break; /* Fold parameters and MD5 digest into the ISN we will send. */ data = sch->sch_oddeven;/* odd or even secret, 1 bit */ data |= off << 1; /* secret offset, derived from iss, 3 bits */ data |= mss << 4; /* mss, 3 bits */ MD5Init(&ctx); MD5Update(&ctx, ((u_int8_t *)secbits) + off, SYNCOOKIE_SECRET_SIZE * sizeof(*secbits) - off); MD5Update(&ctx, secbits, off); MD5Update(&ctx, &sc->sc_inc, sizeof(sc->sc_inc)); MD5Update(&ctx, &sc->sc_irs, sizeof(sc->sc_irs)); MD5Update(&ctx, &data, sizeof(data)); MD5Final((u_int8_t *)&md5_buffer, &ctx); data |= (md5_buffer[0] << 7); sc->sc_iss = data; #ifdef INET6 *flowlabel = md5_buffer[1] & IPV6_FLOWLABEL_MASK; #endif /* Additional parameters are stored in the timestamp if present. */ if (sc->sc_flags & SCF_TIMESTAMP) { data = ((sc->sc_flags & SCF_SIGNATURE) ? 1 : 0); /* TCP-MD5, 1 bit */ data |= ((sc->sc_flags & SCF_SACK) ? 1 : 0) << 1; /* SACK, 1 bit */ data |= sc->sc_requested_s_scale << 2; /* SWIN scale, 4 bits */ data |= sc->sc_requested_r_scale << 6; /* RWIN scale, 4 bits */ data |= md5_buffer[2] << 10; /* more digest bits */ data ^= md5_buffer[3]; sc->sc_ts = data; sc->sc_tsoff = data - ticks; /* after XOR */ } TCPSTAT_INC(tcps_sc_sendcookie); } static struct syncache * syncookie_lookup(struct in_conninfo *inc, struct syncache_head *sch, struct syncache *sc, struct tcpopt *to, struct tcphdr *th, struct socket *so) { INIT_VNET_INET(curvnet); MD5_CTX ctx; u_int32_t md5_buffer[MD5_DIGEST_LENGTH / sizeof(u_int32_t)]; u_int32_t data = 0; u_int32_t *secbits; tcp_seq ack, seq; int off, mss, wnd, flags; SCH_LOCK_ASSERT(sch); /* * Pull information out of SYN-ACK/ACK and * revert sequence number advances. */ ack = th->th_ack - 1; seq = th->th_seq - 1; off = (ack >> 1) & 0x7; mss = (ack >> 4) & 0x7; flags = ack & 0x7f; /* Which of the two secrets to use. */ secbits = (flags & 0x1) ? sch->sch_secbits_odd : sch->sch_secbits_even; /* * The secret wasn't updated for the lifetime of a syncookie, * so this SYN-ACK/ACK is either too old (replay) or totally bogus. */ if (sch->sch_reseed + SYNCOOKIE_LIFETIME < time_uptime) { return (NULL); } /* Recompute the digest so we can compare it. */ MD5Init(&ctx); MD5Update(&ctx, ((u_int8_t *)secbits) + off, SYNCOOKIE_SECRET_SIZE * sizeof(*secbits) - off); MD5Update(&ctx, secbits, off); MD5Update(&ctx, inc, sizeof(*inc)); MD5Update(&ctx, &seq, sizeof(seq)); MD5Update(&ctx, &flags, sizeof(flags)); MD5Final((u_int8_t *)&md5_buffer, &ctx); /* Does the digest part of or ACK'ed ISS match? */ if ((ack & (~0x7f)) != (md5_buffer[0] << 7)) return (NULL); /* Does the digest part of our reflected timestamp match? */ if (to->to_flags & TOF_TS) { data = md5_buffer[3] ^ to->to_tsecr; if ((data & (~0x3ff)) != (md5_buffer[2] << 10)) return (NULL); } /* Fill in the syncache values. */ bcopy(inc, &sc->sc_inc, sizeof(struct in_conninfo)); sc->sc_ipopts = NULL; sc->sc_irs = seq; sc->sc_iss = ack; #ifdef INET6 if (inc->inc_flags & INC_ISIPV6) { if (sotoinpcb(so)->inp_flags & IN6P_AUTOFLOWLABEL) sc->sc_flowlabel = md5_buffer[1] & IPV6_FLOWLABEL_MASK; } else #endif { sc->sc_ip_ttl = sotoinpcb(so)->inp_ip_ttl; sc->sc_ip_tos = sotoinpcb(so)->inp_ip_tos; } /* Additional parameters that were encoded in the timestamp. */ if (data) { sc->sc_flags |= SCF_TIMESTAMP; sc->sc_tsreflect = to->to_tsval; sc->sc_ts = to->to_tsecr; sc->sc_tsoff = to->to_tsecr - ticks; sc->sc_flags |= (data & 0x1) ? SCF_SIGNATURE : 0; sc->sc_flags |= ((data >> 1) & 0x1) ? SCF_SACK : 0; sc->sc_requested_s_scale = min((data >> 2) & 0xf, TCP_MAX_WINSHIFT); sc->sc_requested_r_scale = min((data >> 6) & 0xf, TCP_MAX_WINSHIFT); if (sc->sc_requested_s_scale || sc->sc_requested_r_scale) sc->sc_flags |= SCF_WINSCALE; } else sc->sc_flags |= SCF_NOOPT; wnd = sbspace(&so->so_rcv); wnd = imax(wnd, 0); wnd = imin(wnd, TCP_MAXWIN); sc->sc_wnd = wnd; sc->sc_rxmits = 0; sc->sc_peer_mss = tcp_sc_msstab[mss]; TCPSTAT_INC(tcps_sc_recvcookie); return (sc); } /* * Returns the current number of syncache entries. This number * will probably change before you get around to calling * syncache_pcblist. */ int syncache_pcbcount(void) { INIT_VNET_INET(curvnet); struct syncache_head *sch; int count, i; for (count = 0, i = 0; i < V_tcp_syncache.hashsize; i++) { /* No need to lock for a read. */ sch = &V_tcp_syncache.hashbase[i]; count += sch->sch_length; } return count; } /* * Exports the syncache entries to userland so that netstat can display * them alongside the other sockets. This function is intended to be * called only from tcp_pcblist. * * Due to concurrency on an active system, the number of pcbs exported * may have no relation to max_pcbs. max_pcbs merely indicates the * amount of space the caller allocated for this function to use. */ int syncache_pcblist(struct sysctl_req *req, int max_pcbs, int *pcbs_exported) { INIT_VNET_INET(curvnet); struct xtcpcb xt; struct syncache *sc; struct syncache_head *sch; int count, error, i; for (count = 0, error = 0, i = 0; i < V_tcp_syncache.hashsize; i++) { sch = &V_tcp_syncache.hashbase[i]; SCH_LOCK(sch); TAILQ_FOREACH(sc, &sch->sch_bucket, sc_hash) { if (count >= max_pcbs) { SCH_UNLOCK(sch); goto exit; } if (cr_cansee(req->td->td_ucred, sc->sc_cred) != 0) continue; bzero(&xt, sizeof(xt)); xt.xt_len = sizeof(xt); if (sc->sc_inc.inc_flags & INC_ISIPV6) xt.xt_inp.inp_vflag = INP_IPV6; else xt.xt_inp.inp_vflag = INP_IPV4; bcopy(&sc->sc_inc, &xt.xt_inp.inp_inc, sizeof (struct in_conninfo)); xt.xt_tp.t_inpcb = &xt.xt_inp; xt.xt_tp.t_state = TCPS_SYN_RECEIVED; xt.xt_socket.xso_protocol = IPPROTO_TCP; xt.xt_socket.xso_len = sizeof (struct xsocket); xt.xt_socket.so_type = SOCK_STREAM; xt.xt_socket.so_state = SS_ISCONNECTING; error = SYSCTL_OUT(req, &xt, sizeof xt); if (error) { SCH_UNLOCK(sch); goto exit; } count++; } SCH_UNLOCK(sch); } exit: *pcbs_exported = count; return error; } Index: head/sys/netinet/tcp_syncache.h =================================================================== --- head/sys/netinet/tcp_syncache.h (revision 195653) +++ head/sys/netinet/tcp_syncache.h (revision 195654) @@ -1,125 +1,127 @@ /*- * Copyright (c) 1982, 1986, 1993, 1994, 1995 * The Regents of the University of California. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)tcp_var.h 8.4 (Berkeley) 5/24/95 * $FreeBSD$ */ #ifndef _NETINET_TCP_SYNCACHE_H_ #define _NETINET_TCP_SYNCACHE_H_ #ifdef _KERNEL +struct toeopt; + void syncache_init(void); #ifdef VIMAGE void syncache_destroy(void); #endif void syncache_unreach(struct in_conninfo *, struct tcphdr *); int syncache_expand(struct in_conninfo *, struct tcpopt *, struct tcphdr *, struct socket **, struct mbuf *); -int tcp_offload_syncache_expand(struct in_conninfo *inc, struct tcpopt *to, +int tcp_offload_syncache_expand(struct in_conninfo *inc, struct toeopt *toeo, struct tcphdr *th, struct socket **lsop, struct mbuf *m); void syncache_add(struct in_conninfo *, struct tcpopt *, struct tcphdr *, struct inpcb *, struct socket **, struct mbuf *); -void tcp_offload_syncache_add(struct in_conninfo *, struct tcpopt *, +void tcp_offload_syncache_add(struct in_conninfo *, struct toeopt *, struct tcphdr *, struct inpcb *, struct socket **, struct toe_usrreqs *tu, void *toepcb); void syncache_chkrst(struct in_conninfo *, struct tcphdr *); void syncache_badack(struct in_conninfo *); int syncache_pcbcount(void); int syncache_pcblist(struct sysctl_req *req, int max_pcbs, int *pcbs_exported); struct syncache { TAILQ_ENTRY(syncache) sc_hash; struct in_conninfo sc_inc; /* addresses */ int sc_rxttime; /* retransmit time */ u_int16_t sc_rxmits; /* retransmit counter */ u_int32_t sc_tsreflect; /* timestamp to reflect */ u_int32_t sc_ts; /* our timestamp to send */ u_int32_t sc_tsoff; /* ts offset w/ syncookies */ u_int32_t sc_flowlabel; /* IPv6 flowlabel */ tcp_seq sc_irs; /* seq from peer */ tcp_seq sc_iss; /* our ISS */ struct mbuf *sc_ipopts; /* source route */ u_int16_t sc_peer_mss; /* peer's MSS */ u_int16_t sc_wnd; /* advertised window */ u_int8_t sc_ip_ttl; /* IPv4 TTL */ u_int8_t sc_ip_tos; /* IPv4 TOS */ u_int8_t sc_requested_s_scale:4, sc_requested_r_scale:4; u_int16_t sc_flags; #ifndef TCP_OFFLOAD_DISABLE struct toe_usrreqs *sc_tu; /* TOE operations */ void *sc_toepcb; /* TOE protocol block */ #endif struct label *sc_label; /* MAC label reference */ struct ucred *sc_cred; /* cred cache for jail checks */ }; /* * Flags for the sc_flags field. */ #define SCF_NOOPT 0x01 /* no TCP options */ #define SCF_WINSCALE 0x02 /* negotiated window scaling */ #define SCF_TIMESTAMP 0x04 /* negotiated timestamps */ /* MSS is implicit */ #define SCF_UNREACH 0x10 /* icmp unreachable received */ #define SCF_SIGNATURE 0x20 /* send MD5 digests */ #define SCF_SACK 0x80 /* send SACK option */ #define SCF_ECN 0x100 /* send ECN setup packet */ #define SYNCOOKIE_SECRET_SIZE 8 /* dwords */ #define SYNCOOKIE_LIFETIME 16 /* seconds */ struct syncache_head { struct vnet *sch_vnet; struct mtx sch_mtx; TAILQ_HEAD(sch_head, syncache) sch_bucket; struct callout sch_timer; int sch_nextc; u_int sch_length; u_int sch_oddeven; u_int32_t sch_secbits_odd[SYNCOOKIE_SECRET_SIZE]; u_int32_t sch_secbits_even[SYNCOOKIE_SECRET_SIZE]; u_int sch_reseed; /* time_uptime, seconds */ }; struct tcp_syncache { struct syncache_head *hashbase; uma_zone_t zone; u_int hashsize; u_int hashmask; u_int bucket_limit; u_int cache_count; /* XXX: unprotected */ u_int cache_limit; u_int rexmt_limit; u_int hash_secret; }; #endif /* _KERNEL */ #endif /* !_NETINET_TCP_SYNCACHE_H_ */ Index: head/sys/netinet/tcp_var.h =================================================================== --- head/sys/netinet/tcp_var.h (revision 195653) +++ head/sys/netinet/tcp_var.h (revision 195654) @@ -1,668 +1,668 @@ /*- * Copyright (c) 1982, 1986, 1993, 1994, 1995 * The Regents of the University of California. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)tcp_var.h 8.4 (Berkeley) 5/24/95 * $FreeBSD$ */ #ifndef _NETINET_TCP_VAR_H_ #define _NETINET_TCP_VAR_H_ #include struct vnet; /* * Kernel variables for tcp. */ #ifdef VIMAGE_GLOBALS extern int tcp_do_rfc1323; #endif /* TCP segment queue entry */ struct tseg_qent { LIST_ENTRY(tseg_qent) tqe_q; int tqe_len; /* TCP segment data length */ struct tcphdr *tqe_th; /* a pointer to tcp header */ struct mbuf *tqe_m; /* mbuf contains packet */ }; LIST_HEAD(tsegqe_head, tseg_qent); #ifdef VIMAGE_GLOBALS extern int tcp_reass_qsize; #endif extern struct uma_zone *tcp_reass_zone; struct sackblk { tcp_seq start; /* start seq no. of sack block */ tcp_seq end; /* end seq no. */ }; struct sackhole { tcp_seq start; /* start seq no. of hole */ tcp_seq end; /* end seq no. */ tcp_seq rxmit; /* next seq. no in hole to be retransmitted */ TAILQ_ENTRY(sackhole) scblink; /* scoreboard linkage */ }; struct sackhint { struct sackhole *nexthole; int sack_bytes_rexmit; int ispare; /* explicit pad for 64bit alignment */ uint64_t _pad[2]; /* 1 sacked_bytes, 1 TBD */ }; struct tcptemp { u_char tt_ipgen[40]; /* the size must be of max ip header, now IPv6 */ struct tcphdr tt_t; }; #define tcp6cb tcpcb /* for KAME src sync over BSD*'s */ /* Neighbor Discovery, Neighbor Unreachability Detection Upper layer hint. */ #ifdef INET6 #define ND6_HINT(tp) \ do { \ if ((tp) && (tp)->t_inpcb && \ ((tp)->t_inpcb->inp_vflag & INP_IPV6) != 0) \ nd6_nud_hint(NULL, NULL, 0); \ } while (0) #else #define ND6_HINT(tp) #endif /* * Tcp control block, one per tcp; fields: * Organized for 16 byte cacheline efficiency. */ struct tcpcb { struct tsegqe_head t_segq; /* segment reassembly queue */ void *t_pspare[2]; /* new reassembly queue */ int t_segqlen; /* segment reassembly queue length */ int t_dupacks; /* consecutive dup acks recd */ struct tcp_timer *t_timers; /* All the TCP timers in one struct */ struct inpcb *t_inpcb; /* back pointer to internet pcb */ int t_state; /* state of this connection */ u_int t_flags; struct vnet *t_vnet; /* back pointer to parent vnet */ tcp_seq snd_una; /* send unacknowledged */ tcp_seq snd_max; /* highest sequence number sent; * used to recognize retransmits */ tcp_seq snd_nxt; /* send next */ tcp_seq snd_up; /* send urgent pointer */ tcp_seq snd_wl1; /* window update seg seq number */ tcp_seq snd_wl2; /* window update seg ack number */ tcp_seq iss; /* initial send sequence number */ tcp_seq irs; /* initial receive sequence number */ tcp_seq rcv_nxt; /* receive next */ tcp_seq rcv_adv; /* advertised window */ u_long rcv_wnd; /* receive window */ tcp_seq rcv_up; /* receive urgent pointer */ u_long snd_wnd; /* send window */ u_long snd_cwnd; /* congestion-controlled window */ u_long snd_bwnd; /* bandwidth-controlled window */ u_long snd_ssthresh; /* snd_cwnd size threshold for * for slow start exponential to * linear switch */ u_long snd_bandwidth; /* calculated bandwidth or 0 */ tcp_seq snd_recover; /* for use in NewReno Fast Recovery */ u_int t_maxopd; /* mss plus options */ u_int t_rcvtime; /* inactivity time */ u_int t_starttime; /* time connection was established */ u_int t_rtttime; /* RTT measurement start time */ tcp_seq t_rtseq; /* sequence number being timed */ u_int t_bw_rtttime; /* used for bandwidth calculation */ tcp_seq t_bw_rtseq; /* used for bandwidth calculation */ int t_rxtcur; /* current retransmit value (ticks) */ u_int t_maxseg; /* maximum segment size */ int t_srtt; /* smoothed round-trip time */ int t_rttvar; /* variance in round-trip time */ int t_rxtshift; /* log(2) of rexmt exp. backoff */ u_int t_rttmin; /* minimum rtt allowed */ u_int t_rttbest; /* best rtt we've seen */ u_long t_rttupdated; /* number of times rtt sampled */ u_long max_sndwnd; /* largest window peer has offered */ int t_softerror; /* possible error not yet reported */ /* out-of-band data */ char t_oobflags; /* have some */ char t_iobc; /* input character */ /* RFC 1323 variables */ u_char snd_scale; /* window scaling for send window */ u_char rcv_scale; /* window scaling for recv window */ u_char request_r_scale; /* pending window scaling */ u_int32_t ts_recent; /* timestamp echo data */ u_int ts_recent_age; /* when last updated */ u_int32_t ts_offset; /* our timestamp offset */ tcp_seq last_ack_sent; /* experimental */ u_long snd_cwnd_prev; /* cwnd prior to retransmit */ u_long snd_ssthresh_prev; /* ssthresh prior to retransmit */ tcp_seq snd_recover_prev; /* snd_recover prior to retransmit */ u_int t_badrxtwin; /* window for retransmit recovery */ u_char snd_limited; /* segments limited transmitted */ /* SACK related state */ int snd_numholes; /* number of holes seen by sender */ TAILQ_HEAD(sackhole_head, sackhole) snd_holes; /* SACK scoreboard (sorted) */ tcp_seq snd_fack; /* last seq number(+1) sack'd by rcv'r*/ int rcv_numsacks; /* # distinct sack blks present */ struct sackblk sackblks[MAX_SACK_BLKS]; /* seq nos. of sack blocks */ tcp_seq sack_newdata; /* New data xmitted in this recovery episode starts at this seq number */ struct sackhint sackhint; /* SACK scoreboard hint */ int t_rttlow; /* smallest observerved RTT */ u_int32_t rfbuf_ts; /* recv buffer autoscaling timestamp */ int rfbuf_cnt; /* recv buffer autoscaling byte count */ struct toe_usrreqs *t_tu; /* offload operations vector */ void *t_toe; /* TOE pcb pointer */ int t_bytes_acked; /* # bytes acked during current RTT */ int t_ispare; /* explicit pad for 64bit alignment */ void *t_pspare2[6]; /* 2 CC / 4 TBD */ uint64_t _pad[12]; /* 7 UTO, 5 TBD (1-2 CC/RTT?) */ }; /* * Flags and utility macros for the t_flags field. */ #define TF_ACKNOW 0x000001 /* ack peer immediately */ #define TF_DELACK 0x000002 /* ack, but try to delay it */ #define TF_NODELAY 0x000004 /* don't delay packets to coalesce */ #define TF_NOOPT 0x000008 /* don't use tcp options */ #define TF_SENTFIN 0x000010 /* have sent FIN */ #define TF_REQ_SCALE 0x000020 /* have/will request window scaling */ #define TF_RCVD_SCALE 0x000040 /* other side has requested scaling */ #define TF_REQ_TSTMP 0x000080 /* have/will request timestamps */ #define TF_RCVD_TSTMP 0x000100 /* a timestamp was received in SYN */ #define TF_SACK_PERMIT 0x000200 /* other side said I could SACK */ #define TF_NEEDSYN 0x000400 /* send SYN (implicit state) */ #define TF_NEEDFIN 0x000800 /* send FIN (implicit state) */ #define TF_NOPUSH 0x001000 /* don't push */ #define TF_MORETOCOME 0x010000 /* More data to be appended to sock */ #define TF_LQ_OVERFLOW 0x020000 /* listen queue overflow */ #define TF_LASTIDLE 0x040000 /* connection was previously idle */ #define TF_RXWIN0SENT 0x080000 /* sent a receiver win 0 in response */ #define TF_FASTRECOVERY 0x100000 /* in NewReno Fast Recovery */ #define TF_WASFRECOVERY 0x200000 /* was in NewReno Fast Recovery */ #define TF_SIGNATURE 0x400000 /* require MD5 digests (RFC2385) */ #define TF_FORCEDATA 0x800000 /* force out a byte */ #define TF_TSO 0x1000000 /* TSO enabled on this connection */ #define TF_TOE 0x2000000 /* this connection is offloaded */ #define TF_ECN_PERMIT 0x4000000 /* connection ECN-ready */ #define TF_ECN_SND_CWR 0x8000000 /* ECN CWR in queue */ #define TF_ECN_SND_ECE 0x10000000 /* ECN ECE in queue */ #define IN_FASTRECOVERY(tp) (tp->t_flags & TF_FASTRECOVERY) #define ENTER_FASTRECOVERY(tp) tp->t_flags |= TF_FASTRECOVERY #define EXIT_FASTRECOVERY(tp) tp->t_flags &= ~TF_FASTRECOVERY /* * Flags for the t_oobflags field. */ #define TCPOOB_HAVEDATA 0x01 #define TCPOOB_HADDATA 0x02 #ifdef TCP_SIGNATURE /* * Defines which are needed by the xform_tcp module and tcp_[in|out]put * for SADB verification and lookup. */ #define TCP_SIGLEN 16 /* length of computed digest in bytes */ #define TCP_KEYLEN_MIN 1 /* minimum length of TCP-MD5 key */ #define TCP_KEYLEN_MAX 80 /* maximum length of TCP-MD5 key */ /* * Only a single SA per host may be specified at this time. An SPI is * needed in order for the KEY_ALLOCSA() lookup to work. */ #define TCP_SIG_SPI 0x1000 #endif /* TCP_SIGNATURE */ /* * Structure to hold TCP options that are only used during segment * processing (in tcp_input), but not held in the tcpcb. * It's basically used to reduce the number of parameters * to tcp_dooptions and tcp_addoptions. * The binary order of the to_flags is relevant for packing of the * options in tcp_addoptions. */ struct tcpopt { - u_long to_flags; /* which options are present */ + u_int64_t to_flags; /* which options are present */ #define TOF_MSS 0x0001 /* maximum segment size */ #define TOF_SCALE 0x0002 /* window scaling */ #define TOF_SACKPERM 0x0004 /* SACK permitted */ #define TOF_TS 0x0010 /* timestamp */ #define TOF_SIGNATURE 0x0040 /* TCP-MD5 signature option (RFC2385) */ #define TOF_SACK 0x0080 /* Peer sent SACK option */ #define TOF_MAXOPT 0x0100 u_int32_t to_tsval; /* new timestamp */ u_int32_t to_tsecr; /* reflected timestamp */ + u_char *to_sacks; /* pointer to the first SACK blocks */ + u_char *to_signature; /* pointer to the TCP-MD5 signature */ u_int16_t to_mss; /* maximum segment size */ u_int8_t to_wscale; /* window scaling */ u_int8_t to_nsacks; /* number of SACK blocks */ - u_char *to_sacks; /* pointer to the first SACK blocks */ - u_char *to_signature; /* pointer to the TCP-MD5 signature */ }; /* * Flags for tcp_dooptions. */ #define TO_SYN 0x01 /* parse SYN-only options */ struct hc_metrics_lite { /* must stay in sync with hc_metrics */ u_long rmx_mtu; /* MTU for this path */ u_long rmx_ssthresh; /* outbound gateway buffer limit */ u_long rmx_rtt; /* estimated round trip time */ u_long rmx_rttvar; /* estimated rtt variance */ u_long rmx_bandwidth; /* estimated bandwidth */ u_long rmx_cwnd; /* congestion window */ u_long rmx_sendpipe; /* outbound delay-bandwidth product */ u_long rmx_recvpipe; /* inbound delay-bandwidth product */ }; #ifndef _NETINET_IN_PCB_H_ struct in_conninfo; #endif /* _NETINET_IN_PCB_H_ */ struct tcptw { struct inpcb *tw_inpcb; /* XXX back pointer to internet pcb */ tcp_seq snd_nxt; tcp_seq rcv_nxt; tcp_seq iss; tcp_seq irs; u_short last_win; /* cached window value */ u_short tw_so_options; /* copy of so_options */ struct ucred *tw_cred; /* user credentials */ u_int32_t t_recent; u_int32_t ts_offset; /* our timestamp offset */ u_int t_starttime; int tw_time; TAILQ_ENTRY(tcptw) tw_2msl; }; #define intotcpcb(ip) ((struct tcpcb *)(ip)->inp_ppcb) #define intotw(ip) ((struct tcptw *)(ip)->inp_ppcb) #define sototcpcb(so) (intotcpcb(sotoinpcb(so))) /* * The smoothed round-trip time and estimated variance * are stored as fixed point numbers scaled by the values below. * For convenience, these scales are also used in smoothing the average * (smoothed = (1/scale)sample + ((scale-1)/scale)smoothed). * With these scales, srtt has 3 bits to the right of the binary point, * and thus an "ALPHA" of 0.875. rttvar has 2 bits to the right of the * binary point, and is smoothed with an ALPHA of 0.75. */ #define TCP_RTT_SCALE 32 /* multiplier for srtt; 3 bits frac. */ #define TCP_RTT_SHIFT 5 /* shift for srtt; 3 bits frac. */ #define TCP_RTTVAR_SCALE 16 /* multiplier for rttvar; 2 bits */ #define TCP_RTTVAR_SHIFT 4 /* shift for rttvar; 2 bits */ #define TCP_DELTA_SHIFT 2 /* see tcp_input.c */ /* * The initial retransmission should happen at rtt + 4 * rttvar. * Because of the way we do the smoothing, srtt and rttvar * will each average +1/2 tick of bias. When we compute * the retransmit timer, we want 1/2 tick of rounding and * 1 extra tick because of +-1/2 tick uncertainty in the * firing of the timer. The bias will give us exactly the * 1.5 tick we need. But, because the bias is * statistical, we have to test that we don't drop below * the minimum feasible timer (which is 2 ticks). * This version of the macro adapted from a paper by Lawrence * Brakmo and Larry Peterson which outlines a problem caused * by insufficient precision in the original implementation, * which results in inappropriately large RTO values for very * fast networks. */ #define TCP_REXMTVAL(tp) \ max((tp)->t_rttmin, (((tp)->t_srtt >> (TCP_RTT_SHIFT - TCP_DELTA_SHIFT)) \ + (tp)->t_rttvar) >> TCP_DELTA_SHIFT) /* * TCP statistics. * Many of these should be kept per connection, * but that's inconvenient at the moment. */ struct tcpstat { u_long tcps_connattempt; /* connections initiated */ u_long tcps_accepts; /* connections accepted */ u_long tcps_connects; /* connections established */ u_long tcps_drops; /* connections dropped */ u_long tcps_conndrops; /* embryonic connections dropped */ u_long tcps_minmssdrops; /* average minmss too low drops */ u_long tcps_closed; /* conn. closed (includes drops) */ u_long tcps_segstimed; /* segs where we tried to get rtt */ u_long tcps_rttupdated; /* times we succeeded */ u_long tcps_delack; /* delayed acks sent */ u_long tcps_timeoutdrop; /* conn. dropped in rxmt timeout */ u_long tcps_rexmttimeo; /* retransmit timeouts */ u_long tcps_persisttimeo; /* persist timeouts */ u_long tcps_keeptimeo; /* keepalive timeouts */ u_long tcps_keepprobe; /* keepalive probes sent */ u_long tcps_keepdrops; /* connections dropped in keepalive */ u_long tcps_sndtotal; /* total packets sent */ u_long tcps_sndpack; /* data packets sent */ u_long tcps_sndbyte; /* data bytes sent */ u_long tcps_sndrexmitpack; /* data packets retransmitted */ u_long tcps_sndrexmitbyte; /* data bytes retransmitted */ u_long tcps_sndrexmitbad; /* unnecessary packet retransmissions */ u_long tcps_sndacks; /* ack-only packets sent */ u_long tcps_sndprobe; /* window probes sent */ u_long tcps_sndurg; /* packets sent with URG only */ u_long tcps_sndwinup; /* window update-only packets sent */ u_long tcps_sndctrl; /* control (SYN|FIN|RST) packets sent */ u_long tcps_rcvtotal; /* total packets received */ u_long tcps_rcvpack; /* packets received in sequence */ u_long tcps_rcvbyte; /* bytes received in sequence */ u_long tcps_rcvbadsum; /* packets received with ccksum errs */ u_long tcps_rcvbadoff; /* packets received with bad offset */ u_long tcps_rcvmemdrop; /* packets dropped for lack of memory */ u_long tcps_rcvshort; /* packets received too short */ u_long tcps_rcvduppack; /* duplicate-only packets received */ u_long tcps_rcvdupbyte; /* duplicate-only bytes received */ u_long tcps_rcvpartduppack; /* packets with some duplicate data */ u_long tcps_rcvpartdupbyte; /* dup. bytes in part-dup. packets */ u_long tcps_rcvoopack; /* out-of-order packets received */ u_long tcps_rcvoobyte; /* out-of-order bytes received */ u_long tcps_rcvpackafterwin; /* packets with data after window */ u_long tcps_rcvbyteafterwin; /* bytes rcvd after window */ u_long tcps_rcvafterclose; /* packets rcvd after "close" */ u_long tcps_rcvwinprobe; /* rcvd window probe packets */ u_long tcps_rcvdupack; /* rcvd duplicate acks */ u_long tcps_rcvacktoomuch; /* rcvd acks for unsent data */ u_long tcps_rcvackpack; /* rcvd ack packets */ u_long tcps_rcvackbyte; /* bytes acked by rcvd acks */ u_long tcps_rcvwinupd; /* rcvd window update packets */ u_long tcps_pawsdrop; /* segments dropped due to PAWS */ u_long tcps_predack; /* times hdr predict ok for acks */ u_long tcps_preddat; /* times hdr predict ok for data pkts */ u_long tcps_pcbcachemiss; u_long tcps_cachedrtt; /* times cached RTT in route updated */ u_long tcps_cachedrttvar; /* times cached rttvar updated */ u_long tcps_cachedssthresh; /* times cached ssthresh updated */ u_long tcps_usedrtt; /* times RTT initialized from route */ u_long tcps_usedrttvar; /* times RTTVAR initialized from rt */ u_long tcps_usedssthresh; /* times ssthresh initialized from rt*/ u_long tcps_persistdrop; /* timeout in persist state */ u_long tcps_badsyn; /* bogus SYN, e.g. premature ACK */ u_long tcps_mturesent; /* resends due to MTU discovery */ u_long tcps_listendrop; /* listen queue overflows */ u_long tcps_badrst; /* ignored RSTs in the window */ u_long tcps_sc_added; /* entry added to syncache */ u_long tcps_sc_retransmitted; /* syncache entry was retransmitted */ u_long tcps_sc_dupsyn; /* duplicate SYN packet */ u_long tcps_sc_dropped; /* could not reply to packet */ u_long tcps_sc_completed; /* successful extraction of entry */ u_long tcps_sc_bucketoverflow; /* syncache per-bucket limit hit */ u_long tcps_sc_cacheoverflow; /* syncache cache limit hit */ u_long tcps_sc_reset; /* RST removed entry from syncache */ u_long tcps_sc_stale; /* timed out or listen socket gone */ u_long tcps_sc_aborted; /* syncache entry aborted */ u_long tcps_sc_badack; /* removed due to bad ACK */ u_long tcps_sc_unreach; /* ICMP unreachable received */ u_long tcps_sc_zonefail; /* zalloc() failed */ u_long tcps_sc_sendcookie; /* SYN cookie sent */ u_long tcps_sc_recvcookie; /* SYN cookie received */ u_long tcps_hc_added; /* entry added to hostcache */ u_long tcps_hc_bucketoverflow; /* hostcache per bucket limit hit */ u_long tcps_finwait2_drops; /* Drop FIN_WAIT_2 connection after time limit */ /* SACK related stats */ u_long tcps_sack_recovery_episode; /* SACK recovery episodes */ u_long tcps_sack_rexmits; /* SACK rexmit segments */ u_long tcps_sack_rexmit_bytes; /* SACK rexmit bytes */ u_long tcps_sack_rcv_blocks; /* SACK blocks (options) received */ u_long tcps_sack_send_blocks; /* SACK blocks (options) sent */ u_long tcps_sack_sboverflow; /* times scoreboard overflowed */ /* ECN related stats */ u_long tcps_ecn_ce; /* ECN Congestion Experienced */ u_long tcps_ecn_ect0; /* ECN Capable Transport */ u_long tcps_ecn_ect1; /* ECN Capable Transport */ u_long tcps_ecn_shs; /* ECN successful handshakes */ u_long tcps_ecn_rcwnd; /* # times ECN reduced the cwnd */ u_long _pad[12]; /* 6 UTO, 6 TBD */ }; #ifdef _KERNEL #define TCPSTAT_ADD(name, val) V_tcpstat.name += (val) #define TCPSTAT_INC(name) TCPSTAT_ADD(name, 1) #endif /* * TCB structure exported to user-land via sysctl(3). * Evil hack: declare only if in_pcb.h and sys/socketvar.h have been * included. Not all of our clients do. */ #if defined(_NETINET_IN_PCB_H_) && defined(_SYS_SOCKETVAR_H_) struct xtcpcb { size_t xt_len; struct inpcb xt_inp; struct tcpcb xt_tp; struct xsocket xt_socket; u_quad_t xt_alignment_hack; }; #endif /* * Names for TCP sysctl objects */ #define TCPCTL_DO_RFC1323 1 /* use RFC-1323 extensions */ #define TCPCTL_MSSDFLT 3 /* MSS default */ #define TCPCTL_STATS 4 /* statistics (read-only) */ #define TCPCTL_RTTDFLT 5 /* default RTT estimate */ #define TCPCTL_KEEPIDLE 6 /* keepalive idle timer */ #define TCPCTL_KEEPINTVL 7 /* interval to send keepalives */ #define TCPCTL_SENDSPACE 8 /* send buffer space */ #define TCPCTL_RECVSPACE 9 /* receive buffer space */ #define TCPCTL_KEEPINIT 10 /* timeout for establishing syn */ #define TCPCTL_PCBLIST 11 /* list of all outstanding PCBs */ #define TCPCTL_DELACKTIME 12 /* time before sending delayed ACK */ #define TCPCTL_V6MSSDFLT 13 /* MSS default for IPv6 */ #define TCPCTL_SACK 14 /* Selective Acknowledgement,rfc 2018 */ #define TCPCTL_DROP 15 /* drop tcp connection */ #define TCPCTL_MAXID 16 #define TCPCTL_FINWAIT2_TIMEOUT 17 #define TCPCTL_NAMES { \ { 0, 0 }, \ { "rfc1323", CTLTYPE_INT }, \ { "mssdflt", CTLTYPE_INT }, \ { "stats", CTLTYPE_STRUCT }, \ { "rttdflt", CTLTYPE_INT }, \ { "keepidle", CTLTYPE_INT }, \ { "keepintvl", CTLTYPE_INT }, \ { "sendspace", CTLTYPE_INT }, \ { "recvspace", CTLTYPE_INT }, \ { "keepinit", CTLTYPE_INT }, \ { "pcblist", CTLTYPE_STRUCT }, \ { "delacktime", CTLTYPE_INT }, \ { "v6mssdflt", CTLTYPE_INT }, \ { "maxid", CTLTYPE_INT }, \ } #ifdef _KERNEL #ifdef SYSCTL_DECL SYSCTL_DECL(_net_inet_tcp); SYSCTL_DECL(_net_inet_tcp_sack); MALLOC_DECLARE(M_TCPLOG); #endif extern int tcp_log_in_vain; #ifdef VIMAGE_GLOBALS extern struct inpcbhead tcb; /* head of queue of active tcpcb's */ extern struct inpcbinfo tcbinfo; extern struct tcpstat tcpstat; /* tcp statistics */ extern int tcp_mssdflt; /* XXX */ extern int tcp_minmss; extern int tcp_delack_enabled; extern int tcp_do_newreno; extern int path_mtu_discovery; extern int ss_fltsz; extern int ss_fltsz_local; extern int blackhole; extern int drop_synfin; extern int tcp_do_rfc3042; extern int tcp_do_rfc3390; extern int tcp_insecure_rst; extern int tcp_do_autorcvbuf; extern int tcp_autorcvbuf_inc; extern int tcp_autorcvbuf_max; extern int tcp_do_rfc3465; extern int tcp_abc_l_var; extern int tcp_do_tso; extern int tcp_do_autosndbuf; extern int tcp_autosndbuf_inc; extern int tcp_autosndbuf_max; extern int nolocaltimewait; extern int tcp_do_sack; /* SACK enabled/disabled */ extern int tcp_sack_maxholes; extern int tcp_sack_globalmaxholes; extern int tcp_sack_globalholes; extern int tcp_sc_rst_sock_fail; /* RST on sock alloc failure */ extern int tcp_do_ecn; /* TCP ECN enabled/disabled */ extern int tcp_ecn_maxretries; #endif /* VIMAGE_GLOBALS */ int tcp_addoptions(struct tcpopt *, u_char *); struct tcpcb * tcp_close(struct tcpcb *); void tcp_discardcb(struct tcpcb *); void tcp_twstart(struct tcpcb *); #if 0 int tcp_twrecycleable(struct tcptw *tw); #endif void tcp_twclose(struct tcptw *_tw, int _reuse); void tcp_ctlinput(int, struct sockaddr *, void *); int tcp_ctloutput(struct socket *, struct sockopt *); struct tcpcb * tcp_drop(struct tcpcb *, int); void tcp_drain(void); void tcp_fasttimo(void); void tcp_init(void); #ifdef VIMAGE void tcp_destroy(void); #endif void tcp_fini(void *); char *tcp_log_addrs(struct in_conninfo *, struct tcphdr *, void *, const void *); int tcp_reass(struct tcpcb *, struct tcphdr *, int *, struct mbuf *); void tcp_reass_init(void); void tcp_input(struct mbuf *, int); u_long tcp_maxmtu(struct in_conninfo *, int *); u_long tcp_maxmtu6(struct in_conninfo *, int *); void tcp_mss_update(struct tcpcb *, int, struct hc_metrics_lite *, int *); void tcp_mss(struct tcpcb *, int); int tcp_mssopt(struct in_conninfo *); struct inpcb * tcp_drop_syn_sent(struct inpcb *, int); struct inpcb * tcp_mtudisc(struct inpcb *, int); struct tcpcb * tcp_newtcpcb(struct inpcb *); int tcp_output(struct tcpcb *); void tcp_respond(struct tcpcb *, void *, struct tcphdr *, struct mbuf *, tcp_seq, tcp_seq, int); void tcp_tw_init(void); #ifdef VIMAGE void tcp_tw_destroy(void); #endif void tcp_tw_zone_change(void); int tcp_twcheck(struct inpcb *, struct tcpopt *, struct tcphdr *, struct mbuf *, int); int tcp_twrespond(struct tcptw *, int); void tcp_setpersist(struct tcpcb *); #ifdef TCP_SIGNATURE int tcp_signature_compute(struct mbuf *, int, int, int, u_char *, u_int); #endif void tcp_slowtimo(void); struct tcptemp * tcpip_maketemplate(struct inpcb *); void tcpip_fillheaders(struct inpcb *, void *, void *); void tcp_timer_activate(struct tcpcb *, int, u_int); int tcp_timer_active(struct tcpcb *, int); void tcp_trace(short, short, struct tcpcb *, void *, struct tcphdr *, int); void tcp_xmit_bandwidth_limit(struct tcpcb *tp, tcp_seq ack_seq); /* * All tcp_hc_* functions are IPv4 and IPv6 (via in_conninfo) */ void tcp_hc_init(void); #ifdef VIMAGE void tcp_hc_destroy(void); #endif void tcp_hc_get(struct in_conninfo *, struct hc_metrics_lite *); u_long tcp_hc_getmtu(struct in_conninfo *); void tcp_hc_updatemtu(struct in_conninfo *, u_long); void tcp_hc_update(struct in_conninfo *, struct hc_metrics_lite *); extern struct pr_usrreqs tcp_usrreqs; extern u_long tcp_sendspace; extern u_long tcp_recvspace; tcp_seq tcp_new_isn(struct tcpcb *); void tcp_sack_doack(struct tcpcb *, struct tcpopt *, tcp_seq); void tcp_update_sack_list(struct tcpcb *tp, tcp_seq rcv_laststart, tcp_seq rcv_lastend); void tcp_clean_sackreport(struct tcpcb *tp); void tcp_sack_adjust(struct tcpcb *tp); struct sackhole *tcp_sack_output(struct tcpcb *tp, int *sack_bytes_rexmt); void tcp_sack_partialack(struct tcpcb *, struct tcphdr *); void tcp_free_sackholes(struct tcpcb *tp); int tcp_newreno(struct tcpcb *, struct tcphdr *); u_long tcp_seq_subtract(u_long, u_long ); #endif /* _KERNEL */ #endif /* _NETINET_TCP_VAR_H_ */ Index: head/sys/sys/param.h =================================================================== --- head/sys/sys/param.h (revision 195653) +++ head/sys/sys/param.h (revision 195654) @@ -1,314 +1,314 @@ /*- * Copyright (c) 1982, 1986, 1989, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)param.h 8.3 (Berkeley) 4/4/95 * $FreeBSD$ */ #ifndef _SYS_PARAM_H_ #define _SYS_PARAM_H_ #include #define BSD 199506 /* System version (year & month). */ #define BSD4_3 1 #define BSD4_4 1 /* * __FreeBSD_version numbers are documented in the Porter's Handbook. * If you bump the version for any reason, you should update the documentation * there. * Currently this lives here: * * doc/en_US.ISO8859-1/books/porters-handbook/book.sgml * * scheme is: Rxx * 'R' is in the range 0 to 4 if this is a release branch or * x.0-CURRENT before RELENG_*_0 is created, otherwise 'R' is * in the range 5 to 9. */ #undef __FreeBSD_version -#define __FreeBSD_version 800102 /* Master, propagated to newvers */ +#define __FreeBSD_version 800103 /* Master, propagated to newvers */ #ifndef LOCORE #include #endif /* * Machine-independent constants (some used in following include files). * Redefined constants are from POSIX 1003.1 limits file. * * MAXCOMLEN should be >= sizeof(ac_comm) (see ) * MAXLOGNAME should be == UT_NAMESIZE+1 (see ) */ #include #define MAXCOMLEN 19 /* max command name remembered */ #define MAXINTERP 32 /* max interpreter file name length */ #define MAXLOGNAME 17 /* max login name length (incl. NUL) */ #define MAXUPRC CHILD_MAX /* max simultaneous processes */ #define NCARGS ARG_MAX /* max bytes for an exec function */ #define NGROUPS (NGROUPS_MAX+1) /* max number groups */ #define NOFILE OPEN_MAX /* max open files per process */ #define NOGROUP 65535 /* marker for empty group set member */ #define MAXHOSTNAMELEN 256 /* max hostname size */ #define SPECNAMELEN 63 /* max length of devicename */ /* More types and definitions used throughout the kernel. */ #ifdef _KERNEL #include #include #ifndef LOCORE #include #include #endif #ifndef FALSE #define FALSE 0 #endif #ifndef TRUE #define TRUE 1 #endif #endif #ifndef _KERNEL /* Signals. */ #include #endif /* Machine type dependent parameters. */ #include #ifndef _KERNEL #include #endif #ifndef _NO_NAMESPACE_POLLUTION #ifndef DEV_BSHIFT #define DEV_BSHIFT 9 /* log2(DEV_BSIZE) */ #endif #define DEV_BSIZE (1<>PAGE_SHIFT) #endif /* * btodb() is messy and perhaps slow because `bytes' may be an off_t. We * want to shift an unsigned type to avoid sign extension and we don't * want to widen `bytes' unnecessarily. Assume that the result fits in * a daddr_t. */ #ifndef btodb #define btodb(bytes) /* calculates (bytes / DEV_BSIZE) */ \ (sizeof (bytes) > sizeof(long) \ ? (daddr_t)((unsigned long long)(bytes) >> DEV_BSHIFT) \ : (daddr_t)((unsigned long)(bytes) >> DEV_BSHIFT)) #endif #ifndef dbtob #define dbtob(db) /* calculates (db * DEV_BSIZE) */ \ ((off_t)(db) << DEV_BSHIFT) #endif #endif /* _NO_NAMESPACE_POLLUTION */ #define PRIMASK 0x0ff #define PCATCH 0x100 /* OR'd with pri for tsleep to check signals */ #define PDROP 0x200 /* OR'd with pri to stop re-entry of interlock mutex */ #define NZERO 0 /* default "nice" */ #define NBBY 8 /* number of bits in a byte */ #define NBPW sizeof(int) /* number of bytes per word (integer) */ #define CMASK 022 /* default file mask: S_IWGRP|S_IWOTH */ #define NODEV (dev_t)(-1) /* non-existent device */ /* * File system parameters and macros. * * MAXBSIZE - Filesystems are made out of blocks of at most MAXBSIZE bytes * per block. MAXBSIZE may be made larger without effecting * any existing filesystems as long as it does not exceed MAXPHYS, * and may be made smaller at the risk of not being able to use * filesystems which require a block size exceeding MAXBSIZE. * * BKVASIZE - Nominal buffer space per buffer, in bytes. BKVASIZE is the * minimum KVM memory reservation the kernel is willing to make. * Filesystems can of course request smaller chunks. Actual * backing memory uses a chunk size of a page (PAGE_SIZE). * * If you make BKVASIZE too small you risk seriously fragmenting * the buffer KVM map which may slow things down a bit. If you * make it too big the kernel will not be able to optimally use * the KVM memory reserved for the buffer cache and will wind * up with too-few buffers. * * The default is 16384, roughly 2x the block size used by a * normal UFS filesystem. */ #define MAXBSIZE 65536 /* must be power of 2 */ #define BKVASIZE 16384 /* must be power of 2 */ #define BKVAMASK (BKVASIZE-1) /* * MAXPATHLEN defines the longest permissible path length after expanding * symbolic links. It is used to allocate a temporary buffer from the buffer * pool in which to do the name expansion, hence should be a power of two, * and must be less than or equal to MAXBSIZE. MAXSYMLINKS defines the * maximum number of symbolic links that may be expanded in a path name. * It should be set high enough to allow all legitimate uses, but halt * infinite loops reasonably quickly. */ #define MAXPATHLEN PATH_MAX #define MAXSYMLINKS 32 /* Bit map related macros. */ #define setbit(a,i) (((unsigned char *)(a))[(i)/NBBY] |= 1<<((i)%NBBY)) #define clrbit(a,i) (((unsigned char *)(a))[(i)/NBBY] &= ~(1<<((i)%NBBY))) #define isset(a,i) \ (((const unsigned char *)(a))[(i)/NBBY] & (1<<((i)%NBBY))) #define isclr(a,i) \ ((((const unsigned char *)(a))[(i)/NBBY] & (1<<((i)%NBBY))) == 0) /* Macros for counting and rounding. */ #ifndef howmany #define howmany(x, y) (((x)+((y)-1))/(y)) #endif #define rounddown(x, y) (((x)/(y))*(y)) #define roundup(x, y) ((((x)+((y)-1))/(y))*(y)) /* to any y */ #define roundup2(x, y) (((x)+((y)-1))&(~((y)-1))) /* if y is powers of two */ #define powerof2(x) ((((x)-1)&(x))==0) /* Macros for min/max. */ #define MIN(a,b) (((a)<(b))?(a):(b)) #define MAX(a,b) (((a)>(b))?(a):(b)) #ifdef _KERNEL /* * Basic byte order function prototypes for non-inline functions. */ #ifndef LOCORE #ifndef _BYTEORDER_PROTOTYPED #define _BYTEORDER_PROTOTYPED __BEGIN_DECLS __uint32_t htonl(__uint32_t); __uint16_t htons(__uint16_t); __uint32_t ntohl(__uint32_t); __uint16_t ntohs(__uint16_t); __END_DECLS #endif #endif #ifndef lint #ifndef _BYTEORDER_FUNC_DEFINED #define _BYTEORDER_FUNC_DEFINED #define htonl(x) __htonl(x) #define htons(x) __htons(x) #define ntohl(x) __ntohl(x) #define ntohs(x) __ntohs(x) #endif /* !_BYTEORDER_FUNC_DEFINED */ #endif /* lint */ #endif /* _KERNEL */ /* * Scale factor for scaled integers used to count %cpu time and load avgs. * * The number of CPU `tick's that map to a unique `%age' can be expressed * by the formula (1 / (2 ^ (FSHIFT - 11))). The maximum load average that * can be calculated (assuming 32 bits) can be closely approximated using * the formula (2 ^ (2 * (16 - FSHIFT))) for (FSHIFT < 15). * * For the scheduler to maintain a 1:1 mapping of CPU `tick' to `%age', * FSHIFT must be at least 11; this gives us a maximum load avg of ~1024. */ #define FSHIFT 11 /* bits to right of fixed binary point */ #define FSCALE (1<> (PAGE_SHIFT - DEV_BSHIFT)) #define ctodb(db) /* calculates pages to devblks */ \ ((db) << (PAGE_SHIFT - DEV_BSHIFT)) /* * Given the pointer x to the member m of the struct s, return * a pointer to the containing structure. */ #define member2struct(s, m, x) \ ((struct s *)(void *)((char *)(x) - offsetof(struct s, m))) #endif /* _SYS_PARAM_H_ */