Index: head/UPDATING =================================================================== --- head/UPDATING (revision 192894) +++ head/UPDATING (revision 192895) @@ -1,1538 +1,1542 @@ Updating Information for FreeBSD current users This file is maintained and copyrighted by M. Warner Losh . See end of file for further details. For commonly done items, please see the COMMON ITEMS: section later in the file. Items affecting the ports and packages system can be found in /usr/ports/UPDATING. Please read that file before running portupgrade. NOTE TO PEOPLE WHO THINK THAT FreeBSD 8.x IS SLOW: FreeBSD 8.x has many debugging features turned on, in both the kernel and userland. These features attempt to detect incorrect use of system primitives, and encourage loud failure through extra sanity checking and fail stop semantics. They also substantially impact system performance. If you want to do performance measurement, benchmarking, and optimization, you'll want to turn them off. This includes various WITNESS- related kernel options, INVARIANTS, malloc debugging flags in userland, and various verbose features in the kernel. Many developers choose to disable these features on build machines to maximize performance. (To disable malloc debugging, run ln -s aj /etc/malloc.conf.) +20090527: + Add support for hierarchical jails. Remove global securelevel. + Bump __FreeBSD_version to 800091. + 20090523: The layout of struct vnet_net has changed, therefore modules need to be rebuilt. Bump __FreeBSD_version to 800090. 20090523: The newly imported zic(8) produces a new format in the output. Please run tzsetup(8) to install the newly created data to /etc/localtime. 20090520: The sysctl tree for the usb stack has renamed from hw.usb2.* to hw.usb.* and is now consistent again with previous releases. 20090520: 802.11 monitor mode support was revised and driver api's were changed. Drivers dependent on net80211 now support DLT_IEEE802_11_RADIO instead of DLT_IEEE802_11. No user-visible data structures were changed but applications that use DLT_IEEE802_11 may require changes. Bump __FreeBSD_version to 800088. 20090430: The layout of the following structs has changed: sysctl_oid, socket, ifnet, inpcbinfo, tcpcb, syncache_head, vnet_inet, vnet_inet6 and vnet_ipfw. Most modules need to be rebuild or panics may be experienced. World rebuild is required for correctly checking networking state from userland. Bump __FreeBSD_version to 800085. 20090429: MLDv2 and Source-Specific Multicast (SSM) have been merged to the IPv6 stack. VIMAGE hooks are in but not yet used. The implementation of SSM within FreeBSD's IPv6 stack closely follows the IPv4 implementation. For kernel developers: * The most important changes are that the ip6_output() and ip6_input() paths no longer take the IN6_MULTI_LOCK, and this lock has been downgraded to a non-recursive mutex. * As with the changes to the IPv4 stack to support SSM, filtering of inbound multicast traffic must now be performed by transport protocols within the IPv6 stack. This does not apply to TCP and SCTP, however, it does apply to UDP in IPv6 and raw IPv6. * The KPIs used by IPv6 multicast are similar to those used by the IPv4 stack, with the following differences: * im6o_mc_filter() is analogous to imo_multicast_filter(). * The legacy KAME entry points in6_joingroup and in6_leavegroup() are shimmed to in6_mc_join() and in6_mc_leave() respectively. * IN6_LOOKUP_MULTI() has been deprecated and removed. * IPv6 relies on MLD for the DAD mechanism. KAME's internal KPIs for MLDv1 have an additional 'timer' argument which is used to jitter the initial membership report for the solicited-node multicast membership on-link. * This is not strictly needed for MLDv2, which already jitters its report transmissions. However, the 'timer' argument is preserved in case MLDv1 is active on the interface. * The KAME linked-list based IPv6 membership implementation has been refactored to use a vector similar to that used by the IPv4 stack. Code which maintains a list of its own multicast memberships internally, e.g. carp, has been updated to reflect the new semantics. * There is a known Lock Order Reversal (LOR) due to in6_setscope() acquiring the IF_AFDATA_LOCK and being called within ip6_output(). Whilst MLDv2 tries to avoid this otherwise benign LOR, it is an implementation constraint which needs to be addressed in HEAD. For application developers: * The changes are broadly similar to those made for the IPv4 stack. * The use of IPv4 and IPv6 multicast socket options on the same socket, using mapped addresses, HAS NOT been tested or supported. * There are a number of issues with the implementation of various IPv6 multicast APIs which need to be resolved in the API surface before the implementation is fully compatible with KAME userland use, and these are mostly to do with interface index treatment. * The literature available discusses the use of either the delta / ASM API with setsockopt(2)/getsockopt(2), or the full-state / ASM API using setsourcefilter(3)/getsourcefilter(3). For more information please refer to RFC 3768, 'Socket Interface Extensions for Multicast Source Filters'. * Applications which use the published RFC 3678 APIs should be fine. For systems administrators: * The mtest(8) utility has been refactored to support IPv6, in addition to IPv4. Interface addresses are no longer accepted as arguments, their names must be used instead. The utility will map the interface name to its first IPv4 address as returned by getifaddrs(3). * The ifmcstat(8) utility has also been updated to print the MLDv2 endpoint state and source filter lists via sysctl(3). * The net.inet6.ip6.mcast.loop sysctl may be tuned to 0 to disable loopback of IPv6 multicast datagrams by default; it defaults to 1 to preserve the existing behaviour. Disabling multicast loopback is recommended for optimal system performance. * The IPv6 MROUTING code has been changed to examine this sysctl instead of attempting to perform a group lookup before looping back forwarded datagrams. Bump __FreeBSD_version to 800084. 20090422: Implement low-level Bluetooth HCI API. Bump __FreeBSD_version to 800083. 20090419: The layout of struct malloc_type, used by modules to register new memory allocation types, has changed. Most modules will need to be rebuilt or panics may be experienced. Bump __FreeBSD_version to 800081. 20090415: Anticipate overflowing inp_flags - add inp_flags2. This changes most offsets in inpcb, so checking v4 connection state will require a world rebuild. Bump __FreeBSD_version to 800080. 20090415: Add an llentry to struct route and struct route_in6. Modules embedding a struct route will need to be recompiled. Bump __FreeBSD_version to 800079. 20090414: The size of rt_metrics_lite and by extension rtentry has changed. Networking administration apps will need to be recompiled. The route command now supports show as an alias for get, weighting of routes, sticky and nostick flags to alter the behavior of stateful load balancing. Bump __FreeBSD_version to 800078. 20090408: Do not use Giant for kbdmux(4) locking. This is wrong and apparently causing more problems than it solves. This will re-open the issue where interrupt handlers may race with kbdmux(4) in polling mode. Typical symptoms include (but not limited to) duplicated and/or missing characters when low level console functions (such as gets) are used while interrupts are enabled (for example geli password prompt, mountroot prompt etc.). Disabling kbdmux(4) may help. 20090407: The size of structs vnet_net, vnet_inet and vnet_ipfw has changed; kernel modules referencing any of the above need to be recompiled. Bump __FreeBSD_version to 800075. 20090320: GEOM_PART has become the default partition slicer for storage devices, replacing GEOM_MBR, GEOM_BSD, GEOM_PC98 and GEOM_GPT slicers. It introduces some changes: MSDOS/EBR: the devices created from MSDOS extended partition entries (EBR) can be named differently than with GEOM_MBR and are now symlinks to devices with offset-based names. fstabs may need to be modified. BSD: the "geometry does not match label" warning is harmless in most cases but it points to problems in file system misalignment with disk geometry. The "c" partition is now implicit, covers the whole top-level drive and cannot be (mis)used by users. General: Kernel dumps are now not allowed to be written to devices whose partition types indicate they are meant to be used for file systems (or, in case of MSDOS partitions, as something else than the "386BSD" type). Most of these changes date approximately from 200812. 20090319: The uscanner(4) driver has been removed from the kernel. This follows Linux removing theirs in 2.6 and making libusb the default interface (supported by sane). 20090319: The multicast forwarding code has been cleaned up. netstat(1) only relies on KVM now for printing bandwidth upcall meters. The IPv4 and IPv6 modules are split into ip_mroute_mod and ip6_mroute_mod respectively. The config(5) options for statically compiling this code remain the same, i.e. 'options MROUTING'. 20090315: Support for the IFF_NEEDSGIANT network interface flag has been removed, which means that non-MPSAFE network device drivers are no longer supported. In particular, if_ar, if_sr, and network device drivers from the old (legacy) USB stack can no longer be built or used. 20090313: POSIX.1 Native Language Support (NLS) has been enabled in libc and a bunch of new language catalog files have also been added. This means that some common libc messages are now localized and they depend on the LC_MESSAGES environmental variable. 20090313: The k8temp(4) driver has been renamed to amdtemp(4) since support for K10 and K11 CPU families was added. 20090309: IGMPv3 and Source-Specific Multicast (SSM) have been merged to the IPv4 stack. VIMAGE hooks are in but not yet used. For kernel developers, the most important changes are that the ip_output() and ip_input() paths no longer take the IN_MULTI_LOCK(), and this lock has been downgraded to a non-recursive mutex. Transport protocols (UDP, Raw IP) are now responsible for filtering inbound multicast traffic according to group membership and source filters. The imo_multicast_filter() KPI exists for this purpose. Transports which do not use multicast (SCTP, TCP) already reject multicast by default. Forwarding and receive performance may improve as a mutex acquisition is no longer needed in the ip_input() low-level input path. in_addmulti() and in_delmulti() are shimmed to new KPIs which exist to support SSM in-kernel. For application developers, it is recommended that loopback of multicast datagrams be disabled for best performance, as this will still cause the lock to be taken for each looped-back datagram transmission. The net.inet.ip.mcast.loop sysctl may be tuned to 0 to disable loopback by default; it defaults to 1 to preserve the existing behaviour. For systems administrators, to obtain best performance with multicast reception and multiple groups, it is always recommended that a card with a suitably precise hash filter is used. Hash collisions will still result in the lock being taken within the transport protocol input path to check group membership. If deploying FreeBSD in an environment with IGMP snooping switches, it is recommended that the net.inet.igmp.sendlocal sysctl remain enabled; this forces 224.0.0.0/24 group membership to be announced via IGMP. The size of 'struct igmpstat' has changed; netstat needs to be recompiled to reflect this. Bump __FreeBSD_version to 800070. 20090309: libusb20.so.1 is now installed as libusb.so.1 and the ports system updated to use it. This requires a buildworld/installworld in order to update the library and dependencies (usbconfig, etc). Its advisable to rebuild all ports which uses libusb. More specific directions are given in the ports collection UPDATING file. Any /etc/libmap.conf entries for libusb are no longer required and can be removed. 20090302: A workaround is committed to allow the creation of System V shared memory segment of size > 2 GB on the 64-bit architectures. Due to a limitation of the existing ABI, the shm_segsz member of the struct shmid_ds, returned by shmctl(IPC_STAT) call is wrong for large segments. Note that limits must be explicitly raised to allow such segments to be created. 20090301: The layout of struct ifnet has changed, requiring a rebuild of all network device driver modules. 20090227: The /dev handling for the new USB stack has changed, a buildworld/installworld is required for libusb20. 20090223: The new USB2 stack has now been permanently moved in and all kernel and module names reverted to their previous values (eg, usb, ehci, ohci, ums, ...). The old usb stack can be compiled in by prefixing the name with the letter 'o', the old usb modules have been removed. Updating entry 20090216 for xorg and 20090215 for libmap may still apply. 20090217: The rc.conf(5) option if_up_delay has been renamed to defaultroute_delay to better reflect its purpose. If you have customized this setting in /etc/rc.conf you need to update it to use the new name. 20090216: xorg 7.4 wants to configure its input devices via hald which does not yet work with USB2. If the keyboard/mouse does not work in xorg then add Option "AllowEmptyInput" "off" to your ServerLayout section. This will cause X to use the configured kbd and mouse sections from your xorg.conf. 20090215: The GENERIC kernels for all architectures now default to the new USB2 stack. No kernel config options or code have been removed so if a problem arises please report it and optionally revert to the old USB stack. If you are loading USB kernel modules or have a custom kernel that includes GENERIC then ensure that usb names are also changed over, eg uftdi -> usb2_serial_ftdi. Older programs linked against the ports libusb 0.1 need to be redirected to the new stack's libusb20. /etc/libmap.conf can be used for this: # Map old usb library to new one for usb2 stack libusb-0.1.so.8 libusb20.so.1 20090203: The ichsmb(4) driver has been changed to require SMBus slave addresses be left-justified (xxxxxxx0b) rather than right-justified. All of the other SMBus controller drivers require left-justified slave addresses, so this change makes all the drivers provide the same interface. 20090201: INET6 statistics (struct ip6stat) was updated. netstat(1) needs to be recompiled. 20090119: NTFS has been removed from GENERIC kernel on amd64 to match GENERIC on i386. Should not cause any issues since mount_ntfs(8) will load ntfs.ko module automatically when NTFS support is actually needed, unless ntfs.ko is not installed or security level prohibits loading kernel modules. If either is the case, "options NTFS" has to be added into kernel config. 20090115: TCP Appropriate Byte Counting (RFC 3465) support added to kernel. New field in struct tcpcb breaks ABI, so bump __FreeBSD_version to 800061. User space tools that rely on the size of struct tcpcb in tcp_var.h (e.g. sockstat) need to be recompiled. 20081225: ng_tty(4) module updated to match the new TTY subsystem. Due to API change, user-level applications must be updated. New API support added to mpd5 CVS and expected to be present in next mpd5.3 release. 20081219: With __FreeBSD_version 800060 the makefs tool is part of the base system (it was a port). 20081216: The afdata and ifnet locks have been changed from mutexes to rwlocks, network modules will need to be re-compiled. 20081214: __FreeBSD_version 800059 incorporates the new arp-v2 rewrite. RTF_CLONING, RTF_LLINFO and RTF_WASCLONED flags are eliminated. The new code reduced struct rtentry{} by 16 bytes on 32-bit architecture and 40 bytes on 64-bit architecture. The userland applications "arp" and "ndp" have been updated accordingly. The output from "netstat -r" shows only routing entries and none of the L2 information. 20081130: __FreeBSD_version 800057 marks the switchover from the binary ath hal to source code. Users must add the line: options AH_SUPPORT_AR5416 to their kernel config files when specifying: device ath_hal The ath_hal module no longer exists; the code is now compiled together with the driver in the ath module. It is now possible to tailor chip support (i.e. reduce the set of chips and thereby the code size); consult ath_hal(4) for details. 20081121: __FreeBSD_version 800054 adds memory barriers to , new interfaces to ifnet to facilitate multiple hardware transmit queues for cards that support them, and a lock-less ring-buffer implementation to enable drivers to more efficiently manage queueing of packets. 20081117: A new version of ZFS (version 13) has been merged to -HEAD. This version has zpool attribute "listsnapshots" off by default, which means "zfs list" does not show snapshots, and is the same as Solaris behavior. 20081028: dummynet(4) ABI has changed. ipfw(8) needs to be recompiled. 20081009: The uhci, ohci, ehci and slhci USB Host controller drivers have been put into separate modules. If you load the usb module separately through loader.conf you will need to load the appropriate *hci module as well. E.g. for a UHCI-based USB 2.0 controller add the following to loader.conf: uhci_load="YES" ehci_load="YES" 20081009: The ABI used by the PMC toolset has changed. Please keep userland (libpmc(3)) and the kernel module (hwpmc(4)) in sync. 20080820: The TTY subsystem of the kernel has been replaced by a new implementation, which provides better scalability and an improved driver model. Most common drivers have been migrated to the new TTY subsystem, while others have not. The following drivers have not yet been ported to the new TTY layer: PCI/ISA: cy, digi, rc, rp, sio USB: ubser, ucycom Line disciplines: ng_h4, ng_tty, ppp, sl, snp Adding these drivers to your kernel configuration file shall cause compilation to fail. 20080818: ntpd has been upgraded to 4.2.4p5. 20080801: OpenSSH has been upgraded to 5.1p1. For many years, FreeBSD's version of OpenSSH preferred DSA over RSA for host and user authentication keys. With this upgrade, we've switched to the vendor's default of RSA over DSA. This may cause upgraded clients to warn about unknown host keys even for previously known hosts. Users should follow the usual procedure for verifying host keys before accepting the RSA key. This can be circumvented by setting the "HostKeyAlgorithms" option to "ssh-dss,ssh-rsa" in ~/.ssh/config or on the ssh command line. Please note that the sequence of keys offered for authentication has been changed as well. You may want to specify IdentityFile in a different order to revert this behavior. 20080713: The sio(4) driver has been removed from the i386 and amd64 kernel configuration files. This means uart(4) is now the default serial port driver on those platforms as well. To prevent collisions with the sio(4) driver, the uart(4) driver uses different names for its device nodes. This means the onboard serial port will now most likely be called "ttyu0" instead of "ttyd0". You may need to reconfigure applications to use the new device names. When using the serial port as a boot console, be sure to update /boot/device.hints and /etc/ttys before booting the new kernel. If you forget to do so, you can still manually specify the hints at the loader prompt: set hint.uart.0.at="isa" set hint.uart.0.port="0x3F8" set hint.uart.0.flags="0x10" set hint.uart.0.irq="4" boot -s 20080609: The gpt(8) utility has been removed. Use gpart(8) to partition disks instead. 20080603: The version that Linuxulator emulates was changed from 2.4.2 to 2.6.16. If you experience any problems with Linux binaries please try to set sysctl compat.linux.osrelease to 2.4.2 and if it fixes the problem contact emulation mailing list. 20080525: ISDN4BSD (I4B) was removed from the src tree. You may need to update a your kernel configuration and remove relevant entries. 20080509: I have checked in code to support multiple routing tables. See the man pages setfib(1) and setfib(2). This is a hopefully backwards compatible version, but to make use of it you need to compile your kernel with options ROUTETABLES=2 (or more up to 16). 20080420: The 802.11 wireless support was redone to enable multi-bss operation on devices that are capable. The underlying device is no longer used directly but instead wlanX devices are cloned with ifconfig. This requires changes to rc.conf files. For example, change: ifconfig_ath0="WPA DHCP" to wlans_ath0=wlan0 ifconfig_wlan0="WPA DHCP" see rc.conf(5) for more details. In addition, mergemaster of /etc/rc.d is highly recommended. Simultaneous update of userland and kernel wouldn't hurt either. As part of the multi-bss changes the wlan_scan_ap and wlan_scan_sta modules were merged into the base wlan module. All references to these modules (e.g. in kernel config files) must be removed. 20080408: psm(4) has gained write(2) support in native operation level. Arbitrary commands can be written to /dev/psm%d and status can be read back from it. Therefore, an application is responsible for status validation and error recovery. It is a no-op in other operation levels. 20080312: Support for KSE threading has been removed from the kernel. To run legacy applications linked against KSE libmap.conf may be used. The following libmap.conf may be used to ensure compatibility with any prior release: libpthread.so.1 libthr.so.1 libpthread.so.2 libthr.so.2 libkse.so.3 libthr.so.3 20080301: The layout of struct vmspace has changed. This affects libkvm and any executables that link against libkvm and use the kvm_getprocs() function. In particular, but not exclusively, it affects ps(1), fstat(1), pkill(1), systat(1), top(1) and w(1). The effects are minimal, but it's advisable to upgrade world nonetheless. 20080229: The latest em driver no longer has support in it for the 82575 adapter, this is now moved to the igb driver. The split was done to make new features that are incompatible with older hardware easier to do. 20080220: The new geom_lvm(4) geom class has been renamed to geom_linux_lvm(4), likewise the kernel option is now GEOM_LINUX_LVM. 20080211: The default NFS mount mode has changed from UDP to TCP for increased reliability. If you rely on (insecurely) NFS mounting across a firewall you may need to update your firewall rules. 20080208: Belatedly note the addition of m_collapse for compacting mbuf chains. 20080126: The fts(3) structures have been changed to use adequate integer types for their members and so to be able to cope with huge file trees. The old fts(3) ABI is preserved through symbol versioning in libc, so third-party binaries using fts(3) should still work, although they will not take advantage of the extended types. At the same time, some third-party software might fail to build after this change due to unportable assumptions made in its source code about fts(3) structure members. Such software should be fixed by its vendor or, in the worst case, in the ports tree. FreeBSD_version 800015 marks this change for the unlikely case that a portable fix is impossible. 20080123: To upgrade to -current after this date, you must be running FreeBSD not older than 6.0-RELEASE. Upgrading to -current from 5.x now requires a stop over at RELENG_6 or RELENG_7 systems. 20071128: The ADAPTIVE_GIANT kernel option has been retired because its functionality is the default now. 20071118: The AT keyboard emulation of sunkbd(4) has been turned on by default. In order to make the special symbols of the Sun keyboards driven by sunkbd(4) work under X these now have to be configured the same way as Sun USB keyboards driven by ukbd(4) (which also does AT keyboard emulation), f.e.: Option "XkbLayout" "us" Option "XkbRules" "xorg" Option "XkbSymbols" "pc(pc105)+sun_vndr/usb(sun_usb)+us" 20071024: It has been decided that it is desirable to provide ABI backwards compatibility to the FreeBSD 4/5/6 versions of the PCIOCGETCONF, PCIOCREAD and PCIOCWRITE IOCTLs, which was broken with the introduction of PCI domain support (see the 20070930 entry). Unfortunately, this required the ABI of PCIOCGETCONF to be broken again in order to be able to provide backwards compatibility to the old version of that IOCTL. Thus consumers of PCIOCGETCONF have to be recompiled again. As for prominent ports this affects neither pciutils nor xorg-server this time, the hal port needs to be rebuilt however. 20071020: The misnamed kthread_create() and friends have been renamed to kproc_create() etc. Many of the callers already used kproc_start().. I will return kthread_create() and friends in a while with implementations that actually create threads, not procs. Renaming corresponds with version 800002. 20071010: RELENG_7 branched. 20071009: Setting WITHOUT_LIBPTHREAD now means WITHOUT_LIBKSE and WITHOUT_LIBTHR are set. 20070930: The PCI code has been made aware of PCI domains. This means that the location strings as used by pciconf(8) etc are now in the following format: pci::[:]. It also means that consumers of potentially need to be recompiled; this includes the hal and xorg-server ports. 20070928: The caching daemon (cached) was renamed to nscd. nscd.conf configuration file should be used instead of cached.conf and nscd_enable, nscd_pidfile and nscd_flags options should be used instead of cached_enable, cached_pidfile and cached_flags in rc.conf. 20070921: The getfacl(1) utility now prints owning user and group name instead of owning uid and gid in the three line comment header. This is the same behavior as getfacl(1) on Solaris and Linux. 20070704: The new IPsec code is now compiled in using the IPSEC option. The IPSEC option now requires "device crypto" be defined in your kernel configuration. The FAST_IPSEC kernel option is now deprecated. 20070702: The packet filter (pf) code has been updated to OpenBSD 4.1 Please note the changed syntax - keep state is now on by default. Also note the fact that ftp-proxy(8) has been changed from bottom up and has been moved from libexec to usr/sbin. Changes in the ALTQ handling also affect users of IPFW's ALTQ capabilities. 20070701: Remove KAME IPsec in favor of FAST_IPSEC, which is now the only IPsec supported by FreeBSD. The new IPsec stack supports both IPv4 and IPv6. The kernel option will change after the code changes have settled in. For now the kernel option IPSEC is deprecated and FAST_IPSEC is the only option, that will change after some settling time. 20070701: The wicontrol(8) utility has been removed from the base system. wi(4) cards should be configured using ifconfig(8), see the man page for more information. 20070612: The i386/amd64 GENERIC kernel now defaults to the nfe(4) driver instead of the nve(4) driver. Please update your configuration accordingly. 20070612: By default, /etc/rc.d/sendmail no longer rebuilds the aliases database if it is missing or older than the aliases file. If desired, set the new rc.conf option sendmail_rebuild_aliases to "YES" to restore that functionality. 20070612: The IPv4 multicast socket code has been considerably modified, and moved to the file sys/netinet/in_mcast.c. Initial support for the RFC 3678 Source-Specific Multicast Socket API has been added to the IPv4 network stack. Strict multicast and broadcast reception is now the default for UDP/IPv4 sockets; the net.inet.udp.strict_mcast_mship sysctl variable has now been removed. The RFC 1724 hack for interface selection has been removed; the use of the Linux-derived ip_mreqn structure with IP_MULTICAST_IF has been added to replace it. Consumers such as routed will soon be updated to reflect this. These changes affect users who are running routed(8) or rdisc(8) from the FreeBSD base system on point-to-point or unnumbered interfaces. 20070610: The net80211 layer has changed significantly and all wireless drivers that depend on it need to be recompiled. Further these changes require that any program that interacts with the wireless support in the kernel be recompiled; this includes: ifconfig, wpa_supplicant, hostapd, and wlanstats. Users must also, for the moment, kldload the wlan_scan_sta and/or wlan_scan_ap modules if they use modules for wireless support. These modules implement scanning support for station and ap modes, respectively. Failure to load the appropriate module before marking a wireless interface up will result in a message to the console and the device not operating properly. 20070610: The pam_nologin(8) module ceases to provide an authentication function and starts providing an account management function. Consequent changes to /etc/pam.d should be brought in using mergemaster(8). Third-party files in /usr/local/etc/pam.d may need manual editing as follows. Locate this line (or similar): auth required pam_nologin.so no_warn and change it according to this example: account required pam_nologin.so no_warn That is, the first word needs to be changed from "auth" to "account". The new line can be moved to the account section within the file for clarity. Not updating pam.conf(5) files will result in nologin(5) ignored by the respective services. 20070529: The ether_ioctl() function has been synchronized with ioctl(2) and ifnet.if_ioctl. Due to that, the size of one of its arguments has changed on 64-bit architectures. All kernel modules using ether_ioctl() need to be rebuilt on such architectures. 20070516: Improved INCLUDE_CONFIG_FILE support has been introduced to the config(8) utility. In order to take advantage of this new functionality, you are expected to recompile and install src/usr.sbin/config. If you don't rebuild config(8), and your kernel configuration depends on INCLUDE_CONFIG_FILE, the kernel build will be broken because of a missing "kernconfstring" symbol. 20070513: Symbol versioning is enabled by default. To disable it, use option WITHOUT_SYMVER. It is not advisable to attempt to disable symbol versioning once it is enabled; your installworld will break because a symbol version-less libc will get installed before the install tools. As a result, the old install tools, which previously had symbol dependencies to FBSD_1.0, will fail because the freshly installed libc will not have them. The default threading library (providing "libpthread") has been changed to libthr. If you wish to have libkse as your default, use option DEFAULT_THREAD_LIB=libkse for the buildworld. 20070423: The ABI breakage in sendmail(8)'s libmilter has been repaired so it is no longer necessary to recompile mail filters (aka, milters). If you recompiled mail filters after the 20070408 note, it is not necessary to recompile them again. 20070417: The new trunk(4) driver has been renamed to lagg(4) as it better reflects its purpose. ifconfig will need to be recompiled. 20070408: sendmail(8) has been updated to version 8.14.1. Mail filters (aka, milters) compiled against the libmilter included in the base operating system should be recompiled. 20070302: Firmwares for ipw(4) and iwi(4) are now included in the base tree. In order to use them one must agree to the respective LICENSE in share/doc/legal and define legal.intel_.license_ack=1 via loader.conf(5) or kenv(1). Make sure to deinstall the now deprecated modules from the respective firmware ports. 20070228: The name resolution/mapping functions addr2ascii(3) and ascii2addr(3) were removed from FreeBSD's libc. These originally came from INRIA IPv6. Nothing in FreeBSD ever used them. They may be regarded as deprecated in previous releases. The AF_LINK support for getnameinfo(3) was merged from NetBSD to replace it as a more portable (and re-entrant) API. 20070224: To support interrupt filtering a modification to the newbus API has occurred, ABI was broken and __FreeBSD_version was bumped to 700031. Please make sure that your kernel and modules are in sync. For more info: http://docs.freebsd.org/cgi/mid.cgi?20070221233124.GA13941 20070224: The IPv6 multicast forwarding code may now be loaded into GENERIC kernels by loading the ip_mroute.ko module. This is built into the module unless WITHOUT_INET6 or WITHOUT_INET6_SUPPORT options are set; see src.conf(5) for more information. 20070214: The output of netstat -r has changed. Without -n, we now only print a "network name" without the prefix length if the network address and mask exactly match a Class A/B/C network, and an entry exists in the nsswitch "networks" map. With -n, we print the full unabbreviated CIDR network prefix in the form "a.b.c.d/p". 0.0.0.0/0 is always printed as "default". This change is in preparation for changes such as equal-cost multipath, and to more generally assist operational deployment of FreeBSD as a modern IPv4 router. 20070210: PIM has been turned on by default in the IPv4 multicast routing code. The kernel option 'PIM' has now been removed. PIM is now built by default if option 'MROUTING' is specified. It may now be loaded into GENERIC kernels by loading the ip_mroute.ko module. 20070207: Support for IPIP tunnels (VIFF_TUNNEL) in IPv4 multicast routing has been removed. Its functionality may be achieved by explicitly configuring gif(4) interfaces and using the 'phyint' keyword in mrouted.conf. XORP does not support source-routed IPv4 multicast tunnels nor the integrated IPIP tunneling, therefore it is not affected by this change. The __FreeBSD_version macro has been bumped to 700030. 20061221: Support for PCI Message Signalled Interrupts has been re-enabled in the bge driver, only for those chips which are believed to support it properly. If there are any problems, MSI can be disabled completely by setting the 'hw.pci.enable_msi' and 'hw.pci.enable_msix' tunables to 0 in the loader. 20061214: Support for PCI Message Signalled Interrupts has been disabled again in the bge driver. Many revisions of the hardware fail to support it properly. Support can be re-enabled by removing the #define of BGE_DISABLE_MSI in "src/sys/dev/bge/if_bge.c". 20061214: Support for PCI Message Signalled Interrupts has been added to the bge driver. If there are any problems, MSI can be disabled completely by setting the 'hw.pci.enable_msi' and 'hw.pci.enable_msix' tunables to 0 in the loader. 20061205: The removal of several facets of the experimental Threading system from the kernel means that the proc and thread structures have changed quite a bit. I suggest all kernel modules that might reference these structures be recompiled.. Especially the linux module. 20061126: Sound infrastructure has been updated with various fixes and improvements. Most of the changes are pretty much transparent, with exceptions of followings: 1) All sound driver specific sysctls (hw.snd.pcm%d.*) have been moved to their own dev sysctl nodes, for example: hw.snd.pcm0.vchans -> dev.pcm.0.vchans 2) /dev/dspr%d.%d has been deprecated. Each channel now has its own chardev in the form of "dsp%d.%d", where is p = playback, r = record and v = virtual, respectively. Users are encouraged to use these devs instead of (old) "/dev/dsp%d.%d". This does not affect those who are using "/dev/dsp". 20061122: geom(4)'s gmirror(8) class metadata structure has been rev'd from v3 to v4. If you update across this point and your metadata is converted for you, you will not be easily able to downgrade since the /boot/kernel.old/geom_mirror.ko kernel module will be unable to read the v4 metadata. You can resolve this by doing from the loader(8) prompt: set vfs.root.mountfrom="ufs:/dev/XXX" where XXX is the root slice of one of the disks that composed the mirror (i.e.: /dev/ad0s1a). You can then rebuild the array the same way you built it originally. 20061122: The following binaries have been disconnected from the build: mount_devfs, mount_ext2fs, mount_fdescfs, mount_procfs, mount_linprocfs, and mount_std. The functionality of these programs has been moved into the mount program. For example, to mount a devfs filesystem, instead of using mount_devfs, use: "mount -t devfs". This does not affect entries in /etc/fstab, since entries in /etc/fstab are always processed with "mount -t fstype". 20061113: Support for PCI Message Signalled Interrupts on i386 and amd64 has been added to the kernel and various drivers will soon be updated to use MSI when it is available. If there are any problems, MSI can be disabled completely by setting the 'hw.pci.enable_msi' and 'hw.pci.enable_msix' tunables to 0 in the loader. 20061110: The MUTEX_PROFILING option has been renamed to LOCK_PROFILING. The lockmgr object layout has been changed as a result of having a lock_object embedded in it. As a consequence all file system kernel modules must be re-compiled. The mutex profiling man page has not yet been updated to reflect this change. 20061026: KSE in the kernel has now been made optional and turned on by default. Use 'nooption KSE' in your kernel config to turn it off. All kernel modules *must* be recompiled after this change. There-after, modules from a KSE kernel should be compatible with modules from a NOKSE kernel due to the temporary padding fields added to 'struct proc'. 20060929: mrouted and its utilities have been removed from the base system. 20060927: Some ioctl(2) command codes have changed. Full backward ABI compatibility is provided if the "options COMPAT_FREEBSD6" is present in the kernel configuration file. Make sure to add this option to your kernel config file, or recompile X.Org and the rest of ports; otherwise they may refuse to work. 20060924: tcpslice has been removed from the base system. 20060913: The sizes of struct tcpcb (and struct xtcpcb) have changed due to the rewrite of TCP syncookies. Tools like netstat, sockstat, and systat needs to be rebuilt. 20060903: libpcap updated to v0.9.4 and tcpdump to v3.9.4 20060816: The IPFIREWALL_FORWARD_EXTENDED option is gone and the behaviour for IPFIREWALL_FORWARD is now as it was before when it was first committed and for years after. The behaviour is now ON. 20060725: enigma(1)/crypt(1) utility has been changed on 64 bit architectures. Now it can decrypt files created from different architectures. Unfortunately, it is no longer able to decrypt a cipher text generated with an older version on 64 bit architectures. If you have such a file, you need old utility to decrypt it. 20060709: The interface version of the i4b kernel part has changed. So after updating the kernel sources and compiling a new kernel, the i4b user space tools in "/usr/src/usr.sbin/i4b" must also be rebuilt, and vice versa. 20060627: The XBOX kernel now defaults to the nfe(4) driver instead of the nve(4) driver. Please update your configuration accordingly. 20060514: The i386-only lnc(4) driver for the AMD Am7900 LANCE and Am79C9xx PCnet family of NICs has been removed. The new le(4) driver serves as an equivalent but cross-platform replacement with the pcn(4) driver still providing performance-optimized support for the subset of AMD Am79C971 PCnet-FAST and greater chips as before. 20060511: The machdep.* sysctls and the adjkerntz utility have been modified a bit. The new adjkerntz utility uses the new sysctl names and sysctlbyname() calls, so it may be impossible to run an old /sbin/adjkerntz utility in single-user mode with a new kernel. Replace the `adjkerntz -i' step before `make installworld' with: /usr/obj/usr/src/sbin/adjkerntz/adjkerntz -i and proceed as usual with the rest of the installworld-stage steps. Otherwise, you risk installing binaries with their timestamp set several hours in the future, especially if you are running with local time set to GMT+X hours. 20060412: The ip6fw utility has been removed. The behavior provided by ip6fw has been in ipfw2 for a good while and the rc.d scripts have been updated to deal with it. There are some rules that might not migrate cleanly. Use rc.firewall6 as a template to rewrite rules. 20060428: The puc(4) driver has been overhauled. The ebus(4) and sbus(4) attachments have been removed. Make sure to configure scc(4) on sparc64. Note also that by default puc(4) will use uart(4) and not sio(4) for serial ports because interrupt handling has been optimized for multi-port serial cards and only uart(4) implements the interface to support it. 20060330: The scc(4) driver replaces puc(4) for Serial Communications Controllers (SCCs) like the Siemens SAB82532 and the Zilog Z8530. On sparc64, it is advised to add scc(4) to the kernel configuration to make sure that the serial ports remain functional. 20060317: Most world/kernel related NO_* build options changed names. New knobs have common prefixes WITHOUT_*/WITH_* (modelled after FreeBSD ports) and should be set in /etc/src.conf (the src.conf(5) manpage is provided). Full backwards compatibility is maintained for the time being though it's highly recommended to start moving old options out of the system-wide /etc/make.conf file into the new /etc/src.conf while also properly renaming them. More conversions will likely follow. Posting to current@: http://lists.freebsd.org/pipermail/freebsd-current/2006-March/061725.html 20060305: The NETSMBCRYPTO kernel option has been retired because its functionality is always included in NETSMB and smbfs.ko now. 20060303: The TDFX_LINUX kernel option was retired and replaced by the tdfx_linux device. The latter can be loaded as the 3dfx_linux.ko kernel module. Loading it alone should suffice to get 3dfx support for Linux apps because it will pull in 3dfx.ko and linux.ko through its dependencies. 20060204: The 'audit' group was added to support the new auditing functionality in the base system. Be sure to follow the directions for updating, including the requirement to run mergemaster -p. 20060201: The kernel ABI to file system modules was changed on i386. Please make sure that your kernel and modules are in sync. 20060118: This actually occured some time ago, but installing the kernel now also installs a bunch of symbol files for the kernel modules. This increases the size of /boot/kernel to about 67Mbytes. You will need twice this if you will eventually back this up to kernel.old on your next install. If you have a shortage of room in your root partition, you should add -DINSTALL_NODEBUG to your make arguments or add INSTALL_NODEBUG="yes" to your /etc/make.conf. 20060113: libc's malloc implementation has been replaced. This change has the potential to uncover application bugs that previously went unnoticed. See the malloc(3) manual page for more details. 20060112: The generic netgraph(4) cookie has been changed. If you upgrade kernel passing this point, you also need to upgrade userland and netgraph(4) utilities like ports/net/mpd or ports/net/mpd4. 20060106: si(4)'s device files now contain the unit number. Uses of {cua,tty}A[0-9a-f] should be replaced by {cua,tty}A0[0-9a-f]. 20060106: The kernel ABI was mostly destroyed due to a change in the size of struct lock_object which is nested in other structures such as mutexes which are nested in all sorts of other structures. Make sure your kernel and modules are in sync. 20051231: The page coloring algorithm in the VM subsystem was converted from tuning with kernel options to autotuning. Please remove any PQ_* option except PQ_NOOPT from your kernel config. 20051211: The net80211-related tools in the tools/tools/ath directory have been moved to tools/tools/net80211 and renamed with a "wlan" prefix. Scripts that use them should be adjusted accordingly. 20051202: Scripts in the local_startup directories (as defined in /etc/defaults/rc.conf) that have the new rc.d semantics will now be run as part of the base system rcorder. If there are errors or problems with one of these local scripts, it could cause boot problems. If you encounter such problems, boot in single user mode, remove that script from the */rc.d directory. Please report the problem to the port's maintainer, and the freebsd-ports@freebsd.org mailing list. 20051129: The nodev mount option was deprecated in RELENG_6 (where it was a no-op), and is now unsupported. If you have nodev or dev listed in /etc/fstab, remove it, otherwise it will result in a mount error. 20051129: ABI between ipfw(4) and ipfw(8) has been changed. You need to rebuild ipfw(8) when rebuilding kernel. 20051108: rp(4)'s device files now contain the unit number. Uses of {cua,tty}R[0-9a-f] should be replaced by {cua,tty}R0[0-9a-f]. 20051029: /etc/rc.d/ppp-user has been renamed to /etc/rc.d/ppp. Its /etc/rc.conf.d configuration file has been `ppp' from the beginning, and hence there is no need to touch it. 20051014: Now most modules get their build-time options from the kernel configuration file. A few modules still have fixed options due to their non-conformant implementation, but they will be corrected eventually. You may need to review the options of the modules in use, explicitly specify the non-default options in the kernel configuration file, and rebuild the kernel and modules afterwards. 20051001: kern.polling.enable sysctl MIB is now deprecated. Use ifconfig(8) to turn polling(4) on your interfaces. 20050927: The old bridge(4) implementation was retired. The new if_bridge(4) serves as a full functional replacement. 20050722: The ai_addrlen of a struct addrinfo was changed to a socklen_t to conform to POSIX-2001. This change broke an ABI compatibility on 64 bit architecture. You have to recompile userland programs that use getaddrinfo(3) on 64 bit architecture. 20050711: RELENG_6 branched here. 20050629: The pccard_ifconfig rc.conf variable has been removed and a new variable, ifconfig_DEFAULT has been introduced. Unlike pccard_ifconfig, ifconfig_DEFAULT applies to ALL interfaces that do not have ifconfig_ifn entries rather than just those in removable_interfaces. 20050616: Some previous versions of PAM have permitted the use of non-absolute paths in /etc/pam.conf or /etc/pam.d/* when referring to third party PAM modules in /usr/local/lib. A change has been made to require the use of absolute paths in order to avoid ambiguity and dependence on library path configuration, which may affect existing configurations. 20050610: Major changes to network interface API. All drivers must be recompiled. Drivers not in the base system will need to be updated to the new APIs. 20050609: Changes were made to kinfo_proc in sys/user.h. Please recompile userland, or commands like `fstat', `pkill', `ps', `top' and `w' will not behave correctly. The API and ABI for hwpmc(4) have changed with the addition of sampling support. Please recompile lib/libpmc(3) and usr.sbin/{pmcstat,pmccontrol}. 20050606: The OpenBSD dhclient was imported in place of the ISC dhclient and the network interface configuration scripts were updated accordingly. If you use DHCP to configure your interfaces, you must now run devd. Also, DNS updating was lost so you will need to find a workaround if you use this feature. The '_dhcp' user was added to support the OpenBSD dhclient. Be sure to run mergemaster -p (like you are supposed to do every time anyway). 20050605: if_bridge was added to the tree. This has changed struct ifnet. Please recompile userland and all network related modules. 20050603: The n_net of a struct netent was changed to an uint32_t, and 1st argument of getnetbyaddr() was changed to an uint32_t, to conform to POSIX-2001. These changes broke an ABI compatibility on 64 bit architecture. With these changes, shlib major of libpcap was bumped. You have to recompile userland programs that use getnetbyaddr(3), getnetbyname(3), getnetent(3) and/or libpcap on 64 bit architecture. 20050528: Kernel parsing of extra options on '#!' first lines of shell scripts has changed. Lines with multiple options likely will fail after this date. For full details, please see http://people.freebsd.org/~gad/Updating-20050528.txt 20050503: The packet filter (pf) code has been updated to OpenBSD 3.7 Please note the changed anchor syntax and the fact that authpf(8) now needs a mounted fdescfs(5) to function. 20050415: The NO_MIXED_MODE kernel option has been removed from the i386 amd64 platforms as its use has been superceded by the new local APIC timer code. Any kernel config files containing this option should be updated. 20050227: The on-disk format of LC_CTYPE files was changed to be machine independent. Please make sure NOT to use NO_CLEAN buildworld when crossing this point. Crossing this point also requires recompile or reinstall of all locale depended packages. 20050225: The ifi_epoch member of struct if_data has been changed to contain the uptime at which the interface was created or the statistics zeroed rather then the wall clock time because wallclock time may go backwards. This should have no impact unless an snmp implementation is using this value (I know of none at this point.) 20050224: The acpi_perf and acpi_throttle drivers are now part of the acpi(4) main module. They are no longer built separately. 20050223: The layout of struct image_params has changed. You have to recompile all compatibility modules (linux, svr4, etc) for use with the new kernel. 20050223: The p4tcc driver has been merged into cpufreq(4). This makes "options CPU_ENABLE_TCC" obsolete. Please load cpufreq.ko or compile in "device cpufreq" to restore this functionality. 20050220: The responsibility of recomputing the file system summary of a SoftUpdates-enabled dirty volume has been transferred to the background fsck. A rebuild of fsck(8) utility is recommended if you have updated the kernel. To get the old behavior (recompute file system summary at mount time), you can set vfs.ffs.compute_summary_at_mount=1 before mounting the new volume. 20050206: The cpufreq import is complete. As part of this, the sysctls for acpi(4) throttling have been removed. The power_profile script has been updated, so you can use performance/economy_cpu_freq in rc.conf(5) to set AC on/offline cpu frequencies. 20050206: NG_VERSION has been increased. Recompiling kernel (or ng_socket.ko) requires recompiling libnetgraph and userland netgraph utilities. 20050114: Support for abbreviated forms of a number of ipfw options is now deprecated. Warnings are printed to stderr indicating the correct full form when a match occurs. Some abbreviations may be supported at a later date based on user feedback. To be considered for support, abbreviations must be in use prior to this commit and unlikely to be confused with current key words. 20041221: By a popular demand, a lot of NOFOO options were renamed to NO_FOO (see bsd.compat.mk for a full list). The old spellings are still supported, but will cause annoying warnings on stderr. Make sure you upgrade properly (see the COMMON ITEMS: section later in this file). 20041219: Auto-loading of ancillary wlan modules such as wlan_wep has been temporarily disabled; you need to statically configure the modules you need into your kernel or explicitly load them prior to use. Specifically, if you intend to use WEP encryption with an 802.11 device load/configure wlan_wep; if you want to use WPA with the ath driver load/configure wlan_tkip, wlan_ccmp, and wlan_xauth as required. 20041213: The behaviour of ppp(8) has changed slightly. If lqr is enabled (``enable lqr''), older versions would revert to LCP ECHO mode on negotiation failure. Now, ``enable echo'' is required for this behaviour. The ppp version number has been bumped to 3.4.2 to reflect the change. 20041201: The wlan support has been updated to split the crypto support into separate modules. For static WEP you must configure the wlan_wep module in your system or build and install the module in place where it can be loaded (the kernel will auto-load the module when a wep key is configured). 20041201: The ath driver has been updated to split the tx rate control algorithm into a separate module. You need to include either ath_rate_onoe or ath_rate_amrr when configuring the kernel. 20041116: Support for systems with an 80386 CPU has been removed. Please use FreeBSD 5.x or earlier on systems with an 80386. 20041110: We have had a hack which would mount the root filesystem R/W if the device were named 'md*'. As part of the vnode work I'm doing I have had to remove this hack. People building systems which use preloaded MD root filesystems may need to insert a "/sbin/mount -u -o rw /dev/md0 /" in their /etc/rc scripts. 20041104: FreeBSD 5.3 shipped here. 20041102: The size of struct tcpcb has changed again due to the removal of RFC1644 T/TCP. You have to recompile userland programs that read kmem for tcp sockets directly (netstat, sockstat, etc.) 20041022: The size of struct tcpcb has changed. You have to recompile userland programs that read kmem for tcp sockets directly (netstat, sockstat, etc.) 20041016: RELENG_5 branched here. For older entries, please see updating in the RELENG_5 branch. COMMON ITEMS: General Notes ------------- Avoid using make -j when upgrading. From time to time in the past there have been problems using -j with buildworld and/or installworld. This is especially true when upgrading between "distant" versions (eg one that cross a major release boundary or several minor releases, or when several months have passed on the -current branch). Sometimes, obscure build problems are the result of environment poisoning. This can happen because the make utility reads its environment when searching for values for global variables. To run your build attempts in an "environmental clean room", prefix all make commands with 'env -i '. See the env(1) manual page for more details. When upgrading from one major version to another it is generally best to upgrade to the latest code in the currently installed branch first, then do an upgrade to the new branch. This is the best-tested upgrade path, and has the highest probability of being successful. Please try this approach before reporting problems with a major version upgrade. To build a kernel ----------------- If you are updating from a prior version of FreeBSD (even one just a few days old), you should follow this procedure. It is the most failsafe as it uses a /usr/obj tree with a fresh mini-buildworld, make kernel-toolchain make -DALWAYS_CHECK_MAKE buildkernel KERNCONF=YOUR_KERNEL_HERE make -DALWAYS_CHECK_MAKE installkernel KERNCONF=YOUR_KERNEL_HERE To test a kernel once --------------------- If you just want to boot a kernel once (because you are not sure if it works, or if you want to boot a known bad kernel to provide debugging information) run make installkernel KERNCONF=YOUR_KERNEL_HERE KODIR=/boot/testkernel nextboot -k testkernel To just build a kernel when you know that it won't mess you up -------------------------------------------------------------- This assumes you are already running a 5.X system. Replace ${arch} with the architecture of your machine (e.g. "i386", "alpha", "amd64", "ia64", "pc98", "sparc64", etc). cd src/sys/${arch}/conf config KERNEL_NAME_HERE cd ../compile/KERNEL_NAME_HERE make depend make make install If this fails, go to the "To build a kernel" section. To rebuild everything and install it on the current system. ----------------------------------------------------------- # Note: sometimes if you are running current you gotta do more than # is listed here if you are upgrading from a really old current. make buildworld make kernel KERNCONF=YOUR_KERNEL_HERE [1] [3] mergemaster -p [5] make installworld make delete-old mergemaster [4] To cross-install current onto a separate partition -------------------------------------------------- # In this approach we use a separate partition to hold # current's root, 'usr', and 'var' directories. A partition # holding "/", "/usr" and "/var" should be about 2GB in # size. make buildworld make buildkernel KERNCONF=YOUR_KERNEL_HERE make installworld DESTDIR=${CURRENT_ROOT} make distribution DESTDIR=${CURRENT_ROOT} # if newfs'd make installkernel KERNCONF=YOUR_KERNEL_HERE DESTDIR=${CURRENT_ROOT} cp /etc/fstab ${CURRENT_ROOT}/etc/fstab # if newfs'd To upgrade in-place from 5.x-stable to current ---------------------------------------------- make buildworld [9] make kernel KERNCONF=YOUR_KERNEL_HERE [8] [1] [3] mergemaster -p [5] make installworld make delete-old mergemaster -i [4] Make sure that you've read the UPDATING file to understand the tweaks to various things you need. At this point in the life cycle of current, things change often and you are on your own to cope. The defaults can also change, so please read ALL of the UPDATING entries. Also, if you are tracking -current, you must be subscribed to freebsd-current@freebsd.org. Make sure that before you update your sources that you have read and understood all the recent messages there. If in doubt, please track -stable which has much fewer pitfalls. [1] If you have third party modules, such as vmware, you should disable them at this point so they don't crash your system on reboot. [3] From the bootblocks, boot -s, and then do fsck -p mount -u / mount -a cd src adjkerntz -i # if CMOS is wall time Also, when doing a major release upgrade, it is required that you boot into single user mode to do the installworld. [4] Note: This step is non-optional. Failure to do this step can result in a significant reduction in the functionality of the system. Attempting to do it by hand is not recommended and those that pursue this avenue should read this file carefully, as well as the archives of freebsd-current and freebsd-hackers mailing lists for potential gotchas. [5] Usually this step is a noop. However, from time to time you may need to do this if you get unknown user in the following step. It never hurts to do it all the time. You may need to install a new mergemaster (cd src/usr.sbin/mergemaster && make install) after the buildworld before this step if you last updated from current before 20020224 or from -stable before 20020408. [8] In order to have a kernel that can run the 4.x binaries needed to do an installworld, you must include the COMPAT_FREEBSD4 option in your kernel. Failure to do so may leave you with a system that is hard to boot to recover. A similar kernel option COMPAT_FREEBSD5 is required to run the 5.x binaries on more recent kernels. Make sure that you merge any new devices from GENERIC since the last time you updated your kernel config file. [9] When checking out sources, you must include the -P flag to have cvs prune empty directories. If CPUTYPE is defined in your /etc/make.conf, make sure to use the "?=" instead of the "=" assignment operator, so that buildworld can override the CPUTYPE if it needs to. MAKEOBJDIRPREFIX must be defined in an environment variable, and not on the command line, or in /etc/make.conf. buildworld will warn if it is improperly defined. FORMAT: This file contains a list, in reverse chronological order, of major breakages in tracking -current. Not all things will be listed here, and it only starts on October 16, 2004. Updating files can found in previous releases if your system is older than this. Copyright information: Copyright 1998-2005 M. Warner Losh. All Rights Reserved. Redistribution, publication, translation and use, with or without modification, in full or in part, in any form or format of this document are permitted without further permission from the author. THIS DOCUMENT IS PROVIDED BY WARNER LOSH ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL WARNER LOSH BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. If you find this document useful, and you want to, you may buy the author a beer. Contact Warner Losh if you have any questions about your use of this document. $FreeBSD$ Index: head/lib/libc/sys/jail.2 =================================================================== --- head/lib/libc/sys/jail.2 (revision 192894) +++ head/lib/libc/sys/jail.2 (revision 192895) @@ -1,432 +1,448 @@ .\" Copyright (c) 1999 Poul-Henning Kamp. .\" Copyright (c) 2009 James Gritton. .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" $FreeBSD$ .\" -.Dd April 29, 2009 +.Dd May 27, 2009 .Dt JAIL 2 .Os .Sh NAME .Nm jail , .Nm jail_get , .Nm jail_set , .Nm jail_remove , .Nm jail_attach .Nd create and manage system jails .Sh LIBRARY .Lb libc .Sh SYNOPSIS .In sys/param.h .In sys/jail.h .Ft int .Fn jail "struct jail *jail" .Ft int .Fn jail_attach "int jid" .Ft int .Fn jail_remove "int jid" .In sys/uio.h .Ft int .Fn jail_get "struct iovec *iov" "u_int niov" "int flags" .Ft int .Fn jail_set "struct iovec *iov" "u_int niov" "int flags" .Sh DESCRIPTION The .Fn jail system call sets up a jail and locks the current process in it. .Pp The argument is a pointer to a structure describing the prison: .Bd -literal -offset indent struct jail { u_int32_t version; char *path; char *hostname; char *jailname; unsigned int ip4s; unsigned int ip6s; struct in_addr *ip4; struct in6_addr *ip6; }; .Ed .Pp .Dq Li version defines the version of the API in use. .Dv JAIL_API_VERSION is defined for the current version. .Pp The .Dq Li path pointer should be set to the directory which is to be the root of the prison. .Pp The .Dq Li hostname pointer can be set to the hostname of the prison. This can be changed from the inside of the prison. .Pp The .Dq Li jailname pointer is an optional name that can be assigned to the jail for example for managment purposes. .Pp The .Dq Li ip4s and .Dq Li ip6s give the numbers of IPv4 and IPv6 addresses that will be passed via their respective pointers. .Pp The .Dq Li ip4 and .Dq Li ip6 pointers can be set to an arrays of IPv4 and IPv6 addresses to be assigned to the prison, or NULL if none. IPv4 addresses must be in network byte order. .Pp This is equivalent to the .Fn jail_set system call (see below), with the parameters .Va path , .Va host.hostname , .Va name , .Va ip4.addr , and .Va ip6.addr , and with the .Dv JAIL_ATTACH flag. .Pp The .Fn jail_set system call creates a new jail, or modifies an existing one, and optionally locks the current process in it. Jail parameters are passed as an array of name-value pairs in the array .Fa iov , containing .Fa niov elements. Parameter names are a null-terminated string, and values may be strings, integers, or other arbitrary data. Some parameters are boolean, and do not have a value (their length is zero) but are set by the name alone with or without a .Dq no prefix, e.g. .Va persist or .Va nopersist . Any parameters not set will be given default values, generally based on the current environment. .Pp Jails have a set of core parameters, and modules can add their own jail parameters. The current set of available parameters, and their formats, can be retrieved via the .Va security.jail.param sysctl MIB entry. Notable parameters include those mentioned in the .Fn jail description above, as well as .Va jid and .Va name , which identify the jail being created or modified. See .Xr jail 8 for more information on the core jail parameters. .Pp The .Fa flags arguments consists of one or more of the following flags: .Bl -tag -width indent .It Dv JAIL_CREATE Create a new jail. If a .Va jid or .Va name parameters exists, they must not refer to an existing jail. .It Dv JAIL_UPDATE Modify an existing jail. One of the .Va jid or .Va name parameters must exist, and must refer to an existing jail. If both .Dv JAIL_CREATE and .Dv JAIL_UPDATE are set, a jail will be created if it does not yet exist, and modified if it does exist. .It Dv JAIL_ATTACH In addition to creating or modifying the jail, attach the current process to it, as with the .Fn jail_attach system call. .It Dv JAIL_DYING Allow setting a jail that is in the process of being removed. .El .Pp The .Fn jail_get system call retrieves jail parameters, using the same name-value list as .Fn jail_set in the .Fa iov and .Fa niov arguments. The jail to read can be specified by either .Va jid or .Va name by including those parameters in the list. If they are included but are not intended to be the search key, they should be cleared (zero and the empty string respectively). .Pp The special parameter .Va lastjid can be used to retrieve a list of all jails. It will fetch the jail with the jid above and closest to the passed value. The first jail (usually but not always jid 1) can be found by passing a .Va lastjid of zero. .Pp The .Fa flags arguments consists of one or more following flags: .Bl -tag -width indent .It Dv JAIL_DYING Allow getting a jail that is in the process of being removed. .El .Pp The .Fn jail_attach system call attaches the current process to an existing jail, identified by .Fa jid . .Pp The .Fn jail_remove system call removes the jail identified by .Fa jid . It will kill all processes belonging to the jail, and remove any children of that jail. .Sh RETURN VALUES If successful, .Fn jail , .Fn jail_set , and .Fn jail_get return a non-negative integer, termed the jail identifier (JID). They return \-1 on failure, and set .Va errno to indicate the error. .Pp .Rv -std jail_attach jail_remove .Sh PRISON? Once a process has been put in a prison, it and its descendants cannot escape the prison. .Pp Inside the prison, the concept of .Dq superuser is very diluted. In general, it can be assumed that nothing can be mangled from inside a prison which does not exist entirely inside that prison. For instance the directory tree below .Dq Li path can be manipulated all the ways a root can normally do it, including .Dq Li "rm -rf /*" but new device special nodes cannot be created because they reference shared resources (the device drivers in the kernel). The effective .Dq securelevel for a process is the greater of the global .Dq securelevel or, if present, the per-jail .Dq securelevel . .Pp All IP activity will be forced to happen to/from the IP number specified, which should be an alias on one of the network interfaces. All connections to/from the loopback address .Pf ( Li 127.0.0.1 for IPv4, .Li ::1 for IPv6) will be changed to be to/from the primary address of the jail for the given address family. .Pp It is possible to identify a process as jailed by examining .Dq Li /proc//status : it will show a field near the end of the line, either as -a single hyphen for a process at large, or the hostname currently +a single hyphen for a process at large, or the name currently set for the prison for jailed processes. .Sh ERRORS The .Fn jail system call will fail if: .Bl -tag -width Er .It Bq Er EPERM -This process is not allowed to create a jail. +This process is not allowed to create a jail, either because it is not +the super-user, or because it is in a jail where the +.Va allow.jails +parameter is not set. .It Bq Er EFAULT .Fa jail points to an address outside the allocated address space of the process. .It Bq Er EINVAL The version number of the argument is not correct. .It Bq Er EAGAIN No free JID could be found. .El .Pp The .Fn jail_set system call will fail if: .Bl -tag -width Er .It Bq Er EPERM -This process is not allowed to create a jail. +This process is not allowed to create a jail, either because it is not +the super-user, or because it is in a jail where the +.Va allow.jails +parameter is not set. .It Bq Er EPERM A jail parameter was set to a less restrictive value then the current environment. .It Bq Er EFAULT .Fa Iov , or one of the addresses contained within it, points to an address outside the allocated address space of the process. .It Bq Er ENOENT The jail referred to by a .Va jid or .Va name parameter does not exist, and the .Dv JAIL_CREATE flag is not set. +.It Bq Er ENOENT +The jail referred to by a +.Va jid +is not accessible by the process, because the process is in a different +jail. .It Bq Er EEXIST The jail referred to by a .Va jid or .Va name parameter exists, and the .Dv JAIL_UPDATE flag is not set. .It Bq Er EINVAL A supplied parameter is the wrong size. .It Bq Er EINVAL A supplied parameter is out of range. .It Bq Er EINVAL A supplied string parameter is not null-terminated. .It Bq Er EINVAL A supplied parameter name does not match any known parameters. .It Bq Er EINVAL One of the .Dv JAIL_CREATE or .Dv JAIL_UPDATE flags is not set. .It Bq Er ENAMETOOLONG A supplied string parameter is longer than allowed. .It Bq Er EAGAIN There are no jail IDs left. .El .Pp The .Fn jail_get system call will fail if: .Bl -tag -width Er .It Bq Er EFAULT .Fa Iov , or one of the addresses contained within it, points to an address outside the allocated address space of the process. .It Bq Er ENOENT The jail referred to by a .Va jid or .Va name parameter does not exist. .It Bq Er ENOENT +The jail referred to by a +.Va jid +is not accessible by the process, because the process is in a different +jail. +.It Bq Er ENOENT The .Va lastjid parameter is greater than the highest current jail ID. .It Bq Er EINVAL A supplied parameter is the wrong size. .It Bq Er EINVAL A supplied parameter name does not match any known parameters. .El .Pp The .Fn jail_attach and .Fn jail_remove system calls will fail if: .Bl -tag -width Er .It Bq Er EINVAL The jail specified by .Fa jid does not exist. .El .Pp Further .Fn jail , .Fn jail_set , and .Fn jail_attach call .Xr chroot 2 internally, so it can fail for all the same reasons. Please consult the .Xr chroot 2 manual page for details. .Sh SEE ALSO .Xr chdir 2 , .Xr chroot 2 , .Xr jail 8 .Sh HISTORY The .Fn jail system call appeared in .Fx 4.0 . The .Fn jail_attach system call appeared in .Fx 5.1 . The .Fn jail_set , .Fn jail_get , and .Fn jail_remove system calls appeared in .Fx 8.0 . .Sh AUTHORS The jail feature was written by .An Poul-Henning Kamp for R&D Associates .Dq Li http://www.rndassociates.com/ who contributed it to .Fx . .An James Gritton -added the extensible jail parameters. +added the extensible jail parameters and hierarchical jails. Index: head/sys/compat/freebsd32/freebsd32_misc.c =================================================================== --- head/sys/compat/freebsd32/freebsd32_misc.c (revision 192894) +++ head/sys/compat/freebsd32/freebsd32_misc.c (revision 192895) @@ -1,2961 +1,2827 @@ /*- * Copyright (c) 2002 Doug Rabson * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_compat.h" #include "opt_inet.h" #include "opt_inet6.h" #include #include #include #include #include #include #include #include #include #include #include #include #include /* Must come after sys/malloc.h */ #include #include #include #include #include #include #include #include #include #include #include #include /* Must come after sys/selinfo.h */ #include /* Must come after sys/selinfo.h */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #include #include CTASSERT(sizeof(struct timeval32) == 8); CTASSERT(sizeof(struct timespec32) == 8); CTASSERT(sizeof(struct itimerval32) == 16); CTASSERT(sizeof(struct statfs32) == 256); CTASSERT(sizeof(struct rusage32) == 72); CTASSERT(sizeof(struct sigaltstack32) == 12); CTASSERT(sizeof(struct kevent32) == 20); CTASSERT(sizeof(struct iovec32) == 8); CTASSERT(sizeof(struct msghdr32) == 28); CTASSERT(sizeof(struct stat32) == 96); CTASSERT(sizeof(struct sigaction32) == 24); -extern int jail_max_af_ips; - static int freebsd32_kevent_copyout(void *arg, struct kevent *kevp, int count); static int freebsd32_kevent_copyin(void *arg, struct kevent *kevp, int count); int freebsd32_wait4(struct thread *td, struct freebsd32_wait4_args *uap) { int error, status; struct rusage32 ru32; struct rusage ru, *rup; if (uap->rusage != NULL) rup = &ru; else rup = NULL; error = kern_wait(td, uap->pid, &status, uap->options, rup); if (error) return (error); if (uap->status != NULL) error = copyout(&status, uap->status, sizeof(status)); if (uap->rusage != NULL && error == 0) { TV_CP(ru, ru32, ru_utime); TV_CP(ru, ru32, ru_stime); CP(ru, ru32, ru_maxrss); CP(ru, ru32, ru_ixrss); CP(ru, ru32, ru_idrss); CP(ru, ru32, ru_isrss); CP(ru, ru32, ru_minflt); CP(ru, ru32, ru_majflt); CP(ru, ru32, ru_nswap); CP(ru, ru32, ru_inblock); CP(ru, ru32, ru_oublock); CP(ru, ru32, ru_msgsnd); CP(ru, ru32, ru_msgrcv); CP(ru, ru32, ru_nsignals); CP(ru, ru32, ru_nvcsw); CP(ru, ru32, ru_nivcsw); error = copyout(&ru32, uap->rusage, sizeof(ru32)); } return (error); } #ifdef COMPAT_FREEBSD4 static void copy_statfs(struct statfs *in, struct statfs32 *out) { statfs_scale_blocks(in, INT32_MAX); bzero(out, sizeof(*out)); CP(*in, *out, f_bsize); out->f_iosize = MIN(in->f_iosize, INT32_MAX); CP(*in, *out, f_blocks); CP(*in, *out, f_bfree); CP(*in, *out, f_bavail); out->f_files = MIN(in->f_files, INT32_MAX); out->f_ffree = MIN(in->f_ffree, INT32_MAX); CP(*in, *out, f_fsid); CP(*in, *out, f_owner); CP(*in, *out, f_type); CP(*in, *out, f_flags); out->f_syncwrites = MIN(in->f_syncwrites, INT32_MAX); out->f_asyncwrites = MIN(in->f_asyncwrites, INT32_MAX); strlcpy(out->f_fstypename, in->f_fstypename, MFSNAMELEN); strlcpy(out->f_mntonname, in->f_mntonname, min(MNAMELEN, FREEBSD4_MNAMELEN)); out->f_syncreads = MIN(in->f_syncreads, INT32_MAX); out->f_asyncreads = MIN(in->f_asyncreads, INT32_MAX); strlcpy(out->f_mntfromname, in->f_mntfromname, min(MNAMELEN, FREEBSD4_MNAMELEN)); } #endif #ifdef COMPAT_FREEBSD4 int freebsd4_freebsd32_getfsstat(struct thread *td, struct freebsd4_freebsd32_getfsstat_args *uap) { struct statfs *buf, *sp; struct statfs32 stat32; size_t count, size; int error; count = uap->bufsize / sizeof(struct statfs32); size = count * sizeof(struct statfs); error = kern_getfsstat(td, &buf, size, UIO_SYSSPACE, uap->flags); if (size > 0) { count = td->td_retval[0]; sp = buf; while (count > 0 && error == 0) { copy_statfs(sp, &stat32); error = copyout(&stat32, uap->buf, sizeof(stat32)); sp++; uap->buf++; count--; } free(buf, M_TEMP); } return (error); } #endif int freebsd32_sigaltstack(struct thread *td, struct freebsd32_sigaltstack_args *uap) { struct sigaltstack32 s32; struct sigaltstack ss, oss, *ssp; int error; if (uap->ss != NULL) { error = copyin(uap->ss, &s32, sizeof(s32)); if (error) return (error); PTRIN_CP(s32, ss, ss_sp); CP(s32, ss, ss_size); CP(s32, ss, ss_flags); ssp = &ss; } else ssp = NULL; error = kern_sigaltstack(td, ssp, &oss); if (error == 0 && uap->oss != NULL) { PTROUT_CP(oss, s32, ss_sp); CP(oss, s32, ss_size); CP(oss, s32, ss_flags); error = copyout(&s32, uap->oss, sizeof(s32)); } return (error); } /* * Custom version of exec_copyin_args() so that we can translate * the pointers. */ static int freebsd32_exec_copyin_args(struct image_args *args, char *fname, enum uio_seg segflg, u_int32_t *argv, u_int32_t *envv) { char *argp, *envp; u_int32_t *p32, arg; size_t length; int error; bzero(args, sizeof(*args)); if (argv == NULL) return (EFAULT); /* * Allocate temporary demand zeroed space for argument and * environment strings */ args->buf = (char *) kmem_alloc_wait(exec_map, PATH_MAX + ARG_MAX + MAXSHELLCMDLEN); if (args->buf == NULL) return (ENOMEM); args->begin_argv = args->buf; args->endp = args->begin_argv; args->stringspace = ARG_MAX; /* * Copy the file name. */ if (fname != NULL) { args->fname = args->buf + ARG_MAX; error = (segflg == UIO_SYSSPACE) ? copystr(fname, args->fname, PATH_MAX, &length) : copyinstr(fname, args->fname, PATH_MAX, &length); if (error != 0) goto err_exit; } else args->fname = NULL; /* * extract arguments first */ p32 = argv; for (;;) { error = copyin(p32++, &arg, sizeof(arg)); if (error) goto err_exit; if (arg == 0) break; argp = PTRIN(arg); error = copyinstr(argp, args->endp, args->stringspace, &length); if (error) { if (error == ENAMETOOLONG) error = E2BIG; goto err_exit; } args->stringspace -= length; args->endp += length; args->argc++; } args->begin_envv = args->endp; /* * extract environment strings */ if (envv) { p32 = envv; for (;;) { error = copyin(p32++, &arg, sizeof(arg)); if (error) goto err_exit; if (arg == 0) break; envp = PTRIN(arg); error = copyinstr(envp, args->endp, args->stringspace, &length); if (error) { if (error == ENAMETOOLONG) error = E2BIG; goto err_exit; } args->stringspace -= length; args->endp += length; args->envc++; } } return (0); err_exit: kmem_free_wakeup(exec_map, (vm_offset_t)args->buf, PATH_MAX + ARG_MAX + MAXSHELLCMDLEN); args->buf = NULL; return (error); } int freebsd32_execve(struct thread *td, struct freebsd32_execve_args *uap) { struct image_args eargs; int error; error = freebsd32_exec_copyin_args(&eargs, uap->fname, UIO_USERSPACE, uap->argv, uap->envv); if (error == 0) error = kern_execve(td, &eargs, NULL); return (error); } int freebsd32_fexecve(struct thread *td, struct freebsd32_fexecve_args *uap) { struct image_args eargs; int error; error = freebsd32_exec_copyin_args(&eargs, NULL, UIO_SYSSPACE, uap->argv, uap->envv); if (error == 0) { eargs.fd = uap->fd; error = kern_execve(td, &eargs, NULL); } return (error); } #ifdef __ia64__ static int freebsd32_mmap_partial(struct thread *td, vm_offset_t start, vm_offset_t end, int prot, int fd, off_t pos) { vm_map_t map; vm_map_entry_t entry; int rv; map = &td->td_proc->p_vmspace->vm_map; if (fd != -1) prot |= VM_PROT_WRITE; if (vm_map_lookup_entry(map, start, &entry)) { if ((entry->protection & prot) != prot) { rv = vm_map_protect(map, trunc_page(start), round_page(end), entry->protection | prot, FALSE); if (rv != KERN_SUCCESS) return (EINVAL); } } else { vm_offset_t addr = trunc_page(start); rv = vm_map_find(map, 0, 0, &addr, PAGE_SIZE, FALSE, prot, VM_PROT_ALL, 0); if (rv != KERN_SUCCESS) return (EINVAL); } if (fd != -1) { struct pread_args r; r.fd = fd; r.buf = (void *) start; r.nbyte = end - start; r.offset = pos; return (pread(td, &r)); } else { while (start < end) { subyte((void *) start, 0); start++; } return (0); } } #endif int freebsd32_mmap(struct thread *td, struct freebsd32_mmap_args *uap) { struct mmap_args ap; vm_offset_t addr = (vm_offset_t) uap->addr; vm_size_t len = uap->len; int prot = uap->prot; int flags = uap->flags; int fd = uap->fd; off_t pos = (uap->poslo | ((off_t)uap->poshi << 32)); #ifdef __ia64__ vm_size_t pageoff; int error; /* * Attempt to handle page size hassles. */ pageoff = (pos & PAGE_MASK); if (flags & MAP_FIXED) { vm_offset_t start, end; start = addr; end = addr + len; if (start != trunc_page(start)) { error = freebsd32_mmap_partial(td, start, round_page(start), prot, fd, pos); if (fd != -1) pos += round_page(start) - start; start = round_page(start); } if (end != round_page(end)) { vm_offset_t t = trunc_page(end); error = freebsd32_mmap_partial(td, t, end, prot, fd, pos + t - start); end = trunc_page(end); } if (end > start && fd != -1 && (pos & PAGE_MASK)) { /* * We can't map this region at all. The specified * address doesn't have the same alignment as the file * position. Fake the mapping by simply reading the * entire region into memory. First we need to make * sure the region exists. */ vm_map_t map; struct pread_args r; int rv; prot |= VM_PROT_WRITE; map = &td->td_proc->p_vmspace->vm_map; rv = vm_map_remove(map, start, end); if (rv != KERN_SUCCESS) return (EINVAL); rv = vm_map_find(map, 0, 0, &start, end - start, FALSE, prot, VM_PROT_ALL, 0); if (rv != KERN_SUCCESS) return (EINVAL); r.fd = fd; r.buf = (void *) start; r.nbyte = end - start; r.offset = pos; error = pread(td, &r); if (error) return (error); td->td_retval[0] = addr; return (0); } if (end == start) { /* * After dealing with the ragged ends, there * might be none left. */ td->td_retval[0] = addr; return (0); } addr = start; len = end - start; } #endif ap.addr = (void *) addr; ap.len = len; ap.prot = prot; ap.flags = flags; ap.fd = fd; ap.pos = pos; return (mmap(td, &ap)); } #ifdef COMPAT_FREEBSD6 int freebsd6_freebsd32_mmap(struct thread *td, struct freebsd6_freebsd32_mmap_args *uap) { struct freebsd32_mmap_args ap; ap.addr = uap->addr; ap.len = uap->len; ap.prot = uap->prot; ap.flags = uap->flags; ap.fd = uap->fd; ap.poslo = uap->poslo; ap.poshi = uap->poshi; return (freebsd32_mmap(td, &ap)); } #endif int freebsd32_setitimer(struct thread *td, struct freebsd32_setitimer_args *uap) { struct itimerval itv, oitv, *itvp; struct itimerval32 i32; int error; if (uap->itv != NULL) { error = copyin(uap->itv, &i32, sizeof(i32)); if (error) return (error); TV_CP(i32, itv, it_interval); TV_CP(i32, itv, it_value); itvp = &itv; } else itvp = NULL; error = kern_setitimer(td, uap->which, itvp, &oitv); if (error || uap->oitv == NULL) return (error); TV_CP(oitv, i32, it_interval); TV_CP(oitv, i32, it_value); return (copyout(&i32, uap->oitv, sizeof(i32))); } int freebsd32_getitimer(struct thread *td, struct freebsd32_getitimer_args *uap) { struct itimerval itv; struct itimerval32 i32; int error; error = kern_getitimer(td, uap->which, &itv); if (error || uap->itv == NULL) return (error); TV_CP(itv, i32, it_interval); TV_CP(itv, i32, it_value); return (copyout(&i32, uap->itv, sizeof(i32))); } int freebsd32_select(struct thread *td, struct freebsd32_select_args *uap) { struct timeval32 tv32; struct timeval tv, *tvp; int error; if (uap->tv != NULL) { error = copyin(uap->tv, &tv32, sizeof(tv32)); if (error) return (error); CP(tv32, tv, tv_sec); CP(tv32, tv, tv_usec); tvp = &tv; } else tvp = NULL; /* * XXX big-endian needs to convert the fd_sets too. * XXX Do pointers need PTRIN()? */ return (kern_select(td, uap->nd, uap->in, uap->ou, uap->ex, tvp)); } /* * Copy 'count' items into the destination list pointed to by uap->eventlist. */ static int freebsd32_kevent_copyout(void *arg, struct kevent *kevp, int count) { struct freebsd32_kevent_args *uap; struct kevent32 ks32[KQ_NEVENTS]; int i, error = 0; KASSERT(count <= KQ_NEVENTS, ("count (%d) > KQ_NEVENTS", count)); uap = (struct freebsd32_kevent_args *)arg; for (i = 0; i < count; i++) { CP(kevp[i], ks32[i], ident); CP(kevp[i], ks32[i], filter); CP(kevp[i], ks32[i], flags); CP(kevp[i], ks32[i], fflags); CP(kevp[i], ks32[i], data); PTROUT_CP(kevp[i], ks32[i], udata); } error = copyout(ks32, uap->eventlist, count * sizeof *ks32); if (error == 0) uap->eventlist += count; return (error); } /* * Copy 'count' items from the list pointed to by uap->changelist. */ static int freebsd32_kevent_copyin(void *arg, struct kevent *kevp, int count) { struct freebsd32_kevent_args *uap; struct kevent32 ks32[KQ_NEVENTS]; int i, error = 0; KASSERT(count <= KQ_NEVENTS, ("count (%d) > KQ_NEVENTS", count)); uap = (struct freebsd32_kevent_args *)arg; error = copyin(uap->changelist, ks32, count * sizeof *ks32); if (error) goto done; uap->changelist += count; for (i = 0; i < count; i++) { CP(ks32[i], kevp[i], ident); CP(ks32[i], kevp[i], filter); CP(ks32[i], kevp[i], flags); CP(ks32[i], kevp[i], fflags); CP(ks32[i], kevp[i], data); PTRIN_CP(ks32[i], kevp[i], udata); } done: return (error); } int freebsd32_kevent(struct thread *td, struct freebsd32_kevent_args *uap) { struct timespec32 ts32; struct timespec ts, *tsp; struct kevent_copyops k_ops = { uap, freebsd32_kevent_copyout, freebsd32_kevent_copyin}; int error; if (uap->timeout) { error = copyin(uap->timeout, &ts32, sizeof(ts32)); if (error) return (error); CP(ts32, ts, tv_sec); CP(ts32, ts, tv_nsec); tsp = &ts; } else tsp = NULL; error = kern_kevent(td, uap->fd, uap->nchanges, uap->nevents, &k_ops, tsp); return (error); } int freebsd32_gettimeofday(struct thread *td, struct freebsd32_gettimeofday_args *uap) { struct timeval atv; struct timeval32 atv32; struct timezone rtz; int error = 0; if (uap->tp) { microtime(&atv); CP(atv, atv32, tv_sec); CP(atv, atv32, tv_usec); error = copyout(&atv32, uap->tp, sizeof (atv32)); } if (error == 0 && uap->tzp != NULL) { rtz.tz_minuteswest = tz_minuteswest; rtz.tz_dsttime = tz_dsttime; error = copyout(&rtz, uap->tzp, sizeof (rtz)); } return (error); } int freebsd32_getrusage(struct thread *td, struct freebsd32_getrusage_args *uap) { struct rusage32 s32; struct rusage s; int error; error = kern_getrusage(td, uap->who, &s); if (error) return (error); if (uap->rusage != NULL) { TV_CP(s, s32, ru_utime); TV_CP(s, s32, ru_stime); CP(s, s32, ru_maxrss); CP(s, s32, ru_ixrss); CP(s, s32, ru_idrss); CP(s, s32, ru_isrss); CP(s, s32, ru_minflt); CP(s, s32, ru_majflt); CP(s, s32, ru_nswap); CP(s, s32, ru_inblock); CP(s, s32, ru_oublock); CP(s, s32, ru_msgsnd); CP(s, s32, ru_msgrcv); CP(s, s32, ru_nsignals); CP(s, s32, ru_nvcsw); CP(s, s32, ru_nivcsw); error = copyout(&s32, uap->rusage, sizeof(s32)); } return (error); } static int freebsd32_copyinuio(struct iovec32 *iovp, u_int iovcnt, struct uio **uiop) { struct iovec32 iov32; struct iovec *iov; struct uio *uio; u_int iovlen; int error, i; *uiop = NULL; if (iovcnt > UIO_MAXIOV) return (EINVAL); iovlen = iovcnt * sizeof(struct iovec); uio = malloc(iovlen + sizeof *uio, M_IOV, M_WAITOK); iov = (struct iovec *)(uio + 1); for (i = 0; i < iovcnt; i++) { error = copyin(&iovp[i], &iov32, sizeof(struct iovec32)); if (error) { free(uio, M_IOV); return (error); } iov[i].iov_base = PTRIN(iov32.iov_base); iov[i].iov_len = iov32.iov_len; } uio->uio_iov = iov; uio->uio_iovcnt = iovcnt; uio->uio_segflg = UIO_USERSPACE; uio->uio_offset = -1; uio->uio_resid = 0; for (i = 0; i < iovcnt; i++) { if (iov->iov_len > INT_MAX - uio->uio_resid) { free(uio, M_IOV); return (EINVAL); } uio->uio_resid += iov->iov_len; iov++; } *uiop = uio; return (0); } int freebsd32_readv(struct thread *td, struct freebsd32_readv_args *uap) { struct uio *auio; int error; error = freebsd32_copyinuio(uap->iovp, uap->iovcnt, &auio); if (error) return (error); error = kern_readv(td, uap->fd, auio); free(auio, M_IOV); return (error); } int freebsd32_writev(struct thread *td, struct freebsd32_writev_args *uap) { struct uio *auio; int error; error = freebsd32_copyinuio(uap->iovp, uap->iovcnt, &auio); if (error) return (error); error = kern_writev(td, uap->fd, auio); free(auio, M_IOV); return (error); } int freebsd32_preadv(struct thread *td, struct freebsd32_preadv_args *uap) { struct uio *auio; int error; error = freebsd32_copyinuio(uap->iovp, uap->iovcnt, &auio); if (error) return (error); error = kern_preadv(td, uap->fd, auio, uap->offset); free(auio, M_IOV); return (error); } int freebsd32_pwritev(struct thread *td, struct freebsd32_pwritev_args *uap) { struct uio *auio; int error; error = freebsd32_copyinuio(uap->iovp, uap->iovcnt, &auio); if (error) return (error); error = kern_pwritev(td, uap->fd, auio, uap->offset); free(auio, M_IOV); return (error); } static int freebsd32_copyiniov(struct iovec32 *iovp32, u_int iovcnt, struct iovec **iovp, int error) { struct iovec32 iov32; struct iovec *iov; u_int iovlen; int i; *iovp = NULL; if (iovcnt > UIO_MAXIOV) return (error); iovlen = iovcnt * sizeof(struct iovec); iov = malloc(iovlen, M_IOV, M_WAITOK); for (i = 0; i < iovcnt; i++) { error = copyin(&iovp32[i], &iov32, sizeof(struct iovec32)); if (error) { free(iov, M_IOV); return (error); } iov[i].iov_base = PTRIN(iov32.iov_base); iov[i].iov_len = iov32.iov_len; } *iovp = iov; return (0); } static int freebsd32_copyinmsghdr(struct msghdr32 *msg32, struct msghdr *msg) { struct msghdr32 m32; int error; error = copyin(msg32, &m32, sizeof(m32)); if (error) return (error); msg->msg_name = PTRIN(m32.msg_name); msg->msg_namelen = m32.msg_namelen; msg->msg_iov = PTRIN(m32.msg_iov); msg->msg_iovlen = m32.msg_iovlen; msg->msg_control = PTRIN(m32.msg_control); msg->msg_controllen = m32.msg_controllen; msg->msg_flags = m32.msg_flags; return (0); } static int freebsd32_copyoutmsghdr(struct msghdr *msg, struct msghdr32 *msg32) { struct msghdr32 m32; int error; m32.msg_name = PTROUT(msg->msg_name); m32.msg_namelen = msg->msg_namelen; m32.msg_iov = PTROUT(msg->msg_iov); m32.msg_iovlen = msg->msg_iovlen; m32.msg_control = PTROUT(msg->msg_control); m32.msg_controllen = msg->msg_controllen; m32.msg_flags = msg->msg_flags; error = copyout(&m32, msg32, sizeof(m32)); return (error); } #define FREEBSD32_ALIGNBYTES (sizeof(int) - 1) #define FREEBSD32_ALIGN(p) \ (((u_long)(p) + FREEBSD32_ALIGNBYTES) & ~FREEBSD32_ALIGNBYTES) #define FREEBSD32_CMSG_SPACE(l) \ (FREEBSD32_ALIGN(sizeof(struct cmsghdr)) + FREEBSD32_ALIGN(l)) #define FREEBSD32_CMSG_DATA(cmsg) ((unsigned char *)(cmsg) + \ FREEBSD32_ALIGN(sizeof(struct cmsghdr))) static int freebsd32_copy_msg_out(struct msghdr *msg, struct mbuf *control) { struct cmsghdr *cm; void *data; socklen_t clen, datalen; int error; caddr_t ctlbuf; int len, maxlen, copylen; struct mbuf *m; error = 0; len = msg->msg_controllen; maxlen = msg->msg_controllen; msg->msg_controllen = 0; m = control; ctlbuf = msg->msg_control; while (m && len > 0) { cm = mtod(m, struct cmsghdr *); clen = m->m_len; while (cm != NULL) { if (sizeof(struct cmsghdr) > clen || cm->cmsg_len > clen) { error = EINVAL; break; } data = CMSG_DATA(cm); datalen = (caddr_t)cm + cm->cmsg_len - (caddr_t)data; /* Adjust message length */ cm->cmsg_len = FREEBSD32_ALIGN(sizeof(struct cmsghdr)) + datalen; /* Copy cmsghdr */ copylen = sizeof(struct cmsghdr); if (len < copylen) { msg->msg_flags |= MSG_CTRUNC; copylen = len; } error = copyout(cm,ctlbuf,copylen); if (error) goto exit; ctlbuf += FREEBSD32_ALIGN(copylen); len -= FREEBSD32_ALIGN(copylen); if (len <= 0) break; /* Copy data */ copylen = datalen; if (len < copylen) { msg->msg_flags |= MSG_CTRUNC; copylen = len; } error = copyout(data,ctlbuf,copylen); if (error) goto exit; ctlbuf += FREEBSD32_ALIGN(copylen); len -= FREEBSD32_ALIGN(copylen); if (CMSG_SPACE(datalen) < clen) { clen -= CMSG_SPACE(datalen); cm = (struct cmsghdr *) ((caddr_t)cm + CMSG_SPACE(datalen)); } else { clen = 0; cm = NULL; } } m = m->m_next; } msg->msg_controllen = (len <= 0) ? maxlen : ctlbuf - (caddr_t)msg->msg_control; exit: return (error); } int freebsd32_recvmsg(td, uap) struct thread *td; struct freebsd32_recvmsg_args /* { int s; struct msghdr32 *msg; int flags; } */ *uap; { struct msghdr msg; struct msghdr32 m32; struct iovec *uiov, *iov; struct mbuf *control = NULL; struct mbuf **controlp; int error; error = copyin(uap->msg, &m32, sizeof(m32)); if (error) return (error); error = freebsd32_copyinmsghdr(uap->msg, &msg); if (error) return (error); error = freebsd32_copyiniov(PTRIN(m32.msg_iov), m32.msg_iovlen, &iov, EMSGSIZE); if (error) return (error); msg.msg_flags = uap->flags; uiov = msg.msg_iov; msg.msg_iov = iov; controlp = (msg.msg_control != NULL) ? &control : NULL; error = kern_recvit(td, uap->s, &msg, UIO_USERSPACE, controlp); if (error == 0) { msg.msg_iov = uiov; if (control != NULL) error = freebsd32_copy_msg_out(&msg, control); if (error == 0) error = freebsd32_copyoutmsghdr(&msg, uap->msg); } free(iov, M_IOV); if (control != NULL) m_freem(control); return (error); } static int freebsd32_convert_msg_in(struct mbuf **controlp) { struct mbuf *control = *controlp; struct cmsghdr *cm = mtod(control, struct cmsghdr *); void *data; socklen_t clen = control->m_len, datalen; int error; error = 0; *controlp = NULL; while (cm != NULL) { if (sizeof(struct cmsghdr) > clen || cm->cmsg_len > clen) { error = EINVAL; break; } data = FREEBSD32_CMSG_DATA(cm); datalen = (caddr_t)cm + cm->cmsg_len - (caddr_t)data; *controlp = sbcreatecontrol(data, datalen, cm->cmsg_type, cm->cmsg_level); controlp = &(*controlp)->m_next; if (FREEBSD32_CMSG_SPACE(datalen) < clen) { clen -= FREEBSD32_CMSG_SPACE(datalen); cm = (struct cmsghdr *) ((caddr_t)cm + FREEBSD32_CMSG_SPACE(datalen)); } else { clen = 0; cm = NULL; } } m_freem(control); return (error); } int freebsd32_sendmsg(struct thread *td, struct freebsd32_sendmsg_args *uap) { struct msghdr msg; struct msghdr32 m32; struct iovec *iov; struct mbuf *control = NULL; struct sockaddr *to = NULL; int error; error = copyin(uap->msg, &m32, sizeof(m32)); if (error) return (error); error = freebsd32_copyinmsghdr(uap->msg, &msg); if (error) return (error); error = freebsd32_copyiniov(PTRIN(m32.msg_iov), m32.msg_iovlen, &iov, EMSGSIZE); if (error) return (error); msg.msg_iov = iov; if (msg.msg_name != NULL) { error = getsockaddr(&to, msg.msg_name, msg.msg_namelen); if (error) { to = NULL; goto out; } msg.msg_name = to; } if (msg.msg_control) { if (msg.msg_controllen < sizeof(struct cmsghdr)) { error = EINVAL; goto out; } error = sockargs(&control, msg.msg_control, msg.msg_controllen, MT_CONTROL); if (error) goto out; error = freebsd32_convert_msg_in(&control); if (error) goto out; } error = kern_sendit(td, uap->s, &msg, uap->flags, control, UIO_USERSPACE); out: free(iov, M_IOV); if (to) free(to, M_SONAME); return (error); } int freebsd32_recvfrom(struct thread *td, struct freebsd32_recvfrom_args *uap) { struct msghdr msg; struct iovec aiov; int error; if (uap->fromlenaddr) { error = copyin(PTRIN(uap->fromlenaddr), &msg.msg_namelen, sizeof(msg.msg_namelen)); if (error) return (error); } else { msg.msg_namelen = 0; } msg.msg_name = PTRIN(uap->from); msg.msg_iov = &aiov; msg.msg_iovlen = 1; aiov.iov_base = PTRIN(uap->buf); aiov.iov_len = uap->len; msg.msg_control = NULL; msg.msg_flags = uap->flags; error = kern_recvit(td, uap->s, &msg, UIO_USERSPACE, NULL); if (error == 0 && uap->fromlenaddr) error = copyout(&msg.msg_namelen, PTRIN(uap->fromlenaddr), sizeof (msg.msg_namelen)); return (error); } int freebsd32_settimeofday(struct thread *td, struct freebsd32_settimeofday_args *uap) { struct timeval32 tv32; struct timeval tv, *tvp; struct timezone tz, *tzp; int error; if (uap->tv) { error = copyin(uap->tv, &tv32, sizeof(tv32)); if (error) return (error); CP(tv32, tv, tv_sec); CP(tv32, tv, tv_usec); tvp = &tv; } else tvp = NULL; if (uap->tzp) { error = copyin(uap->tzp, &tz, sizeof(tz)); if (error) return (error); tzp = &tz; } else tzp = NULL; return (kern_settimeofday(td, tvp, tzp)); } int freebsd32_utimes(struct thread *td, struct freebsd32_utimes_args *uap) { struct timeval32 s32[2]; struct timeval s[2], *sp; int error; if (uap->tptr != NULL) { error = copyin(uap->tptr, s32, sizeof(s32)); if (error) return (error); CP(s32[0], s[0], tv_sec); CP(s32[0], s[0], tv_usec); CP(s32[1], s[1], tv_sec); CP(s32[1], s[1], tv_usec); sp = s; } else sp = NULL; return (kern_utimes(td, uap->path, UIO_USERSPACE, sp, UIO_SYSSPACE)); } int freebsd32_lutimes(struct thread *td, struct freebsd32_lutimes_args *uap) { struct timeval32 s32[2]; struct timeval s[2], *sp; int error; if (uap->tptr != NULL) { error = copyin(uap->tptr, s32, sizeof(s32)); if (error) return (error); CP(s32[0], s[0], tv_sec); CP(s32[0], s[0], tv_usec); CP(s32[1], s[1], tv_sec); CP(s32[1], s[1], tv_usec); sp = s; } else sp = NULL; return (kern_lutimes(td, uap->path, UIO_USERSPACE, sp, UIO_SYSSPACE)); } int freebsd32_futimes(struct thread *td, struct freebsd32_futimes_args *uap) { struct timeval32 s32[2]; struct timeval s[2], *sp; int error; if (uap->tptr != NULL) { error = copyin(uap->tptr, s32, sizeof(s32)); if (error) return (error); CP(s32[0], s[0], tv_sec); CP(s32[0], s[0], tv_usec); CP(s32[1], s[1], tv_sec); CP(s32[1], s[1], tv_usec); sp = s; } else sp = NULL; return (kern_futimes(td, uap->fd, sp, UIO_SYSSPACE)); } int freebsd32_futimesat(struct thread *td, struct freebsd32_futimesat_args *uap) { struct timeval32 s32[2]; struct timeval s[2], *sp; int error; if (uap->times != NULL) { error = copyin(uap->times, s32, sizeof(s32)); if (error) return (error); CP(s32[0], s[0], tv_sec); CP(s32[0], s[0], tv_usec); CP(s32[1], s[1], tv_sec); CP(s32[1], s[1], tv_usec); sp = s; } else sp = NULL; return (kern_utimesat(td, uap->fd, uap->path, UIO_USERSPACE, sp, UIO_SYSSPACE)); } int freebsd32_adjtime(struct thread *td, struct freebsd32_adjtime_args *uap) { struct timeval32 tv32; struct timeval delta, olddelta, *deltap; int error; if (uap->delta) { error = copyin(uap->delta, &tv32, sizeof(tv32)); if (error) return (error); CP(tv32, delta, tv_sec); CP(tv32, delta, tv_usec); deltap = δ } else deltap = NULL; error = kern_adjtime(td, deltap, &olddelta); if (uap->olddelta && error == 0) { CP(olddelta, tv32, tv_sec); CP(olddelta, tv32, tv_usec); error = copyout(&tv32, uap->olddelta, sizeof(tv32)); } return (error); } #ifdef COMPAT_FREEBSD4 int freebsd4_freebsd32_statfs(struct thread *td, struct freebsd4_freebsd32_statfs_args *uap) { struct statfs32 s32; struct statfs s; int error; error = kern_statfs(td, uap->path, UIO_USERSPACE, &s); if (error) return (error); copy_statfs(&s, &s32); return (copyout(&s32, uap->buf, sizeof(s32))); } #endif #ifdef COMPAT_FREEBSD4 int freebsd4_freebsd32_fstatfs(struct thread *td, struct freebsd4_freebsd32_fstatfs_args *uap) { struct statfs32 s32; struct statfs s; int error; error = kern_fstatfs(td, uap->fd, &s); if (error) return (error); copy_statfs(&s, &s32); return (copyout(&s32, uap->buf, sizeof(s32))); } #endif #ifdef COMPAT_FREEBSD4 int freebsd4_freebsd32_fhstatfs(struct thread *td, struct freebsd4_freebsd32_fhstatfs_args *uap) { struct statfs32 s32; struct statfs s; fhandle_t fh; int error; if ((error = copyin(uap->u_fhp, &fh, sizeof(fhandle_t))) != 0) return (error); error = kern_fhstatfs(td, fh, &s); if (error) return (error); copy_statfs(&s, &s32); return (copyout(&s32, uap->buf, sizeof(s32))); } #endif static void freebsd32_ipcperm_in(struct ipc_perm32 *ip32, struct ipc_perm *ip) { CP(*ip32, *ip, cuid); CP(*ip32, *ip, cgid); CP(*ip32, *ip, uid); CP(*ip32, *ip, gid); CP(*ip32, *ip, mode); CP(*ip32, *ip, seq); CP(*ip32, *ip, key); } static void freebsd32_ipcperm_out(struct ipc_perm *ip, struct ipc_perm32 *ip32) { CP(*ip, *ip32, cuid); CP(*ip, *ip32, cgid); CP(*ip, *ip32, uid); CP(*ip, *ip32, gid); CP(*ip, *ip32, mode); CP(*ip, *ip32, seq); CP(*ip, *ip32, key); } int freebsd32_semsys(struct thread *td, struct freebsd32_semsys_args *uap) { switch (uap->which) { case 0: return (freebsd32_semctl(td, (struct freebsd32_semctl_args *)&uap->a2)); default: return (semsys(td, (struct semsys_args *)uap)); } } int freebsd32_semctl(struct thread *td, struct freebsd32_semctl_args *uap) { struct semid_ds32 dsbuf32; struct semid_ds dsbuf; union semun semun; union semun32 arg; register_t rval; int error; switch (uap->cmd) { case SEM_STAT: case IPC_SET: case IPC_STAT: case GETALL: case SETVAL: case SETALL: error = copyin(uap->arg, &arg, sizeof(arg)); if (error) return (error); break; } switch (uap->cmd) { case SEM_STAT: case IPC_STAT: semun.buf = &dsbuf; break; case IPC_SET: error = copyin(PTRIN(arg.buf), &dsbuf32, sizeof(dsbuf32)); if (error) return (error); freebsd32_ipcperm_in(&dsbuf32.sem_perm, &dsbuf.sem_perm); PTRIN_CP(dsbuf32, dsbuf, sem_base); CP(dsbuf32, dsbuf, sem_nsems); CP(dsbuf32, dsbuf, sem_otime); CP(dsbuf32, dsbuf, sem_pad1); CP(dsbuf32, dsbuf, sem_ctime); CP(dsbuf32, dsbuf, sem_pad2); CP(dsbuf32, dsbuf, sem_pad3[0]); CP(dsbuf32, dsbuf, sem_pad3[1]); CP(dsbuf32, dsbuf, sem_pad3[2]); CP(dsbuf32, dsbuf, sem_pad3[3]); semun.buf = &dsbuf; break; case GETALL: case SETALL: semun.array = PTRIN(arg.array); break; case SETVAL: semun.val = arg.val; break; } error = kern_semctl(td, uap->semid, uap->semnum, uap->cmd, &semun, &rval); if (error) return (error); switch (uap->cmd) { case SEM_STAT: case IPC_STAT: freebsd32_ipcperm_out(&dsbuf.sem_perm, &dsbuf32.sem_perm); PTROUT_CP(dsbuf, dsbuf32, sem_base); CP(dsbuf, dsbuf32, sem_nsems); CP(dsbuf, dsbuf32, sem_otime); CP(dsbuf, dsbuf32, sem_pad1); CP(dsbuf, dsbuf32, sem_ctime); CP(dsbuf, dsbuf32, sem_pad2); CP(dsbuf, dsbuf32, sem_pad3[0]); CP(dsbuf, dsbuf32, sem_pad3[1]); CP(dsbuf, dsbuf32, sem_pad3[2]); CP(dsbuf, dsbuf32, sem_pad3[3]); error = copyout(&dsbuf32, PTRIN(arg.buf), sizeof(dsbuf32)); break; } if (error == 0) td->td_retval[0] = rval; return (error); } int freebsd32_msgsys(struct thread *td, struct freebsd32_msgsys_args *uap) { switch (uap->which) { case 0: return (freebsd32_msgctl(td, (struct freebsd32_msgctl_args *)&uap->a2)); case 2: return (freebsd32_msgsnd(td, (struct freebsd32_msgsnd_args *)&uap->a2)); case 3: return (freebsd32_msgrcv(td, (struct freebsd32_msgrcv_args *)&uap->a2)); default: return (msgsys(td, (struct msgsys_args *)uap)); } } int freebsd32_msgctl(struct thread *td, struct freebsd32_msgctl_args *uap) { struct msqid_ds msqbuf; struct msqid_ds32 msqbuf32; int error; if (uap->cmd == IPC_SET) { error = copyin(uap->buf, &msqbuf32, sizeof(msqbuf32)); if (error) return (error); freebsd32_ipcperm_in(&msqbuf32.msg_perm, &msqbuf.msg_perm); PTRIN_CP(msqbuf32, msqbuf, msg_first); PTRIN_CP(msqbuf32, msqbuf, msg_last); CP(msqbuf32, msqbuf, msg_cbytes); CP(msqbuf32, msqbuf, msg_qnum); CP(msqbuf32, msqbuf, msg_qbytes); CP(msqbuf32, msqbuf, msg_lspid); CP(msqbuf32, msqbuf, msg_lrpid); CP(msqbuf32, msqbuf, msg_stime); CP(msqbuf32, msqbuf, msg_pad1); CP(msqbuf32, msqbuf, msg_rtime); CP(msqbuf32, msqbuf, msg_pad2); CP(msqbuf32, msqbuf, msg_ctime); CP(msqbuf32, msqbuf, msg_pad3); CP(msqbuf32, msqbuf, msg_pad4[0]); CP(msqbuf32, msqbuf, msg_pad4[1]); CP(msqbuf32, msqbuf, msg_pad4[2]); CP(msqbuf32, msqbuf, msg_pad4[3]); } error = kern_msgctl(td, uap->msqid, uap->cmd, &msqbuf); if (error) return (error); if (uap->cmd == IPC_STAT) { freebsd32_ipcperm_out(&msqbuf.msg_perm, &msqbuf32.msg_perm); PTROUT_CP(msqbuf, msqbuf32, msg_first); PTROUT_CP(msqbuf, msqbuf32, msg_last); CP(msqbuf, msqbuf32, msg_cbytes); CP(msqbuf, msqbuf32, msg_qnum); CP(msqbuf, msqbuf32, msg_qbytes); CP(msqbuf, msqbuf32, msg_lspid); CP(msqbuf, msqbuf32, msg_lrpid); CP(msqbuf, msqbuf32, msg_stime); CP(msqbuf, msqbuf32, msg_pad1); CP(msqbuf, msqbuf32, msg_rtime); CP(msqbuf, msqbuf32, msg_pad2); CP(msqbuf, msqbuf32, msg_ctime); CP(msqbuf, msqbuf32, msg_pad3); CP(msqbuf, msqbuf32, msg_pad4[0]); CP(msqbuf, msqbuf32, msg_pad4[1]); CP(msqbuf, msqbuf32, msg_pad4[2]); CP(msqbuf, msqbuf32, msg_pad4[3]); error = copyout(&msqbuf32, uap->buf, sizeof(struct msqid_ds32)); } return (error); } int freebsd32_msgsnd(struct thread *td, struct freebsd32_msgsnd_args *uap) { const void *msgp; long mtype; int32_t mtype32; int error; msgp = PTRIN(uap->msgp); if ((error = copyin(msgp, &mtype32, sizeof(mtype32))) != 0) return (error); mtype = mtype32; return (kern_msgsnd(td, uap->msqid, (const char *)msgp + sizeof(mtype32), uap->msgsz, uap->msgflg, mtype)); } int freebsd32_msgrcv(struct thread *td, struct freebsd32_msgrcv_args *uap) { void *msgp; long mtype; int32_t mtype32; int error; msgp = PTRIN(uap->msgp); if ((error = kern_msgrcv(td, uap->msqid, (char *)msgp + sizeof(mtype32), uap->msgsz, uap->msgtyp, uap->msgflg, &mtype)) != 0) return (error); mtype32 = (int32_t)mtype; return (copyout(&mtype32, msgp, sizeof(mtype32))); } int freebsd32_shmsys(struct thread *td, struct freebsd32_shmsys_args *uap) { switch (uap->which) { case 0: { /* shmat */ struct shmat_args ap; ap.shmid = uap->a2; ap.shmaddr = PTRIN(uap->a3); ap.shmflg = uap->a4; return (sysent[SYS_shmat].sy_call(td, &ap)); } case 2: { /* shmdt */ struct shmdt_args ap; ap.shmaddr = PTRIN(uap->a2); return (sysent[SYS_shmdt].sy_call(td, &ap)); } case 3: { /* shmget */ struct shmget_args ap; ap.key = uap->a2; ap.size = uap->a3; ap.shmflg = uap->a4; return (sysent[SYS_shmget].sy_call(td, &ap)); } case 4: { /* shmctl */ struct freebsd32_shmctl_args ap; ap.shmid = uap->a2; ap.cmd = uap->a3; ap.buf = PTRIN(uap->a4); return (freebsd32_shmctl(td, &ap)); } case 1: /* oshmctl */ default: return (EINVAL); } } int freebsd32_shmctl(struct thread *td, struct freebsd32_shmctl_args *uap) { int error = 0; union { struct shmid_ds shmid_ds; struct shm_info shm_info; struct shminfo shminfo; } u; union { struct shmid_ds32 shmid_ds32; struct shm_info32 shm_info32; struct shminfo32 shminfo32; } u32; size_t sz; if (uap->cmd == IPC_SET) { if ((error = copyin(uap->buf, &u32.shmid_ds32, sizeof(u32.shmid_ds32)))) goto done; freebsd32_ipcperm_in(&u32.shmid_ds32.shm_perm, &u.shmid_ds.shm_perm); CP(u32.shmid_ds32, u.shmid_ds, shm_segsz); CP(u32.shmid_ds32, u.shmid_ds, shm_lpid); CP(u32.shmid_ds32, u.shmid_ds, shm_cpid); CP(u32.shmid_ds32, u.shmid_ds, shm_nattch); CP(u32.shmid_ds32, u.shmid_ds, shm_atime); CP(u32.shmid_ds32, u.shmid_ds, shm_dtime); CP(u32.shmid_ds32, u.shmid_ds, shm_ctime); PTRIN_CP(u32.shmid_ds32, u.shmid_ds, shm_internal); } error = kern_shmctl(td, uap->shmid, uap->cmd, (void *)&u, &sz); if (error) goto done; /* Cases in which we need to copyout */ switch (uap->cmd) { case IPC_INFO: CP(u.shminfo, u32.shminfo32, shmmax); CP(u.shminfo, u32.shminfo32, shmmin); CP(u.shminfo, u32.shminfo32, shmmni); CP(u.shminfo, u32.shminfo32, shmseg); CP(u.shminfo, u32.shminfo32, shmall); error = copyout(&u32.shminfo32, uap->buf, sizeof(u32.shminfo32)); break; case SHM_INFO: CP(u.shm_info, u32.shm_info32, used_ids); CP(u.shm_info, u32.shm_info32, shm_rss); CP(u.shm_info, u32.shm_info32, shm_tot); CP(u.shm_info, u32.shm_info32, shm_swp); CP(u.shm_info, u32.shm_info32, swap_attempts); CP(u.shm_info, u32.shm_info32, swap_successes); error = copyout(&u32.shm_info32, uap->buf, sizeof(u32.shm_info32)); break; case SHM_STAT: case IPC_STAT: freebsd32_ipcperm_out(&u.shmid_ds.shm_perm, &u32.shmid_ds32.shm_perm); CP(u.shmid_ds, u32.shmid_ds32, shm_segsz); CP(u.shmid_ds, u32.shmid_ds32, shm_lpid); CP(u.shmid_ds, u32.shmid_ds32, shm_cpid); CP(u.shmid_ds, u32.shmid_ds32, shm_nattch); CP(u.shmid_ds, u32.shmid_ds32, shm_atime); CP(u.shmid_ds, u32.shmid_ds32, shm_dtime); CP(u.shmid_ds, u32.shmid_ds32, shm_ctime); PTROUT_CP(u.shmid_ds, u32.shmid_ds32, shm_internal); error = copyout(&u32.shmid_ds32, uap->buf, sizeof(u32.shmid_ds32)); break; } done: if (error) { /* Invalidate the return value */ td->td_retval[0] = -1; } return (error); } int freebsd32_pread(struct thread *td, struct freebsd32_pread_args *uap) { struct pread_args ap; ap.fd = uap->fd; ap.buf = uap->buf; ap.nbyte = uap->nbyte; ap.offset = (uap->offsetlo | ((off_t)uap->offsethi << 32)); return (pread(td, &ap)); } int freebsd32_pwrite(struct thread *td, struct freebsd32_pwrite_args *uap) { struct pwrite_args ap; ap.fd = uap->fd; ap.buf = uap->buf; ap.nbyte = uap->nbyte; ap.offset = (uap->offsetlo | ((off_t)uap->offsethi << 32)); return (pwrite(td, &ap)); } int freebsd32_lseek(struct thread *td, struct freebsd32_lseek_args *uap) { int error; struct lseek_args ap; off_t pos; ap.fd = uap->fd; ap.offset = (uap->offsetlo | ((off_t)uap->offsethi << 32)); ap.whence = uap->whence; error = lseek(td, &ap); /* Expand the quad return into two parts for eax and edx */ pos = *(off_t *)(td->td_retval); td->td_retval[0] = pos & 0xffffffff; /* %eax */ td->td_retval[1] = pos >> 32; /* %edx */ return error; } int freebsd32_truncate(struct thread *td, struct freebsd32_truncate_args *uap) { struct truncate_args ap; ap.path = uap->path; ap.length = (uap->lengthlo | ((off_t)uap->lengthhi << 32)); return (truncate(td, &ap)); } int freebsd32_ftruncate(struct thread *td, struct freebsd32_ftruncate_args *uap) { struct ftruncate_args ap; ap.fd = uap->fd; ap.length = (uap->lengthlo | ((off_t)uap->lengthhi << 32)); return (ftruncate(td, &ap)); } int freebsd32_getdirentries(struct thread *td, struct freebsd32_getdirentries_args *uap) { long base; int32_t base32; int error; error = kern_getdirentries(td, uap->fd, uap->buf, uap->count, &base); if (error) return (error); if (uap->basep != NULL) { base32 = base; error = copyout(&base32, uap->basep, sizeof(int32_t)); } return (error); } #ifdef COMPAT_FREEBSD6 /* versions with the 'int pad' argument */ int freebsd6_freebsd32_pread(struct thread *td, struct freebsd6_freebsd32_pread_args *uap) { struct pread_args ap; ap.fd = uap->fd; ap.buf = uap->buf; ap.nbyte = uap->nbyte; ap.offset = (uap->offsetlo | ((off_t)uap->offsethi << 32)); return (pread(td, &ap)); } int freebsd6_freebsd32_pwrite(struct thread *td, struct freebsd6_freebsd32_pwrite_args *uap) { struct pwrite_args ap; ap.fd = uap->fd; ap.buf = uap->buf; ap.nbyte = uap->nbyte; ap.offset = (uap->offsetlo | ((off_t)uap->offsethi << 32)); return (pwrite(td, &ap)); } int freebsd6_freebsd32_lseek(struct thread *td, struct freebsd6_freebsd32_lseek_args *uap) { int error; struct lseek_args ap; off_t pos; ap.fd = uap->fd; ap.offset = (uap->offsetlo | ((off_t)uap->offsethi << 32)); ap.whence = uap->whence; error = lseek(td, &ap); /* Expand the quad return into two parts for eax and edx */ pos = *(off_t *)(td->td_retval); td->td_retval[0] = pos & 0xffffffff; /* %eax */ td->td_retval[1] = pos >> 32; /* %edx */ return error; } int freebsd6_freebsd32_truncate(struct thread *td, struct freebsd6_freebsd32_truncate_args *uap) { struct truncate_args ap; ap.path = uap->path; ap.length = (uap->lengthlo | ((off_t)uap->lengthhi << 32)); return (truncate(td, &ap)); } int freebsd6_freebsd32_ftruncate(struct thread *td, struct freebsd6_freebsd32_ftruncate_args *uap) { struct ftruncate_args ap; ap.fd = uap->fd; ap.length = (uap->lengthlo | ((off_t)uap->lengthhi << 32)); return (ftruncate(td, &ap)); } #endif /* COMPAT_FREEBSD6 */ struct sf_hdtr32 { uint32_t headers; int hdr_cnt; uint32_t trailers; int trl_cnt; }; static int freebsd32_do_sendfile(struct thread *td, struct freebsd32_sendfile_args *uap, int compat) { struct sendfile_args ap; struct sf_hdtr32 hdtr32; struct sf_hdtr hdtr; struct uio *hdr_uio, *trl_uio; struct iovec32 *iov32; int error; hdr_uio = trl_uio = NULL; ap.fd = uap->fd; ap.s = uap->s; ap.offset = (uap->offsetlo | ((off_t)uap->offsethi << 32)); ap.nbytes = uap->nbytes; ap.hdtr = (struct sf_hdtr *)uap->hdtr; /* XXX not used */ ap.sbytes = uap->sbytes; ap.flags = uap->flags; if (uap->hdtr != NULL) { error = copyin(uap->hdtr, &hdtr32, sizeof(hdtr32)); if (error) goto out; PTRIN_CP(hdtr32, hdtr, headers); CP(hdtr32, hdtr, hdr_cnt); PTRIN_CP(hdtr32, hdtr, trailers); CP(hdtr32, hdtr, trl_cnt); if (hdtr.headers != NULL) { iov32 = PTRIN(hdtr32.headers); error = freebsd32_copyinuio(iov32, hdtr32.hdr_cnt, &hdr_uio); if (error) goto out; } if (hdtr.trailers != NULL) { iov32 = PTRIN(hdtr32.trailers); error = freebsd32_copyinuio(iov32, hdtr32.trl_cnt, &trl_uio); if (error) goto out; } } error = kern_sendfile(td, &ap, hdr_uio, trl_uio, compat); out: if (hdr_uio) free(hdr_uio, M_IOV); if (trl_uio) free(trl_uio, M_IOV); return (error); } #ifdef COMPAT_FREEBSD4 int freebsd4_freebsd32_sendfile(struct thread *td, struct freebsd4_freebsd32_sendfile_args *uap) { return (freebsd32_do_sendfile(td, (struct freebsd32_sendfile_args *)uap, 1)); } #endif int freebsd32_sendfile(struct thread *td, struct freebsd32_sendfile_args *uap) { return (freebsd32_do_sendfile(td, uap, 0)); } static void copy_stat( struct stat *in, struct stat32 *out) { CP(*in, *out, st_dev); CP(*in, *out, st_ino); CP(*in, *out, st_mode); CP(*in, *out, st_nlink); CP(*in, *out, st_uid); CP(*in, *out, st_gid); CP(*in, *out, st_rdev); TS_CP(*in, *out, st_atimespec); TS_CP(*in, *out, st_mtimespec); TS_CP(*in, *out, st_ctimespec); CP(*in, *out, st_size); CP(*in, *out, st_blocks); CP(*in, *out, st_blksize); CP(*in, *out, st_flags); CP(*in, *out, st_gen); } int freebsd32_stat(struct thread *td, struct freebsd32_stat_args *uap) { struct stat sb; struct stat32 sb32; int error; error = kern_stat(td, uap->path, UIO_USERSPACE, &sb); if (error) return (error); copy_stat(&sb, &sb32); error = copyout(&sb32, uap->ub, sizeof (sb32)); return (error); } int freebsd32_fstat(struct thread *td, struct freebsd32_fstat_args *uap) { struct stat ub; struct stat32 ub32; int error; error = kern_fstat(td, uap->fd, &ub); if (error) return (error); copy_stat(&ub, &ub32); error = copyout(&ub32, uap->ub, sizeof(ub32)); return (error); } int freebsd32_fstatat(struct thread *td, struct freebsd32_fstatat_args *uap) { struct stat ub; struct stat32 ub32; int error; error = kern_statat(td, uap->flag, uap->fd, uap->path, UIO_USERSPACE, &ub); if (error) return (error); copy_stat(&ub, &ub32); error = copyout(&ub32, uap->buf, sizeof(ub32)); return (error); } int freebsd32_lstat(struct thread *td, struct freebsd32_lstat_args *uap) { struct stat sb; struct stat32 sb32; int error; error = kern_lstat(td, uap->path, UIO_USERSPACE, &sb); if (error) return (error); copy_stat(&sb, &sb32); error = copyout(&sb32, uap->ub, sizeof (sb32)); return (error); } /* * MPSAFE */ int freebsd32_sysctl(struct thread *td, struct freebsd32_sysctl_args *uap) { int error, name[CTL_MAXNAME]; size_t j, oldlen; if (uap->namelen > CTL_MAXNAME || uap->namelen < 2) return (EINVAL); error = copyin(uap->name, name, uap->namelen * sizeof(int)); if (error) return (error); if (uap->oldlenp) oldlen = fuword32(uap->oldlenp); else oldlen = 0; error = userland_sysctl(td, name, uap->namelen, uap->old, &oldlen, 1, uap->new, uap->newlen, &j, SCTL_MASK32); if (error && error != ENOMEM) return (error); if (uap->oldlenp) suword32(uap->oldlenp, j); return (0); } int freebsd32_jail(struct thread *td, struct freebsd32_jail_args *uap) { - struct iovec optiov[10]; - struct uio opt; - char *u_path, *u_hostname, *u_name; -#ifdef INET - struct in_addr *u_ip4; -#endif -#ifdef INET6 - struct in6_addr *u_ip6; -#endif uint32_t version; int error; + struct jail j; error = copyin(uap->jail, &version, sizeof(uint32_t)); if (error) return (error); switch (version) { case 0: { /* FreeBSD single IPv4 jails. */ struct jail32_v0 j32_v0; + bzero(&j, sizeof(struct jail)); error = copyin(uap->jail, &j32_v0, sizeof(struct jail32_v0)); if (error) return (error); - u_path = malloc(MAXPATHLEN + MAXHOSTNAMELEN, M_TEMP, M_WAITOK); - u_hostname = u_path + MAXPATHLEN; - opt.uio_iov = optiov; - opt.uio_iovcnt = 4; - opt.uio_offset = -1; - opt.uio_resid = -1; - opt.uio_segflg = UIO_SYSSPACE; - opt.uio_rw = UIO_READ; - opt.uio_td = td; - optiov[0].iov_base = "path"; - optiov[0].iov_len = sizeof("path"); - optiov[1].iov_base = u_path; - error = copyinstr(PTRIN(j32_v0.path), u_path, MAXPATHLEN, - &optiov[1].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } - optiov[2].iov_base = "host.hostname"; - optiov[2].iov_len = sizeof("host.hostname"); - optiov[3].iov_base = u_hostname; - error = copyinstr(PTRIN(j32_v0.hostname), u_hostname, - MAXHOSTNAMELEN, &optiov[3].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } -#ifdef INET - optiov[opt.uio_iovcnt].iov_base = "ip4.addr"; - optiov[opt.uio_iovcnt].iov_len = sizeof("ip4.addr"); - opt.uio_iovcnt++; - optiov[opt.uio_iovcnt].iov_base = &j32_v0.ip_number; - j32_v0.ip_number = htonl(j32_v0.ip_number); - optiov[opt.uio_iovcnt].iov_len = sizeof(j32_v0.ip_number); - opt.uio_iovcnt++; -#endif + CP(j32_v0, j, version); + PTRIN_CP(j32_v0, j, path); + PTRIN_CP(j32_v0, j, hostname); + j.ip4s = j32_v0.ip_number; break; } case 1: /* * Version 1 was used by multi-IPv4 jail implementations * that never made it into the official kernel. */ return (EINVAL); case 2: /* JAIL_API_VERSION */ { /* FreeBSD multi-IPv4/IPv6,noIP jails. */ struct jail32 j32; - size_t tmplen; error = copyin(uap->jail, &j32, sizeof(struct jail32)); if (error) return (error); - tmplen = MAXPATHLEN + MAXHOSTNAMELEN + MAXHOSTNAMELEN; -#ifdef INET - if (j32.ip4s > jail_max_af_ips) - return (EINVAL); - tmplen += j32.ip4s * sizeof(struct in_addr); -#else - if (j32.ip4s > 0) - return (EINVAL); -#endif -#ifdef INET6 - if (j32.ip6s > jail_max_af_ips) - return (EINVAL); - tmplen += j32.ip6s * sizeof(struct in6_addr); -#else - if (j32.ip6s > 0) - return (EINVAL); -#endif - u_path = malloc(tmplen, M_TEMP, M_WAITOK); - u_hostname = u_path + MAXPATHLEN; - u_name = u_hostname + MAXHOSTNAMELEN; -#ifdef INET - u_ip4 = (struct in_addr *)(u_name + MAXHOSTNAMELEN); -#endif -#ifdef INET6 -#ifdef INET - u_ip6 = (struct in6_addr *)(u_ip4 + j32.ip4s); -#else - u_ip6 = (struct in6_addr *)(u_name + MAXHOSTNAMELEN); -#endif -#endif - opt.uio_iov = optiov; - opt.uio_iovcnt = 4; - opt.uio_offset = -1; - opt.uio_resid = -1; - opt.uio_segflg = UIO_SYSSPACE; - opt.uio_rw = UIO_READ; - opt.uio_td = td; - optiov[0].iov_base = "path"; - optiov[0].iov_len = sizeof("path"); - optiov[1].iov_base = u_path; - error = copyinstr(PTRIN(j32.path), u_path, MAXPATHLEN, - &optiov[1].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } - optiov[2].iov_base = "host.hostname"; - optiov[2].iov_len = sizeof("host.hostname"); - optiov[3].iov_base = u_hostname; - error = copyinstr(PTRIN(j32.hostname), u_hostname, - MAXHOSTNAMELEN, &optiov[3].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } - if (PTRIN(j32.jailname) != NULL) { - optiov[opt.uio_iovcnt].iov_base = "name"; - optiov[opt.uio_iovcnt].iov_len = sizeof("name"); - opt.uio_iovcnt++; - optiov[opt.uio_iovcnt].iov_base = u_name; - error = copyinstr(PTRIN(j32.jailname), u_name, - MAXHOSTNAMELEN, &optiov[opt.uio_iovcnt].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } - opt.uio_iovcnt++; - } -#ifdef INET - optiov[opt.uio_iovcnt].iov_base = "ip4.addr"; - optiov[opt.uio_iovcnt].iov_len = sizeof("ip4.addr"); - opt.uio_iovcnt++; - optiov[opt.uio_iovcnt].iov_base = u_ip4; - optiov[opt.uio_iovcnt].iov_len = - j32.ip4s * sizeof(struct in_addr); - error = copyin(PTRIN(j32.ip4), u_ip4, - optiov[opt.uio_iovcnt].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } - opt.uio_iovcnt++; -#endif -#ifdef INET6 - optiov[opt.uio_iovcnt].iov_base = "ip6.addr"; - optiov[opt.uio_iovcnt].iov_len = sizeof("ip6.addr"); - opt.uio_iovcnt++; - optiov[opt.uio_iovcnt].iov_base = u_ip6; - optiov[opt.uio_iovcnt].iov_len = - j32.ip6s * sizeof(struct in6_addr); - error = copyin(PTRIN(j32.ip6), u_ip6, - optiov[opt.uio_iovcnt].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } - opt.uio_iovcnt++; -#endif + CP(j32, j, version); + PTRIN_CP(j32, j, path); + PTRIN_CP(j32, j, hostname); + PTRIN_CP(j32, j, jailname); + CP(j32, j, ip4s); + CP(j32, j, ip6s); + PTRIN_CP(j32, j, ip4); + PTRIN_CP(j32, j, ip6); break; } default: /* Sci-Fi jails are not supported, sorry. */ return (EINVAL); } - error = kern_jail_set(td, &opt, JAIL_CREATE | JAIL_ATTACH); - free(u_path, M_TEMP); - return (error); + return (kern_jail(td, &j)); } int freebsd32_jail_set(struct thread *td, struct freebsd32_jail_set_args *uap) { struct uio *auio; int error; /* Check that we have an even number of iovecs. */ if (uap->iovcnt & 1) return (EINVAL); error = freebsd32_copyinuio(uap->iovp, uap->iovcnt, &auio); if (error) return (error); error = kern_jail_set(td, auio, uap->flags); free(auio, M_IOV); return (error); } int freebsd32_jail_get(struct thread *td, struct freebsd32_jail_get_args *uap) { struct iovec32 iov32; struct uio *auio; int error, i; /* Check that we have an even number of iovecs. */ if (uap->iovcnt & 1) return (EINVAL); error = freebsd32_copyinuio(uap->iovp, uap->iovcnt, &auio); if (error) return (error); error = kern_jail_get(td, auio, uap->flags); if (error == 0) for (i = 0; i < uap->iovcnt; i++) { PTROUT_CP(auio->uio_iov[i], iov32, iov_base); CP(auio->uio_iov[i], iov32, iov_len); error = copyout(&iov32, uap->iovp + i, sizeof(iov32)); if (error != 0) break; } free(auio, M_IOV); return (error); } int freebsd32_sigaction(struct thread *td, struct freebsd32_sigaction_args *uap) { struct sigaction32 s32; struct sigaction sa, osa, *sap; int error; if (uap->act) { error = copyin(uap->act, &s32, sizeof(s32)); if (error) return (error); sa.sa_handler = PTRIN(s32.sa_u); CP(s32, sa, sa_flags); CP(s32, sa, sa_mask); sap = &sa; } else sap = NULL; error = kern_sigaction(td, uap->sig, sap, &osa, 0); if (error == 0 && uap->oact != NULL) { s32.sa_u = PTROUT(osa.sa_handler); CP(osa, s32, sa_flags); CP(osa, s32, sa_mask); error = copyout(&s32, uap->oact, sizeof(s32)); } return (error); } #ifdef COMPAT_FREEBSD4 int freebsd4_freebsd32_sigaction(struct thread *td, struct freebsd4_freebsd32_sigaction_args *uap) { struct sigaction32 s32; struct sigaction sa, osa, *sap; int error; if (uap->act) { error = copyin(uap->act, &s32, sizeof(s32)); if (error) return (error); sa.sa_handler = PTRIN(s32.sa_u); CP(s32, sa, sa_flags); CP(s32, sa, sa_mask); sap = &sa; } else sap = NULL; error = kern_sigaction(td, uap->sig, sap, &osa, KSA_FREEBSD4); if (error == 0 && uap->oact != NULL) { s32.sa_u = PTROUT(osa.sa_handler); CP(osa, s32, sa_flags); CP(osa, s32, sa_mask); error = copyout(&s32, uap->oact, sizeof(s32)); } return (error); } #endif #ifdef COMPAT_43 struct osigaction32 { u_int32_t sa_u; osigset_t sa_mask; int sa_flags; }; #define ONSIG 32 int ofreebsd32_sigaction(struct thread *td, struct ofreebsd32_sigaction_args *uap) { struct osigaction32 s32; struct sigaction sa, osa, *sap; int error; if (uap->signum <= 0 || uap->signum >= ONSIG) return (EINVAL); if (uap->nsa) { error = copyin(uap->nsa, &s32, sizeof(s32)); if (error) return (error); sa.sa_handler = PTRIN(s32.sa_u); CP(s32, sa, sa_flags); OSIG2SIG(s32.sa_mask, sa.sa_mask); sap = &sa; } else sap = NULL; error = kern_sigaction(td, uap->signum, sap, &osa, KSA_OSIGSET); if (error == 0 && uap->osa != NULL) { s32.sa_u = PTROUT(osa.sa_handler); CP(osa, s32, sa_flags); SIG2OSIG(osa.sa_mask, s32.sa_mask); error = copyout(&s32, uap->osa, sizeof(s32)); } return (error); } int ofreebsd32_sigprocmask(struct thread *td, struct ofreebsd32_sigprocmask_args *uap) { sigset_t set, oset; int error; OSIG2SIG(uap->mask, set); error = kern_sigprocmask(td, uap->how, &set, &oset, 1); SIG2OSIG(oset, td->td_retval[0]); return (error); } int ofreebsd32_sigpending(struct thread *td, struct ofreebsd32_sigpending_args *uap) { struct proc *p = td->td_proc; sigset_t siglist; PROC_LOCK(p); siglist = p->p_siglist; SIGSETOR(siglist, td->td_siglist); PROC_UNLOCK(p); SIG2OSIG(siglist, td->td_retval[0]); return (0); } struct sigvec32 { u_int32_t sv_handler; int sv_mask; int sv_flags; }; int ofreebsd32_sigvec(struct thread *td, struct ofreebsd32_sigvec_args *uap) { struct sigvec32 vec; struct sigaction sa, osa, *sap; int error; if (uap->signum <= 0 || uap->signum >= ONSIG) return (EINVAL); if (uap->nsv) { error = copyin(uap->nsv, &vec, sizeof(vec)); if (error) return (error); sa.sa_handler = PTRIN(vec.sv_handler); OSIG2SIG(vec.sv_mask, sa.sa_mask); sa.sa_flags = vec.sv_flags; sa.sa_flags ^= SA_RESTART; sap = &sa; } else sap = NULL; error = kern_sigaction(td, uap->signum, sap, &osa, KSA_OSIGSET); if (error == 0 && uap->osv != NULL) { vec.sv_handler = PTROUT(osa.sa_handler); SIG2OSIG(osa.sa_mask, vec.sv_mask); vec.sv_flags = osa.sa_flags; vec.sv_flags &= ~SA_NOCLDWAIT; vec.sv_flags ^= SA_RESTART; error = copyout(&vec, uap->osv, sizeof(vec)); } return (error); } int ofreebsd32_sigblock(struct thread *td, struct ofreebsd32_sigblock_args *uap) { struct proc *p = td->td_proc; sigset_t set; OSIG2SIG(uap->mask, set); SIG_CANTMASK(set); PROC_LOCK(p); SIG2OSIG(td->td_sigmask, td->td_retval[0]); SIGSETOR(td->td_sigmask, set); PROC_UNLOCK(p); return (0); } int ofreebsd32_sigsetmask(struct thread *td, struct ofreebsd32_sigsetmask_args *uap) { struct proc *p = td->td_proc; sigset_t set; OSIG2SIG(uap->mask, set); SIG_CANTMASK(set); PROC_LOCK(p); SIG2OSIG(td->td_sigmask, td->td_retval[0]); SIGSETLO(td->td_sigmask, set); signotify(td); PROC_UNLOCK(p); return (0); } int ofreebsd32_sigsuspend(struct thread *td, struct ofreebsd32_sigsuspend_args *uap) { struct proc *p = td->td_proc; sigset_t mask; PROC_LOCK(p); td->td_oldsigmask = td->td_sigmask; td->td_pflags |= TDP_OLDMASK; OSIG2SIG(uap->mask, mask); SIG_CANTMASK(mask); SIGSETLO(td->td_sigmask, mask); signotify(td); while (msleep(&p->p_sigacts, &p->p_mtx, PPAUSE|PCATCH, "opause", 0) == 0) /* void */; PROC_UNLOCK(p); /* always return EINTR rather than ERESTART... */ return (EINTR); } struct sigstack32 { u_int32_t ss_sp; int ss_onstack; }; int ofreebsd32_sigstack(struct thread *td, struct ofreebsd32_sigstack_args *uap) { struct sigstack32 s32; struct sigstack nss, oss; int error = 0, unss; if (uap->nss != NULL) { error = copyin(uap->nss, &s32, sizeof(s32)); if (error) return (error); nss.ss_sp = PTRIN(s32.ss_sp); CP(s32, nss, ss_onstack); unss = 1; } else { unss = 0; } oss.ss_sp = td->td_sigstk.ss_sp; oss.ss_onstack = sigonstack(cpu_getstack(td)); if (unss) { td->td_sigstk.ss_sp = nss.ss_sp; td->td_sigstk.ss_size = 0; td->td_sigstk.ss_flags |= (nss.ss_onstack & SS_ONSTACK); td->td_pflags |= TDP_ALTSTACK; } if (uap->oss != NULL) { s32.ss_sp = PTROUT(oss.ss_sp); CP(oss, s32, ss_onstack); error = copyout(&s32, uap->oss, sizeof(s32)); } return (error); } #endif int freebsd32_nanosleep(struct thread *td, struct freebsd32_nanosleep_args *uap) { struct timespec32 rmt32, rqt32; struct timespec rmt, rqt; int error; error = copyin(uap->rqtp, &rqt32, sizeof(rqt32)); if (error) return (error); CP(rqt32, rqt, tv_sec); CP(rqt32, rqt, tv_nsec); if (uap->rmtp && !useracc((caddr_t)uap->rmtp, sizeof(rmt), VM_PROT_WRITE)) return (EFAULT); error = kern_nanosleep(td, &rqt, &rmt); if (error && uap->rmtp) { int error2; CP(rmt, rmt32, tv_sec); CP(rmt, rmt32, tv_nsec); error2 = copyout(&rmt32, uap->rmtp, sizeof(rmt32)); if (error2) error = error2; } return (error); } int freebsd32_clock_gettime(struct thread *td, struct freebsd32_clock_gettime_args *uap) { struct timespec ats; struct timespec32 ats32; int error; error = kern_clock_gettime(td, uap->clock_id, &ats); if (error == 0) { CP(ats, ats32, tv_sec); CP(ats, ats32, tv_nsec); error = copyout(&ats32, uap->tp, sizeof(ats32)); } return (error); } int freebsd32_clock_settime(struct thread *td, struct freebsd32_clock_settime_args *uap) { struct timespec ats; struct timespec32 ats32; int error; error = copyin(uap->tp, &ats32, sizeof(ats32)); if (error) return (error); CP(ats32, ats, tv_sec); CP(ats32, ats, tv_nsec); return (kern_clock_settime(td, uap->clock_id, &ats)); } int freebsd32_clock_getres(struct thread *td, struct freebsd32_clock_getres_args *uap) { struct timespec ts; struct timespec32 ts32; int error; if (uap->tp == NULL) return (0); error = kern_clock_getres(td, uap->clock_id, &ts); if (error == 0) { CP(ts, ts32, tv_sec); CP(ts, ts32, tv_nsec); error = copyout(&ts32, uap->tp, sizeof(ts32)); } return (error); } int freebsd32_thr_new(struct thread *td, struct freebsd32_thr_new_args *uap) { struct thr_param32 param32; struct thr_param param; int error; if (uap->param_size < 0 || uap->param_size > sizeof(struct thr_param32)) return (EINVAL); bzero(¶m, sizeof(struct thr_param)); bzero(¶m32, sizeof(struct thr_param32)); error = copyin(uap->param, ¶m32, uap->param_size); if (error != 0) return (error); param.start_func = PTRIN(param32.start_func); param.arg = PTRIN(param32.arg); param.stack_base = PTRIN(param32.stack_base); param.stack_size = param32.stack_size; param.tls_base = PTRIN(param32.tls_base); param.tls_size = param32.tls_size; param.child_tid = PTRIN(param32.child_tid); param.parent_tid = PTRIN(param32.parent_tid); param.flags = param32.flags; param.rtp = PTRIN(param32.rtp); param.spare[0] = PTRIN(param32.spare[0]); param.spare[1] = PTRIN(param32.spare[1]); param.spare[2] = PTRIN(param32.spare[2]); return (kern_thr_new(td, ¶m)); } int freebsd32_thr_suspend(struct thread *td, struct freebsd32_thr_suspend_args *uap) { struct timespec32 ts32; struct timespec ts, *tsp; int error; error = 0; tsp = NULL; if (uap->timeout != NULL) { error = copyin((const void *)uap->timeout, (void *)&ts32, sizeof(struct timespec32)); if (error != 0) return (error); ts.tv_sec = ts32.tv_sec; ts.tv_nsec = ts32.tv_nsec; tsp = &ts; } return (kern_thr_suspend(td, tsp)); } void siginfo_to_siginfo32(siginfo_t *src, struct siginfo32 *dst) { bzero(dst, sizeof(*dst)); dst->si_signo = src->si_signo; dst->si_errno = src->si_errno; dst->si_code = src->si_code; dst->si_pid = src->si_pid; dst->si_uid = src->si_uid; dst->si_status = src->si_status; dst->si_addr = (uintptr_t)src->si_addr; dst->si_value.sigval_int = src->si_value.sival_int; dst->si_timerid = src->si_timerid; dst->si_overrun = src->si_overrun; } int freebsd32_sigtimedwait(struct thread *td, struct freebsd32_sigtimedwait_args *uap) { struct timespec32 ts32; struct timespec ts; struct timespec *timeout; sigset_t set; ksiginfo_t ksi; struct siginfo32 si32; int error; if (uap->timeout) { error = copyin(uap->timeout, &ts32, sizeof(ts32)); if (error) return (error); ts.tv_sec = ts32.tv_sec; ts.tv_nsec = ts32.tv_nsec; timeout = &ts; } else timeout = NULL; error = copyin(uap->set, &set, sizeof(set)); if (error) return (error); error = kern_sigtimedwait(td, set, &ksi, timeout); if (error) return (error); if (uap->info) { siginfo_to_siginfo32(&ksi.ksi_info, &si32); error = copyout(&si32, uap->info, sizeof(struct siginfo32)); } if (error == 0) td->td_retval[0] = ksi.ksi_signo; return (error); } /* * MPSAFE */ int freebsd32_sigwaitinfo(struct thread *td, struct freebsd32_sigwaitinfo_args *uap) { ksiginfo_t ksi; struct siginfo32 si32; sigset_t set; int error; error = copyin(uap->set, &set, sizeof(set)); if (error) return (error); error = kern_sigtimedwait(td, set, &ksi, NULL); if (error) return (error); if (uap->info) { siginfo_to_siginfo32(&ksi.ksi_info, &si32); error = copyout(&si32, uap->info, sizeof(struct siginfo32)); } if (error == 0) td->td_retval[0] = ksi.ksi_signo; return (error); } int freebsd32_cpuset_setid(struct thread *td, struct freebsd32_cpuset_setid_args *uap) { struct cpuset_setid_args ap; ap.which = uap->which; ap.id = (uap->idlo | ((id_t)uap->idhi << 32)); ap.setid = uap->setid; return (cpuset_setid(td, &ap)); } int freebsd32_cpuset_getid(struct thread *td, struct freebsd32_cpuset_getid_args *uap) { struct cpuset_getid_args ap; ap.level = uap->level; ap.which = uap->which; ap.id = (uap->idlo | ((id_t)uap->idhi << 32)); ap.setid = uap->setid; return (cpuset_getid(td, &ap)); } int freebsd32_cpuset_getaffinity(struct thread *td, struct freebsd32_cpuset_getaffinity_args *uap) { struct cpuset_getaffinity_args ap; ap.level = uap->level; ap.which = uap->which; ap.id = (uap->idlo | ((id_t)uap->idhi << 32)); ap.cpusetsize = uap->cpusetsize; ap.mask = uap->mask; return (cpuset_getaffinity(td, &ap)); } int freebsd32_cpuset_setaffinity(struct thread *td, struct freebsd32_cpuset_setaffinity_args *uap) { struct cpuset_setaffinity_args ap; ap.level = uap->level; ap.which = uap->which; ap.id = (uap->idlo | ((id_t)uap->idhi << 32)); ap.cpusetsize = uap->cpusetsize; ap.mask = uap->mask; return (cpuset_setaffinity(td, &ap)); } int freebsd32_nmount(struct thread *td, struct freebsd32_nmount_args /* { struct iovec *iovp; unsigned int iovcnt; int flags; } */ *uap) { struct uio *auio; int error; AUDIT_ARG(fflags, uap->flags); /* * Filter out MNT_ROOTFS. We do not want clients of nmount() in * userspace to set this flag, but we must filter it out if we want * MNT_UPDATE on the root file system to work. * MNT_ROOTFS should only be set in the kernel in vfs_mountroot_try(). */ uap->flags &= ~MNT_ROOTFS; /* * check that we have an even number of iovec's * and that we have at least two options. */ if ((uap->iovcnt & 1) || (uap->iovcnt < 4)) return (EINVAL); error = freebsd32_copyinuio(uap->iovp, uap->iovcnt, &auio); if (error) return (error); error = vfs_donmount(td, uap->flags, auio); free(auio, M_IOV); return error; } #if 0 int freebsd32_xxx(struct thread *td, struct freebsd32_xxx_args *uap) { struct yyy32 *p32, s32; struct yyy *p = NULL, s; struct xxx_arg ap; int error; if (uap->zzz) { error = copyin(uap->zzz, &s32, sizeof(s32)); if (error) return (error); /* translate in */ p = &s; } error = kern_xxx(td, p); if (error) return (error); if (uap->zzz) { /* translate out */ error = copyout(&s32, p32, sizeof(s32)); } return (error); } #endif int syscall32_register(int *offset, struct sysent *new_sysent, struct sysent *old_sysent) { if (*offset == NO_SYSCALL) { int i; for (i = 1; i < SYS_MAXSYSCALL; ++i) if (freebsd32_sysent[i].sy_call == (sy_call_t *)lkmnosys) break; if (i == SYS_MAXSYSCALL) return (ENFILE); *offset = i; } else if (*offset < 0 || *offset >= SYS_MAXSYSCALL) return (EINVAL); else if (freebsd32_sysent[*offset].sy_call != (sy_call_t *)lkmnosys && freebsd32_sysent[*offset].sy_call != (sy_call_t *)lkmressys) return (EEXIST); *old_sysent = freebsd32_sysent[*offset]; freebsd32_sysent[*offset] = *new_sysent; return 0; } int syscall32_deregister(int *offset, struct sysent *old_sysent) { if (*offset) freebsd32_sysent[*offset] = *old_sysent; return 0; } int syscall32_module_handler(struct module *mod, int what, void *arg) { struct syscall_module_data *data = (struct syscall_module_data*)arg; modspecific_t ms; int error; switch (what) { case MOD_LOAD: error = syscall32_register(data->offset, data->new_sysent, &data->old_sysent); if (error) { /* Leave a mark so we know to safely unload below. */ data->offset = NULL; return error; } ms.intval = *data->offset; MOD_XLOCK; module_setspecific(mod, &ms); MOD_XUNLOCK; if (data->chainevh) error = data->chainevh(mod, what, data->chainarg); return (error); case MOD_UNLOAD: /* * MOD_LOAD failed, so just return without calling the * chained handler since we didn't pass along the MOD_LOAD * event. */ if (data->offset == NULL) return (0); if (data->chainevh) { error = data->chainevh(mod, what, data->chainarg); if (error) return (error); } error = syscall32_deregister(data->offset, &data->old_sysent); return (error); default: error = EOPNOTSUPP; if (data->chainevh) error = data->chainevh(mod, what, data->chainarg); return (error); } } Index: head/sys/compat/linux/linux_mib.c =================================================================== --- head/sys/compat/linux/linux_mib.c (revision 192894) +++ head/sys/compat/linux/linux_mib.c (revision 192895) @@ -1,647 +1,599 @@ /*- * Copyright (c) 1999 Marcel Moolenaar * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer * in this position and unchanged. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. The name of the author may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include "opt_compat.h" #ifdef COMPAT_LINUX32 #include #else #include #endif #include struct linux_prison { char pr_osname[LINUX_MAX_UTSNAME]; char pr_osrelease[LINUX_MAX_UTSNAME]; int pr_oss_version; int pr_osrel; }; +static struct linux_prison lprison0 = { + .pr_osname = "Linux", + .pr_osrelease = "2.6.16", + .pr_oss_version = 0x030600, + .pr_osrel = 2006016 +}; + static unsigned linux_osd_jail_slot; SYSCTL_NODE(_compat, OID_AUTO, linux, CTLFLAG_RW, 0, "Linux mode"); -static struct mtx osname_lock; -MTX_SYSINIT(linux_osname, &osname_lock, "linux osname", MTX_DEF); - -static char linux_osname[LINUX_MAX_UTSNAME] = "Linux"; - static int linux_sysctl_osname(SYSCTL_HANDLER_ARGS) { char osname[LINUX_MAX_UTSNAME]; int error; linux_get_osname(req->td, osname); error = sysctl_handle_string(oidp, osname, LINUX_MAX_UTSNAME, req); if (error || req->newptr == NULL) return (error); error = linux_set_osname(req->td, osname); return (error); } SYSCTL_PROC(_compat_linux, OID_AUTO, osname, CTLTYPE_STRING | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE, 0, 0, linux_sysctl_osname, "A", "Linux kernel OS name"); -static char linux_osrelease[LINUX_MAX_UTSNAME] = "2.6.16"; -static int linux_osrel = 2006016; - static int linux_sysctl_osrelease(SYSCTL_HANDLER_ARGS) { char osrelease[LINUX_MAX_UTSNAME]; int error; linux_get_osrelease(req->td, osrelease); error = sysctl_handle_string(oidp, osrelease, LINUX_MAX_UTSNAME, req); if (error || req->newptr == NULL) return (error); error = linux_set_osrelease(req->td, osrelease); return (error); } SYSCTL_PROC(_compat_linux, OID_AUTO, osrelease, CTLTYPE_STRING | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE, 0, 0, linux_sysctl_osrelease, "A", "Linux kernel OS release"); -static int linux_oss_version = 0x030600; - static int linux_sysctl_oss_version(SYSCTL_HANDLER_ARGS) { int oss_version; int error; oss_version = linux_get_oss_version(req->td); error = sysctl_handle_int(oidp, &oss_version, 0, req); if (error || req->newptr == NULL) return (error); error = linux_set_oss_version(req->td, oss_version); return (error); } SYSCTL_PROC(_compat_linux, OID_AUTO, oss_version, CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_PRISON | CTLFLAG_MPSAFE, 0, 0, linux_sysctl_oss_version, "I", "Linux OSS version"); /* * Map the osrelease into integer */ static int linux_map_osrel(char *osrelease, int *osrel) { char *sep, *eosrelease; int len, v0, v1, v2, v; len = strlen(osrelease); eosrelease = osrelease + len; v0 = strtol(osrelease, &sep, 10); if (osrelease == sep || sep + 1 >= eosrelease || *sep != '.') return (EINVAL); osrelease = sep + 1; v1 = strtol(osrelease, &sep, 10); if (osrelease == sep || sep + 1 >= eosrelease || *sep != '.') return (EINVAL); osrelease = sep + 1; v2 = strtol(osrelease, &sep, 10); if (osrelease == sep || sep != eosrelease) return (EINVAL); v = v0 * 1000000 + v1 * 1000 + v2; if (v < 1000000) return (EINVAL); *osrel = v; return (0); } /* - * Returns holding the prison mutex if return non-NULL. + * Find a prison with Linux info. + * Return the Linux info and the (locked) prison. */ static struct linux_prison * -linux_get_prison(struct thread *td, struct prison **prp) +linux_find_prison(struct prison *spr, struct prison **prp) { struct prison *pr; struct linux_prison *lpr; - KASSERT(td == curthread, ("linux_get_prison() called on !curthread")); - *prp = pr = td->td_ucred->cr_prison; - if (pr == NULL || !linux_osd_jail_slot) - return (NULL); - mtx_lock(&pr->pr_mtx); - lpr = osd_jail_get(pr, linux_osd_jail_slot); - if (lpr == NULL) + if (!linux_osd_jail_slot) + /* In case osd_register failed. */ + spr = &prison0; + for (pr = spr;; pr = pr->pr_parent) { + mtx_lock(&pr->pr_mtx); + lpr = (pr == &prison0) + ? &lprison0 + : osd_jail_get(pr, linux_osd_jail_slot); + if (lpr != NULL) + break; mtx_unlock(&pr->pr_mtx); + } + *prp = pr; return (lpr); } /* - * Ensure a prison has its own Linux info. The prison should be locked on - * entrance and will be locked on exit (though it may get unlocked in the - * interrim). + * Ensure a prison has its own Linux info. If lprp is non-null, point it to + * the Linux info and lock the prison. */ static int linux_alloc_prison(struct prison *pr, struct linux_prison **lprp) { + struct prison *ppr; struct linux_prison *lpr, *nlpr; int error; /* If this prison already has Linux info, return that. */ error = 0; - mtx_assert(&pr->pr_mtx, MA_OWNED); - lpr = osd_jail_get(pr, linux_osd_jail_slot); - if (lpr != NULL) + lpr = linux_find_prison(pr, &ppr); + if (ppr == pr) goto done; /* * Allocate a new info record. Then check again, in case something * changed during the allocation. */ - mtx_unlock(&pr->pr_mtx); + mtx_unlock(&ppr->pr_mtx); nlpr = malloc(sizeof(struct linux_prison), M_PRISON, M_WAITOK); - mtx_lock(&pr->pr_mtx); - lpr = osd_jail_get(pr, linux_osd_jail_slot); - if (lpr != NULL) { + lpr = linux_find_prison(pr, &ppr); + if (ppr == pr) { free(nlpr, M_PRISON); goto done; } + /* Inherit the initial values from the ancestor. */ + mtx_lock(&pr->pr_mtx); error = osd_jail_set(pr, linux_osd_jail_slot, nlpr); - if (error) - free(nlpr, M_PRISON); - else { + if (error == 0) { + bcopy(lpr, nlpr, sizeof(*lpr)); lpr = nlpr; - mtx_lock(&osname_lock); - strncpy(lpr->pr_osname, linux_osname, LINUX_MAX_UTSNAME); - strncpy(lpr->pr_osrelease, linux_osrelease, LINUX_MAX_UTSNAME); - lpr->pr_oss_version = linux_oss_version; - lpr->pr_osrel = linux_osrel; - mtx_unlock(&osname_lock); + } else { + free(nlpr, M_PRISON); + lpr = NULL; } -done: + mtx_unlock(&ppr->pr_mtx); + done: if (lprp != NULL) *lprp = lpr; + else + mtx_unlock(&pr->pr_mtx); return (error); } /* * Jail OSD methods for Linux prison data. */ static int linux_prison_create(void *obj, void *data) { - int error; struct prison *pr = obj; struct vfsoptlist *opts = data; if (vfs_flagopt(opts, "nolinux", NULL, 0)) return (0); /* * Inherit a prison's initial values from its parent * (different from NULL which also inherits changes). */ - mtx_lock(&pr->pr_mtx); - error = linux_alloc_prison(pr, NULL); - mtx_unlock(&pr->pr_mtx); - return (error); + return linux_alloc_prison(pr, NULL); } static int linux_prison_check(void *obj __unused, void *data) { struct vfsoptlist *opts = data; char *osname, *osrelease; - int error, len, oss_version; + int error, len, osrel, oss_version; /* Check that the parameters are correct. */ (void)vfs_flagopt(opts, "linux", NULL, 0); (void)vfs_flagopt(opts, "nolinux", NULL, 0); error = vfs_getopt(opts, "linux.osname", (void **)&osname, &len); if (error != ENOENT) { if (error != 0) return (error); if (len == 0 || osname[len - 1] != '\0') return (EINVAL); if (len > LINUX_MAX_UTSNAME) { vfs_opterror(opts, "linux.osname too long"); return (ENAMETOOLONG); } } error = vfs_getopt(opts, "linux.osrelease", (void **)&osrelease, &len); if (error != ENOENT) { if (error != 0) return (error); if (len == 0 || osrelease[len - 1] != '\0') return (EINVAL); if (len > LINUX_MAX_UTSNAME) { vfs_opterror(opts, "linux.osrelease too long"); return (ENAMETOOLONG); } + error = linux_map_osrel(osrelease, &osrel); + if (error != 0) { + vfs_opterror(opts, "linux.osrelease format error"); + return (error); + } } error = vfs_copyopt(opts, "linux.oss_version", &oss_version, sizeof(oss_version)); return (error == ENOENT ? 0 : error); } static int linux_prison_set(void *obj, void *data) { struct linux_prison *lpr; struct prison *pr = obj; struct vfsoptlist *opts = data; char *osname, *osrelease; int error, gotversion, len, nolinux, oss_version, yeslinux; /* Set the parameters, which should be correct. */ yeslinux = vfs_flagopt(opts, "linux", NULL, 0); nolinux = vfs_flagopt(opts, "nolinux", NULL, 0); error = vfs_getopt(opts, "linux.osname", (void **)&osname, &len); if (error == ENOENT) osname = NULL; else yeslinux = 1; error = vfs_getopt(opts, "linux.osrelease", (void **)&osrelease, &len); if (error == ENOENT) osrelease = NULL; else yeslinux = 1; error = vfs_copyopt(opts, "linux.oss_version", &oss_version, sizeof(oss_version)); - gotversion = error == 0; + gotversion = (error == 0); yeslinux |= gotversion; if (nolinux) { /* "nolinux": inherit the parent's Linux info. */ mtx_lock(&pr->pr_mtx); osd_jail_del(pr, linux_osd_jail_slot); mtx_unlock(&pr->pr_mtx); } else if (yeslinux) { /* * "linux" or "linux.*": * the prison gets its own Linux info. */ - mtx_lock(&pr->pr_mtx); error = linux_alloc_prison(pr, &lpr); if (error) { mtx_unlock(&pr->pr_mtx); return (error); } if (osrelease) { error = linux_map_osrel(osrelease, &lpr->pr_osrel); if (error) { mtx_unlock(&pr->pr_mtx); return (error); } strlcpy(lpr->pr_osrelease, osrelease, LINUX_MAX_UTSNAME); } if (osname) strlcpy(lpr->pr_osname, osname, LINUX_MAX_UTSNAME); if (gotversion) lpr->pr_oss_version = oss_version; mtx_unlock(&pr->pr_mtx); } return (0); } SYSCTL_JAIL_PARAM_NODE(linux, "Jail Linux parameters"); SYSCTL_JAIL_PARAM(, nolinux, CTLTYPE_INT | CTLFLAG_RW, "BN", "Jail w/ no Linux parameters"); SYSCTL_JAIL_PARAM_STRING(_linux, osname, CTLFLAG_RW, LINUX_MAX_UTSNAME, "Jail Linux kernel OS name"); SYSCTL_JAIL_PARAM_STRING(_linux, osrelease, CTLFLAG_RW, LINUX_MAX_UTSNAME, "Jail Linux kernel OS release"); SYSCTL_JAIL_PARAM(_linux, oss_version, CTLTYPE_INT | CTLFLAG_RW, "I", "Jail Linux OSS version"); static int linux_prison_get(void *obj, void *data) { struct linux_prison *lpr; + struct prison *ppr; struct prison *pr = obj; struct vfsoptlist *opts = data; int error, i; - mtx_lock(&pr->pr_mtx); - /* Tell whether this prison has its own Linux info. */ - lpr = osd_jail_get(pr, linux_osd_jail_slot); - i = lpr != NULL; + static int version0; + + /* See if this prison is the one with the Linux info. */ + lpr = linux_find_prison(pr, &ppr); + i = (ppr == pr); error = vfs_setopt(opts, "linux", &i, sizeof(i)); if (error != 0 && error != ENOENT) goto done; i = !i; error = vfs_setopt(opts, "nolinux", &i, sizeof(i)); if (error != 0 && error != ENOENT) goto done; - /* - * It's kind of bogus to give the root info, but leave it to the caller - * to check the above flag. - */ - if (lpr != NULL) { - error = vfs_setopts(opts, "linux.osname", lpr->pr_osname); + if (i) { + /* + * If this prison is inheriting its Linux info, report + * empty/zero parameters. + */ + error = vfs_setopts(opts, "linux.osname", ""); if (error != 0 && error != ENOENT) goto done; - error = vfs_setopts(opts, "linux.osrelease", lpr->pr_osrelease); + error = vfs_setopts(opts, "linux.osrelease", ""); if (error != 0 && error != ENOENT) goto done; - error = vfs_setopt(opts, "linux.oss_version", - &lpr->pr_oss_version, sizeof(lpr->pr_oss_version)); + error = vfs_setopt(opts, "linux.oss_version", &version0, + sizeof(lpr->pr_oss_version)); if (error != 0 && error != ENOENT) goto done; } else { - mtx_lock(&osname_lock); - error = vfs_setopts(opts, "linux.osname", linux_osname); + error = vfs_setopts(opts, "linux.osname", lpr->pr_osname); if (error != 0 && error != ENOENT) goto done; - error = vfs_setopts(opts, "linux.osrelease", linux_osrelease); + error = vfs_setopts(opts, "linux.osrelease", lpr->pr_osrelease); if (error != 0 && error != ENOENT) goto done; error = vfs_setopt(opts, "linux.oss_version", - &linux_oss_version, sizeof(linux_oss_version)); + &lpr->pr_oss_version, sizeof(lpr->pr_oss_version)); if (error != 0 && error != ENOENT) goto done; - mtx_unlock(&osname_lock); } error = 0; done: - mtx_unlock(&pr->pr_mtx); + mtx_unlock(&ppr->pr_mtx); return (error); } static void linux_prison_destructor(void *data) { free(data, M_PRISON); } void linux_osd_jail_register(void) { struct prison *pr; osd_method_t methods[PR_MAXMETHOD] = { [PR_METHOD_CREATE] = linux_prison_create, [PR_METHOD_GET] = linux_prison_get, [PR_METHOD_SET] = linux_prison_set, [PR_METHOD_CHECK] = linux_prison_check }; linux_osd_jail_slot = osd_jail_register(linux_prison_destructor, methods); if (linux_osd_jail_slot > 0) { /* Copy the system linux info to any current prisons. */ sx_xlock(&allprison_lock); - TAILQ_FOREACH(pr, &allprison, pr_list) { - mtx_lock(&pr->pr_mtx); + TAILQ_FOREACH(pr, &allprison, pr_list) (void)linux_alloc_prison(pr, NULL); - mtx_unlock(&pr->pr_mtx); - } sx_xunlock(&allprison_lock); } } void linux_osd_jail_deregister(void) { if (linux_osd_jail_slot) osd_jail_deregister(linux_osd_jail_slot); } void linux_get_osname(struct thread *td, char *dst) { struct prison *pr; struct linux_prison *lpr; - lpr = linux_get_prison(td, &pr); - if (lpr != NULL) { - bcopy(lpr->pr_osname, dst, LINUX_MAX_UTSNAME); - mtx_unlock(&pr->pr_mtx); - } else { - mtx_lock(&osname_lock); - bcopy(linux_osname, dst, LINUX_MAX_UTSNAME); - mtx_unlock(&osname_lock); - } + lpr = linux_find_prison(td->td_ucred->cr_prison, &pr); + bcopy(lpr->pr_osname, dst, LINUX_MAX_UTSNAME); + mtx_unlock(&pr->pr_mtx); } int linux_set_osname(struct thread *td, char *osname) { struct prison *pr; struct linux_prison *lpr; - lpr = linux_get_prison(td, &pr); - if (lpr != NULL) { - strlcpy(lpr->pr_osname, osname, LINUX_MAX_UTSNAME); - mtx_unlock(&pr->pr_mtx); - } else { - mtx_lock(&osname_lock); - strcpy(linux_osname, osname); - mtx_unlock(&osname_lock); - } - + lpr = linux_find_prison(td->td_ucred->cr_prison, &pr); + strlcpy(lpr->pr_osname, osname, LINUX_MAX_UTSNAME); + mtx_unlock(&pr->pr_mtx); return (0); } void linux_get_osrelease(struct thread *td, char *dst) { struct prison *pr; struct linux_prison *lpr; - lpr = linux_get_prison(td, &pr); - if (lpr != NULL) { - bcopy(lpr->pr_osrelease, dst, LINUX_MAX_UTSNAME); - mtx_unlock(&pr->pr_mtx); - } else { - mtx_lock(&osname_lock); - bcopy(linux_osrelease, dst, LINUX_MAX_UTSNAME); - mtx_unlock(&osname_lock); - } + lpr = linux_find_prison(td->td_ucred->cr_prison, &pr); + bcopy(lpr->pr_osrelease, dst, LINUX_MAX_UTSNAME); + mtx_unlock(&pr->pr_mtx); } int linux_kernver(struct thread *td) { struct prison *pr; struct linux_prison *lpr; int osrel; - lpr = linux_get_prison(td, &pr); - if (lpr != NULL) { - osrel = lpr->pr_osrel; - mtx_unlock(&pr->pr_mtx); - } else - osrel = linux_osrel; + lpr = linux_find_prison(td->td_ucred->cr_prison, &pr); + osrel = lpr->pr_osrel; + mtx_unlock(&pr->pr_mtx); return (osrel); } int linux_set_osrelease(struct thread *td, char *osrelease) { struct prison *pr; struct linux_prison *lpr; int error; - lpr = linux_get_prison(td, &pr); - if (lpr != NULL) { - error = linux_map_osrel(osrelease, &lpr->pr_osrel); - if (error) { - mtx_unlock(&pr->pr_mtx); - return (error); - } + lpr = linux_find_prison(td->td_ucred->cr_prison, &pr); + error = linux_map_osrel(osrelease, &lpr->pr_osrel); + if (error == 0) strlcpy(lpr->pr_osrelease, osrelease, LINUX_MAX_UTSNAME); - mtx_unlock(&pr->pr_mtx); - } else { - mtx_lock(&osname_lock); - error = linux_map_osrel(osrelease, &linux_osrel); - if (error) { - mtx_unlock(&osname_lock); - return (error); - } - strcpy(linux_osrelease, osrelease); - mtx_unlock(&osname_lock); - } - - return (0); + mtx_unlock(&pr->pr_mtx); + return (error); } int linux_get_oss_version(struct thread *td) { struct prison *pr; struct linux_prison *lpr; int version; - lpr = linux_get_prison(td, &pr); - if (lpr != NULL) { - version = lpr->pr_oss_version; - mtx_unlock(&pr->pr_mtx); - } else - version = linux_oss_version; + lpr = linux_find_prison(td->td_ucred->cr_prison, &pr); + version = lpr->pr_oss_version; + mtx_unlock(&pr->pr_mtx); return (version); } int linux_set_oss_version(struct thread *td, int oss_version) { struct prison *pr; struct linux_prison *lpr; - lpr = linux_get_prison(td, &pr); - if (lpr != NULL) { - lpr->pr_oss_version = oss_version; - mtx_unlock(&pr->pr_mtx); - } else { - mtx_lock(&osname_lock); - linux_oss_version = oss_version; - mtx_unlock(&osname_lock); - } - + lpr = linux_find_prison(td->td_ucred->cr_prison, &pr); + lpr->pr_oss_version = oss_version; + mtx_unlock(&pr->pr_mtx); return (0); } #if defined(DEBUG) || defined(KTR) u_char linux_debug_map[howmany(LINUX_SYS_MAXSYSCALL, sizeof(u_char))]; static int linux_debug(int syscall, int toggle, int global) { if (global) { char c = toggle ? 0 : 0xff; memset(linux_debug_map, c, sizeof(linux_debug_map)); return (0); } if (syscall < 0 || syscall >= LINUX_SYS_MAXSYSCALL) return (EINVAL); if (toggle) clrbit(linux_debug_map, syscall); else setbit(linux_debug_map, syscall); return (0); } /* * Usage: sysctl linux.debug=.<0/1> * * E.g.: sysctl linux.debug=21.0 * * As a special case, syscall "all" will apply to all syscalls globally. */ #define LINUX_MAX_DEBUGSTR 16 static int linux_sysctl_debug(SYSCTL_HANDLER_ARGS) { char value[LINUX_MAX_DEBUGSTR], *p; int error, sysc, toggle; int global = 0; value[0] = '\0'; error = sysctl_handle_string(oidp, value, LINUX_MAX_DEBUGSTR, req); if (error || req->newptr == NULL) return (error); for (p = value; *p != '\0' && *p != '.'; p++); if (*p == '\0') return (EINVAL); *p++ = '\0'; sysc = strtol(value, NULL, 0); toggle = strtol(p, NULL, 0); if (strcmp(value, "all") == 0) global = 1; error = linux_debug(sysc, toggle, global); return (error); } SYSCTL_PROC(_compat_linux, OID_AUTO, debug, CTLTYPE_STRING | CTLFLAG_RW, 0, 0, linux_sysctl_debug, "A", "Linux debugging control"); #endif /* DEBUG || KTR */ Index: head/sys/contrib/ipfilter/netinet/ip_fil_freebsd.c =================================================================== --- head/sys/contrib/ipfilter/netinet/ip_fil_freebsd.c (revision 192894) +++ head/sys/contrib/ipfilter/netinet/ip_fil_freebsd.c (revision 192895) @@ -1,1664 +1,1670 @@ /* $FreeBSD$ */ /* * Copyright (C) 1993-2003 by Darren Reed. * * See the IPFILTER.LICENCE file for details on licencing. */ #if !defined(lint) static const char sccsid[] = "@(#)ip_fil.c 2.41 6/5/96 (C) 1993-2000 Darren Reed"; static const char rcsid[] = "@(#)$Id: ip_fil_freebsd.c,v 2.53.2.50 2007/09/20 12:51:50 darrenr Exp $"; #endif #if defined(KERNEL) || defined(_KERNEL) # undef KERNEL # undef _KERNEL # define KERNEL 1 # define _KERNEL 1 #endif #if defined(__FreeBSD_version) && (__FreeBSD_version >= 400000) && \ !defined(KLD_MODULE) && !defined(IPFILTER_LKM) # include "opt_inet6.h" #endif #if defined(__FreeBSD_version) && (__FreeBSD_version >= 440000) && \ !defined(KLD_MODULE) && !defined(IPFILTER_LKM) # include "opt_random_ip_id.h" #endif #include #if defined(__FreeBSD__) && !defined(__FreeBSD_version) # if defined(IPFILTER_LKM) # ifndef __FreeBSD_cc_version # include # else # if __FreeBSD_cc_version < 430000 # include # endif # endif # endif #endif #include #include #include #if __FreeBSD_version >= 220000 # include # include #else # include #endif #include #include #if (__FreeBSD_version >= 300000) # include #else # include #endif #if !defined(__hpux) # include #endif #include #include #if __FreeBSD_version >= 500043 # include #else # include #endif #if __FreeBSD_version >= 800044 # include #else #define V_path_mtu_discovery path_mtu_discovery #define V_ipforwarding ipforwarding #endif #include #if __FreeBSD_version >= 300000 # include # if __FreeBSD_version >= 500043 # include # endif # if !defined(IPFILTER_LKM) # include "opt_ipfilter.h" # endif #endif #include #include #include #include #include #include #include #if defined(__osf__) # include #endif #include #include #include #if defined(__FreeBSD_version) && (__FreeBSD_version >= 800056) # include #endif #ifndef _KERNEL # include "netinet/ipf.h" #endif #include "netinet/ip_compat.h" #ifdef USE_INET6 # include #endif #include "netinet/ip_fil.h" #include "netinet/ip_nat.h" #include "netinet/ip_frag.h" #include "netinet/ip_state.h" #include "netinet/ip_proxy.h" #include "netinet/ip_auth.h" #ifdef IPFILTER_SYNC #include "netinet/ip_sync.h" #endif #ifdef IPFILTER_SCAN #include "netinet/ip_scan.h" #endif #include "netinet/ip_pool.h" #if defined(__FreeBSD_version) && (__FreeBSD_version >= 300000) # include #endif #include #ifdef CSUM_DATA_VALID #include #endif extern int ip_optcopy __P((struct ip *, struct ip *)); #if (__FreeBSD_version > 460000) && (__FreeBSD_version < 800055) extern int path_mtu_discovery; #endif # ifdef IPFILTER_M_IPFILTER MALLOC_DEFINE(M_IPFILTER, "ipfilter", "IP Filter packet filter data structures"); # endif #if !defined(__osf__) extern struct protosw inetsw[]; #endif static int (*fr_savep) __P((ip_t *, int, void *, int, struct mbuf **)); static int fr_send_ip __P((fr_info_t *, mb_t *, mb_t **)); # ifdef USE_MUTEXES ipfmutex_t ipl_mutex, ipf_authmx, ipf_rw, ipf_stinsert; ipfmutex_t ipf_nat_new, ipf_natio, ipf_timeoutlock; ipfrwlock_t ipf_mutex, ipf_global, ipf_ipidfrag, ipf_frcache, ipf_tokens; ipfrwlock_t ipf_frag, ipf_state, ipf_nat, ipf_natfrag, ipf_auth; # endif int ipf_locks_done = 0; #if (__FreeBSD_version >= 300000) struct callout_handle fr_slowtimer_ch; #endif struct selinfo ipfselwait[IPL_LOGSIZE]; #if (__FreeBSD_version >= 500011) # include # if defined(NETBSD_PF) # include # if (__FreeBSD_version < 501108) # include # endif /* * We provide the fr_checkp name just to minimize changes later. */ int (*fr_checkp) __P((ip_t *ip, int hlen, void *ifp, int out, mb_t **mp)); # endif /* NETBSD_PF */ #endif /* __FreeBSD_version >= 500011 */ #if (__FreeBSD_version >= 502103) static eventhandler_tag ipf_arrivetag, ipf_departtag, ipf_clonetag; static void ipf_ifevent(void *arg); static void ipf_ifevent(arg) void *arg; { frsync(NULL); } #endif #if (__FreeBSD_version >= 501108) && defined(_KERNEL) static int fr_check_wrapper(void *arg, struct mbuf **mp, struct ifnet *ifp, int dir) { struct ip *ip = mtod(*mp, struct ip *); return fr_check(ip, ip->ip_hl << 2, ifp, (dir == PFIL_OUT), mp); } # ifdef USE_INET6 # include static int fr_check_wrapper6(void *arg, struct mbuf **mp, struct ifnet *ifp, int dir) { return (fr_check(mtod(*mp, struct ip *), sizeof(struct ip6_hdr), ifp, (dir == PFIL_OUT), mp)); } # endif #endif /* __FreeBSD_version >= 501108 */ #if defined(IPFILTER_LKM) int iplidentify(s) char *s; { if (strcmp(s, "ipl") == 0) return 1; return 0; } #endif /* IPFILTER_LKM */ int ipfattach() { INIT_VNET_INET(curvnet); #ifdef USE_SPL int s; #endif SPL_NET(s); if (fr_running > 0) { SPL_X(s); return EBUSY; } MUTEX_INIT(&ipf_rw, "ipf rw mutex"); MUTEX_INIT(&ipf_timeoutlock, "ipf timeout queue mutex"); RWLOCK_INIT(&ipf_ipidfrag, "ipf IP NAT-Frag rwlock"); RWLOCK_INIT(&ipf_tokens, "ipf token rwlock"); ipf_locks_done = 1; if (fr_initialise() < 0) { SPL_X(s); return EIO; } if (fr_checkp != fr_check) { fr_savep = fr_checkp; fr_checkp = fr_check; } bzero((char *)ipfselwait, sizeof(ipfselwait)); bzero((char *)frcache, sizeof(frcache)); fr_running = 1; if (fr_control_forwarding & 1) V_ipforwarding = 1; SPL_X(s); #if (__FreeBSD_version >= 300000) fr_slowtimer_ch = timeout(fr_slowtimer, NULL, (hz / IPF_HZ_DIVIDE) * IPF_HZ_MULT); #else timeout(fr_slowtimer, NULL, (hz / IPF_HZ_DIVIDE) * IPF_HZ_MULT); #endif return 0; } /* * Disable the filter by removing the hooks from the IP input/output * stream. */ int ipfdetach() { INIT_VNET_INET(curvnet); #ifdef USE_SPL int s; #endif if (fr_control_forwarding & 2) V_ipforwarding = 0; SPL_NET(s); #if (__FreeBSD_version >= 300000) if (fr_slowtimer_ch.callout != NULL) untimeout(fr_slowtimer, NULL, fr_slowtimer_ch); bzero(&fr_slowtimer_ch, sizeof(fr_slowtimer_ch)); #else untimeout(fr_slowtimer, NULL); #endif /* FreeBSD */ #ifndef NETBSD_PF if (fr_checkp != NULL) fr_checkp = fr_savep; fr_savep = NULL; #endif fr_deinitialise(); fr_running = -2; (void) frflush(IPL_LOGIPF, 0, FR_INQUE|FR_OUTQUE|FR_INACTIVE); (void) frflush(IPL_LOGIPF, 0, FR_INQUE|FR_OUTQUE); if (ipf_locks_done == 1) { MUTEX_DESTROY(&ipf_timeoutlock); MUTEX_DESTROY(&ipf_rw); RW_DESTROY(&ipf_ipidfrag); RW_DESTROY(&ipf_tokens); ipf_locks_done = 0; } SPL_X(s); return 0; } /* * Filter ioctl interface. */ int iplioctl(dev, cmd, data, mode # if defined(_KERNEL) && ((BSD >= 199506) || (__FreeBSD_version >= 220000)) , p) # if (__FreeBSD_version >= 500024) struct thread *p; # if (__FreeBSD_version >= 500043) +# define p_cred td_ucred # define p_uid td_ucred->cr_ruid # else +# define p_cred t_proc->p_cred # define p_uid t_proc->p_cred->p_ruid # endif # else struct proc *p; # define p_uid p_cred->p_ruid # endif /* __FreeBSD_version >= 500024 */ # else ) # endif #if defined(_KERNEL) && (__FreeBSD_version >= 502116) struct cdev *dev; #else dev_t dev; #endif ioctlcmd_t cmd; caddr_t data; int mode; { int error = 0, unit = 0; SPL_INT(s); #if (BSD >= 199306) && defined(_KERNEL) +# if (__FreeBSD_version >= 500034) + if (securelevel_ge(p->p_cred, 3) && (mode & FWRITE)) +# else if ((securelevel >= 3) && (mode & FWRITE)) +# endif return EPERM; #endif unit = GET_MINOR(dev); if ((IPL_LOGMAX < unit) || (unit < 0)) return ENXIO; if (fr_running <= 0) { if (unit != IPL_LOGIPF) return EIO; if (cmd != SIOCIPFGETNEXT && cmd != SIOCIPFGET && cmd != SIOCIPFSET && cmd != SIOCFRENB && cmd != SIOCGETFS && cmd != SIOCGETFF) return EIO; } SPL_NET(s); error = fr_ioctlswitch(unit, data, cmd, mode, p->p_uid, p); if (error != -1) { SPL_X(s); return error; } SPL_X(s); return error; } #if 0 void fr_forgetifp(ifp) void *ifp; { register frentry_t *f; WRITE_ENTER(&ipf_mutex); for (f = ipacct[0][fr_active]; (f != NULL); f = f->fr_next) if (f->fr_ifa == ifp) f->fr_ifa = (void *)-1; for (f = ipacct[1][fr_active]; (f != NULL); f = f->fr_next) if (f->fr_ifa == ifp) f->fr_ifa = (void *)-1; for (f = ipfilter[0][fr_active]; (f != NULL); f = f->fr_next) if (f->fr_ifa == ifp) f->fr_ifa = (void *)-1; for (f = ipfilter[1][fr_active]; (f != NULL); f = f->fr_next) if (f->fr_ifa == ifp) f->fr_ifa = (void *)-1; #ifdef USE_INET6 for (f = ipacct6[0][fr_active]; (f != NULL); f = f->fr_next) if (f->fr_ifa == ifp) f->fr_ifa = (void *)-1; for (f = ipacct6[1][fr_active]; (f != NULL); f = f->fr_next) if (f->fr_ifa == ifp) f->fr_ifa = (void *)-1; for (f = ipfilter6[0][fr_active]; (f != NULL); f = f->fr_next) if (f->fr_ifa == ifp) f->fr_ifa = (void *)-1; for (f = ipfilter6[1][fr_active]; (f != NULL); f = f->fr_next) if (f->fr_ifa == ifp) f->fr_ifa = (void *)-1; #endif RWLOCK_EXIT(&ipf_mutex); fr_natsync(ifp); } #endif /* * routines below for saving IP headers to buffer */ int iplopen(dev, flags #if ((BSD >= 199506) || (__FreeBSD_version >= 220000)) && defined(_KERNEL) , devtype, p) int devtype; # if (__FreeBSD_version >= 500024) struct thread *p; # else struct proc *p; # endif /* __FreeBSD_version >= 500024 */ #else ) #endif #if defined(_KERNEL) && (__FreeBSD_version >= 502116) struct cdev *dev; #else dev_t dev; #endif int flags; { u_int min = GET_MINOR(dev); if (IPL_LOGMAX < min) min = ENXIO; else min = 0; return min; } int iplclose(dev, flags #if ((BSD >= 199506) || (__FreeBSD_version >= 220000)) && defined(_KERNEL) , devtype, p) int devtype; # if (__FreeBSD_version >= 500024) struct thread *p; # else struct proc *p; # endif /* __FreeBSD_version >= 500024 */ #else ) #endif #if defined(_KERNEL) && (__FreeBSD_version >= 502116) struct cdev *dev; #else dev_t dev; #endif int flags; { u_int min = GET_MINOR(dev); if (IPL_LOGMAX < min) min = ENXIO; else min = 0; return min; } /* * iplread/ipllog * both of these must operate with at least splnet() lest they be * called during packet processing and cause an inconsistancy to appear in * the filter lists. */ #if (BSD >= 199306) int iplread(dev, uio, ioflag) int ioflag; #else int iplread(dev, uio) #endif #if defined(_KERNEL) && (__FreeBSD_version >= 502116) struct cdev *dev; #else dev_t dev; #endif register struct uio *uio; { u_int xmin = GET_MINOR(dev); if (fr_running < 1) return EIO; if (xmin < 0) return ENXIO; # ifdef IPFILTER_SYNC if (xmin == IPL_LOGSYNC) return ipfsync_read(uio); # endif #ifdef IPFILTER_LOG return ipflog_read(xmin, uio); #else return ENXIO; #endif } /* * iplwrite * both of these must operate with at least splnet() lest they be * called during packet processing and cause an inconsistancy to appear in * the filter lists. */ #if (BSD >= 199306) int iplwrite(dev, uio, ioflag) int ioflag; #else int iplwrite(dev, uio) #endif #if defined(_KERNEL) && (__FreeBSD_version >= 502116) struct cdev *dev; #else dev_t dev; #endif register struct uio *uio; { if (fr_running < 1) return EIO; #ifdef IPFILTER_SYNC if (GET_MINOR(dev) == IPL_LOGSYNC) return ipfsync_write(uio); #endif return ENXIO; } /* * fr_send_reset - this could conceivably be a call to tcp_respond(), but that * requires a large amount of setting up and isn't any more efficient. */ int fr_send_reset(fin) fr_info_t *fin; { struct tcphdr *tcp, *tcp2; int tlen = 0, hlen; struct mbuf *m; #ifdef USE_INET6 ip6_t *ip6; #endif ip_t *ip; tcp = fin->fin_dp; if (tcp->th_flags & TH_RST) return -1; /* feedback loop */ if (fr_checkl4sum(fin) == -1) return -1; tlen = fin->fin_dlen - (TCP_OFF(tcp) << 2) + ((tcp->th_flags & TH_SYN) ? 1 : 0) + ((tcp->th_flags & TH_FIN) ? 1 : 0); #ifdef USE_INET6 hlen = (fin->fin_v == 6) ? sizeof(ip6_t) : sizeof(ip_t); #else hlen = sizeof(ip_t); #endif #ifdef MGETHDR MGETHDR(m, M_DONTWAIT, MT_HEADER); #else MGET(m, M_DONTWAIT, MT_HEADER); #endif if (m == NULL) return -1; if (sizeof(*tcp2) + hlen > MLEN) { MCLGET(m, M_DONTWAIT); if ((m->m_flags & M_EXT) == 0) { FREE_MB_T(m); return -1; } } m->m_len = sizeof(*tcp2) + hlen; #if (BSD >= 199103) m->m_data += max_linkhdr; m->m_pkthdr.len = m->m_len; m->m_pkthdr.rcvif = (struct ifnet *)0; #endif ip = mtod(m, struct ip *); bzero((char *)ip, hlen); #ifdef USE_INET6 ip6 = (ip6_t *)ip; #endif tcp2 = (struct tcphdr *)((char *)ip + hlen); tcp2->th_sport = tcp->th_dport; tcp2->th_dport = tcp->th_sport; if (tcp->th_flags & TH_ACK) { tcp2->th_seq = tcp->th_ack; tcp2->th_flags = TH_RST; tcp2->th_ack = 0; } else { tcp2->th_seq = 0; tcp2->th_ack = ntohl(tcp->th_seq); tcp2->th_ack += tlen; tcp2->th_ack = htonl(tcp2->th_ack); tcp2->th_flags = TH_RST|TH_ACK; } TCP_X2_A(tcp2, 0); TCP_OFF_A(tcp2, sizeof(*tcp2) >> 2); tcp2->th_win = tcp->th_win; tcp2->th_sum = 0; tcp2->th_urp = 0; #ifdef USE_INET6 if (fin->fin_v == 6) { ip6->ip6_flow = ((ip6_t *)fin->fin_ip)->ip6_flow; ip6->ip6_plen = htons(sizeof(struct tcphdr)); ip6->ip6_nxt = IPPROTO_TCP; ip6->ip6_hlim = 0; ip6->ip6_src = fin->fin_dst6; ip6->ip6_dst = fin->fin_src6; tcp2->th_sum = in6_cksum(m, IPPROTO_TCP, sizeof(*ip6), sizeof(*tcp2)); return fr_send_ip(fin, m, &m); } #endif ip->ip_p = IPPROTO_TCP; ip->ip_len = htons(sizeof(struct tcphdr)); ip->ip_src.s_addr = fin->fin_daddr; ip->ip_dst.s_addr = fin->fin_saddr; tcp2->th_sum = in_cksum(m, hlen + sizeof(*tcp2)); ip->ip_len = hlen + sizeof(*tcp2); return fr_send_ip(fin, m, &m); } static int fr_send_ip(fin, m, mpp) fr_info_t *fin; mb_t *m, **mpp; { INIT_VNET_INET(curvnet); fr_info_t fnew; ip_t *ip, *oip; int hlen; ip = mtod(m, ip_t *); bzero((char *)&fnew, sizeof(fnew)); IP_V_A(ip, fin->fin_v); switch (fin->fin_v) { case 4 : fnew.fin_v = 4; oip = fin->fin_ip; IP_HL_A(ip, sizeof(*oip) >> 2); ip->ip_tos = oip->ip_tos; ip->ip_id = fin->fin_ip->ip_id; #if (__FreeBSD_version > 460000) ip->ip_off = V_path_mtu_discovery ? IP_DF : 0; #else ip->ip_off = 0; #endif ip->ip_ttl = V_ip_defttl; ip->ip_sum = 0; hlen = sizeof(*oip); break; #ifdef USE_INET6 case 6 : { ip6_t *ip6 = (ip6_t *)ip; ip6->ip6_vfc = 0x60; ip6->ip6_hlim = IPDEFTTL; fnew.fin_v = 6; hlen = sizeof(*ip6); break; } #endif default : return EINVAL; } #ifdef IPSEC m->m_pkthdr.rcvif = NULL; #endif fnew.fin_ifp = fin->fin_ifp; fnew.fin_flx = FI_NOCKSUM; fnew.fin_m = m; fnew.fin_ip = ip; fnew.fin_mp = mpp; fnew.fin_hlen = hlen; fnew.fin_dp = (char *)ip + hlen; (void) fr_makefrip(hlen, ip, &fnew); return fr_fastroute(m, mpp, &fnew, NULL); } int fr_send_icmp_err(type, fin, dst) int type; fr_info_t *fin; int dst; { int err, hlen, xtra, iclen, ohlen, avail, code; struct in_addr dst4; struct icmp *icmp; struct mbuf *m; void *ifp; #ifdef USE_INET6 ip6_t *ip6; struct in6_addr dst6; #endif ip_t *ip, *ip2; if ((type < 0) || (type >= ICMP_MAXTYPE)) return -1; code = fin->fin_icode; #ifdef USE_INET6 if ((code < 0) || (code > sizeof(icmptoicmp6unreach)/sizeof(int))) return -1; #endif if (fr_checkl4sum(fin) == -1) return -1; #ifdef MGETHDR MGETHDR(m, M_DONTWAIT, MT_HEADER); #else MGET(m, M_DONTWAIT, MT_HEADER); #endif if (m == NULL) return -1; avail = MHLEN; xtra = 0; hlen = 0; ohlen = 0; ifp = fin->fin_ifp; if (fin->fin_v == 4) { if ((fin->fin_p == IPPROTO_ICMP) && !(fin->fin_flx & FI_SHORT)) switch (ntohs(fin->fin_data[0]) >> 8) { case ICMP_ECHO : case ICMP_TSTAMP : case ICMP_IREQ : case ICMP_MASKREQ : break; default : FREE_MB_T(m); return 0; } if (dst == 0) { if (fr_ifpaddr(4, FRI_NORMAL, ifp, &dst4, NULL) == -1) { FREE_MB_T(m); return -1; } } else dst4.s_addr = fin->fin_daddr; hlen = sizeof(ip_t); ohlen = fin->fin_hlen; if (fin->fin_hlen < fin->fin_plen) xtra = MIN(fin->fin_dlen, 8); else xtra = 0; } #ifdef USE_INET6 else if (fin->fin_v == 6) { hlen = sizeof(ip6_t); ohlen = sizeof(ip6_t); type = icmptoicmp6types[type]; if (type == ICMP6_DST_UNREACH) code = icmptoicmp6unreach[code]; if (hlen + sizeof(*icmp) + max_linkhdr + fin->fin_plen > avail) { MCLGET(m, M_DONTWAIT); if ((m->m_flags & M_EXT) == 0) { FREE_MB_T(m); return -1; } avail = MCLBYTES; } xtra = MIN(fin->fin_plen, avail - hlen - sizeof(*icmp) - max_linkhdr); if (dst == 0) { if (fr_ifpaddr(6, FRI_NORMAL, ifp, (struct in_addr *)&dst6, NULL) == -1) { FREE_MB_T(m); return -1; } } else dst6 = fin->fin_dst6; } #endif else { FREE_MB_T(m); return -1; } iclen = hlen + sizeof(*icmp); avail -= (max_linkhdr + iclen); if (avail < 0) { FREE_MB_T(m); return -1; } if (xtra > avail) xtra = avail; iclen += xtra; m->m_data += max_linkhdr; m->m_pkthdr.rcvif = (struct ifnet *)0; m->m_pkthdr.len = iclen; m->m_len = iclen; ip = mtod(m, ip_t *); icmp = (struct icmp *)((char *)ip + hlen); ip2 = (ip_t *)&icmp->icmp_ip; icmp->icmp_type = type; icmp->icmp_code = fin->fin_icode; icmp->icmp_cksum = 0; #ifdef icmp_nextmtu if (type == ICMP_UNREACH && fin->fin_icode == ICMP_UNREACH_NEEDFRAG && ifp) icmp->icmp_nextmtu = htons(((struct ifnet *)ifp)->if_mtu); #endif bcopy((char *)fin->fin_ip, (char *)ip2, ohlen); #ifdef USE_INET6 ip6 = (ip6_t *)ip; if (fin->fin_v == 6) { ip6->ip6_flow = ((ip6_t *)fin->fin_ip)->ip6_flow; ip6->ip6_plen = htons(iclen - hlen); ip6->ip6_nxt = IPPROTO_ICMPV6; ip6->ip6_hlim = 0; ip6->ip6_src = dst6; ip6->ip6_dst = fin->fin_src6; if (xtra > 0) bcopy((char *)fin->fin_ip + ohlen, (char *)&icmp->icmp_ip + ohlen, xtra); icmp->icmp_cksum = in6_cksum(m, IPPROTO_ICMPV6, sizeof(*ip6), iclen - hlen); } else #endif { ip2->ip_len = htons(ip2->ip_len); ip2->ip_off = htons(ip2->ip_off); ip->ip_p = IPPROTO_ICMP; ip->ip_src.s_addr = dst4.s_addr; ip->ip_dst.s_addr = fin->fin_saddr; if (xtra > 0) bcopy((char *)fin->fin_ip + ohlen, (char *)&icmp->icmp_ip + ohlen, xtra); icmp->icmp_cksum = ipf_cksum((u_short *)icmp, sizeof(*icmp) + 8); ip->ip_len = iclen; ip->ip_p = IPPROTO_ICMP; } err = fr_send_ip(fin, m, &m); return err; } #if !defined(IPFILTER_LKM) && (__FreeBSD_version < 300000) # if (BSD < 199306) int iplinit __P((void)); int # else void iplinit __P((void)); void # endif iplinit() { if (ipfattach() != 0) printf("IP Filter failed to attach\n"); ip_init(); } #endif /* __FreeBSD_version < 300000 */ /* * m0 - pointer to mbuf where the IP packet starts * mpp - pointer to the mbuf pointer that is the start of the mbuf chain */ int fr_fastroute(m0, mpp, fin, fdp) mb_t *m0, **mpp; fr_info_t *fin; frdest_t *fdp; { register struct ip *ip, *mhip; register struct mbuf *m = *mpp; register struct route *ro; int len, off, error = 0, hlen, code; struct ifnet *ifp, *sifp; struct sockaddr_in *dst; struct route iproute; u_short ip_off; frentry_t *fr; ro = NULL; #ifdef M_WRITABLE /* * HOT FIX/KLUDGE: * * If the mbuf we're about to send is not writable (because of * a cluster reference, for example) we'll need to make a copy * of it since this routine modifies the contents. * * If you have non-crappy network hardware that can transmit data * from the mbuf, rather than making a copy, this is gonna be a * problem. */ if (M_WRITABLE(m) == 0) { m0 = m_dup(m, M_DONTWAIT); if (m0 != 0) { FREE_MB_T(m); m = m0; *mpp = m; } else { error = ENOBUFS; FREE_MB_T(m); goto done; } } #endif #ifdef USE_INET6 if (fin->fin_v == 6) { /* * currently "to " and "to :ip#" are not supported * for IPv6 */ #if (__FreeBSD_version >= 490000) return ip6_output(m0, NULL, NULL, 0, NULL, NULL, NULL); #else return ip6_output(m0, NULL, NULL, 0, NULL, NULL); #endif } #endif hlen = fin->fin_hlen; ip = mtod(m0, struct ip *); /* * Route packet. */ ro = &iproute; bzero((caddr_t)ro, sizeof (*ro)); dst = (struct sockaddr_in *)&ro->ro_dst; dst->sin_family = AF_INET; dst->sin_addr = ip->ip_dst; fr = fin->fin_fr; if (fdp != NULL) ifp = fdp->fd_ifp; else ifp = fin->fin_ifp; if ((ifp == NULL) && (!fr || !(fr->fr_flags & FR_FASTROUTE))) { error = -2; goto bad; } if ((fdp != NULL) && (fdp->fd_ip.s_addr != 0)) dst->sin_addr = fdp->fd_ip; dst->sin_len = sizeof(*dst); in_rtalloc(ro, 0); if ((ifp == NULL) && (ro->ro_rt != NULL)) ifp = ro->ro_rt->rt_ifp; if ((ro->ro_rt == NULL) || (ifp == NULL)) { if (in_localaddr(ip->ip_dst)) error = EHOSTUNREACH; else error = ENETUNREACH; goto bad; } if (ro->ro_rt->rt_flags & RTF_GATEWAY) dst = (struct sockaddr_in *)ro->ro_rt->rt_gateway; if (ro->ro_rt) ro->ro_rt->rt_use++; /* * For input packets which are being "fastrouted", they won't * go back through output filtering and miss their chance to get * NAT'd and counted. Duplicated packets aren't considered to be * part of the normal packet stream, so do not NAT them or pass * them through stateful checking, etc. */ if ((fdp != &fr->fr_dif) && (fin->fin_out == 0)) { sifp = fin->fin_ifp; fin->fin_ifp = ifp; fin->fin_out = 1; (void) fr_acctpkt(fin, NULL); fin->fin_fr = NULL; if (!fr || !(fr->fr_flags & FR_RETMASK)) { u_32_t pass; if (fr_checkstate(fin, &pass) != NULL) fr_statederef((ipstate_t **)&fin->fin_state); } switch (fr_checknatout(fin, NULL)) { case 0 : break; case 1 : fr_natderef((nat_t **)&fin->fin_nat); ip->ip_sum = 0; break; case -1 : error = -1; goto bad; break; } fin->fin_ifp = sifp; fin->fin_out = 0; } else ip->ip_sum = 0; /* * If small enough for interface, can just send directly. */ if (ip->ip_len <= ifp->if_mtu) { ip->ip_len = htons(ip->ip_len); ip->ip_off = htons(ip->ip_off); if (!ip->ip_sum) ip->ip_sum = in_cksum(m, hlen); error = (*ifp->if_output)(ifp, m, (struct sockaddr *)dst, ro); goto done; } /* * Too large for interface; fragment if possible. * Must be able to put at least 8 bytes per fragment. */ ip_off = ntohs(ip->ip_off); if (ip_off & IP_DF) { error = EMSGSIZE; goto bad; } len = (ifp->if_mtu - hlen) &~ 7; if (len < 8) { error = EMSGSIZE; goto bad; } { int mhlen, firstlen = len; struct mbuf **mnext = &m->m_act; /* * Loop through length of segment after first fragment, * make new header and copy data of each part and link onto chain. */ m0 = m; mhlen = sizeof (struct ip); for (off = hlen + len; off < ip->ip_len; off += len) { #ifdef MGETHDR MGETHDR(m, M_DONTWAIT, MT_HEADER); #else MGET(m, M_DONTWAIT, MT_HEADER); #endif if (m == 0) { m = m0; error = ENOBUFS; goto bad; } m->m_data += max_linkhdr; mhip = mtod(m, struct ip *); bcopy((char *)ip, (char *)mhip, sizeof(*ip)); if (hlen > sizeof (struct ip)) { mhlen = ip_optcopy(ip, mhip) + sizeof (struct ip); IP_HL_A(mhip, mhlen >> 2); } m->m_len = mhlen; mhip->ip_off = ((off - hlen) >> 3) + ip_off; if (off + len >= ip->ip_len) len = ip->ip_len - off; else mhip->ip_off |= IP_MF; mhip->ip_len = htons((u_short)(len + mhlen)); *mnext = m; m->m_next = m_copy(m0, off, len); if (m->m_next == 0) { error = ENOBUFS; /* ??? */ goto sendorfree; } m->m_pkthdr.len = mhlen + len; m->m_pkthdr.rcvif = NULL; mhip->ip_off = htons((u_short)mhip->ip_off); mhip->ip_sum = 0; mhip->ip_sum = in_cksum(m, mhlen); mnext = &m->m_act; } /* * Update first fragment by trimming what's been copied out * and updating header, then send each fragment (in order). */ m_adj(m0, hlen + firstlen - ip->ip_len); ip->ip_len = htons((u_short)(hlen + firstlen)); ip->ip_off = htons((u_short)IP_MF); ip->ip_sum = 0; ip->ip_sum = in_cksum(m0, hlen); sendorfree: for (m = m0; m; m = m0) { m0 = m->m_act; m->m_act = 0; if (error == 0) error = (*ifp->if_output)(ifp, m, (struct sockaddr *)dst, ro); else FREE_MB_T(m); } } done: if (!error) fr_frouteok[0]++; else fr_frouteok[1]++; if ((ro != NULL) && (ro->ro_rt != NULL)) { RTFREE(ro->ro_rt); } *mpp = NULL; return 0; bad: if (error == EMSGSIZE) { sifp = fin->fin_ifp; code = fin->fin_icode; fin->fin_icode = ICMP_UNREACH_NEEDFRAG; fin->fin_ifp = ifp; (void) fr_send_icmp_err(ICMP_UNREACH, fin, 1); fin->fin_ifp = sifp; fin->fin_icode = code; } FREE_MB_T(m); goto done; } int fr_verifysrc(fin) fr_info_t *fin; { struct sockaddr_in *dst; struct route iproute; bzero((char *)&iproute, sizeof(iproute)); dst = (struct sockaddr_in *)&iproute.ro_dst; dst->sin_len = sizeof(*dst); dst->sin_family = AF_INET; dst->sin_addr = fin->fin_src; in_rtalloc(&iproute, 0); if (iproute.ro_rt == NULL) return 0; return (fin->fin_ifp == iproute.ro_rt->rt_ifp); } /* * return the first IP Address associated with an interface */ int fr_ifpaddr(v, atype, ifptr, inp, inpmask) int v, atype; void *ifptr; struct in_addr *inp, *inpmask; { #ifdef USE_INET6 struct in6_addr *inp6 = NULL; #endif struct sockaddr *sock, *mask; struct sockaddr_in *sin; struct ifaddr *ifa; struct ifnet *ifp; if ((ifptr == NULL) || (ifptr == (void *)-1)) return -1; sin = NULL; ifp = ifptr; if (v == 4) inp->s_addr = 0; #ifdef USE_INET6 else if (v == 6) bzero((char *)inp, sizeof(struct in6_addr)); #endif #if (__FreeBSD_version >= 300000) ifa = TAILQ_FIRST(&ifp->if_addrhead); #else ifa = ifp->if_addrlist; #endif /* __FreeBSD_version >= 300000 */ sock = ifa->ifa_addr; while (sock != NULL && ifa != NULL) { sin = (struct sockaddr_in *)sock; if ((v == 4) && (sin->sin_family == AF_INET)) break; #ifdef USE_INET6 if ((v == 6) && (sin->sin_family == AF_INET6)) { inp6 = &((struct sockaddr_in6 *)sin)->sin6_addr; if (!IN6_IS_ADDR_LINKLOCAL(inp6) && !IN6_IS_ADDR_LOOPBACK(inp6)) break; } #endif #if (__FreeBSD_version >= 300000) ifa = TAILQ_NEXT(ifa, ifa_link); #else ifa = ifa->ifa_next; #endif /* __FreeBSD_version >= 300000 */ if (ifa != NULL) sock = ifa->ifa_addr; } if (ifa == NULL || sin == NULL) return -1; mask = ifa->ifa_netmask; if (atype == FRI_BROADCAST) sock = ifa->ifa_broadaddr; else if (atype == FRI_PEERADDR) sock = ifa->ifa_dstaddr; if (sock == NULL) return -1; #ifdef USE_INET6 if (v == 6) { return fr_ifpfillv6addr(atype, (struct sockaddr_in6 *)sock, (struct sockaddr_in6 *)mask, inp, inpmask); } #endif return fr_ifpfillv4addr(atype, (struct sockaddr_in *)sock, (struct sockaddr_in *)mask, inp, inpmask); } u_32_t fr_newisn(fin) fr_info_t *fin; { u_32_t newiss; #if (__FreeBSD_version >= 400000) newiss = arc4random(); #else static iss_seq_off = 0; u_char hash[16]; MD5_CTX ctx; /* * Compute the base value of the ISS. It is a hash * of (saddr, sport, daddr, dport, secret). */ MD5Init(&ctx); MD5Update(&ctx, (u_char *) &fin->fin_fi.fi_src, sizeof(fin->fin_fi.fi_src)); MD5Update(&ctx, (u_char *) &fin->fin_fi.fi_dst, sizeof(fin->fin_fi.fi_dst)); MD5Update(&ctx, (u_char *) &fin->fin_dat, sizeof(fin->fin_dat)); MD5Update(&ctx, ipf_iss_secret, sizeof(ipf_iss_secret)); MD5Final(hash, &ctx); memcpy(&newiss, hash, sizeof(newiss)); /* * Now increment our "timer", and add it in to * the computed value. * * XXX Use `addin'? * XXX TCP_ISSINCR too large to use? */ iss_seq_off += 0x00010000; newiss += iss_seq_off; #endif return newiss; } /* ------------------------------------------------------------------------ */ /* Function: fr_nextipid */ /* Returns: int - 0 == success, -1 == error (packet should be droppped) */ /* Parameters: fin(I) - pointer to packet information */ /* */ /* Returns the next IPv4 ID to use for this packet. */ /* ------------------------------------------------------------------------ */ u_short fr_nextipid(fin) fr_info_t *fin; { #ifndef RANDOM_IP_ID static u_short ipid = 0; u_short id; MUTEX_ENTER(&ipf_rw); id = ipid++; MUTEX_EXIT(&ipf_rw); #else u_short id; id = ip_randomid(); #endif return id; } INLINE void fr_checkv4sum(fin) fr_info_t *fin; { #ifdef CSUM_DATA_VALID int manual = 0; u_short sum; ip_t *ip; mb_t *m; if ((fin->fin_flx & FI_NOCKSUM) != 0) return; if (fin->fin_cksum != 0) return; m = fin->fin_m; if (m == NULL) { manual = 1; goto skipauto; } ip = fin->fin_ip; if (m->m_pkthdr.csum_flags & CSUM_DATA_VALID) { if (m->m_pkthdr.csum_flags & CSUM_PSEUDO_HDR) sum = m->m_pkthdr.csum_data; else sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htonl(m->m_pkthdr.csum_data + fin->fin_ip->ip_len + fin->fin_p)); sum ^= 0xffff; if (sum != 0) { fin->fin_flx |= FI_BAD; fin->fin_cksum = -1; } else { fin->fin_cksum = 1; } } else manual = 1; skipauto: # ifdef IPFILTER_CKSUM if (manual != 0) if (fr_checkl4sum(fin) == -1) fin->fin_flx |= FI_BAD; # else ; # endif #else # ifdef IPFILTER_CKSUM if (fr_checkl4sum(fin) == -1) fin->fin_flx |= FI_BAD; # endif #endif } #ifdef USE_INET6 INLINE void fr_checkv6sum(fin) fr_info_t *fin; { # ifdef IPFILTER_CKSUM if (fr_checkl4sum(fin) == -1) fin->fin_flx |= FI_BAD; # endif } #endif /* USE_INET6 */ size_t mbufchainlen(m0) struct mbuf *m0; { size_t len; if ((m0->m_flags & M_PKTHDR) != 0) { len = m0->m_pkthdr.len; } else { struct mbuf *m; for (m = m0, len = 0; m != NULL; m = m->m_next) len += m->m_len; } return len; } /* ------------------------------------------------------------------------ */ /* Function: fr_pullup */ /* Returns: NULL == pullup failed, else pointer to protocol header */ /* Parameters: m(I) - pointer to buffer where data packet starts */ /* fin(I) - pointer to packet information */ /* len(I) - number of bytes to pullup */ /* */ /* Attempt to move at least len bytes (from the start of the buffer) into a */ /* single buffer for ease of access. Operating system native functions are */ /* used to manage buffers - if necessary. If the entire packet ends up in */ /* a single buffer, set the FI_COALESCE flag even though fr_coalesce() has */ /* not been called. Both fin_ip and fin_dp are updated before exiting _IF_ */ /* and ONLY if the pullup succeeds. */ /* */ /* We assume that 'min' is a pointer to a buffer that is part of the chain */ /* of buffers that starts at *fin->fin_mp. */ /* ------------------------------------------------------------------------ */ void *fr_pullup(min, fin, len) mb_t *min; fr_info_t *fin; int len; { int out = fin->fin_out, dpoff, ipoff; mb_t *m = min; char *ip; if (m == NULL) return NULL; ip = (char *)fin->fin_ip; if ((fin->fin_flx & FI_COALESCE) != 0) return ip; ipoff = fin->fin_ipoff; if (fin->fin_dp != NULL) dpoff = (char *)fin->fin_dp - (char *)ip; else dpoff = 0; if (M_LEN(m) < len) { #ifdef MHLEN /* * Assume that M_PKTHDR is set and just work with what is left * rather than check.. * Should not make any real difference, anyway. */ if (len > MHLEN) #else if (len > MLEN) #endif { #ifdef HAVE_M_PULLDOWN if (m_pulldown(m, 0, len, NULL) == NULL) m = NULL; #else FREE_MB_T(*fin->fin_mp); m = NULL; #endif } else { m = m_pullup(m, len); } *fin->fin_mp = m; if (m == NULL) { fin->fin_m = NULL; ATOMIC_INCL(frstats[out].fr_pull[1]); return NULL; } while (M_LEN(m) == 0) { m = m->m_next; } fin->fin_m = m; ip = MTOD(m, char *) + ipoff; } ATOMIC_INCL(frstats[out].fr_pull[0]); fin->fin_ip = (ip_t *)ip; if (fin->fin_dp != NULL) fin->fin_dp = (char *)fin->fin_ip + dpoff; if (len == fin->fin_plen) fin->fin_flx |= FI_COALESCE; return ip; } int ipf_inject(fin, m) fr_info_t *fin; mb_t *m; { int error = 0; if (fin->fin_out == 0) { #if (__FreeBSD_version >= 501000) netisr_dispatch(NETISR_IP, m); #else struct ifqueue *ifq; ifq = &ipintrq; # ifdef _IF_QFULL if (_IF_QFULL(ifq)) # else if (IF_QFULL(ifq)) # endif { # ifdef _IF_DROP _IF_DROP(ifq); # else IF_DROP(ifq); # endif FREE_MB_T(m); error = ENOBUFS; } else { IF_ENQUEUE(ifq, m); } #endif } else { fin->fin_ip->ip_len = ntohs(fin->fin_ip->ip_len); fin->fin_ip->ip_off = ntohs(fin->fin_ip->ip_off); #if (__FreeBSD_version >= 470102) error = ip_output(m, NULL, NULL, IP_FORWARDING, NULL, NULL); #else error = ip_output(m, NULL, NULL, IP_FORWARDING, NULL); #endif } return error; } int ipf_pfil_unhook(void) { #if defined(NETBSD_PF) && (__FreeBSD_version >= 500011) # if __FreeBSD_version >= 501108 struct pfil_head *ph_inet; # ifdef USE_INET6 struct pfil_head *ph_inet6; # endif # endif #endif #ifdef NETBSD_PF # if (__FreeBSD_version >= 500011) # if (__FreeBSD_version >= 501108) ph_inet = pfil_head_get(PFIL_TYPE_AF, AF_INET); if (ph_inet != NULL) pfil_remove_hook((void *)fr_check_wrapper, NULL, PFIL_IN|PFIL_OUT|PFIL_WAITOK, ph_inet); # else pfil_remove_hook((void *)fr_check, PFIL_IN|PFIL_OUT|PFIL_WAITOK, &inetsw[ip_protox[IPPROTO_IP]].pr_pfh); # endif # else pfil_remove_hook((void *)fr_check, PFIL_IN|PFIL_OUT|PFIL_WAITOK); # endif # ifdef USE_INET6 # if (__FreeBSD_version >= 501108) ph_inet6 = pfil_head_get(PFIL_TYPE_AF, AF_INET6); if (ph_inet6 != NULL) pfil_remove_hook((void *)fr_check_wrapper6, NULL, PFIL_IN|PFIL_OUT|PFIL_WAITOK, ph_inet6); # else pfil_remove_hook((void *)fr_check, PFIL_IN|PFIL_OUT|PFIL_WAITOK, &inet6sw[ip6_protox[IPPROTO_IPV6]].pr_pfh); # endif # endif #endif return (0); } int ipf_pfil_hook(void) { #if defined(NETBSD_PF) && (__FreeBSD_version >= 500011) # if __FreeBSD_version >= 501108 struct pfil_head *ph_inet; # ifdef USE_INET6 struct pfil_head *ph_inet6; # endif # endif #endif # ifdef NETBSD_PF # if __FreeBSD_version >= 500011 # if __FreeBSD_version >= 501108 ph_inet = pfil_head_get(PFIL_TYPE_AF, AF_INET); # ifdef USE_INET6 ph_inet6 = pfil_head_get(PFIL_TYPE_AF, AF_INET6); # endif if (ph_inet == NULL # ifdef USE_INET6 && ph_inet6 == NULL # endif ) return ENODEV; if (ph_inet != NULL) pfil_add_hook((void *)fr_check_wrapper, NULL, PFIL_IN|PFIL_OUT|PFIL_WAITOK, ph_inet); # else pfil_add_hook((void *)fr_check, PFIL_IN|PFIL_OUT|PFIL_WAITOK, &inetsw[ip_protox[IPPROTO_IP]].pr_pfh); # endif # else pfil_add_hook((void *)fr_check, PFIL_IN|PFIL_OUT|PFIL_WAITOK); # endif # ifdef USE_INET6 # if __FreeBSD_version >= 501108 if (ph_inet6 != NULL) pfil_add_hook((void *)fr_check_wrapper6, NULL, PFIL_IN|PFIL_OUT|PFIL_WAITOK, ph_inet6); # else pfil_add_hook((void *)fr_check, PFIL_IN|PFIL_OUT|PFIL_WAITOK, &inet6sw[ip6_protox[IPPROTO_IPV6]].pr_pfh); # endif # endif # endif return (0); } void ipf_event_reg(void) { #if (__FreeBSD_version >= 502103) ipf_arrivetag = EVENTHANDLER_REGISTER(ifnet_arrival_event, \ ipf_ifevent, NULL, \ EVENTHANDLER_PRI_ANY); ipf_departtag = EVENTHANDLER_REGISTER(ifnet_departure_event, \ ipf_ifevent, NULL, \ EVENTHANDLER_PRI_ANY); ipf_clonetag = EVENTHANDLER_REGISTER(if_clone_event, ipf_ifevent, \ NULL, EVENTHANDLER_PRI_ANY); #endif } void ipf_event_dereg(void) { #if (__FreeBSD_version >= 502103) if (ipf_arrivetag != NULL) { EVENTHANDLER_DEREGISTER(ifnet_arrival_event, ipf_arrivetag); } if (ipf_departtag != NULL) { EVENTHANDLER_DEREGISTER(ifnet_departure_event, ipf_departtag); } if (ipf_clonetag != NULL) { EVENTHANDLER_DEREGISTER(if_clone_event, ipf_clonetag); } #endif } Index: head/sys/contrib/ipfilter/netinet/ip_nat.c =================================================================== --- head/sys/contrib/ipfilter/netinet/ip_nat.c (revision 192894) +++ head/sys/contrib/ipfilter/netinet/ip_nat.c (revision 192895) @@ -1,5489 +1,5493 @@ /* $FreeBSD$ */ /* * Copyright (C) 1995-2003 by Darren Reed. * * See the IPFILTER.LICENCE file for details on licencing. */ #if defined(KERNEL) || defined(_KERNEL) # undef KERNEL # undef _KERNEL # define KERNEL 1 # define _KERNEL 1 #endif #include #include #include #include #include #if defined(_KERNEL) && defined(__NetBSD_Version__) && \ (__NetBSD_Version__ >= 399002000) # include #endif #if defined(__NetBSD__) && (NetBSD >= 199905) && !defined(IPFILTER_LKM) && \ defined(_KERNEL) #if defined(__NetBSD_Version__) && (__NetBSD_Version__ < 399001400) # include "opt_ipfilter_log.h" # else # include "opt_ipfilter.h" # endif #endif #if !defined(_KERNEL) # include # include # include # define _KERNEL # ifdef __OpenBSD__ struct file; # endif # include # undef _KERNEL #endif #if defined(_KERNEL) && (__FreeBSD_version >= 220000) # include # include #else # include #endif #if !defined(AIX) # include #endif #if !defined(linux) # include #endif #include #if defined(_KERNEL) # include # if !defined(__SVR4) && !defined(__svr4__) # include # endif #endif #if defined(__SVR4) || defined(__svr4__) # include # include # ifdef _KERNEL # include # endif # include # include #endif #if __FreeBSD_version >= 300000 # include #endif #include #if __FreeBSD_version >= 300000 # include # if defined(_KERNEL) && !defined(IPFILTER_LKM) # include "opt_ipfilter.h" # endif #endif #ifdef sun # include #endif #include #include #include #include #ifdef RFC1825 # include # include extern struct ifnet vpnif; #endif #if !defined(linux) # include #endif #include #include #include #include "netinet/ip_compat.h" #include #include "netinet/ip_fil.h" #include "netinet/ip_nat.h" #include "netinet/ip_frag.h" #include "netinet/ip_state.h" #include "netinet/ip_proxy.h" #ifdef IPFILTER_SYNC #include "netinet/ip_sync.h" #endif #if (__FreeBSD_version >= 300000) # include #endif /* END OF INCLUDES */ #undef SOCKADDR_IN #define SOCKADDR_IN struct sockaddr_in #if !defined(lint) static const char sccsid[] = "@(#)ip_nat.c 1.11 6/5/96 (C) 1995 Darren Reed"; static const char rcsid[] = "@(#)$FreeBSD$"; /* static const char rcsid[] = "@(#)$Id: ip_nat.c,v 2.195.2.102 2007/10/16 10:08:10 darrenr Exp $"; */ #endif /* ======================================================================== */ /* How the NAT is organised and works. */ /* */ /* Inside (interface y) NAT Outside (interface x) */ /* -------------------- -+- ------------------------------------- */ /* Packet going | out, processsed by fr_checknatout() for x */ /* ------------> | ------------> */ /* src=10.1.1.1 | src=192.1.1.1 */ /* | */ /* | in, processed by fr_checknatin() for x */ /* <------------ | <------------ */ /* dst=10.1.1.1 | dst=192.1.1.1 */ /* -------------------- -+- ------------------------------------- */ /* fr_checknatout() - changes ip_src and if required, sport */ /* - creates a new mapping, if required. */ /* fr_checknatin() - changes ip_dst and if required, dport */ /* */ /* In the NAT table, internal source is recorded as "in" and externally */ /* seen as "out". */ /* ======================================================================== */ nat_t **nat_table[2] = { NULL, NULL }, *nat_instances = NULL; ipnat_t *nat_list = NULL; u_int ipf_nattable_max = NAT_TABLE_MAX; u_int ipf_nattable_sz = NAT_TABLE_SZ; u_int ipf_natrules_sz = NAT_SIZE; u_int ipf_rdrrules_sz = RDR_SIZE; u_int ipf_hostmap_sz = HOSTMAP_SIZE; u_int fr_nat_maxbucket = 0, fr_nat_maxbucket_reset = 1; u_32_t nat_masks = 0; u_32_t rdr_masks = 0; u_long nat_last_force_flush = 0; ipnat_t **nat_rules = NULL; ipnat_t **rdr_rules = NULL; hostmap_t **ipf_hm_maptable = NULL; hostmap_t *ipf_hm_maplist = NULL; ipftq_t nat_tqb[IPF_TCP_NSTATES]; ipftq_t nat_udptq; ipftq_t nat_icmptq; ipftq_t nat_iptq; ipftq_t *nat_utqe = NULL; int fr_nat_doflush = 0; #ifdef IPFILTER_LOG int nat_logging = 1; #else int nat_logging = 0; #endif u_long fr_defnatage = DEF_NAT_AGE, fr_defnatipage = 120, /* 60 seconds */ fr_defnaticmpage = 6; /* 3 seconds */ natstat_t nat_stats; int fr_nat_lock = 0; int fr_nat_init = 0; #if SOLARIS && !defined(_INET_IP_STACK_H) extern int pfil_delayed_copy; #endif static int nat_flush_entry __P((void *)); static int nat_flushtable __P((void)); static int nat_clearlist __P((void)); static void nat_addnat __P((struct ipnat *)); static void nat_addrdr __P((struct ipnat *)); static void nat_delrdr __P((struct ipnat *)); static void nat_delnat __P((struct ipnat *)); static int fr_natgetent __P((caddr_t, int)); static int fr_natgetsz __P((caddr_t, int)); static int fr_natputent __P((caddr_t, int)); static int nat_extraflush __P((int)); static int nat_gettable __P((char *)); static void nat_tabmove __P((nat_t *)); static int nat_match __P((fr_info_t *, ipnat_t *)); static INLINE int nat_newmap __P((fr_info_t *, nat_t *, natinfo_t *)); static INLINE int nat_newrdr __P((fr_info_t *, nat_t *, natinfo_t *)); static hostmap_t *nat_hostmap __P((ipnat_t *, struct in_addr, struct in_addr, struct in_addr, u_32_t)); static int nat_icmpquerytype4 __P((int)); static int nat_siocaddnat __P((ipnat_t *, ipnat_t **, int)); static void nat_siocdelnat __P((ipnat_t *, ipnat_t **, int)); static int nat_finalise __P((fr_info_t *, nat_t *, natinfo_t *, tcphdr_t *, nat_t **, int)); static int nat_resolverule __P((ipnat_t *)); static nat_t *fr_natclone __P((fr_info_t *, nat_t *)); static void nat_mssclamp __P((tcphdr_t *, u_32_t, fr_info_t *, u_short *)); static int nat_wildok __P((nat_t *, int, int, int, int)); static int nat_getnext __P((ipftoken_t *, ipfgeniter_t *)); static int nat_iterator __P((ipftoken_t *, ipfgeniter_t *)); /* ------------------------------------------------------------------------ */ /* Function: fr_natinit */ /* Returns: int - 0 == success, -1 == failure */ /* Parameters: Nil */ /* */ /* Initialise all of the NAT locks, tables and other structures. */ /* ------------------------------------------------------------------------ */ int fr_natinit() { int i; KMALLOCS(nat_table[0], nat_t **, sizeof(nat_t *) * ipf_nattable_sz); if (nat_table[0] != NULL) bzero((char *)nat_table[0], ipf_nattable_sz * sizeof(nat_t *)); else return -1; KMALLOCS(nat_table[1], nat_t **, sizeof(nat_t *) * ipf_nattable_sz); if (nat_table[1] != NULL) bzero((char *)nat_table[1], ipf_nattable_sz * sizeof(nat_t *)); else return -2; KMALLOCS(nat_rules, ipnat_t **, sizeof(ipnat_t *) * ipf_natrules_sz); if (nat_rules != NULL) bzero((char *)nat_rules, ipf_natrules_sz * sizeof(ipnat_t *)); else return -3; KMALLOCS(rdr_rules, ipnat_t **, sizeof(ipnat_t *) * ipf_rdrrules_sz); if (rdr_rules != NULL) bzero((char *)rdr_rules, ipf_rdrrules_sz * sizeof(ipnat_t *)); else return -4; KMALLOCS(ipf_hm_maptable, hostmap_t **, \ sizeof(hostmap_t *) * ipf_hostmap_sz); if (ipf_hm_maptable != NULL) bzero((char *)ipf_hm_maptable, sizeof(hostmap_t *) * ipf_hostmap_sz); else return -5; ipf_hm_maplist = NULL; KMALLOCS(nat_stats.ns_bucketlen[0], u_long *, ipf_nattable_sz * sizeof(u_long)); if (nat_stats.ns_bucketlen[0] == NULL) return -6; bzero((char *)nat_stats.ns_bucketlen[0], ipf_nattable_sz * sizeof(u_long)); KMALLOCS(nat_stats.ns_bucketlen[1], u_long *, ipf_nattable_sz * sizeof(u_long)); if (nat_stats.ns_bucketlen[1] == NULL) return -7; bzero((char *)nat_stats.ns_bucketlen[1], ipf_nattable_sz * sizeof(u_long)); if (fr_nat_maxbucket == 0) { for (i = ipf_nattable_sz; i > 0; i >>= 1) fr_nat_maxbucket++; fr_nat_maxbucket *= 2; } fr_sttab_init(nat_tqb); /* * Increase this because we may have "keep state" following this too * and packet storms can occur if this is removed too quickly. */ nat_tqb[IPF_TCPS_CLOSED].ifq_ttl = fr_tcplastack; nat_tqb[IPF_TCP_NSTATES - 1].ifq_next = &nat_udptq; nat_udptq.ifq_ttl = fr_defnatage; nat_udptq.ifq_ref = 1; nat_udptq.ifq_head = NULL; nat_udptq.ifq_tail = &nat_udptq.ifq_head; MUTEX_INIT(&nat_udptq.ifq_lock, "nat ipftq udp tab"); nat_udptq.ifq_next = &nat_icmptq; nat_icmptq.ifq_ttl = fr_defnaticmpage; nat_icmptq.ifq_ref = 1; nat_icmptq.ifq_head = NULL; nat_icmptq.ifq_tail = &nat_icmptq.ifq_head; MUTEX_INIT(&nat_icmptq.ifq_lock, "nat icmp ipftq tab"); nat_icmptq.ifq_next = &nat_iptq; nat_iptq.ifq_ttl = fr_defnatipage; nat_iptq.ifq_ref = 1; nat_iptq.ifq_head = NULL; nat_iptq.ifq_tail = &nat_iptq.ifq_head; MUTEX_INIT(&nat_iptq.ifq_lock, "nat ip ipftq tab"); nat_iptq.ifq_next = NULL; for (i = 0; i < IPF_TCP_NSTATES; i++) { if (nat_tqb[i].ifq_ttl < fr_defnaticmpage) nat_tqb[i].ifq_ttl = fr_defnaticmpage; #ifdef LARGE_NAT else if (nat_tqb[i].ifq_ttl > fr_defnatage) nat_tqb[i].ifq_ttl = fr_defnatage; #endif } /* * Increase this because we may have "keep state" following * this too and packet storms can occur if this is removed * too quickly. */ nat_tqb[IPF_TCPS_CLOSED].ifq_ttl = nat_tqb[IPF_TCPS_LAST_ACK].ifq_ttl; RWLOCK_INIT(&ipf_nat, "ipf IP NAT rwlock"); RWLOCK_INIT(&ipf_natfrag, "ipf IP NAT-Frag rwlock"); MUTEX_INIT(&ipf_nat_new, "ipf nat new mutex"); MUTEX_INIT(&ipf_natio, "ipf nat io mutex"); fr_nat_init = 1; return 0; } /* ------------------------------------------------------------------------ */ /* Function: nat_addrdr */ /* Returns: Nil */ /* Parameters: n(I) - pointer to NAT rule to add */ /* */ /* Adds a redirect rule to the hash table of redirect rules and the list of */ /* loaded NAT rules. Updates the bitmask indicating which netmasks are in */ /* use by redirect rules. */ /* ------------------------------------------------------------------------ */ static void nat_addrdr(n) ipnat_t *n; { ipnat_t **np; u_32_t j; u_int hv; int k; k = count4bits(n->in_outmsk); if ((k >= 0) && (k != 32)) rdr_masks |= 1 << k; j = (n->in_outip & n->in_outmsk); hv = NAT_HASH_FN(j, 0, ipf_rdrrules_sz); np = rdr_rules + hv; while (*np != NULL) np = &(*np)->in_rnext; n->in_rnext = NULL; n->in_prnext = np; n->in_hv = hv; *np = n; } /* ------------------------------------------------------------------------ */ /* Function: nat_addnat */ /* Returns: Nil */ /* Parameters: n(I) - pointer to NAT rule to add */ /* */ /* Adds a NAT map rule to the hash table of rules and the list of loaded */ /* NAT rules. Updates the bitmask indicating which netmasks are in use by */ /* redirect rules. */ /* ------------------------------------------------------------------------ */ static void nat_addnat(n) ipnat_t *n; { ipnat_t **np; u_32_t j; u_int hv; int k; k = count4bits(n->in_inmsk); if ((k >= 0) && (k != 32)) nat_masks |= 1 << k; j = (n->in_inip & n->in_inmsk); hv = NAT_HASH_FN(j, 0, ipf_natrules_sz); np = nat_rules + hv; while (*np != NULL) np = &(*np)->in_mnext; n->in_mnext = NULL; n->in_pmnext = np; n->in_hv = hv; *np = n; } /* ------------------------------------------------------------------------ */ /* Function: nat_delrdr */ /* Returns: Nil */ /* Parameters: n(I) - pointer to NAT rule to delete */ /* */ /* Removes a redirect rule from the hash table of redirect rules. */ /* ------------------------------------------------------------------------ */ static void nat_delrdr(n) ipnat_t *n; { if (n->in_rnext) n->in_rnext->in_prnext = n->in_prnext; *n->in_prnext = n->in_rnext; } /* ------------------------------------------------------------------------ */ /* Function: nat_delnat */ /* Returns: Nil */ /* Parameters: n(I) - pointer to NAT rule to delete */ /* */ /* Removes a NAT map rule from the hash table of NAT map rules. */ /* ------------------------------------------------------------------------ */ static void nat_delnat(n) ipnat_t *n; { if (n->in_mnext != NULL) n->in_mnext->in_pmnext = n->in_pmnext; *n->in_pmnext = n->in_mnext; } /* ------------------------------------------------------------------------ */ /* Function: nat_hostmap */ /* Returns: struct hostmap* - NULL if no hostmap could be created, */ /* else a pointer to the hostmapping to use */ /* Parameters: np(I) - pointer to NAT rule */ /* real(I) - real IP address */ /* map(I) - mapped IP address */ /* port(I) - destination port number */ /* Write Locks: ipf_nat */ /* */ /* Check if an ip address has already been allocated for a given mapping */ /* that is not doing port based translation. If is not yet allocated, then */ /* create a new entry if a non-NULL NAT rule pointer has been supplied. */ /* ------------------------------------------------------------------------ */ static struct hostmap *nat_hostmap(np, src, dst, map, port) ipnat_t *np; struct in_addr src; struct in_addr dst; struct in_addr map; u_32_t port; { hostmap_t *hm; u_int hv; hv = (src.s_addr ^ dst.s_addr); hv += src.s_addr; hv += dst.s_addr; hv %= HOSTMAP_SIZE; for (hm = ipf_hm_maptable[hv]; hm; hm = hm->hm_next) if ((hm->hm_srcip.s_addr == src.s_addr) && (hm->hm_dstip.s_addr == dst.s_addr) && ((np == NULL) || (np == hm->hm_ipnat)) && ((port == 0) || (port == hm->hm_port))) { hm->hm_ref++; return hm; } if (np == NULL) return NULL; KMALLOC(hm, hostmap_t *); if (hm) { hm->hm_next = ipf_hm_maplist; hm->hm_pnext = &ipf_hm_maplist; if (ipf_hm_maplist != NULL) ipf_hm_maplist->hm_pnext = &hm->hm_next; ipf_hm_maplist = hm; hm->hm_hnext = ipf_hm_maptable[hv]; hm->hm_phnext = ipf_hm_maptable + hv; if (ipf_hm_maptable[hv] != NULL) ipf_hm_maptable[hv]->hm_phnext = &hm->hm_hnext; ipf_hm_maptable[hv] = hm; hm->hm_ipnat = np; hm->hm_srcip = src; hm->hm_dstip = dst; hm->hm_mapip = map; hm->hm_ref = 1; hm->hm_port = port; } return hm; } /* ------------------------------------------------------------------------ */ /* Function: fr_hostmapdel */ /* Returns: Nil */ /* Parameters: hmp(I) - pointer to hostmap structure pointer */ /* Write Locks: ipf_nat */ /* */ /* Decrement the references to this hostmap structure by one. If this */ /* reaches zero then remove it and free it. */ /* ------------------------------------------------------------------------ */ void fr_hostmapdel(hmp) struct hostmap **hmp; { struct hostmap *hm; hm = *hmp; *hmp = NULL; hm->hm_ref--; if (hm->hm_ref == 0) { if (hm->hm_hnext) hm->hm_hnext->hm_phnext = hm->hm_phnext; *hm->hm_phnext = hm->hm_hnext; if (hm->hm_next) hm->hm_next->hm_pnext = hm->hm_pnext; *hm->hm_pnext = hm->hm_next; KFREE(hm); } } /* ------------------------------------------------------------------------ */ /* Function: fix_outcksum */ /* Returns: Nil */ /* Parameters: fin(I) - pointer to packet information */ /* sp(I) - location of 16bit checksum to update */ /* n((I) - amount to adjust checksum by */ /* */ /* Adjusts the 16bit checksum by "n" for packets going out. */ /* ------------------------------------------------------------------------ */ void fix_outcksum(fin, sp, n) fr_info_t *fin; u_short *sp; u_32_t n; { u_short sumshort; u_32_t sum1; if (n == 0) return; if (n & NAT_HW_CKSUM) { n &= 0xffff; n += fin->fin_dlen; n = (n & 0xffff) + (n >> 16); *sp = n & 0xffff; return; } sum1 = (~ntohs(*sp)) & 0xffff; sum1 += (n); sum1 = (sum1 >> 16) + (sum1 & 0xffff); /* Again */ sum1 = (sum1 >> 16) + (sum1 & 0xffff); sumshort = ~(u_short)sum1; *(sp) = htons(sumshort); } /* ------------------------------------------------------------------------ */ /* Function: fix_incksum */ /* Returns: Nil */ /* Parameters: fin(I) - pointer to packet information */ /* sp(I) - location of 16bit checksum to update */ /* n((I) - amount to adjust checksum by */ /* */ /* Adjusts the 16bit checksum by "n" for packets going in. */ /* ------------------------------------------------------------------------ */ void fix_incksum(fin, sp, n) fr_info_t *fin; u_short *sp; u_32_t n; { u_short sumshort; u_32_t sum1; if (n == 0) return; if (n & NAT_HW_CKSUM) { n &= 0xffff; n += fin->fin_dlen; n = (n & 0xffff) + (n >> 16); *sp = n & 0xffff; return; } sum1 = (~ntohs(*sp)) & 0xffff; sum1 += ~(n) & 0xffff; sum1 = (sum1 >> 16) + (sum1 & 0xffff); /* Again */ sum1 = (sum1 >> 16) + (sum1 & 0xffff); sumshort = ~(u_short)sum1; *(sp) = htons(sumshort); } /* ------------------------------------------------------------------------ */ /* Function: fix_datacksum */ /* Returns: Nil */ /* Parameters: sp(I) - location of 16bit checksum to update */ /* n((I) - amount to adjust checksum by */ /* */ /* Fix_datacksum is used *only* for the adjustments of checksums in the */ /* data section of an IP packet. */ /* */ /* The only situation in which you need to do this is when NAT'ing an */ /* ICMP error message. Such a message, contains in its body the IP header */ /* of the original IP packet, that causes the error. */ /* */ /* You can't use fix_incksum or fix_outcksum in that case, because for the */ /* kernel the data section of the ICMP error is just data, and no special */ /* processing like hardware cksum or ntohs processing have been done by the */ /* kernel on the data section. */ /* ------------------------------------------------------------------------ */ void fix_datacksum(sp, n) u_short *sp; u_32_t n; { u_short sumshort; u_32_t sum1; if (n == 0) return; sum1 = (~ntohs(*sp)) & 0xffff; sum1 += (n); sum1 = (sum1 >> 16) + (sum1 & 0xffff); /* Again */ sum1 = (sum1 >> 16) + (sum1 & 0xffff); sumshort = ~(u_short)sum1; *(sp) = htons(sumshort); } /* ------------------------------------------------------------------------ */ /* Function: fr_nat_ioctl */ /* Returns: int - 0 == success, != 0 == failure */ /* Parameters: data(I) - pointer to ioctl data */ /* cmd(I) - ioctl command integer */ /* mode(I) - file mode bits used with open */ /* */ /* Processes an ioctl call made to operate on the IP Filter NAT device. */ /* ------------------------------------------------------------------------ */ int fr_nat_ioctl(data, cmd, mode, uid, ctx) ioctlcmd_t cmd; caddr_t data; int mode, uid; void *ctx; { ipnat_t *nat, *nt, *n = NULL, **np = NULL; int error = 0, ret, arg, getlock; ipnat_t natd; SPL_INT(s); #if (BSD >= 199306) && defined(_KERNEL) # if defined(__NetBSD_Version__) && (__NetBSD_Version__ >= 399002000) if ((mode & FWRITE) && kauth_authorize_network(curlwp->l_cred, KAUTH_NETWORK_FIREWALL, KAUTH_REQ_NETWORK_FIREWALL_FW, NULL, NULL, NULL)) { return EPERM; } # else +# if defined(__FreeBSD_version) && (__FreeBSD_version >= 500034) + if (securelevel_ge(curthread->td_ucred, 3) && (mode & FWRITE)) { +# else if ((securelevel >= 3) && (mode & FWRITE)) { +# endif return EPERM; } # endif #endif #if defined(__osf__) && defined(_KERNEL) getlock = 0; #else getlock = (mode & NAT_LOCKHELD) ? 0 : 1; #endif nat = NULL; /* XXX gcc -Wuninitialized */ if (cmd == (ioctlcmd_t)SIOCADNAT) { KMALLOC(nt, ipnat_t *); } else { nt = NULL; } if ((cmd == (ioctlcmd_t)SIOCADNAT) || (cmd == (ioctlcmd_t)SIOCRMNAT)) { if (mode & NAT_SYSSPACE) { bcopy(data, (char *)&natd, sizeof(natd)); error = 0; } else { error = fr_inobj(data, &natd, IPFOBJ_IPNAT); } } if (error != 0) goto done; /* * For add/delete, look to see if the NAT entry is already present */ if ((cmd == (ioctlcmd_t)SIOCADNAT) || (cmd == (ioctlcmd_t)SIOCRMNAT)) { nat = &natd; if (nat->in_v == 0) /* For backward compat. */ nat->in_v = 4; nat->in_flags &= IPN_USERFLAGS; if ((nat->in_redir & NAT_MAPBLK) == 0) { if ((nat->in_flags & IPN_SPLIT) == 0) nat->in_inip &= nat->in_inmsk; if ((nat->in_flags & IPN_IPRANGE) == 0) nat->in_outip &= nat->in_outmsk; } MUTEX_ENTER(&ipf_natio); for (np = &nat_list; ((n = *np) != NULL); np = &n->in_next) if (bcmp((char *)&nat->in_flags, (char *)&n->in_flags, IPN_CMPSIZ) == 0) { if (nat->in_redir == NAT_REDIRECT && nat->in_pnext != n->in_pnext) continue; break; } } switch (cmd) { #ifdef IPFILTER_LOG case SIOCIPFFB : { int tmp; if (!(mode & FWRITE)) error = EPERM; else { tmp = ipflog_clear(IPL_LOGNAT); error = BCOPYOUT((char *)&tmp, (char *)data, sizeof(tmp)); if (error != 0) error = EFAULT; } break; } case SIOCSETLG : if (!(mode & FWRITE)) error = EPERM; else { error = BCOPYIN((char *)data, (char *)&nat_logging, sizeof(nat_logging)); if (error != 0) error = EFAULT; } break; case SIOCGETLG : error = BCOPYOUT((char *)&nat_logging, (char *)data, sizeof(nat_logging)); if (error != 0) error = EFAULT; break; case FIONREAD : arg = iplused[IPL_LOGNAT]; error = BCOPYOUT(&arg, data, sizeof(arg)); if (error != 0) error = EFAULT; break; #endif case SIOCADNAT : if (!(mode & FWRITE)) { error = EPERM; } else if (n != NULL) { error = EEXIST; } else if (nt == NULL) { error = ENOMEM; } if (error != 0) { MUTEX_EXIT(&ipf_natio); break; } bcopy((char *)nat, (char *)nt, sizeof(*n)); error = nat_siocaddnat(nt, np, getlock); MUTEX_EXIT(&ipf_natio); if (error == 0) nt = NULL; break; case SIOCRMNAT : if (!(mode & FWRITE)) { error = EPERM; n = NULL; } else if (n == NULL) { error = ESRCH; } if (error != 0) { MUTEX_EXIT(&ipf_natio); break; } nat_siocdelnat(n, np, getlock); MUTEX_EXIT(&ipf_natio); n = NULL; break; case SIOCGNATS : nat_stats.ns_table[0] = nat_table[0]; nat_stats.ns_table[1] = nat_table[1]; nat_stats.ns_list = nat_list; nat_stats.ns_maptable = ipf_hm_maptable; nat_stats.ns_maplist = ipf_hm_maplist; nat_stats.ns_nattab_sz = ipf_nattable_sz; nat_stats.ns_nattab_max = ipf_nattable_max; nat_stats.ns_rultab_sz = ipf_natrules_sz; nat_stats.ns_rdrtab_sz = ipf_rdrrules_sz; nat_stats.ns_hostmap_sz = ipf_hostmap_sz; nat_stats.ns_instances = nat_instances; nat_stats.ns_apslist = ap_sess_list; nat_stats.ns_ticks = fr_ticks; error = fr_outobj(data, &nat_stats, IPFOBJ_NATSTAT); break; case SIOCGNATL : { natlookup_t nl; error = fr_inobj(data, &nl, IPFOBJ_NATLOOKUP); if (error == 0) { void *ptr; if (getlock) { READ_ENTER(&ipf_nat); } ptr = nat_lookupredir(&nl); if (getlock) { RWLOCK_EXIT(&ipf_nat); } if (ptr != NULL) { error = fr_outobj(data, &nl, IPFOBJ_NATLOOKUP); } else { error = ESRCH; } } break; } case SIOCIPFFL : /* old SIOCFLNAT & SIOCCNATL */ if (!(mode & FWRITE)) { error = EPERM; break; } if (getlock) { WRITE_ENTER(&ipf_nat); } error = BCOPYIN(data, &arg, sizeof(arg)); if (error != 0) error = EFAULT; else { if (arg == 0) ret = nat_flushtable(); else if (arg == 1) ret = nat_clearlist(); else ret = nat_extraflush(arg); } if (getlock) { RWLOCK_EXIT(&ipf_nat); } if (error == 0) { error = BCOPYOUT(&ret, data, sizeof(ret)); } break; case SIOCPROXY : error = appr_ioctl(data, cmd, mode, ctx); break; case SIOCSTLCK : if (!(mode & FWRITE)) { error = EPERM; } else { error = fr_lock(data, &fr_nat_lock); } break; case SIOCSTPUT : if ((mode & FWRITE) != 0) { error = fr_natputent(data, getlock); } else { error = EACCES; } break; case SIOCSTGSZ : if (fr_nat_lock) { error = fr_natgetsz(data, getlock); } else error = EACCES; break; case SIOCSTGET : if (fr_nat_lock) { error = fr_natgetent(data, getlock); } else error = EACCES; break; case SIOCGENITER : { ipfgeniter_t iter; ipftoken_t *token; SPL_SCHED(s); error = fr_inobj(data, &iter, IPFOBJ_GENITER); if (error == 0) { token = ipf_findtoken(iter.igi_type, uid, ctx); if (token != NULL) { error = nat_iterator(token, &iter); } RWLOCK_EXIT(&ipf_tokens); } SPL_X(s); break; } case SIOCIPFDELTOK : error = BCOPYIN((caddr_t)data, (caddr_t)&arg, sizeof(arg)); if (error == 0) { SPL_SCHED(s); error = ipf_deltoken(arg, uid, ctx); SPL_X(s); } else { error = EFAULT; } break; case SIOCGTQTAB : error = fr_outobj(data, nat_tqb, IPFOBJ_STATETQTAB); break; case SIOCGTABL : error = nat_gettable(data); break; default : error = EINVAL; break; } done: if (nt != NULL) KFREE(nt); return error; } /* ------------------------------------------------------------------------ */ /* Function: nat_siocaddnat */ /* Returns: int - 0 == success, != 0 == failure */ /* Parameters: n(I) - pointer to new NAT rule */ /* np(I) - pointer to where to insert new NAT rule */ /* getlock(I) - flag indicating if lock on ipf_nat is held */ /* Mutex Locks: ipf_natio */ /* */ /* Handle SIOCADNAT. Resolve and calculate details inside the NAT rule */ /* from information passed to the kernel, then add it to the appropriate */ /* NAT rule table(s). */ /* ------------------------------------------------------------------------ */ static int nat_siocaddnat(n, np, getlock) ipnat_t *n, **np; int getlock; { int error = 0, i, j; if (nat_resolverule(n) != 0) return ENOENT; if ((n->in_age[0] == 0) && (n->in_age[1] != 0)) return EINVAL; n->in_use = 0; if (n->in_redir & NAT_MAPBLK) n->in_space = USABLE_PORTS * ~ntohl(n->in_outmsk); else if (n->in_flags & IPN_AUTOPORTMAP) n->in_space = USABLE_PORTS * ~ntohl(n->in_inmsk); else if (n->in_flags & IPN_IPRANGE) n->in_space = ntohl(n->in_outmsk) - ntohl(n->in_outip); else if (n->in_flags & IPN_SPLIT) n->in_space = 2; else if (n->in_outmsk != 0) n->in_space = ~ntohl(n->in_outmsk); else n->in_space = 1; /* * Calculate the number of valid IP addresses in the output * mapping range. In all cases, the range is inclusive of * the start and ending IP addresses. * If to a CIDR address, lose 2: broadcast + network address * (so subtract 1) * If to a range, add one. * If to a single IP address, set to 1. */ if (n->in_space) { if ((n->in_flags & IPN_IPRANGE) != 0) n->in_space += 1; else n->in_space -= 1; } else n->in_space = 1; if ((n->in_outmsk != 0xffffffff) && (n->in_outmsk != 0) && ((n->in_flags & (IPN_IPRANGE|IPN_SPLIT)) == 0)) n->in_nip = ntohl(n->in_outip) + 1; else if ((n->in_flags & IPN_SPLIT) && (n->in_redir & NAT_REDIRECT)) n->in_nip = ntohl(n->in_inip); else n->in_nip = ntohl(n->in_outip); if (n->in_redir & NAT_MAP) { n->in_pnext = ntohs(n->in_pmin); /* * Multiply by the number of ports made available. */ if (ntohs(n->in_pmax) >= ntohs(n->in_pmin)) { n->in_space *= (ntohs(n->in_pmax) - ntohs(n->in_pmin) + 1); /* * Because two different sources can map to * different destinations but use the same * local IP#/port #. * If the result is smaller than in_space, then * we may have wrapped around 32bits. */ i = n->in_inmsk; if ((i != 0) && (i != 0xffffffff)) { j = n->in_space * (~ntohl(i) + 1); if (j >= n->in_space) n->in_space = j; else n->in_space = 0xffffffff; } } /* * If no protocol is specified, multiple by 256 to allow for * at least one IP:IP mapping per protocol. */ if ((n->in_flags & IPN_TCPUDPICMP) == 0) { j = n->in_space * 256; if (j >= n->in_space) n->in_space = j; else n->in_space = 0xffffffff; } } /* Otherwise, these fields are preset */ if (getlock) { WRITE_ENTER(&ipf_nat); } n->in_next = NULL; *np = n; if (n->in_age[0] != 0) n->in_tqehead[0] = fr_addtimeoutqueue(&nat_utqe, n->in_age[0]); if (n->in_age[1] != 0) n->in_tqehead[1] = fr_addtimeoutqueue(&nat_utqe, n->in_age[1]); if (n->in_redir & NAT_REDIRECT) { n->in_flags &= ~IPN_NOTDST; nat_addrdr(n); } if (n->in_redir & (NAT_MAP|NAT_MAPBLK)) { n->in_flags &= ~IPN_NOTSRC; nat_addnat(n); } MUTEX_INIT(&n->in_lock, "ipnat rule lock"); n = NULL; nat_stats.ns_rules++; #if SOLARIS && !defined(_INET_IP_STACK_H) pfil_delayed_copy = 0; #endif if (getlock) { RWLOCK_EXIT(&ipf_nat); /* WRITE */ } return error; } /* ------------------------------------------------------------------------ */ /* Function: nat_resolvrule */ /* Returns: Nil */ /* Parameters: n(I) - pointer to NAT rule */ /* */ /* Handle SIOCADNAT. Resolve and calculate details inside the NAT rule */ /* from information passed to the kernel, then add it to the appropriate */ /* NAT rule table(s). */ /* ------------------------------------------------------------------------ */ static int nat_resolverule(n) ipnat_t *n; { n->in_ifnames[0][LIFNAMSIZ - 1] = '\0'; n->in_ifps[0] = fr_resolvenic(n->in_ifnames[0], 4); n->in_ifnames[1][LIFNAMSIZ - 1] = '\0'; if (n->in_ifnames[1][0] == '\0') { (void) strncpy(n->in_ifnames[1], n->in_ifnames[0], LIFNAMSIZ); n->in_ifps[1] = n->in_ifps[0]; } else { n->in_ifps[1] = fr_resolvenic(n->in_ifnames[1], 4); } if (n->in_plabel[0] != '\0') { n->in_apr = appr_lookup(n->in_p, n->in_plabel); if (n->in_apr == NULL) return -1; } return 0; } /* ------------------------------------------------------------------------ */ /* Function: nat_siocdelnat */ /* Returns: int - 0 == success, != 0 == failure */ /* Parameters: n(I) - pointer to new NAT rule */ /* np(I) - pointer to where to insert new NAT rule */ /* getlock(I) - flag indicating if lock on ipf_nat is held */ /* Mutex Locks: ipf_natio */ /* */ /* Handle SIOCADNAT. Resolve and calculate details inside the NAT rule */ /* from information passed to the kernel, then add it to the appropriate */ /* NAT rule table(s). */ /* ------------------------------------------------------------------------ */ static void nat_siocdelnat(n, np, getlock) ipnat_t *n, **np; int getlock; { if (getlock) { WRITE_ENTER(&ipf_nat); } if (n->in_redir & NAT_REDIRECT) nat_delrdr(n); if (n->in_redir & (NAT_MAPBLK|NAT_MAP)) nat_delnat(n); if (nat_list == NULL) { nat_masks = 0; rdr_masks = 0; } if (n->in_tqehead[0] != NULL) { if (fr_deletetimeoutqueue(n->in_tqehead[0]) == 0) { fr_freetimeoutqueue(n->in_tqehead[1]); } } if (n->in_tqehead[1] != NULL) { if (fr_deletetimeoutqueue(n->in_tqehead[1]) == 0) { fr_freetimeoutqueue(n->in_tqehead[1]); } } *np = n->in_next; if (n->in_use == 0) { if (n->in_apr) appr_free(n->in_apr); MUTEX_DESTROY(&n->in_lock); KFREE(n); nat_stats.ns_rules--; #if SOLARIS && !defined(_INET_IP_STACK_H) if (nat_stats.ns_rules == 0) pfil_delayed_copy = 1; #endif } else { n->in_flags |= IPN_DELETE; n->in_next = NULL; } if (getlock) { RWLOCK_EXIT(&ipf_nat); /* READ/WRITE */ } } /* ------------------------------------------------------------------------ */ /* Function: fr_natgetsz */ /* Returns: int - 0 == success, != 0 is the error value. */ /* Parameters: data(I) - pointer to natget structure with kernel pointer */ /* get the size of. */ /* */ /* Handle SIOCSTGSZ. */ /* Return the size of the nat list entry to be copied back to user space. */ /* The size of the entry is stored in the ng_sz field and the enture natget */ /* structure is copied back to the user. */ /* ------------------------------------------------------------------------ */ static int fr_natgetsz(data, getlock) caddr_t data; int getlock; { ap_session_t *aps; nat_t *nat, *n; natget_t ng; if (BCOPYIN(data, &ng, sizeof(ng)) != 0) return EFAULT; if (getlock) { READ_ENTER(&ipf_nat); } nat = ng.ng_ptr; if (!nat) { nat = nat_instances; ng.ng_sz = 0; /* * Empty list so the size returned is 0. Simple. */ if (nat == NULL) { if (getlock) { RWLOCK_EXIT(&ipf_nat); } if (BCOPYOUT(&ng, data, sizeof(ng)) != 0) return EFAULT; return 0; } } else { /* * Make sure the pointer we're copying from exists in the * current list of entries. Security precaution to prevent * copying of random kernel data. */ for (n = nat_instances; n; n = n->nat_next) if (n == nat) break; if (n == NULL) { if (getlock) { RWLOCK_EXIT(&ipf_nat); } return ESRCH; } } /* * Incluse any space required for proxy data structures. */ ng.ng_sz = sizeof(nat_save_t); aps = nat->nat_aps; if (aps != NULL) { ng.ng_sz += sizeof(ap_session_t) - 4; if (aps->aps_data != 0) ng.ng_sz += aps->aps_psiz; } if (getlock) { RWLOCK_EXIT(&ipf_nat); } if (BCOPYOUT(&ng, data, sizeof(ng)) != 0) return EFAULT; return 0; } /* ------------------------------------------------------------------------ */ /* Function: fr_natgetent */ /* Returns: int - 0 == success, != 0 is the error value. */ /* Parameters: data(I) - pointer to natget structure with kernel pointer */ /* to NAT structure to copy out. */ /* */ /* Handle SIOCSTGET. */ /* Copies out NAT entry to user space. Any additional data held for a */ /* proxy is also copied, as to is the NAT rule which was responsible for it */ /* ------------------------------------------------------------------------ */ static int fr_natgetent(data, getlock) caddr_t data; int getlock; { int error, outsize; ap_session_t *aps; nat_save_t *ipn, ipns; nat_t *n, *nat; error = fr_inobj(data, &ipns, IPFOBJ_NATSAVE); if (error != 0) return error; if ((ipns.ipn_dsize < sizeof(ipns)) || (ipns.ipn_dsize > 81920)) return EINVAL; KMALLOCS(ipn, nat_save_t *, ipns.ipn_dsize); if (ipn == NULL) return ENOMEM; if (getlock) { READ_ENTER(&ipf_nat); } ipn->ipn_dsize = ipns.ipn_dsize; nat = ipns.ipn_next; if (nat == NULL) { nat = nat_instances; if (nat == NULL) { if (nat_instances == NULL) error = ENOENT; goto finished; } } else { /* * Make sure the pointer we're copying from exists in the * current list of entries. Security precaution to prevent * copying of random kernel data. */ for (n = nat_instances; n; n = n->nat_next) if (n == nat) break; if (n == NULL) { error = ESRCH; goto finished; } } ipn->ipn_next = nat->nat_next; /* * Copy the NAT structure. */ bcopy((char *)nat, &ipn->ipn_nat, sizeof(*nat)); /* * If we have a pointer to the NAT rule it belongs to, save that too. */ if (nat->nat_ptr != NULL) bcopy((char *)nat->nat_ptr, (char *)&ipn->ipn_ipnat, sizeof(ipn->ipn_ipnat)); /* * If we also know the NAT entry has an associated filter rule, * save that too. */ if (nat->nat_fr != NULL) bcopy((char *)nat->nat_fr, (char *)&ipn->ipn_fr, sizeof(ipn->ipn_fr)); /* * Last but not least, if there is an application proxy session set * up for this NAT entry, then copy that out too, including any * private data saved along side it by the proxy. */ aps = nat->nat_aps; outsize = ipn->ipn_dsize - sizeof(*ipn) + sizeof(ipn->ipn_data); if (aps != NULL) { char *s; if (outsize < sizeof(*aps)) { error = ENOBUFS; goto finished; } s = ipn->ipn_data; bcopy((char *)aps, s, sizeof(*aps)); s += sizeof(*aps); outsize -= sizeof(*aps); if ((aps->aps_data != NULL) && (outsize >= aps->aps_psiz)) bcopy(aps->aps_data, s, aps->aps_psiz); else error = ENOBUFS; } if (error == 0) { if (getlock) { RWLOCK_EXIT(&ipf_nat); getlock = 0; } error = fr_outobjsz(data, ipn, IPFOBJ_NATSAVE, ipns.ipn_dsize); } finished: if (getlock) { RWLOCK_EXIT(&ipf_nat); } if (ipn != NULL) { KFREES(ipn, ipns.ipn_dsize); } return error; } /* ------------------------------------------------------------------------ */ /* Function: fr_natputent */ /* Returns: int - 0 == success, != 0 is the error value. */ /* Parameters: data(I) - pointer to natget structure with NAT */ /* structure information to load into the kernel */ /* getlock(I) - flag indicating whether or not a write lock */ /* on ipf_nat is already held. */ /* */ /* Handle SIOCSTPUT. */ /* Loads a NAT table entry from user space, including a NAT rule, proxy and */ /* firewall rule data structures, if pointers to them indicate so. */ /* ------------------------------------------------------------------------ */ static int fr_natputent(data, getlock) caddr_t data; int getlock; { nat_save_t ipn, *ipnn; ap_session_t *aps; nat_t *n, *nat; frentry_t *fr; fr_info_t fin; ipnat_t *in; int error; error = fr_inobj(data, &ipn, IPFOBJ_NATSAVE); if (error != 0) return error; /* * Initialise early because of code at junkput label. */ in = NULL; aps = NULL; nat = NULL; ipnn = NULL; fr = NULL; /* * New entry, copy in the rest of the NAT entry if it's size is more * than just the nat_t structure. */ if (ipn.ipn_dsize > sizeof(ipn)) { if (ipn.ipn_dsize > 81920) { error = ENOMEM; goto junkput; } KMALLOCS(ipnn, nat_save_t *, ipn.ipn_dsize); if (ipnn == NULL) return ENOMEM; error = fr_inobjsz(data, ipnn, IPFOBJ_NATSAVE, ipn.ipn_dsize); if (error != 0) { error = EFAULT; goto junkput; } } else ipnn = &ipn; KMALLOC(nat, nat_t *); if (nat == NULL) { error = ENOMEM; goto junkput; } bcopy((char *)&ipnn->ipn_nat, (char *)nat, sizeof(*nat)); /* * Initialize all these so that nat_delete() doesn't cause a crash. */ bzero((char *)nat, offsetof(struct nat, nat_tqe)); nat->nat_tqe.tqe_pnext = NULL; nat->nat_tqe.tqe_next = NULL; nat->nat_tqe.tqe_ifq = NULL; nat->nat_tqe.tqe_parent = nat; /* * Restore the rule associated with this nat session */ in = ipnn->ipn_nat.nat_ptr; if (in != NULL) { KMALLOC(in, ipnat_t *); nat->nat_ptr = in; if (in == NULL) { error = ENOMEM; goto junkput; } bzero((char *)in, offsetof(struct ipnat, in_next6)); bcopy((char *)&ipnn->ipn_ipnat, (char *)in, sizeof(*in)); in->in_use = 1; in->in_flags |= IPN_DELETE; ATOMIC_INC(nat_stats.ns_rules); if (nat_resolverule(in) != 0) { error = ESRCH; goto junkput; } } /* * Check that the NAT entry doesn't already exist in the kernel. * * For NAT_OUTBOUND, we're lookup for a duplicate MAP entry. To do * this, we check to see if the inbound combination of addresses and * ports is already known. Similar logic is applied for NAT_INBOUND. * */ bzero((char *)&fin, sizeof(fin)); fin.fin_p = nat->nat_p; if (nat->nat_dir == NAT_OUTBOUND) { fin.fin_ifp = nat->nat_ifps[0]; fin.fin_data[0] = ntohs(nat->nat_oport); fin.fin_data[1] = ntohs(nat->nat_outport); if (getlock) { READ_ENTER(&ipf_nat); } n = nat_inlookup(&fin, nat->nat_flags, fin.fin_p, nat->nat_oip, nat->nat_inip); if (getlock) { RWLOCK_EXIT(&ipf_nat); } if (n != NULL) { error = EEXIST; goto junkput; } } else if (nat->nat_dir == NAT_INBOUND) { fin.fin_ifp = nat->nat_ifps[0]; fin.fin_data[0] = ntohs(nat->nat_outport); fin.fin_data[1] = ntohs(nat->nat_oport); if (getlock) { READ_ENTER(&ipf_nat); } n = nat_outlookup(&fin, nat->nat_flags, fin.fin_p, nat->nat_outip, nat->nat_oip); if (getlock) { RWLOCK_EXIT(&ipf_nat); } if (n != NULL) { error = EEXIST; goto junkput; } } else { error = EINVAL; goto junkput; } /* * Restore ap_session_t structure. Include the private data allocated * if it was there. */ aps = nat->nat_aps; if (aps != NULL) { KMALLOC(aps, ap_session_t *); nat->nat_aps = aps; if (aps == NULL) { error = ENOMEM; goto junkput; } bcopy(ipnn->ipn_data, (char *)aps, sizeof(*aps)); if (in != NULL) aps->aps_apr = in->in_apr; else aps->aps_apr = NULL; if (aps->aps_psiz != 0) { if (aps->aps_psiz > 81920) { error = ENOMEM; goto junkput; } KMALLOCS(aps->aps_data, void *, aps->aps_psiz); if (aps->aps_data == NULL) { error = ENOMEM; goto junkput; } bcopy(ipnn->ipn_data + sizeof(*aps), aps->aps_data, aps->aps_psiz); } else { aps->aps_psiz = 0; aps->aps_data = NULL; } } /* * If there was a filtering rule associated with this entry then * build up a new one. */ fr = nat->nat_fr; if (fr != NULL) { if ((nat->nat_flags & SI_NEWFR) != 0) { KMALLOC(fr, frentry_t *); nat->nat_fr = fr; if (fr == NULL) { error = ENOMEM; goto junkput; } ipnn->ipn_nat.nat_fr = fr; fr->fr_ref = 1; (void) fr_outobj(data, ipnn, IPFOBJ_NATSAVE); bcopy((char *)&ipnn->ipn_fr, (char *)fr, sizeof(*fr)); fr->fr_ref = 1; fr->fr_dsize = 0; fr->fr_data = NULL; fr->fr_type = FR_T_NONE; MUTEX_NUKE(&fr->fr_lock); MUTEX_INIT(&fr->fr_lock, "nat-filter rule lock"); } else { if (getlock) { READ_ENTER(&ipf_nat); } for (n = nat_instances; n; n = n->nat_next) if (n->nat_fr == fr) break; if (n != NULL) { MUTEX_ENTER(&fr->fr_lock); fr->fr_ref++; MUTEX_EXIT(&fr->fr_lock); } if (getlock) { RWLOCK_EXIT(&ipf_nat); } if (!n) { error = ESRCH; goto junkput; } } } if (ipnn != &ipn) { KFREES(ipnn, ipn.ipn_dsize); ipnn = NULL; } if (getlock) { WRITE_ENTER(&ipf_nat); } error = nat_insert(nat, nat->nat_rev); if ((error == 0) && (aps != NULL)) { aps->aps_next = ap_sess_list; ap_sess_list = aps; } if (getlock) { RWLOCK_EXIT(&ipf_nat); } if (error == 0) return 0; error = ENOMEM; junkput: if (fr != NULL) (void) fr_derefrule(&fr); if ((ipnn != NULL) && (ipnn != &ipn)) { KFREES(ipnn, ipn.ipn_dsize); } if (nat != NULL) { if (aps != NULL) { if (aps->aps_data != NULL) { KFREES(aps->aps_data, aps->aps_psiz); } KFREE(aps); } if (in != NULL) { if (in->in_apr) appr_free(in->in_apr); KFREE(in); } KFREE(nat); } return error; } /* ------------------------------------------------------------------------ */ /* Function: nat_delete */ /* Returns: Nil */ /* Parameters: natd(I) - pointer to NAT structure to delete */ /* logtype(I) - type of LOG record to create before deleting */ /* Write Lock: ipf_nat */ /* */ /* Delete a nat entry from the various lists and table. If NAT logging is */ /* enabled then generate a NAT log record for this event. */ /* ------------------------------------------------------------------------ */ void nat_delete(nat, logtype) struct nat *nat; int logtype; { struct ipnat *ipn; int removed = 0; if (logtype != 0 && nat_logging != 0) nat_log(nat, logtype); #if defined(NEED_LOCAL_RAND) && defined(_KERNEL) ipf_rand_push(nat, sizeof(*nat)); #endif /* * Take it as a general indication that all the pointers are set if * nat_pnext is set. */ if (nat->nat_pnext != NULL) { removed = 1; nat_stats.ns_bucketlen[0][nat->nat_hv[0]]--; nat_stats.ns_bucketlen[1][nat->nat_hv[1]]--; *nat->nat_pnext = nat->nat_next; if (nat->nat_next != NULL) { nat->nat_next->nat_pnext = nat->nat_pnext; nat->nat_next = NULL; } nat->nat_pnext = NULL; *nat->nat_phnext[0] = nat->nat_hnext[0]; if (nat->nat_hnext[0] != NULL) { nat->nat_hnext[0]->nat_phnext[0] = nat->nat_phnext[0]; nat->nat_hnext[0] = NULL; } nat->nat_phnext[0] = NULL; *nat->nat_phnext[1] = nat->nat_hnext[1]; if (nat->nat_hnext[1] != NULL) { nat->nat_hnext[1]->nat_phnext[1] = nat->nat_phnext[1]; nat->nat_hnext[1] = NULL; } nat->nat_phnext[1] = NULL; if ((nat->nat_flags & SI_WILDP) != 0) nat_stats.ns_wilds--; } if (nat->nat_me != NULL) { *nat->nat_me = NULL; nat->nat_me = NULL; } if (nat->nat_tqe.tqe_ifq != NULL) fr_deletequeueentry(&nat->nat_tqe); if (logtype == NL_EXPIRE) nat_stats.ns_expire++; MUTEX_ENTER(&nat->nat_lock); /* * NL_DESTROY should only be passed in when we've got nat_ref >= 2. * This happens when a nat'd packet is blocked and we want to throw * away the NAT session. */ if (logtype == NL_DESTROY) { if (nat->nat_ref > 2) { nat->nat_ref -= 2; MUTEX_EXIT(&nat->nat_lock); if (removed) nat_stats.ns_orphans++; return; } } else if (nat->nat_ref > 1) { nat->nat_ref--; MUTEX_EXIT(&nat->nat_lock); if (removed) nat_stats.ns_orphans++; return; } MUTEX_EXIT(&nat->nat_lock); /* * At this point, nat_ref is 1, doing "--" would make it 0.. */ nat->nat_ref = 0; if (!removed) nat_stats.ns_orphans--; #ifdef IPFILTER_SYNC if (nat->nat_sync) ipfsync_del(nat->nat_sync); #endif if (nat->nat_fr != NULL) (void) fr_derefrule(&nat->nat_fr); if (nat->nat_hm != NULL) fr_hostmapdel(&nat->nat_hm); /* * If there is an active reference from the nat entry to its parent * rule, decrement the rule's reference count and free it too if no * longer being used. */ ipn = nat->nat_ptr; if (ipn != NULL) { fr_ipnatderef(&ipn); } MUTEX_DESTROY(&nat->nat_lock); aps_free(nat->nat_aps); nat_stats.ns_inuse--; /* * If there's a fragment table entry too for this nat entry, then * dereference that as well. This is after nat_lock is released * because of Tru64. */ fr_forgetnat((void *)nat); KFREE(nat); } /* ------------------------------------------------------------------------ */ /* Function: nat_flushtable */ /* Returns: int - number of NAT rules deleted */ /* Parameters: Nil */ /* */ /* Deletes all currently active NAT sessions. In deleting each NAT entry a */ /* log record should be emitted in nat_delete() if NAT logging is enabled. */ /* ------------------------------------------------------------------------ */ /* * nat_flushtable - clear the NAT table of all mapping entries. */ static int nat_flushtable() { nat_t *nat; int j = 0; /* * ALL NAT mappings deleted, so lets just make the deletions * quicker. */ if (nat_table[0] != NULL) bzero((char *)nat_table[0], sizeof(nat_table[0]) * ipf_nattable_sz); if (nat_table[1] != NULL) bzero((char *)nat_table[1], sizeof(nat_table[1]) * ipf_nattable_sz); while ((nat = nat_instances) != NULL) { nat_delete(nat, NL_FLUSH); j++; } nat_stats.ns_inuse = 0; return j; } /* ------------------------------------------------------------------------ */ /* Function: nat_clearlist */ /* Returns: int - number of NAT/RDR rules deleted */ /* Parameters: Nil */ /* */ /* Delete all rules in the current list of rules. There is nothing elegant */ /* about this cleanup: simply free all entries on the list of rules and */ /* clear out the tables used for hashed NAT rule lookups. */ /* ------------------------------------------------------------------------ */ static int nat_clearlist() { ipnat_t *n, **np = &nat_list; int i = 0; if (nat_rules != NULL) bzero((char *)nat_rules, sizeof(*nat_rules) * ipf_natrules_sz); if (rdr_rules != NULL) bzero((char *)rdr_rules, sizeof(*rdr_rules) * ipf_rdrrules_sz); while ((n = *np) != NULL) { *np = n->in_next; if (n->in_use == 0) { if (n->in_apr != NULL) appr_free(n->in_apr); MUTEX_DESTROY(&n->in_lock); KFREE(n); nat_stats.ns_rules--; } else { n->in_flags |= IPN_DELETE; n->in_next = NULL; } i++; } #if SOLARIS && !defined(_INET_IP_STACK_H) pfil_delayed_copy = 1; #endif nat_masks = 0; rdr_masks = 0; return i; } /* ------------------------------------------------------------------------ */ /* Function: nat_newmap */ /* Returns: int - -1 == error, 0 == success */ /* Parameters: fin(I) - pointer to packet information */ /* nat(I) - pointer to NAT entry */ /* ni(I) - pointer to structure with misc. information needed */ /* to create new NAT entry. */ /* */ /* Given an empty NAT structure, populate it with new information about a */ /* new NAT session, as defined by the matching NAT rule. */ /* ni.nai_ip is passed in uninitialised and must be set, in host byte order,*/ /* to the new IP address for the translation. */ /* ------------------------------------------------------------------------ */ static INLINE int nat_newmap(fin, nat, ni) fr_info_t *fin; nat_t *nat; natinfo_t *ni; { u_short st_port, dport, sport, port, sp, dp; struct in_addr in, inb; hostmap_t *hm; u_32_t flags; u_32_t st_ip; ipnat_t *np; nat_t *natl; int l; /* * If it's an outbound packet which doesn't match any existing * record, then create a new port */ l = 0; hm = NULL; np = ni->nai_np; st_ip = np->in_nip; st_port = np->in_pnext; flags = ni->nai_flags; sport = ni->nai_sport; dport = ni->nai_dport; /* * Do a loop until we either run out of entries to try or we find * a NAT mapping that isn't currently being used. This is done * because the change to the source is not (usually) being fixed. */ do { port = 0; in.s_addr = htonl(np->in_nip); if (l == 0) { /* * Check to see if there is an existing NAT * setup for this IP address pair. */ hm = nat_hostmap(np, fin->fin_src, fin->fin_dst, in, 0); if (hm != NULL) in.s_addr = hm->hm_mapip.s_addr; } else if ((l == 1) && (hm != NULL)) { fr_hostmapdel(&hm); } in.s_addr = ntohl(in.s_addr); nat->nat_hm = hm; if ((np->in_outmsk == 0xffffffff) && (np->in_pnext == 0)) { if (l > 0) return -1; } if (np->in_redir == NAT_BIMAP && np->in_inmsk == np->in_outmsk) { /* * map the address block in a 1:1 fashion */ in.s_addr = np->in_outip; in.s_addr |= fin->fin_saddr & ~np->in_inmsk; in.s_addr = ntohl(in.s_addr); } else if (np->in_redir & NAT_MAPBLK) { if ((l >= np->in_ppip) || ((l > 0) && !(flags & IPN_TCPUDP))) return -1; /* * map-block - Calculate destination address. */ in.s_addr = ntohl(fin->fin_saddr); in.s_addr &= ntohl(~np->in_inmsk); inb.s_addr = in.s_addr; in.s_addr /= np->in_ippip; in.s_addr &= ntohl(~np->in_outmsk); in.s_addr += ntohl(np->in_outip); /* * Calculate destination port. */ if ((flags & IPN_TCPUDP) && (np->in_ppip != 0)) { port = ntohs(sport) + l; port %= np->in_ppip; port += np->in_ppip * (inb.s_addr % np->in_ippip); port += MAPBLK_MINPORT; port = htons(port); } } else if ((np->in_outip == 0) && (np->in_outmsk == 0xffffffff)) { /* * 0/32 - use the interface's IP address. */ if ((l > 0) || fr_ifpaddr(4, FRI_NORMAL, fin->fin_ifp, &in, NULL) == -1) return -1; in.s_addr = ntohl(in.s_addr); } else if ((np->in_outip == 0) && (np->in_outmsk == 0)) { /* * 0/0 - use the original source address/port. */ if (l > 0) return -1; in.s_addr = ntohl(fin->fin_saddr); } else if ((np->in_outmsk != 0xffffffff) && (np->in_pnext == 0) && ((l > 0) || (hm == NULL))) np->in_nip++; natl = NULL; if ((flags & IPN_TCPUDP) && ((np->in_redir & NAT_MAPBLK) == 0) && (np->in_flags & IPN_AUTOPORTMAP)) { /* * "ports auto" (without map-block) */ if ((l > 0) && (l % np->in_ppip == 0)) { if (l > np->in_space) { return -1; } else if ((l > np->in_ppip) && np->in_outmsk != 0xffffffff) np->in_nip++; } if (np->in_ppip != 0) { port = ntohs(sport); port += (l % np->in_ppip); port %= np->in_ppip; port += np->in_ppip * (ntohl(fin->fin_saddr) % np->in_ippip); port += MAPBLK_MINPORT; port = htons(port); } } else if (((np->in_redir & NAT_MAPBLK) == 0) && (flags & IPN_TCPUDPICMP) && (np->in_pnext != 0)) { /* * Standard port translation. Select next port. */ if (np->in_flags & IPN_SEQUENTIAL) { port = np->in_pnext; } else { port = ipf_random() % (ntohs(np->in_pmax) - ntohs(np->in_pmin)); port += ntohs(np->in_pmin); } port = htons(port); np->in_pnext++; if (np->in_pnext > ntohs(np->in_pmax)) { np->in_pnext = ntohs(np->in_pmin); if (np->in_outmsk != 0xffffffff) np->in_nip++; } } if (np->in_flags & IPN_IPRANGE) { if (np->in_nip > ntohl(np->in_outmsk)) np->in_nip = ntohl(np->in_outip); } else { if ((np->in_outmsk != 0xffffffff) && ((np->in_nip + 1) & ntohl(np->in_outmsk)) > ntohl(np->in_outip)) np->in_nip = ntohl(np->in_outip) + 1; } if ((port == 0) && (flags & (IPN_TCPUDPICMP|IPN_ICMPQUERY))) port = sport; /* * Here we do a lookup of the connection as seen from * the outside. If an IP# pair already exists, try * again. So if you have A->B becomes C->B, you can * also have D->E become C->E but not D->B causing * another C->B. Also take protocol and ports into * account when determining whether a pre-existing * NAT setup will cause an external conflict where * this is appropriate. */ inb.s_addr = htonl(in.s_addr); sp = fin->fin_data[0]; dp = fin->fin_data[1]; fin->fin_data[0] = fin->fin_data[1]; fin->fin_data[1] = htons(port); natl = nat_inlookup(fin, flags & ~(SI_WILDP|NAT_SEARCH), (u_int)fin->fin_p, fin->fin_dst, inb); fin->fin_data[0] = sp; fin->fin_data[1] = dp; /* * Has the search wrapped around and come back to the * start ? */ if ((natl != NULL) && (np->in_pnext != 0) && (st_port == np->in_pnext) && (np->in_nip != 0) && (st_ip == np->in_nip)) return -1; l++; } while (natl != NULL); if (np->in_space > 0) np->in_space--; /* Setup the NAT table */ nat->nat_inip = fin->fin_src; nat->nat_outip.s_addr = htonl(in.s_addr); nat->nat_oip = fin->fin_dst; if (nat->nat_hm == NULL) nat->nat_hm = nat_hostmap(np, fin->fin_src, fin->fin_dst, nat->nat_outip, 0); /* * The ICMP checksum does not have a pseudo header containing * the IP addresses */ ni->nai_sum1 = LONG_SUM(ntohl(fin->fin_saddr)); ni->nai_sum2 = LONG_SUM(in.s_addr); if ((flags & IPN_TCPUDP)) { ni->nai_sum1 += ntohs(sport); ni->nai_sum2 += ntohs(port); } if (flags & IPN_TCPUDP) { nat->nat_inport = sport; nat->nat_outport = port; /* sport */ nat->nat_oport = dport; ((tcphdr_t *)fin->fin_dp)->th_sport = port; } else if (flags & IPN_ICMPQUERY) { ((icmphdr_t *)fin->fin_dp)->icmp_id = port; nat->nat_inport = port; nat->nat_outport = port; } else if (fin->fin_p == IPPROTO_GRE) { #if 0 nat->nat_gre.gs_flags = ((grehdr_t *)fin->fin_dp)->gr_flags; if (GRE_REV(nat->nat_gre.gs_flags) == 1) { nat->nat_oport = 0;/*fin->fin_data[1];*/ nat->nat_inport = 0;/*fin->fin_data[0];*/ nat->nat_outport = 0;/*fin->fin_data[0];*/ nat->nat_call[0] = fin->fin_data[0]; nat->nat_call[1] = fin->fin_data[0]; } #endif } ni->nai_ip.s_addr = in.s_addr; ni->nai_port = port; ni->nai_nport = dport; return 0; } /* ------------------------------------------------------------------------ */ /* Function: nat_newrdr */ /* Returns: int - -1 == error, 0 == success (no move), 1 == success and */ /* allow rule to be moved if IPN_ROUNDR is set. */ /* Parameters: fin(I) - pointer to packet information */ /* nat(I) - pointer to NAT entry */ /* ni(I) - pointer to structure with misc. information needed */ /* to create new NAT entry. */ /* */ /* ni.nai_ip is passed in uninitialised and must be set, in host byte order,*/ /* to the new IP address for the translation. */ /* ------------------------------------------------------------------------ */ static INLINE int nat_newrdr(fin, nat, ni) fr_info_t *fin; nat_t *nat; natinfo_t *ni; { u_short nport, dport, sport; struct in_addr in, inb; u_short sp, dp; hostmap_t *hm; u_32_t flags; ipnat_t *np; nat_t *natl; int move; move = 1; hm = NULL; in.s_addr = 0; np = ni->nai_np; flags = ni->nai_flags; sport = ni->nai_sport; dport = ni->nai_dport; /* * If the matching rule has IPN_STICKY set, then we want to have the * same rule kick in as before. Why would this happen? If you have * a collection of rdr rules with "round-robin sticky", the current * packet might match a different one to the previous connection but * we want the same destination to be used. */ if (((np->in_flags & (IPN_ROUNDR|IPN_SPLIT)) != 0) && ((np->in_flags & IPN_STICKY) != 0)) { hm = nat_hostmap(NULL, fin->fin_src, fin->fin_dst, in, (u_32_t)dport); if (hm != NULL) { in.s_addr = ntohl(hm->hm_mapip.s_addr); np = hm->hm_ipnat; ni->nai_np = np; move = 0; } } /* * Otherwise, it's an inbound packet. Most likely, we don't * want to rewrite source ports and source addresses. Instead, * we want to rewrite to a fixed internal address and fixed * internal port. */ if (np->in_flags & IPN_SPLIT) { in.s_addr = np->in_nip; if ((np->in_flags & (IPN_ROUNDR|IPN_STICKY)) == IPN_STICKY) { hm = nat_hostmap(NULL, fin->fin_src, fin->fin_dst, in, (u_32_t)dport); if (hm != NULL) { in.s_addr = hm->hm_mapip.s_addr; move = 0; } } if (hm == NULL || hm->hm_ref == 1) { if (np->in_inip == htonl(in.s_addr)) { np->in_nip = ntohl(np->in_inmsk); move = 0; } else { np->in_nip = ntohl(np->in_inip); } } } else if ((np->in_inip == 0) && (np->in_inmsk == 0xffffffff)) { /* * 0/32 - use the interface's IP address. */ if (fr_ifpaddr(4, FRI_NORMAL, fin->fin_ifp, &in, NULL) == -1) return -1; in.s_addr = ntohl(in.s_addr); } else if ((np->in_inip == 0) && (np->in_inmsk== 0)) { /* * 0/0 - use the original destination address/port. */ in.s_addr = ntohl(fin->fin_daddr); } else if (np->in_redir == NAT_BIMAP && np->in_inmsk == np->in_outmsk) { /* * map the address block in a 1:1 fashion */ in.s_addr = np->in_inip; in.s_addr |= fin->fin_daddr & ~np->in_inmsk; in.s_addr = ntohl(in.s_addr); } else { in.s_addr = ntohl(np->in_inip); } if ((np->in_pnext == 0) || ((flags & NAT_NOTRULEPORT) != 0)) nport = dport; else { /* * Whilst not optimized for the case where * pmin == pmax, the gain is not significant. */ if (((np->in_flags & IPN_FIXEDDPORT) == 0) && (np->in_pmin != np->in_pmax)) { nport = ntohs(dport) - ntohs(np->in_pmin) + ntohs(np->in_pnext); nport = htons(nport); } else nport = np->in_pnext; } /* * When the redirect-to address is set to 0.0.0.0, just * assume a blank `forwarding' of the packet. We don't * setup any translation for this either. */ if (in.s_addr == 0) { if (nport == dport) return -1; in.s_addr = ntohl(fin->fin_daddr); } /* * Check to see if this redirect mapping already exists and if * it does, return "failure" (allowing it to be created will just * cause one or both of these "connections" to stop working.) */ inb.s_addr = htonl(in.s_addr); sp = fin->fin_data[0]; dp = fin->fin_data[1]; fin->fin_data[1] = fin->fin_data[0]; fin->fin_data[0] = ntohs(nport); natl = nat_outlookup(fin, flags & ~(SI_WILDP|NAT_SEARCH), (u_int)fin->fin_p, inb, fin->fin_src); fin->fin_data[0] = sp; fin->fin_data[1] = dp; if (natl != NULL) return -1; nat->nat_inip.s_addr = htonl(in.s_addr); nat->nat_outip = fin->fin_dst; nat->nat_oip = fin->fin_src; if ((nat->nat_hm == NULL) && ((np->in_flags & IPN_STICKY) != 0)) nat->nat_hm = nat_hostmap(np, fin->fin_src, fin->fin_dst, in, (u_32_t)dport); ni->nai_sum1 = LONG_SUM(ntohl(fin->fin_daddr)) + ntohs(dport); ni->nai_sum2 = LONG_SUM(in.s_addr) + ntohs(nport); ni->nai_ip.s_addr = in.s_addr; ni->nai_nport = nport; ni->nai_port = sport; if (flags & IPN_TCPUDP) { nat->nat_inport = nport; nat->nat_outport = dport; nat->nat_oport = sport; ((tcphdr_t *)fin->fin_dp)->th_dport = nport; } else if (flags & IPN_ICMPQUERY) { ((icmphdr_t *)fin->fin_dp)->icmp_id = nport; nat->nat_inport = nport; nat->nat_outport = nport; } else if (fin->fin_p == IPPROTO_GRE) { #if 0 nat->nat_gre.gs_flags = ((grehdr_t *)fin->fin_dp)->gr_flags; if (GRE_REV(nat->nat_gre.gs_flags) == 1) { nat->nat_call[0] = fin->fin_data[0]; nat->nat_call[1] = fin->fin_data[1]; nat->nat_oport = 0; /*fin->fin_data[0];*/ nat->nat_inport = 0; /*fin->fin_data[1];*/ nat->nat_outport = 0; /*fin->fin_data[1];*/ } #endif } return move; } /* ------------------------------------------------------------------------ */ /* Function: nat_new */ /* Returns: nat_t* - NULL == failure to create new NAT structure, */ /* else pointer to new NAT structure */ /* Parameters: fin(I) - pointer to packet information */ /* np(I) - pointer to NAT rule */ /* natsave(I) - pointer to where to store NAT struct pointer */ /* flags(I) - flags describing the current packet */ /* direction(I) - direction of packet (in/out) */ /* Write Lock: ipf_nat */ /* */ /* Attempts to create a new NAT entry. Does not actually change the packet */ /* in any way. */ /* */ /* This fucntion is in three main parts: (1) deal with creating a new NAT */ /* structure for a "MAP" rule (outgoing NAT translation); (2) deal with */ /* creating a new NAT structure for a "RDR" rule (incoming NAT translation) */ /* and (3) building that structure and putting it into the NAT table(s). */ /* */ /* NOTE: natsave should NOT be used top point back to an ipstate_t struct */ /* as it can result in memory being corrupted. */ /* ------------------------------------------------------------------------ */ nat_t *nat_new(fin, np, natsave, flags, direction) fr_info_t *fin; ipnat_t *np; nat_t **natsave; u_int flags; int direction; { u_short port = 0, sport = 0, dport = 0, nport = 0; tcphdr_t *tcp = NULL; hostmap_t *hm = NULL; struct in_addr in; nat_t *nat, *natl; u_int nflags; natinfo_t ni; u_32_t sumd; int move; #if SOLARIS && defined(_KERNEL) && (SOLARIS2 >= 6) && defined(ICK_M_CTL_MAGIC) qpktinfo_t *qpi = fin->fin_qpi; #endif if (nat_stats.ns_inuse >= ipf_nattable_max) { nat_stats.ns_memfail++; fr_nat_doflush = 1; return NULL; } move = 1; nflags = np->in_flags & flags; nflags &= NAT_FROMRULE; ni.nai_np = np; ni.nai_nflags = nflags; ni.nai_flags = flags; ni.nai_dport = 0; ni.nai_sport = 0; /* Give me a new nat */ KMALLOC(nat, nat_t *); if (nat == NULL) { nat_stats.ns_memfail++; /* * Try to automatically tune the max # of entries in the * table allowed to be less than what will cause kmem_alloc() * to fail and try to eliminate panics due to out of memory * conditions arising. */ if (ipf_nattable_max > ipf_nattable_sz) { ipf_nattable_max = nat_stats.ns_inuse - 100; printf("ipf_nattable_max reduced to %d\n", ipf_nattable_max); } return NULL; } if (flags & IPN_TCPUDP) { tcp = fin->fin_dp; ni.nai_sport = htons(fin->fin_sport); ni.nai_dport = htons(fin->fin_dport); } else if (flags & IPN_ICMPQUERY) { /* * In the ICMP query NAT code, we translate the ICMP id fields * to make them unique. This is indepedent of the ICMP type * (e.g. in the unlikely event that a host sends an echo and * an tstamp request with the same id, both packets will have * their ip address/id field changed in the same way). */ /* The icmp_id field is used by the sender to identify the * process making the icmp request. (the receiver justs * copies it back in its response). So, it closely matches * the concept of source port. We overlay sport, so we can * maximally reuse the existing code. */ ni.nai_sport = ((icmphdr_t *)fin->fin_dp)->icmp_id; ni.nai_dport = ni.nai_sport; } bzero((char *)nat, sizeof(*nat)); nat->nat_flags = flags; nat->nat_redir = np->in_redir; if ((flags & NAT_SLAVE) == 0) { MUTEX_ENTER(&ipf_nat_new); } /* * Search the current table for a match. */ if (direction == NAT_OUTBOUND) { /* * We can now arrange to call this for the same connection * because ipf_nat_new doesn't protect the code path into * this function. */ natl = nat_outlookup(fin, nflags, (u_int)fin->fin_p, fin->fin_src, fin->fin_dst); if (natl != NULL) { KFREE(nat); nat = natl; goto done; } move = nat_newmap(fin, nat, &ni); if (move == -1) goto badnat; np = ni.nai_np; in = ni.nai_ip; } else { /* * NAT_INBOUND is used only for redirects rules */ natl = nat_inlookup(fin, nflags, (u_int)fin->fin_p, fin->fin_src, fin->fin_dst); if (natl != NULL) { KFREE(nat); nat = natl; goto done; } move = nat_newrdr(fin, nat, &ni); if (move == -1) goto badnat; np = ni.nai_np; in = ni.nai_ip; } port = ni.nai_port; nport = ni.nai_nport; if ((move == 1) && (np->in_flags & IPN_ROUNDR)) { if (np->in_redir == NAT_REDIRECT) { nat_delrdr(np); nat_addrdr(np); } else if (np->in_redir == NAT_MAP) { nat_delnat(np); nat_addnat(np); } } if (flags & IPN_TCPUDP) { sport = ni.nai_sport; dport = ni.nai_dport; } else if (flags & IPN_ICMPQUERY) { sport = ni.nai_sport; dport = 0; } CALC_SUMD(ni.nai_sum1, ni.nai_sum2, sumd); nat->nat_sumd[0] = (sumd & 0xffff) + (sumd >> 16); #if SOLARIS && defined(_KERNEL) && (SOLARIS2 >= 6) && defined(ICK_M_CTL_MAGIC) if ((flags & IPN_TCP) && dohwcksum && (((ill_t *)qpi->qpi_ill)->ill_ick.ick_magic == ICK_M_CTL_MAGIC)) { if (direction == NAT_OUTBOUND) ni.nai_sum1 = LONG_SUM(in.s_addr); else ni.nai_sum1 = LONG_SUM(ntohl(fin->fin_saddr)); ni.nai_sum1 += LONG_SUM(ntohl(fin->fin_daddr)); ni.nai_sum1 += 30; ni.nai_sum1 = (ni.nai_sum1 & 0xffff) + (ni.nai_sum1 >> 16); nat->nat_sumd[1] = NAT_HW_CKSUM|(ni.nai_sum1 & 0xffff); } else #endif nat->nat_sumd[1] = nat->nat_sumd[0]; if ((flags & IPN_TCPUDPICMP) && ((sport != port) || (dport != nport))) { if (direction == NAT_OUTBOUND) ni.nai_sum1 = LONG_SUM(ntohl(fin->fin_saddr)); else ni.nai_sum1 = LONG_SUM(ntohl(fin->fin_daddr)); ni.nai_sum2 = LONG_SUM(in.s_addr); CALC_SUMD(ni.nai_sum1, ni.nai_sum2, sumd); nat->nat_ipsumd = (sumd & 0xffff) + (sumd >> 16); } else { nat->nat_ipsumd = nat->nat_sumd[0]; if (!(flags & IPN_TCPUDPICMP)) { nat->nat_sumd[0] = 0; nat->nat_sumd[1] = 0; } } if (nat_finalise(fin, nat, &ni, tcp, natsave, direction) == -1) { fr_nat_doflush = 1; goto badnat; } if (flags & SI_WILDP) nat_stats.ns_wilds++; fin->fin_flx |= FI_NEWNAT; goto done; badnat: nat_stats.ns_badnat++; if ((hm = nat->nat_hm) != NULL) fr_hostmapdel(&hm); KFREE(nat); nat = NULL; done: if ((flags & NAT_SLAVE) == 0) { MUTEX_EXIT(&ipf_nat_new); } return nat; } /* ------------------------------------------------------------------------ */ /* Function: nat_finalise */ /* Returns: int - 0 == sucess, -1 == failure */ /* Parameters: fin(I) - pointer to packet information */ /* nat(I) - pointer to NAT entry */ /* ni(I) - pointer to structure with misc. information needed */ /* to create new NAT entry. */ /* Write Lock: ipf_nat */ /* */ /* This is the tail end of constructing a new NAT entry and is the same */ /* for both IPv4 and IPv6. */ /* ------------------------------------------------------------------------ */ /*ARGSUSED*/ static int nat_finalise(fin, nat, ni, tcp, natsave, direction) fr_info_t *fin; nat_t *nat; natinfo_t *ni; tcphdr_t *tcp; nat_t **natsave; int direction; { frentry_t *fr; ipnat_t *np; np = ni->nai_np; if (np->in_ifps[0] != NULL) { COPYIFNAME(4, np->in_ifps[0], nat->nat_ifnames[0]); } if (np->in_ifps[1] != NULL) { COPYIFNAME(4, np->in_ifps[1], nat->nat_ifnames[1]); } #ifdef IPFILTER_SYNC if ((nat->nat_flags & SI_CLONE) == 0) nat->nat_sync = ipfsync_new(SMC_NAT, fin, nat); #endif nat->nat_me = natsave; nat->nat_dir = direction; nat->nat_ifps[0] = np->in_ifps[0]; nat->nat_ifps[1] = np->in_ifps[1]; nat->nat_ptr = np; nat->nat_p = fin->fin_p; nat->nat_mssclamp = np->in_mssclamp; if (nat->nat_p == IPPROTO_TCP) nat->nat_seqnext[0] = ntohl(tcp->th_seq); if ((np->in_apr != NULL) && ((ni->nai_flags & NAT_SLAVE) == 0)) if (appr_new(fin, nat) == -1) return -1; if (nat_insert(nat, fin->fin_rev) == 0) { if (nat_logging) nat_log(nat, (u_int)np->in_redir); np->in_use++; fr = fin->fin_fr; nat->nat_fr = fr; if (fr != NULL) { MUTEX_ENTER(&fr->fr_lock); fr->fr_ref++; MUTEX_EXIT(&fr->fr_lock); } return 0; } /* * nat_insert failed, so cleanup time... */ return -1; } /* ------------------------------------------------------------------------ */ /* Function: nat_insert */ /* Returns: int - 0 == sucess, -1 == failure */ /* Parameters: nat(I) - pointer to NAT structure */ /* rev(I) - flag indicating forward/reverse direction of packet */ /* Write Lock: ipf_nat */ /* */ /* Insert a NAT entry into the hash tables for searching and add it to the */ /* list of active NAT entries. Adjust global counters when complete. */ /* ------------------------------------------------------------------------ */ int nat_insert(nat, rev) nat_t *nat; int rev; { u_int hv1, hv2; nat_t **natp; /* * Try and return an error as early as possible, so calculate the hash * entry numbers first and then proceed. */ if ((nat->nat_flags & (SI_W_SPORT|SI_W_DPORT)) == 0) { hv1 = NAT_HASH_FN(nat->nat_inip.s_addr, nat->nat_inport, 0xffffffff); hv1 = NAT_HASH_FN(nat->nat_oip.s_addr, hv1 + nat->nat_oport, ipf_nattable_sz); hv2 = NAT_HASH_FN(nat->nat_outip.s_addr, nat->nat_outport, 0xffffffff); hv2 = NAT_HASH_FN(nat->nat_oip.s_addr, hv2 + nat->nat_oport, ipf_nattable_sz); } else { hv1 = NAT_HASH_FN(nat->nat_inip.s_addr, 0, 0xffffffff); hv1 = NAT_HASH_FN(nat->nat_oip.s_addr, hv1, ipf_nattable_sz); hv2 = NAT_HASH_FN(nat->nat_outip.s_addr, 0, 0xffffffff); hv2 = NAT_HASH_FN(nat->nat_oip.s_addr, hv2, ipf_nattable_sz); } if (nat_stats.ns_bucketlen[0][hv1] >= fr_nat_maxbucket || nat_stats.ns_bucketlen[1][hv2] >= fr_nat_maxbucket) { return -1; } nat->nat_hv[0] = hv1; nat->nat_hv[1] = hv2; MUTEX_INIT(&nat->nat_lock, "nat entry lock"); nat->nat_rev = rev; nat->nat_ref = 1; nat->nat_bytes[0] = 0; nat->nat_pkts[0] = 0; nat->nat_bytes[1] = 0; nat->nat_pkts[1] = 0; nat->nat_ifnames[0][LIFNAMSIZ - 1] = '\0'; nat->nat_ifps[0] = fr_resolvenic(nat->nat_ifnames[0], 4); if (nat->nat_ifnames[1][0] != '\0') { nat->nat_ifnames[1][LIFNAMSIZ - 1] = '\0'; nat->nat_ifps[1] = fr_resolvenic(nat->nat_ifnames[1], 4); } else { (void) strncpy(nat->nat_ifnames[1], nat->nat_ifnames[0], LIFNAMSIZ); nat->nat_ifnames[1][LIFNAMSIZ - 1] = '\0'; nat->nat_ifps[1] = nat->nat_ifps[0]; } nat->nat_next = nat_instances; nat->nat_pnext = &nat_instances; if (nat_instances) nat_instances->nat_pnext = &nat->nat_next; nat_instances = nat; natp = &nat_table[0][hv1]; if (*natp) (*natp)->nat_phnext[0] = &nat->nat_hnext[0]; nat->nat_phnext[0] = natp; nat->nat_hnext[0] = *natp; *natp = nat; nat_stats.ns_bucketlen[0][hv1]++; natp = &nat_table[1][hv2]; if (*natp) (*natp)->nat_phnext[1] = &nat->nat_hnext[1]; nat->nat_phnext[1] = natp; nat->nat_hnext[1] = *natp; *natp = nat; nat_stats.ns_bucketlen[1][hv2]++; fr_setnatqueue(nat, rev); nat_stats.ns_added++; nat_stats.ns_inuse++; return 0; } /* ------------------------------------------------------------------------ */ /* Function: nat_icmperrorlookup */ /* Returns: nat_t* - point to matching NAT structure */ /* Parameters: fin(I) - pointer to packet information */ /* dir(I) - direction of packet (in/out) */ /* */ /* Check if the ICMP error message is related to an existing TCP, UDP or */ /* ICMP query nat entry. It is assumed that the packet is already of the */ /* the required length. */ /* ------------------------------------------------------------------------ */ nat_t *nat_icmperrorlookup(fin, dir) fr_info_t *fin; int dir; { int flags = 0, type, minlen; icmphdr_t *icmp, *orgicmp; tcphdr_t *tcp = NULL; u_short data[2]; nat_t *nat; ip_t *oip; u_int p; icmp = fin->fin_dp; type = icmp->icmp_type; /* * Does it at least have the return (basic) IP header ? * Only a basic IP header (no options) should be with an ICMP error * header. Also, if it's not an error type, then return. */ if ((fin->fin_hlen != sizeof(ip_t)) || !(fin->fin_flx & FI_ICMPERR)) return NULL; /* * Check packet size */ oip = (ip_t *)((char *)fin->fin_dp + 8); minlen = IP_HL(oip) << 2; if ((minlen < sizeof(ip_t)) || (fin->fin_plen < ICMPERR_IPICMPHLEN + minlen)) return NULL; /* * Is the buffer big enough for all of it ? It's the size of the IP * header claimed in the encapsulated part which is of concern. It * may be too big to be in this buffer but not so big that it's * outside the ICMP packet, leading to TCP deref's causing problems. * This is possible because we don't know how big oip_hl is when we * do the pullup early in fr_check() and thus can't gaurantee it is * all here now. */ #ifdef _KERNEL { mb_t *m; m = fin->fin_m; # if defined(MENTAT) if ((char *)oip + fin->fin_dlen - ICMPERR_ICMPHLEN > (char *)m->b_wptr) return NULL; # else if ((char *)oip + fin->fin_dlen - ICMPERR_ICMPHLEN > (char *)fin->fin_ip + M_LEN(m)) return NULL; # endif } #endif if (fin->fin_daddr != oip->ip_src.s_addr) return NULL; p = oip->ip_p; if (p == IPPROTO_TCP) flags = IPN_TCP; else if (p == IPPROTO_UDP) flags = IPN_UDP; else if (p == IPPROTO_ICMP) { orgicmp = (icmphdr_t *)((char *)oip + (IP_HL(oip) << 2)); /* see if this is related to an ICMP query */ if (nat_icmpquerytype4(orgicmp->icmp_type)) { data[0] = fin->fin_data[0]; data[1] = fin->fin_data[1]; fin->fin_data[0] = 0; fin->fin_data[1] = orgicmp->icmp_id; flags = IPN_ICMPERR|IPN_ICMPQUERY; /* * NOTE : dir refers to the direction of the original * ip packet. By definition the icmp error * message flows in the opposite direction. */ if (dir == NAT_INBOUND) nat = nat_inlookup(fin, flags, p, oip->ip_dst, oip->ip_src); else nat = nat_outlookup(fin, flags, p, oip->ip_dst, oip->ip_src); fin->fin_data[0] = data[0]; fin->fin_data[1] = data[1]; return nat; } } if (flags & IPN_TCPUDP) { minlen += 8; /* + 64bits of data to get ports */ if (fin->fin_plen < ICMPERR_IPICMPHLEN + minlen) return NULL; data[0] = fin->fin_data[0]; data[1] = fin->fin_data[1]; tcp = (tcphdr_t *)((char *)oip + (IP_HL(oip) << 2)); fin->fin_data[0] = ntohs(tcp->th_dport); fin->fin_data[1] = ntohs(tcp->th_sport); if (dir == NAT_INBOUND) { nat = nat_inlookup(fin, flags, p, oip->ip_dst, oip->ip_src); } else { nat = nat_outlookup(fin, flags, p, oip->ip_dst, oip->ip_src); } fin->fin_data[0] = data[0]; fin->fin_data[1] = data[1]; return nat; } if (dir == NAT_INBOUND) return nat_inlookup(fin, 0, p, oip->ip_dst, oip->ip_src); else return nat_outlookup(fin, 0, p, oip->ip_dst, oip->ip_src); } /* ------------------------------------------------------------------------ */ /* Function: nat_icmperror */ /* Returns: nat_t* - point to matching NAT structure */ /* Parameters: fin(I) - pointer to packet information */ /* nflags(I) - NAT flags for this packet */ /* dir(I) - direction of packet (in/out) */ /* */ /* Fix up an ICMP packet which is an error message for an existing NAT */ /* session. This will correct both packet header data and checksums. */ /* */ /* This should *ONLY* be used for incoming ICMP error packets to make sure */ /* a NAT'd ICMP packet gets correctly recognised. */ /* ------------------------------------------------------------------------ */ nat_t *nat_icmperror(fin, nflags, dir) fr_info_t *fin; u_int *nflags; int dir; { u_32_t sum1, sum2, sumd, sumd2; struct in_addr a1, a2; int flags, dlen, odst; icmphdr_t *icmp; u_short *csump; tcphdr_t *tcp; nat_t *nat; ip_t *oip; void *dp; if ((fin->fin_flx & (FI_SHORT|FI_FRAGBODY))) return NULL; /* * nat_icmperrorlookup() will return NULL for `defective' packets. */ if ((fin->fin_v != 4) || !(nat = nat_icmperrorlookup(fin, dir))) return NULL; tcp = NULL; csump = NULL; flags = 0; sumd2 = 0; *nflags = IPN_ICMPERR; icmp = fin->fin_dp; oip = (ip_t *)&icmp->icmp_ip; dp = (((char *)oip) + (IP_HL(oip) << 2)); if (oip->ip_p == IPPROTO_TCP) { tcp = (tcphdr_t *)dp; csump = (u_short *)&tcp->th_sum; flags = IPN_TCP; } else if (oip->ip_p == IPPROTO_UDP) { udphdr_t *udp; udp = (udphdr_t *)dp; tcp = (tcphdr_t *)dp; csump = (u_short *)&udp->uh_sum; flags = IPN_UDP; } else if (oip->ip_p == IPPROTO_ICMP) flags = IPN_ICMPQUERY; dlen = fin->fin_plen - ((char *)dp - (char *)fin->fin_ip); /* * Need to adjust ICMP header to include the real IP#'s and * port #'s. Only apply a checksum change relative to the * IP address change as it will be modified again in fr_checknatout * for both address and port. Two checksum changes are * necessary for the two header address changes. Be careful * to only modify the checksum once for the port # and twice * for the IP#. */ /* * Step 1 * Fix the IP addresses in the offending IP packet. You also need * to adjust the IP header checksum of that offending IP packet. * * Normally, you would expect that the ICMP checksum of the * ICMP error message needs to be adjusted as well for the * IP address change in oip. * However, this is a NOP, because the ICMP checksum is * calculated over the complete ICMP packet, which includes the * changed oip IP addresses and oip->ip_sum. However, these * two changes cancel each other out (if the delta for * the IP address is x, then the delta for ip_sum is minus x), * so no change in the icmp_cksum is necessary. * * Inbound ICMP * ------------ * MAP rule, SRC=a,DST=b -> SRC=c,DST=b * - response to outgoing packet (a,b)=>(c,b) (OIP_SRC=c,OIP_DST=b) * - OIP_SRC(c)=nat_outip, OIP_DST(b)=nat_oip * * RDR rule, SRC=a,DST=b -> SRC=a,DST=c * - response to outgoing packet (c,a)=>(b,a) (OIP_SRC=b,OIP_DST=a) * - OIP_SRC(b)=nat_outip, OIP_DST(a)=nat_oip * * Outbound ICMP * ------------- * MAP rule, SRC=a,DST=b -> SRC=c,DST=b * - response to incoming packet (b,c)=>(b,a) (OIP_SRC=b,OIP_DST=a) * - OIP_SRC(a)=nat_oip, OIP_DST(c)=nat_inip * * RDR rule, SRC=a,DST=b -> SRC=a,DST=c * - response to incoming packet (a,b)=>(a,c) (OIP_SRC=a,OIP_DST=c) * - OIP_SRC(a)=nat_oip, OIP_DST(c)=nat_inip * */ odst = (oip->ip_dst.s_addr == nat->nat_oip.s_addr) ? 1 : 0; if (odst == 1) { a1.s_addr = ntohl(nat->nat_inip.s_addr); a2.s_addr = ntohl(oip->ip_src.s_addr); oip->ip_src.s_addr = htonl(a1.s_addr); } else { a1.s_addr = ntohl(nat->nat_outip.s_addr); a2.s_addr = ntohl(oip->ip_dst.s_addr); oip->ip_dst.s_addr = htonl(a1.s_addr); } sumd = a2.s_addr - a1.s_addr; if (sumd != 0) { if (a1.s_addr > a2.s_addr) sumd--; sumd = ~sumd; fix_datacksum(&oip->ip_sum, sumd); } sumd2 = sumd; sum1 = 0; sum2 = 0; /* * Fix UDP pseudo header checksum to compensate for the * IP address change. */ if (((flags & IPN_TCPUDP) != 0) && (dlen >= 4)) { /* * Step 2 : * For offending TCP/UDP IP packets, translate the ports as * well, based on the NAT specification. Of course such * a change may be reflected in the ICMP checksum as well. * * Since the port fields are part of the TCP/UDP checksum * of the offending IP packet, you need to adjust that checksum * as well... except that the change in the port numbers should * be offset by the checksum change. However, the TCP/UDP * checksum will also need to change if there has been an * IP address change. */ if (odst == 1) { sum1 = ntohs(nat->nat_inport); sum2 = ntohs(tcp->th_sport); tcp->th_sport = htons(sum1); } else { sum1 = ntohs(nat->nat_outport); sum2 = ntohs(tcp->th_dport); tcp->th_dport = htons(sum1); } sumd += sum1 - sum2; if (sumd != 0 || sumd2 != 0) { /* * At this point, sumd is the delta to apply to the * TCP/UDP header, given the changes in both the IP * address and the ports and sumd2 is the delta to * apply to the ICMP header, given the IP address * change delta that may need to be applied to the * TCP/UDP checksum instead. * * If we will both the IP and TCP/UDP checksums * then the ICMP checksum changes by the address * delta applied to the TCP/UDP checksum. If we * do not change the TCP/UDP checksum them we * apply the delta in ports to the ICMP checksum. */ if (oip->ip_p == IPPROTO_UDP) { if ((dlen >= 8) && (*csump != 0)) { fix_datacksum(csump, sumd); } else { sumd2 = sum1 - sum2; if (sum2 > sum1) sumd2--; } } else if (oip->ip_p == IPPROTO_TCP) { if (dlen >= 18) { fix_datacksum(csump, sumd); } else { sumd2 = sum2 - sum1; if (sum1 > sum2) sumd2--; } } if (sumd2 != 0) { ipnat_t *np; np = nat->nat_ptr; sumd2 = (sumd2 & 0xffff) + (sumd2 >> 16); sumd2 = (sumd2 & 0xffff) + (sumd2 >> 16); sumd2 = (sumd2 & 0xffff) + (sumd2 >> 16); if ((odst == 0) && (dir == NAT_OUTBOUND) && (fin->fin_rev == 0) && (np != NULL) && (np->in_redir & NAT_REDIRECT)) { fix_outcksum(fin, &icmp->icmp_cksum, sumd2); } else { fix_incksum(fin, &icmp->icmp_cksum, sumd2); } } } } else if (((flags & IPN_ICMPQUERY) != 0) && (dlen >= 8)) { icmphdr_t *orgicmp; /* * XXX - what if this is bogus hl and we go off the end ? * In this case, nat_icmperrorlookup() will have returned NULL. */ orgicmp = (icmphdr_t *)dp; if (odst == 1) { if (orgicmp->icmp_id != nat->nat_inport) { /* * Fix ICMP checksum (of the offening ICMP * query packet) to compensate the change * in the ICMP id of the offending ICMP * packet. * * Since you modify orgicmp->icmp_id with * a delta (say x) and you compensate that * in origicmp->icmp_cksum with a delta * minus x, you don't have to adjust the * overall icmp->icmp_cksum */ sum1 = ntohs(orgicmp->icmp_id); sum2 = ntohs(nat->nat_inport); CALC_SUMD(sum1, sum2, sumd); orgicmp->icmp_id = nat->nat_inport; fix_datacksum(&orgicmp->icmp_cksum, sumd); } } /* nat_dir == NAT_INBOUND is impossible for icmp queries */ } return nat; } /* * NB: these lookups don't lock access to the list, it assumed that it has * already been done! */ /* ------------------------------------------------------------------------ */ /* Function: nat_inlookup */ /* Returns: nat_t* - NULL == no match, */ /* else pointer to matching NAT entry */ /* Parameters: fin(I) - pointer to packet information */ /* flags(I) - NAT flags for this packet */ /* p(I) - protocol for this packet */ /* src(I) - source IP address */ /* mapdst(I) - destination IP address */ /* */ /* Lookup a nat entry based on the mapped destination ip address/port and */ /* real source address/port. We use this lookup when receiving a packet, */ /* we're looking for a table entry, based on the destination address. */ /* */ /* NOTE: THE PACKET BEING CHECKED (IF FOUND) HAS A MAPPING ALREADY. */ /* */ /* NOTE: IT IS ASSUMED THAT ipf_nat IS ONLY HELD WITH A READ LOCK WHEN */ /* THIS FUNCTION IS CALLED WITH NAT_SEARCH SET IN nflags. */ /* */ /* flags -> relevant are IPN_UDP/IPN_TCP/IPN_ICMPQUERY that indicate if */ /* the packet is of said protocol */ /* ------------------------------------------------------------------------ */ nat_t *nat_inlookup(fin, flags, p, src, mapdst) fr_info_t *fin; u_int flags, p; struct in_addr src , mapdst; { u_short sport, dport; grehdr_t *gre; ipnat_t *ipn; u_int sflags; nat_t *nat; int nflags; u_32_t dst; void *ifp; u_int hv; ifp = fin->fin_ifp; sport = 0; dport = 0; gre = NULL; dst = mapdst.s_addr; sflags = flags & NAT_TCPUDPICMP; switch (p) { case IPPROTO_TCP : case IPPROTO_UDP : sport = htons(fin->fin_data[0]); dport = htons(fin->fin_data[1]); break; case IPPROTO_ICMP : if (flags & IPN_ICMPERR) sport = fin->fin_data[1]; else dport = fin->fin_data[1]; break; default : break; } if ((flags & SI_WILDP) != 0) goto find_in_wild_ports; hv = NAT_HASH_FN(dst, dport, 0xffffffff); hv = NAT_HASH_FN(src.s_addr, hv + sport, ipf_nattable_sz); nat = nat_table[1][hv]; for (; nat; nat = nat->nat_hnext[1]) { if (nat->nat_ifps[0] != NULL) { if ((ifp != NULL) && (ifp != nat->nat_ifps[0])) continue; } else if (ifp != NULL) nat->nat_ifps[0] = ifp; nflags = nat->nat_flags; if (nat->nat_oip.s_addr == src.s_addr && nat->nat_outip.s_addr == dst && (((p == 0) && (sflags == (nat->nat_flags & IPN_TCPUDPICMP))) || (p == nat->nat_p))) { switch (p) { #if 0 case IPPROTO_GRE : if (nat->nat_call[1] != fin->fin_data[0]) continue; break; #endif case IPPROTO_ICMP : if ((flags & IPN_ICMPERR) != 0) { if (nat->nat_outport != sport) continue; } else { if (nat->nat_outport != dport) continue; } break; case IPPROTO_TCP : case IPPROTO_UDP : if (nat->nat_oport != sport) continue; if (nat->nat_outport != dport) continue; break; default : break; } ipn = nat->nat_ptr; if ((ipn != NULL) && (nat->nat_aps != NULL)) if (appr_match(fin, nat) != 0) continue; return nat; } } /* * So if we didn't find it but there are wildcard members in the hash * table, go back and look for them. We do this search and update here * because it is modifying the NAT table and we want to do this only * for the first packet that matches. The exception, of course, is * for "dummy" (FI_IGNORE) lookups. */ find_in_wild_ports: if (!(flags & NAT_TCPUDP) || !(flags & NAT_SEARCH)) return NULL; if (nat_stats.ns_wilds == 0) return NULL; RWLOCK_EXIT(&ipf_nat); hv = NAT_HASH_FN(dst, 0, 0xffffffff); hv = NAT_HASH_FN(src.s_addr, hv, ipf_nattable_sz); WRITE_ENTER(&ipf_nat); nat = nat_table[1][hv]; for (; nat; nat = nat->nat_hnext[1]) { if (nat->nat_ifps[0] != NULL) { if ((ifp != NULL) && (ifp != nat->nat_ifps[0])) continue; } else if (ifp != NULL) nat->nat_ifps[0] = ifp; if (nat->nat_p != fin->fin_p) continue; if (nat->nat_oip.s_addr != src.s_addr || nat->nat_outip.s_addr != dst) continue; nflags = nat->nat_flags; if (!(nflags & (NAT_TCPUDP|SI_WILDP))) continue; if (nat_wildok(nat, (int)sport, (int)dport, nflags, NAT_INBOUND) == 1) { if ((fin->fin_flx & FI_IGNORE) != 0) break; if ((nflags & SI_CLONE) != 0) { nat = fr_natclone(fin, nat); if (nat == NULL) break; } else { MUTEX_ENTER(&ipf_nat_new); nat_stats.ns_wilds--; MUTEX_EXIT(&ipf_nat_new); } nat->nat_oport = sport; nat->nat_outport = dport; nat->nat_flags &= ~(SI_W_DPORT|SI_W_SPORT); nat_tabmove(nat); break; } } MUTEX_DOWNGRADE(&ipf_nat); return nat; } /* ------------------------------------------------------------------------ */ /* Function: nat_tabmove */ /* Returns: Nil */ /* Parameters: nat(I) - pointer to NAT structure */ /* Write Lock: ipf_nat */ /* */ /* This function is only called for TCP/UDP NAT table entries where the */ /* original was placed in the table without hashing on the ports and we now */ /* want to include hashing on port numbers. */ /* ------------------------------------------------------------------------ */ static void nat_tabmove(nat) nat_t *nat; { nat_t **natp; u_int hv; if (nat->nat_flags & SI_CLONE) return; /* * Remove the NAT entry from the old location */ if (nat->nat_hnext[0]) nat->nat_hnext[0]->nat_phnext[0] = nat->nat_phnext[0]; *nat->nat_phnext[0] = nat->nat_hnext[0]; nat_stats.ns_bucketlen[0][nat->nat_hv[0]]--; if (nat->nat_hnext[1]) nat->nat_hnext[1]->nat_phnext[1] = nat->nat_phnext[1]; *nat->nat_phnext[1] = nat->nat_hnext[1]; nat_stats.ns_bucketlen[1][nat->nat_hv[1]]--; /* * Add into the NAT table in the new position */ hv = NAT_HASH_FN(nat->nat_inip.s_addr, nat->nat_inport, 0xffffffff); hv = NAT_HASH_FN(nat->nat_oip.s_addr, hv + nat->nat_oport, ipf_nattable_sz); nat->nat_hv[0] = hv; natp = &nat_table[0][hv]; if (*natp) (*natp)->nat_phnext[0] = &nat->nat_hnext[0]; nat->nat_phnext[0] = natp; nat->nat_hnext[0] = *natp; *natp = nat; nat_stats.ns_bucketlen[0][hv]++; hv = NAT_HASH_FN(nat->nat_outip.s_addr, nat->nat_outport, 0xffffffff); hv = NAT_HASH_FN(nat->nat_oip.s_addr, hv + nat->nat_oport, ipf_nattable_sz); nat->nat_hv[1] = hv; natp = &nat_table[1][hv]; if (*natp) (*natp)->nat_phnext[1] = &nat->nat_hnext[1]; nat->nat_phnext[1] = natp; nat->nat_hnext[1] = *natp; *natp = nat; nat_stats.ns_bucketlen[1][hv]++; } /* ------------------------------------------------------------------------ */ /* Function: nat_outlookup */ /* Returns: nat_t* - NULL == no match, */ /* else pointer to matching NAT entry */ /* Parameters: fin(I) - pointer to packet information */ /* flags(I) - NAT flags for this packet */ /* p(I) - protocol for this packet */ /* src(I) - source IP address */ /* dst(I) - destination IP address */ /* rw(I) - 1 == write lock on ipf_nat held, 0 == read lock. */ /* */ /* Lookup a nat entry based on the source 'real' ip address/port and */ /* destination address/port. We use this lookup when sending a packet out, */ /* we're looking for a table entry, based on the source address. */ /* */ /* NOTE: THE PACKET BEING CHECKED (IF FOUND) HAS A MAPPING ALREADY. */ /* */ /* NOTE: IT IS ASSUMED THAT ipf_nat IS ONLY HELD WITH A READ LOCK WHEN */ /* THIS FUNCTION IS CALLED WITH NAT_SEARCH SET IN nflags. */ /* */ /* flags -> relevant are IPN_UDP/IPN_TCP/IPN_ICMPQUERY that indicate if */ /* the packet is of said protocol */ /* ------------------------------------------------------------------------ */ nat_t *nat_outlookup(fin, flags, p, src, dst) fr_info_t *fin; u_int flags, p; struct in_addr src , dst; { u_short sport, dport; u_int sflags; ipnat_t *ipn; u_32_t srcip; nat_t *nat; int nflags; void *ifp; u_int hv; ifp = fin->fin_ifp; srcip = src.s_addr; sflags = flags & IPN_TCPUDPICMP; sport = 0; dport = 0; switch (p) { case IPPROTO_TCP : case IPPROTO_UDP : sport = htons(fin->fin_data[0]); dport = htons(fin->fin_data[1]); break; case IPPROTO_ICMP : if (flags & IPN_ICMPERR) sport = fin->fin_data[1]; else dport = fin->fin_data[1]; break; default : break; } if ((flags & SI_WILDP) != 0) goto find_out_wild_ports; hv = NAT_HASH_FN(srcip, sport, 0xffffffff); hv = NAT_HASH_FN(dst.s_addr, hv + dport, ipf_nattable_sz); nat = nat_table[0][hv]; for (; nat; nat = nat->nat_hnext[0]) { if (nat->nat_ifps[1] != NULL) { if ((ifp != NULL) && (ifp != nat->nat_ifps[1])) continue; } else if (ifp != NULL) nat->nat_ifps[1] = ifp; nflags = nat->nat_flags; if (nat->nat_inip.s_addr == srcip && nat->nat_oip.s_addr == dst.s_addr && (((p == 0) && (sflags == (nflags & NAT_TCPUDPICMP))) || (p == nat->nat_p))) { switch (p) { #if 0 case IPPROTO_GRE : if (nat->nat_call[1] != fin->fin_data[0]) continue; break; #endif case IPPROTO_TCP : case IPPROTO_UDP : if (nat->nat_oport != dport) continue; if (nat->nat_inport != sport) continue; break; default : break; } ipn = nat->nat_ptr; if ((ipn != NULL) && (nat->nat_aps != NULL)) if (appr_match(fin, nat) != 0) continue; return nat; } } /* * So if we didn't find it but there are wildcard members in the hash * table, go back and look for them. We do this search and update here * because it is modifying the NAT table and we want to do this only * for the first packet that matches. The exception, of course, is * for "dummy" (FI_IGNORE) lookups. */ find_out_wild_ports: if (!(flags & NAT_TCPUDP) || !(flags & NAT_SEARCH)) return NULL; if (nat_stats.ns_wilds == 0) return NULL; RWLOCK_EXIT(&ipf_nat); hv = NAT_HASH_FN(srcip, 0, 0xffffffff); hv = NAT_HASH_FN(dst.s_addr, hv, ipf_nattable_sz); WRITE_ENTER(&ipf_nat); nat = nat_table[0][hv]; for (; nat; nat = nat->nat_hnext[0]) { if (nat->nat_ifps[1] != NULL) { if ((ifp != NULL) && (ifp != nat->nat_ifps[1])) continue; } else if (ifp != NULL) nat->nat_ifps[1] = ifp; if (nat->nat_p != fin->fin_p) continue; if ((nat->nat_inip.s_addr != srcip) || (nat->nat_oip.s_addr != dst.s_addr)) continue; nflags = nat->nat_flags; if (!(nflags & (NAT_TCPUDP|SI_WILDP))) continue; if (nat_wildok(nat, (int)sport, (int)dport, nflags, NAT_OUTBOUND) == 1) { if ((fin->fin_flx & FI_IGNORE) != 0) break; if ((nflags & SI_CLONE) != 0) { nat = fr_natclone(fin, nat); if (nat == NULL) break; } else { MUTEX_ENTER(&ipf_nat_new); nat_stats.ns_wilds--; MUTEX_EXIT(&ipf_nat_new); } nat->nat_inport = sport; nat->nat_oport = dport; if (nat->nat_outport == 0) nat->nat_outport = sport; nat->nat_flags &= ~(SI_W_DPORT|SI_W_SPORT); nat_tabmove(nat); break; } } MUTEX_DOWNGRADE(&ipf_nat); return nat; } /* ------------------------------------------------------------------------ */ /* Function: nat_lookupredir */ /* Returns: nat_t* - NULL == no match, */ /* else pointer to matching NAT entry */ /* Parameters: np(I) - pointer to description of packet to find NAT table */ /* entry for. */ /* */ /* Lookup the NAT tables to search for a matching redirect */ /* The contents of natlookup_t should imitate those found in a packet that */ /* would be translated - ie a packet coming in for RDR or going out for MAP.*/ /* We can do the lookup in one of two ways, imitating an inbound or */ /* outbound packet. By default we assume outbound, unless IPN_IN is set. */ /* For IN, the fields are set as follows: */ /* nl_real* = source information */ /* nl_out* = destination information (translated) */ /* For an out packet, the fields are set like this: */ /* nl_in* = source information (untranslated) */ /* nl_out* = destination information (translated) */ /* ------------------------------------------------------------------------ */ nat_t *nat_lookupredir(np) natlookup_t *np; { fr_info_t fi; nat_t *nat; bzero((char *)&fi, sizeof(fi)); if (np->nl_flags & IPN_IN) { fi.fin_data[0] = ntohs(np->nl_realport); fi.fin_data[1] = ntohs(np->nl_outport); } else { fi.fin_data[0] = ntohs(np->nl_inport); fi.fin_data[1] = ntohs(np->nl_outport); } if (np->nl_flags & IPN_TCP) fi.fin_p = IPPROTO_TCP; else if (np->nl_flags & IPN_UDP) fi.fin_p = IPPROTO_UDP; else if (np->nl_flags & (IPN_ICMPERR|IPN_ICMPQUERY)) fi.fin_p = IPPROTO_ICMP; /* * We can do two sorts of lookups: * - IPN_IN: we have the `real' and `out' address, look for `in'. * - default: we have the `in' and `out' address, look for `real'. */ if (np->nl_flags & IPN_IN) { if ((nat = nat_inlookup(&fi, np->nl_flags, fi.fin_p, np->nl_realip, np->nl_outip))) { np->nl_inip = nat->nat_inip; np->nl_inport = nat->nat_inport; } } else { /* * If nl_inip is non null, this is a lookup based on the real * ip address. Else, we use the fake. */ if ((nat = nat_outlookup(&fi, np->nl_flags, fi.fin_p, np->nl_inip, np->nl_outip))) { if ((np->nl_flags & IPN_FINDFORWARD) != 0) { fr_info_t fin; bzero((char *)&fin, sizeof(fin)); fin.fin_p = nat->nat_p; fin.fin_data[0] = ntohs(nat->nat_outport); fin.fin_data[1] = ntohs(nat->nat_oport); if (nat_inlookup(&fin, np->nl_flags, fin.fin_p, nat->nat_outip, nat->nat_oip) != NULL) { np->nl_flags &= ~IPN_FINDFORWARD; } } np->nl_realip = nat->nat_outip; np->nl_realport = nat->nat_outport; } } return nat; } /* ------------------------------------------------------------------------ */ /* Function: nat_match */ /* Returns: int - 0 == no match, 1 == match */ /* Parameters: fin(I) - pointer to packet information */ /* np(I) - pointer to NAT rule */ /* */ /* Pull the matching of a packet against a NAT rule out of that complex */ /* loop inside fr_checknatin() and lay it out properly in its own function. */ /* ------------------------------------------------------------------------ */ static int nat_match(fin, np) fr_info_t *fin; ipnat_t *np; { frtuc_t *ft; if (fin->fin_v != 4) return 0; if (np->in_p && fin->fin_p != np->in_p) return 0; if (fin->fin_out) { if (!(np->in_redir & (NAT_MAP|NAT_MAPBLK))) return 0; if (((fin->fin_fi.fi_saddr & np->in_inmsk) != np->in_inip) ^ ((np->in_flags & IPN_NOTSRC) != 0)) return 0; if (((fin->fin_fi.fi_daddr & np->in_srcmsk) != np->in_srcip) ^ ((np->in_flags & IPN_NOTDST) != 0)) return 0; } else { if (!(np->in_redir & NAT_REDIRECT)) return 0; if (((fin->fin_fi.fi_saddr & np->in_srcmsk) != np->in_srcip) ^ ((np->in_flags & IPN_NOTSRC) != 0)) return 0; if (((fin->fin_fi.fi_daddr & np->in_outmsk) != np->in_outip) ^ ((np->in_flags & IPN_NOTDST) != 0)) return 0; } ft = &np->in_tuc; if (!(fin->fin_flx & FI_TCPUDP) || (fin->fin_flx & (FI_SHORT|FI_FRAGBODY))) { if (ft->ftu_scmp || ft->ftu_dcmp) return 0; return 1; } return fr_tcpudpchk(fin, ft); } /* ------------------------------------------------------------------------ */ /* Function: nat_update */ /* Returns: Nil */ /* Parameters: nat(I) - pointer to NAT structure */ /* np(I) - pointer to NAT rule */ /* */ /* Updates the lifetime of a NAT table entry for non-TCP packets. Must be */ /* called with fin_rev updated - i.e. after calling nat_proto(). */ /* ------------------------------------------------------------------------ */ void nat_update(fin, nat, np) fr_info_t *fin; nat_t *nat; ipnat_t *np; { ipftq_t *ifq, *ifq2; ipftqent_t *tqe; MUTEX_ENTER(&nat->nat_lock); tqe = &nat->nat_tqe; ifq = tqe->tqe_ifq; /* * We allow over-riding of NAT timeouts from NAT rules, even for * TCP, however, if it is TCP and there is no rule timeout set, * then do not update the timeout here. */ if (np != NULL) ifq2 = np->in_tqehead[fin->fin_rev]; else ifq2 = NULL; if (nat->nat_p == IPPROTO_TCP && ifq2 == NULL) { u_32_t end, ack; u_char tcpflags; tcphdr_t *tcp; int dsize; tcp = fin->fin_dp; tcpflags = tcp->th_flags; dsize = fin->fin_dlen - (TCP_OFF(tcp) << 2) + ((tcpflags & TH_SYN) ? 1 : 0) + ((tcpflags & TH_FIN) ? 1 : 0); ack = ntohl(tcp->th_ack); end = ntohl(tcp->th_seq) + dsize; if (SEQ_GT(ack, nat->nat_seqnext[1 - fin->fin_rev])) nat->nat_seqnext[1 - fin->fin_rev] = ack; if (nat->nat_seqnext[fin->fin_rev] == 0) nat->nat_seqnext[fin->fin_rev] = end; (void) fr_tcp_age(&nat->nat_tqe, fin, nat_tqb, 0); } else { if (ifq2 == NULL) { if (nat->nat_p == IPPROTO_UDP) ifq2 = &nat_udptq; else if (nat->nat_p == IPPROTO_ICMP) ifq2 = &nat_icmptq; else ifq2 = &nat_iptq; } fr_movequeue(tqe, ifq, ifq2); } MUTEX_EXIT(&nat->nat_lock); } /* ------------------------------------------------------------------------ */ /* Function: fr_checknatout */ /* Returns: int - -1 == packet failed NAT checks so block it, */ /* 0 == no packet translation occurred, */ /* 1 == packet was successfully translated. */ /* Parameters: fin(I) - pointer to packet information */ /* passp(I) - pointer to filtering result flags */ /* */ /* Check to see if an outcoming packet should be changed. ICMP packets are */ /* first checked to see if they match an existing entry (if an error), */ /* otherwise a search of the current NAT table is made. If neither results */ /* in a match then a search for a matching NAT rule is made. Create a new */ /* NAT entry if a we matched a NAT rule. Lastly, actually change the */ /* packet header(s) as required. */ /* ------------------------------------------------------------------------ */ int fr_checknatout(fin, passp) fr_info_t *fin; u_32_t *passp; { struct ifnet *ifp, *sifp; icmphdr_t *icmp = NULL; tcphdr_t *tcp = NULL; int rval, natfailed; ipnat_t *np = NULL; u_int nflags = 0; u_32_t ipa, iph; int natadd = 1; frentry_t *fr; nat_t *nat; if (nat_stats.ns_rules == 0 || fr_nat_lock != 0) return 0; natfailed = 0; fr = fin->fin_fr; sifp = fin->fin_ifp; if (fr != NULL) { ifp = fr->fr_tifs[fin->fin_rev].fd_ifp; if ((ifp != NULL) && (ifp != (void *)-1)) fin->fin_ifp = ifp; } ifp = fin->fin_ifp; if (!(fin->fin_flx & FI_SHORT) && (fin->fin_off == 0)) { switch (fin->fin_p) { case IPPROTO_TCP : nflags = IPN_TCP; break; case IPPROTO_UDP : nflags = IPN_UDP; break; case IPPROTO_ICMP : icmp = fin->fin_dp; /* * This is an incoming packet, so the destination is * the icmp_id and the source port equals 0 */ if (nat_icmpquerytype4(icmp->icmp_type)) nflags = IPN_ICMPQUERY; break; default : break; } if ((nflags & IPN_TCPUDP)) tcp = fin->fin_dp; } ipa = fin->fin_saddr; READ_ENTER(&ipf_nat); if (((fin->fin_flx & FI_ICMPERR) != 0) && (nat = nat_icmperror(fin, &nflags, NAT_OUTBOUND))) /*EMPTY*/; else if ((fin->fin_flx & FI_FRAG) && (nat = fr_nat_knownfrag(fin))) natadd = 0; else if ((nat = nat_outlookup(fin, nflags|NAT_SEARCH, (u_int)fin->fin_p, fin->fin_src, fin->fin_dst))) { nflags = nat->nat_flags; } else { u_32_t hv, msk, nmsk; /* * If there is no current entry in the nat table for this IP#, * create one for it (if there is a matching rule). */ RWLOCK_EXIT(&ipf_nat); msk = 0xffffffff; nmsk = nat_masks; WRITE_ENTER(&ipf_nat); maskloop: iph = ipa & htonl(msk); hv = NAT_HASH_FN(iph, 0, ipf_natrules_sz); for (np = nat_rules[hv]; np; np = np->in_mnext) { if ((np->in_ifps[1] && (np->in_ifps[1] != ifp))) continue; if (np->in_v != fin->fin_v) continue; if (np->in_p && (np->in_p != fin->fin_p)) continue; if ((np->in_flags & IPN_RF) && !(np->in_flags & nflags)) continue; if (np->in_flags & IPN_FILTER) { if (!nat_match(fin, np)) continue; } else if ((ipa & np->in_inmsk) != np->in_inip) continue; if ((fr != NULL) && !fr_matchtag(&np->in_tag, &fr->fr_nattag)) continue; if (*np->in_plabel != '\0') { if (((np->in_flags & IPN_FILTER) == 0) && (np->in_dport != tcp->th_dport)) continue; if (appr_ok(fin, tcp, np) == 0) continue; } if ((nat = nat_new(fin, np, NULL, nflags, NAT_OUTBOUND))) { np->in_hits++; break; } else natfailed = -1; } if ((np == NULL) && (nmsk != 0)) { while (nmsk) { msk <<= 1; if (nmsk & 0x80000000) break; nmsk <<= 1; } if (nmsk != 0) { nmsk <<= 1; goto maskloop; } } MUTEX_DOWNGRADE(&ipf_nat); } if (nat != NULL) { rval = fr_natout(fin, nat, natadd, nflags); if (rval == 1) { MUTEX_ENTER(&nat->nat_lock); nat->nat_ref++; MUTEX_EXIT(&nat->nat_lock); nat->nat_touched = fr_ticks; fin->fin_nat = nat; } } else rval = natfailed; RWLOCK_EXIT(&ipf_nat); if (rval == -1) { if (passp != NULL) *passp = FR_BLOCK; fin->fin_flx |= FI_BADNAT; } fin->fin_ifp = sifp; return rval; } /* ------------------------------------------------------------------------ */ /* Function: fr_natout */ /* Returns: int - -1 == packet failed NAT checks so block it, */ /* 1 == packet was successfully translated. */ /* Parameters: fin(I) - pointer to packet information */ /* nat(I) - pointer to NAT structure */ /* natadd(I) - flag indicating if it is safe to add frag cache */ /* nflags(I) - NAT flags set for this packet */ /* */ /* Translate a packet coming "out" on an interface. */ /* ------------------------------------------------------------------------ */ int fr_natout(fin, nat, natadd, nflags) fr_info_t *fin; nat_t *nat; int natadd; u_32_t nflags; { icmphdr_t *icmp; u_short *csump; tcphdr_t *tcp; ipnat_t *np; int i; tcp = NULL; icmp = NULL; csump = NULL; np = nat->nat_ptr; if ((natadd != 0) && (fin->fin_flx & FI_FRAG) && (np != NULL)) (void) fr_nat_newfrag(fin, 0, nat); MUTEX_ENTER(&nat->nat_lock); nat->nat_bytes[1] += fin->fin_plen; nat->nat_pkts[1]++; MUTEX_EXIT(&nat->nat_lock); /* * Fix up checksums, not by recalculating them, but * simply computing adjustments. * This is only done for STREAMS based IP implementations where the * checksum has already been calculated by IP. In all other cases, * IPFilter is called before the checksum needs calculating so there * is no call to modify whatever is in the header now. */ if (fin->fin_v == 4) { if (nflags == IPN_ICMPERR) { u_32_t s1, s2, sumd; s1 = LONG_SUM(ntohl(fin->fin_saddr)); s2 = LONG_SUM(ntohl(nat->nat_outip.s_addr)); CALC_SUMD(s1, s2, sumd); fix_outcksum(fin, &fin->fin_ip->ip_sum, sumd); } #if !defined(_KERNEL) || defined(MENTAT) || defined(__sgi) || \ defined(linux) || defined(BRIDGE_IPF) else { /* * Strictly speaking, this isn't necessary on BSD * kernels because they do checksum calculation after * this code has run BUT if ipfilter is being used * to do NAT as a bridge, that code doesn't exist. */ if (nat->nat_dir == NAT_OUTBOUND) fix_outcksum(fin, &fin->fin_ip->ip_sum, nat->nat_ipsumd); else fix_incksum(fin, &fin->fin_ip->ip_sum, nat->nat_ipsumd); } #endif } if (!(fin->fin_flx & FI_SHORT) && (fin->fin_off == 0)) { if ((nat->nat_outport != 0) && (nflags & IPN_TCPUDP)) { tcp = fin->fin_dp; tcp->th_sport = nat->nat_outport; fin->fin_data[0] = ntohs(nat->nat_outport); } if ((nat->nat_outport != 0) && (nflags & IPN_ICMPQUERY)) { icmp = fin->fin_dp; icmp->icmp_id = nat->nat_outport; } csump = nat_proto(fin, nat, nflags); } fin->fin_ip->ip_src = nat->nat_outip; nat_update(fin, nat, np); /* * The above comments do not hold for layer 4 (or higher) checksums... */ if (csump != NULL) { if (nat->nat_dir == NAT_OUTBOUND) fix_outcksum(fin, csump, nat->nat_sumd[1]); else fix_incksum(fin, csump, nat->nat_sumd[1]); } #ifdef IPFILTER_SYNC ipfsync_update(SMC_NAT, fin, nat->nat_sync); #endif /* ------------------------------------------------------------- */ /* A few quick notes: */ /* Following are test conditions prior to calling the */ /* appr_check routine. */ /* */ /* A NULL tcp indicates a non TCP/UDP packet. When dealing */ /* with a redirect rule, we attempt to match the packet's */ /* source port against in_dport, otherwise we'd compare the */ /* packet's destination. */ /* ------------------------------------------------------------- */ if ((np != NULL) && (np->in_apr != NULL)) { i = appr_check(fin, nat); if (i == 0) i = 1; } else i = 1; ATOMIC_INCL(nat_stats.ns_mapped[1]); fin->fin_flx |= FI_NATED; return i; } /* ------------------------------------------------------------------------ */ /* Function: fr_checknatin */ /* Returns: int - -1 == packet failed NAT checks so block it, */ /* 0 == no packet translation occurred, */ /* 1 == packet was successfully translated. */ /* Parameters: fin(I) - pointer to packet information */ /* passp(I) - pointer to filtering result flags */ /* */ /* Check to see if an incoming packet should be changed. ICMP packets are */ /* first checked to see if they match an existing entry (if an error), */ /* otherwise a search of the current NAT table is made. If neither results */ /* in a match then a search for a matching NAT rule is made. Create a new */ /* NAT entry if a we matched a NAT rule. Lastly, actually change the */ /* packet header(s) as required. */ /* ------------------------------------------------------------------------ */ int fr_checknatin(fin, passp) fr_info_t *fin; u_32_t *passp; { u_int nflags, natadd; int rval, natfailed; struct ifnet *ifp; struct in_addr in; icmphdr_t *icmp; tcphdr_t *tcp; u_short dport; ipnat_t *np; nat_t *nat; u_32_t iph; if (nat_stats.ns_rules == 0 || fr_nat_lock != 0) return 0; tcp = NULL; icmp = NULL; dport = 0; natadd = 1; nflags = 0; natfailed = 0; ifp = fin->fin_ifp; if (!(fin->fin_flx & FI_SHORT) && (fin->fin_off == 0)) { switch (fin->fin_p) { case IPPROTO_TCP : nflags = IPN_TCP; break; case IPPROTO_UDP : nflags = IPN_UDP; break; case IPPROTO_ICMP : icmp = fin->fin_dp; /* * This is an incoming packet, so the destination is * the icmp_id and the source port equals 0 */ if (nat_icmpquerytype4(icmp->icmp_type)) { nflags = IPN_ICMPQUERY; dport = icmp->icmp_id; } break; default : break; } if ((nflags & IPN_TCPUDP)) { tcp = fin->fin_dp; dport = tcp->th_dport; } } in = fin->fin_dst; READ_ENTER(&ipf_nat); if (((fin->fin_flx & FI_ICMPERR) != 0) && (nat = nat_icmperror(fin, &nflags, NAT_INBOUND))) /*EMPTY*/; else if ((fin->fin_flx & FI_FRAG) && (nat = fr_nat_knownfrag(fin))) natadd = 0; else if ((nat = nat_inlookup(fin, nflags|NAT_SEARCH, (u_int)fin->fin_p, fin->fin_src, in))) { nflags = nat->nat_flags; } else { u_32_t hv, msk, rmsk; RWLOCK_EXIT(&ipf_nat); rmsk = rdr_masks; msk = 0xffffffff; WRITE_ENTER(&ipf_nat); /* * If there is no current entry in the nat table for this IP#, * create one for it (if there is a matching rule). */ maskloop: iph = in.s_addr & htonl(msk); hv = NAT_HASH_FN(iph, 0, ipf_rdrrules_sz); for (np = rdr_rules[hv]; np; np = np->in_rnext) { if (np->in_ifps[0] && (np->in_ifps[0] != ifp)) continue; if (np->in_v != fin->fin_v) continue; if (np->in_p && (np->in_p != fin->fin_p)) continue; if ((np->in_flags & IPN_RF) && !(np->in_flags & nflags)) continue; if (np->in_flags & IPN_FILTER) { if (!nat_match(fin, np)) continue; } else { if ((in.s_addr & np->in_outmsk) != np->in_outip) continue; if (np->in_pmin && ((ntohs(np->in_pmax) < ntohs(dport)) || (ntohs(dport) < ntohs(np->in_pmin)))) continue; } if (*np->in_plabel != '\0') { if (!appr_ok(fin, tcp, np)) { continue; } } nat = nat_new(fin, np, NULL, nflags, NAT_INBOUND); if (nat != NULL) { np->in_hits++; break; } else natfailed = -1; } if ((np == NULL) && (rmsk != 0)) { while (rmsk) { msk <<= 1; if (rmsk & 0x80000000) break; rmsk <<= 1; } if (rmsk != 0) { rmsk <<= 1; goto maskloop; } } MUTEX_DOWNGRADE(&ipf_nat); } if (nat != NULL) { rval = fr_natin(fin, nat, natadd, nflags); if (rval == 1) { MUTEX_ENTER(&nat->nat_lock); nat->nat_ref++; MUTEX_EXIT(&nat->nat_lock); nat->nat_touched = fr_ticks; fin->fin_nat = nat; } } else rval = natfailed; RWLOCK_EXIT(&ipf_nat); if (rval == -1) { if (passp != NULL) *passp = FR_BLOCK; fin->fin_flx |= FI_BADNAT; } return rval; } /* ------------------------------------------------------------------------ */ /* Function: fr_natin */ /* Returns: int - -1 == packet failed NAT checks so block it, */ /* 1 == packet was successfully translated. */ /* Parameters: fin(I) - pointer to packet information */ /* nat(I) - pointer to NAT structure */ /* natadd(I) - flag indicating if it is safe to add frag cache */ /* nflags(I) - NAT flags set for this packet */ /* Locks Held: ipf_nat (READ) */ /* */ /* Translate a packet coming "in" on an interface. */ /* ------------------------------------------------------------------------ */ int fr_natin(fin, nat, natadd, nflags) fr_info_t *fin; nat_t *nat; int natadd; u_32_t nflags; { icmphdr_t *icmp; u_short *csump; tcphdr_t *tcp; ipnat_t *np; int i; tcp = NULL; csump = NULL; np = nat->nat_ptr; fin->fin_fr = nat->nat_fr; if (np != NULL) { if ((natadd != 0) && (fin->fin_flx & FI_FRAG)) (void) fr_nat_newfrag(fin, 0, nat); /* ------------------------------------------------------------- */ /* A few quick notes: */ /* Following are test conditions prior to calling the */ /* appr_check routine. */ /* */ /* A NULL tcp indicates a non TCP/UDP packet. When dealing */ /* with a map rule, we attempt to match the packet's */ /* source port against in_dport, otherwise we'd compare the */ /* packet's destination. */ /* ------------------------------------------------------------- */ if (np->in_apr != NULL) { i = appr_check(fin, nat); if (i == -1) { return -1; } } } #ifdef IPFILTER_SYNC ipfsync_update(SMC_NAT, fin, nat->nat_sync); #endif MUTEX_ENTER(&nat->nat_lock); nat->nat_bytes[0] += fin->fin_plen; nat->nat_pkts[0]++; MUTEX_EXIT(&nat->nat_lock); fin->fin_ip->ip_dst = nat->nat_inip; fin->fin_fi.fi_daddr = nat->nat_inip.s_addr; if (nflags & IPN_TCPUDP) tcp = fin->fin_dp; /* * Fix up checksums, not by recalculating them, but * simply computing adjustments. * Why only do this for some platforms on inbound packets ? * Because for those that it is done, IP processing is yet to happen * and so the IPv4 header checksum has not yet been evaluated. * Perhaps it should always be done for the benefit of things like * fast forwarding (so that it doesn't need to be recomputed) but with * header checksum offloading, perhaps it is a moot point. */ #if !defined(_KERNEL) || defined(MENTAT) || defined(__sgi) || \ defined(__osf__) || defined(linux) if (nat->nat_dir == NAT_OUTBOUND) fix_incksum(fin, &fin->fin_ip->ip_sum, nat->nat_ipsumd); else fix_outcksum(fin, &fin->fin_ip->ip_sum, nat->nat_ipsumd); #endif if (!(fin->fin_flx & FI_SHORT) && (fin->fin_off == 0)) { if ((nat->nat_inport != 0) && (nflags & IPN_TCPUDP)) { tcp->th_dport = nat->nat_inport; fin->fin_data[1] = ntohs(nat->nat_inport); } if ((nat->nat_inport != 0) && (nflags & IPN_ICMPQUERY)) { icmp = fin->fin_dp; icmp->icmp_id = nat->nat_inport; } csump = nat_proto(fin, nat, nflags); } nat_update(fin, nat, np); /* * The above comments do not hold for layer 4 (or higher) checksums... */ if (csump != NULL) { if (nat->nat_dir == NAT_OUTBOUND) fix_incksum(fin, csump, nat->nat_sumd[0]); else fix_outcksum(fin, csump, nat->nat_sumd[0]); } ATOMIC_INCL(nat_stats.ns_mapped[0]); fin->fin_flx |= FI_NATED; if (np != NULL && np->in_tag.ipt_num[0] != 0) fin->fin_nattag = &np->in_tag; return 1; } /* ------------------------------------------------------------------------ */ /* Function: nat_proto */ /* Returns: u_short* - pointer to transport header checksum to update, */ /* NULL if the transport protocol is not recognised */ /* as needing a checksum update. */ /* Parameters: fin(I) - pointer to packet information */ /* nat(I) - pointer to NAT structure */ /* nflags(I) - NAT flags set for this packet */ /* */ /* Return the pointer to the checksum field for each protocol so understood.*/ /* If support for making other changes to a protocol header is required, */ /* that is not strictly 'address' translation, such as clamping the MSS in */ /* TCP down to a specific value, then do it from here. */ /* ------------------------------------------------------------------------ */ u_short *nat_proto(fin, nat, nflags) fr_info_t *fin; nat_t *nat; u_int nflags; { icmphdr_t *icmp; u_short *csump; tcphdr_t *tcp; udphdr_t *udp; csump = NULL; if (fin->fin_out == 0) { fin->fin_rev = (nat->nat_dir == NAT_OUTBOUND); } else { fin->fin_rev = (nat->nat_dir == NAT_INBOUND); } switch (fin->fin_p) { case IPPROTO_TCP : tcp = fin->fin_dp; csump = &tcp->th_sum; /* * Do a MSS CLAMPING on a SYN packet, * only deal IPv4 for now. */ if ((nat->nat_mssclamp != 0) && (tcp->th_flags & TH_SYN) != 0) nat_mssclamp(tcp, nat->nat_mssclamp, fin, csump); break; case IPPROTO_UDP : udp = fin->fin_dp; if (udp->uh_sum) csump = &udp->uh_sum; break; case IPPROTO_ICMP : icmp = fin->fin_dp; if ((nflags & IPN_ICMPQUERY) != 0) { if (icmp->icmp_cksum != 0) csump = &icmp->icmp_cksum; } break; } return csump; } /* ------------------------------------------------------------------------ */ /* Function: fr_natunload */ /* Returns: Nil */ /* Parameters: Nil */ /* */ /* Free all memory used by NAT structures allocated at runtime. */ /* ------------------------------------------------------------------------ */ void fr_natunload() { ipftq_t *ifq, *ifqnext; (void) nat_clearlist(); (void) nat_flushtable(); /* * Proxy timeout queues are not cleaned here because although they * exist on the NAT list, appr_unload is called after fr_natunload * and the proxies actually are responsible for them being created. * Should the proxy timeouts have their own list? There's no real * justification as this is the only complication. */ for (ifq = nat_utqe; ifq != NULL; ifq = ifqnext) { ifqnext = ifq->ifq_next; if (((ifq->ifq_flags & IFQF_PROXY) == 0) && (fr_deletetimeoutqueue(ifq) == 0)) fr_freetimeoutqueue(ifq); } if (nat_table[0] != NULL) { KFREES(nat_table[0], sizeof(nat_t *) * ipf_nattable_sz); nat_table[0] = NULL; } if (nat_table[1] != NULL) { KFREES(nat_table[1], sizeof(nat_t *) * ipf_nattable_sz); nat_table[1] = NULL; } if (nat_rules != NULL) { KFREES(nat_rules, sizeof(ipnat_t *) * ipf_natrules_sz); nat_rules = NULL; } if (rdr_rules != NULL) { KFREES(rdr_rules, sizeof(ipnat_t *) * ipf_rdrrules_sz); rdr_rules = NULL; } if (ipf_hm_maptable != NULL) { KFREES(ipf_hm_maptable, sizeof(hostmap_t *) * ipf_hostmap_sz); ipf_hm_maptable = NULL; } if (nat_stats.ns_bucketlen[0] != NULL) { KFREES(nat_stats.ns_bucketlen[0], sizeof(u_long *) * ipf_nattable_sz); nat_stats.ns_bucketlen[0] = NULL; } if (nat_stats.ns_bucketlen[1] != NULL) { KFREES(nat_stats.ns_bucketlen[1], sizeof(u_long *) * ipf_nattable_sz); nat_stats.ns_bucketlen[1] = NULL; } if (fr_nat_maxbucket_reset == 1) fr_nat_maxbucket = 0; if (fr_nat_init == 1) { fr_nat_init = 0; fr_sttab_destroy(nat_tqb); RW_DESTROY(&ipf_natfrag); RW_DESTROY(&ipf_nat); MUTEX_DESTROY(&ipf_nat_new); MUTEX_DESTROY(&ipf_natio); MUTEX_DESTROY(&nat_udptq.ifq_lock); MUTEX_DESTROY(&nat_icmptq.ifq_lock); MUTEX_DESTROY(&nat_iptq.ifq_lock); } } /* ------------------------------------------------------------------------ */ /* Function: fr_natexpire */ /* Returns: Nil */ /* Parameters: Nil */ /* */ /* Check all of the timeout queues for entries at the top which need to be */ /* expired. */ /* ------------------------------------------------------------------------ */ void fr_natexpire() { ipftq_t *ifq, *ifqnext; ipftqent_t *tqe, *tqn; int i; SPL_INT(s); SPL_NET(s); WRITE_ENTER(&ipf_nat); for (ifq = nat_tqb, i = 0; ifq != NULL; ifq = ifq->ifq_next) { for (tqn = ifq->ifq_head; ((tqe = tqn) != NULL); i++) { if (tqe->tqe_die > fr_ticks) break; tqn = tqe->tqe_next; nat_delete(tqe->tqe_parent, NL_EXPIRE); } } for (ifq = nat_utqe; ifq != NULL; ifq = ifqnext) { ifqnext = ifq->ifq_next; for (tqn = ifq->ifq_head; ((tqe = tqn) != NULL); i++) { if (tqe->tqe_die > fr_ticks) break; tqn = tqe->tqe_next; nat_delete(tqe->tqe_parent, NL_EXPIRE); } } for (ifq = nat_utqe; ifq != NULL; ifq = ifqnext) { ifqnext = ifq->ifq_next; if (((ifq->ifq_flags & IFQF_DELETE) != 0) && (ifq->ifq_ref == 0)) { fr_freetimeoutqueue(ifq); } } if (fr_nat_doflush != 0) { nat_extraflush(2); fr_nat_doflush = 0; } RWLOCK_EXIT(&ipf_nat); SPL_X(s); } /* ------------------------------------------------------------------------ */ /* Function: fr_natsync */ /* Returns: Nil */ /* Parameters: ifp(I) - pointer to network interface */ /* */ /* Walk through all of the currently active NAT sessions, looking for those */ /* which need to have their translated address updated. */ /* ------------------------------------------------------------------------ */ void fr_natsync(ifp) void *ifp; { u_32_t sum1, sum2, sumd; struct in_addr in; ipnat_t *n; nat_t *nat; void *ifp2; SPL_INT(s); if (fr_running <= 0) return; /* * Change IP addresses for NAT sessions for any protocol except TCP * since it will break the TCP connection anyway. The only rules * which will get changed are those which are "map ... -> 0/32", * where the rule specifies the address is taken from the interface. */ SPL_NET(s); WRITE_ENTER(&ipf_nat); if (fr_running <= 0) { RWLOCK_EXIT(&ipf_nat); return; } for (nat = nat_instances; nat; nat = nat->nat_next) { if ((nat->nat_flags & IPN_TCP) != 0) continue; n = nat->nat_ptr; if ((n == NULL) || (n->in_outip != 0) || (n->in_outmsk != 0xffffffff)) continue; if (((ifp == NULL) || (ifp == nat->nat_ifps[0]) || (ifp == nat->nat_ifps[1]))) { nat->nat_ifps[0] = GETIFP(nat->nat_ifnames[0], 4); if (nat->nat_ifnames[1][0] != '\0') { nat->nat_ifps[1] = GETIFP(nat->nat_ifnames[1], 4); } else nat->nat_ifps[1] = nat->nat_ifps[0]; ifp2 = nat->nat_ifps[0]; if (ifp2 == NULL) continue; /* * Change the map-to address to be the same as the * new one. */ sum1 = nat->nat_outip.s_addr; if (fr_ifpaddr(4, FRI_NORMAL, ifp2, &in, NULL) != -1) nat->nat_outip = in; sum2 = nat->nat_outip.s_addr; if (sum1 == sum2) continue; /* * Readjust the checksum adjustment to take into * account the new IP#. */ CALC_SUMD(sum1, sum2, sumd); /* XXX - dont change for TCP when solaris does * hardware checksumming. */ sumd += nat->nat_sumd[0]; nat->nat_sumd[0] = (sumd & 0xffff) + (sumd >> 16); nat->nat_sumd[1] = nat->nat_sumd[0]; } } for (n = nat_list; (n != NULL); n = n->in_next) { if ((ifp == NULL) || (n->in_ifps[0] == ifp)) n->in_ifps[0] = fr_resolvenic(n->in_ifnames[0], 4); if ((ifp == NULL) || (n->in_ifps[1] == ifp)) n->in_ifps[1] = fr_resolvenic(n->in_ifnames[1], 4); } RWLOCK_EXIT(&ipf_nat); SPL_X(s); } /* ------------------------------------------------------------------------ */ /* Function: nat_icmpquerytype4 */ /* Returns: int - 1 == success, 0 == failure */ /* Parameters: icmptype(I) - ICMP type number */ /* */ /* Tests to see if the ICMP type number passed is a query/response type or */ /* not. */ /* ------------------------------------------------------------------------ */ static int nat_icmpquerytype4(icmptype) int icmptype; { /* * For the ICMP query NAT code, it is essential that both the query * and the reply match on the NAT rule. Because the NAT structure * does not keep track of the icmptype, and a single NAT structure * is used for all icmp types with the same src, dest and id, we * simply define the replies as queries as well. The funny thing is, * altough it seems silly to call a reply a query, this is exactly * as it is defined in the IPv4 specification */ switch (icmptype) { case ICMP_ECHOREPLY: case ICMP_ECHO: /* route aedvertisement/solliciation is currently unsupported: */ /* it would require rewriting the ICMP data section */ case ICMP_TSTAMP: case ICMP_TSTAMPREPLY: case ICMP_IREQ: case ICMP_IREQREPLY: case ICMP_MASKREQ: case ICMP_MASKREPLY: return 1; default: return 0; } } /* ------------------------------------------------------------------------ */ /* Function: nat_log */ /* Returns: Nil */ /* Parameters: nat(I) - pointer to NAT structure */ /* type(I) - type of log entry to create */ /* */ /* Creates a NAT log entry. */ /* ------------------------------------------------------------------------ */ void nat_log(nat, type) struct nat *nat; u_int type; { #ifdef IPFILTER_LOG # ifndef LARGE_NAT struct ipnat *np; int rulen; # endif struct natlog natl; void *items[1]; size_t sizes[1]; int types[1]; natl.nl_inip = nat->nat_inip; natl.nl_outip = nat->nat_outip; natl.nl_origip = nat->nat_oip; natl.nl_bytes[0] = nat->nat_bytes[0]; natl.nl_bytes[1] = nat->nat_bytes[1]; natl.nl_pkts[0] = nat->nat_pkts[0]; natl.nl_pkts[1] = nat->nat_pkts[1]; natl.nl_origport = nat->nat_oport; natl.nl_inport = nat->nat_inport; natl.nl_outport = nat->nat_outport; natl.nl_p = nat->nat_p; natl.nl_type = type; natl.nl_rule = -1; # ifndef LARGE_NAT if (nat->nat_ptr != NULL) { for (rulen = 0, np = nat_list; np; np = np->in_next, rulen++) if (np == nat->nat_ptr) { natl.nl_rule = rulen; break; } } # endif items[0] = &natl; sizes[0] = sizeof(natl); types[0] = 0; (void) ipllog(IPL_LOGNAT, NULL, items, sizes, types, 1); #endif } #if defined(__OpenBSD__) /* ------------------------------------------------------------------------ */ /* Function: nat_ifdetach */ /* Returns: Nil */ /* Parameters: ifp(I) - pointer to network interface */ /* */ /* Compatibility interface for OpenBSD to trigger the correct updating of */ /* interface references within IPFilter. */ /* ------------------------------------------------------------------------ */ void nat_ifdetach(ifp) void *ifp; { frsync(ifp); return; } #endif /* ------------------------------------------------------------------------ */ /* Function: fr_ipnatderef */ /* Returns: Nil */ /* Parameters: isp(I) - pointer to pointer to NAT rule */ /* Write Locks: ipf_nat */ /* */ /* ------------------------------------------------------------------------ */ void fr_ipnatderef(inp) ipnat_t **inp; { ipnat_t *in; in = *inp; *inp = NULL; in->in_space++; in->in_use--; if (in->in_use == 0 && (in->in_flags & IPN_DELETE)) { if (in->in_apr) appr_free(in->in_apr); MUTEX_DESTROY(&in->in_lock); KFREE(in); nat_stats.ns_rules--; #if SOLARIS && !defined(_INET_IP_STACK_H) if (nat_stats.ns_rules == 0) pfil_delayed_copy = 1; #endif } } /* ------------------------------------------------------------------------ */ /* Function: fr_natderef */ /* Returns: Nil */ /* Parameters: isp(I) - pointer to pointer to NAT table entry */ /* */ /* Decrement the reference counter for this NAT table entry and free it if */ /* there are no more things using it. */ /* */ /* IF nat_ref == 1 when this function is called, then we have an orphan nat */ /* structure *because* it only gets called on paths _after_ nat_ref has been*/ /* incremented. If nat_ref == 1 then we shouldn't decrement it here */ /* because nat_delete() will do that and send nat_ref to -1. */ /* */ /* Holding the lock on nat_lock is required to serialise nat_delete() being */ /* called from a NAT flush ioctl with a deref happening because of a packet.*/ /* ------------------------------------------------------------------------ */ void fr_natderef(natp) nat_t **natp; { nat_t *nat; nat = *natp; *natp = NULL; MUTEX_ENTER(&nat->nat_lock); if (nat->nat_ref > 1) { nat->nat_ref--; MUTEX_EXIT(&nat->nat_lock); return; } MUTEX_EXIT(&nat->nat_lock); WRITE_ENTER(&ipf_nat); nat_delete(nat, NL_EXPIRE); RWLOCK_EXIT(&ipf_nat); } /* ------------------------------------------------------------------------ */ /* Function: fr_natclone */ /* Returns: ipstate_t* - NULL == cloning failed, */ /* else pointer to new state structure */ /* Parameters: fin(I) - pointer to packet information */ /* is(I) - pointer to master state structure */ /* Write Lock: ipf_nat */ /* */ /* Create a "duplcate" state table entry from the master. */ /* ------------------------------------------------------------------------ */ static nat_t *fr_natclone(fin, nat) fr_info_t *fin; nat_t *nat; { frentry_t *fr; nat_t *clone; ipnat_t *np; KMALLOC(clone, nat_t *); if (clone == NULL) return NULL; bcopy((char *)nat, (char *)clone, sizeof(*clone)); MUTEX_NUKE(&clone->nat_lock); clone->nat_aps = NULL; /* * Initialize all these so that nat_delete() doesn't cause a crash. */ clone->nat_tqe.tqe_pnext = NULL; clone->nat_tqe.tqe_next = NULL; clone->nat_tqe.tqe_ifq = NULL; clone->nat_tqe.tqe_parent = clone; clone->nat_flags &= ~SI_CLONE; clone->nat_flags |= SI_CLONED; if (clone->nat_hm) clone->nat_hm->hm_ref++; if (nat_insert(clone, fin->fin_rev) == -1) { KFREE(clone); return NULL; } np = clone->nat_ptr; if (np != NULL) { if (nat_logging) nat_log(clone, (u_int)np->in_redir); np->in_use++; } fr = clone->nat_fr; if (fr != NULL) { MUTEX_ENTER(&fr->fr_lock); fr->fr_ref++; MUTEX_EXIT(&fr->fr_lock); } /* * Because the clone is created outside the normal loop of things and * TCP has special needs in terms of state, initialise the timeout * state of the new NAT from here. */ if (clone->nat_p == IPPROTO_TCP) { (void) fr_tcp_age(&clone->nat_tqe, fin, nat_tqb, clone->nat_flags); } #ifdef IPFILTER_SYNC clone->nat_sync = ipfsync_new(SMC_NAT, fin, clone); #endif if (nat_logging) nat_log(clone, NL_CLONE); return clone; } /* ------------------------------------------------------------------------ */ /* Function: nat_wildok */ /* Returns: int - 1 == packet's ports match wildcards */ /* 0 == packet's ports don't match wildcards */ /* Parameters: nat(I) - NAT entry */ /* sport(I) - source port */ /* dport(I) - destination port */ /* flags(I) - wildcard flags */ /* dir(I) - packet direction */ /* */ /* Use NAT entry and packet direction to determine which combination of */ /* wildcard flags should be used. */ /* ------------------------------------------------------------------------ */ static int nat_wildok(nat, sport, dport, flags, dir) nat_t *nat; int sport; int dport; int flags; int dir; { /* * When called by dir is set to * nat_inlookup NAT_INBOUND (0) * nat_outlookup NAT_OUTBOUND (1) * * We simply combine the packet's direction in dir with the original * "intended" direction of that NAT entry in nat->nat_dir to decide * which combination of wildcard flags to allow. */ switch ((dir << 1) | nat->nat_dir) { case 3: /* outbound packet / outbound entry */ if (((nat->nat_inport == sport) || (flags & SI_W_SPORT)) && ((nat->nat_oport == dport) || (flags & SI_W_DPORT))) return 1; break; case 2: /* outbound packet / inbound entry */ if (((nat->nat_outport == sport) || (flags & SI_W_DPORT)) && ((nat->nat_oport == dport) || (flags & SI_W_SPORT))) return 1; break; case 1: /* inbound packet / outbound entry */ if (((nat->nat_oport == sport) || (flags & SI_W_DPORT)) && ((nat->nat_outport == dport) || (flags & SI_W_SPORT))) return 1; break; case 0: /* inbound packet / inbound entry */ if (((nat->nat_oport == sport) || (flags & SI_W_SPORT)) && ((nat->nat_outport == dport) || (flags & SI_W_DPORT))) return 1; break; default: break; } return(0); } /* ------------------------------------------------------------------------ */ /* Function: nat_mssclamp */ /* Returns: Nil */ /* Parameters: tcp(I) - pointer to TCP header */ /* maxmss(I) - value to clamp the TCP MSS to */ /* fin(I) - pointer to packet information */ /* csump(I) - pointer to TCP checksum */ /* */ /* Check for MSS option and clamp it if necessary. If found and changed, */ /* then the TCP header checksum will be updated to reflect the change in */ /* the MSS. */ /* ------------------------------------------------------------------------ */ static void nat_mssclamp(tcp, maxmss, fin, csump) tcphdr_t *tcp; u_32_t maxmss; fr_info_t *fin; u_short *csump; { u_char *cp, *ep, opt; int hlen, advance; u_32_t mss, sumd; hlen = TCP_OFF(tcp) << 2; if (hlen > sizeof(*tcp)) { cp = (u_char *)tcp + sizeof(*tcp); ep = (u_char *)tcp + hlen; while (cp < ep) { opt = cp[0]; if (opt == TCPOPT_EOL) break; else if (opt == TCPOPT_NOP) { cp++; continue; } if (cp + 1 >= ep) break; advance = cp[1]; if ((cp + advance > ep) || (advance <= 0)) break; switch (opt) { case TCPOPT_MAXSEG: if (advance != 4) break; mss = cp[2] * 256 + cp[3]; if (mss > maxmss) { cp[2] = maxmss / 256; cp[3] = maxmss & 0xff; CALC_SUMD(mss, maxmss, sumd); fix_outcksum(fin, csump, sumd); } break; default: /* ignore unknown options */ break; } cp += advance; } } } /* ------------------------------------------------------------------------ */ /* Function: fr_setnatqueue */ /* Returns: Nil */ /* Parameters: nat(I)- pointer to NAT structure */ /* rev(I) - forward(0) or reverse(1) direction */ /* Locks: ipf_nat (read or write) */ /* */ /* Put the NAT entry on its default queue entry, using rev as a helped in */ /* determining which queue it should be placed on. */ /* ------------------------------------------------------------------------ */ void fr_setnatqueue(nat, rev) nat_t *nat; int rev; { ipftq_t *oifq, *nifq; if (nat->nat_ptr != NULL) nifq = nat->nat_ptr->in_tqehead[rev]; else nifq = NULL; if (nifq == NULL) { switch (nat->nat_p) { case IPPROTO_UDP : nifq = &nat_udptq; break; case IPPROTO_ICMP : nifq = &nat_icmptq; break; case IPPROTO_TCP : nifq = nat_tqb + nat->nat_tqe.tqe_state[rev]; break; default : nifq = &nat_iptq; break; } } oifq = nat->nat_tqe.tqe_ifq; /* * If it's currently on a timeout queue, move it from one queue to * another, else put it on the end of the newly determined queue. */ if (oifq != NULL) fr_movequeue(&nat->nat_tqe, oifq, nifq); else fr_queueappend(&nat->nat_tqe, nifq, nat); return; } /* ------------------------------------------------------------------------ */ /* Function: nat_getnext */ /* Returns: int - 0 == ok, else error */ /* Parameters: t(I) - pointer to ipftoken structure */ /* itp(I) - pointer to ipfgeniter_t structure */ /* */ /* Fetch the next nat/ipnat structure pointer from the linked list and */ /* copy it out to the storage space pointed to by itp_data. The next item */ /* in the list to look at is put back in the ipftoken struture. */ /* If we call ipf_freetoken, the accompanying pointer is set to NULL because*/ /* ipf_freetoken will call a deref function for us and we dont want to call */ /* that twice (second time would be in the second switch statement below. */ /* ------------------------------------------------------------------------ */ static int nat_getnext(t, itp) ipftoken_t *t; ipfgeniter_t *itp; { hostmap_t *hm, *nexthm = NULL, zerohm; ipnat_t *ipn, *nextipnat = NULL, zeroipn; nat_t *nat, *nextnat = NULL, zeronat; int error = 0, count; char *dst; count = itp->igi_nitems; if (count < 1) return ENOSPC; READ_ENTER(&ipf_nat); switch (itp->igi_type) { case IPFGENITER_HOSTMAP : hm = t->ipt_data; if (hm == NULL) { nexthm = ipf_hm_maplist; } else { nexthm = hm->hm_next; } break; case IPFGENITER_IPNAT : ipn = t->ipt_data; if (ipn == NULL) { nextipnat = nat_list; } else { nextipnat = ipn->in_next; } break; case IPFGENITER_NAT : nat = t->ipt_data; if (nat == NULL) { nextnat = nat_instances; } else { nextnat = nat->nat_next; } break; default : RWLOCK_EXIT(&ipf_nat); return EINVAL; } dst = itp->igi_data; for (;;) { switch (itp->igi_type) { case IPFGENITER_HOSTMAP : if (nexthm != NULL) { if (count == 1) { ATOMIC_INC32(nexthm->hm_ref); t->ipt_data = nexthm; } } else { bzero(&zerohm, sizeof(zerohm)); nexthm = &zerohm; count = 1; t->ipt_data = NULL; } break; case IPFGENITER_IPNAT : if (nextipnat != NULL) { if (count == 1) { MUTEX_ENTER(&nextipnat->in_lock); nextipnat->in_use++; MUTEX_EXIT(&nextipnat->in_lock); t->ipt_data = nextipnat; } } else { bzero(&zeroipn, sizeof(zeroipn)); nextipnat = &zeroipn; count = 1; t->ipt_data = NULL; } break; case IPFGENITER_NAT : if (nextnat != NULL) { if (count == 1) { MUTEX_ENTER(&nextnat->nat_lock); nextnat->nat_ref++; MUTEX_EXIT(&nextnat->nat_lock); t->ipt_data = nextnat; } } else { bzero(&zeronat, sizeof(zeronat)); nextnat = &zeronat; count = 1; t->ipt_data = NULL; } break; default : break; } RWLOCK_EXIT(&ipf_nat); /* * Copying out to user space needs to be done without the lock. */ switch (itp->igi_type) { case IPFGENITER_HOSTMAP : error = COPYOUT(nexthm, dst, sizeof(*nexthm)); if (error != 0) error = EFAULT; else dst += sizeof(*nexthm); break; case IPFGENITER_IPNAT : error = COPYOUT(nextipnat, dst, sizeof(*nextipnat)); if (error != 0) error = EFAULT; else dst += sizeof(*nextipnat); break; case IPFGENITER_NAT : error = COPYOUT(nextnat, dst, sizeof(*nextnat)); if (error != 0) error = EFAULT; else dst += sizeof(*nextnat); break; } if ((count == 1) || (error != 0)) break; count--; READ_ENTER(&ipf_nat); /* * We need to have the lock again here to make sure that * using _next is consistent. */ switch (itp->igi_type) { case IPFGENITER_HOSTMAP : nexthm = nexthm->hm_next; break; case IPFGENITER_IPNAT : nextipnat = nextipnat->in_next; break; case IPFGENITER_NAT : nextnat = nextnat->nat_next; break; } } switch (itp->igi_type) { case IPFGENITER_HOSTMAP : if (hm != NULL) { WRITE_ENTER(&ipf_nat); fr_hostmapdel(&hm); RWLOCK_EXIT(&ipf_nat); } break; case IPFGENITER_IPNAT : if (ipn != NULL) { fr_ipnatderef(&ipn); } break; case IPFGENITER_NAT : if (nat != NULL) { fr_natderef(&nat); } break; default : break; } return error; } /* ------------------------------------------------------------------------ */ /* Function: nat_iterator */ /* Returns: int - 0 == ok, else error */ /* Parameters: token(I) - pointer to ipftoken structure */ /* itp(I) - pointer to ipfgeniter_t structure */ /* */ /* This function acts as a handler for the SIOCGENITER ioctls that use a */ /* generic structure to iterate through a list. There are three different */ /* linked lists of NAT related information to go through: NAT rules, active */ /* NAT mappings and the NAT fragment cache. */ /* ------------------------------------------------------------------------ */ static int nat_iterator(token, itp) ipftoken_t *token; ipfgeniter_t *itp; { int error; if (itp->igi_data == NULL) return EFAULT; token->ipt_subtype = itp->igi_type; switch (itp->igi_type) { case IPFGENITER_HOSTMAP : case IPFGENITER_IPNAT : case IPFGENITER_NAT : error = nat_getnext(token, itp); break; case IPFGENITER_NATFRAG : #ifdef USE_MUTEXES error = fr_nextfrag(token, itp, &ipfr_natlist, &ipfr_nattail, &ipf_natfrag); #else error = fr_nextfrag(token, itp, &ipfr_natlist, &ipfr_nattail); #endif break; default : error = EINVAL; break; } return error; } /* ------------------------------------------------------------------------ */ /* Function: nat_extraflush */ /* Returns: int - 0 == success, -1 == failure */ /* Parameters: which(I) - how to flush the active NAT table */ /* Write Locks: ipf_nat */ /* */ /* Flush nat tables. Three actions currently defined: */ /* which == 0 : flush all nat table entries */ /* which == 1 : flush TCP connections which have started to close but are */ /* stuck for some reason. */ /* which == 2 : flush TCP connections which have been idle for a long time, */ /* starting at > 4 days idle and working back in successive half-*/ /* days to at most 12 hours old. If this fails to free enough */ /* slots then work backwards in half hour slots to 30 minutes. */ /* If that too fails, then work backwards in 30 second intervals */ /* for the last 30 minutes to at worst 30 seconds idle. */ /* ------------------------------------------------------------------------ */ static int nat_extraflush(which) int which; { ipftq_t *ifq, *ifqnext; nat_t *nat, **natp; ipftqent_t *tqn; int removed; SPL_INT(s); removed = 0; SPL_NET(s); switch (which) { case 0 : /* * Style 0 flush removes everything... */ for (natp = &nat_instances; ((nat = *natp) != NULL); ) { nat_delete(nat, NL_FLUSH); removed++; } break; case 1 : /* * Since we're only interested in things that are closing, * we can start with the appropriate timeout queue. */ for (ifq = nat_tqb + IPF_TCPS_CLOSE_WAIT; ifq != NULL; ifq = ifq->ifq_next) { for (tqn = ifq->ifq_head; tqn != NULL; ) { nat = tqn->tqe_parent; tqn = tqn->tqe_next; if (nat->nat_p != IPPROTO_TCP) break; nat_delete(nat, NL_EXPIRE); removed++; } } /* * Also need to look through the user defined queues. */ for (ifq = nat_utqe; ifq != NULL; ifq = ifqnext) { ifqnext = ifq->ifq_next; for (tqn = ifq->ifq_head; tqn != NULL; ) { nat = tqn->tqe_parent; tqn = tqn->tqe_next; if (nat->nat_p != IPPROTO_TCP) continue; if ((nat->nat_tcpstate[0] > IPF_TCPS_ESTABLISHED) && (nat->nat_tcpstate[1] > IPF_TCPS_ESTABLISHED)) { nat_delete(nat, NL_EXPIRE); removed++; } } } break; /* * Args 5-11 correspond to flushing those particular states * for TCP connections. */ case IPF_TCPS_CLOSE_WAIT : case IPF_TCPS_FIN_WAIT_1 : case IPF_TCPS_CLOSING : case IPF_TCPS_LAST_ACK : case IPF_TCPS_FIN_WAIT_2 : case IPF_TCPS_TIME_WAIT : case IPF_TCPS_CLOSED : tqn = nat_tqb[which].ifq_head; while (tqn != NULL) { nat = tqn->tqe_parent; tqn = tqn->tqe_next; nat_delete(nat, NL_FLUSH); removed++; } break; default : if (which < 30) break; /* * Take a large arbitrary number to mean the number of seconds * for which which consider to be the maximum value we'll allow * the expiration to be. */ which = IPF_TTLVAL(which); for (natp = &nat_instances; ((nat = *natp) != NULL); ) { if (fr_ticks - nat->nat_touched > which) { nat_delete(nat, NL_FLUSH); removed++; } else natp = &nat->nat_next; } break; } if (which != 2) { SPL_X(s); return removed; } /* * Asked to remove inactive entries because the table is full. */ if (fr_ticks - nat_last_force_flush > IPF_TTLVAL(5)) { nat_last_force_flush = fr_ticks; removed = ipf_queueflush(nat_flush_entry, nat_tqb, nat_utqe); } SPL_X(s); return removed; } /* ------------------------------------------------------------------------ */ /* Function: nat_flush_entry */ /* Returns: 0 - always succeeds */ /* Parameters: entry(I) - pointer to NAT entry */ /* Write Locks: ipf_nat */ /* */ /* This function is a stepping stone between ipf_queueflush() and */ /* nat_dlete(). It is used so we can provide a uniform interface via the */ /* ipf_queueflush() function. Since the nat_delete() function returns void */ /* we translate that to mean it always succeeds in deleting something. */ /* ------------------------------------------------------------------------ */ static int nat_flush_entry(entry) void *entry; { nat_delete(entry, NL_FLUSH); return 0; } /* ------------------------------------------------------------------------ */ /* Function: nat_gettable */ /* Returns: int - 0 = success, else error */ /* Parameters: data(I) - pointer to ioctl data */ /* */ /* This function handles ioctl requests for tables of nat information. */ /* At present the only table it deals with is the hash bucket statistics. */ /* ------------------------------------------------------------------------ */ static int nat_gettable(data) char *data; { ipftable_t table; int error; error = fr_inobj(data, &table, IPFOBJ_GTABLE); if (error != 0) return error; switch (table.ita_type) { case IPFTABLE_BUCKETS_NATIN : error = COPYOUT(nat_stats.ns_bucketlen[0], table.ita_table, ipf_nattable_sz * sizeof(u_long)); break; case IPFTABLE_BUCKETS_NATOUT : error = COPYOUT(nat_stats.ns_bucketlen[1], table.ita_table, ipf_nattable_sz * sizeof(u_long)); break; default : return EINVAL; } if (error != 0) { error = EFAULT; } return error; } Index: head/sys/fs/procfs/procfs_status.c =================================================================== --- head/sys/fs/procfs/procfs_status.c (revision 192894) +++ head/sys/fs/procfs/procfs_status.c (revision 192895) @@ -1,216 +1,217 @@ /*- * Copyright (c) 1993 Jan-Simon Pendry * Copyright (c) 1993 * The Regents of the University of California. All rights reserved. * * This code is derived from software contributed to Berkeley by * Jan-Simon Pendry. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)procfs_status.c 8.4 (Berkeley) 6/15/94 * * From: * $Id: procfs_status.c,v 3.1 1993/12/15 09:40:17 jsp Exp $ * $FreeBSD$ */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include int procfs_doprocstatus(PFS_FILL_ARGS) { struct session *sess; struct thread *tdfirst; struct tty *tp; struct ucred *cr; const char *wmesg; char *pc; char *sep; int pid, ppid, pgid, sid; int i; pid = p->p_pid; PROC_LOCK(p); ppid = p->p_pptr ? p->p_pptr->p_pid : 0; pgid = p->p_pgrp->pg_id; sess = p->p_pgrp->pg_session; SESS_LOCK(sess); sid = sess->s_leader ? sess->s_leader->p_pid : 0; /* comm pid ppid pgid sid tty ctty,sldr start ut st wmsg euid ruid rgid,egid,groups[1 .. NGROUPS] */ pc = p->p_comm; do { if (*pc < 33 || *pc > 126 || *pc == '\\') sbuf_printf(sb, "\\%03o", *pc); else sbuf_putc(sb, *pc); } while (*++pc); sbuf_printf(sb, " %d %d %d %d ", pid, ppid, pgid, sid); if ((p->p_flag & P_CONTROLT) && (tp = sess->s_ttyp)) sbuf_printf(sb, "%s ", devtoname(tp->t_dev)); else sbuf_printf(sb, "- "); sep = ""; if (sess->s_ttyvp) { sbuf_printf(sb, "%sctty", sep); sep = ","; } if (SESS_LEADER(p)) { sbuf_printf(sb, "%ssldr", sep); sep = ","; } SESS_UNLOCK(sess); if (*sep != ',') { sbuf_printf(sb, "noflags"); } tdfirst = FIRST_THREAD_IN_PROC(p); if (tdfirst->td_wchan != NULL) { KASSERT(tdfirst->td_wmesg != NULL, ("wchan %p has no wmesg", tdfirst->td_wchan)); wmesg = tdfirst->td_wmesg; } else wmesg = "nochan"; if (p->p_flag & P_INMEM) { struct timeval start, ut, st; PROC_SLOCK(p); calcru(p, &ut, &st); PROC_SUNLOCK(p); start = p->p_stats->p_start; timevaladd(&start, &boottime); sbuf_printf(sb, " %jd,%ld %jd,%ld %jd,%ld", (intmax_t)start.tv_sec, start.tv_usec, (intmax_t)ut.tv_sec, ut.tv_usec, (intmax_t)st.tv_sec, st.tv_usec); } else sbuf_printf(sb, " -1,-1 -1,-1 -1,-1"); sbuf_printf(sb, " %s", wmesg); cr = p->p_ucred; sbuf_printf(sb, " %lu %lu %lu", (u_long)cr->cr_uid, (u_long)cr->cr_ruid, (u_long)cr->cr_rgid); /* egid (cr->cr_svgid) is equal to cr_ngroups[0] see also getegid(2) in /sys/kern/kern_prot.c */ for (i = 0; i < cr->cr_ngroups; i++) { sbuf_printf(sb, ",%lu", (u_long)cr->cr_groups[i]); } - if (jailed(p->p_ucred)) { - mtx_lock(&p->p_ucred->cr_prison->pr_mtx); - sbuf_printf(sb, " %s", p->p_ucred->cr_prison->pr_host); - mtx_unlock(&p->p_ucred->cr_prison->pr_mtx); + if (jailed(cr)) { + mtx_lock(&cr->cr_prison->pr_mtx); + sbuf_printf(sb, " %s", + prison_name(td->td_ucred->cr_prison, cr->cr_prison)); + mtx_unlock(&cr->cr_prison->pr_mtx); } else { sbuf_printf(sb, " -"); } PROC_UNLOCK(p); sbuf_printf(sb, "\n"); return (0); } int procfs_doproccmdline(PFS_FILL_ARGS) { struct ps_strings pstr; char **ps_argvstr; int error, i; /* * If we are using the ps/cmdline caching, use that. Otherwise * revert back to the old way which only implements full cmdline * for the currept process and just p->p_comm for all other * processes. * Note that if the argv is no longer available, we deliberately * don't fall back on p->p_comm or return an error: the authentic * Linux behaviour is to return zero-length in this case. */ PROC_LOCK(p); if (p->p_args && p_cansee(td, p) == 0) { sbuf_bcpy(sb, p->p_args->ar_args, p->p_args->ar_length); PROC_UNLOCK(p); return (0); } PROC_UNLOCK(p); if (p != td->td_proc) { sbuf_printf(sb, "%.*s", MAXCOMLEN, p->p_comm); } else { error = copyin((void *)p->p_sysent->sv_psstrings, &pstr, sizeof(pstr)); if (error) return (error); if (pstr.ps_nargvstr > ARG_MAX) return (E2BIG); ps_argvstr = malloc(pstr.ps_nargvstr * sizeof(char *), M_TEMP, M_WAITOK); error = copyin((void *)pstr.ps_argvstr, ps_argvstr, pstr.ps_nargvstr * sizeof(char *)); if (error) { free(ps_argvstr, M_TEMP); return (error); } for (i = 0; i < pstr.ps_nargvstr; i++) { sbuf_copyin(sb, ps_argvstr[i], 0); sbuf_printf(sb, "%c", '\0'); } free(ps_argvstr, M_TEMP); } return (0); } Index: head/sys/kern/init_main.c =================================================================== --- head/sys/kern/init_main.c (revision 192894) +++ head/sys/kern/init_main.c (revision 192895) @@ -1,773 +1,775 @@ /*- * Copyright (c) 1995 Terrence R. Lambert * All rights reserved. * * Copyright (c) 1982, 1986, 1989, 1991, 1992, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by the University of * California, Berkeley and its contributors. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)init_main.c 8.9 (Berkeley) 1/21/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_init_path.h" #include "opt_mac.h" #include #include #include #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include void mi_startup(void); /* Should be elsewhere */ /* Components of the first process -- never freed. */ static struct session session0; static struct pgrp pgrp0; struct proc proc0; struct thread thread0 __aligned(16); struct vmspace vmspace0; struct proc *initproc; int boothowto = 0; /* initialized so that it can be patched */ SYSCTL_INT(_debug, OID_AUTO, boothowto, CTLFLAG_RD, &boothowto, 0, ""); int bootverbose; SYSCTL_INT(_debug, OID_AUTO, bootverbose, CTLFLAG_RW, &bootverbose, 0, ""); /* * This ensures that there is at least one entry so that the sysinit_set * symbol is not undefined. A sybsystem ID of SI_SUB_DUMMY is never * executed. */ SYSINIT(placeholder, SI_SUB_DUMMY, SI_ORDER_ANY, NULL, NULL); /* * The sysinit table itself. Items are checked off as the are run. * If we want to register new sysinit types, add them to newsysinit. */ SET_DECLARE(sysinit_set, struct sysinit); struct sysinit **sysinit, **sysinit_end; struct sysinit **newsysinit, **newsysinit_end; /* * Merge a new sysinit set into the current set, reallocating it if * necessary. This can only be called after malloc is running. */ void sysinit_add(struct sysinit **set, struct sysinit **set_end) { struct sysinit **newset; struct sysinit **sipp; struct sysinit **xipp; int count; count = set_end - set; if (newsysinit) count += newsysinit_end - newsysinit; else count += sysinit_end - sysinit; newset = malloc(count * sizeof(*sipp), M_TEMP, M_NOWAIT); if (newset == NULL) panic("cannot malloc for sysinit"); xipp = newset; if (newsysinit) for (sipp = newsysinit; sipp < newsysinit_end; sipp++) *xipp++ = *sipp; else for (sipp = sysinit; sipp < sysinit_end; sipp++) *xipp++ = *sipp; for (sipp = set; sipp < set_end; sipp++) *xipp++ = *sipp; if (newsysinit) free(newsysinit, M_TEMP); newsysinit = newset; newsysinit_end = newset + count; } /* * System startup; initialize the world, create process 0, mount root * filesystem, and fork to create init and pagedaemon. Most of the * hard work is done in the lower-level initialization routines including * startup(), which does memory initialization and autoconfiguration. * * This allows simple addition of new kernel subsystems that require * boot time initialization. It also allows substitution of subsystem * (for instance, a scheduler, kernel profiler, or VM system) by object * module. Finally, it allows for optional "kernel threads". */ void mi_startup(void) { register struct sysinit **sipp; /* system initialization*/ register struct sysinit **xipp; /* interior loop of sort*/ register struct sysinit *save; /* bubble*/ #if defined(VERBOSE_SYSINIT) int last; int verbose; #endif if (sysinit == NULL) { sysinit = SET_BEGIN(sysinit_set); sysinit_end = SET_LIMIT(sysinit_set); } restart: /* * Perform a bubble sort of the system initialization objects by * their subsystem (primary key) and order (secondary key). */ for (sipp = sysinit; sipp < sysinit_end; sipp++) { for (xipp = sipp + 1; xipp < sysinit_end; xipp++) { if ((*sipp)->subsystem < (*xipp)->subsystem || ((*sipp)->subsystem == (*xipp)->subsystem && (*sipp)->order <= (*xipp)->order)) continue; /* skip*/ save = *sipp; *sipp = *xipp; *xipp = save; } } #if defined(VERBOSE_SYSINIT) last = SI_SUB_COPYRIGHT; verbose = 0; #if !defined(DDB) printf("VERBOSE_SYSINIT: DDB not enabled, symbol lookups disabled.\n"); #endif #endif /* * Traverse the (now) ordered list of system initialization tasks. * Perform each task, and continue on to the next task. * * The last item on the list is expected to be the scheduler, * which will not return. */ for (sipp = sysinit; sipp < sysinit_end; sipp++) { if ((*sipp)->subsystem == SI_SUB_DUMMY) continue; /* skip dummy task(s)*/ if ((*sipp)->subsystem == SI_SUB_DONE) continue; #if defined(VERBOSE_SYSINIT) if ((*sipp)->subsystem > last) { verbose = 1; last = (*sipp)->subsystem; printf("subsystem %x\n", last); } if (verbose) { #if defined(DDB) const char *name; c_db_sym_t sym; db_expr_t offset; sym = db_search_symbol((vm_offset_t)(*sipp)->func, DB_STGY_PROC, &offset); db_symbol_values(sym, &name, NULL); if (name != NULL) printf(" %s(%p)... ", name, (*sipp)->udata); else #endif printf(" %p(%p)... ", (*sipp)->func, (*sipp)->udata); } #endif /* Call function */ (*((*sipp)->func))((*sipp)->udata); #if defined(VERBOSE_SYSINIT) if (verbose) printf("done.\n"); #endif /* Check off the one we're just done */ (*sipp)->subsystem = SI_SUB_DONE; /* Check if we've installed more sysinit items via KLD */ if (newsysinit != NULL) { if (sysinit != SET_BEGIN(sysinit_set)) free(sysinit, M_TEMP); sysinit = newsysinit; sysinit_end = newsysinit_end; newsysinit = NULL; newsysinit_end = NULL; goto restart; } } panic("Shouldn't get here!"); /* NOTREACHED*/ } /* *************************************************************************** **** **** The following SYSINIT's belong elsewhere, but have not yet **** been moved. **** *************************************************************************** */ static void print_caddr_t(void *data __unused) { printf("%s", (char *)data); } SYSINIT(announce, SI_SUB_COPYRIGHT, SI_ORDER_FIRST, print_caddr_t, copyright); SYSINIT(trademark, SI_SUB_COPYRIGHT, SI_ORDER_SECOND, print_caddr_t, trademark); SYSINIT(version, SI_SUB_COPYRIGHT, SI_ORDER_THIRD, print_caddr_t, version); #ifdef WITNESS static char wit_warn[] = "WARNING: WITNESS option enabled, expect reduced performance.\n"; SYSINIT(witwarn, SI_SUB_COPYRIGHT, SI_ORDER_THIRD + 1, print_caddr_t, wit_warn); SYSINIT(witwarn2, SI_SUB_RUN_SCHEDULER, SI_ORDER_THIRD + 1, print_caddr_t, wit_warn); #endif #ifdef DIAGNOSTIC static char diag_warn[] = "WARNING: DIAGNOSTIC option enabled, expect reduced performance.\n"; SYSINIT(diagwarn, SI_SUB_COPYRIGHT, SI_ORDER_THIRD + 2, print_caddr_t, diag_warn); SYSINIT(diagwarn2, SI_SUB_RUN_SCHEDULER, SI_ORDER_THIRD + 2, print_caddr_t, diag_warn); #endif static void set_boot_verbose(void *data __unused) { if (boothowto & RB_VERBOSE) bootverbose++; } SYSINIT(boot_verbose, SI_SUB_TUNABLES, SI_ORDER_ANY, set_boot_verbose, NULL); struct sysentvec null_sysvec = { .sv_size = 0, .sv_table = NULL, .sv_mask = 0, .sv_sigsize = 0, .sv_sigtbl = NULL, .sv_errsize = 0, .sv_errtbl = NULL, .sv_transtrap = NULL, .sv_fixup = NULL, .sv_sendsig = NULL, .sv_sigcode = NULL, .sv_szsigcode = NULL, .sv_prepsyscall = NULL, .sv_name = "null", .sv_coredump = NULL, .sv_imgact_try = NULL, .sv_minsigstksz = 0, .sv_pagesize = PAGE_SIZE, .sv_minuser = VM_MIN_ADDRESS, .sv_maxuser = VM_MAXUSER_ADDRESS, .sv_usrstack = USRSTACK, .sv_psstrings = PS_STRINGS, .sv_stackprot = VM_PROT_ALL, .sv_copyout_strings = NULL, .sv_setregs = NULL, .sv_fixlimit = NULL, .sv_maxssiz = NULL }; /* *************************************************************************** **** **** The two following SYSINIT's are proc0 specific glue code. I am not **** convinced that they can not be safely combined, but their order of **** operation has been maintained as the same as the original init_main.c **** for right now. **** **** These probably belong in init_proc.c or kern_proc.c, since they **** deal with proc0 (the fork template process). **** *************************************************************************** */ /* ARGSUSED*/ static void proc0_init(void *dummy __unused) { struct proc *p; unsigned i; struct thread *td; GIANT_REQUIRED; p = &proc0; td = &thread0; /* * Initialize magic number and osrel. */ p->p_magic = P_MAGIC; p->p_osrel = osreldate; /* * Initialize thread and process structures. */ procinit(); /* set up proc zone */ threadinit(); /* set up UMA zones */ /* * Initialise scheduler resources. * Add scheduler specific parts to proc, thread as needed. */ schedinit(); /* scheduler gets its house in order */ /* * Initialize sleep queue hash table */ sleepinit(); /* * additional VM structures */ vm_init2(); /* * Create process 0 (the swapper). */ LIST_INSERT_HEAD(&allproc, p, p_list); LIST_INSERT_HEAD(PIDHASH(0), p, p_hash); mtx_init(&pgrp0.pg_mtx, "process group", NULL, MTX_DEF | MTX_DUPOK); p->p_pgrp = &pgrp0; LIST_INSERT_HEAD(PGRPHASH(0), &pgrp0, pg_hash); LIST_INIT(&pgrp0.pg_members); LIST_INSERT_HEAD(&pgrp0.pg_members, p, p_pglist); pgrp0.pg_session = &session0; mtx_init(&session0.s_mtx, "session", NULL, MTX_DEF); refcount_init(&session0.s_count, 1); session0.s_leader = p; p->p_sysent = &null_sysvec; p->p_flag = P_SYSTEM | P_INMEM; p->p_state = PRS_NORMAL; knlist_init(&p->p_klist, &p->p_mtx, NULL, NULL, NULL); STAILQ_INIT(&p->p_ktr); p->p_nice = NZERO; td->td_tid = PID_MAX + 1; td->td_state = TDS_RUNNING; td->td_pri_class = PRI_TIMESHARE; td->td_user_pri = PUSER; td->td_base_user_pri = PUSER; td->td_priority = PVM; td->td_base_pri = PUSER; td->td_oncpu = 0; td->td_flags = TDF_INMEM|TDP_KTHREAD; td->td_cpuset = cpuset_thread0(); + prison0.pr_cpuset = cpuset_ref(td->td_cpuset); p->p_peers = 0; p->p_leader = p; strncpy(p->p_comm, "kernel", sizeof (p->p_comm)); strncpy(td->td_name, "swapper", sizeof (td->td_name)); callout_init(&p->p_itcallout, CALLOUT_MPSAFE); callout_init_mtx(&p->p_limco, &p->p_mtx, 0); callout_init(&td->td_slpcallout, CALLOUT_MPSAFE); /* Create credentials. */ p->p_ucred = crget(); p->p_ucred->cr_ngroups = 1; /* group 0 */ p->p_ucred->cr_uidinfo = uifind(0); p->p_ucred->cr_ruidinfo = uifind(0); - p->p_ucred->cr_prison = NULL; /* Don't jail it. */ + p->p_ucred->cr_prison = &prison0; #ifdef VIMAGE KASSERT(LIST_FIRST(&vimage_head) != NULL, ("vimage_head empty")); P_TO_VIMAGE(p) = LIST_FIRST(&vimage_head); /* set ucred->cr_vimage */ refcount_acquire(&P_TO_VIMAGE(p)->vi_ucredrefc); LIST_FIRST(&vprocg_head)->nprocs++; #endif #ifdef AUDIT audit_cred_kproc0(p->p_ucred); #endif #ifdef MAC mac_cred_create_swapper(p->p_ucred); #endif td->td_ucred = crhold(p->p_ucred); /* Create sigacts. */ p->p_sigacts = sigacts_alloc(); /* Initialize signal state for process 0. */ siginit(&proc0); /* Create the file descriptor table. */ p->p_fd = fdinit(NULL); p->p_fdtol = NULL; /* Create the limits structures. */ p->p_limit = lim_alloc(); for (i = 0; i < RLIM_NLIMITS; i++) p->p_limit->pl_rlimit[i].rlim_cur = p->p_limit->pl_rlimit[i].rlim_max = RLIM_INFINITY; p->p_limit->pl_rlimit[RLIMIT_NOFILE].rlim_cur = p->p_limit->pl_rlimit[RLIMIT_NOFILE].rlim_max = maxfiles; p->p_limit->pl_rlimit[RLIMIT_NPROC].rlim_cur = p->p_limit->pl_rlimit[RLIMIT_NPROC].rlim_max = maxproc; i = ptoa(cnt.v_free_count); p->p_limit->pl_rlimit[RLIMIT_RSS].rlim_max = i; p->p_limit->pl_rlimit[RLIMIT_MEMLOCK].rlim_max = i; p->p_limit->pl_rlimit[RLIMIT_MEMLOCK].rlim_cur = i / 3; p->p_cpulimit = RLIM_INFINITY; p->p_stats = pstats_alloc(); /* Allocate a prototype map so we have something to fork. */ pmap_pinit0(vmspace_pmap(&vmspace0)); p->p_vmspace = &vmspace0; vmspace0.vm_refcnt = 1; vm_map_init(&vmspace0.vm_map, p->p_sysent->sv_minuser, p->p_sysent->sv_maxuser); vmspace0.vm_map.pmap = vmspace_pmap(&vmspace0); /*- * call the init and ctor for the new thread and proc * we wait to do this until all other structures * are fairly sane. */ EVENTHANDLER_INVOKE(process_init, p); EVENTHANDLER_INVOKE(thread_init, td); EVENTHANDLER_INVOKE(process_ctor, p); EVENTHANDLER_INVOKE(thread_ctor, td); /* * Charge root for one process. */ (void)chgproccnt(p->p_ucred->cr_ruidinfo, 1, 0); } SYSINIT(p0init, SI_SUB_INTRINSIC, SI_ORDER_FIRST, proc0_init, NULL); /* ARGSUSED*/ static void proc0_post(void *dummy __unused) { struct timespec ts; struct proc *p; struct rusage ru; struct thread *td; /* * Now we can look at the time, having had a chance to verify the * time from the filesystem. Pretend that proc0 started now. */ sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { microuptime(&p->p_stats->p_start); PROC_SLOCK(p); rufetch(p, &ru); /* Clears thread stats */ PROC_SUNLOCK(p); p->p_rux.rux_runtime = 0; p->p_rux.rux_uticks = 0; p->p_rux.rux_sticks = 0; p->p_rux.rux_iticks = 0; FOREACH_THREAD_IN_PROC(p, td) { td->td_runtime = 0; } } sx_sunlock(&allproc_lock); PCPU_SET(switchtime, cpu_ticks()); PCPU_SET(switchticks, ticks); /* * Give the ``random'' number generator a thump. */ nanotime(&ts); srandom(ts.tv_sec ^ ts.tv_nsec); } SYSINIT(p0post, SI_SUB_INTRINSIC_POST, SI_ORDER_FIRST, proc0_post, NULL); /* *************************************************************************** **** **** The following SYSINIT's and glue code should be moved to the **** respective files on a per subsystem basis. **** *************************************************************************** */ /* *************************************************************************** **** **** The following code probably belongs in another file, like **** kern/init_init.c. **** *************************************************************************** */ /* * List of paths to try when searching for "init". */ static char init_path[MAXPATHLEN] = #ifdef INIT_PATH __XSTRING(INIT_PATH); #else "/sbin/init:/sbin/oinit:/sbin/init.bak:/rescue/init:/stand/sysinstall"; #endif SYSCTL_STRING(_kern, OID_AUTO, init_path, CTLFLAG_RD, init_path, 0, "Path used to search the init process"); /* * Shutdown timeout of init(8). * Unused within kernel, but used to control init(8), hence do not remove. */ #ifndef INIT_SHUTDOWN_TIMEOUT #define INIT_SHUTDOWN_TIMEOUT 120 #endif static int init_shutdown_timeout = INIT_SHUTDOWN_TIMEOUT; SYSCTL_INT(_kern, OID_AUTO, init_shutdown_timeout, CTLFLAG_RW, &init_shutdown_timeout, 0, ""); /* * Start the initial user process; try exec'ing each pathname in init_path. * The program is invoked with one argument containing the boot flags. */ static void start_init(void *dummy) { vm_offset_t addr; struct execve_args args; int options, error; char *var, *path, *next, *s; char *ucp, **uap, *arg0, *arg1; struct thread *td; struct proc *p; mtx_lock(&Giant); GIANT_REQUIRED; td = curthread; p = td->td_proc; vfs_mountroot(); /* * Need just enough stack to hold the faked-up "execve()" arguments. */ addr = p->p_sysent->sv_usrstack - PAGE_SIZE; if (vm_map_find(&p->p_vmspace->vm_map, NULL, 0, &addr, PAGE_SIZE, FALSE, VM_PROT_ALL, VM_PROT_ALL, 0) != 0) panic("init: couldn't allocate argument space"); p->p_vmspace->vm_maxsaddr = (caddr_t)addr; p->p_vmspace->vm_ssize = 1; if ((var = getenv("init_path")) != NULL) { strlcpy(init_path, var, sizeof(init_path)); freeenv(var); } for (path = init_path; *path != '\0'; path = next) { while (*path == ':') path++; if (*path == '\0') break; for (next = path; *next != '\0' && *next != ':'; next++) /* nothing */ ; if (bootverbose) printf("start_init: trying %.*s\n", (int)(next - path), path); /* * Move out the boot flag argument. */ options = 0; ucp = (char *)p->p_sysent->sv_usrstack; (void)subyte(--ucp, 0); /* trailing zero */ if (boothowto & RB_SINGLE) { (void)subyte(--ucp, 's'); options = 1; } #ifdef notyet if (boothowto & RB_FASTBOOT) { (void)subyte(--ucp, 'f'); options = 1; } #endif #ifdef BOOTCDROM (void)subyte(--ucp, 'C'); options = 1; #endif if (options == 0) (void)subyte(--ucp, '-'); (void)subyte(--ucp, '-'); /* leading hyphen */ arg1 = ucp; /* * Move out the file name (also arg 0). */ (void)subyte(--ucp, 0); for (s = next - 1; s >= path; s--) (void)subyte(--ucp, *s); arg0 = ucp; /* * Move out the arg pointers. */ uap = (char **)((intptr_t)ucp & ~(sizeof(intptr_t)-1)); (void)suword((caddr_t)--uap, (long)0); /* terminator */ (void)suword((caddr_t)--uap, (long)(intptr_t)arg1); (void)suword((caddr_t)--uap, (long)(intptr_t)arg0); /* * Point at the arguments. */ args.fname = arg0; args.argv = uap; args.envv = NULL; /* * Now try to exec the program. If can't for any reason * other than it doesn't exist, complain. * * Otherwise, return via fork_trampoline() all the way * to user mode as init! */ if ((error = execve(td, &args)) == 0) { mtx_unlock(&Giant); return; } if (error != ENOENT) printf("exec %.*s: error %d\n", (int)(next - path), path, error); } printf("init: not found in path %s\n", init_path); panic("no init"); } /* * Like kproc_create(), but runs in it's own address space. * We do this early to reserve pid 1. * * Note special case - do not make it runnable yet. Other work * in progress will change this more. */ static void create_init(const void *udata __unused) { struct ucred *newcred, *oldcred; int error; error = fork1(&thread0, RFFDG | RFPROC | RFSTOPPED, 0, &initproc); if (error) panic("cannot fork init: %d\n", error); KASSERT(initproc->p_pid == 1, ("create_init: initproc->p_pid != 1")); /* divorce init's credentials from the kernel's */ newcred = crget(); PROC_LOCK(initproc); initproc->p_flag |= P_SYSTEM | P_INMEM; oldcred = initproc->p_ucred; crcopy(newcred, oldcred); #ifdef MAC mac_cred_create_init(newcred); #endif #ifdef AUDIT audit_cred_proc1(newcred); #endif initproc->p_ucred = newcred; PROC_UNLOCK(initproc); crfree(oldcred); cred_update_thread(FIRST_THREAD_IN_PROC(initproc)); cpu_set_fork_handler(FIRST_THREAD_IN_PROC(initproc), start_init, NULL); } SYSINIT(init, SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL); /* * Make it runnable now. */ static void kick_init(const void *udata __unused) { struct thread *td; td = FIRST_THREAD_IN_PROC(initproc); thread_lock(td); TD_SET_CAN_RUN(td); sched_add(td, SRQ_BORING); thread_unlock(td); } SYSINIT(kickinit, SI_SUB_KTHREAD_INIT, SI_ORDER_FIRST, kick_init, NULL); Index: head/sys/kern/kern_cpuset.c =================================================================== --- head/sys/kern/kern_cpuset.c (revision 192894) +++ head/sys/kern/kern_cpuset.c (revision 192895) @@ -1,1127 +1,1103 @@ /*- * Copyright (c) 2008, Jeffrey Roberson * All rights reserved. * * Copyright (c) 2008 Nokia Corporation * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include -#include /* Must come after sys/proc.h */ #include #ifdef DDB #include #endif /* DDB */ /* * cpusets provide a mechanism for creating and manipulating sets of * processors for the purpose of constraining the scheduling of threads to * specific processors. * * Each process belongs to an identified set, by default this is set 1. Each * thread may further restrict the cpus it may run on to a subset of this * named set. This creates an anonymous set which other threads and processes * may not join by number. * * The named set is referred to herein as the 'base' set to avoid ambiguity. * This set is usually a child of a 'root' set while the anonymous set may * simply be referred to as a mask. In the syscall api these are referred to * as the ROOT, CPUSET, and MASK levels where CPUSET is called 'base' here. * * Threads inherit their set from their creator whether it be anonymous or * not. This means that anonymous sets are immutable because they may be * shared. To modify an anonymous set a new set is created with the desired * mask and the same parent as the existing anonymous set. This gives the * illusion of each thread having a private mask.A * * Via the syscall apis a user may ask to retrieve or modify the root, base, * or mask that is discovered via a pid, tid, or setid. Modifying a set * modifies all numbered and anonymous child sets to comply with the new mask. * Modifying a pid or tid's mask applies only to that tid but must still * exist within the assigned parent set. * * A thread may not be assigned to a a group seperate from other threads in * the process. This is to remove ambiguity when the setid is queried with * a pid argument. There is no other technical limitation. * * This somewhat complex arrangement is intended to make it easy for * applications to query available processors and bind their threads to * specific processors while also allowing administrators to dynamically * reprovision by changing sets which apply to groups of processes. * * A simple application should not concern itself with sets at all and * rather apply masks to its own threads via CPU_WHICH_TID and a -1 id * meaning 'curthread'. It may query availble cpus for that tid with a * getaffinity call using (CPU_LEVEL_CPUSET, CPU_WHICH_PID, -1, ...). */ static uma_zone_t cpuset_zone; static struct mtx cpuset_lock; static struct setlist cpuset_ids; static struct unrhdr *cpuset_unr; static struct cpuset *cpuset_zero; cpuset_t *cpuset_root; /* * Acquire a reference to a cpuset, all pointers must be tracked with refs. */ struct cpuset * cpuset_ref(struct cpuset *set) { refcount_acquire(&set->cs_ref); return (set); } /* * Walks up the tree from 'set' to find the root. Returns the root * referenced. */ static struct cpuset * cpuset_refroot(struct cpuset *set) { for (; set->cs_parent != NULL; set = set->cs_parent) if (set->cs_flags & CPU_SET_ROOT) break; cpuset_ref(set); return (set); } /* * Find the first non-anonymous set starting from 'set'. Returns this set * referenced. May return the passed in set with an extra ref if it is * not anonymous. */ static struct cpuset * cpuset_refbase(struct cpuset *set) { if (set->cs_id == CPUSET_INVALID) set = set->cs_parent; cpuset_ref(set); return (set); } /* * Release a reference in a context where it is safe to allocte. */ void cpuset_rel(struct cpuset *set) { cpusetid_t id; if (refcount_release(&set->cs_ref) == 0) return; mtx_lock_spin(&cpuset_lock); LIST_REMOVE(set, cs_siblings); id = set->cs_id; if (id != CPUSET_INVALID) LIST_REMOVE(set, cs_link); mtx_unlock_spin(&cpuset_lock); cpuset_rel(set->cs_parent); uma_zfree(cpuset_zone, set); if (id != CPUSET_INVALID) free_unr(cpuset_unr, id); } /* * Deferred release must be used when in a context that is not safe to * allocate/free. This places any unreferenced sets on the list 'head'. */ static void cpuset_rel_defer(struct setlist *head, struct cpuset *set) { if (refcount_release(&set->cs_ref) == 0) return; mtx_lock_spin(&cpuset_lock); LIST_REMOVE(set, cs_siblings); if (set->cs_id != CPUSET_INVALID) LIST_REMOVE(set, cs_link); LIST_INSERT_HEAD(head, set, cs_link); mtx_unlock_spin(&cpuset_lock); } /* * Complete a deferred release. Removes the set from the list provided to * cpuset_rel_defer. */ static void cpuset_rel_complete(struct cpuset *set) { LIST_REMOVE(set, cs_link); cpuset_rel(set->cs_parent); uma_zfree(cpuset_zone, set); } /* * Find a set based on an id. Returns it with a ref. */ static struct cpuset * cpuset_lookup(cpusetid_t setid, struct thread *td) { struct cpuset *set; if (setid == CPUSET_INVALID) return (NULL); mtx_lock_spin(&cpuset_lock); LIST_FOREACH(set, &cpuset_ids, cs_link) if (set->cs_id == setid) break; if (set) cpuset_ref(set); mtx_unlock_spin(&cpuset_lock); KASSERT(td != NULL, ("[%s:%d] td is NULL", __func__, __LINE__)); if (set != NULL && jailed(td->td_ucred)) { - struct cpuset *rset, *jset; - struct prison *pr; + struct cpuset *jset, *tset; - rset = cpuset_refroot(set); - - pr = td->td_ucred->cr_prison; - mtx_lock(&pr->pr_mtx); - cpuset_ref(pr->pr_cpuset); - jset = pr->pr_cpuset; - mtx_unlock(&pr->pr_mtx); - - if (jset->cs_id != rset->cs_id) { + jset = td->td_ucred->cr_prison->pr_cpuset; + for (tset = set; tset != NULL; tset = tset->cs_parent) + if (tset == jset) + break; + if (tset == NULL) { cpuset_rel(set); set = NULL; } - cpuset_rel(jset); - cpuset_rel(rset); } return (set); } /* * Create a set in the space provided in 'set' with the provided parameters. * The set is returned with a single ref. May return EDEADLK if the set * will have no valid cpu based on restrictions from the parent. */ static int _cpuset_create(struct cpuset *set, struct cpuset *parent, cpuset_t *mask, cpusetid_t id) { if (!CPU_OVERLAP(&parent->cs_mask, mask)) return (EDEADLK); CPU_COPY(mask, &set->cs_mask); LIST_INIT(&set->cs_children); refcount_init(&set->cs_ref, 1); set->cs_flags = 0; mtx_lock_spin(&cpuset_lock); CPU_AND(mask, &parent->cs_mask); set->cs_id = id; set->cs_parent = cpuset_ref(parent); LIST_INSERT_HEAD(&parent->cs_children, set, cs_siblings); if (set->cs_id != CPUSET_INVALID) LIST_INSERT_HEAD(&cpuset_ids, set, cs_link); mtx_unlock_spin(&cpuset_lock); return (0); } /* * Create a new non-anonymous set with the requested parent and mask. May * return failures if the mask is invalid or a new number can not be * allocated. */ static int cpuset_create(struct cpuset **setp, struct cpuset *parent, cpuset_t *mask) { struct cpuset *set; cpusetid_t id; int error; id = alloc_unr(cpuset_unr); if (id == -1) return (ENFILE); *setp = set = uma_zalloc(cpuset_zone, M_WAITOK); error = _cpuset_create(set, parent, mask, id); if (error == 0) return (0); free_unr(cpuset_unr, id); uma_zfree(cpuset_zone, set); return (error); } /* * Recursively check for errors that would occur from applying mask to * the tree of sets starting at 'set'. Checks for sets that would become * empty as well as RDONLY flags. */ static int cpuset_testupdate(struct cpuset *set, cpuset_t *mask) { struct cpuset *nset; cpuset_t newmask; int error; mtx_assert(&cpuset_lock, MA_OWNED); if (set->cs_flags & CPU_SET_RDONLY) return (EPERM); if (!CPU_OVERLAP(&set->cs_mask, mask)) return (EDEADLK); CPU_COPY(&set->cs_mask, &newmask); CPU_AND(&newmask, mask); error = 0; LIST_FOREACH(nset, &set->cs_children, cs_siblings) if ((error = cpuset_testupdate(nset, &newmask)) != 0) break; return (error); } /* * Applies the mask 'mask' without checking for empty sets or permissions. */ static void cpuset_update(struct cpuset *set, cpuset_t *mask) { struct cpuset *nset; mtx_assert(&cpuset_lock, MA_OWNED); CPU_AND(&set->cs_mask, mask); LIST_FOREACH(nset, &set->cs_children, cs_siblings) cpuset_update(nset, &set->cs_mask); return; } /* * Modify the set 'set' to use a copy of the mask provided. Apply this new * mask to restrict all children in the tree. Checks for validity before * applying the changes. */ static int cpuset_modify(struct cpuset *set, cpuset_t *mask) { struct cpuset *root; int error; error = priv_check(curthread, PRIV_SCHED_CPUSET); if (error) return (error); /* * In case we are called from within the jail * we do not allow modifying the dedicated root * cpuset of the jail but may still allow to * change child sets. */ if (jailed(curthread->td_ucred) && set->cs_flags & CPU_SET_ROOT) return (EPERM); /* * Verify that we have access to this set of * cpus. */ root = set->cs_parent; if (root && !CPU_SUBSET(&root->cs_mask, mask)) return (EINVAL); mtx_lock_spin(&cpuset_lock); error = cpuset_testupdate(set, mask); if (error) goto out; cpuset_update(set, mask); CPU_COPY(mask, &set->cs_mask); out: mtx_unlock_spin(&cpuset_lock); return (error); } /* * Resolve the 'which' parameter of several cpuset apis. * * For WHICH_PID and WHICH_TID return a locked proc and valid proc/tid. Also * checks for permission via p_cansched(). * * For WHICH_SET returns a valid set with a new reference. * * -1 may be supplied for any argument to mean the current proc/thread or * the base set of the current thread. May fail with ESRCH/EPERM. */ static int cpuset_which(cpuwhich_t which, id_t id, struct proc **pp, struct thread **tdp, struct cpuset **setp) { struct cpuset *set; struct thread *td; struct proc *p; int error; *pp = p = NULL; *tdp = td = NULL; *setp = set = NULL; switch (which) { case CPU_WHICH_PID: if (id == -1) { PROC_LOCK(curproc); p = curproc; break; } if ((p = pfind(id)) == NULL) return (ESRCH); break; case CPU_WHICH_TID: if (id == -1) { PROC_LOCK(curproc); p = curproc; td = curthread; break; } sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { PROC_LOCK(p); FOREACH_THREAD_IN_PROC(p, td) if (td->td_tid == id) break; if (td != NULL) break; PROC_UNLOCK(p); } sx_sunlock(&allproc_lock); if (td == NULL) return (ESRCH); break; case CPU_WHICH_CPUSET: if (id == -1) { thread_lock(curthread); set = cpuset_refbase(curthread->td_cpuset); thread_unlock(curthread); } else set = cpuset_lookup(id, curthread); if (set) { *setp = set; return (0); } return (ESRCH); case CPU_WHICH_JAIL: { /* Find `set' for prison with given id. */ struct prison *pr; sx_slock(&allprison_lock); - pr = prison_find(id); + pr = prison_find_child(curthread->td_ucred->cr_prison, id); sx_sunlock(&allprison_lock); if (pr == NULL) return (ESRCH); - if (jailed(curthread->td_ucred)) { - if (curthread->td_ucred->cr_prison == pr) { - cpuset_ref(pr->pr_cpuset); - set = pr->pr_cpuset; - } - } else { - cpuset_ref(pr->pr_cpuset); - set = pr->pr_cpuset; - } + cpuset_ref(pr->pr_cpuset); + *setp = pr->pr_cpuset; mtx_unlock(&pr->pr_mtx); - if (set) { - *setp = set; - return (0); - } - return (ESRCH); + return (0); } case CPU_WHICH_IRQ: return (0); default: return (EINVAL); } error = p_cansched(curthread, p); if (error) { PROC_UNLOCK(p); return (error); } if (td == NULL) td = FIRST_THREAD_IN_PROC(p); *pp = p; *tdp = td; return (0); } /* * Create an anonymous set with the provided mask in the space provided by * 'fset'. If the passed in set is anonymous we use its parent otherwise * the new set is a child of 'set'. */ static int cpuset_shadow(struct cpuset *set, struct cpuset *fset, cpuset_t *mask) { struct cpuset *parent; if (set->cs_id == CPUSET_INVALID) parent = set->cs_parent; else parent = set; if (!CPU_SUBSET(&parent->cs_mask, mask)) return (EDEADLK); return (_cpuset_create(fset, parent, mask, CPUSET_INVALID)); } /* * Handle two cases for replacing the base set or mask of an entire process. * * 1) Set is non-null and mask is null. This reparents all anonymous sets * to the provided set and replaces all non-anonymous td_cpusets with the * provided set. * 2) Mask is non-null and set is null. This replaces or creates anonymous * sets for every thread with the existing base as a parent. * * This is overly complicated because we can't allocate while holding a * spinlock and spinlocks must be held while changing and examining thread * state. */ static int cpuset_setproc(pid_t pid, struct cpuset *set, cpuset_t *mask) { struct setlist freelist; struct setlist droplist; struct cpuset *tdset; struct cpuset *nset; struct thread *td; struct proc *p; int threads; int nfree; int error; /* * The algorithm requires two passes due to locking considerations. * * 1) Lookup the process and acquire the locks in the required order. * 2) If enough cpusets have not been allocated release the locks and * allocate them. Loop. */ LIST_INIT(&freelist); LIST_INIT(&droplist); nfree = 0; for (;;) { error = cpuset_which(CPU_WHICH_PID, pid, &p, &td, &nset); if (error) goto out; if (nfree >= p->p_numthreads) break; threads = p->p_numthreads; PROC_UNLOCK(p); for (; nfree < threads; nfree++) { nset = uma_zalloc(cpuset_zone, M_WAITOK); LIST_INSERT_HEAD(&freelist, nset, cs_link); } } PROC_LOCK_ASSERT(p, MA_OWNED); /* * Now that the appropriate locks are held and we have enough cpusets, * make sure the operation will succeed before applying changes. The * proc lock prevents td_cpuset from changing between calls. */ error = 0; FOREACH_THREAD_IN_PROC(p, td) { thread_lock(td); tdset = td->td_cpuset; /* * Verify that a new mask doesn't specify cpus outside of * the set the thread is a member of. */ if (mask) { if (tdset->cs_id == CPUSET_INVALID) tdset = tdset->cs_parent; if (!CPU_SUBSET(&tdset->cs_mask, mask)) error = EDEADLK; /* * Verify that a new set won't leave an existing thread * mask without a cpu to run on. It can, however, restrict * the set. */ } else if (tdset->cs_id == CPUSET_INVALID) { if (!CPU_OVERLAP(&set->cs_mask, &tdset->cs_mask)) error = EDEADLK; } thread_unlock(td); if (error) goto unlock_out; } /* * Replace each thread's cpuset while using deferred release. We * must do this because the thread lock must be held while operating * on the thread and this limits the type of operations allowed. */ FOREACH_THREAD_IN_PROC(p, td) { thread_lock(td); /* * If we presently have an anonymous set or are applying a * mask we must create an anonymous shadow set. That is * either parented to our existing base or the supplied set. * * If we have a base set with no anonymous shadow we simply * replace it outright. */ tdset = td->td_cpuset; if (tdset->cs_id == CPUSET_INVALID || mask) { nset = LIST_FIRST(&freelist); LIST_REMOVE(nset, cs_link); if (mask) error = cpuset_shadow(tdset, nset, mask); else error = _cpuset_create(nset, set, &tdset->cs_mask, CPUSET_INVALID); if (error) { LIST_INSERT_HEAD(&freelist, nset, cs_link); thread_unlock(td); break; } } else nset = cpuset_ref(set); cpuset_rel_defer(&droplist, tdset); td->td_cpuset = nset; sched_affinity(td); thread_unlock(td); } unlock_out: PROC_UNLOCK(p); out: while ((nset = LIST_FIRST(&droplist)) != NULL) cpuset_rel_complete(nset); while ((nset = LIST_FIRST(&freelist)) != NULL) { LIST_REMOVE(nset, cs_link); uma_zfree(cpuset_zone, nset); } return (error); } /* * Apply an anonymous mask to a single thread. */ int cpuset_setthread(lwpid_t id, cpuset_t *mask) { struct cpuset *nset; struct cpuset *set; struct thread *td; struct proc *p; int error; nset = uma_zalloc(cpuset_zone, M_WAITOK); error = cpuset_which(CPU_WHICH_TID, id, &p, &td, &set); if (error) goto out; set = NULL; thread_lock(td); error = cpuset_shadow(td->td_cpuset, nset, mask); if (error == 0) { set = td->td_cpuset; td->td_cpuset = nset; sched_affinity(td); nset = NULL; } thread_unlock(td); PROC_UNLOCK(p); if (set) cpuset_rel(set); out: if (nset) uma_zfree(cpuset_zone, nset); return (error); } /* * Creates the cpuset for thread0. We make two sets: * * 0 - The root set which should represent all valid processors in the * system. It is initially created with a mask of all processors * because we don't know what processors are valid until cpuset_init() * runs. This set is immutable. * 1 - The default set which all processes are a member of until changed. * This allows an administrator to move all threads off of given cpus to * dedicate them to high priority tasks or save power etc. */ struct cpuset * cpuset_thread0(void) { struct cpuset *set; int error; cpuset_zone = uma_zcreate("cpuset", sizeof(struct cpuset), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); mtx_init(&cpuset_lock, "cpuset", NULL, MTX_SPIN | MTX_RECURSE); /* * Create the root system set for the whole machine. Doesn't use * cpuset_create() due to NULL parent. */ set = uma_zalloc(cpuset_zone, M_WAITOK | M_ZERO); set->cs_mask.__bits[0] = -1; LIST_INIT(&set->cs_children); LIST_INSERT_HEAD(&cpuset_ids, set, cs_link); set->cs_ref = 1; set->cs_flags = CPU_SET_ROOT; cpuset_zero = set; cpuset_root = &set->cs_mask; /* * Now derive a default, modifiable set from that to give out. */ set = uma_zalloc(cpuset_zone, M_WAITOK); error = _cpuset_create(set, cpuset_zero, &cpuset_zero->cs_mask, 1); KASSERT(error == 0, ("Error creating default set: %d\n", error)); /* * Initialize the unit allocator. 0 and 1 are allocated above. */ cpuset_unr = new_unrhdr(2, INT_MAX, NULL); return (set); } /* * Create a cpuset, which would be cpuset_create() but * mark the new 'set' as root. * * We are not going to reparent the td to it. Use cpuset_setproc_update_set() * for that. * * In case of no error, returns the set in *setp locked with a reference. */ int -cpuset_create_root(struct thread *td, struct cpuset **setp) +cpuset_create_root(struct prison *pr, struct cpuset **setp) { - struct cpuset *root; struct cpuset *set; int error; - KASSERT(td != NULL, ("[%s:%d] invalid td", __func__, __LINE__)); + KASSERT(pr != NULL, ("[%s:%d] invalid pr", __func__, __LINE__)); KASSERT(setp != NULL, ("[%s:%d] invalid setp", __func__, __LINE__)); - thread_lock(td); - root = cpuset_refroot(td->td_cpuset); - thread_unlock(td); - - error = cpuset_create(setp, td->td_cpuset, &root->cs_mask); - cpuset_rel(root); + error = cpuset_create(setp, pr->pr_cpuset, &pr->pr_cpuset->cs_mask); if (error) return (error); KASSERT(*setp != NULL, ("[%s:%d] cpuset_create returned invalid data", __func__, __LINE__)); /* Mark the set as root. */ set = *setp; set->cs_flags |= CPU_SET_ROOT; return (0); } int cpuset_setproc_update_set(struct proc *p, struct cpuset *set) { int error; KASSERT(p != NULL, ("[%s:%d] invalid proc", __func__, __LINE__)); KASSERT(set != NULL, ("[%s:%d] invalid set", __func__, __LINE__)); cpuset_ref(set); error = cpuset_setproc(p->p_pid, set, NULL); if (error) return (error); cpuset_rel(set); return (0); } /* * This is called once the final set of system cpus is known. Modifies * the root set and all children and mark the root readonly. */ static void cpuset_init(void *arg) { cpuset_t mask; CPU_ZERO(&mask); #ifdef SMP mask.__bits[0] = all_cpus; #else mask.__bits[0] = 1; #endif if (cpuset_modify(cpuset_zero, &mask)) panic("Can't set initial cpuset mask.\n"); cpuset_zero->cs_flags |= CPU_SET_RDONLY; } SYSINIT(cpuset, SI_SUB_SMP, SI_ORDER_ANY, cpuset_init, NULL); #ifndef _SYS_SYSPROTO_H_ struct cpuset_args { cpusetid_t *setid; }; #endif int cpuset(struct thread *td, struct cpuset_args *uap) { struct cpuset *root; struct cpuset *set; int error; thread_lock(td); root = cpuset_refroot(td->td_cpuset); thread_unlock(td); error = cpuset_create(&set, root, &root->cs_mask); cpuset_rel(root); if (error) return (error); error = copyout(&set->cs_id, uap->setid, sizeof(set->cs_id)); if (error == 0) error = cpuset_setproc(-1, set, NULL); cpuset_rel(set); return (error); } #ifndef _SYS_SYSPROTO_H_ struct cpuset_setid_args { cpuwhich_t which; id_t id; cpusetid_t setid; }; #endif int cpuset_setid(struct thread *td, struct cpuset_setid_args *uap) { struct cpuset *set; int error; /* * Presently we only support per-process sets. */ if (uap->which != CPU_WHICH_PID) return (EINVAL); set = cpuset_lookup(uap->setid, td); if (set == NULL) return (ESRCH); error = cpuset_setproc(uap->id, set, NULL); cpuset_rel(set); return (error); } #ifndef _SYS_SYSPROTO_H_ struct cpuset_getid_args { cpulevel_t level; cpuwhich_t which; id_t id; cpusetid_t *setid; #endif int cpuset_getid(struct thread *td, struct cpuset_getid_args *uap) { struct cpuset *nset; struct cpuset *set; struct thread *ttd; struct proc *p; cpusetid_t id; int error; if (uap->level == CPU_LEVEL_WHICH && uap->which != CPU_WHICH_CPUSET) return (EINVAL); error = cpuset_which(uap->which, uap->id, &p, &ttd, &set); if (error) return (error); switch (uap->which) { case CPU_WHICH_TID: case CPU_WHICH_PID: thread_lock(ttd); set = cpuset_refbase(ttd->td_cpuset); thread_unlock(ttd); PROC_UNLOCK(p); break; case CPU_WHICH_CPUSET: case CPU_WHICH_JAIL: break; case CPU_WHICH_IRQ: return (EINVAL); } switch (uap->level) { case CPU_LEVEL_ROOT: nset = cpuset_refroot(set); cpuset_rel(set); set = nset; break; case CPU_LEVEL_CPUSET: break; case CPU_LEVEL_WHICH: break; } id = set->cs_id; cpuset_rel(set); if (error == 0) error = copyout(&id, uap->setid, sizeof(id)); return (error); } #ifndef _SYS_SYSPROTO_H_ struct cpuset_getaffinity_args { cpulevel_t level; cpuwhich_t which; id_t id; size_t cpusetsize; cpuset_t *mask; }; #endif int cpuset_getaffinity(struct thread *td, struct cpuset_getaffinity_args *uap) { struct thread *ttd; struct cpuset *nset; struct cpuset *set; struct proc *p; cpuset_t *mask; int error; size_t size; if (uap->cpusetsize < sizeof(cpuset_t) || uap->cpusetsize > CPU_MAXSIZE / NBBY) return (ERANGE); size = uap->cpusetsize; mask = malloc(size, M_TEMP, M_WAITOK | M_ZERO); error = cpuset_which(uap->which, uap->id, &p, &ttd, &set); if (error) goto out; switch (uap->level) { case CPU_LEVEL_ROOT: case CPU_LEVEL_CPUSET: switch (uap->which) { case CPU_WHICH_TID: case CPU_WHICH_PID: thread_lock(ttd); set = cpuset_ref(ttd->td_cpuset); thread_unlock(ttd); break; case CPU_WHICH_CPUSET: case CPU_WHICH_JAIL: break; case CPU_WHICH_IRQ: error = EINVAL; goto out; } if (uap->level == CPU_LEVEL_ROOT) nset = cpuset_refroot(set); else nset = cpuset_refbase(set); CPU_COPY(&nset->cs_mask, mask); cpuset_rel(nset); break; case CPU_LEVEL_WHICH: switch (uap->which) { case CPU_WHICH_TID: thread_lock(ttd); CPU_COPY(&ttd->td_cpuset->cs_mask, mask); thread_unlock(ttd); break; case CPU_WHICH_PID: FOREACH_THREAD_IN_PROC(p, ttd) { thread_lock(ttd); CPU_OR(mask, &ttd->td_cpuset->cs_mask); thread_unlock(ttd); } break; case CPU_WHICH_CPUSET: case CPU_WHICH_JAIL: CPU_COPY(&set->cs_mask, mask); break; case CPU_WHICH_IRQ: error = intr_getaffinity(uap->id, mask); break; } break; default: error = EINVAL; break; } if (set) cpuset_rel(set); if (p) PROC_UNLOCK(p); if (error == 0) error = copyout(mask, uap->mask, size); out: free(mask, M_TEMP); return (error); } #ifndef _SYS_SYSPROTO_H_ struct cpuset_setaffinity_args { cpulevel_t level; cpuwhich_t which; id_t id; size_t cpusetsize; const cpuset_t *mask; }; #endif int cpuset_setaffinity(struct thread *td, struct cpuset_setaffinity_args *uap) { struct cpuset *nset; struct cpuset *set; struct thread *ttd; struct proc *p; cpuset_t *mask; int error; if (uap->cpusetsize < sizeof(cpuset_t) || uap->cpusetsize > CPU_MAXSIZE / NBBY) return (ERANGE); mask = malloc(uap->cpusetsize, M_TEMP, M_WAITOK | M_ZERO); error = copyin(uap->mask, mask, uap->cpusetsize); if (error) goto out; /* * Verify that no high bits are set. */ if (uap->cpusetsize > sizeof(cpuset_t)) { char *end; char *cp; end = cp = (char *)&mask->__bits; end += uap->cpusetsize; cp += sizeof(cpuset_t); while (cp != end) if (*cp++ != 0) { error = EINVAL; goto out; } } switch (uap->level) { case CPU_LEVEL_ROOT: case CPU_LEVEL_CPUSET: error = cpuset_which(uap->which, uap->id, &p, &ttd, &set); if (error) break; switch (uap->which) { case CPU_WHICH_TID: case CPU_WHICH_PID: thread_lock(ttd); set = cpuset_ref(ttd->td_cpuset); thread_unlock(ttd); PROC_UNLOCK(p); break; case CPU_WHICH_CPUSET: case CPU_WHICH_JAIL: break; case CPU_WHICH_IRQ: error = EINVAL; goto out; } if (uap->level == CPU_LEVEL_ROOT) nset = cpuset_refroot(set); else nset = cpuset_refbase(set); error = cpuset_modify(nset, mask); cpuset_rel(nset); cpuset_rel(set); break; case CPU_LEVEL_WHICH: switch (uap->which) { case CPU_WHICH_TID: error = cpuset_setthread(uap->id, mask); break; case CPU_WHICH_PID: error = cpuset_setproc(uap->id, NULL, mask); break; case CPU_WHICH_CPUSET: case CPU_WHICH_JAIL: error = cpuset_which(uap->which, uap->id, &p, &ttd, &set); if (error == 0) { error = cpuset_modify(set, mask); cpuset_rel(set); } break; case CPU_WHICH_IRQ: error = intr_setaffinity(uap->id, mask); break; default: error = EINVAL; break; } break; default: error = EINVAL; break; } out: free(mask, M_TEMP); return (error); } #ifdef DDB DB_SHOW_COMMAND(cpusets, db_show_cpusets) { struct cpuset *set; int cpu, once; LIST_FOREACH(set, &cpuset_ids, cs_link) { db_printf("set=%p id=%-6u ref=%-6d flags=0x%04x parent id=%d\n", set, set->cs_id, set->cs_ref, set->cs_flags, (set->cs_parent != NULL) ? set->cs_parent->cs_id : 0); db_printf(" mask="); for (once = 0, cpu = 0; cpu < CPU_SETSIZE; cpu++) { if (CPU_ISSET(cpu, &set->cs_mask)) { if (once == 0) { db_printf("%d", cpu); once = 1; } else db_printf(",%d", cpu); } } db_printf("\n"); if (db_pager_quit) break; } } #endif /* DDB */ Index: head/sys/kern/kern_descrip.c =================================================================== --- head/sys/kern/kern_descrip.c (revision 192894) +++ head/sys/kern/kern_descrip.c (revision 192895) @@ -1,3334 +1,3358 @@ /*- * Copyright (c) 1982, 1986, 1989, 1991, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)kern_descrip.c 8.6 (Berkeley) 4/19/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_compat.h" #include "opt_ddb.h" #include "opt_ktrace.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef KTRACE #include #endif #include #include #include static MALLOC_DEFINE(M_FILEDESC, "filedesc", "Open file descriptor table"); static MALLOC_DEFINE(M_FILEDESC_TO_LEADER, "filedesc_to_leader", "file desc to leader structures"); static MALLOC_DEFINE(M_SIGIO, "sigio", "sigio structures"); static uma_zone_t file_zone; /* Flags for do_dup() */ #define DUP_FIXED 0x1 /* Force fixed allocation */ #define DUP_FCNTL 0x2 /* fcntl()-style errors */ static int do_dup(struct thread *td, int flags, int old, int new, register_t *retval); static int fd_first_free(struct filedesc *, int, int); static int fd_last_used(struct filedesc *, int, int); static void fdgrowtable(struct filedesc *, int); static void fdunused(struct filedesc *fdp, int fd); static void fdused(struct filedesc *fdp, int fd); /* * A process is initially started out with NDFILE descriptors stored within * this structure, selected to be enough for typical applications based on * the historical limit of 20 open files (and the usage of descriptors by * shells). If these descriptors are exhausted, a larger descriptor table * may be allocated, up to a process' resource limit; the internal arrays * are then unused. */ #define NDFILE 20 #define NDSLOTSIZE sizeof(NDSLOTTYPE) #define NDENTRIES (NDSLOTSIZE * __CHAR_BIT) #define NDSLOT(x) ((x) / NDENTRIES) #define NDBIT(x) ((NDSLOTTYPE)1 << ((x) % NDENTRIES)) #define NDSLOTS(x) (((x) + NDENTRIES - 1) / NDENTRIES) /* * Storage required per open file descriptor. */ #define OFILESIZE (sizeof(struct file *) + sizeof(char)) /* * Storage to hold unused ofiles that need to be reclaimed. */ struct freetable { struct file **ft_table; SLIST_ENTRY(freetable) ft_next; }; /* * Basic allocation of descriptors: * one of the above, plus arrays for NDFILE descriptors. */ struct filedesc0 { struct filedesc fd_fd; /* * ofiles which need to be reclaimed on free. */ SLIST_HEAD(,freetable) fd_free; /* * These arrays are used when the number of open files is * <= NDFILE, and are then pointed to by the pointers above. */ struct file *fd_dfiles[NDFILE]; char fd_dfileflags[NDFILE]; NDSLOTTYPE fd_dmap[NDSLOTS(NDFILE)]; }; /* * Descriptor management. */ volatile int openfiles; /* actual number of open files */ struct mtx sigio_lock; /* mtx to protect pointers to sigio */ void (*mq_fdclose)(struct thread *td, int fd, struct file *fp); /* A mutex to protect the association between a proc and filedesc. */ static struct mtx fdesc_mtx; /* * Find the first zero bit in the given bitmap, starting at low and not * exceeding size - 1. */ static int fd_first_free(struct filedesc *fdp, int low, int size) { NDSLOTTYPE *map = fdp->fd_map; NDSLOTTYPE mask; int off, maxoff; if (low >= size) return (low); off = NDSLOT(low); if (low % NDENTRIES) { mask = ~(~(NDSLOTTYPE)0 >> (NDENTRIES - (low % NDENTRIES))); if ((mask &= ~map[off]) != 0UL) return (off * NDENTRIES + ffsl(mask) - 1); ++off; } for (maxoff = NDSLOTS(size); off < maxoff; ++off) if (map[off] != ~0UL) return (off * NDENTRIES + ffsl(~map[off]) - 1); return (size); } /* * Find the highest non-zero bit in the given bitmap, starting at low and * not exceeding size - 1. */ static int fd_last_used(struct filedesc *fdp, int low, int size) { NDSLOTTYPE *map = fdp->fd_map; NDSLOTTYPE mask; int off, minoff; if (low >= size) return (-1); off = NDSLOT(size); if (size % NDENTRIES) { mask = ~(~(NDSLOTTYPE)0 << (size % NDENTRIES)); if ((mask &= map[off]) != 0) return (off * NDENTRIES + flsl(mask) - 1); --off; } for (minoff = NDSLOT(low); off >= minoff; --off) if (map[off] != 0) return (off * NDENTRIES + flsl(map[off]) - 1); return (low - 1); } static int fdisused(struct filedesc *fdp, int fd) { KASSERT(fd >= 0 && fd < fdp->fd_nfiles, ("file descriptor %d out of range (0, %d)", fd, fdp->fd_nfiles)); return ((fdp->fd_map[NDSLOT(fd)] & NDBIT(fd)) != 0); } /* * Mark a file descriptor as used. */ static void fdused(struct filedesc *fdp, int fd) { FILEDESC_XLOCK_ASSERT(fdp); KASSERT(!fdisused(fdp, fd), ("fd already used")); fdp->fd_map[NDSLOT(fd)] |= NDBIT(fd); if (fd > fdp->fd_lastfile) fdp->fd_lastfile = fd; if (fd == fdp->fd_freefile) fdp->fd_freefile = fd_first_free(fdp, fd, fdp->fd_nfiles); } /* * Mark a file descriptor as unused. */ static void fdunused(struct filedesc *fdp, int fd) { FILEDESC_XLOCK_ASSERT(fdp); KASSERT(fdisused(fdp, fd), ("fd is already unused")); KASSERT(fdp->fd_ofiles[fd] == NULL, ("fd is still in use")); fdp->fd_map[NDSLOT(fd)] &= ~NDBIT(fd); if (fd < fdp->fd_freefile) fdp->fd_freefile = fd; if (fd == fdp->fd_lastfile) fdp->fd_lastfile = fd_last_used(fdp, 0, fd); } /* * System calls on descriptors. */ #ifndef _SYS_SYSPROTO_H_ struct getdtablesize_args { int dummy; }; #endif /* ARGSUSED */ int getdtablesize(struct thread *td, struct getdtablesize_args *uap) { struct proc *p = td->td_proc; PROC_LOCK(p); td->td_retval[0] = min((int)lim_cur(p, RLIMIT_NOFILE), maxfilesperproc); PROC_UNLOCK(p); return (0); } /* * Duplicate a file descriptor to a particular value. * * Note: keep in mind that a potential race condition exists when closing * descriptors from a shared descriptor table (via rfork). */ #ifndef _SYS_SYSPROTO_H_ struct dup2_args { u_int from; u_int to; }; #endif /* ARGSUSED */ int dup2(struct thread *td, struct dup2_args *uap) { return (do_dup(td, DUP_FIXED, (int)uap->from, (int)uap->to, td->td_retval)); } /* * Duplicate a file descriptor. */ #ifndef _SYS_SYSPROTO_H_ struct dup_args { u_int fd; }; #endif /* ARGSUSED */ int dup(struct thread *td, struct dup_args *uap) { return (do_dup(td, 0, (int)uap->fd, 0, td->td_retval)); } /* * The file control system call. */ #ifndef _SYS_SYSPROTO_H_ struct fcntl_args { int fd; int cmd; long arg; }; #endif /* ARGSUSED */ int fcntl(struct thread *td, struct fcntl_args *uap) { struct flock fl; struct oflock ofl; intptr_t arg; int error; int cmd; error = 0; cmd = uap->cmd; switch (uap->cmd) { case F_OGETLK: case F_OSETLK: case F_OSETLKW: /* * Convert old flock structure to new. */ error = copyin((void *)(intptr_t)uap->arg, &ofl, sizeof(ofl)); fl.l_start = ofl.l_start; fl.l_len = ofl.l_len; fl.l_pid = ofl.l_pid; fl.l_type = ofl.l_type; fl.l_whence = ofl.l_whence; fl.l_sysid = 0; switch (uap->cmd) { case F_OGETLK: cmd = F_GETLK; break; case F_OSETLK: cmd = F_SETLK; break; case F_OSETLKW: cmd = F_SETLKW; break; } arg = (intptr_t)&fl; break; case F_GETLK: case F_SETLK: case F_SETLKW: case F_SETLK_REMOTE: error = copyin((void *)(intptr_t)uap->arg, &fl, sizeof(fl)); arg = (intptr_t)&fl; break; default: arg = uap->arg; break; } if (error) return (error); error = kern_fcntl(td, uap->fd, cmd, arg); if (error) return (error); if (uap->cmd == F_OGETLK) { ofl.l_start = fl.l_start; ofl.l_len = fl.l_len; ofl.l_pid = fl.l_pid; ofl.l_type = fl.l_type; ofl.l_whence = fl.l_whence; error = copyout(&ofl, (void *)(intptr_t)uap->arg, sizeof(ofl)); } else if (uap->cmd == F_GETLK) { error = copyout(&fl, (void *)(intptr_t)uap->arg, sizeof(fl)); } return (error); } static inline struct file * fdtofp(int fd, struct filedesc *fdp) { struct file *fp; FILEDESC_LOCK_ASSERT(fdp); if ((unsigned)fd >= fdp->fd_nfiles || (fp = fdp->fd_ofiles[fd]) == NULL) return (NULL); return (fp); } int kern_fcntl(struct thread *td, int fd, int cmd, intptr_t arg) { struct filedesc *fdp; struct flock *flp; struct file *fp; struct proc *p; char *pop; struct vnode *vp; int error, flg, tmp; int vfslocked; vfslocked = 0; error = 0; flg = F_POSIX; p = td->td_proc; fdp = p->p_fd; switch (cmd) { case F_DUPFD: tmp = arg; error = do_dup(td, DUP_FCNTL, fd, tmp, td->td_retval); break; case F_DUP2FD: tmp = arg; error = do_dup(td, DUP_FIXED, fd, tmp, td->td_retval); break; case F_GETFD: FILEDESC_SLOCK(fdp); if ((fp = fdtofp(fd, fdp)) == NULL) { FILEDESC_SUNLOCK(fdp); error = EBADF; break; } pop = &fdp->fd_ofileflags[fd]; td->td_retval[0] = (*pop & UF_EXCLOSE) ? FD_CLOEXEC : 0; FILEDESC_SUNLOCK(fdp); break; case F_SETFD: FILEDESC_XLOCK(fdp); if ((fp = fdtofp(fd, fdp)) == NULL) { FILEDESC_XUNLOCK(fdp); error = EBADF; break; } pop = &fdp->fd_ofileflags[fd]; *pop = (*pop &~ UF_EXCLOSE) | (arg & FD_CLOEXEC ? UF_EXCLOSE : 0); FILEDESC_XUNLOCK(fdp); break; case F_GETFL: FILEDESC_SLOCK(fdp); if ((fp = fdtofp(fd, fdp)) == NULL) { FILEDESC_SUNLOCK(fdp); error = EBADF; break; } td->td_retval[0] = OFLAGS(fp->f_flag); FILEDESC_SUNLOCK(fdp); break; case F_SETFL: FILEDESC_SLOCK(fdp); if ((fp = fdtofp(fd, fdp)) == NULL) { FILEDESC_SUNLOCK(fdp); error = EBADF; break; } fhold(fp); FILEDESC_SUNLOCK(fdp); do { tmp = flg = fp->f_flag; tmp &= ~FCNTLFLAGS; tmp |= FFLAGS(arg & ~O_ACCMODE) & FCNTLFLAGS; } while(atomic_cmpset_int(&fp->f_flag, flg, tmp) == 0); tmp = fp->f_flag & FNONBLOCK; error = fo_ioctl(fp, FIONBIO, &tmp, td->td_ucred, td); if (error) { fdrop(fp, td); break; } tmp = fp->f_flag & FASYNC; error = fo_ioctl(fp, FIOASYNC, &tmp, td->td_ucred, td); if (error == 0) { fdrop(fp, td); break; } atomic_clear_int(&fp->f_flag, FNONBLOCK); tmp = 0; (void)fo_ioctl(fp, FIONBIO, &tmp, td->td_ucred, td); fdrop(fp, td); break; case F_GETOWN: FILEDESC_SLOCK(fdp); if ((fp = fdtofp(fd, fdp)) == NULL) { FILEDESC_SUNLOCK(fdp); error = EBADF; break; } fhold(fp); FILEDESC_SUNLOCK(fdp); error = fo_ioctl(fp, FIOGETOWN, &tmp, td->td_ucred, td); if (error == 0) td->td_retval[0] = tmp; fdrop(fp, td); break; case F_SETOWN: FILEDESC_SLOCK(fdp); if ((fp = fdtofp(fd, fdp)) == NULL) { FILEDESC_SUNLOCK(fdp); error = EBADF; break; } fhold(fp); FILEDESC_SUNLOCK(fdp); tmp = arg; error = fo_ioctl(fp, FIOSETOWN, &tmp, td->td_ucred, td); fdrop(fp, td); break; case F_SETLK_REMOTE: error = priv_check(td, PRIV_NFS_LOCKD); if (error) return (error); flg = F_REMOTE; goto do_setlk; case F_SETLKW: flg |= F_WAIT; /* FALLTHROUGH F_SETLK */ case F_SETLK: do_setlk: FILEDESC_SLOCK(fdp); if ((fp = fdtofp(fd, fdp)) == NULL) { FILEDESC_SUNLOCK(fdp); error = EBADF; break; } if (fp->f_type != DTYPE_VNODE) { FILEDESC_SUNLOCK(fdp); error = EBADF; break; } flp = (struct flock *)arg; if (flp->l_whence == SEEK_CUR) { if (fp->f_offset < 0 || (flp->l_start > 0 && fp->f_offset > OFF_MAX - flp->l_start)) { FILEDESC_SUNLOCK(fdp); error = EOVERFLOW; break; } flp->l_start += fp->f_offset; } /* * VOP_ADVLOCK() may block. */ fhold(fp); FILEDESC_SUNLOCK(fdp); vp = fp->f_vnode; vfslocked = VFS_LOCK_GIANT(vp->v_mount); switch (flp->l_type) { case F_RDLCK: if ((fp->f_flag & FREAD) == 0) { error = EBADF; break; } PROC_LOCK(p->p_leader); p->p_leader->p_flag |= P_ADVLOCK; PROC_UNLOCK(p->p_leader); error = VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_SETLK, flp, flg); break; case F_WRLCK: if ((fp->f_flag & FWRITE) == 0) { error = EBADF; break; } PROC_LOCK(p->p_leader); p->p_leader->p_flag |= P_ADVLOCK; PROC_UNLOCK(p->p_leader); error = VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_SETLK, flp, flg); break; case F_UNLCK: error = VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_UNLCK, flp, flg); break; case F_UNLCKSYS: /* * Temporary api for testing remote lock * infrastructure. */ if (flg != F_REMOTE) { error = EINVAL; break; } error = VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_UNLCKSYS, flp, flg); break; default: error = EINVAL; break; } VFS_UNLOCK_GIANT(vfslocked); vfslocked = 0; /* Check for race with close */ FILEDESC_SLOCK(fdp); if ((unsigned) fd >= fdp->fd_nfiles || fp != fdp->fd_ofiles[fd]) { FILEDESC_SUNLOCK(fdp); flp->l_whence = SEEK_SET; flp->l_start = 0; flp->l_len = 0; flp->l_type = F_UNLCK; vfslocked = VFS_LOCK_GIANT(vp->v_mount); (void) VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_UNLCK, flp, F_POSIX); VFS_UNLOCK_GIANT(vfslocked); vfslocked = 0; } else FILEDESC_SUNLOCK(fdp); fdrop(fp, td); break; case F_GETLK: FILEDESC_SLOCK(fdp); if ((fp = fdtofp(fd, fdp)) == NULL) { FILEDESC_SUNLOCK(fdp); error = EBADF; break; } if (fp->f_type != DTYPE_VNODE) { FILEDESC_SUNLOCK(fdp); error = EBADF; break; } flp = (struct flock *)arg; if (flp->l_type != F_RDLCK && flp->l_type != F_WRLCK && flp->l_type != F_UNLCK) { FILEDESC_SUNLOCK(fdp); error = EINVAL; break; } if (flp->l_whence == SEEK_CUR) { if ((flp->l_start > 0 && fp->f_offset > OFF_MAX - flp->l_start) || (flp->l_start < 0 && fp->f_offset < OFF_MIN - flp->l_start)) { FILEDESC_SUNLOCK(fdp); error = EOVERFLOW; break; } flp->l_start += fp->f_offset; } /* * VOP_ADVLOCK() may block. */ fhold(fp); FILEDESC_SUNLOCK(fdp); vp = fp->f_vnode; vfslocked = VFS_LOCK_GIANT(vp->v_mount); error = VOP_ADVLOCK(vp, (caddr_t)p->p_leader, F_GETLK, flp, F_POSIX); VFS_UNLOCK_GIANT(vfslocked); vfslocked = 0; fdrop(fp, td); break; default: error = EINVAL; break; } VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Common code for dup, dup2, fcntl(F_DUPFD) and fcntl(F_DUP2FD). */ static int do_dup(struct thread *td, int flags, int old, int new, register_t *retval) { struct filedesc *fdp; struct proc *p; struct file *fp; struct file *delfp; int error, holdleaders, maxfd; p = td->td_proc; fdp = p->p_fd; /* * Verify we have a valid descriptor to dup from and possibly to * dup to. Unlike dup() and dup2(), fcntl()'s F_DUPFD should * return EINVAL when the new descriptor is out of bounds. */ if (old < 0) return (EBADF); if (new < 0) return (flags & DUP_FCNTL ? EINVAL : EBADF); PROC_LOCK(p); maxfd = min((int)lim_cur(p, RLIMIT_NOFILE), maxfilesperproc); PROC_UNLOCK(p); if (new >= maxfd) return (flags & DUP_FCNTL ? EINVAL : EMFILE); FILEDESC_XLOCK(fdp); if (old >= fdp->fd_nfiles || fdp->fd_ofiles[old] == NULL) { FILEDESC_XUNLOCK(fdp); return (EBADF); } if (flags & DUP_FIXED && old == new) { *retval = new; FILEDESC_XUNLOCK(fdp); return (0); } fp = fdp->fd_ofiles[old]; fhold(fp); /* * If the caller specified a file descriptor, make sure the file * table is large enough to hold it, and grab it. Otherwise, just * allocate a new descriptor the usual way. Since the filedesc * lock may be temporarily dropped in the process, we have to look * out for a race. */ if (flags & DUP_FIXED) { if (new >= fdp->fd_nfiles) fdgrowtable(fdp, new + 1); if (fdp->fd_ofiles[new] == NULL) fdused(fdp, new); } else { if ((error = fdalloc(td, new, &new)) != 0) { FILEDESC_XUNLOCK(fdp); fdrop(fp, td); return (error); } } /* * If the old file changed out from under us then treat it as a * bad file descriptor. Userland should do its own locking to * avoid this case. */ if (fdp->fd_ofiles[old] != fp) { /* we've allocated a descriptor which we won't use */ if (fdp->fd_ofiles[new] == NULL) fdunused(fdp, new); FILEDESC_XUNLOCK(fdp); fdrop(fp, td); return (EBADF); } KASSERT(old != new, ("new fd is same as old")); /* * Save info on the descriptor being overwritten. We cannot close * it without introducing an ownership race for the slot, since we * need to drop the filedesc lock to call closef(). * * XXX this duplicates parts of close(). */ delfp = fdp->fd_ofiles[new]; holdleaders = 0; if (delfp != NULL) { if (td->td_proc->p_fdtol != NULL) { /* * Ask fdfree() to sleep to ensure that all relevant * process leaders can be traversed in closef(). */ fdp->fd_holdleaderscount++; holdleaders = 1; } } /* * Duplicate the source descriptor */ fdp->fd_ofiles[new] = fp; fdp->fd_ofileflags[new] = fdp->fd_ofileflags[old] &~ UF_EXCLOSE; if (new > fdp->fd_lastfile) fdp->fd_lastfile = new; *retval = new; /* * If we dup'd over a valid file, we now own the reference to it * and must dispose of it using closef() semantics (as if a * close() were performed on it). * * XXX this duplicates parts of close(). */ if (delfp != NULL) { knote_fdclose(td, new); if (delfp->f_type == DTYPE_MQUEUE) mq_fdclose(td, new, delfp); FILEDESC_XUNLOCK(fdp); (void) closef(delfp, td); if (holdleaders) { FILEDESC_XLOCK(fdp); fdp->fd_holdleaderscount--; if (fdp->fd_holdleaderscount == 0 && fdp->fd_holdleaderswakeup != 0) { fdp->fd_holdleaderswakeup = 0; wakeup(&fdp->fd_holdleaderscount); } FILEDESC_XUNLOCK(fdp); } } else { FILEDESC_XUNLOCK(fdp); } return (0); } /* * If sigio is on the list associated with a process or process group, * disable signalling from the device, remove sigio from the list and * free sigio. */ void funsetown(struct sigio **sigiop) { struct sigio *sigio; SIGIO_LOCK(); sigio = *sigiop; if (sigio == NULL) { SIGIO_UNLOCK(); return; } *(sigio->sio_myref) = NULL; if ((sigio)->sio_pgid < 0) { struct pgrp *pg = (sigio)->sio_pgrp; PGRP_LOCK(pg); SLIST_REMOVE(&sigio->sio_pgrp->pg_sigiolst, sigio, sigio, sio_pgsigio); PGRP_UNLOCK(pg); } else { struct proc *p = (sigio)->sio_proc; PROC_LOCK(p); SLIST_REMOVE(&sigio->sio_proc->p_sigiolst, sigio, sigio, sio_pgsigio); PROC_UNLOCK(p); } SIGIO_UNLOCK(); crfree(sigio->sio_ucred); free(sigio, M_SIGIO); } /* * Free a list of sigio structures. * We only need to lock the SIGIO_LOCK because we have made ourselves * inaccessible to callers of fsetown and therefore do not need to lock * the proc or pgrp struct for the list manipulation. */ void funsetownlst(struct sigiolst *sigiolst) { struct proc *p; struct pgrp *pg; struct sigio *sigio; sigio = SLIST_FIRST(sigiolst); if (sigio == NULL) return; p = NULL; pg = NULL; /* * Every entry of the list should belong * to a single proc or pgrp. */ if (sigio->sio_pgid < 0) { pg = sigio->sio_pgrp; PGRP_LOCK_ASSERT(pg, MA_NOTOWNED); } else /* if (sigio->sio_pgid > 0) */ { p = sigio->sio_proc; PROC_LOCK_ASSERT(p, MA_NOTOWNED); } SIGIO_LOCK(); while ((sigio = SLIST_FIRST(sigiolst)) != NULL) { *(sigio->sio_myref) = NULL; if (pg != NULL) { KASSERT(sigio->sio_pgid < 0, ("Proc sigio in pgrp sigio list")); KASSERT(sigio->sio_pgrp == pg, ("Bogus pgrp in sigio list")); PGRP_LOCK(pg); SLIST_REMOVE(&pg->pg_sigiolst, sigio, sigio, sio_pgsigio); PGRP_UNLOCK(pg); } else /* if (p != NULL) */ { KASSERT(sigio->sio_pgid > 0, ("Pgrp sigio in proc sigio list")); KASSERT(sigio->sio_proc == p, ("Bogus proc in sigio list")); PROC_LOCK(p); SLIST_REMOVE(&p->p_sigiolst, sigio, sigio, sio_pgsigio); PROC_UNLOCK(p); } SIGIO_UNLOCK(); crfree(sigio->sio_ucred); free(sigio, M_SIGIO); SIGIO_LOCK(); } SIGIO_UNLOCK(); } /* * This is common code for FIOSETOWN ioctl called by fcntl(fd, F_SETOWN, arg). * * After permission checking, add a sigio structure to the sigio list for * the process or process group. */ int fsetown(pid_t pgid, struct sigio **sigiop) { struct proc *proc; struct pgrp *pgrp; struct sigio *sigio; int ret; if (pgid == 0) { funsetown(sigiop); return (0); } ret = 0; /* Allocate and fill in the new sigio out of locks. */ sigio = malloc(sizeof(struct sigio), M_SIGIO, M_WAITOK); sigio->sio_pgid = pgid; sigio->sio_ucred = crhold(curthread->td_ucred); sigio->sio_myref = sigiop; sx_slock(&proctree_lock); if (pgid > 0) { proc = pfind(pgid); if (proc == NULL) { ret = ESRCH; goto fail; } /* * Policy - Don't allow a process to FSETOWN a process * in another session. * * Remove this test to allow maximum flexibility or * restrict FSETOWN to the current process or process * group for maximum safety. */ PROC_UNLOCK(proc); if (proc->p_session != curthread->td_proc->p_session) { ret = EPERM; goto fail; } pgrp = NULL; } else /* if (pgid < 0) */ { pgrp = pgfind(-pgid); if (pgrp == NULL) { ret = ESRCH; goto fail; } PGRP_UNLOCK(pgrp); /* * Policy - Don't allow a process to FSETOWN a process * in another session. * * Remove this test to allow maximum flexibility or * restrict FSETOWN to the current process or process * group for maximum safety. */ if (pgrp->pg_session != curthread->td_proc->p_session) { ret = EPERM; goto fail; } proc = NULL; } funsetown(sigiop); if (pgid > 0) { PROC_LOCK(proc); /* * Since funsetownlst() is called without the proctree * locked, we need to check for P_WEXIT. * XXX: is ESRCH correct? */ if ((proc->p_flag & P_WEXIT) != 0) { PROC_UNLOCK(proc); ret = ESRCH; goto fail; } SLIST_INSERT_HEAD(&proc->p_sigiolst, sigio, sio_pgsigio); sigio->sio_proc = proc; PROC_UNLOCK(proc); } else { PGRP_LOCK(pgrp); SLIST_INSERT_HEAD(&pgrp->pg_sigiolst, sigio, sio_pgsigio); sigio->sio_pgrp = pgrp; PGRP_UNLOCK(pgrp); } sx_sunlock(&proctree_lock); SIGIO_LOCK(); *sigiop = sigio; SIGIO_UNLOCK(); return (0); fail: sx_sunlock(&proctree_lock); crfree(sigio->sio_ucred); free(sigio, M_SIGIO); return (ret); } /* * This is common code for FIOGETOWN ioctl called by fcntl(fd, F_GETOWN, arg). */ pid_t fgetown(sigiop) struct sigio **sigiop; { pid_t pgid; SIGIO_LOCK(); pgid = (*sigiop != NULL) ? (*sigiop)->sio_pgid : 0; SIGIO_UNLOCK(); return (pgid); } /* * Close a file descriptor. */ #ifndef _SYS_SYSPROTO_H_ struct close_args { int fd; }; #endif /* ARGSUSED */ int close(td, uap) struct thread *td; struct close_args *uap; { return (kern_close(td, uap->fd)); } int kern_close(td, fd) struct thread *td; int fd; { struct filedesc *fdp; struct file *fp; int error; int holdleaders; error = 0; holdleaders = 0; fdp = td->td_proc->p_fd; AUDIT_SYSCLOSE(td, fd); FILEDESC_XLOCK(fdp); if ((unsigned)fd >= fdp->fd_nfiles || (fp = fdp->fd_ofiles[fd]) == NULL) { FILEDESC_XUNLOCK(fdp); return (EBADF); } fdp->fd_ofiles[fd] = NULL; fdp->fd_ofileflags[fd] = 0; fdunused(fdp, fd); if (td->td_proc->p_fdtol != NULL) { /* * Ask fdfree() to sleep to ensure that all relevant * process leaders can be traversed in closef(). */ fdp->fd_holdleaderscount++; holdleaders = 1; } /* * We now hold the fp reference that used to be owned by the * descriptor array. We have to unlock the FILEDESC *AFTER* * knote_fdclose to prevent a race of the fd getting opened, a knote * added, and deleteing a knote for the new fd. */ knote_fdclose(td, fd); if (fp->f_type == DTYPE_MQUEUE) mq_fdclose(td, fd, fp); FILEDESC_XUNLOCK(fdp); error = closef(fp, td); if (holdleaders) { FILEDESC_XLOCK(fdp); fdp->fd_holdleaderscount--; if (fdp->fd_holdleaderscount == 0 && fdp->fd_holdleaderswakeup != 0) { fdp->fd_holdleaderswakeup = 0; wakeup(&fdp->fd_holdleaderscount); } FILEDESC_XUNLOCK(fdp); } return (error); } #if defined(COMPAT_43) /* * Return status information about a file descriptor. */ #ifndef _SYS_SYSPROTO_H_ struct ofstat_args { int fd; struct ostat *sb; }; #endif /* ARGSUSED */ int ofstat(struct thread *td, struct ofstat_args *uap) { struct ostat oub; struct stat ub; int error; error = kern_fstat(td, uap->fd, &ub); if (error == 0) { cvtstat(&ub, &oub); error = copyout(&oub, uap->sb, sizeof(oub)); } return (error); } #endif /* COMPAT_43 */ /* * Return status information about a file descriptor. */ #ifndef _SYS_SYSPROTO_H_ struct fstat_args { int fd; struct stat *sb; }; #endif /* ARGSUSED */ int fstat(struct thread *td, struct fstat_args *uap) { struct stat ub; int error; error = kern_fstat(td, uap->fd, &ub); if (error == 0) error = copyout(&ub, uap->sb, sizeof(ub)); return (error); } int kern_fstat(struct thread *td, int fd, struct stat *sbp) { struct file *fp; int error; AUDIT_ARG(fd, fd); if ((error = fget(td, fd, &fp)) != 0) return (error); AUDIT_ARG(file, td->td_proc, fp); error = fo_stat(fp, sbp, td->td_ucred, td); fdrop(fp, td); #ifdef KTRACE if (error == 0 && KTRPOINT(td, KTR_STRUCT)) ktrstat(sbp); #endif return (error); } /* * Return status information about a file descriptor. */ #ifndef _SYS_SYSPROTO_H_ struct nfstat_args { int fd; struct nstat *sb; }; #endif /* ARGSUSED */ int nfstat(struct thread *td, struct nfstat_args *uap) { struct nstat nub; struct stat ub; int error; error = kern_fstat(td, uap->fd, &ub); if (error == 0) { cvtnstat(&ub, &nub); error = copyout(&nub, uap->sb, sizeof(nub)); } return (error); } /* * Return pathconf information about a file descriptor. */ #ifndef _SYS_SYSPROTO_H_ struct fpathconf_args { int fd; int name; }; #endif /* ARGSUSED */ int fpathconf(struct thread *td, struct fpathconf_args *uap) { struct file *fp; struct vnode *vp; int error; if ((error = fget(td, uap->fd, &fp)) != 0) return (error); /* If asynchronous I/O is available, it works for all descriptors. */ if (uap->name == _PC_ASYNC_IO) { td->td_retval[0] = async_io_version; goto out; } vp = fp->f_vnode; if (vp != NULL) { int vfslocked; vfslocked = VFS_LOCK_GIANT(vp->v_mount); vn_lock(vp, LK_SHARED | LK_RETRY); error = VOP_PATHCONF(vp, uap->name, td->td_retval); VOP_UNLOCK(vp, 0); VFS_UNLOCK_GIANT(vfslocked); } else if (fp->f_type == DTYPE_PIPE || fp->f_type == DTYPE_SOCKET) { if (uap->name != _PC_PIPE_BUF) { error = EINVAL; } else { td->td_retval[0] = PIPE_BUF; error = 0; } } else { error = EOPNOTSUPP; } out: fdrop(fp, td); return (error); } /* * Grow the file table to accomodate (at least) nfd descriptors. This may * block and drop the filedesc lock, but it will reacquire it before * returning. */ static void fdgrowtable(struct filedesc *fdp, int nfd) { struct filedesc0 *fdp0; struct freetable *fo; struct file **ntable; struct file **otable; char *nfileflags; int nnfiles, onfiles; NDSLOTTYPE *nmap; FILEDESC_XLOCK_ASSERT(fdp); KASSERT(fdp->fd_nfiles > 0, ("zero-length file table")); /* compute the size of the new table */ onfiles = fdp->fd_nfiles; nnfiles = NDSLOTS(nfd) * NDENTRIES; /* round up */ if (nnfiles <= onfiles) /* the table is already large enough */ return; /* allocate a new table and (if required) new bitmaps */ FILEDESC_XUNLOCK(fdp); ntable = malloc((nnfiles * OFILESIZE) + sizeof(struct freetable), M_FILEDESC, M_ZERO | M_WAITOK); nfileflags = (char *)&ntable[nnfiles]; if (NDSLOTS(nnfiles) > NDSLOTS(onfiles)) nmap = malloc(NDSLOTS(nnfiles) * NDSLOTSIZE, M_FILEDESC, M_ZERO | M_WAITOK); else nmap = NULL; FILEDESC_XLOCK(fdp); /* * We now have new tables ready to go. Since we dropped the * filedesc lock to call malloc(), watch out for a race. */ onfiles = fdp->fd_nfiles; if (onfiles >= nnfiles) { /* we lost the race, but that's OK */ free(ntable, M_FILEDESC); if (nmap != NULL) free(nmap, M_FILEDESC); return; } bcopy(fdp->fd_ofiles, ntable, onfiles * sizeof(*ntable)); bcopy(fdp->fd_ofileflags, nfileflags, onfiles); otable = fdp->fd_ofiles; fdp->fd_ofileflags = nfileflags; fdp->fd_ofiles = ntable; /* * We must preserve ofiles until the process exits because we can't * be certain that no threads have references to the old table via * _fget(). */ if (onfiles > NDFILE) { fo = (struct freetable *)&otable[onfiles]; fdp0 = (struct filedesc0 *)fdp; fo->ft_table = otable; SLIST_INSERT_HEAD(&fdp0->fd_free, fo, ft_next); } if (NDSLOTS(nnfiles) > NDSLOTS(onfiles)) { bcopy(fdp->fd_map, nmap, NDSLOTS(onfiles) * sizeof(*nmap)); if (NDSLOTS(onfiles) > NDSLOTS(NDFILE)) free(fdp->fd_map, M_FILEDESC); fdp->fd_map = nmap; } fdp->fd_nfiles = nnfiles; } /* * Allocate a file descriptor for the process. */ int fdalloc(struct thread *td, int minfd, int *result) { struct proc *p = td->td_proc; struct filedesc *fdp = p->p_fd; int fd = -1, maxfd; FILEDESC_XLOCK_ASSERT(fdp); if (fdp->fd_freefile > minfd) minfd = fdp->fd_freefile; PROC_LOCK(p); maxfd = min((int)lim_cur(p, RLIMIT_NOFILE), maxfilesperproc); PROC_UNLOCK(p); /* * Search the bitmap for a free descriptor. If none is found, try * to grow the file table. Keep at it until we either get a file * descriptor or run into process or system limits; fdgrowtable() * may drop the filedesc lock, so we're in a race. */ for (;;) { fd = fd_first_free(fdp, minfd, fdp->fd_nfiles); if (fd >= maxfd) return (EMFILE); if (fd < fdp->fd_nfiles) break; fdgrowtable(fdp, min(fdp->fd_nfiles * 2, maxfd)); } /* * Perform some sanity checks, then mark the file descriptor as * used and return it to the caller. */ KASSERT(!fdisused(fdp, fd), ("fd_first_free() returned non-free descriptor")); KASSERT(fdp->fd_ofiles[fd] == NULL, ("free descriptor isn't")); fdp->fd_ofileflags[fd] = 0; /* XXX needed? */ fdused(fdp, fd); *result = fd; return (0); } /* * Check to see whether n user file descriptors are available to the process * p. */ int fdavail(struct thread *td, int n) { struct proc *p = td->td_proc; struct filedesc *fdp = td->td_proc->p_fd; struct file **fpp; int i, lim, last; FILEDESC_LOCK_ASSERT(fdp); PROC_LOCK(p); lim = min((int)lim_cur(p, RLIMIT_NOFILE), maxfilesperproc); PROC_UNLOCK(p); if ((i = lim - fdp->fd_nfiles) > 0 && (n -= i) <= 0) return (1); last = min(fdp->fd_nfiles, lim); fpp = &fdp->fd_ofiles[fdp->fd_freefile]; for (i = last - fdp->fd_freefile; --i >= 0; fpp++) { if (*fpp == NULL && --n <= 0) return (1); } return (0); } /* * Create a new open file structure and allocate a file decriptor for the * process that refers to it. We add one reference to the file for the * descriptor table and one reference for resultfp. This is to prevent us * being preempted and the entry in the descriptor table closed after we * release the FILEDESC lock. */ int falloc(struct thread *td, struct file **resultfp, int *resultfd) { struct proc *p = td->td_proc; struct file *fp; int error, i; int maxuserfiles = maxfiles - (maxfiles / 20); static struct timeval lastfail; static int curfail; fp = uma_zalloc(file_zone, M_WAITOK | M_ZERO); if ((openfiles >= maxuserfiles && priv_check(td, PRIV_MAXFILES) != 0) || openfiles >= maxfiles) { if (ppsratecheck(&lastfail, &curfail, 1)) { printf("kern.maxfiles limit exceeded by uid %i, please see tuning(7).\n", td->td_ucred->cr_ruid); } uma_zfree(file_zone, fp); return (ENFILE); } atomic_add_int(&openfiles, 1); /* * If the process has file descriptor zero open, add the new file * descriptor to the list of open files at that point, otherwise * put it at the front of the list of open files. */ refcount_init(&fp->f_count, 1); if (resultfp) fhold(fp); fp->f_cred = crhold(td->td_ucred); fp->f_ops = &badfileops; fp->f_data = NULL; fp->f_vnode = NULL; FILEDESC_XLOCK(p->p_fd); if ((error = fdalloc(td, 0, &i))) { FILEDESC_XUNLOCK(p->p_fd); fdrop(fp, td); if (resultfp) fdrop(fp, td); return (error); } p->p_fd->fd_ofiles[i] = fp; FILEDESC_XUNLOCK(p->p_fd); if (resultfp) *resultfp = fp; if (resultfd) *resultfd = i; return (0); } /* * Build a new filedesc structure from another. * Copy the current, root, and jail root vnode references. */ struct filedesc * fdinit(struct filedesc *fdp) { struct filedesc0 *newfdp; newfdp = malloc(sizeof *newfdp, M_FILEDESC, M_WAITOK | M_ZERO); FILEDESC_LOCK_INIT(&newfdp->fd_fd); if (fdp != NULL) { FILEDESC_XLOCK(fdp); newfdp->fd_fd.fd_cdir = fdp->fd_cdir; if (newfdp->fd_fd.fd_cdir) VREF(newfdp->fd_fd.fd_cdir); newfdp->fd_fd.fd_rdir = fdp->fd_rdir; if (newfdp->fd_fd.fd_rdir) VREF(newfdp->fd_fd.fd_rdir); newfdp->fd_fd.fd_jdir = fdp->fd_jdir; if (newfdp->fd_fd.fd_jdir) VREF(newfdp->fd_fd.fd_jdir); FILEDESC_XUNLOCK(fdp); } /* Create the file descriptor table. */ newfdp->fd_fd.fd_refcnt = 1; newfdp->fd_fd.fd_holdcnt = 1; newfdp->fd_fd.fd_cmask = CMASK; newfdp->fd_fd.fd_ofiles = newfdp->fd_dfiles; newfdp->fd_fd.fd_ofileflags = newfdp->fd_dfileflags; newfdp->fd_fd.fd_nfiles = NDFILE; newfdp->fd_fd.fd_map = newfdp->fd_dmap; newfdp->fd_fd.fd_lastfile = -1; return (&newfdp->fd_fd); } static struct filedesc * fdhold(struct proc *p) { struct filedesc *fdp; mtx_lock(&fdesc_mtx); fdp = p->p_fd; if (fdp != NULL) fdp->fd_holdcnt++; mtx_unlock(&fdesc_mtx); return (fdp); } static void fddrop(struct filedesc *fdp) { struct filedesc0 *fdp0; struct freetable *ft; int i; mtx_lock(&fdesc_mtx); i = --fdp->fd_holdcnt; mtx_unlock(&fdesc_mtx); if (i > 0) return; FILEDESC_LOCK_DESTROY(fdp); fdp0 = (struct filedesc0 *)fdp; while ((ft = SLIST_FIRST(&fdp0->fd_free)) != NULL) { SLIST_REMOVE_HEAD(&fdp0->fd_free, ft_next); free(ft->ft_table, M_FILEDESC); } free(fdp, M_FILEDESC); } /* * Share a filedesc structure. */ struct filedesc * fdshare(struct filedesc *fdp) { FILEDESC_XLOCK(fdp); fdp->fd_refcnt++; FILEDESC_XUNLOCK(fdp); return (fdp); } /* * Unshare a filedesc structure, if necessary by making a copy */ void fdunshare(struct proc *p, struct thread *td) { FILEDESC_XLOCK(p->p_fd); if (p->p_fd->fd_refcnt > 1) { struct filedesc *tmp; FILEDESC_XUNLOCK(p->p_fd); tmp = fdcopy(p->p_fd); fdfree(td); p->p_fd = tmp; } else FILEDESC_XUNLOCK(p->p_fd); } /* * Copy a filedesc structure. A NULL pointer in returns a NULL reference, * this is to ease callers, not catch errors. */ struct filedesc * fdcopy(struct filedesc *fdp) { struct filedesc *newfdp; int i; /* Certain daemons might not have file descriptors. */ if (fdp == NULL) return (NULL); newfdp = fdinit(fdp); FILEDESC_SLOCK(fdp); while (fdp->fd_lastfile >= newfdp->fd_nfiles) { FILEDESC_SUNLOCK(fdp); FILEDESC_XLOCK(newfdp); fdgrowtable(newfdp, fdp->fd_lastfile + 1); FILEDESC_XUNLOCK(newfdp); FILEDESC_SLOCK(fdp); } /* copy everything except kqueue descriptors */ newfdp->fd_freefile = -1; for (i = 0; i <= fdp->fd_lastfile; ++i) { if (fdisused(fdp, i) && fdp->fd_ofiles[i]->f_type != DTYPE_KQUEUE && fdp->fd_ofiles[i]->f_ops != &badfileops) { newfdp->fd_ofiles[i] = fdp->fd_ofiles[i]; newfdp->fd_ofileflags[i] = fdp->fd_ofileflags[i]; fhold(newfdp->fd_ofiles[i]); newfdp->fd_lastfile = i; } else { if (newfdp->fd_freefile == -1) newfdp->fd_freefile = i; } } newfdp->fd_cmask = fdp->fd_cmask; FILEDESC_SUNLOCK(fdp); FILEDESC_XLOCK(newfdp); for (i = 0; i <= newfdp->fd_lastfile; ++i) if (newfdp->fd_ofiles[i] != NULL) fdused(newfdp, i); if (newfdp->fd_freefile == -1) newfdp->fd_freefile = i; FILEDESC_XUNLOCK(newfdp); return (newfdp); } /* * Release a filedesc structure. */ void fdfree(struct thread *td) { struct filedesc *fdp; struct file **fpp; int i, locked; struct filedesc_to_leader *fdtol; struct file *fp; struct vnode *cdir, *jdir, *rdir, *vp; struct flock lf; /* Certain daemons might not have file descriptors. */ fdp = td->td_proc->p_fd; if (fdp == NULL) return; /* Check for special need to clear POSIX style locks */ fdtol = td->td_proc->p_fdtol; if (fdtol != NULL) { FILEDESC_XLOCK(fdp); KASSERT(fdtol->fdl_refcount > 0, ("filedesc_to_refcount botch: fdl_refcount=%d", fdtol->fdl_refcount)); if (fdtol->fdl_refcount == 1 && (td->td_proc->p_leader->p_flag & P_ADVLOCK) != 0) { for (i = 0, fpp = fdp->fd_ofiles; i <= fdp->fd_lastfile; i++, fpp++) { if (*fpp == NULL || (*fpp)->f_type != DTYPE_VNODE) continue; fp = *fpp; fhold(fp); FILEDESC_XUNLOCK(fdp); lf.l_whence = SEEK_SET; lf.l_start = 0; lf.l_len = 0; lf.l_type = F_UNLCK; vp = fp->f_vnode; locked = VFS_LOCK_GIANT(vp->v_mount); (void) VOP_ADVLOCK(vp, (caddr_t)td->td_proc-> p_leader, F_UNLCK, &lf, F_POSIX); VFS_UNLOCK_GIANT(locked); FILEDESC_XLOCK(fdp); fdrop(fp, td); fpp = fdp->fd_ofiles + i; } } retry: if (fdtol->fdl_refcount == 1) { if (fdp->fd_holdleaderscount > 0 && (td->td_proc->p_leader->p_flag & P_ADVLOCK) != 0) { /* * close() or do_dup() has cleared a reference * in a shared file descriptor table. */ fdp->fd_holdleaderswakeup = 1; sx_sleep(&fdp->fd_holdleaderscount, FILEDESC_LOCK(fdp), PLOCK, "fdlhold", 0); goto retry; } if (fdtol->fdl_holdcount > 0) { /* * Ensure that fdtol->fdl_leader remains * valid in closef(). */ fdtol->fdl_wakeup = 1; sx_sleep(fdtol, FILEDESC_LOCK(fdp), PLOCK, "fdlhold", 0); goto retry; } } fdtol->fdl_refcount--; if (fdtol->fdl_refcount == 0 && fdtol->fdl_holdcount == 0) { fdtol->fdl_next->fdl_prev = fdtol->fdl_prev; fdtol->fdl_prev->fdl_next = fdtol->fdl_next; } else fdtol = NULL; td->td_proc->p_fdtol = NULL; FILEDESC_XUNLOCK(fdp); if (fdtol != NULL) free(fdtol, M_FILEDESC_TO_LEADER); } FILEDESC_XLOCK(fdp); i = --fdp->fd_refcnt; FILEDESC_XUNLOCK(fdp); if (i > 0) return; fpp = fdp->fd_ofiles; for (i = fdp->fd_lastfile; i-- >= 0; fpp++) { if (*fpp) { FILEDESC_XLOCK(fdp); fp = *fpp; *fpp = NULL; FILEDESC_XUNLOCK(fdp); (void) closef(fp, td); } } FILEDESC_XLOCK(fdp); /* XXX This should happen earlier. */ mtx_lock(&fdesc_mtx); td->td_proc->p_fd = NULL; mtx_unlock(&fdesc_mtx); if (fdp->fd_nfiles > NDFILE) free(fdp->fd_ofiles, M_FILEDESC); if (NDSLOTS(fdp->fd_nfiles) > NDSLOTS(NDFILE)) free(fdp->fd_map, M_FILEDESC); fdp->fd_nfiles = 0; cdir = fdp->fd_cdir; fdp->fd_cdir = NULL; rdir = fdp->fd_rdir; fdp->fd_rdir = NULL; jdir = fdp->fd_jdir; fdp->fd_jdir = NULL; FILEDESC_XUNLOCK(fdp); if (cdir) { locked = VFS_LOCK_GIANT(cdir->v_mount); vrele(cdir); VFS_UNLOCK_GIANT(locked); } if (rdir) { locked = VFS_LOCK_GIANT(rdir->v_mount); vrele(rdir); VFS_UNLOCK_GIANT(locked); } if (jdir) { locked = VFS_LOCK_GIANT(jdir->v_mount); vrele(jdir); VFS_UNLOCK_GIANT(locked); } fddrop(fdp); } /* * For setugid programs, we don't want to people to use that setugidness * to generate error messages which write to a file which otherwise would * otherwise be off-limits to the process. We check for filesystems where * the vnode can change out from under us after execve (like [lin]procfs). * * Since setugidsafety calls this only for fd 0, 1 and 2, this check is * sufficient. We also don't check for setugidness since we know we are. */ static int is_unsafe(struct file *fp) { if (fp->f_type == DTYPE_VNODE) { struct vnode *vp = fp->f_vnode; if ((vp->v_vflag & VV_PROCDEP) != 0) return (1); } return (0); } /* * Make this setguid thing safe, if at all possible. */ void setugidsafety(struct thread *td) { struct filedesc *fdp; int i; /* Certain daemons might not have file descriptors. */ fdp = td->td_proc->p_fd; if (fdp == NULL) return; /* * Note: fdp->fd_ofiles may be reallocated out from under us while * we are blocked in a close. Be careful! */ FILEDESC_XLOCK(fdp); for (i = 0; i <= fdp->fd_lastfile; i++) { if (i > 2) break; if (fdp->fd_ofiles[i] && is_unsafe(fdp->fd_ofiles[i])) { struct file *fp; knote_fdclose(td, i); /* * NULL-out descriptor prior to close to avoid * a race while close blocks. */ fp = fdp->fd_ofiles[i]; fdp->fd_ofiles[i] = NULL; fdp->fd_ofileflags[i] = 0; fdunused(fdp, i); FILEDESC_XUNLOCK(fdp); (void) closef(fp, td); FILEDESC_XLOCK(fdp); } } FILEDESC_XUNLOCK(fdp); } /* * If a specific file object occupies a specific file descriptor, close the * file descriptor entry and drop a reference on the file object. This is a * convenience function to handle a subsequent error in a function that calls * falloc() that handles the race that another thread might have closed the * file descriptor out from under the thread creating the file object. */ void fdclose(struct filedesc *fdp, struct file *fp, int idx, struct thread *td) { FILEDESC_XLOCK(fdp); if (fdp->fd_ofiles[idx] == fp) { fdp->fd_ofiles[idx] = NULL; fdunused(fdp, idx); FILEDESC_XUNLOCK(fdp); fdrop(fp, td); } else FILEDESC_XUNLOCK(fdp); } /* * Close any files on exec? */ void fdcloseexec(struct thread *td) { struct filedesc *fdp; int i; /* Certain daemons might not have file descriptors. */ fdp = td->td_proc->p_fd; if (fdp == NULL) return; FILEDESC_XLOCK(fdp); /* * We cannot cache fd_ofiles or fd_ofileflags since operations * may block and rip them out from under us. */ for (i = 0; i <= fdp->fd_lastfile; i++) { if (fdp->fd_ofiles[i] != NULL && (fdp->fd_ofiles[i]->f_type == DTYPE_MQUEUE || (fdp->fd_ofileflags[i] & UF_EXCLOSE))) { struct file *fp; knote_fdclose(td, i); /* * NULL-out descriptor prior to close to avoid * a race while close blocks. */ fp = fdp->fd_ofiles[i]; fdp->fd_ofiles[i] = NULL; fdp->fd_ofileflags[i] = 0; fdunused(fdp, i); if (fp->f_type == DTYPE_MQUEUE) mq_fdclose(td, i, fp); FILEDESC_XUNLOCK(fdp); (void) closef(fp, td); FILEDESC_XLOCK(fdp); } } FILEDESC_XUNLOCK(fdp); } /* * It is unsafe for set[ug]id processes to be started with file * descriptors 0..2 closed, as these descriptors are given implicit * significance in the Standard C library. fdcheckstd() will create a * descriptor referencing /dev/null for each of stdin, stdout, and * stderr that is not already open. */ int fdcheckstd(struct thread *td) { struct filedesc *fdp; register_t retval, save; int i, error, devnull; fdp = td->td_proc->p_fd; if (fdp == NULL) return (0); KASSERT(fdp->fd_refcnt == 1, ("the fdtable should not be shared")); devnull = -1; error = 0; for (i = 0; i < 3; i++) { if (fdp->fd_ofiles[i] != NULL) continue; if (devnull < 0) { save = td->td_retval[0]; error = kern_open(td, "/dev/null", UIO_SYSSPACE, O_RDWR, 0); devnull = td->td_retval[0]; KASSERT(devnull == i, ("oof, we didn't get our fd")); td->td_retval[0] = save; if (error) break; } else { error = do_dup(td, DUP_FIXED, devnull, i, &retval); if (error != 0) break; } } return (error); } /* * Internal form of close. Decrement reference count on file structure. * Note: td may be NULL when closing a file that was being passed in a * message. * * XXXRW: Giant is not required for the caller, but often will be held; this * makes it moderately likely the Giant will be recursed in the VFS case. */ int closef(struct file *fp, struct thread *td) { struct vnode *vp; struct flock lf; struct filedesc_to_leader *fdtol; struct filedesc *fdp; /* * POSIX record locking dictates that any close releases ALL * locks owned by this process. This is handled by setting * a flag in the unlock to free ONLY locks obeying POSIX * semantics, and not to free BSD-style file locks. * If the descriptor was in a message, POSIX-style locks * aren't passed with the descriptor, and the thread pointer * will be NULL. Callers should be careful only to pass a * NULL thread pointer when there really is no owning * context that might have locks, or the locks will be * leaked. */ if (fp->f_type == DTYPE_VNODE && td != NULL) { int vfslocked; vp = fp->f_vnode; vfslocked = VFS_LOCK_GIANT(vp->v_mount); if ((td->td_proc->p_leader->p_flag & P_ADVLOCK) != 0) { lf.l_whence = SEEK_SET; lf.l_start = 0; lf.l_len = 0; lf.l_type = F_UNLCK; (void) VOP_ADVLOCK(vp, (caddr_t)td->td_proc->p_leader, F_UNLCK, &lf, F_POSIX); } fdtol = td->td_proc->p_fdtol; if (fdtol != NULL) { /* * Handle special case where file descriptor table is * shared between multiple process leaders. */ fdp = td->td_proc->p_fd; FILEDESC_XLOCK(fdp); for (fdtol = fdtol->fdl_next; fdtol != td->td_proc->p_fdtol; fdtol = fdtol->fdl_next) { if ((fdtol->fdl_leader->p_flag & P_ADVLOCK) == 0) continue; fdtol->fdl_holdcount++; FILEDESC_XUNLOCK(fdp); lf.l_whence = SEEK_SET; lf.l_start = 0; lf.l_len = 0; lf.l_type = F_UNLCK; vp = fp->f_vnode; (void) VOP_ADVLOCK(vp, (caddr_t)fdtol->fdl_leader, F_UNLCK, &lf, F_POSIX); FILEDESC_XLOCK(fdp); fdtol->fdl_holdcount--; if (fdtol->fdl_holdcount == 0 && fdtol->fdl_wakeup != 0) { fdtol->fdl_wakeup = 0; wakeup(fdtol); } } FILEDESC_XUNLOCK(fdp); } VFS_UNLOCK_GIANT(vfslocked); } return (fdrop(fp, td)); } /* * Initialize the file pointer with the specified properties. * * The ops are set with release semantics to be certain that the flags, type, * and data are visible when ops is. This is to prevent ops methods from being * called with bad data. */ void finit(struct file *fp, u_int flag, short type, void *data, struct fileops *ops) { fp->f_data = data; fp->f_flag = flag; fp->f_type = type; atomic_store_rel_ptr((volatile uintptr_t *)&fp->f_ops, (uintptr_t)ops); } struct file * fget_unlocked(struct filedesc *fdp, int fd) { struct file *fp; u_int count; if (fd < 0 || fd >= fdp->fd_nfiles) return (NULL); /* * Fetch the descriptor locklessly. We avoid fdrop() races by * never raising a refcount above 0. To accomplish this we have * to use a cmpset loop rather than an atomic_add. The descriptor * must be re-verified once we acquire a reference to be certain * that the identity is still correct and we did not lose a race * due to preemption. */ for (;;) { fp = fdp->fd_ofiles[fd]; if (fp == NULL) break; count = fp->f_count; if (count == 0) continue; if (atomic_cmpset_int(&fp->f_count, count, count + 1) != 1) continue; if (fp == ((struct file *volatile*)fdp->fd_ofiles)[fd]) break; fdrop(fp, curthread); } return (fp); } /* * Extract the file pointer associated with the specified descriptor for the * current user process. * * If the descriptor doesn't exist or doesn't match 'flags', EBADF is * returned. * * If an error occured the non-zero error is returned and *fpp is set to * NULL. Otherwise *fpp is held and set and zero is returned. Caller is * responsible for fdrop(). */ static __inline int _fget(struct thread *td, int fd, struct file **fpp, int flags) { struct filedesc *fdp; struct file *fp; *fpp = NULL; if (td == NULL || (fdp = td->td_proc->p_fd) == NULL) return (EBADF); if ((fp = fget_unlocked(fdp, fd)) == NULL) return (EBADF); if (fp->f_ops == &badfileops) { fdrop(fp, td); return (EBADF); } /* * FREAD and FWRITE failure return EBADF as per POSIX. * * Only one flag, or 0, may be specified. */ if ((flags == FREAD && (fp->f_flag & FREAD) == 0) || (flags == FWRITE && (fp->f_flag & FWRITE) == 0)) { fdrop(fp, td); return (EBADF); } *fpp = fp; return (0); } int fget(struct thread *td, int fd, struct file **fpp) { return(_fget(td, fd, fpp, 0)); } int fget_read(struct thread *td, int fd, struct file **fpp) { return(_fget(td, fd, fpp, FREAD)); } int fget_write(struct thread *td, int fd, struct file **fpp) { return(_fget(td, fd, fpp, FWRITE)); } /* * Like fget() but loads the underlying vnode, or returns an error if the * descriptor does not represent a vnode. Note that pipes use vnodes but * never have VM objects. The returned vnode will be vref()'d. * * XXX: what about the unused flags ? */ static __inline int _fgetvp(struct thread *td, int fd, struct vnode **vpp, int flags) { struct file *fp; int error; *vpp = NULL; if ((error = _fget(td, fd, &fp, flags)) != 0) return (error); if (fp->f_vnode == NULL) { error = EINVAL; } else { *vpp = fp->f_vnode; vref(*vpp); } fdrop(fp, td); return (error); } int fgetvp(struct thread *td, int fd, struct vnode **vpp) { return (_fgetvp(td, fd, vpp, 0)); } int fgetvp_read(struct thread *td, int fd, struct vnode **vpp) { return (_fgetvp(td, fd, vpp, FREAD)); } #ifdef notyet int fgetvp_write(struct thread *td, int fd, struct vnode **vpp) { return (_fgetvp(td, fd, vpp, FWRITE)); } #endif /* * Like fget() but loads the underlying socket, or returns an error if the * descriptor does not represent a socket. * * We bump the ref count on the returned socket. XXX Also obtain the SX lock * in the future. * * Note: fgetsock() and fputsock() are deprecated, as consumers should rely * on their file descriptor reference to prevent the socket from being free'd * during use. */ int fgetsock(struct thread *td, int fd, struct socket **spp, u_int *fflagp) { struct file *fp; int error; *spp = NULL; if (fflagp != NULL) *fflagp = 0; if ((error = _fget(td, fd, &fp, 0)) != 0) return (error); if (fp->f_type != DTYPE_SOCKET) { error = ENOTSOCK; } else { *spp = fp->f_data; if (fflagp) *fflagp = fp->f_flag; SOCK_LOCK(*spp); soref(*spp); SOCK_UNLOCK(*spp); } fdrop(fp, td); return (error); } /* * Drop the reference count on the socket and XXX release the SX lock in the * future. The last reference closes the socket. * * Note: fputsock() is deprecated, see comment for fgetsock(). */ void fputsock(struct socket *so) { ACCEPT_LOCK(); SOCK_LOCK(so); sorele(so); } /* * Handle the last reference to a file being closed. */ int _fdrop(struct file *fp, struct thread *td) { int error; error = 0; if (fp->f_count != 0) panic("fdrop: count %d", fp->f_count); if (fp->f_ops != &badfileops) error = fo_close(fp, td); /* * The f_cdevpriv cannot be assigned non-NULL value while we * are destroying the file. */ if (fp->f_cdevpriv != NULL) devfs_fpdrop(fp); atomic_subtract_int(&openfiles, 1); crfree(fp->f_cred); uma_zfree(file_zone, fp); return (error); } /* * Apply an advisory lock on a file descriptor. * * Just attempt to get a record lock of the requested type on the entire file * (l_whence = SEEK_SET, l_start = 0, l_len = 0). */ #ifndef _SYS_SYSPROTO_H_ struct flock_args { int fd; int how; }; #endif /* ARGSUSED */ int flock(struct thread *td, struct flock_args *uap) { struct file *fp; struct vnode *vp; struct flock lf; int vfslocked; int error; if ((error = fget(td, uap->fd, &fp)) != 0) return (error); if (fp->f_type != DTYPE_VNODE) { fdrop(fp, td); return (EOPNOTSUPP); } vp = fp->f_vnode; vfslocked = VFS_LOCK_GIANT(vp->v_mount); lf.l_whence = SEEK_SET; lf.l_start = 0; lf.l_len = 0; if (uap->how & LOCK_UN) { lf.l_type = F_UNLCK; atomic_clear_int(&fp->f_flag, FHASLOCK); error = VOP_ADVLOCK(vp, (caddr_t)fp, F_UNLCK, &lf, F_FLOCK); goto done2; } if (uap->how & LOCK_EX) lf.l_type = F_WRLCK; else if (uap->how & LOCK_SH) lf.l_type = F_RDLCK; else { error = EBADF; goto done2; } atomic_set_int(&fp->f_flag, FHASLOCK); error = VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &lf, (uap->how & LOCK_NB) ? F_FLOCK : F_FLOCK | F_WAIT); done2: fdrop(fp, td); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Duplicate the specified descriptor to a free descriptor. */ int dupfdopen(struct thread *td, struct filedesc *fdp, int indx, int dfd, int mode, int error) { struct file *wfp; struct file *fp; /* * If the to-be-dup'd fd number is greater than the allowed number * of file descriptors, or the fd to be dup'd has already been * closed, then reject. */ FILEDESC_XLOCK(fdp); if (dfd < 0 || dfd >= fdp->fd_nfiles || (wfp = fdp->fd_ofiles[dfd]) == NULL) { FILEDESC_XUNLOCK(fdp); return (EBADF); } /* * There are two cases of interest here. * * For ENODEV simply dup (dfd) to file descriptor (indx) and return. * * For ENXIO steal away the file structure from (dfd) and store it in * (indx). (dfd) is effectively closed by this operation. * * Any other error code is just returned. */ switch (error) { case ENODEV: /* * Check that the mode the file is being opened for is a * subset of the mode of the existing descriptor. */ if (((mode & (FREAD|FWRITE)) | wfp->f_flag) != wfp->f_flag) { FILEDESC_XUNLOCK(fdp); return (EACCES); } fp = fdp->fd_ofiles[indx]; fdp->fd_ofiles[indx] = wfp; fdp->fd_ofileflags[indx] = fdp->fd_ofileflags[dfd]; if (fp == NULL) fdused(fdp, indx); fhold(wfp); FILEDESC_XUNLOCK(fdp); if (fp != NULL) /* * We now own the reference to fp that the ofiles[] * array used to own. Release it. */ fdrop(fp, td); return (0); case ENXIO: /* * Steal away the file pointer from dfd and stuff it into indx. */ fp = fdp->fd_ofiles[indx]; fdp->fd_ofiles[indx] = fdp->fd_ofiles[dfd]; fdp->fd_ofiles[dfd] = NULL; fdp->fd_ofileflags[indx] = fdp->fd_ofileflags[dfd]; fdp->fd_ofileflags[dfd] = 0; fdunused(fdp, dfd); if (fp == NULL) fdused(fdp, indx); FILEDESC_XUNLOCK(fdp); /* * We now own the reference to fp that the ofiles[] array * used to own. Release it. */ if (fp != NULL) fdrop(fp, td); return (0); default: FILEDESC_XUNLOCK(fdp); return (error); } /* NOTREACHED */ } /* - * Scan all active processes to see if any of them have a current or root - * directory of `olddp'. If so, replace them with the new mount point. + * Scan all active processes and prisons to see if any of them have a current + * or root directory of `olddp'. If so, replace them with the new mount point. */ void mountcheckdirs(struct vnode *olddp, struct vnode *newdp) { struct filedesc *fdp; + struct prison *pr; struct proc *p; int nrele; if (vrefcnt(olddp) == 1) return; + nrele = 0; sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { fdp = fdhold(p); if (fdp == NULL) continue; - nrele = 0; FILEDESC_XLOCK(fdp); if (fdp->fd_cdir == olddp) { vref(newdp); fdp->fd_cdir = newdp; nrele++; } if (fdp->fd_rdir == olddp) { vref(newdp); fdp->fd_rdir = newdp; nrele++; } + if (fdp->fd_jdir == olddp) { + vref(newdp); + fdp->fd_jdir = newdp; + nrele++; + } FILEDESC_XUNLOCK(fdp); fddrop(fdp); - while (nrele--) - vrele(olddp); } sx_sunlock(&allproc_lock); if (rootvnode == olddp) { - vrele(rootvnode); vref(newdp); rootvnode = newdp; + nrele++; } + mtx_lock(&prison0.pr_mtx); + if (prison0.pr_root == olddp) { + vref(newdp); + prison0.pr_root = newdp; + nrele++; + } + mtx_unlock(&prison0.pr_mtx); + sx_slock(&allprison_lock); + TAILQ_FOREACH(pr, &allprison, pr_list) { + mtx_lock(&pr->pr_mtx); + if (pr->pr_root == olddp) { + vref(newdp); + pr->pr_root = newdp; + nrele++; + } + mtx_unlock(&pr->pr_mtx); + } + sx_sunlock(&allprison_lock); + while (nrele--) + vrele(olddp); } struct filedesc_to_leader * filedesc_to_leader_alloc(struct filedesc_to_leader *old, struct filedesc *fdp, struct proc *leader) { struct filedesc_to_leader *fdtol; fdtol = malloc(sizeof(struct filedesc_to_leader), M_FILEDESC_TO_LEADER, M_WAITOK); fdtol->fdl_refcount = 1; fdtol->fdl_holdcount = 0; fdtol->fdl_wakeup = 0; fdtol->fdl_leader = leader; if (old != NULL) { FILEDESC_XLOCK(fdp); fdtol->fdl_next = old->fdl_next; fdtol->fdl_prev = old; old->fdl_next = fdtol; fdtol->fdl_next->fdl_prev = fdtol; FILEDESC_XUNLOCK(fdp); } else { fdtol->fdl_next = fdtol; fdtol->fdl_prev = fdtol; } return (fdtol); } /* * Get file structures globally. */ static int sysctl_kern_file(SYSCTL_HANDLER_ARGS) { struct xfile xf; struct filedesc *fdp; struct file *fp; struct proc *p; int error, n; error = sysctl_wire_old_buffer(req, 0); if (error != 0) return (error); if (req->oldptr == NULL) { n = 0; sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { if (p->p_state == PRS_NEW) continue; fdp = fdhold(p); if (fdp == NULL) continue; /* overestimates sparse tables. */ if (fdp->fd_lastfile > 0) n += fdp->fd_lastfile; fddrop(fdp); } sx_sunlock(&allproc_lock); return (SYSCTL_OUT(req, 0, n * sizeof(xf))); } error = 0; bzero(&xf, sizeof(xf)); xf.xf_size = sizeof(xf); sx_slock(&allproc_lock); FOREACH_PROC_IN_SYSTEM(p) { if (p->p_state == PRS_NEW) continue; PROC_LOCK(p); if (p_cansee(req->td, p) != 0) { PROC_UNLOCK(p); continue; } xf.xf_pid = p->p_pid; xf.xf_uid = p->p_ucred->cr_uid; PROC_UNLOCK(p); fdp = fdhold(p); if (fdp == NULL) continue; FILEDESC_SLOCK(fdp); for (n = 0; fdp->fd_refcnt > 0 && n < fdp->fd_nfiles; ++n) { if ((fp = fdp->fd_ofiles[n]) == NULL) continue; xf.xf_fd = n; xf.xf_file = fp; xf.xf_data = fp->f_data; xf.xf_vnode = fp->f_vnode; xf.xf_type = fp->f_type; xf.xf_count = fp->f_count; xf.xf_msgcount = 0; xf.xf_offset = fp->f_offset; xf.xf_flag = fp->f_flag; error = SYSCTL_OUT(req, &xf, sizeof(xf)); if (error) break; } FILEDESC_SUNLOCK(fdp); fddrop(fdp); if (error) break; } sx_sunlock(&allproc_lock); return (error); } SYSCTL_PROC(_kern, KERN_FILE, file, CTLTYPE_OPAQUE|CTLFLAG_RD, 0, 0, sysctl_kern_file, "S,xfile", "Entire file table"); #ifdef KINFO_OFILE_SIZE CTASSERT(sizeof(struct kinfo_ofile) == KINFO_OFILE_SIZE); #endif #ifdef COMPAT_FREEBSD7 static int export_vnode_for_osysctl(struct vnode *vp, int type, struct kinfo_ofile *kif, struct filedesc *fdp, struct sysctl_req *req) { int error; char *fullpath, *freepath; int vfslocked; bzero(kif, sizeof(*kif)); kif->kf_structsize = sizeof(*kif); vref(vp); kif->kf_fd = type; kif->kf_type = KF_TYPE_VNODE; /* This function only handles directories. */ if (vp->v_type != VDIR) { vrele(vp); return (ENOTDIR); } kif->kf_vnode_type = KF_VTYPE_VDIR; /* * This is not a true file descriptor, so we set a bogus refcount * and offset to indicate these fields should be ignored. */ kif->kf_ref_count = -1; kif->kf_offset = -1; freepath = NULL; fullpath = "-"; FILEDESC_SUNLOCK(fdp); vn_fullpath(curthread, vp, &fullpath, &freepath); vfslocked = VFS_LOCK_GIANT(vp->v_mount); vrele(vp); VFS_UNLOCK_GIANT(vfslocked); strlcpy(kif->kf_path, fullpath, sizeof(kif->kf_path)); if (freepath != NULL) free(freepath, M_TEMP); error = SYSCTL_OUT(req, kif, sizeof(*kif)); FILEDESC_SLOCK(fdp); return (error); } /* * Get per-process file descriptors for use by procstat(1), et al. */ static int sysctl_kern_proc_ofiledesc(SYSCTL_HANDLER_ARGS) { char *fullpath, *freepath; struct kinfo_ofile *kif; struct filedesc *fdp; int error, i, *name; struct socket *so; struct vnode *vp; struct file *fp; struct proc *p; struct tty *tp; int vfslocked; name = (int *)arg1; if ((p = pfind((pid_t)name[0])) == NULL) return (ESRCH); if ((error = p_candebug(curthread, p))) { PROC_UNLOCK(p); return (error); } fdp = fdhold(p); PROC_UNLOCK(p); if (fdp == NULL) return (ENOENT); kif = malloc(sizeof(*kif), M_TEMP, M_WAITOK); FILEDESC_SLOCK(fdp); if (fdp->fd_cdir != NULL) export_vnode_for_osysctl(fdp->fd_cdir, KF_FD_TYPE_CWD, kif, fdp, req); if (fdp->fd_rdir != NULL) export_vnode_for_osysctl(fdp->fd_rdir, KF_FD_TYPE_ROOT, kif, fdp, req); if (fdp->fd_jdir != NULL) export_vnode_for_osysctl(fdp->fd_jdir, KF_FD_TYPE_JAIL, kif, fdp, req); for (i = 0; i < fdp->fd_nfiles; i++) { if ((fp = fdp->fd_ofiles[i]) == NULL) continue; bzero(kif, sizeof(*kif)); kif->kf_structsize = sizeof(*kif); vp = NULL; so = NULL; tp = NULL; kif->kf_fd = i; switch (fp->f_type) { case DTYPE_VNODE: kif->kf_type = KF_TYPE_VNODE; vp = fp->f_vnode; break; case DTYPE_SOCKET: kif->kf_type = KF_TYPE_SOCKET; so = fp->f_data; break; case DTYPE_PIPE: kif->kf_type = KF_TYPE_PIPE; break; case DTYPE_FIFO: kif->kf_type = KF_TYPE_FIFO; vp = fp->f_vnode; vref(vp); break; case DTYPE_KQUEUE: kif->kf_type = KF_TYPE_KQUEUE; break; case DTYPE_CRYPTO: kif->kf_type = KF_TYPE_CRYPTO; break; case DTYPE_MQUEUE: kif->kf_type = KF_TYPE_MQUEUE; break; case DTYPE_SHM: kif->kf_type = KF_TYPE_SHM; break; case DTYPE_SEM: kif->kf_type = KF_TYPE_SEM; break; case DTYPE_PTS: kif->kf_type = KF_TYPE_PTS; tp = fp->f_data; break; default: kif->kf_type = KF_TYPE_UNKNOWN; break; } kif->kf_ref_count = fp->f_count; if (fp->f_flag & FREAD) kif->kf_flags |= KF_FLAG_READ; if (fp->f_flag & FWRITE) kif->kf_flags |= KF_FLAG_WRITE; if (fp->f_flag & FAPPEND) kif->kf_flags |= KF_FLAG_APPEND; if (fp->f_flag & FASYNC) kif->kf_flags |= KF_FLAG_ASYNC; if (fp->f_flag & FFSYNC) kif->kf_flags |= KF_FLAG_FSYNC; if (fp->f_flag & FNONBLOCK) kif->kf_flags |= KF_FLAG_NONBLOCK; if (fp->f_flag & O_DIRECT) kif->kf_flags |= KF_FLAG_DIRECT; if (fp->f_flag & FHASLOCK) kif->kf_flags |= KF_FLAG_HASLOCK; kif->kf_offset = fp->f_offset; if (vp != NULL) { vref(vp); switch (vp->v_type) { case VNON: kif->kf_vnode_type = KF_VTYPE_VNON; break; case VREG: kif->kf_vnode_type = KF_VTYPE_VREG; break; case VDIR: kif->kf_vnode_type = KF_VTYPE_VDIR; break; case VBLK: kif->kf_vnode_type = KF_VTYPE_VBLK; break; case VCHR: kif->kf_vnode_type = KF_VTYPE_VCHR; break; case VLNK: kif->kf_vnode_type = KF_VTYPE_VLNK; break; case VSOCK: kif->kf_vnode_type = KF_VTYPE_VSOCK; break; case VFIFO: kif->kf_vnode_type = KF_VTYPE_VFIFO; break; case VBAD: kif->kf_vnode_type = KF_VTYPE_VBAD; break; default: kif->kf_vnode_type = KF_VTYPE_UNKNOWN; break; } /* * It is OK to drop the filedesc lock here as we will * re-validate and re-evaluate its properties when * the loop continues. */ freepath = NULL; fullpath = "-"; FILEDESC_SUNLOCK(fdp); vn_fullpath(curthread, vp, &fullpath, &freepath); vfslocked = VFS_LOCK_GIANT(vp->v_mount); vrele(vp); VFS_UNLOCK_GIANT(vfslocked); strlcpy(kif->kf_path, fullpath, sizeof(kif->kf_path)); if (freepath != NULL) free(freepath, M_TEMP); FILEDESC_SLOCK(fdp); } if (so != NULL) { struct sockaddr *sa; if (so->so_proto->pr_usrreqs->pru_sockaddr(so, &sa) == 0 && sa->sa_len <= sizeof(kif->kf_sa_local)) { bcopy(sa, &kif->kf_sa_local, sa->sa_len); free(sa, M_SONAME); } if (so->so_proto->pr_usrreqs->pru_peeraddr(so, &sa) == 00 && sa->sa_len <= sizeof(kif->kf_sa_peer)) { bcopy(sa, &kif->kf_sa_peer, sa->sa_len); free(sa, M_SONAME); } kif->kf_sock_domain = so->so_proto->pr_domain->dom_family; kif->kf_sock_type = so->so_type; kif->kf_sock_protocol = so->so_proto->pr_protocol; } if (tp != NULL) { strlcpy(kif->kf_path, tty_devname(tp), sizeof(kif->kf_path)); } error = SYSCTL_OUT(req, kif, sizeof(*kif)); if (error) break; } FILEDESC_SUNLOCK(fdp); fddrop(fdp); free(kif, M_TEMP); return (0); } static SYSCTL_NODE(_kern_proc, KERN_PROC_OFILEDESC, ofiledesc, CTLFLAG_RD, sysctl_kern_proc_ofiledesc, "Process ofiledesc entries"); #endif /* COMPAT_FREEBSD7 */ #ifdef KINFO_FILE_SIZE CTASSERT(sizeof(struct kinfo_file) == KINFO_FILE_SIZE); #endif static int export_vnode_for_sysctl(struct vnode *vp, int type, struct kinfo_file *kif, struct filedesc *fdp, struct sysctl_req *req) { int error; char *fullpath, *freepath; int vfslocked; bzero(kif, sizeof(*kif)); vref(vp); kif->kf_fd = type; kif->kf_type = KF_TYPE_VNODE; /* This function only handles directories. */ if (vp->v_type != VDIR) { vrele(vp); return (ENOTDIR); } kif->kf_vnode_type = KF_VTYPE_VDIR; /* * This is not a true file descriptor, so we set a bogus refcount * and offset to indicate these fields should be ignored. */ kif->kf_ref_count = -1; kif->kf_offset = -1; freepath = NULL; fullpath = "-"; FILEDESC_SUNLOCK(fdp); vn_fullpath(curthread, vp, &fullpath, &freepath); vfslocked = VFS_LOCK_GIANT(vp->v_mount); vrele(vp); VFS_UNLOCK_GIANT(vfslocked); strlcpy(kif->kf_path, fullpath, sizeof(kif->kf_path)); if (freepath != NULL) free(freepath, M_TEMP); /* Pack record size down */ kif->kf_structsize = offsetof(struct kinfo_file, kf_path) + strlen(kif->kf_path) + 1; kif->kf_structsize = roundup(kif->kf_structsize, sizeof(uint64_t)); error = SYSCTL_OUT(req, kif, kif->kf_structsize); FILEDESC_SLOCK(fdp); return (error); } /* * Get per-process file descriptors for use by procstat(1), et al. */ static int sysctl_kern_proc_filedesc(SYSCTL_HANDLER_ARGS) { char *fullpath, *freepath; struct kinfo_file *kif; struct filedesc *fdp; int error, i, *name; struct socket *so; struct vnode *vp; struct file *fp; struct proc *p; struct tty *tp; int vfslocked; size_t oldidx; name = (int *)arg1; if ((p = pfind((pid_t)name[0])) == NULL) return (ESRCH); if ((error = p_candebug(curthread, p))) { PROC_UNLOCK(p); return (error); } fdp = fdhold(p); PROC_UNLOCK(p); if (fdp == NULL) return (ENOENT); kif = malloc(sizeof(*kif), M_TEMP, M_WAITOK); FILEDESC_SLOCK(fdp); if (fdp->fd_cdir != NULL) export_vnode_for_sysctl(fdp->fd_cdir, KF_FD_TYPE_CWD, kif, fdp, req); if (fdp->fd_rdir != NULL) export_vnode_for_sysctl(fdp->fd_rdir, KF_FD_TYPE_ROOT, kif, fdp, req); if (fdp->fd_jdir != NULL) export_vnode_for_sysctl(fdp->fd_jdir, KF_FD_TYPE_JAIL, kif, fdp, req); for (i = 0; i < fdp->fd_nfiles; i++) { if ((fp = fdp->fd_ofiles[i]) == NULL) continue; bzero(kif, sizeof(*kif)); vp = NULL; so = NULL; tp = NULL; kif->kf_fd = i; switch (fp->f_type) { case DTYPE_VNODE: kif->kf_type = KF_TYPE_VNODE; vp = fp->f_vnode; break; case DTYPE_SOCKET: kif->kf_type = KF_TYPE_SOCKET; so = fp->f_data; break; case DTYPE_PIPE: kif->kf_type = KF_TYPE_PIPE; break; case DTYPE_FIFO: kif->kf_type = KF_TYPE_FIFO; vp = fp->f_vnode; vref(vp); break; case DTYPE_KQUEUE: kif->kf_type = KF_TYPE_KQUEUE; break; case DTYPE_CRYPTO: kif->kf_type = KF_TYPE_CRYPTO; break; case DTYPE_MQUEUE: kif->kf_type = KF_TYPE_MQUEUE; break; case DTYPE_SHM: kif->kf_type = KF_TYPE_SHM; break; case DTYPE_SEM: kif->kf_type = KF_TYPE_SEM; break; case DTYPE_PTS: kif->kf_type = KF_TYPE_PTS; tp = fp->f_data; break; default: kif->kf_type = KF_TYPE_UNKNOWN; break; } kif->kf_ref_count = fp->f_count; if (fp->f_flag & FREAD) kif->kf_flags |= KF_FLAG_READ; if (fp->f_flag & FWRITE) kif->kf_flags |= KF_FLAG_WRITE; if (fp->f_flag & FAPPEND) kif->kf_flags |= KF_FLAG_APPEND; if (fp->f_flag & FASYNC) kif->kf_flags |= KF_FLAG_ASYNC; if (fp->f_flag & FFSYNC) kif->kf_flags |= KF_FLAG_FSYNC; if (fp->f_flag & FNONBLOCK) kif->kf_flags |= KF_FLAG_NONBLOCK; if (fp->f_flag & O_DIRECT) kif->kf_flags |= KF_FLAG_DIRECT; if (fp->f_flag & FHASLOCK) kif->kf_flags |= KF_FLAG_HASLOCK; kif->kf_offset = fp->f_offset; if (vp != NULL) { vref(vp); switch (vp->v_type) { case VNON: kif->kf_vnode_type = KF_VTYPE_VNON; break; case VREG: kif->kf_vnode_type = KF_VTYPE_VREG; break; case VDIR: kif->kf_vnode_type = KF_VTYPE_VDIR; break; case VBLK: kif->kf_vnode_type = KF_VTYPE_VBLK; break; case VCHR: kif->kf_vnode_type = KF_VTYPE_VCHR; break; case VLNK: kif->kf_vnode_type = KF_VTYPE_VLNK; break; case VSOCK: kif->kf_vnode_type = KF_VTYPE_VSOCK; break; case VFIFO: kif->kf_vnode_type = KF_VTYPE_VFIFO; break; case VBAD: kif->kf_vnode_type = KF_VTYPE_VBAD; break; default: kif->kf_vnode_type = KF_VTYPE_UNKNOWN; break; } /* * It is OK to drop the filedesc lock here as we will * re-validate and re-evaluate its properties when * the loop continues. */ freepath = NULL; fullpath = "-"; FILEDESC_SUNLOCK(fdp); vn_fullpath(curthread, vp, &fullpath, &freepath); vfslocked = VFS_LOCK_GIANT(vp->v_mount); vrele(vp); VFS_UNLOCK_GIANT(vfslocked); strlcpy(kif->kf_path, fullpath, sizeof(kif->kf_path)); if (freepath != NULL) free(freepath, M_TEMP); FILEDESC_SLOCK(fdp); } if (so != NULL) { struct sockaddr *sa; if (so->so_proto->pr_usrreqs->pru_sockaddr(so, &sa) == 0 && sa->sa_len <= sizeof(kif->kf_sa_local)) { bcopy(sa, &kif->kf_sa_local, sa->sa_len); free(sa, M_SONAME); } if (so->so_proto->pr_usrreqs->pru_peeraddr(so, &sa) == 00 && sa->sa_len <= sizeof(kif->kf_sa_peer)) { bcopy(sa, &kif->kf_sa_peer, sa->sa_len); free(sa, M_SONAME); } kif->kf_sock_domain = so->so_proto->pr_domain->dom_family; kif->kf_sock_type = so->so_type; kif->kf_sock_protocol = so->so_proto->pr_protocol; } if (tp != NULL) { strlcpy(kif->kf_path, tty_devname(tp), sizeof(kif->kf_path)); } /* Pack record size down */ kif->kf_structsize = offsetof(struct kinfo_file, kf_path) + strlen(kif->kf_path) + 1; kif->kf_structsize = roundup(kif->kf_structsize, sizeof(uint64_t)); oldidx = req->oldidx; error = SYSCTL_OUT(req, kif, kif->kf_structsize); if (error) { if (error == ENOMEM) { /* * The hack to keep the ABI of sysctl * kern.proc.filedesc intact, but not * to account a partially copied * kinfo_file into the oldidx. */ req->oldidx = oldidx; error = 0; } break; } } FILEDESC_SUNLOCK(fdp); fddrop(fdp); free(kif, M_TEMP); return (error); } static SYSCTL_NODE(_kern_proc, KERN_PROC_FILEDESC, filedesc, CTLFLAG_RD, sysctl_kern_proc_filedesc, "Process filedesc entries"); #ifdef DDB /* * For the purposes of debugging, generate a human-readable string for the * file type. */ static const char * file_type_to_name(short type) { switch (type) { case 0: return ("zero"); case DTYPE_VNODE: return ("vnod"); case DTYPE_SOCKET: return ("sock"); case DTYPE_PIPE: return ("pipe"); case DTYPE_FIFO: return ("fifo"); case DTYPE_KQUEUE: return ("kque"); case DTYPE_CRYPTO: return ("crpt"); case DTYPE_MQUEUE: return ("mque"); case DTYPE_SHM: return ("shm"); case DTYPE_SEM: return ("ksem"); default: return ("unkn"); } } /* * For the purposes of debugging, identify a process (if any, perhaps one of * many) that references the passed file in its file descriptor array. Return * NULL if none. */ static struct proc * file_to_first_proc(struct file *fp) { struct filedesc *fdp; struct proc *p; int n; FOREACH_PROC_IN_SYSTEM(p) { if (p->p_state == PRS_NEW) continue; fdp = p->p_fd; if (fdp == NULL) continue; for (n = 0; n < fdp->fd_nfiles; n++) { if (fp == fdp->fd_ofiles[n]) return (p); } } return (NULL); } static void db_print_file(struct file *fp, int header) { struct proc *p; if (header) db_printf("%8s %4s %8s %8s %4s %5s %6s %8s %5s %12s\n", "File", "Type", "Data", "Flag", "GCFl", "Count", "MCount", "Vnode", "FPID", "FCmd"); p = file_to_first_proc(fp); db_printf("%8p %4s %8p %08x %04x %5d %6d %8p %5d %12s\n", fp, file_type_to_name(fp->f_type), fp->f_data, fp->f_flag, 0, fp->f_count, 0, fp->f_vnode, p != NULL ? p->p_pid : -1, p != NULL ? p->p_comm : "-"); } DB_SHOW_COMMAND(file, db_show_file) { struct file *fp; if (!have_addr) { db_printf("usage: show file \n"); return; } fp = (struct file *)addr; db_print_file(fp, 1); } DB_SHOW_COMMAND(files, db_show_files) { struct filedesc *fdp; struct file *fp; struct proc *p; int header; int n; header = 1; FOREACH_PROC_IN_SYSTEM(p) { if (p->p_state == PRS_NEW) continue; if ((fdp = p->p_fd) == NULL) continue; for (n = 0; n < fdp->fd_nfiles; ++n) { if ((fp = fdp->fd_ofiles[n]) == NULL) continue; db_print_file(fp, header); header = 0; } } } #endif SYSCTL_INT(_kern, KERN_MAXFILESPERPROC, maxfilesperproc, CTLFLAG_RW, &maxfilesperproc, 0, "Maximum files allowed open per process"); SYSCTL_INT(_kern, KERN_MAXFILES, maxfiles, CTLFLAG_RW, &maxfiles, 0, "Maximum number of files"); SYSCTL_INT(_kern, OID_AUTO, openfiles, CTLFLAG_RD, __DEVOLATILE(int *, &openfiles), 0, "System-wide number of open files"); /* ARGSUSED*/ static void filelistinit(void *dummy) { file_zone = uma_zcreate("Files", sizeof(struct file), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); mtx_init(&sigio_lock, "sigio lock", NULL, MTX_DEF); mtx_init(&fdesc_mtx, "fdesc", NULL, MTX_DEF); } SYSINIT(select, SI_SUB_LOCK, SI_ORDER_FIRST, filelistinit, NULL); /*-------------------------------------------------------------------*/ static int badfo_readwrite(struct file *fp, struct uio *uio, struct ucred *active_cred, int flags, struct thread *td) { return (EBADF); } static int badfo_truncate(struct file *fp, off_t length, struct ucred *active_cred, struct thread *td) { return (EINVAL); } static int badfo_ioctl(struct file *fp, u_long com, void *data, struct ucred *active_cred, struct thread *td) { return (EBADF); } static int badfo_poll(struct file *fp, int events, struct ucred *active_cred, struct thread *td) { return (0); } static int badfo_kqfilter(struct file *fp, struct knote *kn) { return (EBADF); } static int badfo_stat(struct file *fp, struct stat *sb, struct ucred *active_cred, struct thread *td) { return (EBADF); } static int badfo_close(struct file *fp, struct thread *td) { return (EBADF); } struct fileops badfileops = { .fo_read = badfo_readwrite, .fo_write = badfo_readwrite, .fo_truncate = badfo_truncate, .fo_ioctl = badfo_ioctl, .fo_poll = badfo_poll, .fo_kqfilter = badfo_kqfilter, .fo_stat = badfo_stat, .fo_close = badfo_close, }; /*-------------------------------------------------------------------*/ /* * File Descriptor pseudo-device driver (/dev/fd/). * * Opening minor device N dup()s the file (if any) connected to file * descriptor N belonging to the calling process. Note that this driver * consists of only the ``open()'' routine, because all subsequent * references to this file will be direct to the other driver. * * XXX: we could give this one a cloning event handler if necessary. */ /* ARGSUSED */ static int fdopen(struct cdev *dev, int mode, int type, struct thread *td) { /* * XXX Kludge: set curthread->td_dupfd to contain the value of the * the file descriptor being sought for duplication. The error * return ensures that the vnode for this device will be released * by vn_open. Open will detect this special error and take the * actions in dupfdopen below. Other callers of vn_open or VOP_OPEN * will simply report the error. */ td->td_dupfd = dev2unit(dev); return (ENODEV); } static struct cdevsw fildesc_cdevsw = { .d_version = D_VERSION, .d_open = fdopen, .d_name = "FD", }; static void fildesc_drvinit(void *unused) { struct cdev *dev; dev = make_dev(&fildesc_cdevsw, 0, UID_ROOT, GID_WHEEL, 0666, "fd/0"); make_dev_alias(dev, "stdin"); dev = make_dev(&fildesc_cdevsw, 1, UID_ROOT, GID_WHEEL, 0666, "fd/1"); make_dev_alias(dev, "stdout"); dev = make_dev(&fildesc_cdevsw, 2, UID_ROOT, GID_WHEEL, 0666, "fd/2"); make_dev_alias(dev, "stderr"); } SYSINIT(fildescdev, SI_SUB_DRIVERS, SI_ORDER_MIDDLE, fildesc_drvinit, NULL); Index: head/sys/kern/kern_exit.c =================================================================== --- head/sys/kern/kern_exit.c (revision 192894) +++ head/sys/kern/kern_exit.c (revision 192895) @@ -1,924 +1,923 @@ /*- * Copyright (c) 1982, 1986, 1989, 1991, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)kern_exit.c 8.7 (Berkeley) 2/12/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_compat.h" #include "opt_kdtrace.h" #include "opt_ktrace.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* for acct_process() function prototype */ #include #include #include #include #include #ifdef KTRACE #include #endif #include #include #include #include #include #include #include #include #include #ifdef KDTRACE_HOOKS #include dtrace_execexit_func_t dtrace_fasttrap_exit; #endif SDT_PROVIDER_DECLARE(proc); SDT_PROBE_DEFINE(proc, kernel, , exit); SDT_PROBE_ARGTYPE(proc, kernel, , exit, 0, "int"); /* Required to be non-static for SysVR4 emulator */ MALLOC_DEFINE(M_ZOMBIE, "zombie", "zombie proc status"); /* Hook for NFS teardown procedure. */ void (*nlminfo_release_p)(struct proc *p); /* * exit -- death of process. */ void sys_exit(struct thread *td, struct sys_exit_args *uap) { exit1(td, W_EXITCODE(uap->rval, 0)); /* NOTREACHED */ } /* * Exit: deallocate address space and other resources, change proc state to * zombie, and unlink proc from allproc and parent's lists. Save exit status * and rusage for wait(). Check for child processes and orphan them. */ void exit1(struct thread *td, int rv) { struct proc *p, *nq, *q; struct vnode *vtmp; struct vnode *ttyvp = NULL; #ifdef KTRACE struct vnode *tracevp; struct ucred *tracecred; #endif struct plimit *plim; int locked; mtx_assert(&Giant, MA_NOTOWNED); p = td->td_proc; if (p == initproc) { printf("init died (signal %d, exit %d)\n", WTERMSIG(rv), WEXITSTATUS(rv)); panic("Going nowhere without my init!"); } /* * MUST abort all other threads before proceeding past here. */ PROC_LOCK(p); while (p->p_flag & P_HADTHREADS) { /* * First check if some other thread got here before us.. * if so, act apropriatly, (exit or suspend); */ thread_suspend_check(0); /* * Kill off the other threads. This requires * some co-operation from other parts of the kernel * so it may not be instantaneous. With this state set * any thread entering the kernel from userspace will * thread_exit() in trap(). Any thread attempting to * sleep will return immediately with EINTR or EWOULDBLOCK * which will hopefully force them to back out to userland * freeing resources as they go. Any thread attempting * to return to userland will thread_exit() from userret(). * thread_exit() will unsuspend us when the last of the * other threads exits. * If there is already a thread singler after resumption, * calling thread_single will fail; in that case, we just * re-check all suspension request, the thread should * either be suspended there or exit. */ if (! thread_single(SINGLE_EXIT)) break; /* * All other activity in this process is now stopped. * Threading support has been turned off. */ } KASSERT(p->p_numthreads == 1, ("exit1: proc %p exiting with %d threads", p, p->p_numthreads)); /* * Wakeup anyone in procfs' PIOCWAIT. They should have a hold * on our vmspace, so we should block below until they have * released their reference to us. Note that if they have * requested S_EXIT stops we will block here until they ack * via PIOCCONT. */ _STOPEVENT(p, S_EXIT, rv); /* * Note that we are exiting and do another wakeup of anyone in * PIOCWAIT in case they aren't listening for S_EXIT stops or * decided to wait again after we told them we are exiting. */ p->p_flag |= P_WEXIT; wakeup(&p->p_stype); /* * Wait for any processes that have a hold on our vmspace to * release their reference. */ while (p->p_lock > 0) msleep(&p->p_lock, &p->p_mtx, PWAIT, "exithold", 0); PROC_UNLOCK(p); /* Drain the limit callout while we don't have the proc locked */ callout_drain(&p->p_limco); #ifdef AUDIT /* * The Sun BSM exit token contains two components: an exit status as * passed to exit(), and a return value to indicate what sort of exit * it was. The exit status is WEXITSTATUS(rv), but it's not clear * what the return value is. */ AUDIT_ARG(exit, WEXITSTATUS(rv), 0); AUDIT_SYSCALL_EXIT(0, td); #endif /* Are we a task leader? */ if (p == p->p_leader) { mtx_lock(&ppeers_lock); q = p->p_peers; while (q != NULL) { PROC_LOCK(q); psignal(q, SIGKILL); PROC_UNLOCK(q); q = q->p_peers; } while (p->p_peers != NULL) msleep(p, &ppeers_lock, PWAIT, "exit1", 0); mtx_unlock(&ppeers_lock); } /* * Check if any loadable modules need anything done at process exit. * E.g. SYSV IPC stuff * XXX what if one of these generates an error? */ EVENTHANDLER_INVOKE(process_exit, p); /* * If parent is waiting for us to exit or exec, * P_PPWAIT is set; we will wakeup the parent below. */ PROC_LOCK(p); stopprofclock(p); p->p_flag &= ~(P_TRACED | P_PPWAIT); /* * Stop the real interval timer. If the handler is currently * executing, prevent it from rearming itself and let it finish. */ if (timevalisset(&p->p_realtimer.it_value) && callout_stop(&p->p_itcallout) == 0) { timevalclear(&p->p_realtimer.it_interval); msleep(&p->p_itcallout, &p->p_mtx, PWAIT, "ritwait", 0); KASSERT(!timevalisset(&p->p_realtimer.it_value), ("realtime timer is still armed")); } PROC_UNLOCK(p); /* * Reset any sigio structures pointing to us as a result of * F_SETOWN with our pid. */ funsetownlst(&p->p_sigiolst); /* * If this process has an nlminfo data area (for lockd), release it */ if (nlminfo_release_p != NULL && p->p_nlminfo != NULL) (*nlminfo_release_p)(p); /* * Close open files and release open-file table. * This may block! */ fdfree(td); /* * If this thread tickled GEOM, we need to wait for the giggling to * stop before we return to userland */ if (td->td_pflags & TDP_GEOM) g_waitidle(); /* * Remove ourself from our leader's peer list and wake our leader. */ mtx_lock(&ppeers_lock); if (p->p_leader->p_peers) { q = p->p_leader; while (q->p_peers != p) q = q->p_peers; q->p_peers = p->p_peers; wakeup(p->p_leader); } mtx_unlock(&ppeers_lock); vmspace_exit(td); sx_xlock(&proctree_lock); if (SESS_LEADER(p)) { struct session *sp; sp = p->p_session; SESS_LOCK(sp); ttyvp = sp->s_ttyvp; sp->s_ttyvp = NULL; SESS_UNLOCK(sp); if (ttyvp != NULL) { /* * Controlling process. * Signal foreground pgrp and revoke access to * controlling terminal. * * There is no need to drain the terminal here, * because this will be done on revocation. */ if (sp->s_ttyp != NULL) { struct tty *tp = sp->s_ttyp; tty_lock(tp); tty_signal_pgrp(tp, SIGHUP); tty_unlock(tp); /* * The tty could have been revoked * if we blocked. */ if (ttyvp->v_type != VBAD) { sx_xunlock(&proctree_lock); VOP_LOCK(ttyvp, LK_EXCLUSIVE); VOP_REVOKE(ttyvp, REVOKEALL); VOP_UNLOCK(ttyvp, 0); sx_xlock(&proctree_lock); } } /* * s_ttyp is not zero'd; we use this to indicate that * the session once had a controlling terminal. * (for logging and informational purposes) */ } SESS_LOCK(p->p_session); sp->s_leader = NULL; SESS_UNLOCK(p->p_session); } fixjobc(p, p->p_pgrp, 0); sx_xunlock(&proctree_lock); (void)acct_process(td); /* Release the TTY now we've unlocked everything. */ if (ttyvp != NULL) vrele(ttyvp); #ifdef KTRACE /* * Disable tracing, then drain any pending records and release * the trace file. */ if (p->p_traceflag != 0) { PROC_LOCK(p); mtx_lock(&ktrace_mtx); p->p_traceflag = 0; mtx_unlock(&ktrace_mtx); PROC_UNLOCK(p); ktrprocexit(td); PROC_LOCK(p); mtx_lock(&ktrace_mtx); tracevp = p->p_tracevp; p->p_tracevp = NULL; tracecred = p->p_tracecred; p->p_tracecred = NULL; mtx_unlock(&ktrace_mtx); PROC_UNLOCK(p); if (tracevp != NULL) { locked = VFS_LOCK_GIANT(tracevp->v_mount); vrele(tracevp); VFS_UNLOCK_GIANT(locked); } if (tracecred != NULL) crfree(tracecred); } #endif /* * Release reference to text vnode */ if ((vtmp = p->p_textvp) != NULL) { p->p_textvp = NULL; locked = VFS_LOCK_GIANT(vtmp->v_mount); vrele(vtmp); VFS_UNLOCK_GIANT(locked); } /* * Release our limits structure. */ PROC_LOCK(p); plim = p->p_limit; p->p_limit = NULL; PROC_UNLOCK(p); lim_free(plim); /* * Remove proc from allproc queue and pidhash chain. * Place onto zombproc. Unlink from parent's child list. */ sx_xlock(&allproc_lock); LIST_REMOVE(p, p_list); LIST_INSERT_HEAD(&zombproc, p, p_list); LIST_REMOVE(p, p_hash); sx_xunlock(&allproc_lock); /* * Call machine-dependent code to release any * machine-dependent resources other than the address space. * The address space is released by "vmspace_exitfree(p)" in * vm_waitproc(). */ cpu_exit(td); WITNESS_WARN(WARN_PANIC, NULL, "process (pid %d) exiting", p->p_pid); /* * Reparent all of our children to init. */ sx_xlock(&proctree_lock); q = LIST_FIRST(&p->p_children); if (q != NULL) /* only need this if any child is S_ZOMB */ wakeup(initproc); for (; q != NULL; q = nq) { nq = LIST_NEXT(q, p_sibling); PROC_LOCK(q); proc_reparent(q, initproc); q->p_sigparent = SIGCHLD; /* * Traced processes are killed * since their existence means someone is screwing up. */ if (q->p_flag & P_TRACED) { struct thread *temp; q->p_flag &= ~(P_TRACED | P_STOPPED_TRACE); FOREACH_THREAD_IN_PROC(q, temp) temp->td_dbgflags &= ~TDB_SUSPEND; psignal(q, SIGKILL); } PROC_UNLOCK(q); } /* Save exit status. */ PROC_LOCK(p); p->p_xstat = rv; p->p_xthread = td; - /* In case we are jailed tell the prison that we are gone. */ - if (jailed(p->p_ucred)) - prison_proc_free(p->p_ucred->cr_prison); + /* Tell the prison that we are gone. */ + prison_proc_free(p->p_ucred->cr_prison); #ifdef KDTRACE_HOOKS /* * Tell the DTrace fasttrap provider about the exit if it * has declared an interest. */ if (dtrace_fasttrap_exit) dtrace_fasttrap_exit(p); #endif /* * Notify interested parties of our demise. */ KNOTE_LOCKED(&p->p_klist, NOTE_EXIT); #ifdef KDTRACE_HOOKS int reason = CLD_EXITED; if (WCOREDUMP(rv)) reason = CLD_DUMPED; else if (WIFSIGNALED(rv)) reason = CLD_KILLED; SDT_PROBE(proc, kernel, , exit, reason, 0, 0, 0, 0); #endif /* * Just delete all entries in the p_klist. At this point we won't * report any more events, and there are nasty race conditions that * can beat us if we don't. */ knlist_clear(&p->p_klist, 1); /* * Notify parent that we're gone. If parent has the PS_NOCLDWAIT * flag set, or if the handler is set to SIG_IGN, notify process * 1 instead (and hope it will handle this situation). */ PROC_LOCK(p->p_pptr); mtx_lock(&p->p_pptr->p_sigacts->ps_mtx); if (p->p_pptr->p_sigacts->ps_flag & (PS_NOCLDWAIT | PS_CLDSIGIGN)) { struct proc *pp; mtx_unlock(&p->p_pptr->p_sigacts->ps_mtx); pp = p->p_pptr; PROC_UNLOCK(pp); proc_reparent(p, initproc); p->p_sigparent = SIGCHLD; PROC_LOCK(p->p_pptr); /* * Notify parent, so in case he was wait(2)ing or * executing waitpid(2) with our pid, he will * continue. */ wakeup(pp); } else mtx_unlock(&p->p_pptr->p_sigacts->ps_mtx); if (p->p_pptr == initproc) psignal(p->p_pptr, SIGCHLD); else if (p->p_sigparent != 0) { if (p->p_sigparent == SIGCHLD) childproc_exited(p); else /* LINUX thread */ psignal(p->p_pptr, p->p_sigparent); } sx_xunlock(&proctree_lock); /* * The state PRS_ZOMBIE prevents other proesses from sending * signal to the process, to avoid memory leak, we free memory * for signal queue at the time when the state is set. */ sigqueue_flush(&p->p_sigqueue); sigqueue_flush(&td->td_sigqueue); /* * We have to wait until after acquiring all locks before * changing p_state. We need to avoid all possible context * switches (including ones from blocking on a mutex) while * marked as a zombie. We also have to set the zombie state * before we release the parent process' proc lock to avoid * a lost wakeup. So, we first call wakeup, then we grab the * sched lock, update the state, and release the parent process' * proc lock. */ wakeup(p->p_pptr); cv_broadcast(&p->p_pwait); sched_exit(p->p_pptr, td); PROC_SLOCK(p); p->p_state = PRS_ZOMBIE; PROC_UNLOCK(p->p_pptr); /* * Hopefully no one will try to deliver a signal to the process this * late in the game. */ knlist_destroy(&p->p_klist); /* * Save our children's rusage information in our exit rusage. */ ruadd(&p->p_ru, &p->p_rux, &p->p_stats->p_cru, &p->p_crux); /* * Make sure the scheduler takes this thread out of its tables etc. * This will also release this thread's reference to the ucred. * Other thread parts to release include pcb bits and such. */ thread_exit(); } #ifndef _SYS_SYSPROTO_H_ struct abort2_args { char *why; int nargs; void **args; }; #endif int abort2(struct thread *td, struct abort2_args *uap) { struct proc *p = td->td_proc; struct sbuf *sb; void *uargs[16]; int error, i, sig; /* * Do it right now so we can log either proper call of abort2(), or * note, that invalid argument was passed. 512 is big enough to * handle 16 arguments' descriptions with additional comments. */ sb = sbuf_new(NULL, NULL, 512, SBUF_FIXEDLEN); sbuf_clear(sb); sbuf_printf(sb, "%s(pid %d uid %d) aborted: ", p->p_comm, p->p_pid, td->td_ucred->cr_uid); /* * Since we can't return from abort2(), send SIGKILL in cases, where * abort2() was called improperly */ sig = SIGKILL; /* Prevent from DoSes from user-space. */ if (uap->nargs < 0 || uap->nargs > 16) goto out; if (uap->nargs > 0) { if (uap->args == NULL) goto out; error = copyin(uap->args, uargs, uap->nargs * sizeof(void *)); if (error != 0) goto out; } /* * Limit size of 'reason' string to 128. Will fit even when * maximal number of arguments was chosen to be logged. */ if (uap->why != NULL) { error = sbuf_copyin(sb, uap->why, 128); if (error < 0) goto out; } else { sbuf_printf(sb, "(null)"); } if (uap->nargs > 0) { sbuf_printf(sb, "("); for (i = 0;i < uap->nargs; i++) sbuf_printf(sb, "%s%p", i == 0 ? "" : ", ", uargs[i]); sbuf_printf(sb, ")"); } /* * Final stage: arguments were proper, string has been * successfully copied from userspace, and copying pointers * from user-space succeed. */ sig = SIGABRT; out: if (sig == SIGKILL) { sbuf_trim(sb); sbuf_printf(sb, " (Reason text inaccessible)"); } sbuf_cat(sb, "\n"); sbuf_finish(sb); log(LOG_INFO, "%s", sbuf_data(sb)); sbuf_delete(sb); exit1(td, W_EXITCODE(0, sig)); return (0); } #ifdef COMPAT_43 /* * The dirty work is handled by kern_wait(). */ int owait(struct thread *td, struct owait_args *uap __unused) { int error, status; error = kern_wait(td, WAIT_ANY, &status, 0, NULL); if (error == 0) td->td_retval[1] = status; return (error); } #endif /* COMPAT_43 */ /* * The dirty work is handled by kern_wait(). */ int wait4(struct thread *td, struct wait_args *uap) { struct rusage ru, *rup; int error, status; if (uap->rusage != NULL) rup = &ru; else rup = NULL; error = kern_wait(td, uap->pid, &status, uap->options, rup); if (uap->status != NULL && error == 0) error = copyout(&status, uap->status, sizeof(status)); if (uap->rusage != NULL && error == 0) error = copyout(&ru, uap->rusage, sizeof(struct rusage)); return (error); } int kern_wait(struct thread *td, pid_t pid, int *status, int options, struct rusage *rusage) { struct proc *p, *q, *t; int error, nfound; AUDIT_ARG(pid, pid); q = td->td_proc; if (pid == 0) { PROC_LOCK(q); pid = -q->p_pgid; PROC_UNLOCK(q); } if (options &~ (WUNTRACED|WNOHANG|WCONTINUED|WNOWAIT|WLINUXCLONE)) return (EINVAL); loop: if (q->p_flag & P_STATCHILD) { PROC_LOCK(q); q->p_flag &= ~P_STATCHILD; PROC_UNLOCK(q); } nfound = 0; sx_xlock(&proctree_lock); LIST_FOREACH(p, &q->p_children, p_sibling) { PROC_LOCK(p); if (pid != WAIT_ANY && p->p_pid != pid && p->p_pgid != -pid) { PROC_UNLOCK(p); continue; } if (p_canwait(td, p)) { PROC_UNLOCK(p); continue; } /* * This special case handles a kthread spawned by linux_clone * (see linux_misc.c). The linux_wait4 and linux_waitpid * functions need to be able to distinguish between waiting * on a process and waiting on a thread. It is a thread if * p_sigparent is not SIGCHLD, and the WLINUXCLONE option * signifies we want to wait for threads and not processes. */ if ((p->p_sigparent != SIGCHLD) ^ ((options & WLINUXCLONE) != 0)) { PROC_UNLOCK(p); continue; } nfound++; PROC_SLOCK(p); if (p->p_state == PRS_ZOMBIE) { INIT_VPROCG(P_TO_VPROCG(p)); if (rusage) { *rusage = p->p_ru; calcru(p, &rusage->ru_utime, &rusage->ru_stime); } PROC_SUNLOCK(p); td->td_retval[0] = p->p_pid; if (status) *status = p->p_xstat; /* convert to int */ if (options & WNOWAIT) { /* * Only poll, returning the status. * Caller does not wish to release the proc * struct just yet. */ PROC_UNLOCK(p); sx_xunlock(&proctree_lock); return (0); } PROC_LOCK(q); sigqueue_take(p->p_ksi); PROC_UNLOCK(q); PROC_UNLOCK(p); /* * If we got the child via a ptrace 'attach', * we need to give it back to the old parent. */ if (p->p_oppid && (t = pfind(p->p_oppid)) != NULL) { PROC_LOCK(p); p->p_oppid = 0; proc_reparent(p, t); PROC_UNLOCK(p); tdsignal(t, NULL, SIGCHLD, p->p_ksi); wakeup(t); cv_broadcast(&p->p_pwait); PROC_UNLOCK(t); sx_xunlock(&proctree_lock); return (0); } /* * Remove other references to this process to ensure * we have an exclusive reference. */ sx_xlock(&allproc_lock); LIST_REMOVE(p, p_list); /* off zombproc */ sx_xunlock(&allproc_lock); LIST_REMOVE(p, p_sibling); leavepgrp(p); sx_xunlock(&proctree_lock); /* * As a side effect of this lock, we know that * all other writes to this proc are visible now, so * no more locking is needed for p. */ PROC_LOCK(p); p->p_xstat = 0; /* XXX: why? */ PROC_UNLOCK(p); PROC_LOCK(q); ruadd(&q->p_stats->p_cru, &q->p_crux, &p->p_ru, &p->p_rux); PROC_UNLOCK(q); /* * Decrement the count of procs running with this uid. */ (void)chgproccnt(p->p_ucred->cr_ruidinfo, -1, 0); /* * Free credentials, arguments, and sigacts. */ crfree(p->p_ucred); p->p_ucred = NULL; pargs_drop(p->p_args); p->p_args = NULL; sigacts_free(p->p_sigacts); p->p_sigacts = NULL; /* * Do any thread-system specific cleanups. */ thread_wait(p); /* * Give vm and machine-dependent layer a chance * to free anything that cpu_exit couldn't * release while still running in process context. */ vm_waitproc(p); #ifdef MAC mac_proc_destroy(p); #endif KASSERT(FIRST_THREAD_IN_PROC(p), ("kern_wait: no residual thread!")); uma_zfree(proc_zone, p); sx_xlock(&allproc_lock); nprocs--; #ifdef VIMAGE vprocg->nprocs--; #endif sx_xunlock(&allproc_lock); return (0); } if ((p->p_flag & P_STOPPED_SIG) && (p->p_suspcount == p->p_numthreads) && (p->p_flag & P_WAITED) == 0 && (p->p_flag & P_TRACED || options & WUNTRACED)) { PROC_SUNLOCK(p); p->p_flag |= P_WAITED; sx_xunlock(&proctree_lock); td->td_retval[0] = p->p_pid; if (status) *status = W_STOPCODE(p->p_xstat); PROC_LOCK(q); sigqueue_take(p->p_ksi); PROC_UNLOCK(q); PROC_UNLOCK(p); return (0); } PROC_SUNLOCK(p); if (options & WCONTINUED && (p->p_flag & P_CONTINUED)) { sx_xunlock(&proctree_lock); td->td_retval[0] = p->p_pid; p->p_flag &= ~P_CONTINUED; PROC_LOCK(q); sigqueue_take(p->p_ksi); PROC_UNLOCK(q); PROC_UNLOCK(p); if (status) *status = SIGCONT; return (0); } PROC_UNLOCK(p); } if (nfound == 0) { sx_xunlock(&proctree_lock); return (ECHILD); } if (options & WNOHANG) { sx_xunlock(&proctree_lock); td->td_retval[0] = 0; return (0); } PROC_LOCK(q); sx_xunlock(&proctree_lock); if (q->p_flag & P_STATCHILD) { q->p_flag &= ~P_STATCHILD; error = 0; } else error = msleep(q, &q->p_mtx, PWAIT | PCATCH, "wait", 0); PROC_UNLOCK(q); if (error) return (error); goto loop; } /* * Make process 'parent' the new parent of process 'child'. * Must be called with an exclusive hold of proctree lock. */ void proc_reparent(struct proc *child, struct proc *parent) { sx_assert(&proctree_lock, SX_XLOCKED); PROC_LOCK_ASSERT(child, MA_OWNED); if (child->p_pptr == parent) return; PROC_LOCK(child->p_pptr); sigqueue_take(child->p_ksi); PROC_UNLOCK(child->p_pptr); LIST_REMOVE(child, p_sibling); LIST_INSERT_HEAD(&parent->p_children, child, p_sibling); child->p_pptr = parent; } Index: head/sys/kern/kern_fork.c =================================================================== --- head/sys/kern/kern_fork.c (revision 192894) +++ head/sys/kern/kern_fork.c (revision 192895) @@ -1,864 +1,863 @@ /*- * Copyright (c) 1982, 1986, 1989, 1991, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)kern_fork.c 8.6 (Berkeley) 4/8/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_kdtrace.h" #include "opt_ktrace.h" #include "opt_mac.h" #include #include #include #include #include +#include #include #include #include #include #include #include #include #include -#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef KDTRACE_HOOKS #include dtrace_fork_func_t dtrace_fasttrap_fork; #endif SDT_PROVIDER_DECLARE(proc); SDT_PROBE_DEFINE(proc, kernel, , create); SDT_PROBE_ARGTYPE(proc, kernel, , create, 0, "struct proc *"); SDT_PROBE_ARGTYPE(proc, kernel, , create, 1, "struct proc *"); SDT_PROBE_ARGTYPE(proc, kernel, , create, 2, "int"); #ifndef _SYS_SYSPROTO_H_ struct fork_args { int dummy; }; #endif /* ARGSUSED */ int fork(td, uap) struct thread *td; struct fork_args *uap; { int error; struct proc *p2; error = fork1(td, RFFDG | RFPROC, 0, &p2); if (error == 0) { td->td_retval[0] = p2->p_pid; td->td_retval[1] = 0; } return (error); } /* ARGSUSED */ int vfork(td, uap) struct thread *td; struct vfork_args *uap; { int error, flags; struct proc *p2; #ifdef XEN flags = RFFDG | RFPROC; /* validate that this is still an issue */ #else flags = RFFDG | RFPROC | RFPPWAIT | RFMEM; #endif error = fork1(td, flags, 0, &p2); if (error == 0) { td->td_retval[0] = p2->p_pid; td->td_retval[1] = 0; } return (error); } int rfork(td, uap) struct thread *td; struct rfork_args *uap; { struct proc *p2; int error; /* Don't allow kernel-only flags. */ if ((uap->flags & RFKERNELONLY) != 0) return (EINVAL); AUDIT_ARG(fflags, uap->flags); error = fork1(td, uap->flags, 0, &p2); if (error == 0) { td->td_retval[0] = p2 ? p2->p_pid : 0; td->td_retval[1] = 0; } return (error); } int nprocs = 1; /* process 0 */ int lastpid = 0; SYSCTL_INT(_kern, OID_AUTO, lastpid, CTLFLAG_RD, &lastpid, 0, "Last used PID"); /* * Random component to lastpid generation. We mix in a random factor to make * it a little harder to predict. We sanity check the modulus value to avoid * doing it in critical paths. Don't let it be too small or we pointlessly * waste randomness entropy, and don't let it be impossibly large. Using a * modulus that is too big causes a LOT more process table scans and slows * down fork processing as the pidchecked caching is defeated. */ static int randompid = 0; static int sysctl_kern_randompid(SYSCTL_HANDLER_ARGS) { int error, pid; error = sysctl_wire_old_buffer(req, sizeof(int)); if (error != 0) return(error); sx_xlock(&allproc_lock); pid = randompid; error = sysctl_handle_int(oidp, &pid, 0, req); if (error == 0 && req->newptr != NULL) { if (pid < 0 || pid > PID_MAX - 100) /* out of range */ pid = PID_MAX - 100; else if (pid < 2) /* NOP */ pid = 0; else if (pid < 100) /* Make it reasonable */ pid = 100; randompid = pid; } sx_xunlock(&allproc_lock); return (error); } SYSCTL_PROC(_kern, OID_AUTO, randompid, CTLTYPE_INT|CTLFLAG_RW, 0, 0, sysctl_kern_randompid, "I", "Random PID modulus"); int fork1(td, flags, pages, procp) struct thread *td; int flags; int pages; struct proc **procp; { struct proc *p1, *p2, *pptr; struct proc *newproc; int ok, trypid; static int curfail, pidchecked = 0; static struct timeval lastfail; struct filedesc *fd; struct filedesc_to_leader *fdtol; struct thread *td2; struct sigacts *newsigacts; struct vmspace *vm2; int error; /* Can't copy and clear. */ if ((flags & (RFFDG|RFCFDG)) == (RFFDG|RFCFDG)) return (EINVAL); p1 = td->td_proc; /* * Here we don't create a new process, but we divorce * certain parts of a process from itself. */ if ((flags & RFPROC) == 0) { if (((p1->p_flag & (P_HADTHREADS|P_SYSTEM)) == P_HADTHREADS) && (flags & (RFCFDG | RFFDG))) { PROC_LOCK(p1); if (thread_single(SINGLE_BOUNDARY)) { PROC_UNLOCK(p1); return (ERESTART); } PROC_UNLOCK(p1); } error = vm_forkproc(td, NULL, NULL, NULL, flags); if (error) goto norfproc_fail; /* * Close all file descriptors. */ if (flags & RFCFDG) { struct filedesc *fdtmp; fdtmp = fdinit(td->td_proc->p_fd); fdfree(td); p1->p_fd = fdtmp; } /* * Unshare file descriptors (from parent). */ if (flags & RFFDG) fdunshare(p1, td); norfproc_fail: if (((p1->p_flag & (P_HADTHREADS|P_SYSTEM)) == P_HADTHREADS) && (flags & (RFCFDG | RFFDG))) { PROC_LOCK(p1); thread_single_end(); PROC_UNLOCK(p1); } *procp = NULL; return (error); } /* * XXX * We did have single-threading code here * however it proved un-needed and caused problems */ vm2 = NULL; /* Allocate new proc. */ newproc = uma_zalloc(proc_zone, M_WAITOK); if (TAILQ_EMPTY(&newproc->p_threads)) { td2 = thread_alloc(); if (td2 == NULL) { error = ENOMEM; goto fail1; } proc_linkup(newproc, td2); } else td2 = FIRST_THREAD_IN_PROC(newproc); /* Allocate and switch to an alternate kstack if specified. */ if (pages != 0) { if (!vm_thread_new_altkstack(td2, pages)) { error = ENOMEM; goto fail1; } } if ((flags & RFMEM) == 0) { vm2 = vmspace_fork(p1->p_vmspace); if (vm2 == NULL) { error = ENOMEM; goto fail1; } } #ifdef MAC mac_proc_init(newproc); #endif knlist_init(&newproc->p_klist, &newproc->p_mtx, NULL, NULL, NULL); STAILQ_INIT(&newproc->p_ktr); /* We have to lock the process tree while we look for a pid. */ sx_slock(&proctree_lock); /* * Although process entries are dynamically created, we still keep * a global limit on the maximum number we will create. Don't allow * a nonprivileged user to use the last ten processes; don't let root * exceed the limit. The variable nprocs is the current number of * processes, maxproc is the limit. */ sx_xlock(&allproc_lock); if ((nprocs >= maxproc - 10 && priv_check_cred(td->td_ucred, PRIV_MAXPROC, 0) != 0) || nprocs >= maxproc) { error = EAGAIN; goto fail; } /* * Increment the count of procs running with this uid. Don't allow * a nonprivileged user to exceed their current limit. * * XXXRW: Can we avoid privilege here if it's not needed? */ error = priv_check_cred(td->td_ucred, PRIV_PROC_LIMIT, 0); if (error == 0) ok = chgproccnt(td->td_ucred->cr_ruidinfo, 1, 0); else { PROC_LOCK(p1); ok = chgproccnt(td->td_ucred->cr_ruidinfo, 1, lim_cur(p1, RLIMIT_NPROC)); PROC_UNLOCK(p1); } if (!ok) { error = EAGAIN; goto fail; } /* * Increment the nprocs resource before blocking can occur. There * are hard-limits as to the number of processes that can run. */ nprocs++; #ifdef VIMAGE P_TO_VPROCG(p1)->nprocs++; #endif /* * Find an unused process ID. We remember a range of unused IDs * ready to use (from lastpid+1 through pidchecked-1). * * If RFHIGHPID is set (used during system boot), do not allocate * low-numbered pids. */ trypid = lastpid + 1; if (flags & RFHIGHPID) { if (trypid < 10) trypid = 10; } else { if (randompid) trypid += arc4random() % randompid; } retry: /* * If the process ID prototype has wrapped around, * restart somewhat above 0, as the low-numbered procs * tend to include daemons that don't exit. */ if (trypid >= PID_MAX) { trypid = trypid % PID_MAX; if (trypid < 100) trypid += 100; pidchecked = 0; } if (trypid >= pidchecked) { int doingzomb = 0; pidchecked = PID_MAX; /* * Scan the active and zombie procs to check whether this pid * is in use. Remember the lowest pid that's greater * than trypid, so we can avoid checking for a while. */ p2 = LIST_FIRST(&allproc); again: for (; p2 != NULL; p2 = LIST_NEXT(p2, p_list)) { while (p2->p_pid == trypid || (p2->p_pgrp != NULL && (p2->p_pgrp->pg_id == trypid || (p2->p_session != NULL && p2->p_session->s_sid == trypid)))) { trypid++; if (trypid >= pidchecked) goto retry; } if (p2->p_pid > trypid && pidchecked > p2->p_pid) pidchecked = p2->p_pid; if (p2->p_pgrp != NULL) { if (p2->p_pgrp->pg_id > trypid && pidchecked > p2->p_pgrp->pg_id) pidchecked = p2->p_pgrp->pg_id; if (p2->p_session != NULL && p2->p_session->s_sid > trypid && pidchecked > p2->p_session->s_sid) pidchecked = p2->p_session->s_sid; } } if (!doingzomb) { doingzomb = 1; p2 = LIST_FIRST(&zombproc); goto again; } } sx_sunlock(&proctree_lock); /* * RFHIGHPID does not mess with the lastpid counter during boot. */ if (flags & RFHIGHPID) pidchecked = 0; else lastpid = trypid; p2 = newproc; p2->p_state = PRS_NEW; /* protect against others */ p2->p_pid = trypid; /* * Allow the scheduler to initialize the child. */ thread_lock(td); sched_fork(td, td2); thread_unlock(td); AUDIT_ARG(pid, p2->p_pid); LIST_INSERT_HEAD(&allproc, p2, p_list); LIST_INSERT_HEAD(PIDHASH(p2->p_pid), p2, p_hash); PROC_LOCK(p2); PROC_LOCK(p1); sx_xunlock(&allproc_lock); bcopy(&p1->p_startcopy, &p2->p_startcopy, __rangeof(struct proc, p_startcopy, p_endcopy)); pargs_hold(p2->p_args); PROC_UNLOCK(p1); bzero(&p2->p_startzero, __rangeof(struct proc, p_startzero, p_endzero)); p2->p_ucred = crhold(td->td_ucred); - /* In case we are jailed tell the prison that we exist. */ - if (jailed(p2->p_ucred)) - prison_proc_hold(p2->p_ucred->cr_prison); + /* Tell the prison that we exist. */ + prison_proc_hold(p2->p_ucred->cr_prison); PROC_UNLOCK(p2); /* * Malloc things while we don't hold any locks. */ if (flags & RFSIGSHARE) newsigacts = NULL; else newsigacts = sigacts_alloc(); /* * Copy filedesc. */ if (flags & RFCFDG) { fd = fdinit(p1->p_fd); fdtol = NULL; } else if (flags & RFFDG) { fd = fdcopy(p1->p_fd); fdtol = NULL; } else { fd = fdshare(p1->p_fd); if (p1->p_fdtol == NULL) p1->p_fdtol = filedesc_to_leader_alloc(NULL, NULL, p1->p_leader); if ((flags & RFTHREAD) != 0) { /* * Shared file descriptor table and * shared process leaders. */ fdtol = p1->p_fdtol; FILEDESC_XLOCK(p1->p_fd); fdtol->fdl_refcount++; FILEDESC_XUNLOCK(p1->p_fd); } else { /* * Shared file descriptor table, and * different process leaders */ fdtol = filedesc_to_leader_alloc(p1->p_fdtol, p1->p_fd, p2); } } /* * Make a proc table entry for the new process. * Start by zeroing the section of proc that is zero-initialized, * then copy the section that is copied directly from the parent. */ PROC_LOCK(p2); PROC_LOCK(p1); bzero(&td2->td_startzero, __rangeof(struct thread, td_startzero, td_endzero)); bcopy(&td->td_startcopy, &td2->td_startcopy, __rangeof(struct thread, td_startcopy, td_endcopy)); bcopy(&p2->p_comm, &td2->td_name, sizeof(td2->td_name)); td2->td_sigstk = td->td_sigstk; td2->td_sigmask = td->td_sigmask; td2->td_flags = TDF_INMEM; #ifdef VIMAGE td2->td_vnet = NULL; td2->td_vnet_lpush = NULL; #endif /* * Duplicate sub-structures as needed. * Increase reference counts on shared objects. */ p2->p_flag = P_INMEM; p2->p_swtick = ticks; if (p1->p_flag & P_PROFIL) startprofclock(p2); td2->td_ucred = crhold(p2->p_ucred); if (flags & RFSIGSHARE) { p2->p_sigacts = sigacts_hold(p1->p_sigacts); } else { sigacts_copy(newsigacts, p1->p_sigacts); p2->p_sigacts = newsigacts; } if (flags & RFLINUXTHPN) p2->p_sigparent = SIGUSR1; else p2->p_sigparent = SIGCHLD; p2->p_textvp = p1->p_textvp; p2->p_fd = fd; p2->p_fdtol = fdtol; /* * p_limit is copy-on-write. Bump its refcount. */ lim_fork(p1, p2); pstats_fork(p1->p_stats, p2->p_stats); PROC_UNLOCK(p1); PROC_UNLOCK(p2); /* Bump references to the text vnode (for procfs) */ if (p2->p_textvp) vref(p2->p_textvp); /* * Set up linkage for kernel based threading. */ if ((flags & RFTHREAD) != 0) { mtx_lock(&ppeers_lock); p2->p_peers = p1->p_peers; p1->p_peers = p2; p2->p_leader = p1->p_leader; mtx_unlock(&ppeers_lock); PROC_LOCK(p1->p_leader); if ((p1->p_leader->p_flag & P_WEXIT) != 0) { PROC_UNLOCK(p1->p_leader); /* * The task leader is exiting, so process p1 is * going to be killed shortly. Since p1 obviously * isn't dead yet, we know that the leader is either * sending SIGKILL's to all the processes in this * task or is sleeping waiting for all the peers to * exit. We let p1 complete the fork, but we need * to go ahead and kill the new process p2 since * the task leader may not get a chance to send * SIGKILL to it. We leave it on the list so that * the task leader will wait for this new process * to commit suicide. */ PROC_LOCK(p2); psignal(p2, SIGKILL); PROC_UNLOCK(p2); } else PROC_UNLOCK(p1->p_leader); } else { p2->p_peers = NULL; p2->p_leader = p2; } sx_xlock(&proctree_lock); PGRP_LOCK(p1->p_pgrp); PROC_LOCK(p2); PROC_LOCK(p1); /* * Preserve some more flags in subprocess. P_PROFIL has already * been preserved. */ p2->p_flag |= p1->p_flag & P_SUGID; td2->td_pflags |= td->td_pflags & TDP_ALTSTACK; SESS_LOCK(p1->p_session); if (p1->p_session->s_ttyvp != NULL && p1->p_flag & P_CONTROLT) p2->p_flag |= P_CONTROLT; SESS_UNLOCK(p1->p_session); if (flags & RFPPWAIT) p2->p_flag |= P_PPWAIT; p2->p_pgrp = p1->p_pgrp; LIST_INSERT_AFTER(p1, p2, p_pglist); PGRP_UNLOCK(p1->p_pgrp); LIST_INIT(&p2->p_children); callout_init(&p2->p_itcallout, CALLOUT_MPSAFE); #ifdef KTRACE /* * Copy traceflag and tracefile if enabled. */ mtx_lock(&ktrace_mtx); KASSERT(p2->p_tracevp == NULL, ("new process has a ktrace vnode")); if (p1->p_traceflag & KTRFAC_INHERIT) { p2->p_traceflag = p1->p_traceflag; if ((p2->p_tracevp = p1->p_tracevp) != NULL) { VREF(p2->p_tracevp); KASSERT(p1->p_tracecred != NULL, ("ktrace vnode with no cred")); p2->p_tracecred = crhold(p1->p_tracecred); } } mtx_unlock(&ktrace_mtx); #endif /* * If PF_FORK is set, the child process inherits the * procfs ioctl flags from its parent. */ if (p1->p_pfsflags & PF_FORK) { p2->p_stops = p1->p_stops; p2->p_pfsflags = p1->p_pfsflags; } #ifdef KDTRACE_HOOKS /* * Tell the DTrace fasttrap provider about the new process * if it has registered an interest. */ if (dtrace_fasttrap_fork) dtrace_fasttrap_fork(p1, p2); #endif /* * This begins the section where we must prevent the parent * from being swapped. */ _PHOLD(p1); PROC_UNLOCK(p1); /* * Attach the new process to its parent. * * If RFNOWAIT is set, the newly created process becomes a child * of init. This effectively disassociates the child from the * parent. */ if (flags & RFNOWAIT) pptr = initproc; else pptr = p1; p2->p_pptr = pptr; LIST_INSERT_HEAD(&pptr->p_children, p2, p_sibling); sx_xunlock(&proctree_lock); /* Inform accounting that we have forked. */ p2->p_acflag = AFORK; PROC_UNLOCK(p2); /* * Finish creating the child process. It will return via a different * execution path later. (ie: directly into user mode) */ vm_forkproc(td, p2, td2, vm2, flags); if (flags == (RFFDG | RFPROC)) { PCPU_INC(cnt.v_forks); PCPU_ADD(cnt.v_forkpages, p2->p_vmspace->vm_dsize + p2->p_vmspace->vm_ssize); } else if (flags == (RFFDG | RFPROC | RFPPWAIT | RFMEM)) { PCPU_INC(cnt.v_vforks); PCPU_ADD(cnt.v_vforkpages, p2->p_vmspace->vm_dsize + p2->p_vmspace->vm_ssize); } else if (p1 == &proc0) { PCPU_INC(cnt.v_kthreads); PCPU_ADD(cnt.v_kthreadpages, p2->p_vmspace->vm_dsize + p2->p_vmspace->vm_ssize); } else { PCPU_INC(cnt.v_rforks); PCPU_ADD(cnt.v_rforkpages, p2->p_vmspace->vm_dsize + p2->p_vmspace->vm_ssize); } /* * Both processes are set up, now check if any loadable modules want * to adjust anything. * What if they have an error? XXX */ EVENTHANDLER_INVOKE(process_fork, p1, p2, flags); /* * Set the child start time and mark the process as being complete. */ microuptime(&p2->p_stats->p_start); PROC_SLOCK(p2); p2->p_state = PRS_NORMAL; PROC_SUNLOCK(p2); /* * If RFSTOPPED not requested, make child runnable and add to * run queue. */ if ((flags & RFSTOPPED) == 0) { thread_lock(td2); TD_SET_CAN_RUN(td2); sched_add(td2, SRQ_BORING); thread_unlock(td2); } /* * Now can be swapped. */ PROC_LOCK(p1); _PRELE(p1); PROC_UNLOCK(p1); /* * Tell any interested parties about the new process. */ knote_fork(&p1->p_klist, p2->p_pid); SDT_PROBE(proc, kernel, , create, p2, p1, flags, 0, 0); /* * Preserve synchronization semantics of vfork. If waiting for * child to exec or exit, set P_PPWAIT on child, and sleep on our * proc (in case of exit). */ PROC_LOCK(p2); while (p2->p_flag & P_PPWAIT) cv_wait(&p2->p_pwait, &p2->p_mtx); PROC_UNLOCK(p2); /* * Return child proc pointer to parent. */ *procp = p2; return (0); fail: sx_sunlock(&proctree_lock); if (ppsratecheck(&lastfail, &curfail, 1)) printf("maxproc limit exceeded by uid %i, please see tuning(7) and login.conf(5).\n", td->td_ucred->cr_ruid); sx_xunlock(&allproc_lock); #ifdef MAC mac_proc_destroy(newproc); #endif fail1: if (vm2 != NULL) vmspace_free(vm2); uma_zfree(proc_zone, newproc); pause("fork", hz / 2); return (error); } /* * Handle the return of a child process from fork1(). This function * is called from the MD fork_trampoline() entry point. */ void fork_exit(callout, arg, frame) void (*callout)(void *, struct trapframe *); void *arg; struct trapframe *frame; { struct proc *p; struct thread *td; struct thread *dtd; td = curthread; p = td->td_proc; KASSERT(p->p_state == PRS_NORMAL, ("executing process is still new")); CTR4(KTR_PROC, "fork_exit: new thread %p (td_sched %p, pid %d, %s)", td, td->td_sched, p->p_pid, td->td_name); sched_fork_exit(td); /* * Processes normally resume in mi_switch() after being * cpu_switch()'ed to, but when children start up they arrive here * instead, so we must do much the same things as mi_switch() would. */ if ((dtd = PCPU_GET(deadthread))) { PCPU_SET(deadthread, NULL); thread_stash(dtd); } thread_unlock(td); /* * cpu_set_fork_handler intercepts this function call to * have this call a non-return function to stay in kernel mode. * initproc has its own fork handler, but it does return. */ KASSERT(callout != NULL, ("NULL callout in fork_exit")); callout(arg, frame); /* * Check if a kernel thread misbehaved and returned from its main * function. */ if (p->p_flag & P_KTHREAD) { printf("Kernel thread \"%s\" (pid %d) exited prematurely.\n", td->td_name, p->p_pid); kproc_exit(0); } mtx_assert(&Giant, MA_NOTOWNED); EVENTHANDLER_INVOKE(schedtail, p); } /* * Simplified back end of syscall(), used when returning from fork() * directly into user mode. Giant is not held on entry, and must not * be held on return. This function is passed in to fork_exit() as the * first parameter and is called when returning to a new userland process. */ void fork_return(td, frame) struct thread *td; struct trapframe *frame; { userret(td, frame); #ifdef KTRACE if (KTRPOINT(td, KTR_SYSRET)) ktrsysret(SYS_fork, 0, 0); #endif mtx_assert(&Giant, MA_NOTOWNED); } Index: head/sys/kern/kern_jail.c =================================================================== --- head/sys/kern/kern_jail.c (revision 192894) +++ head/sys/kern/kern_jail.c (revision 192895) @@ -1,2721 +1,3820 @@ /*- * Copyright (c) 1999 Poul-Henning Kamp. * Copyright (c) 2008 Bjoern A. Zeeb. * Copyright (c) 2009 James Gritton. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_inet.h" #include "opt_inet6.h" #include "opt_mac.h" #include #include #include #include #include #include #include +#include #include #include #include #include #include #include #include -#include #include #include #include #include #include #include #include #include #include #include #include #ifdef DDB #include #ifdef INET6 #include #endif /* INET6 */ #endif /* DDB */ #include MALLOC_DEFINE(M_PRISON, "prison", "Prison structures"); -SYSCTL_NODE(_security, OID_AUTO, jail, CTLFLAG_RW, 0, - "Jail rules"); +/* prison0 describes what is "real" about the system. */ +struct prison prison0 = { + .pr_id = 0, + .pr_name = "0", + .pr_ref = 1, + .pr_uref = 1, + .pr_path = "/", + .pr_securelevel = -1, + .pr_children = LIST_HEAD_INITIALIZER(&prison0.pr_children), + .pr_allow = PR_ALLOW_ALL, +}; +MTX_SYSINIT(prison0, &prison0.pr_mtx, "jail mutex", MTX_DEF); -int jail_set_hostname_allowed = 1; -SYSCTL_INT(_security_jail, OID_AUTO, set_hostname_allowed, CTLFLAG_RW, - &jail_set_hostname_allowed, 0, - "Processes in jail can set their hostnames"); - -int jail_socket_unixiproute_only = 1; -SYSCTL_INT(_security_jail, OID_AUTO, socket_unixiproute_only, CTLFLAG_RW, - &jail_socket_unixiproute_only, 0, - "Processes in jail are limited to creating UNIX/IP/route sockets only"); - -int jail_sysvipc_allowed = 0; -SYSCTL_INT(_security_jail, OID_AUTO, sysvipc_allowed, CTLFLAG_RW, - &jail_sysvipc_allowed, 0, - "Processes in jail can use System V IPC primitives"); - -static int jail_enforce_statfs = 2; -SYSCTL_INT(_security_jail, OID_AUTO, enforce_statfs, CTLFLAG_RW, - &jail_enforce_statfs, 0, - "Processes in jail cannot see all mounted file systems"); - -int jail_allow_raw_sockets = 0; -SYSCTL_INT(_security_jail, OID_AUTO, allow_raw_sockets, CTLFLAG_RW, - &jail_allow_raw_sockets, 0, - "Prison root can create raw sockets"); - -int jail_chflags_allowed = 0; -SYSCTL_INT(_security_jail, OID_AUTO, chflags_allowed, CTLFLAG_RW, - &jail_chflags_allowed, 0, - "Processes in jail can alter system file flags"); - -int jail_mount_allowed = 0; -SYSCTL_INT(_security_jail, OID_AUTO, mount_allowed, CTLFLAG_RW, - &jail_mount_allowed, 0, - "Processes in jail can mount/unmount jail-friendly file systems"); - -int jail_max_af_ips = 255; -SYSCTL_INT(_security_jail, OID_AUTO, jail_max_af_ips, CTLFLAG_RW, - &jail_max_af_ips, 0, - "Number of IP addresses a jail may have at most per address family"); - -/* allprison, lastprid, and prisoncount are protected by allprison_lock. */ +/* allprison and lastprid are protected by allprison_lock. */ struct sx allprison_lock; SX_SYSINIT(allprison_lock, &allprison_lock, "allprison"); struct prisonlist allprison = TAILQ_HEAD_INITIALIZER(allprison); int lastprid = 0; -int prisoncount = 0; static int do_jail_attach(struct thread *td, struct prison *pr); static void prison_complete(void *context, int pending); static void prison_deref(struct prison *pr, int flags); +static char *prison_path(struct prison *pr1, struct prison *pr2); +static void prison_remove_one(struct prison *pr); #ifdef INET static int _prison_check_ip4(struct prison *pr, struct in_addr *ia); +static int prison_restrict_ip4(struct prison *pr, struct in_addr *newip4); #endif #ifdef INET6 static int _prison_check_ip6(struct prison *pr, struct in6_addr *ia6); +static int prison_restrict_ip6(struct prison *pr, struct in6_addr *newip6); #endif -static int sysctl_jail_list(SYSCTL_HANDLER_ARGS); /* Flags for prison_deref */ #define PD_DEREF 0x01 #define PD_DEUREF 0x02 #define PD_LOCKED 0x04 #define PD_LIST_SLOCKED 0x08 #define PD_LIST_XLOCKED 0x10 +/* + * Parameter names corresponding to PR_* flag values + */ +static char *pr_flag_names[] = { + [0] = "persist", #ifdef INET + [2] = "ip4", +#endif +#ifdef INET6 + [3] = "ip6", +#endif +}; + +static char *pr_flag_nonames[] = { + [0] = "nopersist", +#ifdef INET + [2] = "noip4", +#endif +#ifdef INET6 + [3] = "noip6", +#endif +}; + +static char *pr_allow_names[] = { + "allow.set_hostname", + "allow.sysvipc", + "allow.raw_sockets", + "allow.chflags", + "allow.mount", + "allow.quotas", + "allow.jails", + "allow.socket_af", +}; + +static char *pr_allow_nonames[] = { + "allow.noset_hostname", + "allow.nosysvipc", + "allow.noraw_sockets", + "allow.nochflags", + "allow.nomount", + "allow.noquotas", + "allow.nojails", + "allow.nosocket_af", +}; + +#define JAIL_DEFAULT_ALLOW PR_ALLOW_SET_HOSTNAME +static unsigned jail_default_allow = JAIL_DEFAULT_ALLOW; +static int jail_default_enforce_statfs = 2; +#if defined(INET) || defined(INET6) +static int jail_max_af_ips = 255; +#endif + +#ifdef INET static int qcmp_v4(const void *ip1, const void *ip2) { in_addr_t iaa, iab; /* * We need to compare in HBO here to get the list sorted as expected * by the result of the code. Sorting NBO addresses gives you * interesting results. If you do not understand, do not try. */ iaa = ntohl(((const struct in_addr *)ip1)->s_addr); iab = ntohl(((const struct in_addr *)ip2)->s_addr); /* * Do not simply return the difference of the two numbers, the int is * not wide enough. */ if (iaa > iab) return (1); else if (iaa < iab) return (-1); else return (0); } #endif #ifdef INET6 static int qcmp_v6(const void *ip1, const void *ip2) { const struct in6_addr *ia6a, *ia6b; int i, rc; ia6a = (const struct in6_addr *)ip1; ia6b = (const struct in6_addr *)ip2; rc = 0; for (i = 0; rc == 0 && i < sizeof(struct in6_addr); i++) { if (ia6a->s6_addr[i] > ia6b->s6_addr[i]) rc = 1; else if (ia6a->s6_addr[i] < ia6b->s6_addr[i]) rc = -1; } return (rc); } #endif /* * struct jail_args { * struct jail *jail; * }; */ int jail(struct thread *td, struct jail_args *uap) { - struct iovec optiov[10]; - struct uio opt; - char *u_path, *u_hostname, *u_name; -#ifdef INET - struct in_addr *u_ip4; -#endif -#ifdef INET6 - struct in6_addr *u_ip6; -#endif uint32_t version; int error; + struct jail j; error = copyin(uap->jail, &version, sizeof(uint32_t)); if (error) return (error); switch (version) { case 0: { - /* FreeBSD single IPv4 jails. */ struct jail_v0 j0; + /* FreeBSD single IPv4 jails. */ + bzero(&j, sizeof(struct jail)); error = copyin(uap->jail, &j0, sizeof(struct jail_v0)); if (error) return (error); - u_path = malloc(MAXPATHLEN + MAXHOSTNAMELEN, M_TEMP, M_WAITOK); - u_hostname = u_path + MAXPATHLEN; - opt.uio_iov = optiov; - opt.uio_iovcnt = 4; - opt.uio_offset = -1; - opt.uio_resid = -1; - opt.uio_segflg = UIO_SYSSPACE; - opt.uio_rw = UIO_READ; - opt.uio_td = td; - optiov[0].iov_base = "path"; - optiov[0].iov_len = sizeof("path"); - optiov[1].iov_base = u_path; - error = - copyinstr(j0.path, u_path, MAXPATHLEN, &optiov[1].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } - optiov[2].iov_base = "host.hostname"; - optiov[2].iov_len = sizeof("host.hostname"); - optiov[3].iov_base = u_hostname; - error = copyinstr(j0.hostname, u_hostname, MAXHOSTNAMELEN, - &optiov[3].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } -#ifdef INET - optiov[opt.uio_iovcnt].iov_base = "ip4.addr"; - optiov[opt.uio_iovcnt].iov_len = sizeof("ip4.addr"); - opt.uio_iovcnt++; - optiov[opt.uio_iovcnt].iov_base = &j0.ip_number; - j0.ip_number = htonl(j0.ip_number); - optiov[opt.uio_iovcnt].iov_len = sizeof(j0.ip_number); - opt.uio_iovcnt++; -#endif + j.version = j0.version; + j.path = j0.path; + j.hostname = j0.hostname; + j.ip4s = j0.ip_number; break; } case 1: /* * Version 1 was used by multi-IPv4 jail implementations * that never made it into the official kernel. */ return (EINVAL); case 2: /* JAIL_API_VERSION */ - { /* FreeBSD multi-IPv4/IPv6,noIP jails. */ - struct jail j; - size_t tmplen; - error = copyin(uap->jail, &j, sizeof(struct jail)); if (error) return (error); - tmplen = MAXPATHLEN + MAXHOSTNAMELEN + MAXHOSTNAMELEN; + break; + + default: + /* Sci-Fi jails are not supported, sorry. */ + return (EINVAL); + } + return (kern_jail(td, &j)); +} + +int +kern_jail(struct thread *td, struct jail *j) +{ + struct iovec optiov[24]; + struct uio opt; + char *u_path, *u_hostname, *u_name; #ifdef INET - if (j.ip4s > jail_max_af_ips) - return (EINVAL); - tmplen += j.ip4s * sizeof(struct in_addr); + int ip4s; + struct in_addr *u_ip4; +#endif +#ifdef INET6 + struct in6_addr *u_ip6; +#endif + size_t tmplen; + int error, enforce_statfs, fi; + + bzero(&optiov, sizeof(optiov)); + opt.uio_iov = optiov; + opt.uio_iovcnt = 0; + opt.uio_offset = -1; + opt.uio_resid = -1; + opt.uio_segflg = UIO_SYSSPACE; + opt.uio_rw = UIO_READ; + opt.uio_td = td; + + /* Set permissions for top-level jails from sysctls. */ + if (!jailed(td->td_ucred)) { + for (fi = 0; fi < sizeof(pr_allow_names) / + sizeof(pr_allow_names[0]); fi++) { + optiov[opt.uio_iovcnt].iov_base = + (jail_default_allow & (1 << fi)) + ? pr_allow_names[fi] : pr_allow_nonames[fi]; + optiov[opt.uio_iovcnt].iov_len = + strlen(optiov[opt.uio_iovcnt].iov_base) + 1; + opt.uio_iovcnt += 2; + } + optiov[opt.uio_iovcnt].iov_base = "enforce_statfs"; + optiov[opt.uio_iovcnt].iov_len = sizeof("enforce_statfs"); + opt.uio_iovcnt++; + enforce_statfs = jail_default_enforce_statfs; + optiov[opt.uio_iovcnt].iov_base = &enforce_statfs; + optiov[opt.uio_iovcnt].iov_len = sizeof(enforce_statfs); + opt.uio_iovcnt++; + } + + tmplen = MAXPATHLEN + MAXHOSTNAMELEN + MAXHOSTNAMELEN; +#ifdef INET + ip4s = (j->version == 0) ? 1 : j->ip4s; + if (ip4s > jail_max_af_ips) + return (EINVAL); + tmplen += ip4s * sizeof(struct in_addr); #else - if (j.ip4s > 0) - return (EINVAL); + if (j->ip4s > 0) + return (EINVAL); #endif #ifdef INET6 - if (j.ip6s > jail_max_af_ips) - return (EINVAL); - tmplen += j.ip6s * sizeof(struct in6_addr); + if (j->ip6s > jail_max_af_ips) + return (EINVAL); + tmplen += j->ip6s * sizeof(struct in6_addr); #else - if (j.ip6s > 0) - return (EINVAL); + if (j->ip6s > 0) + return (EINVAL); #endif - u_path = malloc(tmplen, M_TEMP, M_WAITOK); - u_hostname = u_path + MAXPATHLEN; - u_name = u_hostname + MAXHOSTNAMELEN; + u_path = malloc(tmplen, M_TEMP, M_WAITOK); + u_hostname = u_path + MAXPATHLEN; + u_name = u_hostname + MAXHOSTNAMELEN; #ifdef INET - u_ip4 = (struct in_addr *)(u_name + MAXHOSTNAMELEN); + u_ip4 = (struct in_addr *)(u_name + MAXHOSTNAMELEN); #endif #ifdef INET6 #ifdef INET - u_ip6 = (struct in6_addr *)(u_ip4 + j.ip4s); + u_ip6 = (struct in6_addr *)(u_ip4 + ip4s); #else - u_ip6 = (struct in6_addr *)(u_name + MAXHOSTNAMELEN); + u_ip6 = (struct in6_addr *)(u_name + MAXHOSTNAMELEN); #endif #endif - opt.uio_iov = optiov; - opt.uio_iovcnt = 4; - opt.uio_offset = -1; - opt.uio_resid = -1; - opt.uio_segflg = UIO_SYSSPACE; - opt.uio_rw = UIO_READ; - opt.uio_td = td; - optiov[0].iov_base = "path"; - optiov[0].iov_len = sizeof("path"); - optiov[1].iov_base = u_path; - error = - copyinstr(j.path, u_path, MAXPATHLEN, &optiov[1].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } - optiov[2].iov_base = "host.hostname"; - optiov[2].iov_len = sizeof("host.hostname"); - optiov[3].iov_base = u_hostname; - error = copyinstr(j.hostname, u_hostname, MAXHOSTNAMELEN, - &optiov[3].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } - if (j.jailname != NULL) { - optiov[opt.uio_iovcnt].iov_base = "name"; - optiov[opt.uio_iovcnt].iov_len = sizeof("name"); - opt.uio_iovcnt++; - optiov[opt.uio_iovcnt].iov_base = u_name; - error = copyinstr(j.jailname, u_name, MAXHOSTNAMELEN, - &optiov[opt.uio_iovcnt].iov_len); - if (error) { - free(u_path, M_TEMP); - return (error); - } - opt.uio_iovcnt++; - } -#ifdef INET - optiov[opt.uio_iovcnt].iov_base = "ip4.addr"; - optiov[opt.uio_iovcnt].iov_len = sizeof("ip4.addr"); + optiov[opt.uio_iovcnt].iov_base = "path"; + optiov[opt.uio_iovcnt].iov_len = sizeof("path"); + opt.uio_iovcnt++; + optiov[opt.uio_iovcnt].iov_base = u_path; + error = copyinstr(j->path, u_path, MAXPATHLEN, + &optiov[opt.uio_iovcnt].iov_len); + if (error) { + free(u_path, M_TEMP); + return (error); + } + opt.uio_iovcnt++; + optiov[opt.uio_iovcnt].iov_base = "host.hostname"; + optiov[opt.uio_iovcnt].iov_len = sizeof("host.hostname"); + opt.uio_iovcnt++; + optiov[opt.uio_iovcnt].iov_base = u_hostname; + error = copyinstr(j->hostname, u_hostname, MAXHOSTNAMELEN, + &optiov[opt.uio_iovcnt].iov_len); + if (error) { + free(u_path, M_TEMP); + return (error); + } + opt.uio_iovcnt++; + if (j->jailname != NULL) { + optiov[opt.uio_iovcnt].iov_base = "name"; + optiov[opt.uio_iovcnt].iov_len = sizeof("name"); opt.uio_iovcnt++; - optiov[opt.uio_iovcnt].iov_base = u_ip4; - optiov[opt.uio_iovcnt].iov_len = - j.ip4s * sizeof(struct in_addr); - error = copyin(j.ip4, u_ip4, optiov[opt.uio_iovcnt].iov_len); + optiov[opt.uio_iovcnt].iov_base = u_name; + error = copyinstr(j->jailname, u_name, MAXHOSTNAMELEN, + &optiov[opt.uio_iovcnt].iov_len); if (error) { free(u_path, M_TEMP); return (error); } opt.uio_iovcnt++; -#endif -#ifdef INET6 - optiov[opt.uio_iovcnt].iov_base = "ip6.addr"; - optiov[opt.uio_iovcnt].iov_len = sizeof("ip6.addr"); - opt.uio_iovcnt++; - optiov[opt.uio_iovcnt].iov_base = u_ip6; - optiov[opt.uio_iovcnt].iov_len = - j.ip6s * sizeof(struct in6_addr); - error = copyin(j.ip6, u_ip6, optiov[opt.uio_iovcnt].iov_len); + } +#ifdef INET + optiov[opt.uio_iovcnt].iov_base = "ip4.addr"; + optiov[opt.uio_iovcnt].iov_len = sizeof("ip4.addr"); + opt.uio_iovcnt++; + optiov[opt.uio_iovcnt].iov_base = u_ip4; + optiov[opt.uio_iovcnt].iov_len = ip4s * sizeof(struct in_addr); + if (j->version == 0) + u_ip4->s_addr = j->ip4s; + else { + error = copyin(j->ip4, u_ip4, optiov[opt.uio_iovcnt].iov_len); if (error) { free(u_path, M_TEMP); return (error); } - opt.uio_iovcnt++; + } + opt.uio_iovcnt++; #endif - break; +#ifdef INET6 + optiov[opt.uio_iovcnt].iov_base = "ip6.addr"; + optiov[opt.uio_iovcnt].iov_len = sizeof("ip6.addr"); + opt.uio_iovcnt++; + optiov[opt.uio_iovcnt].iov_base = u_ip6; + optiov[opt.uio_iovcnt].iov_len = j->ip6s * sizeof(struct in6_addr); + error = copyin(j->ip6, u_ip6, optiov[opt.uio_iovcnt].iov_len); + if (error) { + free(u_path, M_TEMP); + return (error); } - - default: - /* Sci-Fi jails are not supported, sorry. */ - return (EINVAL); - } + opt.uio_iovcnt++; +#endif + KASSERT(opt.uio_iovcnt <= sizeof(optiov) / sizeof(optiov[0]), + ("kern_jail: too many iovecs (%d)", opt.uio_iovcnt)); error = kern_jail_set(td, &opt, JAIL_CREATE | JAIL_ATTACH); free(u_path, M_TEMP); return (error); } + /* * struct jail_set_args { * struct iovec *iovp; * unsigned int iovcnt; * int flags; * }; */ int jail_set(struct thread *td, struct jail_set_args *uap) { struct uio *auio; int error; /* Check that we have an even number of iovecs. */ if (uap->iovcnt & 1) return (EINVAL); error = copyinuio(uap->iovp, uap->iovcnt, &auio); if (error) return (error); error = kern_jail_set(td, auio, uap->flags); free(auio, M_IOV); return (error); } int kern_jail_set(struct thread *td, struct uio *optuio, int flags) { struct nameidata nd; #ifdef INET struct in_addr *ip4; #endif #ifdef INET6 struct in6_addr *ip6; #endif struct vfsopt *opt; struct vfsoptlist *opts; - struct prison *pr, *deadpr, *tpr; + struct prison *pr, *deadpr, *mypr, *ppr, *tpr; struct vnode *root; char *errmsg, *host, *name, *p, *path; +#if defined(INET) || defined(INET6) void *op; - int created, cuflags, error, errmsg_len, errmsg_pos; - int gotslevel, jid, len; +#endif + size_t namelen, onamelen; + int created, cuflags, descend, enforce, error, errmsg_len, errmsg_pos; + int gotenforce, gotslevel, fi, jid, len; int slevel, vfslocked; #if defined(INET) || defined(INET6) - int ii; + int ii, ij; #endif #ifdef INET - int ip4s; + int ip4s, ip4a, redo_ip4; #endif #ifdef INET6 - int ip6s; + int ip6s, ip6a, redo_ip6; #endif unsigned pr_flags, ch_flags; + unsigned pr_allow, ch_allow, tallow; char numbuf[12]; error = priv_check(td, PRIV_JAIL_SET); if (!error && (flags & JAIL_ATTACH)) error = priv_check(td, PRIV_JAIL_ATTACH); if (error) return (error); + mypr = ppr = td->td_ucred->cr_prison; + if ((flags & JAIL_CREATE) && !(mypr->pr_allow & PR_ALLOW_JAILS)) + return (EPERM); if (flags & ~JAIL_SET_MASK) return (EINVAL); /* * Check all the parameters before committing to anything. Not all * errors can be caught early, but we may as well try. Also, this * takes care of some expensive stuff (path lookup) before getting * the allprison lock. * * XXX Jails are not filesystems, and jail parameters are not mount * options. But it makes more sense to re-use the vfsopt code * than duplicate it under a different name. */ error = vfs_buildopts(optuio, &opts); if (error) return (error); #ifdef INET + ip4a = 0; ip4 = NULL; #endif #ifdef INET6 + ip6a = 0; ip6 = NULL; #endif +#if defined(INET) || defined(INET6) + again: +#endif error = vfs_copyopt(opts, "jid", &jid, sizeof(jid)); if (error == ENOENT) jid = 0; else if (error != 0) goto done_free; error = vfs_copyopt(opts, "securelevel", &slevel, sizeof(slevel)); if (error == ENOENT) gotslevel = 0; else if (error != 0) goto done_free; else gotslevel = 1; + error = vfs_copyopt(opts, "enforce_statfs", &enforce, sizeof(enforce)); + gotenforce = (error == 0); + if (gotenforce) { + if (enforce < 0 || enforce > 2) + return (EINVAL); + } else if (error != ENOENT) + goto done_free; + pr_flags = ch_flags = 0; - vfs_flagopt(opts, "persist", &pr_flags, PR_PERSIST); - vfs_flagopt(opts, "nopersist", &ch_flags, PR_PERSIST); + for (fi = 0; fi < sizeof(pr_flag_names) / sizeof(pr_flag_names[0]); + fi++) { + if (pr_flag_names[fi] == NULL) + continue; + vfs_flagopt(opts, pr_flag_names[fi], &pr_flags, 1 << fi); + vfs_flagopt(opts, pr_flag_nonames[fi], &ch_flags, 1 << fi); + } ch_flags |= pr_flags; if ((flags & (JAIL_CREATE | JAIL_UPDATE | JAIL_ATTACH)) == JAIL_CREATE && !(pr_flags & PR_PERSIST)) { error = EINVAL; vfs_opterror(opts, "new jail must persist or attach"); goto done_errmsg; } + pr_allow = ch_allow = 0; + for (fi = 0; fi < sizeof(pr_allow_names) / sizeof(pr_allow_names[0]); + fi++) { + vfs_flagopt(opts, pr_allow_names[fi], &pr_allow, 1 << fi); + vfs_flagopt(opts, pr_allow_nonames[fi], &ch_allow, 1 << fi); + } + ch_allow |= pr_allow; + error = vfs_getopt(opts, "name", (void **)&name, &len); if (error == ENOENT) name = NULL; else if (error != 0) goto done_free; else { if (len == 0 || name[len - 1] != '\0') { error = EINVAL; goto done_free; } if (len > MAXHOSTNAMELEN) { error = ENAMETOOLONG; goto done_free; } } error = vfs_getopt(opts, "host.hostname", (void **)&host, &len); if (error == ENOENT) host = NULL; else if (error != 0) goto done_free; else { if (len == 0 || host[len - 1] != '\0') { error = EINVAL; goto done_free; } if (len > MAXHOSTNAMELEN) { error = ENAMETOOLONG; goto done_free; } } + /* This might be the second time around for this option. */ #ifdef INET error = vfs_getopt(opts, "ip4.addr", &op, &ip4s); if (error == ENOENT) ip4s = -1; else if (error != 0) goto done_free; else if (ip4s & (sizeof(*ip4) - 1)) { error = EINVAL; goto done_free; - } else if (ip4s > 0) { - ip4s /= sizeof(*ip4); - if (ip4s > jail_max_af_ips) { - error = EINVAL; - vfs_opterror(opts, "too many IPv4 addresses"); - goto done_errmsg; - } - ip4 = malloc(ip4s * sizeof(*ip4), M_PRISON, M_WAITOK); - bcopy(op, ip4, ip4s * sizeof(*ip4)); - /* - * IP addresses are all sorted but ip[0] to preserve the - * primary IP address as given from userland. This special IP - * is used for unbound outgoing connections as well for - * "loopback" traffic. - */ - if (ip4s > 1) - qsort(ip4 + 1, ip4s - 1, sizeof(*ip4), qcmp_v4); - /* - * Check for duplicate addresses and do some simple zero and - * broadcast checks. If users give other bogus addresses it is - * their problem. - * - * We do not have to care about byte order for these checks so - * we will do them in NBO. - */ - for (ii = 0; ii < ip4s; ii++) { - if (ip4[ii].s_addr == INADDR_ANY || - ip4[ii].s_addr == INADDR_BROADCAST) { + } else { + ch_flags |= PR_IP4_USER; + pr_flags |= PR_IP4_USER; + if (ip4s > 0) { + ip4s /= sizeof(*ip4); + if (ip4s > jail_max_af_ips) { error = EINVAL; - goto done_free; + vfs_opterror(opts, "too many IPv4 addresses"); + goto done_errmsg; } - if ((ii+1) < ip4s && - (ip4[0].s_addr == ip4[ii+1].s_addr || - ip4[ii].s_addr == ip4[ii+1].s_addr)) { - error = EINVAL; - goto done_free; + if (ip4a < ip4s) { + ip4a = ip4s; + free(ip4, M_PRISON); + ip4 = NULL; } + if (ip4 == NULL) + ip4 = malloc(ip4a * sizeof(*ip4), M_PRISON, + M_WAITOK); + bcopy(op, ip4, ip4s * sizeof(*ip4)); + /* + * IP addresses are all sorted but ip[0] to preserve + * the primary IP address as given from userland. + * This special IP is used for unbound outgoing + * connections as well for "loopback" traffic. + */ + if (ip4s > 1) + qsort(ip4 + 1, ip4s - 1, sizeof(*ip4), qcmp_v4); + /* + * Check for duplicate addresses and do some simple + * zero and broadcast checks. If users give other bogus + * addresses it is their problem. + * + * We do not have to care about byte order for these + * checks so we will do them in NBO. + */ + for (ii = 0; ii < ip4s; ii++) { + if (ip4[ii].s_addr == INADDR_ANY || + ip4[ii].s_addr == INADDR_BROADCAST) { + error = EINVAL; + goto done_free; + } + if ((ii+1) < ip4s && + (ip4[0].s_addr == ip4[ii+1].s_addr || + ip4[ii].s_addr == ip4[ii+1].s_addr)) { + error = EINVAL; + goto done_free; + } + } } } #endif #ifdef INET6 error = vfs_getopt(opts, "ip6.addr", &op, &ip6s); if (error == ENOENT) ip6s = -1; else if (error != 0) goto done_free; else if (ip6s & (sizeof(*ip6) - 1)) { error = EINVAL; goto done_free; - } else if (ip6s > 0) { - ip6s /= sizeof(*ip6); - if (ip6s > jail_max_af_ips) { - error = EINVAL; - vfs_opterror(opts, "too many IPv6 addresses"); - goto done_errmsg; - } - ip6 = malloc(ip6s * sizeof(*ip6), M_PRISON, M_WAITOK); - bcopy(op, ip6, ip6s * sizeof(*ip6)); - if (ip6s > 1) - qsort(ip6 + 1, ip6s - 1, sizeof(*ip6), qcmp_v6); - for (ii = 0; ii < ip6s; ii++) { - if (IN6_IS_ADDR_UNSPECIFIED(&ip6[0])) { + } else { + ch_flags |= PR_IP6_USER; + pr_flags |= PR_IP6_USER; + if (ip6s > 0) { + ip6s /= sizeof(*ip6); + if (ip6s > jail_max_af_ips) { error = EINVAL; - goto done_free; + vfs_opterror(opts, "too many IPv6 addresses"); + goto done_errmsg; } - if ((ii+1) < ip6s && - (IN6_ARE_ADDR_EQUAL(&ip6[0], &ip6[ii+1]) || - IN6_ARE_ADDR_EQUAL(&ip6[ii], &ip6[ii+1]))) - { - error = EINVAL; - goto done_free; + if (ip6a < ip6s) { + ip6a = ip6s; + free(ip6, M_PRISON); + ip6 = NULL; } + if (ip6 == NULL) + ip6 = malloc(ip6a * sizeof(*ip6), M_PRISON, + M_WAITOK); + bcopy(op, ip6, ip6s * sizeof(*ip6)); + if (ip6s > 1) + qsort(ip6 + 1, ip6s - 1, sizeof(*ip6), qcmp_v6); + for (ii = 0; ii < ip6s; ii++) { + if (IN6_IS_ADDR_UNSPECIFIED(&ip6[ii])) { + error = EINVAL; + goto done_free; + } + if ((ii+1) < ip6s && + (IN6_ARE_ADDR_EQUAL(&ip6[0], &ip6[ii+1]) || + IN6_ARE_ADDR_EQUAL(&ip6[ii], &ip6[ii+1]))) + { + error = EINVAL; + goto done_free; + } + } } } #endif root = NULL; error = vfs_getopt(opts, "path", (void **)&path, &len); if (error == ENOENT) path = NULL; else if (error != 0) goto done_free; else { if (flags & JAIL_UPDATE) { error = EINVAL; vfs_opterror(opts, "path cannot be changed after creation"); goto done_errmsg; } if (len == 0 || path[len - 1] != '\0') { error = EINVAL; goto done_free; } - if (len > MAXPATHLEN) { - error = ENAMETOOLONG; - goto done_free; - } if (len < 2 || (len == 2 && path[0] == '/')) path = NULL; else { + /* Leave room for a real-root full pathname. */ + if (len + (path[0] == '/' && strcmp(mypr->pr_path, "/") + ? strlen(mypr->pr_path) : 0) > MAXPATHLEN) { + error = ENAMETOOLONG; + goto done_free; + } NDINIT(&nd, LOOKUP, MPSAFE | FOLLOW, UIO_SYSSPACE, path, td); error = namei(&nd); if (error) goto done_free; vfslocked = NDHASGIANT(&nd); root = nd.ni_vp; NDFREE(&nd, NDF_ONLY_PNBUF); if (root->v_type != VDIR) { error = ENOTDIR; vrele(root); VFS_UNLOCK_GIANT(vfslocked); goto done_free; } VFS_UNLOCK_GIANT(vfslocked); } } /* * Grab the allprison lock before letting modules check their * parameters. Once we have it, do not let go so we'll have a * consistent view of the OSD list. */ sx_xlock(&allprison_lock); error = osd_jail_call(NULL, PR_METHOD_CHECK, opts); if (error) goto done_unlock_list; /* By now, all parameters should have been noted. */ TAILQ_FOREACH(opt, opts, link) { if (!opt->seen && strcmp(opt->name, "errmsg")) { error = EINVAL; vfs_opterror(opts, "unknown parameter: %s", opt->name); goto done_unlock_list; } } /* * See if we are creating a new record or updating an existing one. * This abuses the file error codes ENOENT and EEXIST. */ cuflags = flags & (JAIL_CREATE | JAIL_UPDATE); if (!cuflags) { error = EINVAL; vfs_opterror(opts, "no valid operation (create or update)"); goto done_unlock_list; } pr = NULL; if (jid != 0) { - /* See if a requested jid already exists. */ + /* + * See if a requested jid already exists. There is an + * information leak here if the jid exists but is not within + * the caller's jail hierarchy. Jail creators will get EEXIST + * even though they cannot see the jail, and CREATE | UPDATE + * will return ENOENT which is not normally a valid error. + */ if (jid < 0) { error = EINVAL; vfs_opterror(opts, "negative jid"); goto done_unlock_list; } pr = prison_find(jid); if (pr != NULL) { + ppr = pr->pr_parent; /* Create: jid must not exist. */ if (cuflags == JAIL_CREATE) { mtx_unlock(&pr->pr_mtx); error = EEXIST; vfs_opterror(opts, "jail %d already exists", jid); goto done_unlock_list; } - if (pr->pr_uref == 0) { + if (!prison_ischild(mypr, pr)) { + mtx_unlock(&pr->pr_mtx); + pr = NULL; + } else if (pr->pr_uref == 0) { if (!(flags & JAIL_DYING)) { mtx_unlock(&pr->pr_mtx); error = ENOENT; vfs_opterror(opts, "jail %d is dying", jid); goto done_unlock_list; } else if ((flags & JAIL_ATTACH) || (pr_flags & PR_PERSIST)) { /* * A dying jail might be resurrected * (via attach or persist), but first * it must determine if another jail * has claimed its name. Accomplish * this by implicitly re-setting the * name. */ if (name == NULL) - name = pr->pr_name; + name = prison_name(mypr, pr); } } } if (pr == NULL) { /* Update: jid must exist. */ if (cuflags == JAIL_UPDATE) { error = ENOENT; vfs_opterror(opts, "jail %d not found", jid); goto done_unlock_list; } } } /* * If the caller provided a name, look for a jail by that name. * This has different semantics for creates and updates keyed by jid * (where the name must not already exist in a different jail), * and updates keyed by the name itself (where the name must exist * because that is the jail being updated). */ if (name != NULL) { + p = strrchr(name, '.'); + if (p != NULL) { + /* + * This is a hierarchical name. Split it into the + * parent and child names, and make sure the parent + * exists or matches an already found jail. + */ + *p = '\0'; + if (pr != NULL) { + if (strncmp(name, ppr->pr_name, p - name) || + ppr->pr_name[p - name] != '\0') { + mtx_unlock(&pr->pr_mtx); + error = EINVAL; + vfs_opterror(opts, + "cannot change jail's parent"); + goto done_unlock_list; + } + } else { + ppr = prison_find_name(mypr, name); + if (ppr == NULL) { + error = ENOENT; + vfs_opterror(opts, + "jail \"%s\" not found", name); + goto done_unlock_list; + } + mtx_unlock(&ppr->pr_mtx); + } + name = p + 1; + } if (name[0] != '\0') { - deadpr = NULL; + namelen = + (ppr == &prison0) ? 0 : strlen(ppr->pr_name) + 1; name_again: - TAILQ_FOREACH(tpr, &allprison, pr_list) { + deadpr = NULL; + FOREACH_PRISON_CHILD(ppr, tpr) { if (tpr != pr && tpr->pr_ref > 0 && - !strcmp(tpr->pr_name, name)) { + !strcmp(tpr->pr_name + namelen, name)) { if (pr == NULL && cuflags != JAIL_CREATE) { mtx_lock(&tpr->pr_mtx); if (tpr->pr_ref > 0) { /* * Use this jail * for updates. */ if (tpr->pr_uref > 0) { pr = tpr; break; } deadpr = tpr; } mtx_unlock(&tpr->pr_mtx); } else if (tpr->pr_uref > 0) { /* * Create, or update(jid): * name must not exist in an - * active jail. + * active sibling jail. */ error = EEXIST; if (pr != NULL) mtx_unlock(&pr->pr_mtx); vfs_opterror(opts, "jail \"%s\" already exists", name); goto done_unlock_list; } } } /* If no active jail is found, use a dying one. */ if (deadpr != NULL && pr == NULL) { if (flags & JAIL_DYING) { mtx_lock(&deadpr->pr_mtx); if (deadpr->pr_ref == 0) { mtx_unlock(&deadpr->pr_mtx); goto name_again; } pr = deadpr; } else if (cuflags == JAIL_UPDATE) { error = ENOENT; vfs_opterror(opts, "jail \"%s\" is dying", name); goto done_unlock_list; } } /* Update: name must exist if no jid. */ else if (cuflags == JAIL_UPDATE && pr == NULL) { error = ENOENT; vfs_opterror(opts, "jail \"%s\" not found", name); goto done_unlock_list; } } } /* Update: must provide a jid or name. */ else if (cuflags == JAIL_UPDATE && pr == NULL) { error = ENOENT; vfs_opterror(opts, "update specified no jail"); goto done_unlock_list; } /* If there's no prison to update, create a new one and link it in. */ if (pr == NULL) { created = 1; + mtx_lock(&ppr->pr_mtx); + if (ppr->pr_ref == 0 || (ppr->pr_flags & PR_REMOVE)) { + mtx_unlock(&ppr->pr_mtx); + error = ENOENT; + vfs_opterror(opts, "parent jail went away!"); + goto done_unlock_list; + } + ppr->pr_ref++; + ppr->pr_uref++; + mtx_unlock(&ppr->pr_mtx); pr = malloc(sizeof(*pr), M_PRISON, M_WAITOK | M_ZERO); if (jid == 0) { /* Find the next free jid. */ jid = lastprid + 1; findnext: if (jid == JAIL_MAX) jid = 1; TAILQ_FOREACH(tpr, &allprison, pr_list) { if (tpr->pr_id < jid) continue; if (tpr->pr_id > jid || tpr->pr_ref == 0) { TAILQ_INSERT_BEFORE(tpr, pr, pr_list); break; } if (jid == lastprid) { error = EAGAIN; vfs_opterror(opts, "no available jail IDs"); free(pr, M_PRISON); - goto done_unlock_list; + prison_deref(ppr, PD_DEREF | + PD_DEUREF | PD_LIST_XLOCKED); + goto done_releroot; } jid++; goto findnext; } lastprid = jid; } else { /* * The jail already has a jid (that did not yet exist), * so just find where to insert it. */ TAILQ_FOREACH(tpr, &allprison, pr_list) if (tpr->pr_id >= jid) { TAILQ_INSERT_BEFORE(tpr, pr, pr_list); break; } } if (tpr == NULL) TAILQ_INSERT_TAIL(&allprison, pr, pr_list); - prisoncount++; + LIST_INSERT_HEAD(&ppr->pr_children, pr, pr_sibling); + for (tpr = ppr; tpr != NULL; tpr = tpr->pr_parent) + tpr->pr_prisoncount++; + pr->pr_parent = ppr; pr->pr_id = jid; + + /* Set some default values, and inherit some from the parent. */ if (name == NULL) name = ""; if (path == NULL) { path = "/"; - root = rootvnode; + root = mypr->pr_root; vref(root); } +#ifdef INET + pr->pr_flags |= ppr->pr_flags & PR_IP4; + pr->pr_ip4s = ppr->pr_ip4s; + if (ppr->pr_ip4 != NULL) { + pr->pr_ip4 = malloc(pr->pr_ip4s * + sizeof(struct in_addr), M_PRISON, M_WAITOK); + bcopy(ppr->pr_ip4, pr->pr_ip4, + pr->pr_ip4s * sizeof(*pr->pr_ip4)); + } +#endif +#ifdef INET6 + pr->pr_flags |= ppr->pr_flags & PR_IP6; + pr->pr_ip6s = ppr->pr_ip6s; + if (ppr->pr_ip6 != NULL) { + pr->pr_ip6 = malloc(pr->pr_ip6s * + sizeof(struct in6_addr), M_PRISON, M_WAITOK); + bcopy(ppr->pr_ip6, pr->pr_ip6, + pr->pr_ip6s * sizeof(*pr->pr_ip6)); + } +#endif + pr->pr_securelevel = ppr->pr_securelevel; + pr->pr_allow = JAIL_DEFAULT_ALLOW & ppr->pr_allow; + pr->pr_enforce_statfs = ppr->pr_enforce_statfs; - mtx_init(&pr->pr_mtx, "jail mutex", NULL, MTX_DEF); + LIST_INIT(&pr->pr_children); + mtx_init(&pr->pr_mtx, "jail mutex", NULL, MTX_DEF | MTX_DUPOK); /* * Allocate a dedicated cpuset for each jail. * Unlike other initial settings, this may return an erorr. */ - error = cpuset_create_root(td, &pr->pr_cpuset); + error = cpuset_create_root(ppr, &pr->pr_cpuset); if (error) { prison_deref(pr, PD_LIST_XLOCKED); goto done_releroot; } mtx_lock(&pr->pr_mtx); /* * New prisons do not yet have a reference, because we do not * want other to see the incomplete prison once the * allprison_lock is downgraded. */ } else { created = 0; /* * Grab a reference for existing prisons, to ensure they * continue to exist for the duration of the call. */ pr->pr_ref++; } /* Do final error checking before setting anything. */ - error = 0; -#if defined(INET) || defined(INET6) - if ( + if (gotslevel) { + if (slevel < ppr->pr_securelevel) { + error = EPERM; + goto done_deref_locked; + } + } + if (gotenforce) { + if (enforce < ppr->pr_enforce_statfs) { + error = EPERM; + goto done_deref_locked; + } + } #ifdef INET - ip4s > 0 -#ifdef INET6 - || -#endif -#endif -#ifdef INET6 - ip6s > 0 -#endif - ) - /* - * Check for conflicting IP addresses. We permit them if there - * is no more than 1 IP on each jail. If there is a duplicate - * on a jail with more than one IP stop checking and return - * error. - */ - TAILQ_FOREACH(tpr, &allprison, pr_list) { - if (tpr == pr || tpr->pr_uref == 0) - continue; -#ifdef INET - if ((ip4s > 0 && tpr->pr_ip4s > 1) || - (ip4s > 1 && tpr->pr_ip4s > 0)) - for (ii = 0; ii < ip4s; ii++) + if (ch_flags & PR_IP4_USER) { + if (ppr->pr_flags & PR_IP4) { + if (!(pr_flags & PR_IP4_USER)) { + /* + * Silently ignore attempts to make the IP + * addresses unrestricted when the parent is + * restricted; in other words, interpret + * "unrestricted" as "as unrestricted as + * possible". + */ + ip4s = ppr->pr_ip4s; + if (ip4s == 0) { + free(ip4, M_PRISON); + ip4 = NULL; + } else if (ip4s <= ip4a) { + /* Inherit the parent's address(es). */ + bcopy(ppr->pr_ip4, ip4, + ip4s * sizeof(*ip4)); + } else { + /* + * There's no room for the parent's + * address list. Allocate some more. + */ + ip4a = ip4s; + free(ip4, M_PRISON); + ip4 = malloc(ip4a * sizeof(*ip4), + M_PRISON, M_NOWAIT); + if (ip4 != NULL) + bcopy(ppr->pr_ip4, ip4, + ip4s * sizeof(*ip4)); + else { + /* Allocation failed without + * sleeping. Unlocking the + * prison now will invalidate + * some checks and prematurely + * show an unfinished new jail. + * So let go of everything and + * start over. + */ + prison_deref(pr, created + ? PD_LOCKED | + PD_LIST_XLOCKED + : PD_DEREF | PD_LOCKED | + PD_LIST_XLOCKED); + if (root != NULL) { + vfslocked = + VFS_LOCK_GIANT( + root->v_mount); + vrele(root); + VFS_UNLOCK_GIANT( + vfslocked); + } + ip4 = malloc(ip4a * + sizeof(*ip4), M_PRISON, + M_WAITOK); + goto again; + } + } + } else if (ip4s > 0) { + /* + * Make sure the new set of IP addresses is a + * subset of the parent's list. Don't worry + * about the parent being unlocked, as any + * setting is done with allprison_lock held. + */ + for (ij = 0; ij < ppr->pr_ip4s; ij++) + if (ip4[0].s_addr == + ppr->pr_ip4[ij].s_addr) + break; + if (ij == ppr->pr_ip4s) { + error = EPERM; + goto done_deref_locked; + } + if (ip4s > 1) { + for (ii = ij = 1; ii < ip4s; ii++) { + if (ip4[ii].s_addr == + ppr->pr_ip4[0].s_addr) + continue; + for (; ij < ppr->pr_ip4s; ij++) + if (ip4[ii].s_addr == + ppr->pr_ip4[ij].s_addr) + break; + if (ij == ppr->pr_ip4s) + break; + } + if (ij == ppr->pr_ip4s) { + error = EPERM; + goto done_deref_locked; + } + } + } + } + if (ip4s > 0) { + /* + * Check for conflicting IP addresses. We permit them + * if there is no more than one IP on each jail. If + * there is a duplicate on a jail with more than one + * IP stop checking and return error. + */ + FOREACH_PRISON_DESCENDANT(&prison0, tpr, descend) { + if (tpr == pr || tpr->pr_uref == 0) { + descend = 0; + continue; + } + if (!(tpr->pr_flags & PR_IP4_USER)) + continue; + descend = 0; + if (tpr->pr_ip4 == NULL || + (ip4s == 1 && tpr->pr_ip4s == 1)) + continue; + for (ii = 0; ii < ip4s; ii++) { if (_prison_check_ip4(tpr, &ip4[ii]) == 0) { - error = EINVAL; + error = EADDRINUSE; vfs_opterror(opts, "IPv4 addresses clash"); goto done_deref_locked; } + } + } + } + } #endif #ifdef INET6 - if ((ip6s > 0 && tpr->pr_ip6s > 1) || - (ip6s > 1 && tpr->pr_ip6s > 0)) - for (ii = 0; ii < ip6s; ii++) + if (ch_flags & PR_IP6_USER) { + if (ppr->pr_flags & PR_IP6) { + if (!(pr_flags & PR_IP6_USER)) { + /* + * Silently ignore attempts to make the IP + * addresses unrestricted when the parent is + * restricted. + */ + ip6s = ppr->pr_ip6s; + if (ip6s == 0) { + free(ip6, M_PRISON); + ip6 = NULL; + } else if (ip6s <= ip6a) { + /* Inherit the parent's address(es). */ + bcopy(ppr->pr_ip6, ip6, + ip6s * sizeof(*ip6)); + } else { + /* + * There's no room for the parent's + * address list. + */ + ip6a = ip6s; + free(ip6, M_PRISON); + ip6 = malloc(ip6a * sizeof(*ip6), + M_PRISON, M_NOWAIT); + if (ip6 != NULL) + bcopy(ppr->pr_ip6, ip6, + ip6s * sizeof(*ip6)); + else { + prison_deref(pr, created + ? PD_LOCKED | + PD_LIST_XLOCKED + : PD_DEREF | PD_LOCKED | + PD_LIST_XLOCKED); + if (root != NULL) { + vfslocked = + VFS_LOCK_GIANT( + root->v_mount); + vrele(root); + VFS_UNLOCK_GIANT( + vfslocked); + } + ip6 = malloc(ip6a * + sizeof(*ip6), M_PRISON, + M_WAITOK); + goto again; + } + } + } else if (ip6s > 0) { + /* + * Make sure the new set of IP addresses is a + * subset of the parent's list. + */ + for (ij = 0; ij < ppr->pr_ip6s; ij++) + if (IN6_ARE_ADDR_EQUAL(&ip6[0], + &ppr->pr_ip6[ij])) + break; + if (ij == ppr->pr_ip6s) { + error = EPERM; + goto done_deref_locked; + } + if (ip6s > 1) { + for (ii = ij = 1; ii < ip6s; ii++) { + if (IN6_ARE_ADDR_EQUAL(&ip6[ii], + &ppr->pr_ip6[0])) + continue; + for (; ij < ppr->pr_ip6s; ij++) + if (IN6_ARE_ADDR_EQUAL( + &ip6[ii], + &ppr->pr_ip6[ij])) + break; + if (ij == ppr->pr_ip6s) + break; + } + if (ij == ppr->pr_ip6s) { + error = EPERM; + goto done_deref_locked; + } + } + } + } + if (ip6s > 0) { + /* Check for conflicting IP addresses. */ + FOREACH_PRISON_DESCENDANT(&prison0, tpr, descend) { + if (tpr == pr || tpr->pr_uref == 0) { + descend = 0; + continue; + } + if (!(tpr->pr_flags & PR_IP6_USER)) + continue; + descend = 0; + if (tpr->pr_ip6 == NULL || + (ip6s == 1 && tpr->pr_ip6s == 1)) + continue; + for (ii = 0; ii < ip6s; ii++) { if (_prison_check_ip6(tpr, &ip6[ii]) == 0) { - error = EINVAL; + error = EADDRINUSE; vfs_opterror(opts, "IPv6 addresses clash"); goto done_deref_locked; } -#endif + } + } } + } #endif - if (error == 0 && name != NULL) { + onamelen = namelen = 0; + if (name != NULL) { /* Give a default name of the jid. */ if (name[0] == '\0') snprintf(name = numbuf, sizeof(numbuf), "%d", jid); else if (strtoul(name, &p, 10) != jid && *p == '\0') { error = EINVAL; vfs_opterror(opts, "name cannot be numeric"); + goto done_deref_locked; } - } - if (error) { - done_deref_locked: /* - * Some parameter had an error so do not set anything. - * If this is a new jail, it will go away without ever - * having been seen. + * Make sure the name isn't too long for the prison or its + * children. */ - prison_deref(pr, created - ? PD_LOCKED | PD_LIST_XLOCKED - : PD_DEREF | PD_LOCKED | PD_LIST_XLOCKED); - goto done_releroot; + onamelen = strlen(pr->pr_name); + namelen = strlen(name); + if (strlen(ppr->pr_name) + namelen + 2 > sizeof(pr->pr_name)) { + error = ENAMETOOLONG; + goto done_deref_locked; + } + FOREACH_PRISON_DESCENDANT(pr, tpr, descend) { + if (strlen(tpr->pr_name) + (namelen - onamelen) >= + sizeof(pr->pr_name)) { + error = ENAMETOOLONG; + goto done_deref_locked; + } + } } + if (pr_allow & ~ppr->pr_allow) { + error = EPERM; + goto done_deref_locked; + } /* Set the parameters of the prison. */ #ifdef INET - if (ip4s >= 0) { - pr->pr_ip4s = ip4s; - free(pr->pr_ip4, M_PRISON); - pr->pr_ip4 = ip4; - ip4 = NULL; + redo_ip4 = 0; + if (ch_flags & PR_IP4_USER) { + if (pr_flags & PR_IP4_USER) { + /* Some restriction set. */ + pr->pr_flags |= PR_IP4; + if (ip4s >= 0) { + free(pr->pr_ip4, M_PRISON); + pr->pr_ip4s = ip4s; + pr->pr_ip4 = ip4; + ip4 = NULL; + } + } else if (ppr->pr_flags & PR_IP4) { + /* This restriction cleared, but keep inherited. */ + free(pr->pr_ip4, M_PRISON); + pr->pr_ip4s = ip4s; + pr->pr_ip4 = ip4; + ip4 = NULL; + } else { + /* Restriction cleared, now unrestricted. */ + pr->pr_flags &= ~PR_IP4; + free(pr->pr_ip4, M_PRISON); + pr->pr_ip4s = 0; + } + FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) { + if (prison_restrict_ip4(tpr, NULL)) { + redo_ip4 = 1; + descend = 0; + } + } } #endif #ifdef INET6 - if (ip6s >= 0) { - pr->pr_ip6s = ip6s; - free(pr->pr_ip6, M_PRISON); - pr->pr_ip6 = ip6; - ip6 = NULL; + redo_ip6 = 0; + if (ch_flags & PR_IP6_USER) { + if (pr_flags & PR_IP6_USER) { + /* Some restriction set. */ + pr->pr_flags |= PR_IP6; + if (ip6s >= 0) { + free(pr->pr_ip6, M_PRISON); + pr->pr_ip6s = ip6s; + pr->pr_ip6 = ip6; + ip6 = NULL; + } + } else if (ppr->pr_flags & PR_IP6) { + /* This restriction cleared, but keep inherited. */ + free(pr->pr_ip6, M_PRISON); + pr->pr_ip6s = ip6s; + pr->pr_ip6 = ip6; + ip6 = NULL; + } else { + /* Restriction cleared, now unrestricted. */ + pr->pr_flags &= ~PR_IP6; + free(pr->pr_ip6, M_PRISON); + pr->pr_ip6s = 0; + } + FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) { + if (prison_restrict_ip6(tpr, NULL)) { + redo_ip6 = 1; + descend = 0; + } + } } #endif - if (gotslevel) + if (gotslevel) { pr->pr_securelevel = slevel; - if (name != NULL) - strlcpy(pr->pr_name, name, sizeof(pr->pr_name)); + /* Set all child jails to be at least this level. */ + FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) + if (tpr->pr_securelevel < slevel) + tpr->pr_securelevel = slevel; + } + if (gotenforce) { + pr->pr_enforce_statfs = enforce; + /* Pass this restriction on to the children. */ + FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) + if (tpr->pr_enforce_statfs < enforce) + tpr->pr_enforce_statfs = enforce; + } + if (name != NULL) { + if (ppr == &prison0) + strlcpy(pr->pr_name, name, sizeof(pr->pr_name)); + else + snprintf(pr->pr_name, sizeof(pr->pr_name), "%s.%s", + ppr->pr_name, name); + /* Change this component of child names. */ + FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) { + bcopy(tpr->pr_name + onamelen, tpr->pr_name + namelen, + strlen(tpr->pr_name + onamelen) + 1); + bcopy(pr->pr_name, tpr->pr_name, namelen); + } + } if (path != NULL) { - strlcpy(pr->pr_path, path, sizeof(pr->pr_path)); + /* Try to keep a real-rooted full pathname. */ + if (path[0] == '/' && strcmp(mypr->pr_path, "/")) + snprintf(pr->pr_path, sizeof(pr->pr_path), "%s%s", + mypr->pr_path, path); + else + strlcpy(pr->pr_path, path, sizeof(pr->pr_path)); pr->pr_root = root; } if (host != NULL) strlcpy(pr->pr_host, host, sizeof(pr->pr_host)); + if ((tallow = ch_allow & ~pr_allow)) { + /* Clear allow bits in all children. */ + FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) + tpr->pr_allow &= ~tallow; + } + pr->pr_allow = (pr->pr_allow & ~ch_allow) | pr_allow; /* * Persistent prisons get an extra reference, and prisons losing their * persist flag lose that reference. Only do this for existing prisons * for now, so new ones will remain unseen until after the module * handlers have completed. */ if (!created && (ch_flags & PR_PERSIST & (pr_flags ^ pr->pr_flags))) { if (pr_flags & PR_PERSIST) { pr->pr_ref++; pr->pr_uref++; } else { pr->pr_ref--; pr->pr_uref--; } } pr->pr_flags = (pr->pr_flags & ~ch_flags) | pr_flags; mtx_unlock(&pr->pr_mtx); + /* Locks may have prevented a complete restriction of child IP + * addresses. If so, allocate some more memory and try again. + */ +#ifdef INET + while (redo_ip4) { + ip4s = pr->pr_ip4s; + ip4 = malloc(ip4s * sizeof(*ip4), M_PRISON, M_WAITOK); + mtx_lock(&pr->pr_mtx); + redo_ip4 = 0; + FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) { + if (prison_restrict_ip4(tpr, ip4)) { + if (ip4 != NULL) + ip4 = NULL; + else + redo_ip4 = 1; + } + } + mtx_unlock(&pr->pr_mtx); + } +#endif +#ifdef INET6 + while (redo_ip6) { + ip6s = pr->pr_ip6s; + ip6 = malloc(ip6s * sizeof(*ip6), M_PRISON, M_WAITOK); + mtx_lock(&pr->pr_mtx); + redo_ip6 = 0; + FOREACH_PRISON_DESCENDANT_LOCKED(pr, tpr, descend) { + if (prison_restrict_ip6(tpr, ip6)) { + if (ip6 != NULL) + ip6 = NULL; + else + redo_ip6 = 1; + } + } + mtx_unlock(&pr->pr_mtx); + } +#endif + /* Let the modules do their work. */ sx_downgrade(&allprison_lock); if (created) { error = osd_jail_call(pr, PR_METHOD_CREATE, opts); if (error) { prison_deref(pr, PD_LIST_SLOCKED); goto done_errmsg; } } error = osd_jail_call(pr, PR_METHOD_SET, opts); if (error) { prison_deref(pr, created ? PD_LIST_SLOCKED : PD_DEREF | PD_LIST_SLOCKED); goto done_errmsg; } /* Attach this process to the prison if requested. */ if (flags & JAIL_ATTACH) { mtx_lock(&pr->pr_mtx); error = do_jail_attach(td, pr); if (error) { vfs_opterror(opts, "attach failed"); if (!created) prison_deref(pr, PD_DEREF); goto done_errmsg; } } /* * Now that it is all there, drop the temporary reference from existing * prisons. Or add a reference to newly created persistent prisons * (which was not done earlier so that the prison would not be publicly * visible). */ if (!created) { prison_deref(pr, (flags & JAIL_ATTACH) ? PD_DEREF : PD_DEREF | PD_LIST_SLOCKED); } else { if (pr_flags & PR_PERSIST) { mtx_lock(&pr->pr_mtx); pr->pr_ref++; pr->pr_uref++; mtx_unlock(&pr->pr_mtx); } if (!(flags & JAIL_ATTACH)) sx_sunlock(&allprison_lock); } td->td_retval[0] = pr->pr_id; goto done_errmsg; + done_deref_locked: + prison_deref(pr, created + ? PD_LOCKED | PD_LIST_XLOCKED + : PD_DEREF | PD_LOCKED | PD_LIST_XLOCKED); + goto done_releroot; done_unlock_list: sx_xunlock(&allprison_lock); done_releroot: if (root != NULL) { vfslocked = VFS_LOCK_GIANT(root->v_mount); vrele(root); VFS_UNLOCK_GIANT(vfslocked); } done_errmsg: if (error) { vfs_getopt(opts, "errmsg", (void **)&errmsg, &errmsg_len); if (errmsg_len > 0) { errmsg_pos = 2 * vfs_getopt_pos(opts, "errmsg") + 1; if (errmsg_pos > 0) { if (optuio->uio_segflg == UIO_SYSSPACE) bcopy(errmsg, optuio->uio_iov[errmsg_pos].iov_base, errmsg_len); else copyout(errmsg, optuio->uio_iov[errmsg_pos].iov_base, errmsg_len); } } } done_free: #ifdef INET free(ip4, M_PRISON); #endif #ifdef INET6 free(ip6, M_PRISON); #endif vfs_freeopts(opts); return (error); } -/* - * Sysctl nodes to describe jail parameters. Maximum length of string - * parameters is returned in the string itself, and the other parameters - * exist merely to make themselves and their types known. - */ -SYSCTL_NODE(_security_jail, OID_AUTO, param, CTLFLAG_RW, 0, - "Jail parameters"); -int -sysctl_jail_param(SYSCTL_HANDLER_ARGS) -{ - int i; - long l; - size_t s; - char numbuf[12]; - - switch (oidp->oid_kind & CTLTYPE) - { - case CTLTYPE_LONG: - case CTLTYPE_ULONG: - l = 0; -#ifdef SCTL_MASK32 - if (!(req->flags & SCTL_MASK32)) -#endif - return (SYSCTL_OUT(req, &l, sizeof(l))); - case CTLTYPE_INT: - case CTLTYPE_UINT: - i = 0; - return (SYSCTL_OUT(req, &i, sizeof(i))); - case CTLTYPE_STRING: - snprintf(numbuf, sizeof(numbuf), "%d", arg2); - return - (sysctl_handle_string(oidp, numbuf, sizeof(numbuf), req)); - case CTLTYPE_STRUCT: - s = (size_t)arg2; - return (SYSCTL_OUT(req, &s, sizeof(s))); - } - return (0); -} - -SYSCTL_JAIL_PARAM(, jid, CTLTYPE_INT | CTLFLAG_RD, "I", "Jail ID"); -SYSCTL_JAIL_PARAM_STRING(, name, CTLFLAG_RW, MAXHOSTNAMELEN, "Jail name"); -SYSCTL_JAIL_PARAM(, cpuset, CTLTYPE_INT | CTLFLAG_RD, "I", "Jail cpuset ID"); -SYSCTL_JAIL_PARAM_STRING(, path, CTLFLAG_RD, MAXPATHLEN, "Jail root path"); -SYSCTL_JAIL_PARAM(, securelevel, CTLTYPE_INT | CTLFLAG_RW, - "I", "Jail secure level"); -SYSCTL_JAIL_PARAM(, persist, CTLTYPE_INT | CTLFLAG_RW, - "B", "Jail persistence"); -SYSCTL_JAIL_PARAM(, dying, CTLTYPE_INT | CTLFLAG_RD, - "B", "Jail is in the process of shutting down"); - -SYSCTL_JAIL_PARAM_NODE(host, "Jail host info"); -SYSCTL_JAIL_PARAM_STRING(_host, hostname, CTLFLAG_RW, MAXHOSTNAMELEN, - "Jail hostname"); - -#ifdef INET -SYSCTL_JAIL_PARAM_NODE(ip4, "Jail IPv4 address virtualization"); -SYSCTL_JAIL_PARAM_STRUCT(_ip4, addr, CTLFLAG_RW, sizeof(struct in_addr), - "S,in_addr,a", "Jail IPv4 addresses"); -#endif -#ifdef INET6 -SYSCTL_JAIL_PARAM_NODE(ip6, "Jail IPv6 address virtualization"); -SYSCTL_JAIL_PARAM_STRUCT(_ip6, addr, CTLFLAG_RW, sizeof(struct in6_addr), - "S,in6_addr,a", "Jail IPv6 addresses"); -#endif - - /* * struct jail_get_args { * struct iovec *iovp; * unsigned int iovcnt; * int flags; * }; */ int jail_get(struct thread *td, struct jail_get_args *uap) { struct uio *auio; int error; /* Check that we have an even number of iovecs. */ if (uap->iovcnt & 1) return (EINVAL); error = copyinuio(uap->iovp, uap->iovcnt, &auio); if (error) return (error); error = kern_jail_get(td, auio, uap->flags); if (error == 0) error = copyout(auio->uio_iov, uap->iovp, uap->iovcnt * sizeof (struct iovec)); free(auio, M_IOV); return (error); } int kern_jail_get(struct thread *td, struct uio *optuio, int flags) { - struct prison *pr; + struct prison *pr, *mypr; struct vfsopt *opt; struct vfsoptlist *opts; char *errmsg, *name; - int error, errmsg_len, errmsg_pos, i, jid, len, locked, pos; + int error, errmsg_len, errmsg_pos, fi, i, jid, len, locked, pos; if (flags & ~JAIL_GET_MASK) return (EINVAL); /* Get the parameter list. */ error = vfs_buildopts(optuio, &opts); if (error) return (error); errmsg_pos = vfs_getopt_pos(opts, "errmsg"); + mypr = td->td_ucred->cr_prison; - /* Don't allow a jailed process to see any jails, not even its own. */ - if (jailed(td->td_ucred)) { - vfs_opterror(opts, "jail not found"); - return (ENOENT); - } - /* * Find the prison specified by one of: lastjid, jid, name. */ sx_slock(&allprison_lock); error = vfs_copyopt(opts, "lastjid", &jid, sizeof(jid)); if (error == 0) { TAILQ_FOREACH(pr, &allprison, pr_list) { - if (pr->pr_id > jid) { + if (pr->pr_id > jid && prison_ischild(mypr, pr)) { mtx_lock(&pr->pr_mtx); if (pr->pr_ref > 0 && (pr->pr_uref > 0 || (flags & JAIL_DYING))) break; mtx_unlock(&pr->pr_mtx); } } if (pr != NULL) goto found_prison; error = ENOENT; vfs_opterror(opts, "no jail after %d", jid); goto done_unlock_list; } else if (error != ENOENT) goto done_unlock_list; error = vfs_copyopt(opts, "jid", &jid, sizeof(jid)); if (error == 0) { if (jid != 0) { - pr = prison_find(jid); + pr = prison_find_child(mypr, jid); if (pr != NULL) { if (pr->pr_uref == 0 && !(flags & JAIL_DYING)) { mtx_unlock(&pr->pr_mtx); error = ENOENT; vfs_opterror(opts, "jail %d is dying", jid); goto done_unlock_list; } goto found_prison; } error = ENOENT; vfs_opterror(opts, "jail %d not found", jid); goto done_unlock_list; } } else if (error != ENOENT) goto done_unlock_list; error = vfs_getopt(opts, "name", (void **)&name, &len); if (error == 0) { if (len == 0 || name[len - 1] != '\0') { error = EINVAL; goto done_unlock_list; } - pr = prison_find_name(name); + pr = prison_find_name(mypr, name); if (pr != NULL) { if (pr->pr_uref == 0 && !(flags & JAIL_DYING)) { mtx_unlock(&pr->pr_mtx); error = ENOENT; vfs_opterror(opts, "jail \"%s\" is dying", name); goto done_unlock_list; } goto found_prison; } error = ENOENT; vfs_opterror(opts, "jail \"%s\" not found", name); goto done_unlock_list; } else if (error != ENOENT) goto done_unlock_list; vfs_opterror(opts, "no jail specified"); error = ENOENT; goto done_unlock_list; found_prison: /* Get the parameters of the prison. */ pr->pr_ref++; locked = PD_LOCKED; td->td_retval[0] = pr->pr_id; error = vfs_setopt(opts, "jid", &pr->pr_id, sizeof(pr->pr_id)); if (error != 0 && error != ENOENT) goto done_deref; - error = vfs_setopts(opts, "name", pr->pr_name); + i = (pr->pr_parent == mypr) ? 0 : pr->pr_parent->pr_id; + error = vfs_setopt(opts, "parent", &i, sizeof(i)); if (error != 0 && error != ENOENT) goto done_deref; - error = vfs_setopt(opts, "cpuset", &pr->pr_cpuset->cs_id, + error = vfs_setopts(opts, "name", prison_name(mypr, pr)); + if (error != 0 && error != ENOENT) + goto done_deref; + error = vfs_setopt(opts, "cpuset.id", &pr->pr_cpuset->cs_id, sizeof(pr->pr_cpuset->cs_id)); if (error != 0 && error != ENOENT) goto done_deref; - error = vfs_setopts(opts, "path", pr->pr_path); + error = vfs_setopts(opts, "path", prison_path(mypr, pr)); if (error != 0 && error != ENOENT) goto done_deref; #ifdef INET error = vfs_setopt_part(opts, "ip4.addr", pr->pr_ip4, pr->pr_ip4s * sizeof(*pr->pr_ip4)); if (error != 0 && error != ENOENT) goto done_deref; #endif #ifdef INET6 error = vfs_setopt_part(opts, "ip6.addr", pr->pr_ip6, pr->pr_ip6s * sizeof(*pr->pr_ip6)); if (error != 0 && error != ENOENT) goto done_deref; #endif error = vfs_setopt(opts, "securelevel", &pr->pr_securelevel, sizeof(pr->pr_securelevel)); if (error != 0 && error != ENOENT) goto done_deref; error = vfs_setopts(opts, "host.hostname", pr->pr_host); if (error != 0 && error != ENOENT) goto done_deref; - i = pr->pr_flags & PR_PERSIST ? 1 : 0; - error = vfs_setopt(opts, "persist", &i, sizeof(i)); + error = vfs_setopt(opts, "enforce_statfs", &pr->pr_enforce_statfs, + sizeof(pr->pr_enforce_statfs)); if (error != 0 && error != ENOENT) goto done_deref; - i = !i; - error = vfs_setopt(opts, "nopersist", &i, sizeof(i)); - if (error != 0 && error != ENOENT) - goto done_deref; + for (fi = 0; fi < sizeof(pr_flag_names) / sizeof(pr_flag_names[0]); + fi++) { + if (pr_flag_names[fi] == NULL) + continue; + i = (pr->pr_flags & (1 << fi)) ? 1 : 0; + error = vfs_setopt(opts, pr_flag_names[fi], &i, sizeof(i)); + if (error != 0 && error != ENOENT) + goto done_deref; + i = !i; + error = vfs_setopt(opts, pr_flag_nonames[fi], &i, sizeof(i)); + if (error != 0 && error != ENOENT) + goto done_deref; + } + for (fi = 0; fi < sizeof(pr_allow_names) / sizeof(pr_allow_names[0]); + fi++) { + if (pr_allow_names[fi] == NULL) + continue; + i = (pr->pr_allow & (1 << fi)) ? 1 : 0; + error = vfs_setopt(opts, pr_allow_names[fi], &i, sizeof(i)); + if (error != 0 && error != ENOENT) + goto done_deref; + i = !i; + error = vfs_setopt(opts, pr_allow_nonames[fi], &i, sizeof(i)); + if (error != 0 && error != ENOENT) + goto done_deref; + } i = (pr->pr_uref == 0); error = vfs_setopt(opts, "dying", &i, sizeof(i)); if (error != 0 && error != ENOENT) goto done_deref; i = !i; error = vfs_setopt(opts, "nodying", &i, sizeof(i)); if (error != 0 && error != ENOENT) goto done_deref; /* Get the module parameters. */ mtx_unlock(&pr->pr_mtx); locked = 0; error = osd_jail_call(pr, PR_METHOD_GET, opts); if (error) goto done_deref; prison_deref(pr, PD_DEREF | PD_LIST_SLOCKED); /* By now, all parameters should have been noted. */ TAILQ_FOREACH(opt, opts, link) { if (!opt->seen && strcmp(opt->name, "errmsg")) { error = EINVAL; vfs_opterror(opts, "unknown parameter: %s", opt->name); goto done_errmsg; } } /* Write the fetched parameters back to userspace. */ error = 0; TAILQ_FOREACH(opt, opts, link) { if (opt->pos >= 0 && opt->pos != errmsg_pos) { pos = 2 * opt->pos + 1; optuio->uio_iov[pos].iov_len = opt->len; if (opt->value != NULL) { if (optuio->uio_segflg == UIO_SYSSPACE) { bcopy(opt->value, optuio->uio_iov[pos].iov_base, opt->len); } else { error = copyout(opt->value, optuio->uio_iov[pos].iov_base, opt->len); if (error) break; } } } } goto done_errmsg; done_deref: prison_deref(pr, locked | PD_DEREF | PD_LIST_SLOCKED); goto done_errmsg; done_unlock_list: sx_sunlock(&allprison_lock); done_errmsg: if (error && errmsg_pos >= 0) { vfs_getopt(opts, "errmsg", (void **)&errmsg, &errmsg_len); errmsg_pos = 2 * errmsg_pos + 1; if (errmsg_len > 0) { if (optuio->uio_segflg == UIO_SYSSPACE) bcopy(errmsg, optuio->uio_iov[errmsg_pos].iov_base, errmsg_len); else copyout(errmsg, optuio->uio_iov[errmsg_pos].iov_base, errmsg_len); } } vfs_freeopts(opts); return (error); } + /* * struct jail_remove_args { * int jid; * }; */ int jail_remove(struct thread *td, struct jail_remove_args *uap) { - struct prison *pr; - struct proc *p; - int deuref, error; + struct prison *pr, *cpr, *lpr, *tpr; + int descend, error; error = priv_check(td, PRIV_JAIL_REMOVE); if (error) return (error); sx_xlock(&allprison_lock); - pr = prison_find(uap->jid); + pr = prison_find_child(td->td_ucred->cr_prison, uap->jid); if (pr == NULL) { sx_xunlock(&allprison_lock); return (EINVAL); } + /* Remove all descendants of this prison, then remove this prison. */ + pr->pr_ref++; + pr->pr_flags |= PR_REMOVE; + if (!LIST_EMPTY(&pr->pr_children)) { + mtx_unlock(&pr->pr_mtx); + lpr = NULL; + FOREACH_PRISON_DESCENDANT(pr, cpr, descend) { + mtx_lock(&cpr->pr_mtx); + if (cpr->pr_ref > 0) { + tpr = cpr; + cpr->pr_ref++; + cpr->pr_flags |= PR_REMOVE; + } else { + /* Already removed - do not do it again. */ + tpr = NULL; + } + mtx_unlock(&cpr->pr_mtx); + if (lpr != NULL) { + mtx_lock(&lpr->pr_mtx); + prison_remove_one(lpr); + sx_xlock(&allprison_lock); + } + lpr = tpr; + } + if (lpr != NULL) { + mtx_lock(&lpr->pr_mtx); + prison_remove_one(lpr); + sx_xlock(&allprison_lock); + } + mtx_lock(&pr->pr_mtx); + } + prison_remove_one(pr); + return (0); +} + +static void +prison_remove_one(struct prison *pr) +{ + struct proc *p; + int deuref; + /* If the prison was persistent, it is not anymore. */ deuref = 0; if (pr->pr_flags & PR_PERSIST) { pr->pr_ref--; deuref = PD_DEUREF; pr->pr_flags &= ~PR_PERSIST; } - /* If there are no references left, remove the prison now. */ - if (pr->pr_ref == 0) { + /* + * jail_remove added a reference. If that's the only one, remove + * the prison now. + */ + KASSERT(pr->pr_ref > 0, + ("prison_remove_one removing a dead prison (jid=%d)", pr->pr_id)); + if (pr->pr_ref == 1) { prison_deref(pr, deuref | PD_DEREF | PD_LOCKED | PD_LIST_XLOCKED); - return (0); + return; } - /* - * Keep a temporary reference to make sure this prison sticks around. - */ - pr->pr_ref++; mtx_unlock(&pr->pr_mtx); sx_xunlock(&allprison_lock); /* * Kill all processes unfortunate enough to be attached to this prison. */ sx_slock(&allproc_lock); LIST_FOREACH(p, &allproc, p_list) { PROC_LOCK(p); if (p->p_state != PRS_NEW && p->p_ucred && p->p_ucred->cr_prison == pr) psignal(p, SIGKILL); PROC_UNLOCK(p); } sx_sunlock(&allproc_lock); - /* Remove the temporary reference. */ + /* Remove the temporary reference added by jail_remove. */ prison_deref(pr, deuref | PD_DEREF); - return (0); } /* * struct jail_attach_args { * int jid; * }; */ int jail_attach(struct thread *td, struct jail_attach_args *uap) { struct prison *pr; int error; error = priv_check(td, PRIV_JAIL_ATTACH); if (error) return (error); sx_slock(&allprison_lock); - pr = prison_find(uap->jid); + pr = prison_find_child(td->td_ucred->cr_prison, uap->jid); if (pr == NULL) { sx_sunlock(&allprison_lock); return (EINVAL); } /* * Do not allow a process to attach to a prison that is not * considered to be "alive". */ if (pr->pr_uref == 0) { mtx_unlock(&pr->pr_mtx); sx_sunlock(&allprison_lock); return (EINVAL); } return (do_jail_attach(td, pr)); } static int do_jail_attach(struct thread *td, struct prison *pr) { + struct prison *ppr; struct proc *p; struct ucred *newcred, *oldcred; int vfslocked, error; /* * XXX: Note that there is a slight race here if two threads * in the same privileged process attempt to attach to two * different jails at the same time. It is important for * user processes not to do this, or they might end up with * a process root from one prison, but attached to the jail * of another. */ pr->pr_ref++; pr->pr_uref++; mtx_unlock(&pr->pr_mtx); /* Let modules do whatever they need to prepare for attaching. */ error = osd_jail_call(pr, PR_METHOD_ATTACH, td); if (error) { prison_deref(pr, PD_DEREF | PD_DEUREF | PD_LIST_SLOCKED); return (error); } sx_sunlock(&allprison_lock); /* * Reparent the newly attached process to this jail. */ + ppr = td->td_ucred->cr_prison; p = td->td_proc; error = cpuset_setproc_update_set(p, pr->pr_cpuset); if (error) goto e_revert_osd; vfslocked = VFS_LOCK_GIANT(pr->pr_root->v_mount); vn_lock(pr->pr_root, LK_EXCLUSIVE | LK_RETRY); if ((error = change_dir(pr->pr_root, td)) != 0) goto e_unlock; #ifdef MAC if ((error = mac_vnode_check_chroot(td->td_ucred, pr->pr_root))) goto e_unlock; #endif VOP_UNLOCK(pr->pr_root, 0); if ((error = change_root(pr->pr_root, td))) goto e_unlock_giant; VFS_UNLOCK_GIANT(vfslocked); newcred = crget(); PROC_LOCK(p); oldcred = p->p_ucred; setsugid(p); crcopy(newcred, oldcred); newcred->cr_prison = pr; p->p_ucred = newcred; PROC_UNLOCK(p); crfree(oldcred); + prison_deref(ppr, PD_DEREF | PD_DEUREF); return (0); e_unlock: VOP_UNLOCK(pr->pr_root, 0); e_unlock_giant: VFS_UNLOCK_GIANT(vfslocked); e_revert_osd: /* Tell modules this thread is still in its old jail after all. */ - (void)osd_jail_call(td->td_ucred->cr_prison, PR_METHOD_ATTACH, td); + (void)osd_jail_call(ppr, PR_METHOD_ATTACH, td); prison_deref(pr, PD_DEREF | PD_DEUREF); return (error); } + /* * Returns a locked prison instance, or NULL on failure. */ struct prison * prison_find(int prid) { struct prison *pr; sx_assert(&allprison_lock, SX_LOCKED); TAILQ_FOREACH(pr, &allprison, pr_list) { if (pr->pr_id == prid) { mtx_lock(&pr->pr_mtx); if (pr->pr_ref > 0) return (pr); mtx_unlock(&pr->pr_mtx); } } return (NULL); } /* - * Look for the named prison. Returns a locked prison or NULL. + * Find a prison that is a descendant of mypr. Returns a locked prison or NULL. */ struct prison * -prison_find_name(const char *name) +prison_find_child(struct prison *mypr, int prid) { + struct prison *pr; + int descend; + + sx_assert(&allprison_lock, SX_LOCKED); + FOREACH_PRISON_DESCENDANT(mypr, pr, descend) { + if (pr->pr_id == prid) { + mtx_lock(&pr->pr_mtx); + if (pr->pr_ref > 0) + return (pr); + mtx_unlock(&pr->pr_mtx); + } + } + return (NULL); +} + +/* + * Look for the name relative to mypr. Returns a locked prison or NULL. + */ +struct prison * +prison_find_name(struct prison *mypr, const char *name) +{ struct prison *pr, *deadpr; + size_t mylen; + int descend; sx_assert(&allprison_lock, SX_LOCKED); + mylen = (mypr == &prison0) ? 0 : strlen(mypr->pr_name) + 1; again: deadpr = NULL; - TAILQ_FOREACH(pr, &allprison, pr_list) { - if (!strcmp(pr->pr_name, name)) { + FOREACH_PRISON_DESCENDANT(mypr, pr, descend) { + if (!strcmp(pr->pr_name + mylen, name)) { mtx_lock(&pr->pr_mtx); if (pr->pr_ref > 0) { if (pr->pr_uref > 0) return (pr); deadpr = pr; } mtx_unlock(&pr->pr_mtx); } } - /* There was no valid prison - perhaps there was a dying one */ + /* There was no valid prison - perhaps there was a dying one. */ if (deadpr != NULL) { mtx_lock(&deadpr->pr_mtx); if (deadpr->pr_ref == 0) { mtx_unlock(&deadpr->pr_mtx); goto again; } } return (deadpr); } /* + * See if a prison has the specific flag set. + */ +int +prison_flag(struct ucred *cred, unsigned flag) +{ + + /* This is an atomic read, so no locking is necessary. */ + return (cred->cr_prison->pr_flags & flag); +} + +int +prison_allow(struct ucred *cred, unsigned flag) +{ + + /* This is an atomic read, so no locking is necessary. */ + return (cred->cr_prison->pr_allow & flag); +} + +/* * Remove a prison reference. If that was the last reference, remove the * prison itself - but not in this context in case there are locks held. */ void prison_free_locked(struct prison *pr) { mtx_assert(&pr->pr_mtx, MA_OWNED); pr->pr_ref--; if (pr->pr_ref == 0) { mtx_unlock(&pr->pr_mtx); TASK_INIT(&pr->pr_task, 0, prison_complete, pr); taskqueue_enqueue(taskqueue_thread, &pr->pr_task); return; } mtx_unlock(&pr->pr_mtx); } void prison_free(struct prison *pr) { mtx_lock(&pr->pr_mtx); prison_free_locked(pr); } static void prison_complete(void *context, int pending) { prison_deref((struct prison *)context, 0); } /* * Remove a prison reference (usually). This internal version assumes no * mutexes are held, except perhaps the prison itself. If there are no more * references, release and delist the prison. On completion, the prison lock * and the allprison lock are both unlocked. */ static void prison_deref(struct prison *pr, int flags) { + struct prison *ppr, *tpr; int vfslocked; if (!(flags & PD_LOCKED)) mtx_lock(&pr->pr_mtx); + /* Decrement the user references in a separate loop. */ if (flags & PD_DEUREF) { - pr->pr_uref--; + for (tpr = pr;; tpr = tpr->pr_parent) { + if (tpr != pr) + mtx_lock(&tpr->pr_mtx); + if (--tpr->pr_uref > 0) + break; + KASSERT(tpr != &prison0, ("prison0 pr_uref=0")); + mtx_unlock(&tpr->pr_mtx); + } /* Done if there were only user references to remove. */ if (!(flags & PD_DEREF)) { - mtx_unlock(&pr->pr_mtx); + mtx_unlock(&tpr->pr_mtx); if (flags & PD_LIST_SLOCKED) sx_sunlock(&allprison_lock); else if (flags & PD_LIST_XLOCKED) sx_xunlock(&allprison_lock); return; } + if (tpr != pr) { + mtx_unlock(&tpr->pr_mtx); + mtx_lock(&pr->pr_mtx); + } } - if (flags & PD_DEREF) - pr->pr_ref--; - /* If the prison still has references, nothing else to do. */ - if (pr->pr_ref > 0) { - mtx_unlock(&pr->pr_mtx); - if (flags & PD_LIST_SLOCKED) - sx_sunlock(&allprison_lock); - else if (flags & PD_LIST_XLOCKED) - sx_xunlock(&allprison_lock); - return; - } - KASSERT(pr->pr_uref == 0, - ("%s: Trying to remove an active prison (jid=%d).", __func__, - pr->pr_id)); - mtx_unlock(&pr->pr_mtx); - if (flags & PD_LIST_SLOCKED) { - if (!sx_try_upgrade(&allprison_lock)) { - sx_sunlock(&allprison_lock); - sx_xlock(&allprison_lock); + for (;;) { + if (flags & PD_DEREF) + pr->pr_ref--; + /* If the prison still has references, nothing else to do. */ + if (pr->pr_ref > 0) { + mtx_unlock(&pr->pr_mtx); + if (flags & PD_LIST_SLOCKED) + sx_sunlock(&allprison_lock); + else if (flags & PD_LIST_XLOCKED) + sx_xunlock(&allprison_lock); + return; } - } else if (!(flags & PD_LIST_XLOCKED)) - sx_xlock(&allprison_lock); - TAILQ_REMOVE(&allprison, pr, pr_list); - prisoncount--; - sx_xunlock(&allprison_lock); + mtx_unlock(&pr->pr_mtx); + if (flags & PD_LIST_SLOCKED) { + if (!sx_try_upgrade(&allprison_lock)) { + sx_sunlock(&allprison_lock); + sx_xlock(&allprison_lock); + } + } else if (!(flags & PD_LIST_XLOCKED)) + sx_xlock(&allprison_lock); - if (pr->pr_root != NULL) { - vfslocked = VFS_LOCK_GIANT(pr->pr_root->v_mount); - vrele(pr->pr_root); - VFS_UNLOCK_GIANT(vfslocked); - } - mtx_destroy(&pr->pr_mtx); + TAILQ_REMOVE(&allprison, pr, pr_list); + LIST_REMOVE(pr, pr_sibling); + ppr = pr->pr_parent; + for (tpr = ppr; tpr != NULL; tpr = tpr->pr_parent) + tpr->pr_prisoncount--; + sx_downgrade(&allprison_lock); + + if (pr->pr_root != NULL) { + vfslocked = VFS_LOCK_GIANT(pr->pr_root->v_mount); + vrele(pr->pr_root); + VFS_UNLOCK_GIANT(vfslocked); + } + mtx_destroy(&pr->pr_mtx); #ifdef INET - free(pr->pr_ip4, M_PRISON); + free(pr->pr_ip4, M_PRISON); #endif #ifdef INET6 - free(pr->pr_ip6, M_PRISON); + free(pr->pr_ip6, M_PRISON); #endif - if (pr->pr_cpuset != NULL) - cpuset_rel(pr->pr_cpuset); - osd_jail_exit(pr); - free(pr, M_PRISON); + if (pr->pr_cpuset != NULL) + cpuset_rel(pr->pr_cpuset); + osd_jail_exit(pr); + free(pr, M_PRISON); + + /* Removing a prison frees a reference on its parent. */ + pr = ppr; + mtx_lock(&pr->pr_mtx); + flags = PD_DEREF | PD_LIST_SLOCKED; + } } void prison_hold_locked(struct prison *pr) { mtx_assert(&pr->pr_mtx, MA_OWNED); KASSERT(pr->pr_ref > 0, ("Trying to hold dead prison (jid=%d).", pr->pr_id)); pr->pr_ref++; } void prison_hold(struct prison *pr) { mtx_lock(&pr->pr_mtx); prison_hold_locked(pr); mtx_unlock(&pr->pr_mtx); } void prison_proc_hold(struct prison *pr) { mtx_lock(&pr->pr_mtx); KASSERT(pr->pr_uref > 0, ("Cannot add a process to a non-alive prison (jid=%d)", pr->pr_id)); pr->pr_uref++; mtx_unlock(&pr->pr_mtx); } void prison_proc_free(struct prison *pr) { mtx_lock(&pr->pr_mtx); KASSERT(pr->pr_uref > 0, ("Trying to kill a process in a dead prison (jid=%d)", pr->pr_id)); prison_deref(pr, PD_DEUREF | PD_LOCKED); } #ifdef INET /* + * Restrict a prison's IP address list with its parent's, possibly replacing + * it. Return true if the replacement buffer was used (or would have been). + */ +static int +prison_restrict_ip4(struct prison *pr, struct in_addr *newip4) +{ + int ii, ij, used; + struct prison *ppr; + + ppr = pr->pr_parent; + if (!(pr->pr_flags & PR_IP4_USER)) { + /* This has no user settings, so just copy the parent's list. */ + if (pr->pr_ip4s < ppr->pr_ip4s) { + /* + * There's no room for the parent's list. Use the + * new list buffer, which is assumed to be big enough + * (if it was passed). If there's no buffer, try to + * allocate one. + */ + used = 1; + if (newip4 == NULL) { + newip4 = malloc(ppr->pr_ip4s * sizeof(*newip4), + M_PRISON, M_NOWAIT); + if (newip4 != NULL) + used = 0; + } + if (newip4 != NULL) { + bcopy(ppr->pr_ip4, newip4, + ppr->pr_ip4s * sizeof(*newip4)); + free(pr->pr_ip4, M_PRISON); + pr->pr_ip4 = newip4; + pr->pr_ip4s = ppr->pr_ip4s; + pr->pr_flags |= PR_IP4; + } + return (used); + } + pr->pr_ip4s = ppr->pr_ip4s; + if (pr->pr_ip4s > 0) + bcopy(ppr->pr_ip4, pr->pr_ip4, + pr->pr_ip4s * sizeof(*newip4)); + else if (pr->pr_ip4 != NULL) { + free(pr->pr_ip4, M_PRISON); + pr->pr_ip4 = NULL; + } + pr->pr_flags = + (pr->pr_flags & ~PR_IP4) | (ppr->pr_flags & PR_IP4); + } else if (pr->pr_ip4s > 0 && (ppr->pr_flags & PR_IP4)) { + /* Remove addresses that aren't in the parent. */ + for (ij = 0; ij < ppr->pr_ip4s; ij++) + if (pr->pr_ip4[0].s_addr == ppr->pr_ip4[ij].s_addr) + break; + if (ij < ppr->pr_ip4s) + ii = 1; + else { + bcopy(pr->pr_ip4 + 1, pr->pr_ip4, + --pr->pr_ip4s * sizeof(*pr->pr_ip4)); + ii = 0; + } + for (ij = 1; ii < pr->pr_ip4s; ) { + if (pr->pr_ip4[ii].s_addr == ppr->pr_ip4[0].s_addr) { + ii++; + continue; + } + switch (ij >= ppr->pr_ip4s ? -1 : + qcmp_v4(&pr->pr_ip4[ii], &ppr->pr_ip4[ij])) { + case -1: + bcopy(pr->pr_ip4 + ii + 1, pr->pr_ip4 + ii, + (--pr->pr_ip4s - ii) * sizeof(*pr->pr_ip4)); + break; + case 0: + ii++; + ij++; + break; + case 1: + ij++; + break; + } + } + if (pr->pr_ip4s == 0) { + free(pr->pr_ip4, M_PRISON); + pr->pr_ip4 = NULL; + } + } + return (0); +} + +/* * Pass back primary IPv4 address of this jail. * - * If not jailed return success but do not alter the address. Caller has to - * make sure to initialize it correctly (e.g. INADDR_ANY). + * If not restricted return success but do not alter the address. Caller has + * to make sure to initialize it correctly (e.g. INADDR_ANY). * * Returns 0 on success, EAFNOSUPPORT if the jail doesn't allow IPv4. * Address returned in NBO. */ int prison_get_ip4(struct ucred *cred, struct in_addr *ia) { struct prison *pr; KASSERT(cred != NULL, ("%s: cred is NULL", __func__)); KASSERT(ia != NULL, ("%s: ia is NULL", __func__)); - if (!jailed(cred)) - return (0); pr = cred->cr_prison; + if (!(pr->pr_flags & PR_IP4)) + return (0); mtx_lock(&pr->pr_mtx); + if (!(pr->pr_flags & PR_IP4)) { + mtx_unlock(&pr->pr_mtx); + return (0); + } if (pr->pr_ip4 == NULL) { mtx_unlock(&pr->pr_mtx); return (EAFNOSUPPORT); } ia->s_addr = pr->pr_ip4[0].s_addr; mtx_unlock(&pr->pr_mtx); return (0); } /* + * Return true if pr1 and pr2 have the same IPv4 address restrictions. + */ +int +prison_equal_ip4(struct prison *pr1, struct prison *pr2) +{ + + if (pr1 == pr2) + return (1); + + /* + * jail_set maintains an exclusive hold on allprison_lock while it + * changes the IP addresses, so only a shared hold is needed. This is + * easier than locking the two prisons which would require finding the + * proper locking order and end up needing allprison_lock anyway. + */ + sx_slock(&allprison_lock); + while (pr1 != &prison0 && !(pr1->pr_flags & PR_IP4_USER)) + pr1 = pr1->pr_parent; + while (pr2 != &prison0 && !(pr2->pr_flags & PR_IP4_USER)) + pr2 = pr2->pr_parent; + sx_sunlock(&allprison_lock); + return (pr1 == pr2); +} + +/* * Make sure our (source) address is set to something meaningful to this * jail. * - * Returns 0 if not jailed or if address belongs to jail, EADDRNOTAVAIL if - * the address doesn't belong, or EAFNOSUPPORT if the jail doesn't allow IPv4. - * Address passed in in NBO and returned in NBO. + * Returns 0 if jail doesn't restrict IPv4 or if address belongs to jail, + * EADDRNOTAVAIL if the address doesn't belong, or EAFNOSUPPORT if the jail + * doesn't allow IPv4. Address passed in in NBO and returned in NBO. */ int prison_local_ip4(struct ucred *cred, struct in_addr *ia) { struct prison *pr; struct in_addr ia0; int error; KASSERT(cred != NULL, ("%s: cred is NULL", __func__)); KASSERT(ia != NULL, ("%s: ia is NULL", __func__)); - if (!jailed(cred)) - return (0); pr = cred->cr_prison; + if (!(pr->pr_flags & PR_IP4)) + return (0); mtx_lock(&pr->pr_mtx); + if (!(pr->pr_flags & PR_IP4)) { + mtx_unlock(&pr->pr_mtx); + return (0); + } if (pr->pr_ip4 == NULL) { mtx_unlock(&pr->pr_mtx); return (EAFNOSUPPORT); } ia0.s_addr = ntohl(ia->s_addr); if (ia0.s_addr == INADDR_LOOPBACK) { ia->s_addr = pr->pr_ip4[0].s_addr; mtx_unlock(&pr->pr_mtx); return (0); } if (ia0.s_addr == INADDR_ANY) { /* * In case there is only 1 IPv4 address, bind directly. */ if (pr->pr_ip4s == 1) ia->s_addr = pr->pr_ip4[0].s_addr; mtx_unlock(&pr->pr_mtx); return (0); } error = _prison_check_ip4(pr, ia); mtx_unlock(&pr->pr_mtx); return (error); } /* * Rewrite destination address in case we will connect to loopback address. * * Returns 0 on success, EAFNOSUPPORT if the jail doesn't allow IPv4. * Address passed in in NBO and returned in NBO. */ int prison_remote_ip4(struct ucred *cred, struct in_addr *ia) { struct prison *pr; KASSERT(cred != NULL, ("%s: cred is NULL", __func__)); KASSERT(ia != NULL, ("%s: ia is NULL", __func__)); - if (!jailed(cred)) - return (0); pr = cred->cr_prison; + if (!(pr->pr_flags & PR_IP4)) + return (0); mtx_lock(&pr->pr_mtx); + if (!(pr->pr_flags & PR_IP4)) { + mtx_unlock(&pr->pr_mtx); + return (0); + } if (pr->pr_ip4 == NULL) { mtx_unlock(&pr->pr_mtx); return (EAFNOSUPPORT); } if (ntohl(ia->s_addr) == INADDR_LOOPBACK) { ia->s_addr = pr->pr_ip4[0].s_addr; mtx_unlock(&pr->pr_mtx); return (0); } /* * Return success because nothing had to be changed. */ mtx_unlock(&pr->pr_mtx); return (0); } /* * Check if given address belongs to the jail referenced by cred/prison. * - * Returns 0 if not jailed or if address belongs to jail, EADDRNOTAVAIL if - * the address doesn't belong, or EAFNOSUPPORT if the jail doesn't allow IPv4. - * Address passed in in NBO. + * Returns 0 if jail doesn't restrict IPv4 or if address belongs to jail, + * EADDRNOTAVAIL if the address doesn't belong, or EAFNOSUPPORT if the jail + * doesn't allow IPv4. Address passed in in NBO. */ static int _prison_check_ip4(struct prison *pr, struct in_addr *ia) { int i, a, z, d; /* * Check the primary IP. */ if (pr->pr_ip4[0].s_addr == ia->s_addr) return (0); /* * All the other IPs are sorted so we can do a binary search. */ a = 0; z = pr->pr_ip4s - 2; while (a <= z) { i = (a + z) / 2; d = qcmp_v4(&pr->pr_ip4[i+1], ia); if (d > 0) z = i - 1; else if (d < 0) a = i + 1; else return (0); } return (EADDRNOTAVAIL); } int prison_check_ip4(struct ucred *cred, struct in_addr *ia) { struct prison *pr; int error; KASSERT(cred != NULL, ("%s: cred is NULL", __func__)); KASSERT(ia != NULL, ("%s: ia is NULL", __func__)); - if (!jailed(cred)) - return (0); pr = cred->cr_prison; + if (!(pr->pr_flags & PR_IP4)) + return (0); mtx_lock(&pr->pr_mtx); + if (!(pr->pr_flags & PR_IP4)) { + mtx_unlock(&pr->pr_mtx); + return (0); + } if (pr->pr_ip4 == NULL) { mtx_unlock(&pr->pr_mtx); return (EAFNOSUPPORT); } error = _prison_check_ip4(pr, ia); mtx_unlock(&pr->pr_mtx); return (error); } #endif #ifdef INET6 +static int +prison_restrict_ip6(struct prison *pr, struct in6_addr *newip6) +{ + int ii, ij, used; + struct prison *ppr; + + ppr = pr->pr_parent; + if (!(pr->pr_flags & PR_IP6_USER)) { + /* This has no user settings, so just copy the parent's list. */ + if (pr->pr_ip6s < ppr->pr_ip6s) { + /* + * There's no room for the parent's list. Use the + * new list buffer, which is assumed to be big enough + * (if it was passed). If there's no buffer, try to + * allocate one. + */ + used = 1; + if (newip6 == NULL) { + newip6 = malloc(ppr->pr_ip6s * sizeof(*newip6), + M_PRISON, M_NOWAIT); + if (newip6 != NULL) + used = 0; + } + if (newip6 != NULL) { + bcopy(ppr->pr_ip6, newip6, + ppr->pr_ip6s * sizeof(*newip6)); + free(pr->pr_ip6, M_PRISON); + pr->pr_ip6 = newip6; + pr->pr_ip6s = ppr->pr_ip6s; + pr->pr_flags |= PR_IP6; + } + return (used); + } + pr->pr_ip6s = ppr->pr_ip6s; + if (pr->pr_ip6s > 0) + bcopy(ppr->pr_ip6, pr->pr_ip6, + pr->pr_ip6s * sizeof(*newip6)); + else if (pr->pr_ip6 != NULL) { + free(pr->pr_ip6, M_PRISON); + pr->pr_ip6 = NULL; + } + pr->pr_flags = + (pr->pr_flags & ~PR_IP6) | (ppr->pr_flags & PR_IP6); + } else if (pr->pr_ip6s > 0 && (ppr->pr_flags & PR_IP6)) { + /* Remove addresses that aren't in the parent. */ + for (ij = 0; ij < ppr->pr_ip6s; ij++) + if (IN6_ARE_ADDR_EQUAL(&pr->pr_ip6[0], + &ppr->pr_ip6[ij])) + break; + if (ij < ppr->pr_ip6s) + ii = 1; + else { + bcopy(pr->pr_ip6 + 1, pr->pr_ip6, + --pr->pr_ip6s * sizeof(*pr->pr_ip6)); + ii = 0; + } + for (ij = 1; ii < pr->pr_ip6s; ) { + if (IN6_ARE_ADDR_EQUAL(&pr->pr_ip6[ii], + &ppr->pr_ip6[0])) { + ii++; + continue; + } + switch (ij >= ppr->pr_ip4s ? -1 : + qcmp_v6(&pr->pr_ip6[ii], &ppr->pr_ip6[ij])) { + case -1: + bcopy(pr->pr_ip6 + ii + 1, pr->pr_ip6 + ii, + (--pr->pr_ip6s - ii) * sizeof(*pr->pr_ip6)); + break; + case 0: + ii++; + ij++; + break; + case 1: + ij++; + break; + } + } + if (pr->pr_ip6s == 0) { + free(pr->pr_ip6, M_PRISON); + pr->pr_ip6 = NULL; + } + } + return 0; +} + /* * Pass back primary IPv6 address for this jail. * - * If not jailed return success but do not alter the address. Caller has to - * make sure to initialize it correctly (e.g. IN6ADDR_ANY_INIT). + * If not restricted return success but do not alter the address. Caller has + * to make sure to initialize it correctly (e.g. IN6ADDR_ANY_INIT). * * Returns 0 on success, EAFNOSUPPORT if the jail doesn't allow IPv6. */ int prison_get_ip6(struct ucred *cred, struct in6_addr *ia6) { struct prison *pr; KASSERT(cred != NULL, ("%s: cred is NULL", __func__)); KASSERT(ia6 != NULL, ("%s: ia6 is NULL", __func__)); - if (!jailed(cred)) - return (0); pr = cred->cr_prison; + if (!(pr->pr_flags & PR_IP6)) + return (0); mtx_lock(&pr->pr_mtx); + if (!(pr->pr_flags & PR_IP6)) { + mtx_unlock(&pr->pr_mtx); + return (0); + } if (pr->pr_ip6 == NULL) { mtx_unlock(&pr->pr_mtx); return (EAFNOSUPPORT); } bcopy(&pr->pr_ip6[0], ia6, sizeof(struct in6_addr)); mtx_unlock(&pr->pr_mtx); return (0); } /* + * Return true if pr1 and pr2 have the same IPv6 address restrictions. + */ +int +prison_equal_ip6(struct prison *pr1, struct prison *pr2) +{ + + if (pr1 == pr2) + return (1); + + sx_slock(&allprison_lock); + while (pr1 != &prison0 && !(pr1->pr_flags & PR_IP6_USER)) + pr1 = pr1->pr_parent; + while (pr2 != &prison0 && !(pr2->pr_flags & PR_IP6_USER)) + pr2 = pr2->pr_parent; + sx_sunlock(&allprison_lock); + return (pr1 == pr2); +} + +/* * Make sure our (source) address is set to something meaningful to this jail. * * v6only should be set based on (inp->inp_flags & IN6P_IPV6_V6ONLY != 0) * when needed while binding. * - * Returns 0 if not jailed or if address belongs to jail, EADDRNOTAVAIL if - * the address doesn't belong, or EAFNOSUPPORT if the jail doesn't allow IPv6. + * Returns 0 if jail doesn't restrict IPv6 or if address belongs to jail, + * EADDRNOTAVAIL if the address doesn't belong, or EAFNOSUPPORT if the jail + * doesn't allow IPv6. */ int prison_local_ip6(struct ucred *cred, struct in6_addr *ia6, int v6only) { struct prison *pr; int error; KASSERT(cred != NULL, ("%s: cred is NULL", __func__)); KASSERT(ia6 != NULL, ("%s: ia6 is NULL", __func__)); - if (!jailed(cred)) - return (0); pr = cred->cr_prison; + if (!(pr->pr_flags & PR_IP6)) + return (0); mtx_lock(&pr->pr_mtx); + if (!(pr->pr_flags & PR_IP6)) { + mtx_unlock(&pr->pr_mtx); + return (0); + } if (pr->pr_ip6 == NULL) { mtx_unlock(&pr->pr_mtx); return (EAFNOSUPPORT); } if (IN6_IS_ADDR_LOOPBACK(ia6)) { bcopy(&pr->pr_ip6[0], ia6, sizeof(struct in6_addr)); mtx_unlock(&pr->pr_mtx); return (0); } if (IN6_IS_ADDR_UNSPECIFIED(ia6)) { /* * In case there is only 1 IPv6 address, and v6only is true, * then bind directly. */ if (v6only != 0 && pr->pr_ip6s == 1) bcopy(&pr->pr_ip6[0], ia6, sizeof(struct in6_addr)); mtx_unlock(&pr->pr_mtx); return (0); } error = _prison_check_ip6(pr, ia6); mtx_unlock(&pr->pr_mtx); return (error); } /* * Rewrite destination address in case we will connect to loopback address. * * Returns 0 on success, EAFNOSUPPORT if the jail doesn't allow IPv6. */ int prison_remote_ip6(struct ucred *cred, struct in6_addr *ia6) { struct prison *pr; KASSERT(cred != NULL, ("%s: cred is NULL", __func__)); KASSERT(ia6 != NULL, ("%s: ia6 is NULL", __func__)); - if (!jailed(cred)) - return (0); pr = cred->cr_prison; + if (!(pr->pr_flags & PR_IP6)) + return (0); mtx_lock(&pr->pr_mtx); + if (!(pr->pr_flags & PR_IP6)) { + mtx_unlock(&pr->pr_mtx); + return (0); + } if (pr->pr_ip6 == NULL) { mtx_unlock(&pr->pr_mtx); return (EAFNOSUPPORT); } if (IN6_IS_ADDR_LOOPBACK(ia6)) { bcopy(&pr->pr_ip6[0], ia6, sizeof(struct in6_addr)); mtx_unlock(&pr->pr_mtx); return (0); } /* * Return success because nothing had to be changed. */ mtx_unlock(&pr->pr_mtx); return (0); } /* * Check if given address belongs to the jail referenced by cred/prison. * - * Returns 0 if not jailed or if address belongs to jail, EADDRNOTAVAIL if - * the address doesn't belong, or EAFNOSUPPORT if the jail doesn't allow IPv6. + * Returns 0 if jail doesn't restrict IPv6 or if address belongs to jail, + * EADDRNOTAVAIL if the address doesn't belong, or EAFNOSUPPORT if the jail + * doesn't allow IPv6. */ static int _prison_check_ip6(struct prison *pr, struct in6_addr *ia6) { int i, a, z, d; /* * Check the primary IP. */ if (IN6_ARE_ADDR_EQUAL(&pr->pr_ip6[0], ia6)) return (0); /* * All the other IPs are sorted so we can do a binary search. */ a = 0; z = pr->pr_ip6s - 2; while (a <= z) { i = (a + z) / 2; d = qcmp_v6(&pr->pr_ip6[i+1], ia6); if (d > 0) z = i - 1; else if (d < 0) a = i + 1; else return (0); } return (EADDRNOTAVAIL); } int prison_check_ip6(struct ucred *cred, struct in6_addr *ia6) { struct prison *pr; int error; KASSERT(cred != NULL, ("%s: cred is NULL", __func__)); KASSERT(ia6 != NULL, ("%s: ia6 is NULL", __func__)); - if (!jailed(cred)) - return (0); pr = cred->cr_prison; + if (!(pr->pr_flags & PR_IP6)) + return (0); mtx_lock(&pr->pr_mtx); + if (!(pr->pr_flags & PR_IP6)) { + mtx_unlock(&pr->pr_mtx); + return (0); + } if (pr->pr_ip6 == NULL) { mtx_unlock(&pr->pr_mtx); return (EAFNOSUPPORT); } error = _prison_check_ip6(pr, ia6); mtx_unlock(&pr->pr_mtx); return (error); } #endif /* * Check if a jail supports the given address family. * * Returns 0 if not jailed or the address family is supported, EAFNOSUPPORT * if not. */ int prison_check_af(struct ucred *cred, int af) { + struct prison *pr; int error; KASSERT(cred != NULL, ("%s: cred is NULL", __func__)); - - if (!jailed(cred)) - return (0); - + pr = cred->cr_prison; error = 0; switch (af) { #ifdef INET case AF_INET: - if (cred->cr_prison->pr_ip4 == NULL) - error = EAFNOSUPPORT; + if (pr->pr_flags & PR_IP4) + { + mtx_lock(&pr->pr_mtx); + if ((pr->pr_flags & PR_IP4) && pr->pr_ip4 == NULL) + error = EAFNOSUPPORT; + mtx_unlock(&pr->pr_mtx); + } break; #endif #ifdef INET6 case AF_INET6: - if (cred->cr_prison->pr_ip6 == NULL) - error = EAFNOSUPPORT; + if (pr->pr_flags & PR_IP6) + { + mtx_lock(&pr->pr_mtx); + if ((pr->pr_flags & PR_IP6) && pr->pr_ip6 == NULL) + error = EAFNOSUPPORT; + mtx_unlock(&pr->pr_mtx); + } break; #endif case AF_LOCAL: case AF_ROUTE: break; default: - if (jail_socket_unixiproute_only) + if (!(pr->pr_allow & PR_ALLOW_SOCKET_AF)) error = EAFNOSUPPORT; } return (error); } /* * Check if given address belongs to the jail referenced by cred (wrapper to * prison_check_ip[46]). * - * Returns 0 if not jailed or if address belongs to jail, EADDRNOTAVAIL if - * the address doesn't belong, or EAFNOSUPPORT if the jail doesn't allow - * the address family. IPv4 Address passed in in NBO. + * Returns 0 if jail doesn't restrict the address family or if address belongs + * to jail, EADDRNOTAVAIL if the address doesn't belong, or EAFNOSUPPORT if + * the jail doesn't allow the address family. IPv4 Address passed in in NBO. */ int prison_if(struct ucred *cred, struct sockaddr *sa) { #ifdef INET struct sockaddr_in *sai; #endif #ifdef INET6 struct sockaddr_in6 *sai6; #endif int error; KASSERT(cred != NULL, ("%s: cred is NULL", __func__)); KASSERT(sa != NULL, ("%s: sa is NULL", __func__)); error = 0; switch (sa->sa_family) { #ifdef INET case AF_INET: sai = (struct sockaddr_in *)sa; error = prison_check_ip4(cred, &sai->sin_addr); break; #endif #ifdef INET6 case AF_INET6: sai6 = (struct sockaddr_in6 *)sa; error = prison_check_ip6(cred, &sai6->sin6_addr); break; #endif default: - if (jailed(cred) && jail_socket_unixiproute_only) + if (!(cred->cr_prison->pr_allow & PR_ALLOW_SOCKET_AF)) error = EAFNOSUPPORT; } return (error); } /* * Return 0 if jails permit p1 to frob p2, otherwise ESRCH. */ int prison_check(struct ucred *cred1, struct ucred *cred2) { - if (jailed(cred1)) { - if (!jailed(cred2)) - return (ESRCH); - if (cred2->cr_prison != cred1->cr_prison) - return (ESRCH); - } #ifdef VIMAGE if (cred2->cr_vimage->v_procg != cred1->cr_vimage->v_procg) return (ESRCH); #endif + return ((cred1->cr_prison == cred2->cr_prison || + prison_ischild(cred1->cr_prison, cred2->cr_prison)) ? 0 : ESRCH); +} +/* + * Return 1 if p2 is a child of p1, otherwise 0. + */ +int +prison_ischild(struct prison *pr1, struct prison *pr2) +{ + + for (pr2 = pr2->pr_parent; pr2 != NULL; pr2 = pr2->pr_parent) + if (pr1 == pr2) + return (1); return (0); } /* * Return 1 if the passed credential is in a jail, otherwise 0. */ int jailed(struct ucred *cred) { - return (cred->cr_prison != NULL); + return (cred->cr_prison != &prison0); } /* * Return the correct hostname for the passed credential. */ void getcredhostname(struct ucred *cred, char *buf, size_t size) { INIT_VPROCG(cred->cr_vimage->v_procg); if (jailed(cred)) { mtx_lock(&cred->cr_prison->pr_mtx); strlcpy(buf, cred->cr_prison->pr_host, size); mtx_unlock(&cred->cr_prison->pr_mtx); } else { mtx_lock(&hostname_mtx); strlcpy(buf, V_hostname, size); mtx_unlock(&hostname_mtx); } } /* * Determine whether the subject represented by cred can "see" * status of a mount point. * Returns: 0 for permitted, ENOENT otherwise. * XXX: This function should be called cr_canseemount() and should be * placed in kern_prot.c. */ int prison_canseemount(struct ucred *cred, struct mount *mp) { struct prison *pr; struct statfs *sp; size_t len; - if (!jailed(cred) || jail_enforce_statfs == 0) - return (0); pr = cred->cr_prison; + if (pr->pr_enforce_statfs == 0) + return (0); if (pr->pr_root->v_mount == mp) return (0); - if (jail_enforce_statfs == 2) + if (pr->pr_enforce_statfs == 2) return (ENOENT); /* * If jail's chroot directory is set to "/" we should be able to see * all mount-points from inside a jail. * This is ugly check, but this is the only situation when jail's * directory ends with '/'. */ if (strcmp(pr->pr_path, "/") == 0) return (0); len = strlen(pr->pr_path); sp = &mp->mnt_stat; if (strncmp(pr->pr_path, sp->f_mntonname, len) != 0) return (ENOENT); /* * Be sure that we don't have situation where jail's root directory * is "/some/path" and mount point is "/some/pathpath". */ if (sp->f_mntonname[len] != '\0' && sp->f_mntonname[len] != '/') return (ENOENT); return (0); } void prison_enforce_statfs(struct ucred *cred, struct mount *mp, struct statfs *sp) { char jpath[MAXPATHLEN]; struct prison *pr; size_t len; - if (!jailed(cred) || jail_enforce_statfs == 0) - return; pr = cred->cr_prison; + if (pr->pr_enforce_statfs == 0) + return; if (prison_canseemount(cred, mp) != 0) { bzero(sp->f_mntonname, sizeof(sp->f_mntonname)); strlcpy(sp->f_mntonname, "[restricted]", sizeof(sp->f_mntonname)); return; } if (pr->pr_root->v_mount == mp) { /* * Clear current buffer data, so we are sure nothing from * the valid path left there. */ bzero(sp->f_mntonname, sizeof(sp->f_mntonname)); *sp->f_mntonname = '/'; return; } /* * If jail's chroot directory is set to "/" we should be able to see * all mount-points from inside a jail. */ if (strcmp(pr->pr_path, "/") == 0) return; len = strlen(pr->pr_path); strlcpy(jpath, sp->f_mntonname + len, sizeof(jpath)); /* * Clear current buffer data, so we are sure nothing from * the valid path left there. */ bzero(sp->f_mntonname, sizeof(sp->f_mntonname)); if (*jpath == '\0') { /* Should never happen. */ *sp->f_mntonname = '/'; } else { strlcpy(sp->f_mntonname, jpath, sizeof(sp->f_mntonname)); } } /* * Check with permission for a specific privilege is granted within jail. We * have a specific list of accepted privileges; the rest are denied. */ int prison_priv_check(struct ucred *cred, int priv) { if (!jailed(cred)) return (0); switch (priv) { /* * Allow ktrace privileges for root in jail. */ case PRIV_KTRACE: #if 0 /* * Allow jailed processes to configure audit identity and * submit audit records (login, etc). In the future we may * want to further refine the relationship between audit and * jail. */ case PRIV_AUDIT_GETAUDIT: case PRIV_AUDIT_SETAUDIT: case PRIV_AUDIT_SUBMIT: #endif /* * Allow jailed processes to manipulate process UNIX * credentials in any way they see fit. */ case PRIV_CRED_SETUID: case PRIV_CRED_SETEUID: case PRIV_CRED_SETGID: case PRIV_CRED_SETEGID: case PRIV_CRED_SETGROUPS: case PRIV_CRED_SETREUID: case PRIV_CRED_SETREGID: case PRIV_CRED_SETRESUID: case PRIV_CRED_SETRESGID: /* * Jail implements visibility constraints already, so allow * jailed root to override uid/gid-based constraints. */ case PRIV_SEEOTHERGIDS: case PRIV_SEEOTHERUIDS: /* * Jail implements inter-process debugging limits already, so * allow jailed root various debugging privileges. */ case PRIV_DEBUG_DIFFCRED: case PRIV_DEBUG_SUGID: case PRIV_DEBUG_UNPRIV: /* * Allow jail to set various resource limits and login * properties, and for now, exceed process resource limits. */ case PRIV_PROC_LIMIT: case PRIV_PROC_SETLOGIN: case PRIV_PROC_SETRLIMIT: /* * System V and POSIX IPC privileges are granted in jail. */ case PRIV_IPC_READ: case PRIV_IPC_WRITE: case PRIV_IPC_ADMIN: case PRIV_IPC_MSGSIZE: case PRIV_MQ_ADMIN: /* + * Jail operations within a jail work on child jails. + */ + case PRIV_JAIL_ATTACH: + case PRIV_JAIL_SET: + case PRIV_JAIL_REMOVE: + + /* * Jail implements its own inter-process limits, so allow * root processes in jail to change scheduling on other * processes in the same jail. Likewise for signalling. */ case PRIV_SCHED_DIFFCRED: case PRIV_SCHED_CPUSET: case PRIV_SIGNAL_DIFFCRED: case PRIV_SIGNAL_SUGID: /* * Allow jailed processes to write to sysctls marked as jail * writable. */ case PRIV_SYSCTL_WRITEJAIL: /* * Allow root in jail to manage a variety of quota * properties. These should likely be conditional on a * configuration option. */ case PRIV_VFS_GETQUOTA: case PRIV_VFS_SETQUOTA: /* * Since Jail relies on chroot() to implement file system * protections, grant many VFS privileges to root in jail. * Be careful to exclude mount-related and NFS-related * privileges. */ case PRIV_VFS_READ: case PRIV_VFS_WRITE: case PRIV_VFS_ADMIN: case PRIV_VFS_EXEC: case PRIV_VFS_LOOKUP: case PRIV_VFS_BLOCKRESERVE: /* XXXRW: Slightly surprising. */ case PRIV_VFS_CHFLAGS_DEV: case PRIV_VFS_CHOWN: case PRIV_VFS_CHROOT: case PRIV_VFS_RETAINSUGID: case PRIV_VFS_FCHROOT: case PRIV_VFS_LINK: case PRIV_VFS_SETGID: case PRIV_VFS_STAT: case PRIV_VFS_STICKYFILE: return (0); /* * Depending on the global setting, allow privilege of * setting system flags. */ case PRIV_VFS_SYSFLAGS: - if (jail_chflags_allowed) + if (cred->cr_prison->pr_allow & PR_ALLOW_CHFLAGS) return (0); else return (EPERM); /* * Depending on the global setting, allow privilege of * mounting/unmounting file systems. */ case PRIV_VFS_MOUNT: case PRIV_VFS_UNMOUNT: case PRIV_VFS_MOUNT_NONUSER: case PRIV_VFS_MOUNT_OWNER: - if (jail_mount_allowed) + if (cred->cr_prison->pr_allow & PR_ALLOW_MOUNT) return (0); else return (EPERM); /* * Allow jailed root to bind reserved ports and reuse in-use * ports. */ case PRIV_NETINET_RESERVEDPORT: case PRIV_NETINET_REUSEPORT: return (0); /* * Allow jailed root to set certian IPv4/6 (option) headers. */ case PRIV_NETINET_SETHDROPTS: return (0); /* * Conditionally allow creating raw sockets in jail. */ case PRIV_NETINET_RAW: - if (jail_allow_raw_sockets) + if (cred->cr_prison->pr_allow & PR_ALLOW_RAW_SOCKETS) return (0); else return (EPERM); /* * Since jail implements its own visibility limits on netstat * sysctls, allow getcred. This allows identd to work in * jail. */ case PRIV_NETINET_GETCRED: return (0); default: /* * In all remaining cases, deny the privilege request. This * includes almost all network privileges, many system * configuration privileges. */ return (EPERM); } } +/* + * Return the part of pr2's name that is relative to pr1, or the whole name + * if it does not directly follow. + */ + +char * +prison_name(struct prison *pr1, struct prison *pr2) +{ + char *name; + + /* Jails see themselves as "0" (if they see themselves at all). */ + if (pr1 == pr2) + return "0"; + name = pr2->pr_name; + if (prison_ischild(pr1, pr2)) { + /* + * pr1 isn't locked (and allprison_lock may not be either) + * so its length can't be counted on. But the number of dots + * can be counted on - and counted. + */ + for (; pr1 != &prison0; pr1 = pr1->pr_parent) + name = strchr(name, '.') + 1; + } + return (name); +} + +/* + * Return the part of pr2's path that is relative to pr1, or the whole path + * if it does not directly follow. + */ +static char * +prison_path(struct prison *pr1, struct prison *pr2) +{ + char *path1, *path2; + int len1; + + path1 = pr1->pr_path; + path2 = pr2->pr_path; + if (!strcmp(path1, "/")) + return (path2); + len1 = strlen(path1); + if (strncmp(path1, path2, len1)) + return (path2); + if (path2[len1] == '\0') + return "/"; + if (path2[len1] == '/') + return (path2 + len1); + return (path2); +} + + +/* + * Jail-related sysctls. + */ +SYSCTL_NODE(_security, OID_AUTO, jail, CTLFLAG_RW, 0, + "Jails"); + static int sysctl_jail_list(SYSCTL_HANDLER_ARGS) { struct xprison *xp; - struct prison *pr; + struct prison *pr, *cpr; #ifdef INET struct in_addr *ip4 = NULL; int ip4s = 0; #endif #ifdef INET6 struct in_addr *ip6 = NULL; int ip6s = 0; #endif - int error; + int descend, error; - if (jailed(req->td->td_ucred)) - return (0); - xp = malloc(sizeof(*xp), M_TEMP, M_WAITOK); + pr = req->td->td_ucred->cr_prison; error = 0; sx_slock(&allprison_lock); - TAILQ_FOREACH(pr, &allprison, pr_list) { + FOREACH_PRISON_DESCENDANT(pr, cpr, descend) { +#if defined(INET) || defined(INET6) again: - mtx_lock(&pr->pr_mtx); +#endif + mtx_lock(&cpr->pr_mtx); #ifdef INET - if (pr->pr_ip4s > 0) { - if (ip4s < pr->pr_ip4s) { - ip4s = pr->pr_ip4s; - mtx_unlock(&pr->pr_mtx); + if (cpr->pr_ip4s > 0) { + if (ip4s < cpr->pr_ip4s) { + ip4s = cpr->pr_ip4s; + mtx_unlock(&cpr->pr_mtx); ip4 = realloc(ip4, ip4s * sizeof(struct in_addr), M_TEMP, M_WAITOK); goto again; } - bcopy(pr->pr_ip4, ip4, - pr->pr_ip4s * sizeof(struct in_addr)); + bcopy(cpr->pr_ip4, ip4, + cpr->pr_ip4s * sizeof(struct in_addr)); } #endif #ifdef INET6 - if (pr->pr_ip6s > 0) { - if (ip6s < pr->pr_ip6s) { - ip6s = pr->pr_ip6s; - mtx_unlock(&pr->pr_mtx); + if (cpr->pr_ip6s > 0) { + if (ip6s < cpr->pr_ip6s) { + ip6s = cpr->pr_ip6s; + mtx_unlock(&cpr->pr_mtx); ip6 = realloc(ip6, ip6s * sizeof(struct in6_addr), M_TEMP, M_WAITOK); goto again; } - bcopy(pr->pr_ip6, ip6, - pr->pr_ip6s * sizeof(struct in6_addr)); + bcopy(cpr->pr_ip6, ip6, + cpr->pr_ip6s * sizeof(struct in6_addr)); } #endif - if (pr->pr_ref == 0) { - mtx_unlock(&pr->pr_mtx); + if (cpr->pr_ref == 0) { + mtx_unlock(&cpr->pr_mtx); continue; } bzero(xp, sizeof(*xp)); xp->pr_version = XPRISON_VERSION; - xp->pr_id = pr->pr_id; - xp->pr_state = pr->pr_uref > 0 + xp->pr_id = cpr->pr_id; + xp->pr_state = cpr->pr_uref > 0 ? PRISON_STATE_ALIVE : PRISON_STATE_DYING; - strlcpy(xp->pr_path, pr->pr_path, sizeof(xp->pr_path)); - strlcpy(xp->pr_host, pr->pr_host, sizeof(xp->pr_host)); - strlcpy(xp->pr_name, pr->pr_name, sizeof(xp->pr_name)); + strlcpy(xp->pr_path, prison_path(pr, cpr), sizeof(xp->pr_path)); + strlcpy(xp->pr_host, cpr->pr_host, sizeof(xp->pr_host)); + strlcpy(xp->pr_name, prison_name(pr, cpr), sizeof(xp->pr_name)); #ifdef INET - xp->pr_ip4s = pr->pr_ip4s; + xp->pr_ip4s = cpr->pr_ip4s; #endif #ifdef INET6 - xp->pr_ip6s = pr->pr_ip6s; + xp->pr_ip6s = cpr->pr_ip6s; #endif - mtx_unlock(&pr->pr_mtx); + mtx_unlock(&cpr->pr_mtx); error = SYSCTL_OUT(req, xp, sizeof(*xp)); if (error) break; #ifdef INET if (xp->pr_ip4s > 0) { error = SYSCTL_OUT(req, ip4, xp->pr_ip4s * sizeof(struct in_addr)); if (error) break; } #endif #ifdef INET6 if (xp->pr_ip6s > 0) { error = SYSCTL_OUT(req, ip6, xp->pr_ip6s * sizeof(struct in6_addr)); if (error) break; } #endif } sx_sunlock(&allprison_lock); free(xp, M_TEMP); #ifdef INET free(ip4, M_TEMP); #endif #ifdef INET6 free(ip6, M_TEMP); #endif return (error); } SYSCTL_OID(_security_jail, OID_AUTO, list, CTLTYPE_STRUCT | CTLFLAG_RD | CTLFLAG_MPSAFE, NULL, 0, sysctl_jail_list, "S", "List of active jails"); static int sysctl_jail_jailed(SYSCTL_HANDLER_ARGS) { int error, injail; injail = jailed(req->td->td_ucred); error = SYSCTL_OUT(req, &injail, sizeof(injail)); return (error); } + SYSCTL_PROC(_security_jail, OID_AUTO, jailed, CTLTYPE_INT | CTLFLAG_RD | CTLFLAG_MPSAFE, NULL, 0, sysctl_jail_jailed, "I", "Process in jail?"); +#if defined(INET) || defined(INET6) +SYSCTL_INT(_security_jail, OID_AUTO, jail_max_af_ips, CTLFLAG_RW, + &jail_max_af_ips, 0, + "Number of IP addresses a jail may have at most per address family"); +#endif + +/* + * Default parameters for jail(2) compatability. For historical reasons, + * the sysctl names have varying similarity to the parameter names. Prisons + * just see their own parameters, and can't change them. + */ +static int +sysctl_jail_default_allow(SYSCTL_HANDLER_ARGS) +{ + struct prison *pr; + int allow, error, i; + + pr = req->td->td_ucred->cr_prison; + allow = (pr == &prison0) ? jail_default_allow : pr->pr_allow; + + /* Get the current flag value, and convert it to a boolean. */ + i = (allow & arg2) ? 1 : 0; + if (arg1 != NULL) + i = !i; + error = sysctl_handle_int(oidp, &i, 0, req); + if (error || !req->newptr) + return (error); + i = i ? arg2 : 0; + if (arg1 != NULL) + i ^= arg2; + /* + * The sysctls don't have CTLFLAGS_PRISON, so assume prison0 + * for writing. + */ + mtx_lock(&prison0.pr_mtx); + jail_default_allow = (jail_default_allow & ~arg2) | i; + mtx_unlock(&prison0.pr_mtx); + return (0); +} + +SYSCTL_PROC(_security_jail, OID_AUTO, set_hostname_allowed, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, + NULL, PR_ALLOW_SET_HOSTNAME, sysctl_jail_default_allow, "I", + "Processes in jail can set their hostnames"); +SYSCTL_PROC(_security_jail, OID_AUTO, socket_unixiproute_only, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, + (void *)1, PR_ALLOW_SOCKET_AF, sysctl_jail_default_allow, "I", + "Processes in jail are limited to creating UNIX/IP/route sockets only"); +SYSCTL_PROC(_security_jail, OID_AUTO, sysvipc_allowed, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, + NULL, PR_ALLOW_SYSVIPC, sysctl_jail_default_allow, "I", + "Processes in jail can use System V IPC primitives"); +SYSCTL_PROC(_security_jail, OID_AUTO, allow_raw_sockets, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, + NULL, PR_ALLOW_RAW_SOCKETS, sysctl_jail_default_allow, "I", + "Prison root can create raw sockets"); +SYSCTL_PROC(_security_jail, OID_AUTO, chflags_allowed, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, + NULL, PR_ALLOW_CHFLAGS, sysctl_jail_default_allow, "I", + "Processes in jail can alter system file flags"); +SYSCTL_PROC(_security_jail, OID_AUTO, mount_allowed, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, + NULL, PR_ALLOW_MOUNT, sysctl_jail_default_allow, "I", + "Processes in jail can mount/unmount jail-friendly file systems"); + +static int +sysctl_jail_default_level(SYSCTL_HANDLER_ARGS) +{ + struct prison *pr; + int level, error; + + pr = req->td->td_ucred->cr_prison; + level = (pr == &prison0) ? *(int *)arg1 : *(int *)((char *)pr + arg2); + error = sysctl_handle_int(oidp, &level, 0, req); + if (error || !req->newptr) + return (error); + *(int *)arg1 = level; + return (0); +} + +SYSCTL_PROC(_security_jail, OID_AUTO, enforce_statfs, + CTLTYPE_INT | CTLFLAG_RW | CTLFLAG_MPSAFE, + &jail_default_enforce_statfs, offsetof(struct prison, pr_enforce_statfs), + sysctl_jail_default_level, "I", + "Processes in jail cannot see all mounted file systems"); + +/* + * Nodes to describe jail parameters. Maximum length of string parameters + * is returned in the string itself, and the other parameters exist merely + * to make themselves and their types known. + */ +SYSCTL_NODE(_security_jail, OID_AUTO, param, CTLFLAG_RW, 0, + "Jail parameters"); + +int +sysctl_jail_param(SYSCTL_HANDLER_ARGS) +{ + int i; + long l; + size_t s; + char numbuf[12]; + + switch (oidp->oid_kind & CTLTYPE) + { + case CTLTYPE_LONG: + case CTLTYPE_ULONG: + l = 0; +#ifdef SCTL_MASK32 + if (!(req->flags & SCTL_MASK32)) +#endif + return (SYSCTL_OUT(req, &l, sizeof(l))); + case CTLTYPE_INT: + case CTLTYPE_UINT: + i = 0; + return (SYSCTL_OUT(req, &i, sizeof(i))); + case CTLTYPE_STRING: + snprintf(numbuf, sizeof(numbuf), "%d", arg2); + return + (sysctl_handle_string(oidp, numbuf, sizeof(numbuf), req)); + case CTLTYPE_STRUCT: + s = (size_t)arg2; + return (SYSCTL_OUT(req, &s, sizeof(s))); + } + return (0); +} + +SYSCTL_JAIL_PARAM(, jid, CTLTYPE_INT | CTLFLAG_RDTUN, "I", "Jail ID"); +SYSCTL_JAIL_PARAM(, parent, CTLTYPE_INT | CTLFLAG_RD, "I", "Jail parent ID"); +SYSCTL_JAIL_PARAM_STRING(, name, CTLFLAG_RW, MAXHOSTNAMELEN, "Jail name"); +SYSCTL_JAIL_PARAM_STRING(, path, CTLFLAG_RDTUN, MAXPATHLEN, "Jail root path"); +SYSCTL_JAIL_PARAM(, securelevel, CTLTYPE_INT | CTLFLAG_RW, + "I", "Jail secure level"); +SYSCTL_JAIL_PARAM(, enforce_statfs, CTLTYPE_INT | CTLFLAG_RW, + "I", "Jail cannot see all mounted file systems"); +SYSCTL_JAIL_PARAM(, persist, CTLTYPE_INT | CTLFLAG_RW, + "B", "Jail persistence"); +SYSCTL_JAIL_PARAM(, dying, CTLTYPE_INT | CTLFLAG_RD, + "B", "Jail is in the process of shutting down"); + +SYSCTL_JAIL_PARAM_NODE(host, "Jail host info"); +SYSCTL_JAIL_PARAM_STRING(_host, hostname, CTLFLAG_RW, MAXHOSTNAMELEN, + "Jail hostname"); + +SYSCTL_JAIL_PARAM_NODE(cpuset, "Jail cpuset"); +SYSCTL_JAIL_PARAM(_cpuset, id, CTLTYPE_INT | CTLFLAG_RD, "I", "Jail cpuset ID"); + +#ifdef INET +SYSCTL_JAIL_PARAM_NODE(ip4, "Jail IPv4 address virtualization"); +SYSCTL_JAIL_PARAM(, noip4, CTLTYPE_INT | CTLFLAG_RW, + "BN", "Jail w/ no IP address virtualization"); +SYSCTL_JAIL_PARAM_STRUCT(_ip4, addr, CTLFLAG_RW, sizeof(struct in_addr), + "S,in_addr,a", "Jail IPv4 addresses"); +#endif +#ifdef INET6 +SYSCTL_JAIL_PARAM_NODE(ip6, "Jail IPv6 address virtualization"); +SYSCTL_JAIL_PARAM(, noip6, CTLTYPE_INT | CTLFLAG_RW, + "BN", "Jail w/ no IP address virtualization"); +SYSCTL_JAIL_PARAM_STRUCT(_ip6, addr, CTLFLAG_RW, sizeof(struct in6_addr), + "S,in6_addr,a", "Jail IPv6 addresses"); +#endif + +SYSCTL_JAIL_PARAM_NODE(allow, "Jail permission flags"); +SYSCTL_JAIL_PARAM(_allow, set_hostname, CTLTYPE_INT | CTLFLAG_RW, + "B", "Jail may set hostname"); +SYSCTL_JAIL_PARAM(_allow, sysvipc, CTLTYPE_INT | CTLFLAG_RW, + "B", "Jail may use SYSV IPC"); +SYSCTL_JAIL_PARAM(_allow, raw_sockets, CTLTYPE_INT | CTLFLAG_RW, + "B", "Jail may create raw sockets"); +SYSCTL_JAIL_PARAM(_allow, chflags, CTLTYPE_INT | CTLFLAG_RW, + "B", "Jail may alter system file flags"); +SYSCTL_JAIL_PARAM(_allow, mount, CTLTYPE_INT | CTLFLAG_RW, + "B", "Jail may mount/unmount jail-friendly file systems"); +SYSCTL_JAIL_PARAM(_allow, quotas, CTLTYPE_INT | CTLFLAG_RW, + "B", "Jail may set file quotas"); +SYSCTL_JAIL_PARAM(_allow, jails, CTLTYPE_INT | CTLFLAG_RW, + "B", "Jail may create child jails"); +SYSCTL_JAIL_PARAM(_allow, socket_af, CTLTYPE_INT | CTLFLAG_RW, + "B", "Jail may create sockets other than just UNIX/IPv4/IPv6/route"); + + #ifdef DDB static void db_show_prison(struct prison *pr) { + int fi; #if defined(INET) || defined(INET6) int ii; #endif #ifdef INET6 char ip6buf[INET6_ADDRSTRLEN]; #endif db_printf("prison %p:\n", pr); db_printf(" jid = %d\n", pr->pr_id); db_printf(" name = %s\n", pr->pr_name); + db_printf(" parent = %p\n", pr->pr_parent); db_printf(" ref = %d\n", pr->pr_ref); db_printf(" uref = %d\n", pr->pr_uref); db_printf(" path = %s\n", pr->pr_path); db_printf(" cpuset = %d\n", pr->pr_cpuset ? pr->pr_cpuset->cs_id : -1); db_printf(" root = %p\n", pr->pr_root); db_printf(" securelevel = %d\n", pr->pr_securelevel); + db_printf(" child = %p\n", LIST_FIRST(&pr->pr_children)); + db_printf(" sibling = %p\n", LIST_NEXT(pr, pr_sibling)); db_printf(" flags = %x", pr->pr_flags); - if (pr->pr_flags & PR_PERSIST) - db_printf(" persist"); + for (fi = 0; fi < sizeof(pr_flag_names) / sizeof(pr_flag_names[0]); + fi++) + if (pr_flag_names[fi] != NULL && (pr->pr_flags & (1 << fi))) + db_printf(" %s", pr_flag_names[fi]); + db_printf(" allow = %x", pr->pr_allow); + for (fi = 0; fi < sizeof(pr_allow_names) / sizeof(pr_allow_names[0]); + fi++) + if (pr_allow_names[fi] != NULL && (pr->pr_allow & (1 << fi))) + db_printf(" %s", pr_allow_names[fi]); db_printf("\n"); + db_printf(" enforce_statfs = %d\n", pr->pr_enforce_statfs); db_printf(" host.hostname = %s\n", pr->pr_host); #ifdef INET db_printf(" ip4s = %d\n", pr->pr_ip4s); for (ii = 0; ii < pr->pr_ip4s; ii++) db_printf(" %s %s\n", ii == 0 ? "ip4 =" : " ", inet_ntoa(pr->pr_ip4[ii])); #endif #ifdef INET6 db_printf(" ip6s = %d\n", pr->pr_ip6s); for (ii = 0; ii < pr->pr_ip6s; ii++) db_printf(" %s %s\n", ii == 0 ? "ip6 =" : " ", ip6_sprintf(ip6buf, &pr->pr_ip6[ii])); #endif } DB_SHOW_COMMAND(prison, db_show_prison_command) { struct prison *pr; if (!have_addr) { - /* Show all prisons in the list. */ - TAILQ_FOREACH(pr, &allprison, pr_list) { - db_show_prison(pr); - if (db_pager_quit) - break; + /* + * Show all prisons in the list, and prison0 which is not + * listed. + */ + db_show_prison(&prison0); + if (!db_pager_quit) { + TAILQ_FOREACH(pr, &allprison, pr_list) { + db_show_prison(pr); + if (db_pager_quit) + break; + } } return; } - /* Look for a prison with the ID and with references. */ - TAILQ_FOREACH(pr, &allprison, pr_list) - if (pr->pr_id == addr && pr->pr_ref > 0) - break; - if (pr == NULL) - /* Look again, without requiring a reference. */ + if (addr == 0) + pr = &prison0; + else { + /* Look for a prison with the ID and with references. */ TAILQ_FOREACH(pr, &allprison, pr_list) - if (pr->pr_id == addr) + if (pr->pr_id == addr && pr->pr_ref > 0) break; - if (pr == NULL) - /* Assume address points to a valid prison. */ - pr = (struct prison *)addr; + if (pr == NULL) + /* Look again, without requiring a reference. */ + TAILQ_FOREACH(pr, &allprison, pr_list) + if (pr->pr_id == addr) + break; + if (pr == NULL) + /* Assume address points to a valid prison. */ + pr = (struct prison *)addr; + } db_show_prison(pr); } #endif /* DDB */ Index: head/sys/kern/kern_linker.c =================================================================== --- head/sys/kern/kern_linker.c (revision 192894) +++ head/sys/kern/kern_linker.c (revision 192895) @@ -1,2182 +1,2183 @@ /*- * Copyright (c) 1997-2000 Doug Rabson * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_hwpmc_hooks.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include +#include #include #include #include #include #include #include #include #include "linker_if.h" #ifdef HWPMC_HOOKS #include #endif #ifdef KLD_DEBUG int kld_debug = 0; #endif #define KLD_LOCK() sx_xlock(&kld_sx) #define KLD_UNLOCK() sx_xunlock(&kld_sx) #define KLD_LOCKED() sx_xlocked(&kld_sx) #define KLD_LOCK_ASSERT() do { \ if (!cold) \ sx_assert(&kld_sx, SX_XLOCKED); \ } while (0) /* * static char *linker_search_path(const char *name, struct mod_depend * *verinfo); */ static const char *linker_basename(const char *path); /* * Find a currently loaded file given its filename. */ static linker_file_t linker_find_file_by_name(const char* _filename); /* * Find a currently loaded file given its file id. */ static linker_file_t linker_find_file_by_id(int _fileid); /* Metadata from the static kernel */ SET_DECLARE(modmetadata_set, struct mod_metadata); MALLOC_DEFINE(M_LINKER, "linker", "kernel linker"); linker_file_t linker_kernel_file; static struct sx kld_sx; /* kernel linker lock */ /* * Load counter used by clients to determine if a linker file has been * re-loaded. This counter is incremented for each file load. */ static int loadcnt; static linker_class_list_t classes; static linker_file_list_t linker_files; static int next_file_id = 1; static int linker_no_more_classes = 0; #define LINKER_GET_NEXT_FILE_ID(a) do { \ linker_file_t lftmp; \ \ KLD_LOCK_ASSERT(); \ retry: \ TAILQ_FOREACH(lftmp, &linker_files, link) { \ if (next_file_id == lftmp->id) { \ next_file_id++; \ goto retry; \ } \ } \ (a) = next_file_id; \ } while(0) /* XXX wrong name; we're looking at version provision tags here, not modules */ typedef TAILQ_HEAD(, modlist) modlisthead_t; struct modlist { TAILQ_ENTRY(modlist) link; /* chain together all modules */ linker_file_t container; const char *name; int version; }; typedef struct modlist *modlist_t; static modlisthead_t found_modules; static int linker_file_add_dependency(linker_file_t file, linker_file_t dep); static caddr_t linker_file_lookup_symbol_internal(linker_file_t file, const char* name, int deps); static int linker_load_module(const char *kldname, const char *modname, struct linker_file *parent, struct mod_depend *verinfo, struct linker_file **lfpp); static modlist_t modlist_lookup2(const char *name, struct mod_depend *verinfo); static char * linker_strdup(const char *str) { char *result; if ((result = malloc((strlen(str) + 1), M_LINKER, M_WAITOK)) != NULL) strcpy(result, str); return (result); } static void linker_init(void *arg) { sx_init(&kld_sx, "kernel linker"); TAILQ_INIT(&classes); TAILQ_INIT(&linker_files); } SYSINIT(linker, SI_SUB_KLD, SI_ORDER_FIRST, linker_init, 0); static void linker_stop_class_add(void *arg) { linker_no_more_classes = 1; } SYSINIT(linker_class, SI_SUB_KLD, SI_ORDER_ANY, linker_stop_class_add, NULL); int linker_add_class(linker_class_t lc) { /* * We disallow any class registration past SI_ORDER_ANY * of SI_SUB_KLD. We bump the reference count to keep the * ops from being freed. */ if (linker_no_more_classes == 1) return (EPERM); kobj_class_compile((kobj_class_t) lc); ((kobj_class_t)lc)->refs++; /* XXX: kobj_mtx */ TAILQ_INSERT_TAIL(&classes, lc, link); return (0); } static void linker_file_sysinit(linker_file_t lf) { struct sysinit **start, **stop, **sipp, **xipp, *save; KLD_DPF(FILE, ("linker_file_sysinit: calling SYSINITs for %s\n", lf->filename)); if (linker_file_lookup_set(lf, "sysinit_set", &start, &stop, NULL) != 0) return; /* * Perform a bubble sort of the system initialization objects by * their subsystem (primary key) and order (secondary key). * * Since some things care about execution order, this is the operation * which ensures continued function. */ for (sipp = start; sipp < stop; sipp++) { for (xipp = sipp + 1; xipp < stop; xipp++) { if ((*sipp)->subsystem < (*xipp)->subsystem || ((*sipp)->subsystem == (*xipp)->subsystem && (*sipp)->order <= (*xipp)->order)) continue; /* skip */ save = *sipp; *sipp = *xipp; *xipp = save; } } /* * Traverse the (now) ordered list of system initialization tasks. * Perform each task, and continue on to the next task. */ mtx_lock(&Giant); for (sipp = start; sipp < stop; sipp++) { if ((*sipp)->subsystem == SI_SUB_DUMMY) continue; /* skip dummy task(s) */ /* Call function */ (*((*sipp)->func)) ((*sipp)->udata); } mtx_unlock(&Giant); } static void linker_file_sysuninit(linker_file_t lf) { struct sysinit **start, **stop, **sipp, **xipp, *save; KLD_DPF(FILE, ("linker_file_sysuninit: calling SYSUNINITs for %s\n", lf->filename)); if (linker_file_lookup_set(lf, "sysuninit_set", &start, &stop, NULL) != 0) return; /* * Perform a reverse bubble sort of the system initialization objects * by their subsystem (primary key) and order (secondary key). * * Since some things care about execution order, this is the operation * which ensures continued function. */ for (sipp = start; sipp < stop; sipp++) { for (xipp = sipp + 1; xipp < stop; xipp++) { if ((*sipp)->subsystem > (*xipp)->subsystem || ((*sipp)->subsystem == (*xipp)->subsystem && (*sipp)->order >= (*xipp)->order)) continue; /* skip */ save = *sipp; *sipp = *xipp; *xipp = save; } } /* * Traverse the (now) ordered list of system initialization tasks. * Perform each task, and continue on to the next task. */ mtx_lock(&Giant); for (sipp = start; sipp < stop; sipp++) { if ((*sipp)->subsystem == SI_SUB_DUMMY) continue; /* skip dummy task(s) */ /* Call function */ (*((*sipp)->func)) ((*sipp)->udata); } mtx_unlock(&Giant); } static void linker_file_register_sysctls(linker_file_t lf) { struct sysctl_oid **start, **stop, **oidp; KLD_DPF(FILE, ("linker_file_register_sysctls: registering SYSCTLs for %s\n", lf->filename)); if (linker_file_lookup_set(lf, "sysctl_set", &start, &stop, NULL) != 0) return; sysctl_lock(); for (oidp = start; oidp < stop; oidp++) sysctl_register_oid(*oidp); sysctl_unlock(); } static void linker_file_unregister_sysctls(linker_file_t lf) { struct sysctl_oid **start, **stop, **oidp; KLD_DPF(FILE, ("linker_file_unregister_sysctls: registering SYSCTLs" " for %s\n", lf->filename)); if (linker_file_lookup_set(lf, "sysctl_set", &start, &stop, NULL) != 0) return; sysctl_lock(); for (oidp = start; oidp < stop; oidp++) sysctl_unregister_oid(*oidp); sysctl_unlock(); } static int linker_file_register_modules(linker_file_t lf) { struct mod_metadata **start, **stop, **mdp; const moduledata_t *moddata; int first_error, error; KLD_DPF(FILE, ("linker_file_register_modules: registering modules" " in %s\n", lf->filename)); if (linker_file_lookup_set(lf, "modmetadata_set", &start, &stop, NULL) != 0) { /* * This fallback should be unnecessary, but if we get booted * from boot2 instead of loader and we are missing our * metadata then we have to try the best we can. */ if (lf == linker_kernel_file) { start = SET_BEGIN(modmetadata_set); stop = SET_LIMIT(modmetadata_set); } else return (0); } first_error = 0; for (mdp = start; mdp < stop; mdp++) { if ((*mdp)->md_type != MDT_MODULE) continue; moddata = (*mdp)->md_data; KLD_DPF(FILE, ("Registering module %s in %s\n", moddata->name, lf->filename)); error = module_register(moddata, lf); if (error) { printf("Module %s failed to register: %d\n", moddata->name, error); if (first_error == 0) first_error = error; } } return (first_error); } static void linker_init_kernel_modules(void) { linker_file_register_modules(linker_kernel_file); } SYSINIT(linker_kernel, SI_SUB_KLD, SI_ORDER_ANY, linker_init_kernel_modules, 0); static int linker_load_file(const char *filename, linker_file_t *result) { linker_class_t lc; linker_file_t lf; int foundfile, error; /* Refuse to load modules if securelevel raised */ - if (securelevel > 0) + if (prison0.pr_securelevel > 0) return (EPERM); KLD_LOCK_ASSERT(); lf = linker_find_file_by_name(filename); if (lf) { KLD_DPF(FILE, ("linker_load_file: file %s is already loaded," " incrementing refs\n", filename)); *result = lf; lf->refs++; return (0); } foundfile = 0; error = 0; /* * We do not need to protect (lock) classes here because there is * no class registration past startup (SI_SUB_KLD, SI_ORDER_ANY) * and there is no class deregistration mechanism at this time. */ TAILQ_FOREACH(lc, &classes, link) { KLD_DPF(FILE, ("linker_load_file: trying to load %s\n", filename)); error = LINKER_LOAD_FILE(lc, filename, &lf); /* * If we got something other than ENOENT, then it exists but * we cannot load it for some other reason. */ if (error != ENOENT) foundfile = 1; if (lf) { error = linker_file_register_modules(lf); if (error == EEXIST) { linker_file_unload(lf, LINKER_UNLOAD_FORCE); return (error); } KLD_UNLOCK(); linker_file_register_sysctls(lf); linker_file_sysinit(lf); KLD_LOCK(); lf->flags |= LINKER_FILE_LINKED; *result = lf; return (0); } } /* * Less than ideal, but tells the user whether it failed to load or * the module was not found. */ if (foundfile) { /* * If the file type has not been recognized by the last try * printout a message before to fail. */ if (error == ENOSYS) printf("linker_load_file: Unsupported file type\n"); /* * Format not recognized or otherwise unloadable. * When loading a module that is statically built into * the kernel EEXIST percolates back up as the return * value. Preserve this so that apps like sysinstall * can recognize this special case and not post bogus * dialog boxes. */ if (error != EEXIST) error = ENOEXEC; } else error = ENOENT; /* Nothing found */ return (error); } int linker_reference_module(const char *modname, struct mod_depend *verinfo, linker_file_t *result) { modlist_t mod; int error; KLD_LOCK(); if ((mod = modlist_lookup2(modname, verinfo)) != NULL) { *result = mod->container; (*result)->refs++; KLD_UNLOCK(); return (0); } error = linker_load_module(NULL, modname, NULL, verinfo, result); KLD_UNLOCK(); return (error); } int linker_release_module(const char *modname, struct mod_depend *verinfo, linker_file_t lf) { modlist_t mod; int error; KLD_LOCK(); if (lf == NULL) { KASSERT(modname != NULL, ("linker_release_module: no file or name")); mod = modlist_lookup2(modname, verinfo); if (mod == NULL) { KLD_UNLOCK(); return (ESRCH); } lf = mod->container; } else KASSERT(modname == NULL && verinfo == NULL, ("linker_release_module: both file and name")); error = linker_file_unload(lf, LINKER_UNLOAD_NORMAL); KLD_UNLOCK(); return (error); } static linker_file_t linker_find_file_by_name(const char *filename) { linker_file_t lf; char *koname; koname = malloc(strlen(filename) + 4, M_LINKER, M_WAITOK); sprintf(koname, "%s.ko", filename); KLD_LOCK_ASSERT(); TAILQ_FOREACH(lf, &linker_files, link) { if (strcmp(lf->filename, koname) == 0) break; if (strcmp(lf->filename, filename) == 0) break; } free(koname, M_LINKER); return (lf); } static linker_file_t linker_find_file_by_id(int fileid) { linker_file_t lf; KLD_LOCK_ASSERT(); TAILQ_FOREACH(lf, &linker_files, link) if (lf->id == fileid && lf->flags & LINKER_FILE_LINKED) break; return (lf); } int linker_file_foreach(linker_predicate_t *predicate, void *context) { linker_file_t lf; int retval = 0; KLD_LOCK(); TAILQ_FOREACH(lf, &linker_files, link) { retval = predicate(lf, context); if (retval != 0) break; } KLD_UNLOCK(); return (retval); } linker_file_t linker_make_file(const char *pathname, linker_class_t lc) { linker_file_t lf; const char *filename; KLD_LOCK_ASSERT(); filename = linker_basename(pathname); KLD_DPF(FILE, ("linker_make_file: new file, filename='%s' for pathname='%s'\n", filename, pathname)); lf = (linker_file_t)kobj_create((kobj_class_t)lc, M_LINKER, M_WAITOK); if (lf == NULL) return (NULL); lf->refs = 1; lf->userrefs = 0; lf->flags = 0; lf->filename = linker_strdup(filename); lf->pathname = linker_strdup(pathname); LINKER_GET_NEXT_FILE_ID(lf->id); lf->ndeps = 0; lf->deps = NULL; lf->loadcnt = ++loadcnt; lf->sdt_probes = NULL; lf->sdt_nprobes = 0; STAILQ_INIT(&lf->common); TAILQ_INIT(&lf->modules); TAILQ_INSERT_TAIL(&linker_files, lf, link); return (lf); } int linker_file_unload(linker_file_t file, int flags) { module_t mod, next; modlist_t ml, nextml; struct common_symbol *cp; int error, i; /* Refuse to unload modules if securelevel raised. */ - if (securelevel > 0) + if (prison0.pr_securelevel > 0) return (EPERM); KLD_LOCK_ASSERT(); KLD_DPF(FILE, ("linker_file_unload: lf->refs=%d\n", file->refs)); /* Easy case of just dropping a reference. */ if (file->refs > 1) { file->refs--; return (0); } KLD_DPF(FILE, ("linker_file_unload: file is unloading," " informing modules\n")); /* * Quiesce all the modules to give them a chance to veto the unload. */ MOD_SLOCK; for (mod = TAILQ_FIRST(&file->modules); mod; mod = module_getfnext(mod)) { error = module_quiesce(mod); if (error != 0 && flags != LINKER_UNLOAD_FORCE) { KLD_DPF(FILE, ("linker_file_unload: module %s" " vetoed unload\n", module_getname(mod))); /* * XXX: Do we need to tell all the quiesced modules * that they can resume work now via a new module * event? */ MOD_SUNLOCK; return (error); } } MOD_SUNLOCK; /* * Inform any modules associated with this file that they are * being be unloaded. */ MOD_XLOCK; for (mod = TAILQ_FIRST(&file->modules); mod; mod = next) { next = module_getfnext(mod); MOD_XUNLOCK; /* * Give the module a chance to veto the unload. */ if ((error = module_unload(mod)) != 0) { KLD_DPF(FILE, ("linker_file_unload: module %s" " failed unload\n", mod)); return (error); } MOD_XLOCK; module_release(mod); } MOD_XUNLOCK; TAILQ_FOREACH_SAFE(ml, &found_modules, link, nextml) { if (ml->container == file) { TAILQ_REMOVE(&found_modules, ml, link); free(ml, M_LINKER); } } /* * Don't try to run SYSUNINITs if we are unloaded due to a * link error. */ if (file->flags & LINKER_FILE_LINKED) { file->flags &= ~LINKER_FILE_LINKED; KLD_UNLOCK(); linker_file_sysuninit(file); linker_file_unregister_sysctls(file); KLD_LOCK(); } TAILQ_REMOVE(&linker_files, file, link); if (file->deps) { for (i = 0; i < file->ndeps; i++) linker_file_unload(file->deps[i], flags); free(file->deps, M_LINKER); file->deps = NULL; } while ((cp = STAILQ_FIRST(&file->common)) != NULL) { STAILQ_REMOVE_HEAD(&file->common, link); free(cp, M_LINKER); } LINKER_UNLOAD(file); if (file->filename) { free(file->filename, M_LINKER); file->filename = NULL; } if (file->pathname) { free(file->pathname, M_LINKER); file->pathname = NULL; } kobj_delete((kobj_t) file, M_LINKER); return (0); } int linker_ctf_get(linker_file_t file, linker_ctf_t *lc) { return (LINKER_CTF_GET(file, lc)); } static int linker_file_add_dependency(linker_file_t file, linker_file_t dep) { linker_file_t *newdeps; KLD_LOCK_ASSERT(); newdeps = malloc((file->ndeps + 1) * sizeof(linker_file_t *), M_LINKER, M_WAITOK | M_ZERO); if (newdeps == NULL) return (ENOMEM); if (file->deps) { bcopy(file->deps, newdeps, file->ndeps * sizeof(linker_file_t *)); free(file->deps, M_LINKER); } file->deps = newdeps; file->deps[file->ndeps] = dep; file->ndeps++; return (0); } /* * Locate a linker set and its contents. This is a helper function to avoid * linker_if.h exposure elsewhere. Note: firstp and lastp are really void **. * This function is used in this file so we can avoid having lots of (void **) * casts. */ int linker_file_lookup_set(linker_file_t file, const char *name, void *firstp, void *lastp, int *countp) { int error, locked; locked = KLD_LOCKED(); if (!locked) KLD_LOCK(); error = LINKER_LOOKUP_SET(file, name, firstp, lastp, countp); if (!locked) KLD_UNLOCK(); return (error); } /* * List all functions in a file. */ int linker_file_function_listall(linker_file_t lf, linker_function_nameval_callback_t callback_func, void *arg) { return (LINKER_EACH_FUNCTION_NAMEVAL(lf, callback_func, arg)); } caddr_t linker_file_lookup_symbol(linker_file_t file, const char *name, int deps) { caddr_t sym; int locked; locked = KLD_LOCKED(); if (!locked) KLD_LOCK(); sym = linker_file_lookup_symbol_internal(file, name, deps); if (!locked) KLD_UNLOCK(); return (sym); } static caddr_t linker_file_lookup_symbol_internal(linker_file_t file, const char *name, int deps) { c_linker_sym_t sym; linker_symval_t symval; caddr_t address; size_t common_size = 0; int i; KLD_LOCK_ASSERT(); KLD_DPF(SYM, ("linker_file_lookup_symbol: file=%p, name=%s, deps=%d\n", file, name, deps)); if (LINKER_LOOKUP_SYMBOL(file, name, &sym) == 0) { LINKER_SYMBOL_VALUES(file, sym, &symval); if (symval.value == 0) /* * For commons, first look them up in the * dependencies and only allocate space if not found * there. */ common_size = symval.size; else { KLD_DPF(SYM, ("linker_file_lookup_symbol: symbol" ".value=%p\n", symval.value)); return (symval.value); } } if (deps) { for (i = 0; i < file->ndeps; i++) { address = linker_file_lookup_symbol_internal( file->deps[i], name, 0); if (address) { KLD_DPF(SYM, ("linker_file_lookup_symbol:" " deps value=%p\n", address)); return (address); } } } if (common_size > 0) { /* * This is a common symbol which was not found in the * dependencies. We maintain a simple common symbol table in * the file object. */ struct common_symbol *cp; STAILQ_FOREACH(cp, &file->common, link) { if (strcmp(cp->name, name) == 0) { KLD_DPF(SYM, ("linker_file_lookup_symbol:" " old common value=%p\n", cp->address)); return (cp->address); } } /* * Round the symbol size up to align. */ common_size = (common_size + sizeof(int) - 1) & -sizeof(int); cp = malloc(sizeof(struct common_symbol) + common_size + strlen(name) + 1, M_LINKER, M_WAITOK | M_ZERO); cp->address = (caddr_t)(cp + 1); cp->name = cp->address + common_size; strcpy(cp->name, name); bzero(cp->address, common_size); STAILQ_INSERT_TAIL(&file->common, cp, link); KLD_DPF(SYM, ("linker_file_lookup_symbol: new common" " value=%p\n", cp->address)); return (cp->address); } KLD_DPF(SYM, ("linker_file_lookup_symbol: fail\n")); return (0); } /* * Both DDB and stack(9) rely on the kernel linker to provide forward and * backward lookup of symbols. However, DDB and sometimes stack(9) need to * do this in a lockfree manner. We provide a set of internal helper * routines to perform these operations without locks, and then wrappers that * optionally lock. * * linker_debug_lookup() is ifdef DDB as currently it's only used by DDB. */ #ifdef DDB static int linker_debug_lookup(const char *symstr, c_linker_sym_t *sym) { linker_file_t lf; TAILQ_FOREACH(lf, &linker_files, link) { if (LINKER_LOOKUP_SYMBOL(lf, symstr, sym) == 0) return (0); } return (ENOENT); } #endif static int linker_debug_search_symbol(caddr_t value, c_linker_sym_t *sym, long *diffp) { linker_file_t lf; c_linker_sym_t best, es; u_long diff, bestdiff, off; best = 0; off = (uintptr_t)value; bestdiff = off; TAILQ_FOREACH(lf, &linker_files, link) { if (LINKER_SEARCH_SYMBOL(lf, value, &es, &diff) != 0) continue; if (es != 0 && diff < bestdiff) { best = es; bestdiff = diff; } if (bestdiff == 0) break; } if (best) { *sym = best; *diffp = bestdiff; return (0); } else { *sym = 0; *diffp = off; return (ENOENT); } } static int linker_debug_symbol_values(c_linker_sym_t sym, linker_symval_t *symval) { linker_file_t lf; TAILQ_FOREACH(lf, &linker_files, link) { if (LINKER_SYMBOL_VALUES(lf, sym, symval) == 0) return (0); } return (ENOENT); } static int linker_debug_search_symbol_name(caddr_t value, char *buf, u_int buflen, long *offset) { linker_symval_t symval; c_linker_sym_t sym; int error; *offset = 0; error = linker_debug_search_symbol(value, &sym, offset); if (error) return (error); error = linker_debug_symbol_values(sym, &symval); if (error) return (error); strlcpy(buf, symval.name, buflen); return (0); } #ifdef DDB /* * DDB Helpers. DDB has to look across multiple files with their own symbol * tables and string tables. * * Note that we do not obey list locking protocols here. We really don't need * DDB to hang because somebody's got the lock held. We'll take the chance * that the files list is inconsistant instead. */ int linker_ddb_lookup(const char *symstr, c_linker_sym_t *sym) { return (linker_debug_lookup(symstr, sym)); } int linker_ddb_search_symbol(caddr_t value, c_linker_sym_t *sym, long *diffp) { return (linker_debug_search_symbol(value, sym, diffp)); } int linker_ddb_symbol_values(c_linker_sym_t sym, linker_symval_t *symval) { return (linker_debug_symbol_values(sym, symval)); } int linker_ddb_search_symbol_name(caddr_t value, char *buf, u_int buflen, long *offset) { return (linker_debug_search_symbol_name(value, buf, buflen, offset)); } #endif /* * stack(9) helper for non-debugging environemnts. Unlike DDB helpers, we do * obey locking protocols, and offer a significantly less complex interface. */ int linker_search_symbol_name(caddr_t value, char *buf, u_int buflen, long *offset) { int error; KLD_LOCK(); error = linker_debug_search_symbol_name(value, buf, buflen, offset); KLD_UNLOCK(); return (error); } /* * Syscalls. */ int kern_kldload(struct thread *td, const char *file, int *fileid) { #ifdef HWPMC_HOOKS struct pmckern_map_in pkm; #endif const char *kldname, *modname; linker_file_t lf; int error; if ((error = securelevel_gt(td->td_ucred, 0)) != 0) return (error); if ((error = priv_check(td, PRIV_KLD_LOAD)) != 0) return (error); #ifdef VIMAGE /* Only the default vimage is permitted to kldload modules. */ if (!IS_DEFAULT_VIMAGE(TD_TO_VIMAGE(td))) return (EPERM); #endif /* * It is possible that kldloaded module will attach a new ifnet, * so vnet context must be set when this ocurs. */ CURVNET_SET(TD_TO_VNET(td)); /* * If file does not contain a qualified name or any dot in it * (kldname.ko, or kldname.ver.ko) treat it as an interface * name. */ if (index(file, '/') || index(file, '.')) { kldname = file; modname = NULL; } else { kldname = NULL; modname = file; } KLD_LOCK(); error = linker_load_module(kldname, modname, NULL, NULL, &lf); if (error) goto unlock; #ifdef HWPMC_HOOKS pkm.pm_file = lf->filename; pkm.pm_address = (uintptr_t) lf->address; PMC_CALL_HOOK(td, PMC_FN_KLD_LOAD, (void *) &pkm); #endif lf->userrefs++; if (fileid != NULL) *fileid = lf->id; unlock: KLD_UNLOCK(); CURVNET_RESTORE(); return (error); } int kldload(struct thread *td, struct kldload_args *uap) { char *pathname = NULL; int error, fileid; td->td_retval[0] = -1; pathname = malloc(MAXPATHLEN, M_TEMP, M_WAITOK); error = copyinstr(uap->file, pathname, MAXPATHLEN, NULL); if (error == 0) { error = kern_kldload(td, pathname, &fileid); if (error == 0) td->td_retval[0] = fileid; } free(pathname, M_TEMP); return (error); } int kern_kldunload(struct thread *td, int fileid, int flags) { #ifdef HWPMC_HOOKS struct pmckern_map_out pkm; #endif linker_file_t lf; int error = 0; if ((error = securelevel_gt(td->td_ucred, 0)) != 0) return (error); if ((error = priv_check(td, PRIV_KLD_UNLOAD)) != 0) return (error); #ifdef VIMAGE /* Only the default vimage is permitted to kldunload modules. */ if (!IS_DEFAULT_VIMAGE(TD_TO_VIMAGE(td))) return (EPERM); #endif CURVNET_SET(TD_TO_VNET(td)); KLD_LOCK(); lf = linker_find_file_by_id(fileid); if (lf) { KLD_DPF(FILE, ("kldunload: lf->userrefs=%d\n", lf->userrefs)); /* Check if there are DTrace probes enabled on this file. */ if (lf->nenabled > 0) { printf("kldunload: attempt to unload file that has" " DTrace probes enabled\n"); error = EBUSY; } else if (lf->userrefs == 0) { /* * XXX: maybe LINKER_UNLOAD_FORCE should override ? */ printf("kldunload: attempt to unload file that was" " loaded by the kernel\n"); error = EBUSY; } else { #ifdef HWPMC_HOOKS /* Save data needed by hwpmc(4) before unloading. */ pkm.pm_address = (uintptr_t) lf->address; pkm.pm_size = lf->size; #endif lf->userrefs--; error = linker_file_unload(lf, flags); if (error) lf->userrefs++; } } else error = ENOENT; #ifdef HWPMC_HOOKS if (error == 0) PMC_CALL_HOOK(td, PMC_FN_KLD_UNLOAD, (void *) &pkm); #endif KLD_UNLOCK(); CURVNET_RESTORE(); return (error); } int kldunload(struct thread *td, struct kldunload_args *uap) { return (kern_kldunload(td, uap->fileid, LINKER_UNLOAD_NORMAL)); } int kldunloadf(struct thread *td, struct kldunloadf_args *uap) { if (uap->flags != LINKER_UNLOAD_NORMAL && uap->flags != LINKER_UNLOAD_FORCE) return (EINVAL); return (kern_kldunload(td, uap->fileid, uap->flags)); } int kldfind(struct thread *td, struct kldfind_args *uap) { char *pathname; const char *filename; linker_file_t lf; int error; #ifdef MAC error = mac_kld_check_stat(td->td_ucred); if (error) return (error); #endif td->td_retval[0] = -1; pathname = malloc(MAXPATHLEN, M_TEMP, M_WAITOK); if ((error = copyinstr(uap->file, pathname, MAXPATHLEN, NULL)) != 0) goto out; filename = linker_basename(pathname); KLD_LOCK(); lf = linker_find_file_by_name(filename); if (lf) td->td_retval[0] = lf->id; else error = ENOENT; KLD_UNLOCK(); out: free(pathname, M_TEMP); return (error); } int kldnext(struct thread *td, struct kldnext_args *uap) { linker_file_t lf; int error = 0; #ifdef MAC error = mac_kld_check_stat(td->td_ucred); if (error) return (error); #endif KLD_LOCK(); if (uap->fileid == 0) lf = TAILQ_FIRST(&linker_files); else { lf = linker_find_file_by_id(uap->fileid); if (lf == NULL) { error = ENOENT; goto out; } lf = TAILQ_NEXT(lf, link); } /* Skip partially loaded files. */ while (lf != NULL && !(lf->flags & LINKER_FILE_LINKED)) lf = TAILQ_NEXT(lf, link); if (lf) td->td_retval[0] = lf->id; else td->td_retval[0] = 0; out: KLD_UNLOCK(); return (error); } int kldstat(struct thread *td, struct kldstat_args *uap) { struct kld_file_stat stat; linker_file_t lf; int error, namelen, version, version_num; /* * Check the version of the user's structure. */ if ((error = copyin(&uap->stat->version, &version, sizeof(version))) != 0) return (error); if (version == sizeof(struct kld_file_stat_1)) version_num = 1; else if (version == sizeof(struct kld_file_stat)) version_num = 2; else return (EINVAL); #ifdef MAC error = mac_kld_check_stat(td->td_ucred); if (error) return (error); #endif KLD_LOCK(); lf = linker_find_file_by_id(uap->fileid); if (lf == NULL) { KLD_UNLOCK(); return (ENOENT); } /* Version 1 fields: */ namelen = strlen(lf->filename) + 1; if (namelen > MAXPATHLEN) namelen = MAXPATHLEN; bcopy(lf->filename, &stat.name[0], namelen); stat.refs = lf->refs; stat.id = lf->id; stat.address = lf->address; stat.size = lf->size; if (version_num > 1) { /* Version 2 fields: */ namelen = strlen(lf->pathname) + 1; if (namelen > MAXPATHLEN) namelen = MAXPATHLEN; bcopy(lf->pathname, &stat.pathname[0], namelen); } KLD_UNLOCK(); td->td_retval[0] = 0; return (copyout(&stat, uap->stat, version)); } int kldfirstmod(struct thread *td, struct kldfirstmod_args *uap) { linker_file_t lf; module_t mp; int error = 0; #ifdef MAC error = mac_kld_check_stat(td->td_ucred); if (error) return (error); #endif KLD_LOCK(); lf = linker_find_file_by_id(uap->fileid); if (lf) { MOD_SLOCK; mp = TAILQ_FIRST(&lf->modules); if (mp != NULL) td->td_retval[0] = module_getid(mp); else td->td_retval[0] = 0; MOD_SUNLOCK; } else error = ENOENT; KLD_UNLOCK(); return (error); } int kldsym(struct thread *td, struct kldsym_args *uap) { char *symstr = NULL; c_linker_sym_t sym; linker_symval_t symval; linker_file_t lf; struct kld_sym_lookup lookup; int error = 0; #ifdef MAC error = mac_kld_check_stat(td->td_ucred); if (error) return (error); #endif if ((error = copyin(uap->data, &lookup, sizeof(lookup))) != 0) return (error); if (lookup.version != sizeof(lookup) || uap->cmd != KLDSYM_LOOKUP) return (EINVAL); symstr = malloc(MAXPATHLEN, M_TEMP, M_WAITOK); if ((error = copyinstr(lookup.symname, symstr, MAXPATHLEN, NULL)) != 0) goto out; KLD_LOCK(); if (uap->fileid != 0) { lf = linker_find_file_by_id(uap->fileid); if (lf == NULL) error = ENOENT; else if (LINKER_LOOKUP_SYMBOL(lf, symstr, &sym) == 0 && LINKER_SYMBOL_VALUES(lf, sym, &symval) == 0) { lookup.symvalue = (uintptr_t) symval.value; lookup.symsize = symval.size; error = copyout(&lookup, uap->data, sizeof(lookup)); } else error = ENOENT; } else { TAILQ_FOREACH(lf, &linker_files, link) { if (LINKER_LOOKUP_SYMBOL(lf, symstr, &sym) == 0 && LINKER_SYMBOL_VALUES(lf, sym, &symval) == 0) { lookup.symvalue = (uintptr_t)symval.value; lookup.symsize = symval.size; error = copyout(&lookup, uap->data, sizeof(lookup)); break; } } #ifndef VIMAGE_GLOBALS /* * If the symbol is not found in global namespace, * try to look it up in the current vimage namespace. */ if (lf == NULL) { CURVNET_SET(TD_TO_VNET(td)); error = vi_symlookup(&lookup, symstr); CURVNET_RESTORE(); if (error == 0) error = copyout(&lookup, uap->data, sizeof(lookup)); } #else if (lf == NULL) error = ENOENT; #endif } KLD_UNLOCK(); out: free(symstr, M_TEMP); return (error); } /* * Preloaded module support */ static modlist_t modlist_lookup(const char *name, int ver) { modlist_t mod; TAILQ_FOREACH(mod, &found_modules, link) { if (strcmp(mod->name, name) == 0 && (ver == 0 || mod->version == ver)) return (mod); } return (NULL); } static modlist_t modlist_lookup2(const char *name, struct mod_depend *verinfo) { modlist_t mod, bestmod; int ver; if (verinfo == NULL) return (modlist_lookup(name, 0)); bestmod = NULL; TAILQ_FOREACH(mod, &found_modules, link) { if (strcmp(mod->name, name) != 0) continue; ver = mod->version; if (ver == verinfo->md_ver_preferred) return (mod); if (ver >= verinfo->md_ver_minimum && ver <= verinfo->md_ver_maximum && (bestmod == NULL || ver > bestmod->version)) bestmod = mod; } return (bestmod); } static modlist_t modlist_newmodule(const char *modname, int version, linker_file_t container) { modlist_t mod; mod = malloc(sizeof(struct modlist), M_LINKER, M_NOWAIT | M_ZERO); if (mod == NULL) panic("no memory for module list"); mod->container = container; mod->name = modname; mod->version = version; TAILQ_INSERT_TAIL(&found_modules, mod, link); return (mod); } static void linker_addmodules(linker_file_t lf, struct mod_metadata **start, struct mod_metadata **stop, int preload) { struct mod_metadata *mp, **mdp; const char *modname; int ver; for (mdp = start; mdp < stop; mdp++) { mp = *mdp; if (mp->md_type != MDT_VERSION) continue; modname = mp->md_cval; ver = ((struct mod_version *)mp->md_data)->mv_version; if (modlist_lookup(modname, ver) != NULL) { printf("module %s already present!\n", modname); /* XXX what can we do? this is a build error. :-( */ continue; } modlist_newmodule(modname, ver, lf); } } static void linker_preload(void *arg) { caddr_t modptr; const char *modname, *nmodname; char *modtype; linker_file_t lf, nlf; linker_class_t lc; int error; linker_file_list_t loaded_files; linker_file_list_t depended_files; struct mod_metadata *mp, *nmp; struct mod_metadata **start, **stop, **mdp, **nmdp; struct mod_depend *verinfo; int nver; int resolves; modlist_t mod; struct sysinit **si_start, **si_stop; TAILQ_INIT(&loaded_files); TAILQ_INIT(&depended_files); TAILQ_INIT(&found_modules); error = 0; modptr = NULL; while ((modptr = preload_search_next_name(modptr)) != NULL) { modname = (char *)preload_search_info(modptr, MODINFO_NAME); modtype = (char *)preload_search_info(modptr, MODINFO_TYPE); if (modname == NULL) { printf("Preloaded module at %p does not have a" " name!\n", modptr); continue; } if (modtype == NULL) { printf("Preloaded module at %p does not have a type!\n", modptr); continue; } if (bootverbose) printf("Preloaded %s \"%s\" at %p.\n", modtype, modname, modptr); lf = NULL; TAILQ_FOREACH(lc, &classes, link) { error = LINKER_LINK_PRELOAD(lc, modname, &lf); if (!error) break; lf = NULL; } if (lf) TAILQ_INSERT_TAIL(&loaded_files, lf, loaded); } /* * First get a list of stuff in the kernel. */ if (linker_file_lookup_set(linker_kernel_file, MDT_SETNAME, &start, &stop, NULL) == 0) linker_addmodules(linker_kernel_file, start, stop, 1); /* * This is a once-off kinky bubble sort to resolve relocation * dependency requirements. */ restart: TAILQ_FOREACH(lf, &loaded_files, loaded) { error = linker_file_lookup_set(lf, MDT_SETNAME, &start, &stop, NULL); /* * First, look to see if we would successfully link with this * stuff. */ resolves = 1; /* unless we know otherwise */ if (!error) { for (mdp = start; mdp < stop; mdp++) { mp = *mdp; if (mp->md_type != MDT_DEPEND) continue; modname = mp->md_cval; verinfo = mp->md_data; for (nmdp = start; nmdp < stop; nmdp++) { nmp = *nmdp; if (nmp->md_type != MDT_VERSION) continue; nmodname = nmp->md_cval; if (strcmp(modname, nmodname) == 0) break; } if (nmdp < stop) /* it's a self reference */ continue; /* * ok, the module isn't here yet, we * are not finished */ if (modlist_lookup2(modname, verinfo) == NULL) resolves = 0; } } /* * OK, if we found our modules, we can link. So, "provide" * the modules inside and add it to the end of the link order * list. */ if (resolves) { if (!error) { for (mdp = start; mdp < stop; mdp++) { mp = *mdp; if (mp->md_type != MDT_VERSION) continue; modname = mp->md_cval; nver = ((struct mod_version *) mp->md_data)->mv_version; if (modlist_lookup(modname, nver) != NULL) { printf("module %s already" " present!\n", modname); TAILQ_REMOVE(&loaded_files, lf, loaded); linker_file_unload(lf, LINKER_UNLOAD_FORCE); /* we changed tailq next ptr */ goto restart; } modlist_newmodule(modname, nver, lf); } } TAILQ_REMOVE(&loaded_files, lf, loaded); TAILQ_INSERT_TAIL(&depended_files, lf, loaded); /* * Since we provided modules, we need to restart the * sort so that the previous files that depend on us * have a chance. Also, we've busted the tailq next * pointer with the REMOVE. */ goto restart; } } /* * At this point, we check to see what could not be resolved.. */ while ((lf = TAILQ_FIRST(&loaded_files)) != NULL) { TAILQ_REMOVE(&loaded_files, lf, loaded); printf("KLD file %s is missing dependencies\n", lf->filename); linker_file_unload(lf, LINKER_UNLOAD_FORCE); } /* * We made it. Finish off the linking in the order we determined. */ TAILQ_FOREACH_SAFE(lf, &depended_files, loaded, nlf) { if (linker_kernel_file) { linker_kernel_file->refs++; error = linker_file_add_dependency(lf, linker_kernel_file); if (error) panic("cannot add dependency"); } lf->userrefs++; /* so we can (try to) kldunload it */ error = linker_file_lookup_set(lf, MDT_SETNAME, &start, &stop, NULL); if (!error) { for (mdp = start; mdp < stop; mdp++) { mp = *mdp; if (mp->md_type != MDT_DEPEND) continue; modname = mp->md_cval; verinfo = mp->md_data; mod = modlist_lookup2(modname, verinfo); /* Don't count self-dependencies */ if (lf == mod->container) continue; mod->container->refs++; error = linker_file_add_dependency(lf, mod->container); if (error) panic("cannot add dependency"); } } /* * Now do relocation etc using the symbol search paths * established by the dependencies */ error = LINKER_LINK_PRELOAD_FINISH(lf); if (error) { TAILQ_REMOVE(&depended_files, lf, loaded); printf("KLD file %s - could not finalize loading\n", lf->filename); linker_file_unload(lf, LINKER_UNLOAD_FORCE); continue; } linker_file_register_modules(lf); if (linker_file_lookup_set(lf, "sysinit_set", &si_start, &si_stop, NULL) == 0) sysinit_add(si_start, si_stop); linker_file_register_sysctls(lf); lf->flags |= LINKER_FILE_LINKED; } /* woohoo! we made it! */ } SYSINIT(preload, SI_SUB_KLD, SI_ORDER_MIDDLE, linker_preload, 0); /* * Search for a not-loaded module by name. * * Modules may be found in the following locations: * * - preloaded (result is just the module name) - on disk (result is full path * to module) * * If the module name is qualified in any way (contains path, etc.) the we * simply return a copy of it. * * The search path can be manipulated via sysctl. Note that we use the ';' * character as a separator to be consistent with the bootloader. */ static char linker_hintfile[] = "linker.hints"; static char linker_path[MAXPATHLEN] = "/boot/kernel;/boot/modules"; SYSCTL_STRING(_kern, OID_AUTO, module_path, CTLFLAG_RW, linker_path, sizeof(linker_path), "module load search path"); TUNABLE_STR("module_path", linker_path, sizeof(linker_path)); static char *linker_ext_list[] = { "", ".ko", NULL }; /* * Check if file actually exists either with or without extension listed in * the linker_ext_list. (probably should be generic for the rest of the * kernel) */ static char * linker_lookup_file(const char *path, int pathlen, const char *name, int namelen, struct vattr *vap) { struct nameidata nd; struct thread *td = curthread; /* XXX */ char *result, **cpp, *sep; int error, len, extlen, reclen, flags, vfslocked; enum vtype type; extlen = 0; for (cpp = linker_ext_list; *cpp; cpp++) { len = strlen(*cpp); if (len > extlen) extlen = len; } extlen++; /* trailing '\0' */ sep = (path[pathlen - 1] != '/') ? "/" : ""; reclen = pathlen + strlen(sep) + namelen + extlen + 1; result = malloc(reclen, M_LINKER, M_WAITOK); for (cpp = linker_ext_list; *cpp; cpp++) { snprintf(result, reclen, "%.*s%s%.*s%s", pathlen, path, sep, namelen, name, *cpp); /* * Attempt to open the file, and return the path if * we succeed and it's a regular file. */ NDINIT(&nd, LOOKUP, FOLLOW | MPSAFE, UIO_SYSSPACE, result, td); flags = FREAD; error = vn_open(&nd, &flags, 0, NULL); if (error == 0) { vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); type = nd.ni_vp->v_type; if (vap) VOP_GETATTR(nd.ni_vp, vap, td->td_ucred); VOP_UNLOCK(nd.ni_vp, 0); vn_close(nd.ni_vp, FREAD, td->td_ucred, td); VFS_UNLOCK_GIANT(vfslocked); if (type == VREG) return (result); } } free(result, M_LINKER); return (NULL); } #define INT_ALIGN(base, ptr) ptr = \ (base) + (((ptr) - (base) + sizeof(int) - 1) & ~(sizeof(int) - 1)) /* * Lookup KLD which contains requested module in the "linker.hints" file. If * version specification is available, then try to find the best KLD. * Otherwise just find the latest one. */ static char * linker_hints_lookup(const char *path, int pathlen, const char *modname, int modnamelen, struct mod_depend *verinfo) { struct thread *td = curthread; /* XXX */ struct ucred *cred = td ? td->td_ucred : NULL; struct nameidata nd; struct vattr vattr, mattr; u_char *hints = NULL; u_char *cp, *recptr, *bufend, *result, *best, *pathbuf, *sep; int error, ival, bestver, *intp, reclen, found, flags, clen, blen; int vfslocked = 0; result = NULL; bestver = found = 0; sep = (path[pathlen - 1] != '/') ? "/" : ""; reclen = imax(modnamelen, strlen(linker_hintfile)) + pathlen + strlen(sep) + 1; pathbuf = malloc(reclen, M_LINKER, M_WAITOK); snprintf(pathbuf, reclen, "%.*s%s%s", pathlen, path, sep, linker_hintfile); NDINIT(&nd, LOOKUP, NOFOLLOW | MPSAFE, UIO_SYSSPACE, pathbuf, td); flags = FREAD; error = vn_open(&nd, &flags, 0, NULL); if (error) goto bad; vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); if (nd.ni_vp->v_type != VREG) goto bad; best = cp = NULL; error = VOP_GETATTR(nd.ni_vp, &vattr, cred); if (error) goto bad; /* * XXX: we need to limit this number to some reasonable value */ if (vattr.va_size > 100 * 1024) { printf("hints file too large %ld\n", (long)vattr.va_size); goto bad; } hints = malloc(vattr.va_size, M_TEMP, M_WAITOK); if (hints == NULL) goto bad; error = vn_rdwr(UIO_READ, nd.ni_vp, (caddr_t)hints, vattr.va_size, 0, UIO_SYSSPACE, IO_NODELOCKED, cred, NOCRED, &reclen, td); if (error) goto bad; VOP_UNLOCK(nd.ni_vp, 0); vn_close(nd.ni_vp, FREAD, cred, td); VFS_UNLOCK_GIANT(vfslocked); nd.ni_vp = NULL; if (reclen != 0) { printf("can't read %d\n", reclen); goto bad; } intp = (int *)hints; ival = *intp++; if (ival != LINKER_HINTS_VERSION) { printf("hints file version mismatch %d\n", ival); goto bad; } bufend = hints + vattr.va_size; recptr = (u_char *)intp; clen = blen = 0; while (recptr < bufend && !found) { intp = (int *)recptr; reclen = *intp++; ival = *intp++; cp = (char *)intp; switch (ival) { case MDT_VERSION: clen = *cp++; if (clen != modnamelen || bcmp(cp, modname, clen) != 0) break; cp += clen; INT_ALIGN(hints, cp); ival = *(int *)cp; cp += sizeof(int); clen = *cp++; if (verinfo == NULL || ival == verinfo->md_ver_preferred) { found = 1; break; } if (ival >= verinfo->md_ver_minimum && ival <= verinfo->md_ver_maximum && ival > bestver) { bestver = ival; best = cp; blen = clen; } break; default: break; } recptr += reclen + sizeof(int); } /* * Finally check if KLD is in the place */ if (found) result = linker_lookup_file(path, pathlen, cp, clen, &mattr); else if (best) result = linker_lookup_file(path, pathlen, best, blen, &mattr); /* * KLD is newer than hints file. What we should do now? */ if (result && timespeccmp(&mattr.va_mtime, &vattr.va_mtime, >)) printf("warning: KLD '%s' is newer than the linker.hints" " file\n", result); bad: free(pathbuf, M_LINKER); if (hints) free(hints, M_TEMP); if (nd.ni_vp != NULL) { VOP_UNLOCK(nd.ni_vp, 0); vn_close(nd.ni_vp, FREAD, cred, td); VFS_UNLOCK_GIANT(vfslocked); } /* * If nothing found or hints is absent - fallback to the old * way by using "kldname[.ko]" as module name. */ if (!found && !bestver && result == NULL) result = linker_lookup_file(path, pathlen, modname, modnamelen, NULL); return (result); } /* * Lookup KLD which contains requested module in the all directories. */ static char * linker_search_module(const char *modname, int modnamelen, struct mod_depend *verinfo) { char *cp, *ep, *result; /* * traverse the linker path */ for (cp = linker_path; *cp; cp = ep + 1) { /* find the end of this component */ for (ep = cp; (*ep != 0) && (*ep != ';'); ep++); result = linker_hints_lookup(cp, ep - cp, modname, modnamelen, verinfo); if (result != NULL) return (result); if (*ep == 0) break; } return (NULL); } /* * Search for module in all directories listed in the linker_path. */ static char * linker_search_kld(const char *name) { char *cp, *ep, *result; int len; /* qualified at all? */ if (index(name, '/')) return (linker_strdup(name)); /* traverse the linker path */ len = strlen(name); for (ep = linker_path; *ep; ep++) { cp = ep; /* find the end of this component */ for (; *ep != 0 && *ep != ';'; ep++); result = linker_lookup_file(cp, ep - cp, name, len, NULL); if (result != NULL) return (result); } return (NULL); } static const char * linker_basename(const char *path) { const char *filename; filename = rindex(path, '/'); if (filename == NULL) return path; if (filename[1]) filename++; return (filename); } #ifdef HWPMC_HOOKS struct hwpmc_context { int nobjects; int nmappings; struct pmckern_map_in *kobase; }; static int linker_hwpmc_list_object(linker_file_t lf, void *arg) { struct hwpmc_context *hc; hc = arg; /* If we run out of mappings, fail. */ if (hc->nobjects >= hc->nmappings) return (1); /* Save the info for this linker file. */ hc->kobase[hc->nobjects].pm_file = lf->filename; hc->kobase[hc->nobjects].pm_address = (uintptr_t)lf->address; hc->nobjects++; return (0); } /* * Inform hwpmc about the set of kernel modules currently loaded. */ void * linker_hwpmc_list_objects(void) { struct hwpmc_context hc; hc.nmappings = 15; /* a reasonable default */ retry: /* allocate nmappings+1 entries */ hc.kobase = malloc((hc.nmappings + 1) * sizeof(struct pmckern_map_in), M_LINKER, M_WAITOK | M_ZERO); hc.nobjects = 0; if (linker_file_foreach(linker_hwpmc_list_object, &hc) != 0) { hc.nmappings = hc.nobjects; free(hc.kobase, M_LINKER); goto retry; } KASSERT(hc.nobjects > 0, ("linker_hpwmc_list_objects: no kernel " "objects?")); /* The last entry of the malloced area comprises of all zeros. */ KASSERT(hc.kobase[hc.nobjects].pm_file == NULL, ("linker_hwpmc_list_objects: last object not NULL")); return ((void *)hc.kobase); } #endif /* * Find a file which contains given module and load it, if "parent" is not * NULL, register a reference to it. */ static int linker_load_module(const char *kldname, const char *modname, struct linker_file *parent, struct mod_depend *verinfo, struct linker_file **lfpp) { linker_file_t lfdep; const char *filename; char *pathname; int error; KLD_LOCK_ASSERT(); if (modname == NULL) { /* * We have to load KLD */ KASSERT(verinfo == NULL, ("linker_load_module: verinfo" " is not NULL")); pathname = linker_search_kld(kldname); } else { if (modlist_lookup2(modname, verinfo) != NULL) return (EEXIST); if (kldname != NULL) pathname = linker_strdup(kldname); else if (rootvnode == NULL) pathname = NULL; else /* * Need to find a KLD with required module */ pathname = linker_search_module(modname, strlen(modname), verinfo); } if (pathname == NULL) return (ENOENT); /* * Can't load more than one file with the same basename XXX: * Actually it should be possible to have multiple KLDs with * the same basename but different path because they can * provide different versions of the same modules. */ filename = linker_basename(pathname); if (linker_find_file_by_name(filename)) error = EEXIST; else do { error = linker_load_file(pathname, &lfdep); if (error) break; if (modname && verinfo && modlist_lookup2(modname, verinfo) == NULL) { linker_file_unload(lfdep, LINKER_UNLOAD_FORCE); error = ENOENT; break; } if (parent) { error = linker_file_add_dependency(parent, lfdep); if (error) break; } if (lfpp) *lfpp = lfdep; } while (0); free(pathname, M_LINKER); return (error); } /* * This routine is responsible for finding dependencies of userland initiated * kldload(2)'s of files. */ int linker_load_dependencies(linker_file_t lf) { linker_file_t lfdep; struct mod_metadata **start, **stop, **mdp, **nmdp; struct mod_metadata *mp, *nmp; struct mod_depend *verinfo; modlist_t mod; const char *modname, *nmodname; int ver, error = 0, count; /* * All files are dependant on /kernel. */ KLD_LOCK_ASSERT(); if (linker_kernel_file) { linker_kernel_file->refs++; error = linker_file_add_dependency(lf, linker_kernel_file); if (error) return (error); } if (linker_file_lookup_set(lf, MDT_SETNAME, &start, &stop, &count) != 0) return (0); for (mdp = start; mdp < stop; mdp++) { mp = *mdp; if (mp->md_type != MDT_VERSION) continue; modname = mp->md_cval; ver = ((struct mod_version *)mp->md_data)->mv_version; mod = modlist_lookup(modname, ver); if (mod != NULL) { printf("interface %s.%d already present in the KLD" " '%s'!\n", modname, ver, mod->container->filename); return (EEXIST); } } for (mdp = start; mdp < stop; mdp++) { mp = *mdp; if (mp->md_type != MDT_DEPEND) continue; modname = mp->md_cval; verinfo = mp->md_data; nmodname = NULL; for (nmdp = start; nmdp < stop; nmdp++) { nmp = *nmdp; if (nmp->md_type != MDT_VERSION) continue; nmodname = nmp->md_cval; if (strcmp(modname, nmodname) == 0) break; } if (nmdp < stop)/* early exit, it's a self reference */ continue; mod = modlist_lookup2(modname, verinfo); if (mod) { /* woohoo, it's loaded already */ lfdep = mod->container; lfdep->refs++; error = linker_file_add_dependency(lf, lfdep); if (error) break; continue; } error = linker_load_module(NULL, modname, lf, verinfo, NULL); if (error) { printf("KLD %s: depends on %s - not available\n", lf->filename, modname); break; } } if (error) return (error); linker_addmodules(lf, start, stop, 0); return (error); } static int sysctl_kern_function_list_iterate(const char *name, void *opaque) { struct sysctl_req *req; req = opaque; return (SYSCTL_OUT(req, name, strlen(name) + 1)); } /* * Export a nul-separated, double-nul-terminated list of all function names * in the kernel. */ static int sysctl_kern_function_list(SYSCTL_HANDLER_ARGS) { linker_file_t lf; int error; #ifdef MAC error = mac_kld_check_stat(req->td->td_ucred); if (error) return (error); #endif error = sysctl_wire_old_buffer(req, 0); if (error != 0) return (error); KLD_LOCK(); TAILQ_FOREACH(lf, &linker_files, link) { error = LINKER_EACH_FUNCTION_NAME(lf, sysctl_kern_function_list_iterate, req); if (error) { KLD_UNLOCK(); return (error); } } KLD_UNLOCK(); return (SYSCTL_OUT(req, "", 1)); } SYSCTL_PROC(_kern, OID_AUTO, function_list, CTLFLAG_RD, NULL, 0, sysctl_kern_function_list, "", "kernel function list"); Index: head/sys/kern/kern_mib.c =================================================================== --- head/sys/kern/kern_mib.c (revision 192894) +++ head/sys/kern/kern_mib.c (revision 192895) @@ -1,472 +1,464 @@ /*- * Copyright (c) 1982, 1986, 1989, 1993 * The Regents of the University of California. All rights reserved. * * This code is derived from software contributed to Berkeley by * Mike Karels at Berkeley Software Design, Inc. * * Quite extensively rewritten by Poul-Henning Kamp of the FreeBSD * project, to make these variables more userfriendly. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)kern_sysctl.c 8.4 (Berkeley) 4/14/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_compat.h" #include "opt_posix.h" #include "opt_config.h" #include #include #include #include #include #include #include #include #include #include +#include #include #include SYSCTL_NODE(, 0, sysctl, CTLFLAG_RW, 0, "Sysctl internal magic"); SYSCTL_NODE(, CTL_KERN, kern, CTLFLAG_RW, 0, "High kernel, proc, limits &c"); SYSCTL_NODE(, CTL_VM, vm, CTLFLAG_RW, 0, "Virtual memory"); SYSCTL_NODE(, CTL_VFS, vfs, CTLFLAG_RW, 0, "File system"); SYSCTL_NODE(, CTL_NET, net, CTLFLAG_RW, 0, "Network, (see socket.h)"); SYSCTL_NODE(, CTL_DEBUG, debug, CTLFLAG_RW, 0, "Debugging"); SYSCTL_NODE(_debug, OID_AUTO, sizeof, CTLFLAG_RW, 0, "Sizeof various things"); SYSCTL_NODE(, CTL_HW, hw, CTLFLAG_RW, 0, "hardware"); SYSCTL_NODE(, CTL_MACHDEP, machdep, CTLFLAG_RW, 0, "machine dependent"); SYSCTL_NODE(, CTL_USER, user, CTLFLAG_RW, 0, "user-level"); SYSCTL_NODE(, CTL_P1003_1B, p1003_1b, CTLFLAG_RW, 0, "p1003_1b, (see p1003_1b.h)"); SYSCTL_NODE(, OID_AUTO, compat, CTLFLAG_RW, 0, "Compatibility code"); SYSCTL_NODE(, OID_AUTO, security, CTLFLAG_RW, 0, "Security"); #ifdef REGRESSION SYSCTL_NODE(, OID_AUTO, regression, CTLFLAG_RW, 0, "Regression test MIB"); #endif SYSCTL_STRING(_kern, OID_AUTO, ident, CTLFLAG_RD|CTLFLAG_MPSAFE, kern_ident, 0, "Kernel identifier"); SYSCTL_STRING(_kern, KERN_OSRELEASE, osrelease, CTLFLAG_RD|CTLFLAG_MPSAFE, osrelease, 0, "Operating system release"); SYSCTL_INT(_kern, KERN_OSREV, osrevision, CTLFLAG_RD, 0, BSD, "Operating system revision"); SYSCTL_STRING(_kern, KERN_VERSION, version, CTLFLAG_RD|CTLFLAG_MPSAFE, version, 0, "Kernel version"); SYSCTL_STRING(_kern, KERN_OSTYPE, ostype, CTLFLAG_RD|CTLFLAG_MPSAFE, ostype, 0, "Operating system type"); /* * NOTICE: The *userland* release date is available in * /usr/include/osreldate.h */ SYSCTL_INT(_kern, KERN_OSRELDATE, osreldate, CTLFLAG_RD, &osreldate, 0, "Kernel release date"); SYSCTL_INT(_kern, KERN_MAXPROC, maxproc, CTLFLAG_RDTUN, &maxproc, 0, "Maximum number of processes"); SYSCTL_INT(_kern, KERN_MAXPROCPERUID, maxprocperuid, CTLFLAG_RW, &maxprocperuid, 0, "Maximum processes allowed per userid"); SYSCTL_INT(_kern, OID_AUTO, maxusers, CTLFLAG_RDTUN, &maxusers, 0, "Hint for kernel tuning"); SYSCTL_INT(_kern, KERN_ARGMAX, argmax, CTLFLAG_RD, 0, ARG_MAX, "Maximum bytes of argument to execve(2)"); SYSCTL_INT(_kern, KERN_POSIX1, posix1version, CTLFLAG_RD, 0, _POSIX_VERSION, "Version of POSIX attempting to comply to"); SYSCTL_INT(_kern, KERN_NGROUPS, ngroups, CTLFLAG_RD, 0, NGROUPS_MAX, "Maximum number of groups a user can belong to"); SYSCTL_INT(_kern, KERN_JOB_CONTROL, job_control, CTLFLAG_RD, 0, 1, "Whether job control is available"); #ifdef _POSIX_SAVED_IDS SYSCTL_INT(_kern, KERN_SAVED_IDS, saved_ids, CTLFLAG_RD, 0, 1, "Whether saved set-group/user ID is available"); #else SYSCTL_INT(_kern, KERN_SAVED_IDS, saved_ids, CTLFLAG_RD, 0, 0, "Whether saved set-group/user ID is available"); #endif char kernelname[MAXPATHLEN] = "/kernel"; /* XXX bloat */ SYSCTL_STRING(_kern, KERN_BOOTFILE, bootfile, CTLFLAG_RW, kernelname, sizeof kernelname, "Name of kernel file booted"); SYSCTL_INT(_hw, HW_NCPU, ncpu, CTLFLAG_RD, &mp_ncpus, 0, "Number of active CPUs"); SYSCTL_INT(_hw, HW_BYTEORDER, byteorder, CTLFLAG_RD, 0, BYTE_ORDER, "System byte order"); SYSCTL_INT(_hw, HW_PAGESIZE, pagesize, CTLFLAG_RD, 0, PAGE_SIZE, "System memory page size"); static int sysctl_kern_arnd(SYSCTL_HANDLER_ARGS) { char buf[256]; size_t len; len = req->oldlen; if (len > sizeof(buf)) len = sizeof(buf); arc4rand(buf, len, 0); return (SYSCTL_OUT(req, buf, len)); } SYSCTL_PROC(_kern, KERN_ARND, arandom, CTLTYPE_OPAQUE | CTLFLAG_RD | CTLFLAG_MPSAFE, NULL, 0, sysctl_kern_arnd, "", "arc4rand"); static int sysctl_hw_physmem(SYSCTL_HANDLER_ARGS) { u_long val; val = ctob(physmem); return (sysctl_handle_long(oidp, &val, 0, req)); } SYSCTL_PROC(_hw, HW_PHYSMEM, physmem, CTLTYPE_ULONG | CTLFLAG_RD, 0, 0, sysctl_hw_physmem, "LU", ""); static int sysctl_hw_realmem(SYSCTL_HANDLER_ARGS) { u_long val; val = ctob(realmem); return (sysctl_handle_long(oidp, &val, 0, req)); } SYSCTL_PROC(_hw, HW_REALMEM, realmem, CTLTYPE_ULONG | CTLFLAG_RD, 0, 0, sysctl_hw_realmem, "LU", ""); static int sysctl_hw_usermem(SYSCTL_HANDLER_ARGS) { u_long val; val = ctob(physmem - cnt.v_wire_count); return (sysctl_handle_long(oidp, &val, 0, req)); } SYSCTL_PROC(_hw, HW_USERMEM, usermem, CTLTYPE_ULONG | CTLFLAG_RD, 0, 0, sysctl_hw_usermem, "LU", ""); SYSCTL_ULONG(_hw, OID_AUTO, availpages, CTLFLAG_RD, &physmem, 0, ""); static char machine_arch[] = MACHINE_ARCH; SYSCTL_STRING(_hw, HW_MACHINE_ARCH, machine_arch, CTLFLAG_RD, machine_arch, 0, "System architecture"); #ifdef VIMAGE_GLOBALS char hostname[MAXHOSTNAMELEN]; #endif /* * This mutex is used to protect the hostname and domainname variables, and * perhaps in the future should also protect hostid, hostuid, and others. */ struct mtx hostname_mtx; MTX_SYSINIT(hostname_mtx, &hostname_mtx, "hostname", MTX_DEF); static int sysctl_hostname(SYSCTL_HANDLER_ARGS) { INIT_VPROCG(TD_TO_VPROCG(req->td)); struct prison *pr; char tmphostname[MAXHOSTNAMELEN]; int error; pr = req->td->td_ucred->cr_prison; - if (pr != NULL) { - if (!jail_set_hostname_allowed && req->newptr) + if (pr != &prison0) { + if (!(pr->pr_allow & PR_ALLOW_SET_HOSTNAME) && req->newptr) return (EPERM); /* * Process is in jail, so make a local copy of jail * hostname to get/set so we don't have to hold the jail * mutex during the sysctl copyin/copyout activities. */ mtx_lock(&pr->pr_mtx); bcopy(pr->pr_host, tmphostname, MAXHOSTNAMELEN); mtx_unlock(&pr->pr_mtx); error = sysctl_handle_string(oidp, tmphostname, sizeof pr->pr_host, req); if (req->newptr != NULL && error == 0) { /* * Copy the locally set hostname to the jail, if * appropriate. */ mtx_lock(&pr->pr_mtx); bcopy(tmphostname, pr->pr_host, MAXHOSTNAMELEN); mtx_unlock(&pr->pr_mtx); } } else { mtx_lock(&hostname_mtx); bcopy(V_hostname, tmphostname, MAXHOSTNAMELEN); mtx_unlock(&hostname_mtx); error = sysctl_handle_string(oidp, tmphostname, sizeof tmphostname, req); if (req->newptr != NULL && error == 0) { + mtx_lock(&prison0.pr_mtx); mtx_lock(&hostname_mtx); + bcopy(tmphostname, prison0.pr_host, MAXHOSTNAMELEN); bcopy(tmphostname, V_hostname, MAXHOSTNAMELEN); mtx_unlock(&hostname_mtx); + mtx_unlock(&prison0.pr_mtx); } } return (error); } SYSCTL_PROC(_kern, KERN_HOSTNAME, hostname, CTLTYPE_STRING|CTLFLAG_RW|CTLFLAG_PRISON|CTLFLAG_MPSAFE, 0, 0, sysctl_hostname, "A", "Hostname"); static int regression_securelevel_nonmonotonic = 0; #ifdef REGRESSION SYSCTL_INT(_regression, OID_AUTO, securelevel_nonmonotonic, CTLFLAG_RW, ®ression_securelevel_nonmonotonic, 0, "securelevel may be lowered"); #endif -int securelevel = -1; -static struct mtx securelevel_mtx; - -MTX_SYSINIT(securelevel_lock, &securelevel_mtx, "securelevel mutex lock", - MTX_DEF); - static int sysctl_kern_securelvl(SYSCTL_HANDLER_ARGS) { - struct prison *pr; - int error, level; + struct prison *pr, *cpr; + int descend, error, level; pr = req->td->td_ucred->cr_prison; /* - * If the process is in jail, return the maximum of the global and - * local levels; otherwise, return the global level. Perform a - * lockless read since the securelevel is an integer. + * Reading the securelevel is easy, since the current jail's level + * is known to be at least as secure as any higher levels. Perform + * a lockless read since the securelevel is an integer. */ - if (pr != NULL) - level = imax(securelevel, pr->pr_securelevel); - else - level = securelevel; + level = pr->pr_securelevel; error = sysctl_handle_int(oidp, &level, 0, req); if (error || !req->newptr) return (error); + /* Permit update only if the new securelevel exceeds the old. */ + sx_slock(&allprison_lock); + mtx_lock(&pr->pr_mtx); + if (!regression_securelevel_nonmonotonic && + level < pr->pr_securelevel) { + mtx_unlock(&pr->pr_mtx); + sx_sunlock(&allprison_lock); + return (EPERM); + } + pr->pr_securelevel = level; /* - * Permit update only if the new securelevel exceeds the - * global level, and local level if any. + * Set all child jails to be at least this level, but do not lower + * them (even if regression_securelevel_nonmonotonic). */ - if (pr != NULL) { - mtx_lock(&pr->pr_mtx); - if (!regression_securelevel_nonmonotonic && - (level < imax(securelevel, pr->pr_securelevel))) { - mtx_unlock(&pr->pr_mtx); - return (EPERM); - } - pr->pr_securelevel = level; - mtx_unlock(&pr->pr_mtx); - } else { - mtx_lock(&securelevel_mtx); - if (!regression_securelevel_nonmonotonic && - (level < securelevel)) { - mtx_unlock(&securelevel_mtx); - return (EPERM); - } - securelevel = level; - mtx_unlock(&securelevel_mtx); + FOREACH_PRISON_DESCENDANT_LOCKED(pr, cpr, descend) { + if (cpr->pr_securelevel < level) + cpr->pr_securelevel = level; } + mtx_unlock(&pr->pr_mtx); + sx_sunlock(&allprison_lock); return (error); } SYSCTL_PROC(_kern, KERN_SECURELVL, securelevel, CTLTYPE_INT|CTLFLAG_RW|CTLFLAG_PRISON, 0, 0, sysctl_kern_securelvl, "I", "Current secure level"); #ifdef INCLUDE_CONFIG_FILE /* Actual kernel configuration options. */ extern char kernconfstring[]; static int sysctl_kern_config(SYSCTL_HANDLER_ARGS) { return (sysctl_handle_string(oidp, kernconfstring, strlen(kernconfstring), req)); } SYSCTL_PROC(_kern, OID_AUTO, conftxt, CTLTYPE_STRING|CTLFLAG_RW, 0, 0, sysctl_kern_config, "", "Kernel configuration file"); #endif #ifdef VIMAGE_GLOBALS char domainname[MAXHOSTNAMELEN]; /* Protected by hostname_mtx. */ #endif static int sysctl_domainname(SYSCTL_HANDLER_ARGS) { INIT_VPROCG(TD_TO_VPROCG(req->td)); char tmpdomainname[MAXHOSTNAMELEN]; int error; mtx_lock(&hostname_mtx); bcopy(V_domainname, tmpdomainname, MAXHOSTNAMELEN); mtx_unlock(&hostname_mtx); error = sysctl_handle_string(oidp, tmpdomainname, sizeof tmpdomainname, req); if (req->newptr != NULL && error == 0) { mtx_lock(&hostname_mtx); bcopy(tmpdomainname, V_domainname, MAXHOSTNAMELEN); mtx_unlock(&hostname_mtx); } return (error); } SYSCTL_PROC(_kern, KERN_NISDOMAINNAME, domainname, CTLTYPE_STRING|CTLFLAG_RW, 0, 0, sysctl_domainname, "A", "Name of the current YP/NIS domain"); u_long hostid; SYSCTL_ULONG(_kern, KERN_HOSTID, hostid, CTLFLAG_RW, &hostid, 0, "Host ID"); char hostuuid[64] = "00000000-0000-0000-0000-000000000000"; SYSCTL_STRING(_kern, KERN_HOSTUUID, hostuuid, CTLFLAG_RW, hostuuid, sizeof(hostuuid), "Host UUID"); SYSCTL_NODE(_kern, OID_AUTO, features, CTLFLAG_RD, 0, "Kernel Features"); #ifdef COMPAT_FREEBSD4 FEATURE(compat_freebsd4, "Compatible with FreeBSD 4"); #endif #ifdef COMPAT_FREEBSD5 FEATURE(compat_freebsd5, "Compatible with FreeBSD 5"); #endif #ifdef COMPAT_FREEBSD6 FEATURE(compat_freebsd6, "Compatible with FreeBSD 6"); #endif #ifdef COMPAT_FREEBSD7 FEATURE(compat_freebsd7, "Compatible with FreeBSD 7"); #endif /* * This is really cheating. These actually live in the libc, something * which I'm not quite sure is a good idea anyway, but in order for * getnext and friends to actually work, we define dummies here. */ SYSCTL_STRING(_user, USER_CS_PATH, cs_path, CTLFLAG_RD, "", 0, "PATH that finds all the standard utilities"); SYSCTL_INT(_user, USER_BC_BASE_MAX, bc_base_max, CTLFLAG_RD, 0, 0, "Max ibase/obase values in bc(1)"); SYSCTL_INT(_user, USER_BC_DIM_MAX, bc_dim_max, CTLFLAG_RD, 0, 0, "Max array size in bc(1)"); SYSCTL_INT(_user, USER_BC_SCALE_MAX, bc_scale_max, CTLFLAG_RD, 0, 0, "Max scale value in bc(1)"); SYSCTL_INT(_user, USER_BC_STRING_MAX, bc_string_max, CTLFLAG_RD, 0, 0, "Max string length in bc(1)"); SYSCTL_INT(_user, USER_COLL_WEIGHTS_MAX, coll_weights_max, CTLFLAG_RD, 0, 0, "Maximum number of weights assigned to an LC_COLLATE locale entry"); SYSCTL_INT(_user, USER_EXPR_NEST_MAX, expr_nest_max, CTLFLAG_RD, 0, 0, ""); SYSCTL_INT(_user, USER_LINE_MAX, line_max, CTLFLAG_RD, 0, 0, "Max length (bytes) of a text-processing utility's input line"); SYSCTL_INT(_user, USER_RE_DUP_MAX, re_dup_max, CTLFLAG_RD, 0, 0, "Maximum number of repeats of a regexp permitted"); SYSCTL_INT(_user, USER_POSIX2_VERSION, posix2_version, CTLFLAG_RD, 0, 0, "The version of POSIX 1003.2 with which the system attempts to comply"); SYSCTL_INT(_user, USER_POSIX2_C_BIND, posix2_c_bind, CTLFLAG_RD, 0, 0, "Whether C development supports the C bindings option"); SYSCTL_INT(_user, USER_POSIX2_C_DEV, posix2_c_dev, CTLFLAG_RD, 0, 0, "Whether system supports the C development utilities option"); SYSCTL_INT(_user, USER_POSIX2_CHAR_TERM, posix2_char_term, CTLFLAG_RD, 0, 0, ""); SYSCTL_INT(_user, USER_POSIX2_FORT_DEV, posix2_fort_dev, CTLFLAG_RD, 0, 0, "Whether system supports FORTRAN development utilities"); SYSCTL_INT(_user, USER_POSIX2_FORT_RUN, posix2_fort_run, CTLFLAG_RD, 0, 0, "Whether system supports FORTRAN runtime utilities"); SYSCTL_INT(_user, USER_POSIX2_LOCALEDEF, posix2_localedef, CTLFLAG_RD, 0, 0, "Whether system supports creation of locales"); SYSCTL_INT(_user, USER_POSIX2_SW_DEV, posix2_sw_dev, CTLFLAG_RD, 0, 0, "Whether system supports software development utilities"); SYSCTL_INT(_user, USER_POSIX2_UPE, posix2_upe, CTLFLAG_RD, 0, 0, "Whether system supports the user portability utilities"); SYSCTL_INT(_user, USER_STREAM_MAX, stream_max, CTLFLAG_RD, 0, 0, "Min Maximum number of streams a process may have open at one time"); SYSCTL_INT(_user, USER_TZNAME_MAX, tzname_max, CTLFLAG_RD, 0, 0, "Min Maximum number of types supported for timezone names"); #include SYSCTL_INT(_debug_sizeof, OID_AUTO, vnode, CTLFLAG_RD, 0, sizeof(struct vnode), "sizeof(struct vnode)"); SYSCTL_INT(_debug_sizeof, OID_AUTO, proc, CTLFLAG_RD, 0, sizeof(struct proc), "sizeof(struct proc)"); #include #include SYSCTL_INT(_debug_sizeof, OID_AUTO, bio, CTLFLAG_RD, 0, sizeof(struct bio), "sizeof(struct bio)"); SYSCTL_INT(_debug_sizeof, OID_AUTO, buf, CTLFLAG_RD, 0, sizeof(struct buf), "sizeof(struct buf)"); #include SYSCTL_INT(_debug_sizeof, OID_AUTO, kinfo_proc, CTLFLAG_RD, 0, sizeof(struct kinfo_proc), "sizeof(struct kinfo_proc)"); /* XXX compatibility, remove for 6.0 */ #include #include SYSCTL_INT(_kern, OID_AUTO, fallback_elf_brand, CTLFLAG_RW, &__elfN(fallback_brand), sizeof(__elfN(fallback_brand)), "compatibility for kern.fallback_elf_brand"); Index: head/sys/kern/kern_proc.c =================================================================== --- head/sys/kern/kern_proc.c (revision 192894) +++ head/sys/kern/kern_proc.c (revision 192895) @@ -1,1901 +1,1901 @@ /*- * Copyright (c) 1982, 1986, 1989, 1991, 1993 * The Regents of the University of California. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)kern_proc.c 8.7 (Berkeley) 2/14/95 */ #include __FBSDID("$FreeBSD$"); #include "opt_compat.h" #include "opt_ddb.h" #include "opt_kdtrace.h" #include "opt_ktrace.h" #include "opt_kstack_pages.h" #include "opt_stack.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef KTRACE #include #include #endif #ifdef DDB #include #endif #include #include #include #include #include #include SDT_PROVIDER_DEFINE(proc); SDT_PROBE_DEFINE(proc, kernel, ctor, entry); SDT_PROBE_ARGTYPE(proc, kernel, ctor, entry, 0, "struct proc *"); SDT_PROBE_ARGTYPE(proc, kernel, ctor, entry, 1, "int"); SDT_PROBE_ARGTYPE(proc, kernel, ctor, entry, 2, "void *"); SDT_PROBE_ARGTYPE(proc, kernel, ctor, entry, 3, "int"); SDT_PROBE_DEFINE(proc, kernel, ctor, return); SDT_PROBE_ARGTYPE(proc, kernel, ctor, return, 0, "struct proc *"); SDT_PROBE_ARGTYPE(proc, kernel, ctor, return, 1, "int"); SDT_PROBE_ARGTYPE(proc, kernel, ctor, return, 2, "void *"); SDT_PROBE_ARGTYPE(proc, kernel, ctor, return, 3, "int"); SDT_PROBE_DEFINE(proc, kernel, dtor, entry); SDT_PROBE_ARGTYPE(proc, kernel, dtor, entry, 0, "struct proc *"); SDT_PROBE_ARGTYPE(proc, kernel, dtor, entry, 1, "int"); SDT_PROBE_ARGTYPE(proc, kernel, dtor, entry, 2, "void *"); SDT_PROBE_ARGTYPE(proc, kernel, dtor, entry, 3, "struct thread *"); SDT_PROBE_DEFINE(proc, kernel, dtor, return); SDT_PROBE_ARGTYPE(proc, kernel, dtor, return, 0, "struct proc *"); SDT_PROBE_ARGTYPE(proc, kernel, dtor, return, 1, "int"); SDT_PROBE_ARGTYPE(proc, kernel, dtor, return, 2, "void *"); SDT_PROBE_DEFINE(proc, kernel, init, entry); SDT_PROBE_ARGTYPE(proc, kernel, init, entry, 0, "struct proc *"); SDT_PROBE_ARGTYPE(proc, kernel, init, entry, 1, "int"); SDT_PROBE_ARGTYPE(proc, kernel, init, entry, 2, "int"); SDT_PROBE_DEFINE(proc, kernel, init, return); SDT_PROBE_ARGTYPE(proc, kernel, init, return, 0, "struct proc *"); SDT_PROBE_ARGTYPE(proc, kernel, init, return, 1, "int"); SDT_PROBE_ARGTYPE(proc, kernel, init, return, 2, "int"); MALLOC_DEFINE(M_PGRP, "pgrp", "process group header"); MALLOC_DEFINE(M_SESSION, "session", "session header"); static MALLOC_DEFINE(M_PROC, "proc", "Proc structures"); MALLOC_DEFINE(M_SUBPROC, "subproc", "Proc sub-structures"); static void doenterpgrp(struct proc *, struct pgrp *); static void orphanpg(struct pgrp *pg); static void fill_kinfo_aggregate(struct proc *p, struct kinfo_proc *kp); static void fill_kinfo_proc_only(struct proc *p, struct kinfo_proc *kp); static void fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp, int preferthread); static void pgadjustjobc(struct pgrp *pgrp, int entering); static void pgdelete(struct pgrp *); static int proc_ctor(void *mem, int size, void *arg, int flags); static void proc_dtor(void *mem, int size, void *arg); static int proc_init(void *mem, int size, int flags); static void proc_fini(void *mem, int size); static void pargs_free(struct pargs *pa); /* * Other process lists */ struct pidhashhead *pidhashtbl; u_long pidhash; struct pgrphashhead *pgrphashtbl; u_long pgrphash; struct proclist allproc; struct proclist zombproc; struct sx allproc_lock; struct sx proctree_lock; struct mtx ppeers_lock; uma_zone_t proc_zone; uma_zone_t ithread_zone; int kstack_pages = KSTACK_PAGES; SYSCTL_INT(_kern, OID_AUTO, kstack_pages, CTLFLAG_RD, &kstack_pages, 0, ""); CTASSERT(sizeof(struct kinfo_proc) == KINFO_PROC_SIZE); /* * Initialize global process hashing structures. */ void procinit() { sx_init(&allproc_lock, "allproc"); sx_init(&proctree_lock, "proctree"); mtx_init(&ppeers_lock, "p_peers", NULL, MTX_DEF); LIST_INIT(&allproc); LIST_INIT(&zombproc); pidhashtbl = hashinit(maxproc / 4, M_PROC, &pidhash); pgrphashtbl = hashinit(maxproc / 4, M_PROC, &pgrphash); proc_zone = uma_zcreate("PROC", sched_sizeof_proc(), proc_ctor, proc_dtor, proc_init, proc_fini, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); uihashinit(); } /* * Prepare a proc for use. */ static int proc_ctor(void *mem, int size, void *arg, int flags) { struct proc *p; p = (struct proc *)mem; SDT_PROBE(proc, kernel, ctor , entry, p, size, arg, flags, 0); EVENTHANDLER_INVOKE(process_ctor, p); SDT_PROBE(proc, kernel, ctor , return, p, size, arg, flags, 0); return (0); } /* * Reclaim a proc after use. */ static void proc_dtor(void *mem, int size, void *arg) { struct proc *p; struct thread *td; /* INVARIANTS checks go here */ p = (struct proc *)mem; td = FIRST_THREAD_IN_PROC(p); SDT_PROBE(proc, kernel, dtor, entry, p, size, arg, td, 0); if (td != NULL) { #ifdef INVARIANTS KASSERT((p->p_numthreads == 1), ("bad number of threads in exiting process")); KASSERT(STAILQ_EMPTY(&p->p_ktr), ("proc_dtor: non-empty p_ktr")); #endif /* Free all OSD associated to this thread. */ osd_thread_exit(td); /* Dispose of an alternate kstack, if it exists. * XXX What if there are more than one thread in the proc? * The first thread in the proc is special and not * freed, so you gotta do this here. */ if (((p->p_flag & P_KTHREAD) != 0) && (td->td_altkstack != 0)) vm_thread_dispose_altkstack(td); } EVENTHANDLER_INVOKE(process_dtor, p); if (p->p_ksi != NULL) KASSERT(! KSI_ONQ(p->p_ksi), ("SIGCHLD queue")); SDT_PROBE(proc, kernel, dtor, return, p, size, arg, 0, 0); } /* * Initialize type-stable parts of a proc (when newly created). */ static int proc_init(void *mem, int size, int flags) { struct proc *p; p = (struct proc *)mem; SDT_PROBE(proc, kernel, init, entry, p, size, flags, 0, 0); p->p_sched = (struct p_sched *)&p[1]; bzero(&p->p_mtx, sizeof(struct mtx)); mtx_init(&p->p_mtx, "process lock", NULL, MTX_DEF | MTX_DUPOK); mtx_init(&p->p_slock, "process slock", NULL, MTX_SPIN | MTX_RECURSE); cv_init(&p->p_pwait, "ppwait"); TAILQ_INIT(&p->p_threads); /* all threads in proc */ EVENTHANDLER_INVOKE(process_init, p); p->p_stats = pstats_alloc(); SDT_PROBE(proc, kernel, init, return, p, size, flags, 0, 0); return (0); } /* * UMA should ensure that this function is never called. * Freeing a proc structure would violate type stability. */ static void proc_fini(void *mem, int size) { #ifdef notnow struct proc *p; p = (struct proc *)mem; EVENTHANDLER_INVOKE(process_fini, p); pstats_free(p->p_stats); thread_free(FIRST_THREAD_IN_PROC(p)); mtx_destroy(&p->p_mtx); if (p->p_ksi != NULL) ksiginfo_free(p->p_ksi); #else panic("proc reclaimed"); #endif } /* * Is p an inferior of the current process? */ int inferior(p) register struct proc *p; { sx_assert(&proctree_lock, SX_LOCKED); for (; p != curproc; p = p->p_pptr) if (p->p_pid == 0) return (0); return (1); } /* * Locate a process by number; return only "live" processes -- i.e., neither * zombies nor newly born but incompletely initialized processes. By not * returning processes in the PRS_NEW state, we allow callers to avoid * testing for that condition to avoid dereferencing p_ucred, et al. */ struct proc * pfind(pid) register pid_t pid; { register struct proc *p; sx_slock(&allproc_lock); LIST_FOREACH(p, PIDHASH(pid), p_hash) if (p->p_pid == pid) { if (p->p_state == PRS_NEW) { p = NULL; break; } PROC_LOCK(p); break; } sx_sunlock(&allproc_lock); return (p); } /* * Locate a process group by number. * The caller must hold proctree_lock. */ struct pgrp * pgfind(pgid) register pid_t pgid; { register struct pgrp *pgrp; sx_assert(&proctree_lock, SX_LOCKED); LIST_FOREACH(pgrp, PGRPHASH(pgid), pg_hash) { if (pgrp->pg_id == pgid) { PGRP_LOCK(pgrp); return (pgrp); } } return (NULL); } /* * Create a new process group. * pgid must be equal to the pid of p. * Begin a new session if required. */ int enterpgrp(p, pgid, pgrp, sess) register struct proc *p; pid_t pgid; struct pgrp *pgrp; struct session *sess; { struct pgrp *pgrp2; sx_assert(&proctree_lock, SX_XLOCKED); KASSERT(pgrp != NULL, ("enterpgrp: pgrp == NULL")); KASSERT(p->p_pid == pgid, ("enterpgrp: new pgrp and pid != pgid")); pgrp2 = pgfind(pgid); KASSERT(pgrp2 == NULL, ("enterpgrp: pgrp with pgid exists")); KASSERT(!SESS_LEADER(p), ("enterpgrp: session leader attempted setpgrp")); mtx_init(&pgrp->pg_mtx, "process group", NULL, MTX_DEF | MTX_DUPOK); if (sess != NULL) { /* * new session */ mtx_init(&sess->s_mtx, "session", NULL, MTX_DEF); PROC_LOCK(p); p->p_flag &= ~P_CONTROLT; PROC_UNLOCK(p); PGRP_LOCK(pgrp); sess->s_leader = p; sess->s_sid = p->p_pid; refcount_init(&sess->s_count, 1); sess->s_ttyvp = NULL; sess->s_ttyp = NULL; bcopy(p->p_session->s_login, sess->s_login, sizeof(sess->s_login)); pgrp->pg_session = sess; KASSERT(p == curproc, ("enterpgrp: mksession and p != curproc")); } else { pgrp->pg_session = p->p_session; sess_hold(pgrp->pg_session); PGRP_LOCK(pgrp); } pgrp->pg_id = pgid; LIST_INIT(&pgrp->pg_members); /* * As we have an exclusive lock of proctree_lock, * this should not deadlock. */ LIST_INSERT_HEAD(PGRPHASH(pgid), pgrp, pg_hash); pgrp->pg_jobc = 0; SLIST_INIT(&pgrp->pg_sigiolst); PGRP_UNLOCK(pgrp); doenterpgrp(p, pgrp); return (0); } /* * Move p to an existing process group */ int enterthispgrp(p, pgrp) register struct proc *p; struct pgrp *pgrp; { sx_assert(&proctree_lock, SX_XLOCKED); PROC_LOCK_ASSERT(p, MA_NOTOWNED); PGRP_LOCK_ASSERT(pgrp, MA_NOTOWNED); PGRP_LOCK_ASSERT(p->p_pgrp, MA_NOTOWNED); SESS_LOCK_ASSERT(p->p_session, MA_NOTOWNED); KASSERT(pgrp->pg_session == p->p_session, ("%s: pgrp's session %p, p->p_session %p.\n", __func__, pgrp->pg_session, p->p_session)); KASSERT(pgrp != p->p_pgrp, ("%s: p belongs to pgrp.", __func__)); doenterpgrp(p, pgrp); return (0); } /* * Move p to a process group */ static void doenterpgrp(p, pgrp) struct proc *p; struct pgrp *pgrp; { struct pgrp *savepgrp; sx_assert(&proctree_lock, SX_XLOCKED); PROC_LOCK_ASSERT(p, MA_NOTOWNED); PGRP_LOCK_ASSERT(pgrp, MA_NOTOWNED); PGRP_LOCK_ASSERT(p->p_pgrp, MA_NOTOWNED); SESS_LOCK_ASSERT(p->p_session, MA_NOTOWNED); savepgrp = p->p_pgrp; /* * Adjust eligibility of affected pgrps to participate in job control. * Increment eligibility counts before decrementing, otherwise we * could reach 0 spuriously during the first call. */ fixjobc(p, pgrp, 1); fixjobc(p, p->p_pgrp, 0); PGRP_LOCK(pgrp); PGRP_LOCK(savepgrp); PROC_LOCK(p); LIST_REMOVE(p, p_pglist); p->p_pgrp = pgrp; PROC_UNLOCK(p); LIST_INSERT_HEAD(&pgrp->pg_members, p, p_pglist); PGRP_UNLOCK(savepgrp); PGRP_UNLOCK(pgrp); if (LIST_EMPTY(&savepgrp->pg_members)) pgdelete(savepgrp); } /* * remove process from process group */ int leavepgrp(p) register struct proc *p; { struct pgrp *savepgrp; sx_assert(&proctree_lock, SX_XLOCKED); savepgrp = p->p_pgrp; PGRP_LOCK(savepgrp); PROC_LOCK(p); LIST_REMOVE(p, p_pglist); p->p_pgrp = NULL; PROC_UNLOCK(p); PGRP_UNLOCK(savepgrp); if (LIST_EMPTY(&savepgrp->pg_members)) pgdelete(savepgrp); return (0); } /* * delete a process group */ static void pgdelete(pgrp) register struct pgrp *pgrp; { struct session *savesess; struct tty *tp; sx_assert(&proctree_lock, SX_XLOCKED); PGRP_LOCK_ASSERT(pgrp, MA_NOTOWNED); SESS_LOCK_ASSERT(pgrp->pg_session, MA_NOTOWNED); /* * Reset any sigio structures pointing to us as a result of * F_SETOWN with our pgid. */ funsetownlst(&pgrp->pg_sigiolst); PGRP_LOCK(pgrp); tp = pgrp->pg_session->s_ttyp; LIST_REMOVE(pgrp, pg_hash); savesess = pgrp->pg_session; PGRP_UNLOCK(pgrp); /* Remove the reference to the pgrp before deallocating it. */ if (tp != NULL) { tty_lock(tp); tty_rel_pgrp(tp, pgrp); } mtx_destroy(&pgrp->pg_mtx); free(pgrp, M_PGRP); sess_release(savesess); } static void pgadjustjobc(pgrp, entering) struct pgrp *pgrp; int entering; { PGRP_LOCK(pgrp); if (entering) pgrp->pg_jobc++; else { --pgrp->pg_jobc; if (pgrp->pg_jobc == 0) orphanpg(pgrp); } PGRP_UNLOCK(pgrp); } /* * Adjust pgrp jobc counters when specified process changes process group. * We count the number of processes in each process group that "qualify" * the group for terminal job control (those with a parent in a different * process group of the same session). If that count reaches zero, the * process group becomes orphaned. Check both the specified process' * process group and that of its children. * entering == 0 => p is leaving specified group. * entering == 1 => p is entering specified group. */ void fixjobc(p, pgrp, entering) register struct proc *p; register struct pgrp *pgrp; int entering; { register struct pgrp *hispgrp; register struct session *mysession; sx_assert(&proctree_lock, SX_LOCKED); PROC_LOCK_ASSERT(p, MA_NOTOWNED); PGRP_LOCK_ASSERT(pgrp, MA_NOTOWNED); SESS_LOCK_ASSERT(pgrp->pg_session, MA_NOTOWNED); /* * Check p's parent to see whether p qualifies its own process * group; if so, adjust count for p's process group. */ mysession = pgrp->pg_session; if ((hispgrp = p->p_pptr->p_pgrp) != pgrp && hispgrp->pg_session == mysession) pgadjustjobc(pgrp, entering); /* * Check this process' children to see whether they qualify * their process groups; if so, adjust counts for children's * process groups. */ LIST_FOREACH(p, &p->p_children, p_sibling) { hispgrp = p->p_pgrp; if (hispgrp == pgrp || hispgrp->pg_session != mysession) continue; PROC_LOCK(p); if (p->p_state == PRS_ZOMBIE) { PROC_UNLOCK(p); continue; } PROC_UNLOCK(p); pgadjustjobc(hispgrp, entering); } } /* * A process group has become orphaned; * if there are any stopped processes in the group, * hang-up all process in that group. */ static void orphanpg(pg) struct pgrp *pg; { register struct proc *p; PGRP_LOCK_ASSERT(pg, MA_OWNED); LIST_FOREACH(p, &pg->pg_members, p_pglist) { PROC_LOCK(p); if (P_SHOULDSTOP(p)) { PROC_UNLOCK(p); LIST_FOREACH(p, &pg->pg_members, p_pglist) { PROC_LOCK(p); psignal(p, SIGHUP); psignal(p, SIGCONT); PROC_UNLOCK(p); } return; } PROC_UNLOCK(p); } } void sess_hold(struct session *s) { refcount_acquire(&s->s_count); } void sess_release(struct session *s) { if (refcount_release(&s->s_count)) { if (s->s_ttyp != NULL) { tty_lock(s->s_ttyp); tty_rel_sess(s->s_ttyp, s); } mtx_destroy(&s->s_mtx); free(s, M_SESSION); } } #include "opt_ddb.h" #ifdef DDB #include DB_SHOW_COMMAND(pgrpdump, pgrpdump) { register struct pgrp *pgrp; register struct proc *p; register int i; for (i = 0; i <= pgrphash; i++) { if (!LIST_EMPTY(&pgrphashtbl[i])) { printf("\tindx %d\n", i); LIST_FOREACH(pgrp, &pgrphashtbl[i], pg_hash) { printf( "\tpgrp %p, pgid %ld, sess %p, sesscnt %d, mem %p\n", (void *)pgrp, (long)pgrp->pg_id, (void *)pgrp->pg_session, pgrp->pg_session->s_count, (void *)LIST_FIRST(&pgrp->pg_members)); LIST_FOREACH(p, &pgrp->pg_members, p_pglist) { printf("\t\tpid %ld addr %p pgrp %p\n", (long)p->p_pid, (void *)p, (void *)p->p_pgrp); } } } } } #endif /* DDB */ /* * Calculate the kinfo_proc members which contain process-wide * informations. * Must be called with the target process locked. */ static void fill_kinfo_aggregate(struct proc *p, struct kinfo_proc *kp) { struct thread *td; PROC_LOCK_ASSERT(p, MA_OWNED); kp->ki_estcpu = 0; kp->ki_pctcpu = 0; kp->ki_runtime = 0; FOREACH_THREAD_IN_PROC(p, td) { thread_lock(td); kp->ki_pctcpu += sched_pctcpu(td); kp->ki_runtime += cputick2usec(td->td_runtime); kp->ki_estcpu += td->td_estcpu; thread_unlock(td); } } /* * Clear kinfo_proc and fill in any information that is common * to all threads in the process. * Must be called with the target process locked. */ static void fill_kinfo_proc_only(struct proc *p, struct kinfo_proc *kp) { struct thread *td0; struct tty *tp; struct session *sp; struct ucred *cred; struct sigacts *ps; PROC_LOCK_ASSERT(p, MA_OWNED); bzero(kp, sizeof(*kp)); kp->ki_structsize = sizeof(*kp); kp->ki_paddr = p; kp->ki_addr =/* p->p_addr; */0; /* XXX */ kp->ki_args = p->p_args; kp->ki_textvp = p->p_textvp; #ifdef KTRACE kp->ki_tracep = p->p_tracevp; mtx_lock(&ktrace_mtx); kp->ki_traceflag = p->p_traceflag; mtx_unlock(&ktrace_mtx); #endif kp->ki_fd = p->p_fd; kp->ki_vmspace = p->p_vmspace; kp->ki_flag = p->p_flag; cred = p->p_ucred; if (cred) { kp->ki_uid = cred->cr_uid; kp->ki_ruid = cred->cr_ruid; kp->ki_svuid = cred->cr_svuid; /* XXX bde doesn't like KI_NGROUPS */ kp->ki_ngroups = min(cred->cr_ngroups, KI_NGROUPS); bcopy(cred->cr_groups, kp->ki_groups, kp->ki_ngroups * sizeof(gid_t)); kp->ki_rgid = cred->cr_rgid; kp->ki_svgid = cred->cr_svgid; /* If jailed(cred), emulate the old P_JAILED flag. */ if (jailed(cred)) { kp->ki_flag |= P_JAILED; - /* If inside a jail, use 0 as a jail ID. */ - if (!jailed(curthread->td_ucred)) + /* If inside the jail, use 0 as a jail ID. */ + if (cred->cr_prison != curthread->td_ucred->cr_prison) kp->ki_jid = cred->cr_prison->pr_id; } } ps = p->p_sigacts; if (ps) { mtx_lock(&ps->ps_mtx); kp->ki_sigignore = ps->ps_sigignore; kp->ki_sigcatch = ps->ps_sigcatch; mtx_unlock(&ps->ps_mtx); } PROC_SLOCK(p); if (p->p_state != PRS_NEW && p->p_state != PRS_ZOMBIE && p->p_vmspace != NULL) { struct vmspace *vm = p->p_vmspace; kp->ki_size = vm->vm_map.size; kp->ki_rssize = vmspace_resident_count(vm); /*XXX*/ FOREACH_THREAD_IN_PROC(p, td0) { if (!TD_IS_SWAPPED(td0)) kp->ki_rssize += td0->td_kstack_pages; if (td0->td_altkstack_obj != NULL) kp->ki_rssize += td0->td_altkstack_pages; } kp->ki_swrss = vm->vm_swrss; kp->ki_tsize = vm->vm_tsize; kp->ki_dsize = vm->vm_dsize; kp->ki_ssize = vm->vm_ssize; } else if (p->p_state == PRS_ZOMBIE) kp->ki_stat = SZOMB; if (kp->ki_flag & P_INMEM) kp->ki_sflag = PS_INMEM; else kp->ki_sflag = 0; /* Calculate legacy swtime as seconds since 'swtick'. */ kp->ki_swtime = (ticks - p->p_swtick) / hz; kp->ki_pid = p->p_pid; kp->ki_nice = p->p_nice; rufetch(p, &kp->ki_rusage); kp->ki_runtime = cputick2usec(p->p_rux.rux_runtime); PROC_SUNLOCK(p); if ((p->p_flag & P_INMEM) && p->p_stats != NULL) { kp->ki_start = p->p_stats->p_start; timevaladd(&kp->ki_start, &boottime); PROC_SLOCK(p); calcru(p, &kp->ki_rusage.ru_utime, &kp->ki_rusage.ru_stime); PROC_SUNLOCK(p); calccru(p, &kp->ki_childutime, &kp->ki_childstime); /* Some callers want child-times in a single value */ kp->ki_childtime = kp->ki_childstime; timevaladd(&kp->ki_childtime, &kp->ki_childutime); } tp = NULL; if (p->p_pgrp) { kp->ki_pgid = p->p_pgrp->pg_id; kp->ki_jobc = p->p_pgrp->pg_jobc; sp = p->p_pgrp->pg_session; if (sp != NULL) { kp->ki_sid = sp->s_sid; SESS_LOCK(sp); strlcpy(kp->ki_login, sp->s_login, sizeof(kp->ki_login)); if (sp->s_ttyvp) kp->ki_kiflag |= KI_CTTY; if (SESS_LEADER(p)) kp->ki_kiflag |= KI_SLEADER; /* XXX proctree_lock */ tp = sp->s_ttyp; SESS_UNLOCK(sp); } } if ((p->p_flag & P_CONTROLT) && tp != NULL) { kp->ki_tdev = tty_udev(tp); kp->ki_tpgid = tp->t_pgrp ? tp->t_pgrp->pg_id : NO_PID; if (tp->t_session) kp->ki_tsid = tp->t_session->s_sid; } else kp->ki_tdev = NODEV; if (p->p_comm[0] != '\0') strlcpy(kp->ki_comm, p->p_comm, sizeof(kp->ki_comm)); if (p->p_sysent && p->p_sysent->sv_name != NULL && p->p_sysent->sv_name[0] != '\0') strlcpy(kp->ki_emul, p->p_sysent->sv_name, sizeof(kp->ki_emul)); kp->ki_siglist = p->p_siglist; kp->ki_xstat = p->p_xstat; kp->ki_acflag = p->p_acflag; kp->ki_lock = p->p_lock; if (p->p_pptr) kp->ki_ppid = p->p_pptr->p_pid; } /* * Fill in information that is thread specific. Must be called with p_slock * locked. If 'preferthread' is set, overwrite certain process-related * fields that are maintained for both threads and processes. */ static void fill_kinfo_thread(struct thread *td, struct kinfo_proc *kp, int preferthread) { struct proc *p; p = td->td_proc; PROC_LOCK_ASSERT(p, MA_OWNED); thread_lock(td); if (td->td_wmesg != NULL) strlcpy(kp->ki_wmesg, td->td_wmesg, sizeof(kp->ki_wmesg)); else bzero(kp->ki_wmesg, sizeof(kp->ki_wmesg)); if (td->td_name[0] != '\0') strlcpy(kp->ki_ocomm, td->td_name, sizeof(kp->ki_ocomm)); if (TD_ON_LOCK(td)) { kp->ki_kiflag |= KI_LOCKBLOCK; strlcpy(kp->ki_lockname, td->td_lockname, sizeof(kp->ki_lockname)); } else { kp->ki_kiflag &= ~KI_LOCKBLOCK; bzero(kp->ki_lockname, sizeof(kp->ki_lockname)); } if (p->p_state == PRS_NORMAL) { /* approximate. */ if (TD_ON_RUNQ(td) || TD_CAN_RUN(td) || TD_IS_RUNNING(td)) { kp->ki_stat = SRUN; } else if (P_SHOULDSTOP(p)) { kp->ki_stat = SSTOP; } else if (TD_IS_SLEEPING(td)) { kp->ki_stat = SSLEEP; } else if (TD_ON_LOCK(td)) { kp->ki_stat = SLOCK; } else { kp->ki_stat = SWAIT; } } else if (p->p_state == PRS_ZOMBIE) { kp->ki_stat = SZOMB; } else { kp->ki_stat = SIDL; } /* Things in the thread */ kp->ki_wchan = td->td_wchan; kp->ki_pri.pri_level = td->td_priority; kp->ki_pri.pri_native = td->td_base_pri; kp->ki_lastcpu = td->td_lastcpu; kp->ki_oncpu = td->td_oncpu; kp->ki_tdflags = td->td_flags; kp->ki_tid = td->td_tid; kp->ki_numthreads = p->p_numthreads; kp->ki_pcb = td->td_pcb; kp->ki_kstack = (void *)td->td_kstack; kp->ki_slptime = (ticks - td->td_slptick) / hz; kp->ki_pri.pri_class = td->td_pri_class; kp->ki_pri.pri_user = td->td_user_pri; if (preferthread) { kp->ki_runtime = cputick2usec(td->td_runtime); kp->ki_pctcpu = sched_pctcpu(td); kp->ki_estcpu = td->td_estcpu; } /* We can't get this anymore but ps etc never used it anyway. */ kp->ki_rqindex = 0; SIGSETOR(kp->ki_siglist, td->td_siglist); kp->ki_sigmask = td->td_sigmask; thread_unlock(td); } /* * Fill in a kinfo_proc structure for the specified process. * Must be called with the target process locked. */ void fill_kinfo_proc(struct proc *p, struct kinfo_proc *kp) { MPASS(FIRST_THREAD_IN_PROC(p) != NULL); fill_kinfo_proc_only(p, kp); fill_kinfo_thread(FIRST_THREAD_IN_PROC(p), kp, 0); fill_kinfo_aggregate(p, kp); } struct pstats * pstats_alloc(void) { return (malloc(sizeof(struct pstats), M_SUBPROC, M_ZERO|M_WAITOK)); } /* * Copy parts of p_stats; zero the rest of p_stats (statistics). */ void pstats_fork(struct pstats *src, struct pstats *dst) { bzero(&dst->pstat_startzero, __rangeof(struct pstats, pstat_startzero, pstat_endzero)); bcopy(&src->pstat_startcopy, &dst->pstat_startcopy, __rangeof(struct pstats, pstat_startcopy, pstat_endcopy)); } void pstats_free(struct pstats *ps) { free(ps, M_SUBPROC); } /* * Locate a zombie process by number */ struct proc * zpfind(pid_t pid) { struct proc *p; sx_slock(&allproc_lock); LIST_FOREACH(p, &zombproc, p_list) if (p->p_pid == pid) { PROC_LOCK(p); break; } sx_sunlock(&allproc_lock); return (p); } #define KERN_PROC_ZOMBMASK 0x3 #define KERN_PROC_NOTHREADS 0x4 /* * Must be called with the process locked and will return with it unlocked. */ static int sysctl_out_proc(struct proc *p, struct sysctl_req *req, int flags) { struct thread *td; struct kinfo_proc kinfo_proc; int error = 0; struct proc *np; pid_t pid = p->p_pid; PROC_LOCK_ASSERT(p, MA_OWNED); MPASS(FIRST_THREAD_IN_PROC(p) != NULL); fill_kinfo_proc(p, &kinfo_proc); if (flags & KERN_PROC_NOTHREADS) error = SYSCTL_OUT(req, (caddr_t)&kinfo_proc, sizeof(kinfo_proc)); else { FOREACH_THREAD_IN_PROC(p, td) { fill_kinfo_thread(td, &kinfo_proc, 1); error = SYSCTL_OUT(req, (caddr_t)&kinfo_proc, sizeof(kinfo_proc)); if (error) break; } } PROC_UNLOCK(p); if (error) return (error); if (flags & KERN_PROC_ZOMBMASK) np = zpfind(pid); else { if (pid == 0) return (0); np = pfind(pid); } if (np == NULL) return (ESRCH); if (np != p) { PROC_UNLOCK(np); return (ESRCH); } PROC_UNLOCK(np); return (0); } static int sysctl_kern_proc(SYSCTL_HANDLER_ARGS) { int *name = (int*) arg1; u_int namelen = arg2; struct proc *p; int flags, doingzomb, oid_number; int error = 0; oid_number = oidp->oid_number; if (oid_number != KERN_PROC_ALL && (oid_number & KERN_PROC_INC_THREAD) == 0) flags = KERN_PROC_NOTHREADS; else { flags = 0; oid_number &= ~KERN_PROC_INC_THREAD; } if (oid_number == KERN_PROC_PID) { if (namelen != 1) return (EINVAL); error = sysctl_wire_old_buffer(req, 0); if (error) return (error); p = pfind((pid_t)name[0]); if (!p) return (ESRCH); if ((error = p_cansee(curthread, p))) { PROC_UNLOCK(p); return (error); } error = sysctl_out_proc(p, req, flags); return (error); } switch (oid_number) { case KERN_PROC_ALL: if (namelen != 0) return (EINVAL); break; case KERN_PROC_PROC: if (namelen != 0 && namelen != 1) return (EINVAL); break; default: if (namelen != 1) return (EINVAL); break; } if (!req->oldptr) { /* overestimate by 5 procs */ error = SYSCTL_OUT(req, 0, sizeof (struct kinfo_proc) * 5); if (error) return (error); } error = sysctl_wire_old_buffer(req, 0); if (error != 0) return (error); sx_slock(&allproc_lock); for (doingzomb=0 ; doingzomb < 2 ; doingzomb++) { if (!doingzomb) p = LIST_FIRST(&allproc); else p = LIST_FIRST(&zombproc); for (; p != 0; p = LIST_NEXT(p, p_list)) { /* * Skip embryonic processes. */ PROC_SLOCK(p); if (p->p_state == PRS_NEW) { PROC_SUNLOCK(p); continue; } PROC_SUNLOCK(p); PROC_LOCK(p); KASSERT(p->p_ucred != NULL, ("process credential is NULL for non-NEW proc")); /* * Show a user only appropriate processes. */ if (p_cansee(curthread, p)) { PROC_UNLOCK(p); continue; } /* * TODO - make more efficient (see notes below). * do by session. */ switch (oid_number) { case KERN_PROC_GID: if (p->p_ucred->cr_gid != (gid_t)name[0]) { PROC_UNLOCK(p); continue; } break; case KERN_PROC_PGRP: /* could do this by traversing pgrp */ if (p->p_pgrp == NULL || p->p_pgrp->pg_id != (pid_t)name[0]) { PROC_UNLOCK(p); continue; } break; case KERN_PROC_RGID: if (p->p_ucred->cr_rgid != (gid_t)name[0]) { PROC_UNLOCK(p); continue; } break; case KERN_PROC_SESSION: if (p->p_session == NULL || p->p_session->s_sid != (pid_t)name[0]) { PROC_UNLOCK(p); continue; } break; case KERN_PROC_TTY: if ((p->p_flag & P_CONTROLT) == 0 || p->p_session == NULL) { PROC_UNLOCK(p); continue; } /* XXX proctree_lock */ SESS_LOCK(p->p_session); if (p->p_session->s_ttyp == NULL || tty_udev(p->p_session->s_ttyp) != (dev_t)name[0]) { SESS_UNLOCK(p->p_session); PROC_UNLOCK(p); continue; } SESS_UNLOCK(p->p_session); break; case KERN_PROC_UID: if (p->p_ucred->cr_uid != (uid_t)name[0]) { PROC_UNLOCK(p); continue; } break; case KERN_PROC_RUID: if (p->p_ucred->cr_ruid != (uid_t)name[0]) { PROC_UNLOCK(p); continue; } break; case KERN_PROC_PROC: break; default: break; } error = sysctl_out_proc(p, req, flags | doingzomb); if (error) { sx_sunlock(&allproc_lock); return (error); } } } sx_sunlock(&allproc_lock); return (0); } struct pargs * pargs_alloc(int len) { struct pargs *pa; pa = malloc(sizeof(struct pargs) + len, M_PARGS, M_WAITOK); refcount_init(&pa->ar_ref, 1); pa->ar_length = len; return (pa); } static void pargs_free(struct pargs *pa) { free(pa, M_PARGS); } void pargs_hold(struct pargs *pa) { if (pa == NULL) return; refcount_acquire(&pa->ar_ref); } void pargs_drop(struct pargs *pa) { if (pa == NULL) return; if (refcount_release(&pa->ar_ref)) pargs_free(pa); } /* * This sysctl allows a process to retrieve the argument list or process * title for another process without groping around in the address space * of the other process. It also allow a process to set its own "process * title to a string of its own choice. */ static int sysctl_kern_proc_args(SYSCTL_HANDLER_ARGS) { int *name = (int*) arg1; u_int namelen = arg2; struct pargs *newpa, *pa; struct proc *p; int error = 0; if (namelen != 1) return (EINVAL); p = pfind((pid_t)name[0]); if (!p) return (ESRCH); if ((error = p_cansee(curthread, p)) != 0) { PROC_UNLOCK(p); return (error); } if (req->newptr && curproc != p) { PROC_UNLOCK(p); return (EPERM); } pa = p->p_args; pargs_hold(pa); PROC_UNLOCK(p); if (req->oldptr != NULL && pa != NULL) error = SYSCTL_OUT(req, pa->ar_args, pa->ar_length); pargs_drop(pa); if (error != 0 || req->newptr == NULL) return (error); if (req->newlen + sizeof(struct pargs) > ps_arg_cache_limit) return (ENOMEM); newpa = pargs_alloc(req->newlen); error = SYSCTL_IN(req, newpa->ar_args, req->newlen); if (error != 0) { pargs_free(newpa); return (error); } PROC_LOCK(p); pa = p->p_args; p->p_args = newpa; PROC_UNLOCK(p); pargs_drop(pa); return (0); } /* * This sysctl allows a process to retrieve the path of the executable for * itself or another process. */ static int sysctl_kern_proc_pathname(SYSCTL_HANDLER_ARGS) { pid_t *pidp = (pid_t *)arg1; unsigned int arglen = arg2; struct proc *p; struct vnode *vp; char *retbuf, *freebuf; int error, vfslocked; if (arglen != 1) return (EINVAL); if (*pidp == -1) { /* -1 means this process */ p = req->td->td_proc; } else { p = pfind(*pidp); if (p == NULL) return (ESRCH); if ((error = p_cansee(curthread, p)) != 0) { PROC_UNLOCK(p); return (error); } } vp = p->p_textvp; if (vp == NULL) { if (*pidp != -1) PROC_UNLOCK(p); return (0); } vref(vp); if (*pidp != -1) PROC_UNLOCK(p); error = vn_fullpath(req->td, vp, &retbuf, &freebuf); vfslocked = VFS_LOCK_GIANT(vp->v_mount); vrele(vp); VFS_UNLOCK_GIANT(vfslocked); if (error) return (error); error = SYSCTL_OUT(req, retbuf, strlen(retbuf) + 1); free(freebuf, M_TEMP); return (error); } static int sysctl_kern_proc_sv_name(SYSCTL_HANDLER_ARGS) { struct proc *p; char *sv_name; int *name; int namelen; int error; namelen = arg2; if (namelen != 1) return (EINVAL); name = (int *)arg1; if ((p = pfind((pid_t)name[0])) == NULL) return (ESRCH); if ((error = p_cansee(curthread, p))) { PROC_UNLOCK(p); return (error); } sv_name = p->p_sysent->sv_name; PROC_UNLOCK(p); return (sysctl_handle_string(oidp, sv_name, 0, req)); } #ifdef KINFO_OVMENTRY_SIZE CTASSERT(sizeof(struct kinfo_ovmentry) == KINFO_OVMENTRY_SIZE); #endif #ifdef COMPAT_FREEBSD7 static int sysctl_kern_proc_ovmmap(SYSCTL_HANDLER_ARGS) { vm_map_entry_t entry, tmp_entry; unsigned int last_timestamp; char *fullpath, *freepath; struct kinfo_ovmentry *kve; struct vattr va; struct ucred *cred; int error, *name; struct vnode *vp; struct proc *p; vm_map_t map; struct vmspace *vm; name = (int *)arg1; if ((p = pfind((pid_t)name[0])) == NULL) return (ESRCH); if (p->p_flag & P_WEXIT) { PROC_UNLOCK(p); return (ESRCH); } if ((error = p_candebug(curthread, p))) { PROC_UNLOCK(p); return (error); } _PHOLD(p); PROC_UNLOCK(p); vm = vmspace_acquire_ref(p); if (vm == NULL) { PRELE(p); return (ESRCH); } kve = malloc(sizeof(*kve), M_TEMP, M_WAITOK); map = &p->p_vmspace->vm_map; /* XXXRW: More locking required? */ vm_map_lock_read(map); for (entry = map->header.next; entry != &map->header; entry = entry->next) { vm_object_t obj, tobj, lobj; vm_offset_t addr; int vfslocked; if (entry->eflags & MAP_ENTRY_IS_SUB_MAP) continue; bzero(kve, sizeof(*kve)); kve->kve_structsize = sizeof(*kve); kve->kve_private_resident = 0; obj = entry->object.vm_object; if (obj != NULL) { VM_OBJECT_LOCK(obj); if (obj->shadow_count == 1) kve->kve_private_resident = obj->resident_page_count; } kve->kve_resident = 0; addr = entry->start; while (addr < entry->end) { if (pmap_extract(map->pmap, addr)) kve->kve_resident++; addr += PAGE_SIZE; } for (lobj = tobj = obj; tobj; tobj = tobj->backing_object) { if (tobj != obj) VM_OBJECT_LOCK(tobj); if (lobj != obj) VM_OBJECT_UNLOCK(lobj); lobj = tobj; } kve->kve_start = (void*)entry->start; kve->kve_end = (void*)entry->end; kve->kve_offset = (off_t)entry->offset; if (entry->protection & VM_PROT_READ) kve->kve_protection |= KVME_PROT_READ; if (entry->protection & VM_PROT_WRITE) kve->kve_protection |= KVME_PROT_WRITE; if (entry->protection & VM_PROT_EXECUTE) kve->kve_protection |= KVME_PROT_EXEC; if (entry->eflags & MAP_ENTRY_COW) kve->kve_flags |= KVME_FLAG_COW; if (entry->eflags & MAP_ENTRY_NEEDS_COPY) kve->kve_flags |= KVME_FLAG_NEEDS_COPY; last_timestamp = map->timestamp; vm_map_unlock_read(map); kve->kve_fileid = 0; kve->kve_fsid = 0; freepath = NULL; fullpath = ""; if (lobj) { vp = NULL; switch (lobj->type) { case OBJT_DEFAULT: kve->kve_type = KVME_TYPE_DEFAULT; break; case OBJT_VNODE: kve->kve_type = KVME_TYPE_VNODE; vp = lobj->handle; vref(vp); break; case OBJT_SWAP: kve->kve_type = KVME_TYPE_SWAP; break; case OBJT_DEVICE: kve->kve_type = KVME_TYPE_DEVICE; break; case OBJT_PHYS: kve->kve_type = KVME_TYPE_PHYS; break; case OBJT_DEAD: kve->kve_type = KVME_TYPE_DEAD; break; default: kve->kve_type = KVME_TYPE_UNKNOWN; break; } if (lobj != obj) VM_OBJECT_UNLOCK(lobj); kve->kve_ref_count = obj->ref_count; kve->kve_shadow_count = obj->shadow_count; VM_OBJECT_UNLOCK(obj); if (vp != NULL) { vn_fullpath(curthread, vp, &fullpath, &freepath); cred = curthread->td_ucred; vfslocked = VFS_LOCK_GIANT(vp->v_mount); vn_lock(vp, LK_SHARED | LK_RETRY); if (VOP_GETATTR(vp, &va, cred) == 0) { kve->kve_fileid = va.va_fileid; kve->kve_fsid = va.va_fsid; } vput(vp); VFS_UNLOCK_GIANT(vfslocked); } } else { kve->kve_type = KVME_TYPE_NONE; kve->kve_ref_count = 0; kve->kve_shadow_count = 0; } strlcpy(kve->kve_path, fullpath, sizeof(kve->kve_path)); if (freepath != NULL) free(freepath, M_TEMP); error = SYSCTL_OUT(req, kve, sizeof(*kve)); vm_map_lock_read(map); if (error) break; if (last_timestamp != map->timestamp) { vm_map_lookup_entry(map, addr - 1, &tmp_entry); entry = tmp_entry; } } vm_map_unlock_read(map); vmspace_free(vm); PRELE(p); free(kve, M_TEMP); return (error); } #endif /* COMPAT_FREEBSD7 */ #ifdef KINFO_VMENTRY_SIZE CTASSERT(sizeof(struct kinfo_vmentry) == KINFO_VMENTRY_SIZE); #endif static int sysctl_kern_proc_vmmap(SYSCTL_HANDLER_ARGS) { vm_map_entry_t entry, tmp_entry; unsigned int last_timestamp; char *fullpath, *freepath; struct kinfo_vmentry *kve; struct vattr va; struct ucred *cred; int error, *name; struct vnode *vp; struct proc *p; struct vmspace *vm; vm_map_t map; name = (int *)arg1; if ((p = pfind((pid_t)name[0])) == NULL) return (ESRCH); if (p->p_flag & P_WEXIT) { PROC_UNLOCK(p); return (ESRCH); } if ((error = p_candebug(curthread, p))) { PROC_UNLOCK(p); return (error); } _PHOLD(p); PROC_UNLOCK(p); vm = vmspace_acquire_ref(p); if (vm == NULL) { PRELE(p); return (ESRCH); } kve = malloc(sizeof(*kve), M_TEMP, M_WAITOK); map = &vm->vm_map; /* XXXRW: More locking required? */ vm_map_lock_read(map); for (entry = map->header.next; entry != &map->header; entry = entry->next) { vm_object_t obj, tobj, lobj; vm_offset_t addr; int vfslocked; if (entry->eflags & MAP_ENTRY_IS_SUB_MAP) continue; bzero(kve, sizeof(*kve)); kve->kve_private_resident = 0; obj = entry->object.vm_object; if (obj != NULL) { VM_OBJECT_LOCK(obj); if (obj->shadow_count == 1) kve->kve_private_resident = obj->resident_page_count; } kve->kve_resident = 0; addr = entry->start; while (addr < entry->end) { if (pmap_extract(map->pmap, addr)) kve->kve_resident++; addr += PAGE_SIZE; } for (lobj = tobj = obj; tobj; tobj = tobj->backing_object) { if (tobj != obj) VM_OBJECT_LOCK(tobj); if (lobj != obj) VM_OBJECT_UNLOCK(lobj); lobj = tobj; } kve->kve_start = entry->start; kve->kve_end = entry->end; kve->kve_offset = entry->offset; if (entry->protection & VM_PROT_READ) kve->kve_protection |= KVME_PROT_READ; if (entry->protection & VM_PROT_WRITE) kve->kve_protection |= KVME_PROT_WRITE; if (entry->protection & VM_PROT_EXECUTE) kve->kve_protection |= KVME_PROT_EXEC; if (entry->eflags & MAP_ENTRY_COW) kve->kve_flags |= KVME_FLAG_COW; if (entry->eflags & MAP_ENTRY_NEEDS_COPY) kve->kve_flags |= KVME_FLAG_NEEDS_COPY; last_timestamp = map->timestamp; vm_map_unlock_read(map); kve->kve_fileid = 0; kve->kve_fsid = 0; freepath = NULL; fullpath = ""; if (lobj) { vp = NULL; switch (lobj->type) { case OBJT_DEFAULT: kve->kve_type = KVME_TYPE_DEFAULT; break; case OBJT_VNODE: kve->kve_type = KVME_TYPE_VNODE; vp = lobj->handle; vref(vp); break; case OBJT_SWAP: kve->kve_type = KVME_TYPE_SWAP; break; case OBJT_DEVICE: kve->kve_type = KVME_TYPE_DEVICE; break; case OBJT_PHYS: kve->kve_type = KVME_TYPE_PHYS; break; case OBJT_DEAD: kve->kve_type = KVME_TYPE_DEAD; break; default: kve->kve_type = KVME_TYPE_UNKNOWN; break; } if (lobj != obj) VM_OBJECT_UNLOCK(lobj); kve->kve_ref_count = obj->ref_count; kve->kve_shadow_count = obj->shadow_count; VM_OBJECT_UNLOCK(obj); if (vp != NULL) { vn_fullpath(curthread, vp, &fullpath, &freepath); cred = curthread->td_ucred; vfslocked = VFS_LOCK_GIANT(vp->v_mount); vn_lock(vp, LK_SHARED | LK_RETRY); if (VOP_GETATTR(vp, &va, cred) == 0) { kve->kve_fileid = va.va_fileid; kve->kve_fsid = va.va_fsid; } vput(vp); VFS_UNLOCK_GIANT(vfslocked); } } else { kve->kve_type = KVME_TYPE_NONE; kve->kve_ref_count = 0; kve->kve_shadow_count = 0; } strlcpy(kve->kve_path, fullpath, sizeof(kve->kve_path)); if (freepath != NULL) free(freepath, M_TEMP); /* Pack record size down */ kve->kve_structsize = offsetof(struct kinfo_vmentry, kve_path) + strlen(kve->kve_path) + 1; kve->kve_structsize = roundup(kve->kve_structsize, sizeof(uint64_t)); error = SYSCTL_OUT(req, kve, kve->kve_structsize); vm_map_lock_read(map); if (error) break; if (last_timestamp != map->timestamp) { vm_map_lookup_entry(map, addr - 1, &tmp_entry); entry = tmp_entry; } } vm_map_unlock_read(map); vmspace_free(vm); PRELE(p); free(kve, M_TEMP); return (error); } #if defined(STACK) || defined(DDB) static int sysctl_kern_proc_kstack(SYSCTL_HANDLER_ARGS) { struct kinfo_kstack *kkstp; int error, i, *name, numthreads; lwpid_t *lwpidarray; struct thread *td; struct stack *st; struct sbuf sb; struct proc *p; name = (int *)arg1; if ((p = pfind((pid_t)name[0])) == NULL) return (ESRCH); /* XXXRW: Not clear ESRCH is the right error during proc execve(). */ if (p->p_flag & P_WEXIT || p->p_flag & P_INEXEC) { PROC_UNLOCK(p); return (ESRCH); } if ((error = p_candebug(curthread, p))) { PROC_UNLOCK(p); return (error); } _PHOLD(p); PROC_UNLOCK(p); kkstp = malloc(sizeof(*kkstp), M_TEMP, M_WAITOK); st = stack_create(); lwpidarray = NULL; numthreads = 0; PROC_LOCK(p); repeat: if (numthreads < p->p_numthreads) { if (lwpidarray != NULL) { free(lwpidarray, M_TEMP); lwpidarray = NULL; } numthreads = p->p_numthreads; PROC_UNLOCK(p); lwpidarray = malloc(sizeof(*lwpidarray) * numthreads, M_TEMP, M_WAITOK | M_ZERO); PROC_LOCK(p); goto repeat; } i = 0; /* * XXXRW: During the below loop, execve(2) and countless other sorts * of changes could have taken place. Should we check to see if the * vmspace has been replaced, or the like, in order to prevent * giving a snapshot that spans, say, execve(2), with some threads * before and some after? Among other things, the credentials could * have changed, in which case the right to extract debug info might * no longer be assured. */ FOREACH_THREAD_IN_PROC(p, td) { KASSERT(i < numthreads, ("sysctl_kern_proc_kstack: numthreads")); lwpidarray[i] = td->td_tid; i++; } numthreads = i; for (i = 0; i < numthreads; i++) { td = thread_find(p, lwpidarray[i]); if (td == NULL) { continue; } bzero(kkstp, sizeof(*kkstp)); (void)sbuf_new(&sb, kkstp->kkst_trace, sizeof(kkstp->kkst_trace), SBUF_FIXEDLEN); thread_lock(td); kkstp->kkst_tid = td->td_tid; if (TD_IS_SWAPPED(td)) kkstp->kkst_state = KKST_STATE_SWAPPED; else if (TD_IS_RUNNING(td)) kkstp->kkst_state = KKST_STATE_RUNNING; else { kkstp->kkst_state = KKST_STATE_STACKOK; stack_save_td(st, td); } thread_unlock(td); PROC_UNLOCK(p); stack_sbuf_print(&sb, st); sbuf_finish(&sb); sbuf_delete(&sb); error = SYSCTL_OUT(req, kkstp, sizeof(*kkstp)); PROC_LOCK(p); if (error) break; } _PRELE(p); PROC_UNLOCK(p); if (lwpidarray != NULL) free(lwpidarray, M_TEMP); stack_destroy(st); free(kkstp, M_TEMP); return (error); } #endif SYSCTL_NODE(_kern, KERN_PROC, proc, CTLFLAG_RD, 0, "Process table"); SYSCTL_PROC(_kern_proc, KERN_PROC_ALL, all, CTLFLAG_RD|CTLTYPE_STRUCT| CTLFLAG_MPSAFE, 0, 0, sysctl_kern_proc, "S,proc", "Return entire process table"); static SYSCTL_NODE(_kern_proc, KERN_PROC_GID, gid, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, KERN_PROC_PGRP, pgrp, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, KERN_PROC_RGID, rgid, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, KERN_PROC_SESSION, sid, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, KERN_PROC_TTY, tty, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, KERN_PROC_UID, uid, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, KERN_PROC_RUID, ruid, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, KERN_PROC_PID, pid, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, KERN_PROC_PROC, proc, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Return process table, no threads"); static SYSCTL_NODE(_kern_proc, KERN_PROC_ARGS, args, CTLFLAG_RW | CTLFLAG_ANYBODY | CTLFLAG_MPSAFE, sysctl_kern_proc_args, "Process argument list"); static SYSCTL_NODE(_kern_proc, KERN_PROC_PATHNAME, pathname, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc_pathname, "Process executable path"); static SYSCTL_NODE(_kern_proc, KERN_PROC_SV_NAME, sv_name, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc_sv_name, "Process syscall vector name (ABI type)"); static SYSCTL_NODE(_kern_proc, (KERN_PROC_GID | KERN_PROC_INC_THREAD), gid_td, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, (KERN_PROC_PGRP | KERN_PROC_INC_THREAD), pgrp_td, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, (KERN_PROC_RGID | KERN_PROC_INC_THREAD), rgid_td, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, (KERN_PROC_SESSION | KERN_PROC_INC_THREAD), sid_td, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, (KERN_PROC_TTY | KERN_PROC_INC_THREAD), tty_td, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, (KERN_PROC_UID | KERN_PROC_INC_THREAD), uid_td, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, (KERN_PROC_RUID | KERN_PROC_INC_THREAD), ruid_td, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, (KERN_PROC_PID | KERN_PROC_INC_THREAD), pid_td, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Process table"); static SYSCTL_NODE(_kern_proc, (KERN_PROC_PROC | KERN_PROC_INC_THREAD), proc_td, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc, "Return process table, no threads"); #ifdef COMPAT_FREEBSD7 static SYSCTL_NODE(_kern_proc, KERN_PROC_OVMMAP, ovmmap, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc_ovmmap, "Old Process vm map entries"); #endif static SYSCTL_NODE(_kern_proc, KERN_PROC_VMMAP, vmmap, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc_vmmap, "Process vm map entries"); #if defined(STACK) || defined(DDB) static SYSCTL_NODE(_kern_proc, KERN_PROC_KSTACK, kstack, CTLFLAG_RD | CTLFLAG_MPSAFE, sysctl_kern_proc_kstack, "Process kernel stacks"); #endif Index: head/sys/kern/kern_prot.c =================================================================== --- head/sys/kern/kern_prot.c (revision 192894) +++ head/sys/kern/kern_prot.c (revision 192895) @@ -1,2083 +1,2074 @@ /*- * Copyright (c) 1982, 1986, 1989, 1990, 1991, 1993 * The Regents of the University of California. * (c) UNIX System Laboratories, Inc. * Copyright (c) 2000-2001 Robert N. M. Watson. * All rights reserved. * * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)kern_prot.c 8.6 (Berkeley) 1/21/94 */ /* * System calls related to processes and protection */ #include __FBSDID("$FreeBSD$"); #include "opt_compat.h" #include "opt_inet.h" #include "opt_inet6.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #if defined(INET) || defined(INET6) #include #include #endif #include #include static MALLOC_DEFINE(M_CRED, "cred", "credentials"); SYSCTL_NODE(_security, OID_AUTO, bsd, CTLFLAG_RW, 0, "BSD security policy"); #ifndef _SYS_SYSPROTO_H_ struct getpid_args { int dummy; }; #endif /* ARGSUSED */ int getpid(struct thread *td, struct getpid_args *uap) { struct proc *p = td->td_proc; td->td_retval[0] = p->p_pid; #if defined(COMPAT_43) PROC_LOCK(p); td->td_retval[1] = p->p_pptr->p_pid; PROC_UNLOCK(p); #endif return (0); } #ifndef _SYS_SYSPROTO_H_ struct getppid_args { int dummy; }; #endif /* ARGSUSED */ int getppid(struct thread *td, struct getppid_args *uap) { struct proc *p = td->td_proc; PROC_LOCK(p); td->td_retval[0] = p->p_pptr->p_pid; PROC_UNLOCK(p); return (0); } /* * Get process group ID; note that POSIX getpgrp takes no parameter. */ #ifndef _SYS_SYSPROTO_H_ struct getpgrp_args { int dummy; }; #endif int getpgrp(struct thread *td, struct getpgrp_args *uap) { struct proc *p = td->td_proc; PROC_LOCK(p); td->td_retval[0] = p->p_pgrp->pg_id; PROC_UNLOCK(p); return (0); } /* Get an arbitary pid's process group id */ #ifndef _SYS_SYSPROTO_H_ struct getpgid_args { pid_t pid; }; #endif int getpgid(struct thread *td, struct getpgid_args *uap) { struct proc *p; int error; if (uap->pid == 0) { p = td->td_proc; PROC_LOCK(p); } else { p = pfind(uap->pid); if (p == NULL) return (ESRCH); error = p_cansee(td, p); if (error) { PROC_UNLOCK(p); return (error); } } td->td_retval[0] = p->p_pgrp->pg_id; PROC_UNLOCK(p); return (0); } /* * Get an arbitary pid's session id. */ #ifndef _SYS_SYSPROTO_H_ struct getsid_args { pid_t pid; }; #endif int getsid(struct thread *td, struct getsid_args *uap) { struct proc *p; int error; if (uap->pid == 0) { p = td->td_proc; PROC_LOCK(p); } else { p = pfind(uap->pid); if (p == NULL) return (ESRCH); error = p_cansee(td, p); if (error) { PROC_UNLOCK(p); return (error); } } td->td_retval[0] = p->p_session->s_sid; PROC_UNLOCK(p); return (0); } #ifndef _SYS_SYSPROTO_H_ struct getuid_args { int dummy; }; #endif /* ARGSUSED */ int getuid(struct thread *td, struct getuid_args *uap) { td->td_retval[0] = td->td_ucred->cr_ruid; #if defined(COMPAT_43) td->td_retval[1] = td->td_ucred->cr_uid; #endif return (0); } #ifndef _SYS_SYSPROTO_H_ struct geteuid_args { int dummy; }; #endif /* ARGSUSED */ int geteuid(struct thread *td, struct geteuid_args *uap) { td->td_retval[0] = td->td_ucred->cr_uid; return (0); } #ifndef _SYS_SYSPROTO_H_ struct getgid_args { int dummy; }; #endif /* ARGSUSED */ int getgid(struct thread *td, struct getgid_args *uap) { td->td_retval[0] = td->td_ucred->cr_rgid; #if defined(COMPAT_43) td->td_retval[1] = td->td_ucred->cr_groups[0]; #endif return (0); } /* * Get effective group ID. The "egid" is groups[0], and could be obtained * via getgroups. This syscall exists because it is somewhat painful to do * correctly in a library function. */ #ifndef _SYS_SYSPROTO_H_ struct getegid_args { int dummy; }; #endif /* ARGSUSED */ int getegid(struct thread *td, struct getegid_args *uap) { td->td_retval[0] = td->td_ucred->cr_groups[0]; return (0); } #ifndef _SYS_SYSPROTO_H_ struct getgroups_args { u_int gidsetsize; gid_t *gidset; }; #endif int getgroups(struct thread *td, register struct getgroups_args *uap) { gid_t groups[NGROUPS]; u_int ngrp; int error; ngrp = MIN(uap->gidsetsize, NGROUPS); error = kern_getgroups(td, &ngrp, groups); if (error) return (error); if (uap->gidsetsize > 0) error = copyout(groups, uap->gidset, ngrp * sizeof(gid_t)); if (error == 0) td->td_retval[0] = ngrp; return (error); } int kern_getgroups(struct thread *td, u_int *ngrp, gid_t *groups) { struct ucred *cred; cred = td->td_ucred; if (*ngrp == 0) { *ngrp = cred->cr_ngroups; return (0); } if (*ngrp < cred->cr_ngroups) return (EINVAL); *ngrp = cred->cr_ngroups; bcopy(cred->cr_groups, groups, *ngrp * sizeof(gid_t)); return (0); } #ifndef _SYS_SYSPROTO_H_ struct setsid_args { int dummy; }; #endif /* ARGSUSED */ int setsid(register struct thread *td, struct setsid_args *uap) { struct pgrp *pgrp; int error; struct proc *p = td->td_proc; struct pgrp *newpgrp; struct session *newsess; error = 0; pgrp = NULL; newpgrp = malloc(sizeof(struct pgrp), M_PGRP, M_WAITOK | M_ZERO); newsess = malloc(sizeof(struct session), M_SESSION, M_WAITOK | M_ZERO); sx_xlock(&proctree_lock); if (p->p_pgid == p->p_pid || (pgrp = pgfind(p->p_pid)) != NULL) { if (pgrp != NULL) PGRP_UNLOCK(pgrp); error = EPERM; } else { (void)enterpgrp(p, p->p_pid, newpgrp, newsess); td->td_retval[0] = p->p_pid; newpgrp = NULL; newsess = NULL; } sx_xunlock(&proctree_lock); if (newpgrp != NULL) free(newpgrp, M_PGRP); if (newsess != NULL) free(newsess, M_SESSION); return (error); } /* * set process group (setpgid/old setpgrp) * * caller does setpgid(targpid, targpgid) * * pid must be caller or child of caller (ESRCH) * if a child * pid must be in same session (EPERM) * pid can't have done an exec (EACCES) * if pgid != pid * there must exist some pid in same session having pgid (EPERM) * pid must not be session leader (EPERM) */ #ifndef _SYS_SYSPROTO_H_ struct setpgid_args { int pid; /* target process id */ int pgid; /* target pgrp id */ }; #endif /* ARGSUSED */ int setpgid(struct thread *td, register struct setpgid_args *uap) { struct proc *curp = td->td_proc; register struct proc *targp; /* target process */ register struct pgrp *pgrp; /* target pgrp */ int error; struct pgrp *newpgrp; if (uap->pgid < 0) return (EINVAL); error = 0; newpgrp = malloc(sizeof(struct pgrp), M_PGRP, M_WAITOK | M_ZERO); sx_xlock(&proctree_lock); if (uap->pid != 0 && uap->pid != curp->p_pid) { if ((targp = pfind(uap->pid)) == NULL) { error = ESRCH; goto done; } if (!inferior(targp)) { PROC_UNLOCK(targp); error = ESRCH; goto done; } if ((error = p_cansee(td, targp))) { PROC_UNLOCK(targp); goto done; } if (targp->p_pgrp == NULL || targp->p_session != curp->p_session) { PROC_UNLOCK(targp); error = EPERM; goto done; } if (targp->p_flag & P_EXEC) { PROC_UNLOCK(targp); error = EACCES; goto done; } PROC_UNLOCK(targp); } else targp = curp; if (SESS_LEADER(targp)) { error = EPERM; goto done; } if (uap->pgid == 0) uap->pgid = targp->p_pid; if ((pgrp = pgfind(uap->pgid)) == NULL) { if (uap->pgid == targp->p_pid) { error = enterpgrp(targp, uap->pgid, newpgrp, NULL); if (error == 0) newpgrp = NULL; } else error = EPERM; } else { if (pgrp == targp->p_pgrp) { PGRP_UNLOCK(pgrp); goto done; } if (pgrp->pg_id != targp->p_pid && pgrp->pg_session != curp->p_session) { PGRP_UNLOCK(pgrp); error = EPERM; goto done; } PGRP_UNLOCK(pgrp); error = enterthispgrp(targp, pgrp); } done: sx_xunlock(&proctree_lock); KASSERT((error == 0) || (newpgrp != NULL), ("setpgid failed and newpgrp is NULL")); if (newpgrp != NULL) free(newpgrp, M_PGRP); return (error); } /* * Use the clause in B.4.2.2 that allows setuid/setgid to be 4.2/4.3BSD * compatible. It says that setting the uid/gid to euid/egid is a special * case of "appropriate privilege". Once the rules are expanded out, this * basically means that setuid(nnn) sets all three id's, in all permitted * cases unless _POSIX_SAVED_IDS is enabled. In that case, setuid(getuid()) * does not set the saved id - this is dangerous for traditional BSD * programs. For this reason, we *really* do not want to set * _POSIX_SAVED_IDS and do not want to clear POSIX_APPENDIX_B_4_2_2. */ #define POSIX_APPENDIX_B_4_2_2 #ifndef _SYS_SYSPROTO_H_ struct setuid_args { uid_t uid; }; #endif /* ARGSUSED */ int setuid(struct thread *td, struct setuid_args *uap) { struct proc *p = td->td_proc; struct ucred *newcred, *oldcred; uid_t uid; struct uidinfo *uip; int error; uid = uap->uid; AUDIT_ARG(uid, uid); newcred = crget(); uip = uifind(uid); PROC_LOCK(p); oldcred = p->p_ucred; #ifdef MAC error = mac_cred_check_setuid(oldcred, uid); if (error) goto fail; #endif /* * See if we have "permission" by POSIX 1003.1 rules. * * Note that setuid(geteuid()) is a special case of * "appropriate privileges" in appendix B.4.2.2. We need * to use this clause to be compatible with traditional BSD * semantics. Basically, it means that "setuid(xx)" sets all * three id's (assuming you have privs). * * Notes on the logic. We do things in three steps. * 1: We determine if the euid is going to change, and do EPERM * right away. We unconditionally change the euid later if this * test is satisfied, simplifying that part of the logic. * 2: We determine if the real and/or saved uids are going to * change. Determined by compile options. * 3: Change euid last. (after tests in #2 for "appropriate privs") */ if (uid != oldcred->cr_ruid && /* allow setuid(getuid()) */ #ifdef _POSIX_SAVED_IDS uid != oldcred->cr_svuid && /* allow setuid(saved gid) */ #endif #ifdef POSIX_APPENDIX_B_4_2_2 /* Use BSD-compat clause from B.4.2.2 */ uid != oldcred->cr_uid && /* allow setuid(geteuid()) */ #endif (error = priv_check_cred(oldcred, PRIV_CRED_SETUID, 0)) != 0) goto fail; /* * Copy credentials so other references do not see our changes. */ crcopy(newcred, oldcred); #ifdef _POSIX_SAVED_IDS /* * Do we have "appropriate privileges" (are we root or uid == euid) * If so, we are changing the real uid and/or saved uid. */ if ( #ifdef POSIX_APPENDIX_B_4_2_2 /* Use the clause from B.4.2.2 */ uid == oldcred->cr_uid || #endif /* We are using privs. */ priv_check_cred(oldcred, PRIV_CRED_SETUID, 0) == 0) #endif { /* * Set the real uid and transfer proc count to new user. */ if (uid != oldcred->cr_ruid) { change_ruid(newcred, uip); setsugid(p); } /* * Set saved uid * * XXX always set saved uid even if not _POSIX_SAVED_IDS, as * the security of seteuid() depends on it. B.4.2.2 says it * is important that we should do this. */ if (uid != oldcred->cr_svuid) { change_svuid(newcred, uid); setsugid(p); } } /* * In all permitted cases, we are changing the euid. */ if (uid != oldcred->cr_uid) { change_euid(newcred, uip); setsugid(p); } p->p_ucred = newcred; PROC_UNLOCK(p); uifree(uip); crfree(oldcred); return (0); fail: PROC_UNLOCK(p); uifree(uip); crfree(newcred); return (error); } #ifndef _SYS_SYSPROTO_H_ struct seteuid_args { uid_t euid; }; #endif /* ARGSUSED */ int seteuid(struct thread *td, struct seteuid_args *uap) { struct proc *p = td->td_proc; struct ucred *newcred, *oldcred; uid_t euid; struct uidinfo *euip; int error; euid = uap->euid; AUDIT_ARG(euid, euid); newcred = crget(); euip = uifind(euid); PROC_LOCK(p); oldcred = p->p_ucred; #ifdef MAC error = mac_cred_check_seteuid(oldcred, euid); if (error) goto fail; #endif if (euid != oldcred->cr_ruid && /* allow seteuid(getuid()) */ euid != oldcred->cr_svuid && /* allow seteuid(saved uid) */ (error = priv_check_cred(oldcred, PRIV_CRED_SETEUID, 0)) != 0) goto fail; /* * Everything's okay, do it. Copy credentials so other references do * not see our changes. */ crcopy(newcred, oldcred); if (oldcred->cr_uid != euid) { change_euid(newcred, euip); setsugid(p); } p->p_ucred = newcred; PROC_UNLOCK(p); uifree(euip); crfree(oldcred); return (0); fail: PROC_UNLOCK(p); uifree(euip); crfree(newcred); return (error); } #ifndef _SYS_SYSPROTO_H_ struct setgid_args { gid_t gid; }; #endif /* ARGSUSED */ int setgid(struct thread *td, struct setgid_args *uap) { struct proc *p = td->td_proc; struct ucred *newcred, *oldcred; gid_t gid; int error; gid = uap->gid; AUDIT_ARG(gid, gid); newcred = crget(); PROC_LOCK(p); oldcred = p->p_ucred; #ifdef MAC error = mac_cred_check_setgid(oldcred, gid); if (error) goto fail; #endif /* * See if we have "permission" by POSIX 1003.1 rules. * * Note that setgid(getegid()) is a special case of * "appropriate privileges" in appendix B.4.2.2. We need * to use this clause to be compatible with traditional BSD * semantics. Basically, it means that "setgid(xx)" sets all * three id's (assuming you have privs). * * For notes on the logic here, see setuid() above. */ if (gid != oldcred->cr_rgid && /* allow setgid(getgid()) */ #ifdef _POSIX_SAVED_IDS gid != oldcred->cr_svgid && /* allow setgid(saved gid) */ #endif #ifdef POSIX_APPENDIX_B_4_2_2 /* Use BSD-compat clause from B.4.2.2 */ gid != oldcred->cr_groups[0] && /* allow setgid(getegid()) */ #endif (error = priv_check_cred(oldcred, PRIV_CRED_SETGID, 0)) != 0) goto fail; crcopy(newcred, oldcred); #ifdef _POSIX_SAVED_IDS /* * Do we have "appropriate privileges" (are we root or gid == egid) * If so, we are changing the real uid and saved gid. */ if ( #ifdef POSIX_APPENDIX_B_4_2_2 /* use the clause from B.4.2.2 */ gid == oldcred->cr_groups[0] || #endif /* We are using privs. */ priv_check_cred(oldcred, PRIV_CRED_SETGID, 0) == 0) #endif { /* * Set real gid */ if (oldcred->cr_rgid != gid) { change_rgid(newcred, gid); setsugid(p); } /* * Set saved gid * * XXX always set saved gid even if not _POSIX_SAVED_IDS, as * the security of setegid() depends on it. B.4.2.2 says it * is important that we should do this. */ if (oldcred->cr_svgid != gid) { change_svgid(newcred, gid); setsugid(p); } } /* * In all cases permitted cases, we are changing the egid. * Copy credentials so other references do not see our changes. */ if (oldcred->cr_groups[0] != gid) { change_egid(newcred, gid); setsugid(p); } p->p_ucred = newcred; PROC_UNLOCK(p); crfree(oldcred); return (0); fail: PROC_UNLOCK(p); crfree(newcred); return (error); } #ifndef _SYS_SYSPROTO_H_ struct setegid_args { gid_t egid; }; #endif /* ARGSUSED */ int setegid(struct thread *td, struct setegid_args *uap) { struct proc *p = td->td_proc; struct ucred *newcred, *oldcred; gid_t egid; int error; egid = uap->egid; AUDIT_ARG(egid, egid); newcred = crget(); PROC_LOCK(p); oldcred = p->p_ucred; #ifdef MAC error = mac_cred_check_setegid(oldcred, egid); if (error) goto fail; #endif if (egid != oldcred->cr_rgid && /* allow setegid(getgid()) */ egid != oldcred->cr_svgid && /* allow setegid(saved gid) */ (error = priv_check_cred(oldcred, PRIV_CRED_SETEGID, 0)) != 0) goto fail; crcopy(newcred, oldcred); if (oldcred->cr_groups[0] != egid) { change_egid(newcred, egid); setsugid(p); } p->p_ucred = newcred; PROC_UNLOCK(p); crfree(oldcred); return (0); fail: PROC_UNLOCK(p); crfree(newcred); return (error); } #ifndef _SYS_SYSPROTO_H_ struct setgroups_args { u_int gidsetsize; gid_t *gidset; }; #endif /* ARGSUSED */ int setgroups(struct thread *td, struct setgroups_args *uap) { gid_t groups[NGROUPS]; int error; if (uap->gidsetsize > NGROUPS) return (EINVAL); error = copyin(uap->gidset, groups, uap->gidsetsize * sizeof(gid_t)); if (error) return (error); return (kern_setgroups(td, uap->gidsetsize, groups)); } int kern_setgroups(struct thread *td, u_int ngrp, gid_t *groups) { struct proc *p = td->td_proc; struct ucred *newcred, *oldcred; int error; if (ngrp > NGROUPS) return (EINVAL); AUDIT_ARG(groupset, groups, ngrp); newcred = crget(); PROC_LOCK(p); oldcred = p->p_ucred; #ifdef MAC error = mac_cred_check_setgroups(oldcred, ngrp, groups); if (error) goto fail; #endif error = priv_check_cred(oldcred, PRIV_CRED_SETGROUPS, 0); if (error) goto fail; /* * XXX A little bit lazy here. We could test if anything has * changed before crcopy() and setting P_SUGID. */ crcopy(newcred, oldcred); if (ngrp < 1) { /* * setgroups(0, NULL) is a legitimate way of clearing the * groups vector on non-BSD systems (which generally do not * have the egid in the groups[0]). We risk security holes * when running non-BSD software if we do not do the same. */ newcred->cr_ngroups = 1; } else { bcopy(groups, newcred->cr_groups, ngrp * sizeof(gid_t)); newcred->cr_ngroups = ngrp; } setsugid(p); p->p_ucred = newcred; PROC_UNLOCK(p); crfree(oldcred); return (0); fail: PROC_UNLOCK(p); crfree(newcred); return (error); } #ifndef _SYS_SYSPROTO_H_ struct setreuid_args { uid_t ruid; uid_t euid; }; #endif /* ARGSUSED */ int setreuid(register struct thread *td, struct setreuid_args *uap) { struct proc *p = td->td_proc; struct ucred *newcred, *oldcred; uid_t euid, ruid; struct uidinfo *euip, *ruip; int error; euid = uap->euid; ruid = uap->ruid; AUDIT_ARG(euid, euid); AUDIT_ARG(ruid, ruid); newcred = crget(); euip = uifind(euid); ruip = uifind(ruid); PROC_LOCK(p); oldcred = p->p_ucred; #ifdef MAC error = mac_cred_check_setreuid(oldcred, ruid, euid); if (error) goto fail; #endif if (((ruid != (uid_t)-1 && ruid != oldcred->cr_ruid && ruid != oldcred->cr_svuid) || (euid != (uid_t)-1 && euid != oldcred->cr_uid && euid != oldcred->cr_ruid && euid != oldcred->cr_svuid)) && (error = priv_check_cred(oldcred, PRIV_CRED_SETREUID, 0)) != 0) goto fail; crcopy(newcred, oldcred); if (euid != (uid_t)-1 && oldcred->cr_uid != euid) { change_euid(newcred, euip); setsugid(p); } if (ruid != (uid_t)-1 && oldcred->cr_ruid != ruid) { change_ruid(newcred, ruip); setsugid(p); } if ((ruid != (uid_t)-1 || newcred->cr_uid != newcred->cr_ruid) && newcred->cr_svuid != newcred->cr_uid) { change_svuid(newcred, newcred->cr_uid); setsugid(p); } p->p_ucred = newcred; PROC_UNLOCK(p); uifree(ruip); uifree(euip); crfree(oldcred); return (0); fail: PROC_UNLOCK(p); uifree(ruip); uifree(euip); crfree(newcred); return (error); } #ifndef _SYS_SYSPROTO_H_ struct setregid_args { gid_t rgid; gid_t egid; }; #endif /* ARGSUSED */ int setregid(register struct thread *td, struct setregid_args *uap) { struct proc *p = td->td_proc; struct ucred *newcred, *oldcred; gid_t egid, rgid; int error; egid = uap->egid; rgid = uap->rgid; AUDIT_ARG(egid, egid); AUDIT_ARG(rgid, rgid); newcred = crget(); PROC_LOCK(p); oldcred = p->p_ucred; #ifdef MAC error = mac_cred_check_setregid(oldcred, rgid, egid); if (error) goto fail; #endif if (((rgid != (gid_t)-1 && rgid != oldcred->cr_rgid && rgid != oldcred->cr_svgid) || (egid != (gid_t)-1 && egid != oldcred->cr_groups[0] && egid != oldcred->cr_rgid && egid != oldcred->cr_svgid)) && (error = priv_check_cred(oldcred, PRIV_CRED_SETREGID, 0)) != 0) goto fail; crcopy(newcred, oldcred); if (egid != (gid_t)-1 && oldcred->cr_groups[0] != egid) { change_egid(newcred, egid); setsugid(p); } if (rgid != (gid_t)-1 && oldcred->cr_rgid != rgid) { change_rgid(newcred, rgid); setsugid(p); } if ((rgid != (gid_t)-1 || newcred->cr_groups[0] != newcred->cr_rgid) && newcred->cr_svgid != newcred->cr_groups[0]) { change_svgid(newcred, newcred->cr_groups[0]); setsugid(p); } p->p_ucred = newcred; PROC_UNLOCK(p); crfree(oldcred); return (0); fail: PROC_UNLOCK(p); crfree(newcred); return (error); } /* * setresuid(ruid, euid, suid) is like setreuid except control over the saved * uid is explicit. */ #ifndef _SYS_SYSPROTO_H_ struct setresuid_args { uid_t ruid; uid_t euid; uid_t suid; }; #endif /* ARGSUSED */ int setresuid(register struct thread *td, struct setresuid_args *uap) { struct proc *p = td->td_proc; struct ucred *newcred, *oldcred; uid_t euid, ruid, suid; struct uidinfo *euip, *ruip; int error; euid = uap->euid; ruid = uap->ruid; suid = uap->suid; AUDIT_ARG(euid, euid); AUDIT_ARG(ruid, ruid); AUDIT_ARG(suid, suid); newcred = crget(); euip = uifind(euid); ruip = uifind(ruid); PROC_LOCK(p); oldcred = p->p_ucred; #ifdef MAC error = mac_cred_check_setresuid(oldcred, ruid, euid, suid); if (error) goto fail; #endif if (((ruid != (uid_t)-1 && ruid != oldcred->cr_ruid && ruid != oldcred->cr_svuid && ruid != oldcred->cr_uid) || (euid != (uid_t)-1 && euid != oldcred->cr_ruid && euid != oldcred->cr_svuid && euid != oldcred->cr_uid) || (suid != (uid_t)-1 && suid != oldcred->cr_ruid && suid != oldcred->cr_svuid && suid != oldcred->cr_uid)) && (error = priv_check_cred(oldcred, PRIV_CRED_SETRESUID, 0)) != 0) goto fail; crcopy(newcred, oldcred); if (euid != (uid_t)-1 && oldcred->cr_uid != euid) { change_euid(newcred, euip); setsugid(p); } if (ruid != (uid_t)-1 && oldcred->cr_ruid != ruid) { change_ruid(newcred, ruip); setsugid(p); } if (suid != (uid_t)-1 && oldcred->cr_svuid != suid) { change_svuid(newcred, suid); setsugid(p); } p->p_ucred = newcred; PROC_UNLOCK(p); uifree(ruip); uifree(euip); crfree(oldcred); return (0); fail: PROC_UNLOCK(p); uifree(ruip); uifree(euip); crfree(newcred); return (error); } /* * setresgid(rgid, egid, sgid) is like setregid except control over the saved * gid is explicit. */ #ifndef _SYS_SYSPROTO_H_ struct setresgid_args { gid_t rgid; gid_t egid; gid_t sgid; }; #endif /* ARGSUSED */ int setresgid(register struct thread *td, struct setresgid_args *uap) { struct proc *p = td->td_proc; struct ucred *newcred, *oldcred; gid_t egid, rgid, sgid; int error; egid = uap->egid; rgid = uap->rgid; sgid = uap->sgid; AUDIT_ARG(egid, egid); AUDIT_ARG(rgid, rgid); AUDIT_ARG(sgid, sgid); newcred = crget(); PROC_LOCK(p); oldcred = p->p_ucred; #ifdef MAC error = mac_cred_check_setresgid(oldcred, rgid, egid, sgid); if (error) goto fail; #endif if (((rgid != (gid_t)-1 && rgid != oldcred->cr_rgid && rgid != oldcred->cr_svgid && rgid != oldcred->cr_groups[0]) || (egid != (gid_t)-1 && egid != oldcred->cr_rgid && egid != oldcred->cr_svgid && egid != oldcred->cr_groups[0]) || (sgid != (gid_t)-1 && sgid != oldcred->cr_rgid && sgid != oldcred->cr_svgid && sgid != oldcred->cr_groups[0])) && (error = priv_check_cred(oldcred, PRIV_CRED_SETRESGID, 0)) != 0) goto fail; crcopy(newcred, oldcred); if (egid != (gid_t)-1 && oldcred->cr_groups[0] != egid) { change_egid(newcred, egid); setsugid(p); } if (rgid != (gid_t)-1 && oldcred->cr_rgid != rgid) { change_rgid(newcred, rgid); setsugid(p); } if (sgid != (gid_t)-1 && oldcred->cr_svgid != sgid) { change_svgid(newcred, sgid); setsugid(p); } p->p_ucred = newcred; PROC_UNLOCK(p); crfree(oldcred); return (0); fail: PROC_UNLOCK(p); crfree(newcred); return (error); } #ifndef _SYS_SYSPROTO_H_ struct getresuid_args { uid_t *ruid; uid_t *euid; uid_t *suid; }; #endif /* ARGSUSED */ int getresuid(register struct thread *td, struct getresuid_args *uap) { struct ucred *cred; int error1 = 0, error2 = 0, error3 = 0; cred = td->td_ucred; if (uap->ruid) error1 = copyout(&cred->cr_ruid, uap->ruid, sizeof(cred->cr_ruid)); if (uap->euid) error2 = copyout(&cred->cr_uid, uap->euid, sizeof(cred->cr_uid)); if (uap->suid) error3 = copyout(&cred->cr_svuid, uap->suid, sizeof(cred->cr_svuid)); return (error1 ? error1 : error2 ? error2 : error3); } #ifndef _SYS_SYSPROTO_H_ struct getresgid_args { gid_t *rgid; gid_t *egid; gid_t *sgid; }; #endif /* ARGSUSED */ int getresgid(register struct thread *td, struct getresgid_args *uap) { struct ucred *cred; int error1 = 0, error2 = 0, error3 = 0; cred = td->td_ucred; if (uap->rgid) error1 = copyout(&cred->cr_rgid, uap->rgid, sizeof(cred->cr_rgid)); if (uap->egid) error2 = copyout(&cred->cr_groups[0], uap->egid, sizeof(cred->cr_groups[0])); if (uap->sgid) error3 = copyout(&cred->cr_svgid, uap->sgid, sizeof(cred->cr_svgid)); return (error1 ? error1 : error2 ? error2 : error3); } #ifndef _SYS_SYSPROTO_H_ struct issetugid_args { int dummy; }; #endif /* ARGSUSED */ int issetugid(register struct thread *td, struct issetugid_args *uap) { struct proc *p = td->td_proc; /* * Note: OpenBSD sets a P_SUGIDEXEC flag set at execve() time, * we use P_SUGID because we consider changing the owners as * "tainting" as well. * This is significant for procs that start as root and "become" * a user without an exec - programs cannot know *everything* * that libc *might* have put in their data segment. */ PROC_LOCK(p); td->td_retval[0] = (p->p_flag & P_SUGID) ? 1 : 0; PROC_UNLOCK(p); return (0); } int __setugid(struct thread *td, struct __setugid_args *uap) { #ifdef REGRESSION struct proc *p; p = td->td_proc; switch (uap->flag) { case 0: PROC_LOCK(p); p->p_flag &= ~P_SUGID; PROC_UNLOCK(p); return (0); case 1: PROC_LOCK(p); p->p_flag |= P_SUGID; PROC_UNLOCK(p); return (0); default: return (EINVAL); } #else /* !REGRESSION */ return (ENOSYS); #endif /* REGRESSION */ } /* * Check if gid is a member of the group set. */ int groupmember(gid_t gid, struct ucred *cred) { register gid_t *gp; gid_t *egp; egp = &(cred->cr_groups[cred->cr_ngroups]); for (gp = cred->cr_groups; gp < egp; gp++) if (*gp == gid) return (1); return (0); } /* * Test the active securelevel against a given level. securelevel_gt() * implements (securelevel > level). securelevel_ge() implements * (securelevel >= level). Note that the logic is inverted -- these * functions return EPERM on "success" and 0 on "failure". * + * Due to care taken when setting the securelevel, we know that no jail will + * be less secure that its parent (or the physical system), so it is sufficient + * to test the current jail only. + * * XXXRW: Possibly since this has to do with privilege, it should move to * kern_priv.c. */ int securelevel_gt(struct ucred *cr, int level) { - int active_securelevel; - active_securelevel = securelevel; - KASSERT(cr != NULL, ("securelevel_gt: null cr")); - if (cr->cr_prison != NULL) - active_securelevel = imax(cr->cr_prison->pr_securelevel, - active_securelevel); - return (active_securelevel > level ? EPERM : 0); + return (cr->cr_prison->pr_securelevel > level ? EPERM : 0); } int securelevel_ge(struct ucred *cr, int level) { - int active_securelevel; - active_securelevel = securelevel; - KASSERT(cr != NULL, ("securelevel_ge: null cr")); - if (cr->cr_prison != NULL) - active_securelevel = imax(cr->cr_prison->pr_securelevel, - active_securelevel); - return (active_securelevel >= level ? EPERM : 0); + return (cr->cr_prison->pr_securelevel >= level ? EPERM : 0); } /* * 'see_other_uids' determines whether or not visibility of processes * and sockets with credentials holding different real uids is possible * using a variety of system MIBs. * XXX: data declarations should be together near the beginning of the file. */ static int see_other_uids = 1; SYSCTL_INT(_security_bsd, OID_AUTO, see_other_uids, CTLFLAG_RW, &see_other_uids, 0, "Unprivileged processes may see subjects/objects with different real uid"); /*- * Determine if u1 "can see" the subject specified by u2, according to the * 'see_other_uids' policy. * Returns: 0 for permitted, ESRCH otherwise * Locks: none * References: *u1 and *u2 must not change during the call * u1 may equal u2, in which case only one reference is required */ static int cr_seeotheruids(struct ucred *u1, struct ucred *u2) { if (!see_other_uids && u1->cr_ruid != u2->cr_ruid) { if (priv_check_cred(u1, PRIV_SEEOTHERUIDS, 0) != 0) return (ESRCH); } return (0); } /* * 'see_other_gids' determines whether or not visibility of processes * and sockets with credentials holding different real gids is possible * using a variety of system MIBs. * XXX: data declarations should be together near the beginning of the file. */ static int see_other_gids = 1; SYSCTL_INT(_security_bsd, OID_AUTO, see_other_gids, CTLFLAG_RW, &see_other_gids, 0, "Unprivileged processes may see subjects/objects with different real gid"); /* * Determine if u1 can "see" the subject specified by u2, according to the * 'see_other_gids' policy. * Returns: 0 for permitted, ESRCH otherwise * Locks: none * References: *u1 and *u2 must not change during the call * u1 may equal u2, in which case only one reference is required */ static int cr_seeothergids(struct ucred *u1, struct ucred *u2) { int i, match; if (!see_other_gids) { match = 0; for (i = 0; i < u1->cr_ngroups; i++) { if (groupmember(u1->cr_groups[i], u2)) match = 1; if (match) break; } if (!match) { if (priv_check_cred(u1, PRIV_SEEOTHERGIDS, 0) != 0) return (ESRCH); } } return (0); } /*- * Determine if u1 "can see" the subject specified by u2. * Returns: 0 for permitted, an errno value otherwise * Locks: none * References: *u1 and *u2 must not change during the call * u1 may equal u2, in which case only one reference is required */ int cr_cansee(struct ucred *u1, struct ucred *u2) { int error; if ((error = prison_check(u1, u2))) return (error); #ifdef MAC if ((error = mac_cred_check_visible(u1, u2))) return (error); #endif if ((error = cr_seeotheruids(u1, u2))) return (error); if ((error = cr_seeothergids(u1, u2))) return (error); return (0); } /*- * Determine if td "can see" the subject specified by p. * Returns: 0 for permitted, an errno value otherwise * Locks: Sufficient locks to protect p->p_ucred must be held. td really * should be curthread. * References: td and p must be valid for the lifetime of the call */ int p_cansee(struct thread *td, struct proc *p) { /* Wrap cr_cansee() for all functionality. */ KASSERT(td == curthread, ("%s: td not curthread", __func__)); PROC_LOCK_ASSERT(p, MA_OWNED); return (cr_cansee(td->td_ucred, p->p_ucred)); } /* * 'conservative_signals' prevents the delivery of a broad class of * signals by unprivileged processes to processes that have changed their * credentials since the last invocation of execve(). This can prevent * the leakage of cached information or retained privileges as a result * of a common class of signal-related vulnerabilities. However, this * may interfere with some applications that expect to be able to * deliver these signals to peer processes after having given up * privilege. */ static int conservative_signals = 1; SYSCTL_INT(_security_bsd, OID_AUTO, conservative_signals, CTLFLAG_RW, &conservative_signals, 0, "Unprivileged processes prevented from " "sending certain signals to processes whose credentials have changed"); /*- * Determine whether cred may deliver the specified signal to proc. * Returns: 0 for permitted, an errno value otherwise. * Locks: A lock must be held for proc. * References: cred and proc must be valid for the lifetime of the call. */ int cr_cansignal(struct ucred *cred, struct proc *proc, int signum) { int error; PROC_LOCK_ASSERT(proc, MA_OWNED); /* * Jail semantics limit the scope of signalling to proc in the * same jail as cred, if cred is in jail. */ error = prison_check(cred, proc->p_ucred); if (error) return (error); #ifdef MAC if ((error = mac_proc_check_signal(cred, proc, signum))) return (error); #endif if ((error = cr_seeotheruids(cred, proc->p_ucred))) return (error); if ((error = cr_seeothergids(cred, proc->p_ucred))) return (error); /* * UNIX signal semantics depend on the status of the P_SUGID * bit on the target process. If the bit is set, then additional * restrictions are placed on the set of available signals. */ if (conservative_signals && (proc->p_flag & P_SUGID)) { switch (signum) { case 0: case SIGKILL: case SIGINT: case SIGTERM: case SIGALRM: case SIGSTOP: case SIGTTIN: case SIGTTOU: case SIGTSTP: case SIGHUP: case SIGUSR1: case SIGUSR2: /* * Generally, permit job and terminal control * signals. */ break; default: /* Not permitted without privilege. */ error = priv_check_cred(cred, PRIV_SIGNAL_SUGID, 0); if (error) return (error); } } /* * Generally, the target credential's ruid or svuid must match the * subject credential's ruid or euid. */ if (cred->cr_ruid != proc->p_ucred->cr_ruid && cred->cr_ruid != proc->p_ucred->cr_svuid && cred->cr_uid != proc->p_ucred->cr_ruid && cred->cr_uid != proc->p_ucred->cr_svuid) { error = priv_check_cred(cred, PRIV_SIGNAL_DIFFCRED, 0); if (error) return (error); } return (0); } /*- * Determine whether td may deliver the specified signal to p. * Returns: 0 for permitted, an errno value otherwise * Locks: Sufficient locks to protect various components of td and p * must be held. td must be curthread, and a lock must be * held for p. * References: td and p must be valid for the lifetime of the call */ int p_cansignal(struct thread *td, struct proc *p, int signum) { KASSERT(td == curthread, ("%s: td not curthread", __func__)); PROC_LOCK_ASSERT(p, MA_OWNED); if (td->td_proc == p) return (0); /* * UNIX signalling semantics require that processes in the same * session always be able to deliver SIGCONT to one another, * overriding the remaining protections. */ /* XXX: This will require an additional lock of some sort. */ if (signum == SIGCONT && td->td_proc->p_session == p->p_session) return (0); /* * Some compat layers use SIGTHR and higher signals for * communication between different kernel threads of the same * process, so that they expect that it's always possible to * deliver them, even for suid applications where cr_cansignal() can * deny such ability for security consideration. It should be * pretty safe to do since the only way to create two processes * with the same p_leader is via rfork(2). */ if (td->td_proc->p_leader != NULL && signum >= SIGTHR && signum < SIGTHR + 4 && td->td_proc->p_leader == p->p_leader) return (0); return (cr_cansignal(td->td_ucred, p, signum)); } /*- * Determine whether td may reschedule p. * Returns: 0 for permitted, an errno value otherwise * Locks: Sufficient locks to protect various components of td and p * must be held. td must be curthread, and a lock must * be held for p. * References: td and p must be valid for the lifetime of the call */ int p_cansched(struct thread *td, struct proc *p) { int error; KASSERT(td == curthread, ("%s: td not curthread", __func__)); PROC_LOCK_ASSERT(p, MA_OWNED); if (td->td_proc == p) return (0); if ((error = prison_check(td->td_ucred, p->p_ucred))) return (error); #ifdef MAC if ((error = mac_proc_check_sched(td->td_ucred, p))) return (error); #endif if ((error = cr_seeotheruids(td->td_ucred, p->p_ucred))) return (error); if ((error = cr_seeothergids(td->td_ucred, p->p_ucred))) return (error); if (td->td_ucred->cr_ruid != p->p_ucred->cr_ruid && td->td_ucred->cr_uid != p->p_ucred->cr_ruid) { error = priv_check(td, PRIV_SCHED_DIFFCRED); if (error) return (error); } return (0); } /* * The 'unprivileged_proc_debug' flag may be used to disable a variety of * unprivileged inter-process debugging services, including some procfs * functionality, ptrace(), and ktrace(). In the past, inter-process * debugging has been involved in a variety of security problems, and sites * not requiring the service might choose to disable it when hardening * systems. * * XXX: Should modifying and reading this variable require locking? * XXX: data declarations should be together near the beginning of the file. */ static int unprivileged_proc_debug = 1; SYSCTL_INT(_security_bsd, OID_AUTO, unprivileged_proc_debug, CTLFLAG_RW, &unprivileged_proc_debug, 0, "Unprivileged processes may use process debugging facilities"); /*- * Determine whether td may debug p. * Returns: 0 for permitted, an errno value otherwise * Locks: Sufficient locks to protect various components of td and p * must be held. td must be curthread, and a lock must * be held for p. * References: td and p must be valid for the lifetime of the call */ int p_candebug(struct thread *td, struct proc *p) { int credentialchanged, error, grpsubset, i, uidsubset; KASSERT(td == curthread, ("%s: td not curthread", __func__)); PROC_LOCK_ASSERT(p, MA_OWNED); if (!unprivileged_proc_debug) { error = priv_check(td, PRIV_DEBUG_UNPRIV); if (error) return (error); } if (td->td_proc == p) return (0); if ((error = prison_check(td->td_ucred, p->p_ucred))) return (error); #ifdef MAC if ((error = mac_proc_check_debug(td->td_ucred, p))) return (error); #endif if ((error = cr_seeotheruids(td->td_ucred, p->p_ucred))) return (error); if ((error = cr_seeothergids(td->td_ucred, p->p_ucred))) return (error); /* * Is p's group set a subset of td's effective group set? This * includes p's egid, group access list, rgid, and svgid. */ grpsubset = 1; for (i = 0; i < p->p_ucred->cr_ngroups; i++) { if (!groupmember(p->p_ucred->cr_groups[i], td->td_ucred)) { grpsubset = 0; break; } } grpsubset = grpsubset && groupmember(p->p_ucred->cr_rgid, td->td_ucred) && groupmember(p->p_ucred->cr_svgid, td->td_ucred); /* * Are the uids present in p's credential equal to td's * effective uid? This includes p's euid, svuid, and ruid. */ uidsubset = (td->td_ucred->cr_uid == p->p_ucred->cr_uid && td->td_ucred->cr_uid == p->p_ucred->cr_svuid && td->td_ucred->cr_uid == p->p_ucred->cr_ruid); /* * Has the credential of the process changed since the last exec()? */ credentialchanged = (p->p_flag & P_SUGID); /* * If p's gids aren't a subset, or the uids aren't a subset, * or the credential has changed, require appropriate privilege * for td to debug p. */ if (!grpsubset || !uidsubset) { error = priv_check(td, PRIV_DEBUG_DIFFCRED); if (error) return (error); } if (credentialchanged) { error = priv_check(td, PRIV_DEBUG_SUGID); if (error) return (error); } /* Can't trace init when securelevel > 0. */ if (p == initproc) { error = securelevel_gt(td->td_ucred, 0); if (error) return (error); } /* * Can't trace a process that's currently exec'ing. * * XXX: Note, this is not a security policy decision, it's a * basic correctness/functionality decision. Therefore, this check * should be moved to the caller's of p_candebug(). */ if ((p->p_flag & P_INEXEC) != 0) return (EBUSY); return (0); } /*- * Determine whether the subject represented by cred can "see" a socket. * Returns: 0 for permitted, ENOENT otherwise. */ int cr_canseesocket(struct ucred *cred, struct socket *so) { int error; error = prison_check(cred, so->so_cred); if (error) return (ENOENT); #ifdef MAC SOCK_LOCK(so); error = mac_socket_check_visible(cred, so); SOCK_UNLOCK(so); if (error) return (error); #endif if (cr_seeotheruids(cred, so->so_cred)) return (ENOENT); if (cr_seeothergids(cred, so->so_cred)) return (ENOENT); return (0); } #if defined(INET) || defined(INET6) /*- * Determine whether the subject represented by cred can "see" a socket. * Returns: 0 for permitted, ENOENT otherwise. */ int cr_canseeinpcb(struct ucred *cred, struct inpcb *inp) { int error; error = prison_check(cred, inp->inp_cred); if (error) return (ENOENT); #ifdef MAC INP_LOCK_ASSERT(inp); error = mac_inpcb_check_visible(cred, inp); if (error) return (error); #endif if (cr_seeotheruids(cred, inp->inp_cred)) return (ENOENT); if (cr_seeothergids(cred, inp->inp_cred)) return (ENOENT); return (0); } #endif /*- * Determine whether td can wait for the exit of p. * Returns: 0 for permitted, an errno value otherwise * Locks: Sufficient locks to protect various components of td and p * must be held. td must be curthread, and a lock must * be held for p. * References: td and p must be valid for the lifetime of the call */ int p_canwait(struct thread *td, struct proc *p) { int error; KASSERT(td == curthread, ("%s: td not curthread", __func__)); PROC_LOCK_ASSERT(p, MA_OWNED); if ((error = prison_check(td->td_ucred, p->p_ucred))) return (error); #ifdef MAC if ((error = mac_proc_check_wait(td->td_ucred, p))) return (error); #endif #if 0 /* XXXMAC: This could have odd effects on some shells. */ if ((error = cr_seeotheruids(td->td_ucred, p->p_ucred))) return (error); #endif return (0); } /* * Allocate a zeroed cred structure. */ struct ucred * crget(void) { register struct ucred *cr; cr = malloc(sizeof(*cr), M_CRED, M_WAITOK | M_ZERO); refcount_init(&cr->cr_ref, 1); #ifdef AUDIT audit_cred_init(cr); #endif #ifdef MAC mac_cred_init(cr); #endif return (cr); } /* * Claim another reference to a ucred structure. */ struct ucred * crhold(struct ucred *cr) { refcount_acquire(&cr->cr_ref); return (cr); } /* * Free a cred structure. Throws away space when ref count gets to 0. */ void crfree(struct ucred *cr) { KASSERT(cr->cr_ref > 0, ("bad ucred refcount: %d", cr->cr_ref)); KASSERT(cr->cr_ref != 0xdeadc0de, ("dangling reference to ucred")); if (refcount_release(&cr->cr_ref)) { /* * Some callers of crget(), such as nfs_statfs(), * allocate a temporary credential, but don't * allocate a uidinfo structure. */ if (cr->cr_uidinfo != NULL) uifree(cr->cr_uidinfo); if (cr->cr_ruidinfo != NULL) uifree(cr->cr_ruidinfo); /* * Free a prison, if any. */ - if (jailed(cr)) + if (cr->cr_prison != NULL) prison_free(cr->cr_prison); #ifdef VIMAGE /* XXX TODO: find out why and when cr_vimage can be NULL here! */ if (cr->cr_vimage != NULL) refcount_release(&cr->cr_vimage->vi_ucredrefc); #endif #ifdef AUDIT audit_cred_destroy(cr); #endif #ifdef MAC mac_cred_destroy(cr); #endif free(cr, M_CRED); } } /* * Check to see if this ucred is shared. */ int crshared(struct ucred *cr) { return (cr->cr_ref > 1); } /* * Copy a ucred's contents from a template. Does not block. */ void crcopy(struct ucred *dest, struct ucred *src) { KASSERT(crshared(dest) == 0, ("crcopy of shared ucred")); bcopy(&src->cr_startcopy, &dest->cr_startcopy, (unsigned)((caddr_t)&src->cr_endcopy - (caddr_t)&src->cr_startcopy)); uihold(dest->cr_uidinfo); uihold(dest->cr_ruidinfo); - if (jailed(dest)) - prison_hold(dest->cr_prison); + prison_hold(dest->cr_prison); #ifdef VIMAGE KASSERT(src->cr_vimage != NULL, ("cr_vimage == NULL")); refcount_acquire(&dest->cr_vimage->vi_ucredrefc); #endif #ifdef AUDIT audit_cred_copy(src, dest); #endif #ifdef MAC mac_cred_copy(src, dest); #endif } /* * Dup cred struct to a new held one. */ struct ucred * crdup(struct ucred *cr) { struct ucred *newcr; newcr = crget(); crcopy(newcr, cr); return (newcr); } /* * Fill in a struct xucred based on a struct ucred. */ void cru2x(struct ucred *cr, struct xucred *xcr) { bzero(xcr, sizeof(*xcr)); xcr->cr_version = XUCRED_VERSION; xcr->cr_uid = cr->cr_uid; xcr->cr_ngroups = cr->cr_ngroups; bcopy(cr->cr_groups, xcr->cr_groups, sizeof(cr->cr_groups)); } /* * small routine to swap a thread's current ucred for the correct one taken * from the process. */ void cred_update_thread(struct thread *td) { struct proc *p; struct ucred *cred; p = td->td_proc; cred = td->td_ucred; PROC_LOCK(p); td->td_ucred = crhold(p->p_ucred); PROC_UNLOCK(p); if (cred != NULL) crfree(cred); } /* * Get login name, if available. */ #ifndef _SYS_SYSPROTO_H_ struct getlogin_args { char *namebuf; u_int namelen; }; #endif /* ARGSUSED */ int getlogin(struct thread *td, struct getlogin_args *uap) { int error; char login[MAXLOGNAME]; struct proc *p = td->td_proc; if (uap->namelen > MAXLOGNAME) uap->namelen = MAXLOGNAME; PROC_LOCK(p); SESS_LOCK(p->p_session); bcopy(p->p_session->s_login, login, uap->namelen); SESS_UNLOCK(p->p_session); PROC_UNLOCK(p); error = copyout(login, uap->namebuf, uap->namelen); return(error); } /* * Set login name. */ #ifndef _SYS_SYSPROTO_H_ struct setlogin_args { char *namebuf; }; #endif /* ARGSUSED */ int setlogin(struct thread *td, struct setlogin_args *uap) { struct proc *p = td->td_proc; int error; char logintmp[MAXLOGNAME]; error = priv_check(td, PRIV_PROC_SETLOGIN); if (error) return (error); error = copyinstr(uap->namebuf, logintmp, sizeof(logintmp), NULL); if (error == ENAMETOOLONG) error = EINVAL; else if (!error) { PROC_LOCK(p); SESS_LOCK(p->p_session); (void) memcpy(p->p_session->s_login, logintmp, sizeof(logintmp)); SESS_UNLOCK(p->p_session); PROC_UNLOCK(p); } return (error); } void setsugid(struct proc *p) { PROC_LOCK_ASSERT(p, MA_OWNED); p->p_flag |= P_SUGID; if (!(p->p_pfsflags & PF_ISUGID)) p->p_stops = 0; } /*- * Change a process's effective uid. * Side effects: newcred->cr_uid and newcred->cr_uidinfo will be modified. * References: newcred must be an exclusive credential reference for the * duration of the call. */ void change_euid(struct ucred *newcred, struct uidinfo *euip) { newcred->cr_uid = euip->ui_uid; uihold(euip); uifree(newcred->cr_uidinfo); newcred->cr_uidinfo = euip; } /*- * Change a process's effective gid. * Side effects: newcred->cr_gid will be modified. * References: newcred must be an exclusive credential reference for the * duration of the call. */ void change_egid(struct ucred *newcred, gid_t egid) { newcred->cr_groups[0] = egid; } /*- * Change a process's real uid. * Side effects: newcred->cr_ruid will be updated, newcred->cr_ruidinfo * will be updated, and the old and new cr_ruidinfo proc * counts will be updated. * References: newcred must be an exclusive credential reference for the * duration of the call. */ void change_ruid(struct ucred *newcred, struct uidinfo *ruip) { (void)chgproccnt(newcred->cr_ruidinfo, -1, 0); newcred->cr_ruid = ruip->ui_uid; uihold(ruip); uifree(newcred->cr_ruidinfo); newcred->cr_ruidinfo = ruip; (void)chgproccnt(newcred->cr_ruidinfo, 1, 0); } /*- * Change a process's real gid. * Side effects: newcred->cr_rgid will be updated. * References: newcred must be an exclusive credential reference for the * duration of the call. */ void change_rgid(struct ucred *newcred, gid_t rgid) { newcred->cr_rgid = rgid; } /*- * Change a process's saved uid. * Side effects: newcred->cr_svuid will be updated. * References: newcred must be an exclusive credential reference for the * duration of the call. */ void change_svuid(struct ucred *newcred, uid_t svuid) { newcred->cr_svuid = svuid; } /*- * Change a process's saved gid. * Side effects: newcred->cr_svgid will be updated. * References: newcred must be an exclusive credential reference for the * duration of the call. */ void change_svgid(struct ucred *newcred, gid_t svgid) { newcred->cr_svgid = svgid; } Index: head/sys/kern/sysv_msg.c =================================================================== --- head/sys/kern/sysv_msg.c (revision 192894) +++ head/sys/kern/sysv_msg.c (revision 192895) @@ -1,1293 +1,1293 @@ /*- * Implementation of SVID messages * * Author: Daniel Boulet * * Copyright 1993 Daniel Boulet and RTMX Inc. * * This system call was implemented by Daniel Boulet under contract from RTMX. * * Redistribution and use in source forms, with and without modification, * are permitted provided that this entire comment appears intact. * * Redistribution in binary form may occur without any restrictions. * Obviously, it would be nice if you gave credit where credit is due * but requiring it would be too onerous. * * This software is provided ``AS IS'' without any warranties of any kind. */ /*- * Copyright (c) 2003-2005 McAfee, Inc. * All rights reserved. * * This software was developed for the FreeBSD Project in part by McAfee * Research, the Security Research Division of McAfee, Inc under DARPA/SPAWAR * contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA CHATS research * program. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_sysvipc.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static MALLOC_DEFINE(M_MSG, "msg", "SVID compatible message queues"); static void msginit(void); static int msgunload(void); static int sysvmsg_modload(struct module *, int, void *); #ifdef MSG_DEBUG #define DPRINTF(a) printf a #else #define DPRINTF(a) #endif static void msg_freehdr(struct msg *msghdr); /* XXX casting to (sy_call_t *) is bogus, as usual. */ static sy_call_t *msgcalls[] = { (sy_call_t *)msgctl, (sy_call_t *)msgget, (sy_call_t *)msgsnd, (sy_call_t *)msgrcv }; #ifndef MSGSSZ #define MSGSSZ 8 /* Each segment must be 2^N long */ #endif #ifndef MSGSEG #define MSGSEG 2048 /* must be less than 32767 */ #endif #define MSGMAX (MSGSSZ*MSGSEG) #ifndef MSGMNB #define MSGMNB 2048 /* max # of bytes in a queue */ #endif #ifndef MSGMNI #define MSGMNI 40 #endif #ifndef MSGTQL #define MSGTQL 40 #endif /* * Based on the configuration parameters described in an SVR2 (yes, two) * config(1m) man page. * * Each message is broken up and stored in segments that are msgssz bytes * long. For efficiency reasons, this should be a power of two. Also, * it doesn't make sense if it is less than 8 or greater than about 256. * Consequently, msginit in kern/sysv_msg.c checks that msgssz is a power of * two between 8 and 1024 inclusive (and panic's if it isn't). */ struct msginfo msginfo = { MSGMAX, /* max chars in a message */ MSGMNI, /* # of message queue identifiers */ MSGMNB, /* max chars in a queue */ MSGTQL, /* max messages in system */ MSGSSZ, /* size of a message segment */ /* (must be small power of 2 greater than 4) */ MSGSEG /* number of message segments */ }; /* * macros to convert between msqid_ds's and msqid's. * (specific to this implementation) */ #define MSQID(ix,ds) ((ix) & 0xffff | (((ds).msg_perm.seq << 16) & 0xffff0000)) #define MSQID_IX(id) ((id) & 0xffff) #define MSQID_SEQ(id) (((id) >> 16) & 0xffff) /* * The rest of this file is specific to this particular implementation. */ struct msgmap { short next; /* next segment in buffer */ /* -1 -> available */ /* 0..(MSGSEG-1) -> index of next segment */ }; #define MSG_LOCKED 01000 /* Is this msqid_ds locked? */ static int nfree_msgmaps; /* # of free map entries */ static short free_msgmaps; /* head of linked list of free map entries */ static struct msg *free_msghdrs;/* list of free msg headers */ static char *msgpool; /* MSGMAX byte long msg buffer pool */ static struct msgmap *msgmaps; /* MSGSEG msgmap structures */ static struct msg *msghdrs; /* MSGTQL msg headers */ static struct msqid_kernel *msqids; /* MSGMNI msqid_kernel struct's */ static struct mtx msq_mtx; /* global mutex for message queues. */ static void msginit() { register int i; TUNABLE_INT_FETCH("kern.ipc.msgseg", &msginfo.msgseg); TUNABLE_INT_FETCH("kern.ipc.msgssz", &msginfo.msgssz); msginfo.msgmax = msginfo.msgseg * msginfo.msgssz; TUNABLE_INT_FETCH("kern.ipc.msgmni", &msginfo.msgmni); TUNABLE_INT_FETCH("kern.ipc.msgmnb", &msginfo.msgmnb); TUNABLE_INT_FETCH("kern.ipc.msgtql", &msginfo.msgtql); msgpool = malloc(msginfo.msgmax, M_MSG, M_WAITOK); if (msgpool == NULL) panic("msgpool is NULL"); msgmaps = malloc(sizeof(struct msgmap) * msginfo.msgseg, M_MSG, M_WAITOK); if (msgmaps == NULL) panic("msgmaps is NULL"); msghdrs = malloc(sizeof(struct msg) * msginfo.msgtql, M_MSG, M_WAITOK); if (msghdrs == NULL) panic("msghdrs is NULL"); msqids = malloc(sizeof(struct msqid_kernel) * msginfo.msgmni, M_MSG, M_WAITOK); if (msqids == NULL) panic("msqids is NULL"); /* * msginfo.msgssz should be a power of two for efficiency reasons. * It is also pretty silly if msginfo.msgssz is less than 8 * or greater than about 256 so ... */ i = 8; while (i < 1024 && i != msginfo.msgssz) i <<= 1; if (i != msginfo.msgssz) { DPRINTF(("msginfo.msgssz=%d (0x%x)\n", msginfo.msgssz, msginfo.msgssz)); panic("msginfo.msgssz not a small power of 2"); } if (msginfo.msgseg > 32767) { DPRINTF(("msginfo.msgseg=%d\n", msginfo.msgseg)); panic("msginfo.msgseg > 32767"); } if (msgmaps == NULL) panic("msgmaps is NULL"); for (i = 0; i < msginfo.msgseg; i++) { if (i > 0) msgmaps[i-1].next = i; msgmaps[i].next = -1; /* implies entry is available */ } free_msgmaps = 0; nfree_msgmaps = msginfo.msgseg; if (msghdrs == NULL) panic("msghdrs is NULL"); for (i = 0; i < msginfo.msgtql; i++) { msghdrs[i].msg_type = 0; if (i > 0) msghdrs[i-1].msg_next = &msghdrs[i]; msghdrs[i].msg_next = NULL; #ifdef MAC mac_sysvmsg_init(&msghdrs[i]); #endif } free_msghdrs = &msghdrs[0]; if (msqids == NULL) panic("msqids is NULL"); for (i = 0; i < msginfo.msgmni; i++) { msqids[i].u.msg_qbytes = 0; /* implies entry is available */ msqids[i].u.msg_perm.seq = 0; /* reset to a known value */ msqids[i].u.msg_perm.mode = 0; #ifdef MAC mac_sysvmsq_init(&msqids[i]); #endif } mtx_init(&msq_mtx, "msq", NULL, MTX_DEF); } static int msgunload() { struct msqid_kernel *msqkptr; int msqid; #ifdef MAC int i; #endif for (msqid = 0; msqid < msginfo.msgmni; msqid++) { /* * Look for an unallocated and unlocked msqid_ds. * msqid_ds's can be locked by msgsnd or msgrcv while * they are copying the message in/out. We can't * re-use the entry until they release it. */ msqkptr = &msqids[msqid]; if (msqkptr->u.msg_qbytes != 0 || (msqkptr->u.msg_perm.mode & MSG_LOCKED) != 0) break; } if (msqid != msginfo.msgmni) return (EBUSY); #ifdef MAC for (i = 0; i < msginfo.msgtql; i++) mac_sysvmsg_destroy(&msghdrs[i]); for (msqid = 0; msqid < msginfo.msgmni; msqid++) mac_sysvmsq_destroy(&msqids[msqid]); #endif free(msgpool, M_MSG); free(msgmaps, M_MSG); free(msghdrs, M_MSG); free(msqids, M_MSG); mtx_destroy(&msq_mtx); return (0); } static int sysvmsg_modload(struct module *module, int cmd, void *arg) { int error = 0; switch (cmd) { case MOD_LOAD: msginit(); break; case MOD_UNLOAD: error = msgunload(); break; case MOD_SHUTDOWN: break; default: error = EINVAL; break; } return (error); } static moduledata_t sysvmsg_mod = { "sysvmsg", &sysvmsg_modload, NULL }; SYSCALL_MODULE_HELPER(msgsys); SYSCALL_MODULE_HELPER(msgctl); SYSCALL_MODULE_HELPER(msgget); SYSCALL_MODULE_HELPER(msgsnd); SYSCALL_MODULE_HELPER(msgrcv); DECLARE_MODULE(sysvmsg, sysvmsg_mod, SI_SUB_SYSV_MSG, SI_ORDER_FIRST); MODULE_VERSION(sysvmsg, 1); /* * Entry point for all MSG calls. */ int msgsys(td, uap) struct thread *td; /* XXX actually varargs. */ struct msgsys_args /* { int which; int a2; int a3; int a4; int a5; int a6; } */ *uap; { int error; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); if (uap->which < 0 || uap->which >= sizeof(msgcalls)/sizeof(msgcalls[0])) return (EINVAL); error = (*msgcalls[uap->which])(td, &uap->a2); return (error); } static void msg_freehdr(msghdr) struct msg *msghdr; { while (msghdr->msg_ts > 0) { short next; if (msghdr->msg_spot < 0 || msghdr->msg_spot >= msginfo.msgseg) panic("msghdr->msg_spot out of range"); next = msgmaps[msghdr->msg_spot].next; msgmaps[msghdr->msg_spot].next = free_msgmaps; free_msgmaps = msghdr->msg_spot; nfree_msgmaps++; msghdr->msg_spot = next; if (msghdr->msg_ts >= msginfo.msgssz) msghdr->msg_ts -= msginfo.msgssz; else msghdr->msg_ts = 0; } if (msghdr->msg_spot != -1) panic("msghdr->msg_spot != -1"); msghdr->msg_next = free_msghdrs; free_msghdrs = msghdr; #ifdef MAC mac_sysvmsg_cleanup(msghdr); #endif } #ifndef _SYS_SYSPROTO_H_ struct msgctl_args { int msqid; int cmd; struct msqid_ds *buf; }; #endif int msgctl(td, uap) struct thread *td; register struct msgctl_args *uap; { int msqid = uap->msqid; int cmd = uap->cmd; struct msqid_ds msqbuf; int error; DPRINTF(("call to msgctl(%d, %d, %p)\n", msqid, cmd, uap->buf)); if (cmd == IPC_SET && (error = copyin(uap->buf, &msqbuf, sizeof(msqbuf))) != 0) return (error); error = kern_msgctl(td, msqid, cmd, &msqbuf); if (cmd == IPC_STAT && error == 0) error = copyout(&msqbuf, uap->buf, sizeof(struct msqid_ds)); return (error); } int kern_msgctl(td, msqid, cmd, msqbuf) struct thread *td; int msqid; int cmd; struct msqid_ds *msqbuf; { int rval, error, msqix; register struct msqid_kernel *msqkptr; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); msqix = IPCID_TO_IX(msqid); if (msqix < 0 || msqix >= msginfo.msgmni) { DPRINTF(("msqid (%d) out of range (0<=msqid<%d)\n", msqix, msginfo.msgmni)); return (EINVAL); } msqkptr = &msqids[msqix]; mtx_lock(&msq_mtx); if (msqkptr->u.msg_qbytes == 0) { DPRINTF(("no such msqid\n")); error = EINVAL; goto done2; } if (msqkptr->u.msg_perm.seq != IPCID_TO_SEQ(msqid)) { DPRINTF(("wrong sequence number\n")); error = EINVAL; goto done2; } #ifdef MAC error = mac_sysvmsq_check_msqctl(td->td_ucred, msqkptr, cmd); if (error != 0) goto done2; #endif error = 0; rval = 0; switch (cmd) { case IPC_RMID: { struct msg *msghdr; if ((error = ipcperm(td, &msqkptr->u.msg_perm, IPC_M))) goto done2; #ifdef MAC /* * Check that the thread has MAC access permissions to * individual msghdrs. Note: We need to do this in a * separate loop because the actual loop alters the * msq/msghdr info as it progresses, and there is no going * back if half the way through we discover that the * thread cannot free a certain msghdr. The msq will get * into an inconsistent state. */ for (msghdr = msqkptr->u.msg_first; msghdr != NULL; msghdr = msghdr->msg_next) { error = mac_sysvmsq_check_msgrmid(td->td_ucred, msghdr); if (error != 0) goto done2; } #endif /* Free the message headers */ msghdr = msqkptr->u.msg_first; while (msghdr != NULL) { struct msg *msghdr_tmp; /* Free the segments of each message */ msqkptr->u.msg_cbytes -= msghdr->msg_ts; msqkptr->u.msg_qnum--; msghdr_tmp = msghdr; msghdr = msghdr->msg_next; msg_freehdr(msghdr_tmp); } if (msqkptr->u.msg_cbytes != 0) panic("msg_cbytes is screwed up"); if (msqkptr->u.msg_qnum != 0) panic("msg_qnum is screwed up"); msqkptr->u.msg_qbytes = 0; /* Mark it as free */ #ifdef MAC mac_sysvmsq_cleanup(msqkptr); #endif wakeup(msqkptr); } break; case IPC_SET: if ((error = ipcperm(td, &msqkptr->u.msg_perm, IPC_M))) goto done2; if (msqbuf->msg_qbytes > msqkptr->u.msg_qbytes) { error = priv_check(td, PRIV_IPC_MSGSIZE); if (error) goto done2; } if (msqbuf->msg_qbytes > msginfo.msgmnb) { DPRINTF(("can't increase msg_qbytes beyond %d" "(truncating)\n", msginfo.msgmnb)); msqbuf->msg_qbytes = msginfo.msgmnb; /* silently restrict qbytes to system limit */ } if (msqbuf->msg_qbytes == 0) { DPRINTF(("can't reduce msg_qbytes to 0\n")); error = EINVAL; /* non-standard errno! */ goto done2; } msqkptr->u.msg_perm.uid = msqbuf->msg_perm.uid; /* change the owner */ msqkptr->u.msg_perm.gid = msqbuf->msg_perm.gid; /* change the owner */ msqkptr->u.msg_perm.mode = (msqkptr->u.msg_perm.mode & ~0777) | (msqbuf->msg_perm.mode & 0777); msqkptr->u.msg_qbytes = msqbuf->msg_qbytes; msqkptr->u.msg_ctime = time_second; break; case IPC_STAT: if ((error = ipcperm(td, &msqkptr->u.msg_perm, IPC_R))) { DPRINTF(("requester doesn't have read access\n")); goto done2; } *msqbuf = msqkptr->u; break; default: DPRINTF(("invalid command %d\n", cmd)); error = EINVAL; goto done2; } if (error == 0) td->td_retval[0] = rval; done2: mtx_unlock(&msq_mtx); return (error); } #ifndef _SYS_SYSPROTO_H_ struct msgget_args { key_t key; int msgflg; }; #endif int msgget(td, uap) struct thread *td; register struct msgget_args *uap; { int msqid, error = 0; int key = uap->key; int msgflg = uap->msgflg; struct ucred *cred = td->td_ucred; register struct msqid_kernel *msqkptr = NULL; DPRINTF(("msgget(0x%x, 0%o)\n", key, msgflg)); - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); mtx_lock(&msq_mtx); if (key != IPC_PRIVATE) { for (msqid = 0; msqid < msginfo.msgmni; msqid++) { msqkptr = &msqids[msqid]; if (msqkptr->u.msg_qbytes != 0 && msqkptr->u.msg_perm.key == key) break; } if (msqid < msginfo.msgmni) { DPRINTF(("found public key\n")); if ((msgflg & IPC_CREAT) && (msgflg & IPC_EXCL)) { DPRINTF(("not exclusive\n")); error = EEXIST; goto done2; } if ((error = ipcperm(td, &msqkptr->u.msg_perm, msgflg & 0700))) { DPRINTF(("requester doesn't have 0%o access\n", msgflg & 0700)); goto done2; } #ifdef MAC error = mac_sysvmsq_check_msqget(cred, msqkptr); if (error != 0) goto done2; #endif goto found; } } DPRINTF(("need to allocate the msqid_ds\n")); if (key == IPC_PRIVATE || (msgflg & IPC_CREAT)) { for (msqid = 0; msqid < msginfo.msgmni; msqid++) { /* * Look for an unallocated and unlocked msqid_ds. * msqid_ds's can be locked by msgsnd or msgrcv while * they are copying the message in/out. We can't * re-use the entry until they release it. */ msqkptr = &msqids[msqid]; if (msqkptr->u.msg_qbytes == 0 && (msqkptr->u.msg_perm.mode & MSG_LOCKED) == 0) break; } if (msqid == msginfo.msgmni) { DPRINTF(("no more msqid_ds's available\n")); error = ENOSPC; goto done2; } DPRINTF(("msqid %d is available\n", msqid)); msqkptr->u.msg_perm.key = key; msqkptr->u.msg_perm.cuid = cred->cr_uid; msqkptr->u.msg_perm.uid = cred->cr_uid; msqkptr->u.msg_perm.cgid = cred->cr_gid; msqkptr->u.msg_perm.gid = cred->cr_gid; msqkptr->u.msg_perm.mode = (msgflg & 0777); /* Make sure that the returned msqid is unique */ msqkptr->u.msg_perm.seq = (msqkptr->u.msg_perm.seq + 1) & 0x7fff; msqkptr->u.msg_first = NULL; msqkptr->u.msg_last = NULL; msqkptr->u.msg_cbytes = 0; msqkptr->u.msg_qnum = 0; msqkptr->u.msg_qbytes = msginfo.msgmnb; msqkptr->u.msg_lspid = 0; msqkptr->u.msg_lrpid = 0; msqkptr->u.msg_stime = 0; msqkptr->u.msg_rtime = 0; msqkptr->u.msg_ctime = time_second; #ifdef MAC mac_sysvmsq_create(cred, msqkptr); #endif } else { DPRINTF(("didn't find it and wasn't asked to create it\n")); error = ENOENT; goto done2; } found: /* Construct the unique msqid */ td->td_retval[0] = IXSEQ_TO_IPCID(msqid, msqkptr->u.msg_perm); done2: mtx_unlock(&msq_mtx); return (error); } #ifndef _SYS_SYSPROTO_H_ struct msgsnd_args { int msqid; const void *msgp; size_t msgsz; int msgflg; }; #endif int kern_msgsnd(td, msqid, msgp, msgsz, msgflg, mtype) struct thread *td; int msqid; const void *msgp; /* XXX msgp is actually mtext. */ size_t msgsz; int msgflg; long mtype; { int msqix, segs_needed, error = 0; register struct msqid_kernel *msqkptr; register struct msg *msghdr; short next; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); mtx_lock(&msq_mtx); msqix = IPCID_TO_IX(msqid); if (msqix < 0 || msqix >= msginfo.msgmni) { DPRINTF(("msqid (%d) out of range (0<=msqid<%d)\n", msqix, msginfo.msgmni)); error = EINVAL; goto done2; } msqkptr = &msqids[msqix]; if (msqkptr->u.msg_qbytes == 0) { DPRINTF(("no such message queue id\n")); error = EINVAL; goto done2; } if (msqkptr->u.msg_perm.seq != IPCID_TO_SEQ(msqid)) { DPRINTF(("wrong sequence number\n")); error = EINVAL; goto done2; } if ((error = ipcperm(td, &msqkptr->u.msg_perm, IPC_W))) { DPRINTF(("requester doesn't have write access\n")); goto done2; } #ifdef MAC error = mac_sysvmsq_check_msqsnd(td->td_ucred, msqkptr); if (error != 0) goto done2; #endif segs_needed = (msgsz + msginfo.msgssz - 1) / msginfo.msgssz; DPRINTF(("msgsz=%zu, msgssz=%d, segs_needed=%d\n", msgsz, msginfo.msgssz, segs_needed)); for (;;) { int need_more_resources = 0; /* * check msgsz * (inside this loop in case msg_qbytes changes while we sleep) */ if (msgsz > msqkptr->u.msg_qbytes) { DPRINTF(("msgsz > msqkptr->u.msg_qbytes\n")); error = EINVAL; goto done2; } if (msqkptr->u.msg_perm.mode & MSG_LOCKED) { DPRINTF(("msqid is locked\n")); need_more_resources = 1; } if (msgsz + msqkptr->u.msg_cbytes > msqkptr->u.msg_qbytes) { DPRINTF(("msgsz + msg_cbytes > msg_qbytes\n")); need_more_resources = 1; } if (segs_needed > nfree_msgmaps) { DPRINTF(("segs_needed > nfree_msgmaps\n")); need_more_resources = 1; } if (free_msghdrs == NULL) { DPRINTF(("no more msghdrs\n")); need_more_resources = 1; } if (need_more_resources) { int we_own_it; if ((msgflg & IPC_NOWAIT) != 0) { DPRINTF(("need more resources but caller " "doesn't want to wait\n")); error = EAGAIN; goto done2; } if ((msqkptr->u.msg_perm.mode & MSG_LOCKED) != 0) { DPRINTF(("we don't own the msqid_ds\n")); we_own_it = 0; } else { /* Force later arrivals to wait for our request */ DPRINTF(("we own the msqid_ds\n")); msqkptr->u.msg_perm.mode |= MSG_LOCKED; we_own_it = 1; } DPRINTF(("msgsnd: goodnight\n")); error = msleep(msqkptr, &msq_mtx, (PZERO - 4) | PCATCH, "msgsnd", hz); DPRINTF(("msgsnd: good morning, error=%d\n", error)); if (we_own_it) msqkptr->u.msg_perm.mode &= ~MSG_LOCKED; if (error == EWOULDBLOCK) { DPRINTF(("msgsnd: timed out\n")); continue; } if (error != 0) { DPRINTF(("msgsnd: interrupted system call\n")); error = EINTR; goto done2; } /* * Make sure that the msq queue still exists */ if (msqkptr->u.msg_qbytes == 0) { DPRINTF(("msqid deleted\n")); error = EIDRM; goto done2; } } else { DPRINTF(("got all the resources that we need\n")); break; } } /* * We have the resources that we need. * Make sure! */ if (msqkptr->u.msg_perm.mode & MSG_LOCKED) panic("msg_perm.mode & MSG_LOCKED"); if (segs_needed > nfree_msgmaps) panic("segs_needed > nfree_msgmaps"); if (msgsz + msqkptr->u.msg_cbytes > msqkptr->u.msg_qbytes) panic("msgsz + msg_cbytes > msg_qbytes"); if (free_msghdrs == NULL) panic("no more msghdrs"); /* * Re-lock the msqid_ds in case we page-fault when copying in the * message */ if ((msqkptr->u.msg_perm.mode & MSG_LOCKED) != 0) panic("msqid_ds is already locked"); msqkptr->u.msg_perm.mode |= MSG_LOCKED; /* * Allocate a message header */ msghdr = free_msghdrs; free_msghdrs = msghdr->msg_next; msghdr->msg_spot = -1; msghdr->msg_ts = msgsz; msghdr->msg_type = mtype; #ifdef MAC /* * XXXMAC: Should the mac_sysvmsq_check_msgmsq check follow here * immediately? Or, should it be checked just before the msg is * enqueued in the msgq (as it is done now)? */ mac_sysvmsg_create(td->td_ucred, msqkptr, msghdr); #endif /* * Allocate space for the message */ while (segs_needed > 0) { if (nfree_msgmaps <= 0) panic("not enough msgmaps"); if (free_msgmaps == -1) panic("nil free_msgmaps"); next = free_msgmaps; if (next <= -1) panic("next too low #1"); if (next >= msginfo.msgseg) panic("next out of range #1"); DPRINTF(("allocating segment %d to message\n", next)); free_msgmaps = msgmaps[next].next; nfree_msgmaps--; msgmaps[next].next = msghdr->msg_spot; msghdr->msg_spot = next; segs_needed--; } /* * Validate the message type */ if (msghdr->msg_type < 1) { msg_freehdr(msghdr); msqkptr->u.msg_perm.mode &= ~MSG_LOCKED; wakeup(msqkptr); DPRINTF(("mtype (%ld) < 1\n", msghdr->msg_type)); error = EINVAL; goto done2; } /* * Copy in the message body */ next = msghdr->msg_spot; while (msgsz > 0) { size_t tlen; if (msgsz > msginfo.msgssz) tlen = msginfo.msgssz; else tlen = msgsz; if (next <= -1) panic("next too low #2"); if (next >= msginfo.msgseg) panic("next out of range #2"); mtx_unlock(&msq_mtx); if ((error = copyin(msgp, &msgpool[next * msginfo.msgssz], tlen)) != 0) { mtx_lock(&msq_mtx); DPRINTF(("error %d copying in message segment\n", error)); msg_freehdr(msghdr); msqkptr->u.msg_perm.mode &= ~MSG_LOCKED; wakeup(msqkptr); goto done2; } mtx_lock(&msq_mtx); msgsz -= tlen; msgp = (const char *)msgp + tlen; next = msgmaps[next].next; } if (next != -1) panic("didn't use all the msg segments"); /* * We've got the message. Unlock the msqid_ds. */ msqkptr->u.msg_perm.mode &= ~MSG_LOCKED; /* * Make sure that the msqid_ds is still allocated. */ if (msqkptr->u.msg_qbytes == 0) { msg_freehdr(msghdr); wakeup(msqkptr); error = EIDRM; goto done2; } #ifdef MAC /* * Note: Since the task/thread allocates the msghdr and usually * primes it with its own MAC label, for a majority of policies, it * won't be necessary to check whether the msghdr has access * permissions to the msgq. The mac_sysvmsq_check_msqsnd check would * suffice in that case. However, this hook may be required where * individual policies derive a non-identical label for the msghdr * from the current thread label and may want to check the msghdr * enqueue permissions, along with read/write permissions to the * msgq. */ error = mac_sysvmsq_check_msgmsq(td->td_ucred, msghdr, msqkptr); if (error != 0) { msg_freehdr(msghdr); wakeup(msqkptr); goto done2; } #endif /* * Put the message into the queue */ if (msqkptr->u.msg_first == NULL) { msqkptr->u.msg_first = msghdr; msqkptr->u.msg_last = msghdr; } else { msqkptr->u.msg_last->msg_next = msghdr; msqkptr->u.msg_last = msghdr; } msqkptr->u.msg_last->msg_next = NULL; msqkptr->u.msg_cbytes += msghdr->msg_ts; msqkptr->u.msg_qnum++; msqkptr->u.msg_lspid = td->td_proc->p_pid; msqkptr->u.msg_stime = time_second; wakeup(msqkptr); td->td_retval[0] = 0; done2: mtx_unlock(&msq_mtx); return (error); } int msgsnd(td, uap) struct thread *td; register struct msgsnd_args *uap; { int error; long mtype; DPRINTF(("call to msgsnd(%d, %p, %zu, %d)\n", uap->msqid, uap->msgp, uap->msgsz, uap->msgflg)); if ((error = copyin(uap->msgp, &mtype, sizeof(mtype))) != 0) { DPRINTF(("error %d copying the message type\n", error)); return (error); } return (kern_msgsnd(td, uap->msqid, (const char *)uap->msgp + sizeof(mtype), uap->msgsz, uap->msgflg, mtype)); } #ifndef _SYS_SYSPROTO_H_ struct msgrcv_args { int msqid; void *msgp; size_t msgsz; long msgtyp; int msgflg; }; #endif int kern_msgrcv(td, msqid, msgp, msgsz, msgtyp, msgflg, mtype) struct thread *td; int msqid; void *msgp; /* XXX msgp is actually mtext. */ size_t msgsz; long msgtyp; int msgflg; long *mtype; { size_t len; register struct msqid_kernel *msqkptr; register struct msg *msghdr; int msqix, error = 0; short next; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); msqix = IPCID_TO_IX(msqid); if (msqix < 0 || msqix >= msginfo.msgmni) { DPRINTF(("msqid (%d) out of range (0<=msqid<%d)\n", msqix, msginfo.msgmni)); return (EINVAL); } msqkptr = &msqids[msqix]; mtx_lock(&msq_mtx); if (msqkptr->u.msg_qbytes == 0) { DPRINTF(("no such message queue id\n")); error = EINVAL; goto done2; } if (msqkptr->u.msg_perm.seq != IPCID_TO_SEQ(msqid)) { DPRINTF(("wrong sequence number\n")); error = EINVAL; goto done2; } if ((error = ipcperm(td, &msqkptr->u.msg_perm, IPC_R))) { DPRINTF(("requester doesn't have read access\n")); goto done2; } #ifdef MAC error = mac_sysvmsq_check_msqrcv(td->td_ucred, msqkptr); if (error != 0) goto done2; #endif msghdr = NULL; while (msghdr == NULL) { if (msgtyp == 0) { msghdr = msqkptr->u.msg_first; if (msghdr != NULL) { if (msgsz < msghdr->msg_ts && (msgflg & MSG_NOERROR) == 0) { DPRINTF(("first message on the queue " "is too big (want %zu, got %d)\n", msgsz, msghdr->msg_ts)); error = E2BIG; goto done2; } #ifdef MAC error = mac_sysvmsq_check_msgrcv(td->td_ucred, msghdr); if (error != 0) goto done2; #endif if (msqkptr->u.msg_first == msqkptr->u.msg_last) { msqkptr->u.msg_first = NULL; msqkptr->u.msg_last = NULL; } else { msqkptr->u.msg_first = msghdr->msg_next; if (msqkptr->u.msg_first == NULL) panic("msg_first/last screwed up #1"); } } } else { struct msg *previous; struct msg **prev; previous = NULL; prev = &(msqkptr->u.msg_first); while ((msghdr = *prev) != NULL) { /* * Is this message's type an exact match or is * this message's type less than or equal to * the absolute value of a negative msgtyp? * Note that the second half of this test can * NEVER be true if msgtyp is positive since * msg_type is always positive! */ if (msgtyp == msghdr->msg_type || msghdr->msg_type <= -msgtyp) { DPRINTF(("found message type %ld, " "requested %ld\n", msghdr->msg_type, msgtyp)); if (msgsz < msghdr->msg_ts && (msgflg & MSG_NOERROR) == 0) { DPRINTF(("requested message " "on the queue is too big " "(want %zu, got %hu)\n", msgsz, msghdr->msg_ts)); error = E2BIG; goto done2; } #ifdef MAC error = mac_sysvmsq_check_msgrcv( td->td_ucred, msghdr); if (error != 0) goto done2; #endif *prev = msghdr->msg_next; if (msghdr == msqkptr->u.msg_last) { if (previous == NULL) { if (prev != &msqkptr->u.msg_first) panic("msg_first/last screwed up #2"); msqkptr->u.msg_first = NULL; msqkptr->u.msg_last = NULL; } else { if (prev == &msqkptr->u.msg_first) panic("msg_first/last screwed up #3"); msqkptr->u.msg_last = previous; } } break; } previous = msghdr; prev = &(msghdr->msg_next); } } /* * We've either extracted the msghdr for the appropriate * message or there isn't one. * If there is one then bail out of this loop. */ if (msghdr != NULL) break; /* * Hmph! No message found. Does the user want to wait? */ if ((msgflg & IPC_NOWAIT) != 0) { DPRINTF(("no appropriate message found (msgtyp=%ld)\n", msgtyp)); /* The SVID says to return ENOMSG. */ error = ENOMSG; goto done2; } /* * Wait for something to happen */ DPRINTF(("msgrcv: goodnight\n")); error = msleep(msqkptr, &msq_mtx, (PZERO - 4) | PCATCH, "msgrcv", 0); DPRINTF(("msgrcv: good morning (error=%d)\n", error)); if (error != 0) { DPRINTF(("msgrcv: interrupted system call\n")); error = EINTR; goto done2; } /* * Make sure that the msq queue still exists */ if (msqkptr->u.msg_qbytes == 0 || msqkptr->u.msg_perm.seq != IPCID_TO_SEQ(msqid)) { DPRINTF(("msqid deleted\n")); error = EIDRM; goto done2; } } /* * Return the message to the user. * * First, do the bookkeeping (before we risk being interrupted). */ msqkptr->u.msg_cbytes -= msghdr->msg_ts; msqkptr->u.msg_qnum--; msqkptr->u.msg_lrpid = td->td_proc->p_pid; msqkptr->u.msg_rtime = time_second; /* * Make msgsz the actual amount that we'll be returning. * Note that this effectively truncates the message if it is too long * (since msgsz is never increased). */ DPRINTF(("found a message, msgsz=%zu, msg_ts=%hu\n", msgsz, msghdr->msg_ts)); if (msgsz > msghdr->msg_ts) msgsz = msghdr->msg_ts; *mtype = msghdr->msg_type; /* * Return the segments to the user */ next = msghdr->msg_spot; for (len = 0; len < msgsz; len += msginfo.msgssz) { size_t tlen; if (msgsz - len > msginfo.msgssz) tlen = msginfo.msgssz; else tlen = msgsz - len; if (next <= -1) panic("next too low #3"); if (next >= msginfo.msgseg) panic("next out of range #3"); mtx_unlock(&msq_mtx); error = copyout(&msgpool[next * msginfo.msgssz], msgp, tlen); mtx_lock(&msq_mtx); if (error != 0) { DPRINTF(("error (%d) copying out message segment\n", error)); msg_freehdr(msghdr); wakeup(msqkptr); goto done2; } msgp = (char *)msgp + tlen; next = msgmaps[next].next; } /* * Done, return the actual number of bytes copied out. */ msg_freehdr(msghdr); wakeup(msqkptr); td->td_retval[0] = msgsz; done2: mtx_unlock(&msq_mtx); return (error); } int msgrcv(td, uap) struct thread *td; register struct msgrcv_args *uap; { int error; long mtype; DPRINTF(("call to msgrcv(%d, %p, %zu, %ld, %d)\n", uap->msqid, uap->msgp, uap->msgsz, uap->msgtyp, uap->msgflg)); if ((error = kern_msgrcv(td, uap->msqid, (char *)uap->msgp + sizeof(mtype), uap->msgsz, uap->msgtyp, uap->msgflg, &mtype)) != 0) return (error); if ((error = copyout(&mtype, uap->msgp, sizeof(mtype))) != 0) DPRINTF(("error %d copying the message type\n", error)); return (error); } static int sysctl_msqids(SYSCTL_HANDLER_ARGS) { return (SYSCTL_OUT(req, msqids, sizeof(struct msqid_kernel) * msginfo.msgmni)); } SYSCTL_INT(_kern_ipc, OID_AUTO, msgmax, CTLFLAG_RD, &msginfo.msgmax, 0, "Maximum message size"); SYSCTL_INT(_kern_ipc, OID_AUTO, msgmni, CTLFLAG_RDTUN, &msginfo.msgmni, 0, "Number of message queue identifiers"); SYSCTL_INT(_kern_ipc, OID_AUTO, msgmnb, CTLFLAG_RDTUN, &msginfo.msgmnb, 0, "Maximum number of bytes in a queue"); SYSCTL_INT(_kern_ipc, OID_AUTO, msgtql, CTLFLAG_RDTUN, &msginfo.msgtql, 0, "Maximum number of messages in the system"); SYSCTL_INT(_kern_ipc, OID_AUTO, msgssz, CTLFLAG_RDTUN, &msginfo.msgssz, 0, "Size of a message segment"); SYSCTL_INT(_kern_ipc, OID_AUTO, msgseg, CTLFLAG_RDTUN, &msginfo.msgseg, 0, "Number of message segments"); SYSCTL_PROC(_kern_ipc, OID_AUTO, msqids, CTLFLAG_RD, NULL, 0, sysctl_msqids, "", "Message queue IDs"); Index: head/sys/kern/sysv_sem.c =================================================================== --- head/sys/kern/sysv_sem.c (revision 192894) +++ head/sys/kern/sysv_sem.c (revision 192895) @@ -1,1349 +1,1349 @@ /*- * Implementation of SVID semaphores * * Author: Daniel Boulet * * This software is provided ``AS IS'' without any warranties of any kind. */ /*- * Copyright (c) 2003-2005 McAfee, Inc. * All rights reserved. * * This software was developed for the FreeBSD Project in part by McAfee * Research, the Security Research Division of McAfee, Inc under DARPA/SPAWAR * contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA CHATS research * program. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_sysvipc.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static MALLOC_DEFINE(M_SEM, "sem", "SVID compatible semaphores"); #ifdef SEM_DEBUG #define DPRINTF(a) printf a #else #define DPRINTF(a) #endif static void seminit(void); static int sysvsem_modload(struct module *, int, void *); static int semunload(void); static void semexit_myhook(void *arg, struct proc *p); static int sysctl_sema(SYSCTL_HANDLER_ARGS); static int semvalid(int semid, struct semid_kernel *semakptr); #ifndef _SYS_SYSPROTO_H_ struct __semctl_args; int __semctl(struct thread *td, struct __semctl_args *uap); struct semget_args; int semget(struct thread *td, struct semget_args *uap); struct semop_args; int semop(struct thread *td, struct semop_args *uap); #endif static struct sem_undo *semu_alloc(struct thread *td); static int semundo_adjust(struct thread *td, struct sem_undo **supptr, int semid, int semseq, int semnum, int adjval); static void semundo_clear(int semid, int semnum); /* XXX casting to (sy_call_t *) is bogus, as usual. */ static sy_call_t *semcalls[] = { (sy_call_t *)__semctl, (sy_call_t *)semget, (sy_call_t *)semop }; static struct mtx sem_mtx; /* semaphore global lock */ static struct mtx sem_undo_mtx; static int semtot = 0; static struct semid_kernel *sema; /* semaphore id pool */ static struct mtx *sema_mtx; /* semaphore id pool mutexes*/ static struct sem *sem; /* semaphore pool */ LIST_HEAD(, sem_undo) semu_list; /* list of active undo structures */ LIST_HEAD(, sem_undo) semu_free_list; /* list of free undo structures */ static int *semu; /* undo structure pool */ static eventhandler_tag semexit_tag; #define SEMUNDO_MTX sem_undo_mtx #define SEMUNDO_LOCK() mtx_lock(&SEMUNDO_MTX); #define SEMUNDO_UNLOCK() mtx_unlock(&SEMUNDO_MTX); #define SEMUNDO_LOCKASSERT(how) mtx_assert(&SEMUNDO_MTX, (how)); struct sem { u_short semval; /* semaphore value */ pid_t sempid; /* pid of last operation */ u_short semncnt; /* # awaiting semval > cval */ u_short semzcnt; /* # awaiting semval = 0 */ }; /* * Undo structure (one per process) */ struct sem_undo { LIST_ENTRY(sem_undo) un_next; /* ptr to next active undo structure */ struct proc *un_proc; /* owner of this structure */ short un_cnt; /* # of active entries */ struct undo { short un_adjval; /* adjust on exit values */ short un_num; /* semaphore # */ int un_id; /* semid */ unsigned short un_seq; } un_ent[1]; /* undo entries */ }; /* * Configuration parameters */ #ifndef SEMMNI #define SEMMNI 10 /* # of semaphore identifiers */ #endif #ifndef SEMMNS #define SEMMNS 60 /* # of semaphores in system */ #endif #ifndef SEMUME #define SEMUME 10 /* max # of undo entries per process */ #endif #ifndef SEMMNU #define SEMMNU 30 /* # of undo structures in system */ #endif /* shouldn't need tuning */ #ifndef SEMMAP #define SEMMAP 30 /* # of entries in semaphore map */ #endif #ifndef SEMMSL #define SEMMSL SEMMNS /* max # of semaphores per id */ #endif #ifndef SEMOPM #define SEMOPM 100 /* max # of operations per semop call */ #endif #define SEMVMX 32767 /* semaphore maximum value */ #define SEMAEM 16384 /* adjust on exit max value */ /* * Due to the way semaphore memory is allocated, we have to ensure that * SEMUSZ is properly aligned. */ #define SEM_ALIGN(bytes) (((bytes) + (sizeof(long) - 1)) & ~(sizeof(long) - 1)) /* actual size of an undo structure */ #define SEMUSZ SEM_ALIGN(offsetof(struct sem_undo, un_ent[SEMUME])) /* * Macro to find a particular sem_undo vector */ #define SEMU(ix) \ ((struct sem_undo *)(((intptr_t)semu)+ix * seminfo.semusz)) /* * semaphore info struct */ struct seminfo seminfo = { SEMMAP, /* # of entries in semaphore map */ SEMMNI, /* # of semaphore identifiers */ SEMMNS, /* # of semaphores in system */ SEMMNU, /* # of undo structures in system */ SEMMSL, /* max # of semaphores per id */ SEMOPM, /* max # of operations per semop call */ SEMUME, /* max # of undo entries per process */ SEMUSZ, /* size in bytes of undo structure */ SEMVMX, /* semaphore maximum value */ SEMAEM /* adjust on exit max value */ }; SYSCTL_INT(_kern_ipc, OID_AUTO, semmap, CTLFLAG_RW, &seminfo.semmap, 0, "Number of entries in the semaphore map"); SYSCTL_INT(_kern_ipc, OID_AUTO, semmni, CTLFLAG_RDTUN, &seminfo.semmni, 0, "Number of semaphore identifiers"); SYSCTL_INT(_kern_ipc, OID_AUTO, semmns, CTLFLAG_RDTUN, &seminfo.semmns, 0, "Maximum number of semaphores in the system"); SYSCTL_INT(_kern_ipc, OID_AUTO, semmnu, CTLFLAG_RDTUN, &seminfo.semmnu, 0, "Maximum number of undo structures in the system"); SYSCTL_INT(_kern_ipc, OID_AUTO, semmsl, CTLFLAG_RW, &seminfo.semmsl, 0, "Max semaphores per id"); SYSCTL_INT(_kern_ipc, OID_AUTO, semopm, CTLFLAG_RDTUN, &seminfo.semopm, 0, "Max operations per semop call"); SYSCTL_INT(_kern_ipc, OID_AUTO, semume, CTLFLAG_RDTUN, &seminfo.semume, 0, "Max undo entries per process"); SYSCTL_INT(_kern_ipc, OID_AUTO, semusz, CTLFLAG_RDTUN, &seminfo.semusz, 0, "Size in bytes of undo structure"); SYSCTL_INT(_kern_ipc, OID_AUTO, semvmx, CTLFLAG_RW, &seminfo.semvmx, 0, "Semaphore maximum value"); SYSCTL_INT(_kern_ipc, OID_AUTO, semaem, CTLFLAG_RW, &seminfo.semaem, 0, "Adjust on exit max value"); SYSCTL_PROC(_kern_ipc, OID_AUTO, sema, CTLFLAG_RD, NULL, 0, sysctl_sema, "", ""); static void seminit(void) { int i; TUNABLE_INT_FETCH("kern.ipc.semmap", &seminfo.semmap); TUNABLE_INT_FETCH("kern.ipc.semmni", &seminfo.semmni); TUNABLE_INT_FETCH("kern.ipc.semmns", &seminfo.semmns); TUNABLE_INT_FETCH("kern.ipc.semmnu", &seminfo.semmnu); TUNABLE_INT_FETCH("kern.ipc.semmsl", &seminfo.semmsl); TUNABLE_INT_FETCH("kern.ipc.semopm", &seminfo.semopm); TUNABLE_INT_FETCH("kern.ipc.semume", &seminfo.semume); TUNABLE_INT_FETCH("kern.ipc.semusz", &seminfo.semusz); TUNABLE_INT_FETCH("kern.ipc.semvmx", &seminfo.semvmx); TUNABLE_INT_FETCH("kern.ipc.semaem", &seminfo.semaem); sem = malloc(sizeof(struct sem) * seminfo.semmns, M_SEM, M_WAITOK); sema = malloc(sizeof(struct semid_kernel) * seminfo.semmni, M_SEM, M_WAITOK); sema_mtx = malloc(sizeof(struct mtx) * seminfo.semmni, M_SEM, M_WAITOK | M_ZERO); semu = malloc(seminfo.semmnu * seminfo.semusz, M_SEM, M_WAITOK); for (i = 0; i < seminfo.semmni; i++) { sema[i].u.sem_base = 0; sema[i].u.sem_perm.mode = 0; sema[i].u.sem_perm.seq = 0; #ifdef MAC mac_sysvsem_init(&sema[i]); #endif } for (i = 0; i < seminfo.semmni; i++) mtx_init(&sema_mtx[i], "semid", NULL, MTX_DEF); LIST_INIT(&semu_free_list); for (i = 0; i < seminfo.semmnu; i++) { struct sem_undo *suptr = SEMU(i); suptr->un_proc = NULL; LIST_INSERT_HEAD(&semu_free_list, suptr, un_next); } LIST_INIT(&semu_list); mtx_init(&sem_mtx, "sem", NULL, MTX_DEF); mtx_init(&sem_undo_mtx, "semu", NULL, MTX_DEF); semexit_tag = EVENTHANDLER_REGISTER(process_exit, semexit_myhook, NULL, EVENTHANDLER_PRI_ANY); } static int semunload(void) { int i; /* XXXKIB */ if (semtot != 0) return (EBUSY); EVENTHANDLER_DEREGISTER(process_exit, semexit_tag); #ifdef MAC for (i = 0; i < seminfo.semmni; i++) mac_sysvsem_destroy(&sema[i]); #endif free(sem, M_SEM); free(sema, M_SEM); free(semu, M_SEM); for (i = 0; i < seminfo.semmni; i++) mtx_destroy(&sema_mtx[i]); free(sema_mtx, M_SEM); mtx_destroy(&sem_mtx); mtx_destroy(&sem_undo_mtx); return (0); } static int sysvsem_modload(struct module *module, int cmd, void *arg) { int error = 0; switch (cmd) { case MOD_LOAD: seminit(); break; case MOD_UNLOAD: error = semunload(); break; case MOD_SHUTDOWN: break; default: error = EINVAL; break; } return (error); } static moduledata_t sysvsem_mod = { "sysvsem", &sysvsem_modload, NULL }; SYSCALL_MODULE_HELPER(semsys); SYSCALL_MODULE_HELPER(__semctl); SYSCALL_MODULE_HELPER(semget); SYSCALL_MODULE_HELPER(semop); DECLARE_MODULE(sysvsem, sysvsem_mod, SI_SUB_SYSV_SEM, SI_ORDER_FIRST); MODULE_VERSION(sysvsem, 1); /* * Entry point for all SEM calls. */ int semsys(td, uap) struct thread *td; /* XXX actually varargs. */ struct semsys_args /* { int which; int a2; int a3; int a4; int a5; } */ *uap; { int error; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); if (uap->which < 0 || uap->which >= sizeof(semcalls)/sizeof(semcalls[0])) return (EINVAL); error = (*semcalls[uap->which])(td, &uap->a2); return (error); } /* * Allocate a new sem_undo structure for a process * (returns ptr to structure or NULL if no more room) */ static struct sem_undo * semu_alloc(struct thread *td) { struct sem_undo *suptr; SEMUNDO_LOCKASSERT(MA_OWNED); if ((suptr = LIST_FIRST(&semu_free_list)) == NULL) return (NULL); LIST_REMOVE(suptr, un_next); LIST_INSERT_HEAD(&semu_list, suptr, un_next); suptr->un_cnt = 0; suptr->un_proc = td->td_proc; return (suptr); } static int semu_try_free(struct sem_undo *suptr) { SEMUNDO_LOCKASSERT(MA_OWNED); if (suptr->un_cnt != 0) return (0); LIST_REMOVE(suptr, un_next); LIST_INSERT_HEAD(&semu_free_list, suptr, un_next); return (1); } /* * Adjust a particular entry for a particular proc */ static int semundo_adjust(struct thread *td, struct sem_undo **supptr, int semid, int semseq, int semnum, int adjval) { struct proc *p = td->td_proc; struct sem_undo *suptr; struct undo *sunptr; int i; SEMUNDO_LOCKASSERT(MA_OWNED); /* Look for and remember the sem_undo if the caller doesn't provide it */ suptr = *supptr; if (suptr == NULL) { LIST_FOREACH(suptr, &semu_list, un_next) { if (suptr->un_proc == p) { *supptr = suptr; break; } } if (suptr == NULL) { if (adjval == 0) return(0); suptr = semu_alloc(td); if (suptr == NULL) return (ENOSPC); *supptr = suptr; } } /* * Look for the requested entry and adjust it (delete if adjval becomes * 0). */ sunptr = &suptr->un_ent[0]; for (i = 0; i < suptr->un_cnt; i++, sunptr++) { if (sunptr->un_id != semid || sunptr->un_num != semnum) continue; if (adjval != 0) { adjval += sunptr->un_adjval; if (adjval > seminfo.semaem || adjval < -seminfo.semaem) return (ERANGE); } sunptr->un_adjval = adjval; if (sunptr->un_adjval == 0) { suptr->un_cnt--; if (i < suptr->un_cnt) suptr->un_ent[i] = suptr->un_ent[suptr->un_cnt]; if (suptr->un_cnt == 0) semu_try_free(suptr); } return (0); } /* Didn't find the right entry - create it */ if (adjval == 0) return (0); if (adjval > seminfo.semaem || adjval < -seminfo.semaem) return (ERANGE); if (suptr->un_cnt != seminfo.semume) { sunptr = &suptr->un_ent[suptr->un_cnt]; suptr->un_cnt++; sunptr->un_adjval = adjval; sunptr->un_id = semid; sunptr->un_num = semnum; sunptr->un_seq = semseq; } else return (EINVAL); return (0); } static void semundo_clear(int semid, int semnum) { struct sem_undo *suptr, *suptr1; struct undo *sunptr; int i; SEMUNDO_LOCKASSERT(MA_OWNED); LIST_FOREACH_SAFE(suptr, &semu_list, un_next, suptr1) { sunptr = &suptr->un_ent[0]; for (i = 0; i < suptr->un_cnt; i++, sunptr++) { if (sunptr->un_id != semid) continue; if (semnum == -1 || sunptr->un_num == semnum) { suptr->un_cnt--; if (i < suptr->un_cnt) { suptr->un_ent[i] = suptr->un_ent[suptr->un_cnt]; continue; } semu_try_free(suptr); } if (semnum != -1) break; } } } static int semvalid(int semid, struct semid_kernel *semakptr) { return ((semakptr->u.sem_perm.mode & SEM_ALLOC) == 0 || semakptr->u.sem_perm.seq != IPCID_TO_SEQ(semid) ? EINVAL : 0); } /* * Note that the user-mode half of this passes a union, not a pointer. */ #ifndef _SYS_SYSPROTO_H_ struct __semctl_args { int semid; int semnum; int cmd; union semun *arg; }; #endif int __semctl(struct thread *td, struct __semctl_args *uap) { struct semid_ds dsbuf; union semun arg, semun; register_t rval; int error; switch (uap->cmd) { case SEM_STAT: case IPC_SET: case IPC_STAT: case GETALL: case SETVAL: case SETALL: error = copyin(uap->arg, &arg, sizeof(arg)); if (error) return (error); break; } switch (uap->cmd) { case SEM_STAT: case IPC_STAT: semun.buf = &dsbuf; break; case IPC_SET: error = copyin(arg.buf, &dsbuf, sizeof(dsbuf)); if (error) return (error); semun.buf = &dsbuf; break; case GETALL: case SETALL: semun.array = arg.array; break; case SETVAL: semun.val = arg.val; break; } error = kern_semctl(td, uap->semid, uap->semnum, uap->cmd, &semun, &rval); if (error) return (error); switch (uap->cmd) { case SEM_STAT: case IPC_STAT: error = copyout(&dsbuf, arg.buf, sizeof(dsbuf)); break; } if (error == 0) td->td_retval[0] = rval; return (error); } int kern_semctl(struct thread *td, int semid, int semnum, int cmd, union semun *arg, register_t *rval) { u_short *array; struct ucred *cred = td->td_ucred; int i, error; struct semid_ds *sbuf; struct semid_kernel *semakptr; struct mtx *sema_mtxp; u_short usval, count; int semidx; DPRINTF(("call to semctl(%d, %d, %d, 0x%p)\n", semid, semnum, cmd, arg)); - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); array = NULL; switch(cmd) { case SEM_STAT: /* * For this command we assume semid is an array index * rather than an IPC id. */ if (semid < 0 || semid >= seminfo.semmni) return (EINVAL); semakptr = &sema[semid]; sema_mtxp = &sema_mtx[semid]; mtx_lock(sema_mtxp); if ((semakptr->u.sem_perm.mode & SEM_ALLOC) == 0) { error = EINVAL; goto done2; } if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_R))) goto done2; #ifdef MAC error = mac_sysvsem_check_semctl(cred, semakptr, cmd); if (error != 0) goto done2; #endif bcopy(&semakptr->u, arg->buf, sizeof(struct semid_ds)); *rval = IXSEQ_TO_IPCID(semid, semakptr->u.sem_perm); mtx_unlock(sema_mtxp); return (0); } semidx = IPCID_TO_IX(semid); if (semidx < 0 || semidx >= seminfo.semmni) return (EINVAL); semakptr = &sema[semidx]; sema_mtxp = &sema_mtx[semidx]; if (cmd == IPC_RMID) mtx_lock(&sem_mtx); mtx_lock(sema_mtxp); #ifdef MAC error = mac_sysvsem_check_semctl(cred, semakptr, cmd); if (error != 0) goto done2; #endif error = 0; *rval = 0; switch (cmd) { case IPC_RMID: if ((error = semvalid(semid, semakptr)) != 0) goto done2; if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_M))) goto done2; semakptr->u.sem_perm.cuid = cred->cr_uid; semakptr->u.sem_perm.uid = cred->cr_uid; semakptr->u.sem_perm.mode = 0; SEMUNDO_LOCK(); semundo_clear(semidx, -1); SEMUNDO_UNLOCK(); #ifdef MAC mac_sysvsem_cleanup(semakptr); #endif wakeup(semakptr); for (i = 0; i < seminfo.semmni; i++) { if ((sema[i].u.sem_perm.mode & SEM_ALLOC) && sema[i].u.sem_base > semakptr->u.sem_base) mtx_lock_flags(&sema_mtx[i], LOP_DUPOK); } for (i = semakptr->u.sem_base - sem; i < semtot; i++) sem[i] = sem[i + semakptr->u.sem_nsems]; for (i = 0; i < seminfo.semmni; i++) { if ((sema[i].u.sem_perm.mode & SEM_ALLOC) && sema[i].u.sem_base > semakptr->u.sem_base) { sema[i].u.sem_base -= semakptr->u.sem_nsems; mtx_unlock(&sema_mtx[i]); } } semtot -= semakptr->u.sem_nsems; break; case IPC_SET: if ((error = semvalid(semid, semakptr)) != 0) goto done2; if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_M))) goto done2; sbuf = arg->buf; semakptr->u.sem_perm.uid = sbuf->sem_perm.uid; semakptr->u.sem_perm.gid = sbuf->sem_perm.gid; semakptr->u.sem_perm.mode = (semakptr->u.sem_perm.mode & ~0777) | (sbuf->sem_perm.mode & 0777); semakptr->u.sem_ctime = time_second; break; case IPC_STAT: if ((error = semvalid(semid, semakptr)) != 0) goto done2; if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_R))) goto done2; bcopy(&semakptr->u, arg->buf, sizeof(struct semid_ds)); break; case GETNCNT: if ((error = semvalid(semid, semakptr)) != 0) goto done2; if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_R))) goto done2; if (semnum < 0 || semnum >= semakptr->u.sem_nsems) { error = EINVAL; goto done2; } *rval = semakptr->u.sem_base[semnum].semncnt; break; case GETPID: if ((error = semvalid(semid, semakptr)) != 0) goto done2; if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_R))) goto done2; if (semnum < 0 || semnum >= semakptr->u.sem_nsems) { error = EINVAL; goto done2; } *rval = semakptr->u.sem_base[semnum].sempid; break; case GETVAL: if ((error = semvalid(semid, semakptr)) != 0) goto done2; if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_R))) goto done2; if (semnum < 0 || semnum >= semakptr->u.sem_nsems) { error = EINVAL; goto done2; } *rval = semakptr->u.sem_base[semnum].semval; break; case GETALL: /* * Unfortunately, callers of this function don't know * in advance how many semaphores are in this set. * While we could just allocate the maximum size array * and pass the actual size back to the caller, that * won't work for SETALL since we can't copyin() more * data than the user specified as we may return a * spurious EFAULT. * * Note that the number of semaphores in a set is * fixed for the life of that set. The only way that * the 'count' could change while are blocked in * malloc() is if this semaphore set were destroyed * and a new one created with the same index. * However, semvalid() will catch that due to the * sequence number unless exactly 0x8000 (or a * multiple thereof) semaphore sets for the same index * are created and destroyed while we are in malloc! * */ count = semakptr->u.sem_nsems; mtx_unlock(sema_mtxp); array = malloc(sizeof(*array) * count, M_TEMP, M_WAITOK); mtx_lock(sema_mtxp); if ((error = semvalid(semid, semakptr)) != 0) goto done2; KASSERT(count == semakptr->u.sem_nsems, ("nsems changed")); if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_R))) goto done2; for (i = 0; i < semakptr->u.sem_nsems; i++) array[i] = semakptr->u.sem_base[i].semval; mtx_unlock(sema_mtxp); error = copyout(array, arg->array, count * sizeof(*array)); mtx_lock(sema_mtxp); break; case GETZCNT: if ((error = semvalid(semid, semakptr)) != 0) goto done2; if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_R))) goto done2; if (semnum < 0 || semnum >= semakptr->u.sem_nsems) { error = EINVAL; goto done2; } *rval = semakptr->u.sem_base[semnum].semzcnt; break; case SETVAL: if ((error = semvalid(semid, semakptr)) != 0) goto done2; if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_W))) goto done2; if (semnum < 0 || semnum >= semakptr->u.sem_nsems) { error = EINVAL; goto done2; } if (arg->val < 0 || arg->val > seminfo.semvmx) { error = ERANGE; goto done2; } semakptr->u.sem_base[semnum].semval = arg->val; SEMUNDO_LOCK(); semundo_clear(semidx, semnum); SEMUNDO_UNLOCK(); wakeup(semakptr); break; case SETALL: /* * See comment on GETALL for why 'count' shouldn't change * and why we require a userland buffer. */ count = semakptr->u.sem_nsems; mtx_unlock(sema_mtxp); array = malloc(sizeof(*array) * count, M_TEMP, M_WAITOK); error = copyin(arg->array, array, count * sizeof(*array)); mtx_lock(sema_mtxp); if (error) break; if ((error = semvalid(semid, semakptr)) != 0) goto done2; KASSERT(count == semakptr->u.sem_nsems, ("nsems changed")); if ((error = ipcperm(td, &semakptr->u.sem_perm, IPC_W))) goto done2; for (i = 0; i < semakptr->u.sem_nsems; i++) { usval = array[i]; if (usval > seminfo.semvmx) { error = ERANGE; break; } semakptr->u.sem_base[i].semval = usval; } SEMUNDO_LOCK(); semundo_clear(semidx, -1); SEMUNDO_UNLOCK(); wakeup(semakptr); break; default: error = EINVAL; break; } done2: mtx_unlock(sema_mtxp); if (cmd == IPC_RMID) mtx_unlock(&sem_mtx); if (array != NULL) free(array, M_TEMP); return(error); } #ifndef _SYS_SYSPROTO_H_ struct semget_args { key_t key; int nsems; int semflg; }; #endif int semget(struct thread *td, struct semget_args *uap) { int semid, error = 0; int key = uap->key; int nsems = uap->nsems; int semflg = uap->semflg; struct ucred *cred = td->td_ucred; DPRINTF(("semget(0x%x, %d, 0%o)\n", key, nsems, semflg)); - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); mtx_lock(&sem_mtx); if (key != IPC_PRIVATE) { for (semid = 0; semid < seminfo.semmni; semid++) { if ((sema[semid].u.sem_perm.mode & SEM_ALLOC) && sema[semid].u.sem_perm.key == key) break; } if (semid < seminfo.semmni) { DPRINTF(("found public key\n")); if ((error = ipcperm(td, &sema[semid].u.sem_perm, semflg & 0700))) { goto done2; } if (nsems > 0 && sema[semid].u.sem_nsems < nsems) { DPRINTF(("too small\n")); error = EINVAL; goto done2; } if ((semflg & IPC_CREAT) && (semflg & IPC_EXCL)) { DPRINTF(("not exclusive\n")); error = EEXIST; goto done2; } #ifdef MAC error = mac_sysvsem_check_semget(cred, &sema[semid]); if (error != 0) goto done2; #endif goto found; } } DPRINTF(("need to allocate the semid_kernel\n")); if (key == IPC_PRIVATE || (semflg & IPC_CREAT)) { if (nsems <= 0 || nsems > seminfo.semmsl) { DPRINTF(("nsems out of range (0<%d<=%d)\n", nsems, seminfo.semmsl)); error = EINVAL; goto done2; } if (nsems > seminfo.semmns - semtot) { DPRINTF(( "not enough semaphores left (need %d, got %d)\n", nsems, seminfo.semmns - semtot)); error = ENOSPC; goto done2; } for (semid = 0; semid < seminfo.semmni; semid++) { if ((sema[semid].u.sem_perm.mode & SEM_ALLOC) == 0) break; } if (semid == seminfo.semmni) { DPRINTF(("no more semid_kernel's available\n")); error = ENOSPC; goto done2; } DPRINTF(("semid %d is available\n", semid)); mtx_lock(&sema_mtx[semid]); KASSERT((sema[semid].u.sem_perm.mode & SEM_ALLOC) == 0, ("Lost semaphore %d", semid)); sema[semid].u.sem_perm.key = key; sema[semid].u.sem_perm.cuid = cred->cr_uid; sema[semid].u.sem_perm.uid = cred->cr_uid; sema[semid].u.sem_perm.cgid = cred->cr_gid; sema[semid].u.sem_perm.gid = cred->cr_gid; sema[semid].u.sem_perm.mode = (semflg & 0777) | SEM_ALLOC; sema[semid].u.sem_perm.seq = (sema[semid].u.sem_perm.seq + 1) & 0x7fff; sema[semid].u.sem_nsems = nsems; sema[semid].u.sem_otime = 0; sema[semid].u.sem_ctime = time_second; sema[semid].u.sem_base = &sem[semtot]; semtot += nsems; bzero(sema[semid].u.sem_base, sizeof(sema[semid].u.sem_base[0])*nsems); #ifdef MAC mac_sysvsem_create(cred, &sema[semid]); #endif mtx_unlock(&sema_mtx[semid]); DPRINTF(("sembase = %p, next = %p\n", sema[semid].u.sem_base, &sem[semtot])); } else { DPRINTF(("didn't find it and wasn't asked to create it\n")); error = ENOENT; goto done2; } found: td->td_retval[0] = IXSEQ_TO_IPCID(semid, sema[semid].u.sem_perm); done2: mtx_unlock(&sem_mtx); return (error); } #ifndef _SYS_SYSPROTO_H_ struct semop_args { int semid; struct sembuf *sops; size_t nsops; }; #endif int semop(struct thread *td, struct semop_args *uap) { #define SMALL_SOPS 8 struct sembuf small_sops[SMALL_SOPS]; int semid = uap->semid; size_t nsops = uap->nsops; struct sembuf *sops; struct semid_kernel *semakptr; struct sembuf *sopptr = 0; struct sem *semptr = 0; struct sem_undo *suptr; struct mtx *sema_mtxp; size_t i, j, k; int error; int do_wakeup, do_undos; unsigned short seq; #ifdef SEM_DEBUG sops = NULL; #endif DPRINTF(("call to semop(%d, %p, %u)\n", semid, sops, nsops)); - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); semid = IPCID_TO_IX(semid); /* Convert back to zero origin */ if (semid < 0 || semid >= seminfo.semmni) return (EINVAL); /* Allocate memory for sem_ops */ if (nsops <= SMALL_SOPS) sops = small_sops; else if (nsops <= seminfo.semopm) sops = malloc(nsops * sizeof(*sops), M_TEMP, M_WAITOK); else { DPRINTF(("too many sops (max=%d, nsops=%d)\n", seminfo.semopm, nsops)); return (E2BIG); } if ((error = copyin(uap->sops, sops, nsops * sizeof(sops[0]))) != 0) { DPRINTF(("error = %d from copyin(%p, %p, %d)\n", error, uap->sops, sops, nsops * sizeof(sops[0]))); if (sops != small_sops) free(sops, M_SEM); return (error); } semakptr = &sema[semid]; sema_mtxp = &sema_mtx[semid]; mtx_lock(sema_mtxp); if ((semakptr->u.sem_perm.mode & SEM_ALLOC) == 0) { error = EINVAL; goto done2; } seq = semakptr->u.sem_perm.seq; if (seq != IPCID_TO_SEQ(uap->semid)) { error = EINVAL; goto done2; } /* * Initial pass thru sops to see what permissions are needed. * Also perform any checks that don't need repeating on each * attempt to satisfy the request vector. */ j = 0; /* permission needed */ do_undos = 0; for (i = 0; i < nsops; i++) { sopptr = &sops[i]; if (sopptr->sem_num >= semakptr->u.sem_nsems) { error = EFBIG; goto done2; } if (sopptr->sem_flg & SEM_UNDO && sopptr->sem_op != 0) do_undos = 1; j |= (sopptr->sem_op == 0) ? SEM_R : SEM_A; } if ((error = ipcperm(td, &semakptr->u.sem_perm, j))) { DPRINTF(("error = %d from ipaccess\n", error)); goto done2; } #ifdef MAC error = mac_sysvsem_check_semop(td->td_ucred, semakptr, j); if (error != 0) goto done2; #endif /* * Loop trying to satisfy the vector of requests. * If we reach a point where we must wait, any requests already * performed are rolled back and we go to sleep until some other * process wakes us up. At this point, we start all over again. * * This ensures that from the perspective of other tasks, a set * of requests is atomic (never partially satisfied). */ for (;;) { do_wakeup = 0; error = 0; /* error return if necessary */ for (i = 0; i < nsops; i++) { sopptr = &sops[i]; semptr = &semakptr->u.sem_base[sopptr->sem_num]; DPRINTF(( "semop: semakptr=%p, sem_base=%p, " "semptr=%p, sem[%d]=%d : op=%d, flag=%s\n", semakptr, semakptr->u.sem_base, semptr, sopptr->sem_num, semptr->semval, sopptr->sem_op, (sopptr->sem_flg & IPC_NOWAIT) ? "nowait" : "wait")); if (sopptr->sem_op < 0) { if (semptr->semval + sopptr->sem_op < 0) { DPRINTF(("semop: can't do it now\n")); break; } else { semptr->semval += sopptr->sem_op; if (semptr->semval == 0 && semptr->semzcnt > 0) do_wakeup = 1; } } else if (sopptr->sem_op == 0) { if (semptr->semval != 0) { DPRINTF(("semop: not zero now\n")); break; } } else if (semptr->semval + sopptr->sem_op > seminfo.semvmx) { error = ERANGE; break; } else { if (semptr->semncnt > 0) do_wakeup = 1; semptr->semval += sopptr->sem_op; } } /* * Did we get through the entire vector? */ if (i >= nsops) goto done; /* * No ... rollback anything that we've already done */ DPRINTF(("semop: rollback 0 through %d\n", i-1)); for (j = 0; j < i; j++) semakptr->u.sem_base[sops[j].sem_num].semval -= sops[j].sem_op; /* If we detected an error, return it */ if (error != 0) goto done2; /* * If the request that we couldn't satisfy has the * NOWAIT flag set then return with EAGAIN. */ if (sopptr->sem_flg & IPC_NOWAIT) { error = EAGAIN; goto done2; } if (sopptr->sem_op == 0) semptr->semzcnt++; else semptr->semncnt++; DPRINTF(("semop: good night!\n")); error = msleep(semakptr, sema_mtxp, (PZERO - 4) | PCATCH, "semwait", 0); DPRINTF(("semop: good morning (error=%d)!\n", error)); /* return code is checked below, after sem[nz]cnt-- */ /* * Make sure that the semaphore still exists */ seq = semakptr->u.sem_perm.seq; if ((semakptr->u.sem_perm.mode & SEM_ALLOC) == 0 || seq != IPCID_TO_SEQ(uap->semid)) { error = EIDRM; goto done2; } /* * Renew the semaphore's pointer after wakeup since * during msleep sem_base may have been modified and semptr * is not valid any more */ semptr = &semakptr->u.sem_base[sopptr->sem_num]; /* * The semaphore is still alive. Readjust the count of * waiting processes. */ if (sopptr->sem_op == 0) semptr->semzcnt--; else semptr->semncnt--; /* * Is it really morning, or was our sleep interrupted? * (Delayed check of msleep() return code because we * need to decrement sem[nz]cnt either way.) */ if (error != 0) { error = EINTR; goto done2; } DPRINTF(("semop: good morning!\n")); } done: /* * Process any SEM_UNDO requests. */ if (do_undos) { SEMUNDO_LOCK(); suptr = NULL; for (i = 0; i < nsops; i++) { /* * We only need to deal with SEM_UNDO's for non-zero * op's. */ int adjval; if ((sops[i].sem_flg & SEM_UNDO) == 0) continue; adjval = sops[i].sem_op; if (adjval == 0) continue; error = semundo_adjust(td, &suptr, semid, seq, sops[i].sem_num, -adjval); if (error == 0) continue; /* * Oh-Oh! We ran out of either sem_undo's or undo's. * Rollback the adjustments to this point and then * rollback the semaphore ups and down so we can return * with an error with all structures restored. We * rollback the undo's in the exact reverse order that * we applied them. This guarantees that we won't run * out of space as we roll things back out. */ for (j = 0; j < i; j++) { k = i - j - 1; if ((sops[k].sem_flg & SEM_UNDO) == 0) continue; adjval = sops[k].sem_op; if (adjval == 0) continue; if (semundo_adjust(td, &suptr, semid, seq, sops[k].sem_num, adjval) != 0) panic("semop - can't undo undos"); } for (j = 0; j < nsops; j++) semakptr->u.sem_base[sops[j].sem_num].semval -= sops[j].sem_op; DPRINTF(("error = %d from semundo_adjust\n", error)); SEMUNDO_UNLOCK(); goto done2; } /* loop through the sops */ SEMUNDO_UNLOCK(); } /* if (do_undos) */ /* We're definitely done - set the sempid's and time */ for (i = 0; i < nsops; i++) { sopptr = &sops[i]; semptr = &semakptr->u.sem_base[sopptr->sem_num]; semptr->sempid = td->td_proc->p_pid; } semakptr->u.sem_otime = time_second; /* * Do a wakeup if any semaphore was up'd whilst something was * sleeping on it. */ if (do_wakeup) { DPRINTF(("semop: doing wakeup\n")); wakeup(semakptr); DPRINTF(("semop: back from wakeup\n")); } DPRINTF(("semop: done\n")); td->td_retval[0] = 0; done2: mtx_unlock(sema_mtxp); if (sops != small_sops) free(sops, M_SEM); return (error); } /* * Go through the undo structures for this process and apply the adjustments to * semaphores. */ static void semexit_myhook(void *arg, struct proc *p) { struct sem_undo *suptr; struct semid_kernel *semakptr; struct mtx *sema_mtxp; int semid, semnum, adjval, ix; unsigned short seq; /* * Go through the chain of undo vectors looking for one * associated with this process. */ SEMUNDO_LOCK(); LIST_FOREACH(suptr, &semu_list, un_next) { if (suptr->un_proc == p) break; } if (suptr == NULL) { SEMUNDO_UNLOCK(); return; } LIST_REMOVE(suptr, un_next); DPRINTF(("proc @%p has undo structure with %d entries\n", p, suptr->un_cnt)); /* * If there are any active undo elements then process them. */ if (suptr->un_cnt > 0) { SEMUNDO_UNLOCK(); for (ix = 0; ix < suptr->un_cnt; ix++) { semid = suptr->un_ent[ix].un_id; semnum = suptr->un_ent[ix].un_num; adjval = suptr->un_ent[ix].un_adjval; seq = suptr->un_ent[ix].un_seq; semakptr = &sema[semid]; sema_mtxp = &sema_mtx[semid]; mtx_lock(sema_mtxp); if ((semakptr->u.sem_perm.mode & SEM_ALLOC) == 0 || (semakptr->u.sem_perm.seq != seq)) { mtx_unlock(sema_mtxp); continue; } if (semnum >= semakptr->u.sem_nsems) panic("semexit - semnum out of range"); DPRINTF(( "semexit: %p id=%d num=%d(adj=%d) ; sem=%d\n", suptr->un_proc, suptr->un_ent[ix].un_id, suptr->un_ent[ix].un_num, suptr->un_ent[ix].un_adjval, semakptr->u.sem_base[semnum].semval)); if (adjval < 0 && semakptr->u.sem_base[semnum].semval < -adjval) semakptr->u.sem_base[semnum].semval = 0; else semakptr->u.sem_base[semnum].semval += adjval; wakeup(semakptr); DPRINTF(("semexit: back from wakeup\n")); mtx_unlock(sema_mtxp); } SEMUNDO_LOCK(); } /* * Deallocate the undo vector. */ DPRINTF(("removing vector\n")); suptr->un_proc = NULL; suptr->un_cnt = 0; LIST_INSERT_HEAD(&semu_free_list, suptr, un_next); SEMUNDO_UNLOCK(); } static int sysctl_sema(SYSCTL_HANDLER_ARGS) { return (SYSCTL_OUT(req, sema, sizeof(struct semid_kernel) * seminfo.semmni)); } Index: head/sys/kern/sysv_shm.c =================================================================== --- head/sys/kern/sysv_shm.c (revision 192894) +++ head/sys/kern/sysv_shm.c (revision 192895) @@ -1,1037 +1,1037 @@ /* $NetBSD: sysv_shm.c,v 1.23 1994/07/04 23:25:12 glass Exp $ */ /*- * Copyright (c) 1994 Adam Glass and Charles Hannum. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. All advertising materials mentioning features or use of this software * must display the following acknowledgement: * This product includes software developed by Adam Glass and Charles * Hannum. * 4. The names of the authors may not be used to endorse or promote products * derived from this software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHORS ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. */ /*- * Copyright (c) 2003-2005 McAfee, Inc. * All rights reserved. * * This software was developed for the FreeBSD Project in part by McAfee * Research, the Security Research Division of McAfee, Inc under DARPA/SPAWAR * contract N66001-01-C-8035 ("CBOSS"), as part of the DARPA CHATS research * program. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include "opt_compat.h" #include "opt_sysvipc.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static MALLOC_DEFINE(M_SHM, "shm", "SVID compatible shared memory segments"); #if defined(__i386__) && (defined(COMPAT_FREEBSD4) || defined(COMPAT_43)) struct oshmctl_args; static int oshmctl(struct thread *td, struct oshmctl_args *uap); #endif static int shmget_allocate_segment(struct thread *td, struct shmget_args *uap, int mode); static int shmget_existing(struct thread *td, struct shmget_args *uap, int mode, int segnum); #if defined(__i386__) && (defined(COMPAT_FREEBSD4) || defined(COMPAT_43)) /* XXX casting to (sy_call_t *) is bogus, as usual. */ static sy_call_t *shmcalls[] = { (sy_call_t *)shmat, (sy_call_t *)oshmctl, (sy_call_t *)shmdt, (sy_call_t *)shmget, (sy_call_t *)shmctl }; #endif #define SHMSEG_FREE 0x0200 #define SHMSEG_REMOVED 0x0400 #define SHMSEG_ALLOCATED 0x0800 #define SHMSEG_WANTED 0x1000 static int shm_last_free, shm_nused, shmalloced; vm_size_t shm_committed; static struct shmid_kernel *shmsegs; struct shmmap_state { vm_offset_t va; int shmid; }; static void shm_deallocate_segment(struct shmid_kernel *); static int shm_find_segment_by_key(key_t); static struct shmid_kernel *shm_find_segment_by_shmid(int); static struct shmid_kernel *shm_find_segment_by_shmidx(int); static int shm_delete_mapping(struct vmspace *vm, struct shmmap_state *); static void shmrealloc(void); static void shminit(void); static int sysvshm_modload(struct module *, int, void *); static int shmunload(void); static void shmexit_myhook(struct vmspace *vm); static void shmfork_myhook(struct proc *p1, struct proc *p2); static int sysctl_shmsegs(SYSCTL_HANDLER_ARGS); /* * Tuneable values. */ #ifndef SHMMAXPGS #define SHMMAXPGS 8192 /* Note: sysv shared memory is swap backed. */ #endif #ifndef SHMMAX #define SHMMAX (SHMMAXPGS*PAGE_SIZE) #endif #ifndef SHMMIN #define SHMMIN 1 #endif #ifndef SHMMNI #define SHMMNI 192 #endif #ifndef SHMSEG #define SHMSEG 128 #endif #ifndef SHMALL #define SHMALL (SHMMAXPGS) #endif struct shminfo shminfo = { SHMMAX, SHMMIN, SHMMNI, SHMSEG, SHMALL }; static int shm_use_phys; static int shm_allow_removed; SYSCTL_ULONG(_kern_ipc, OID_AUTO, shmmax, CTLFLAG_RW, &shminfo.shmmax, 0, "Maximum shared memory segment size"); SYSCTL_ULONG(_kern_ipc, OID_AUTO, shmmin, CTLFLAG_RW, &shminfo.shmmin, 0, "Minimum shared memory segment size"); SYSCTL_ULONG(_kern_ipc, OID_AUTO, shmmni, CTLFLAG_RDTUN, &shminfo.shmmni, 0, "Number of shared memory identifiers"); SYSCTL_ULONG(_kern_ipc, OID_AUTO, shmseg, CTLFLAG_RDTUN, &shminfo.shmseg, 0, "Number of segments per process"); SYSCTL_ULONG(_kern_ipc, OID_AUTO, shmall, CTLFLAG_RW, &shminfo.shmall, 0, "Maximum number of pages available for shared memory"); SYSCTL_INT(_kern_ipc, OID_AUTO, shm_use_phys, CTLFLAG_RW, &shm_use_phys, 0, "Enable/Disable locking of shared memory pages in core"); SYSCTL_INT(_kern_ipc, OID_AUTO, shm_allow_removed, CTLFLAG_RW, &shm_allow_removed, 0, "Enable/Disable attachment to attached segments marked for removal"); SYSCTL_PROC(_kern_ipc, OID_AUTO, shmsegs, CTLFLAG_RD, NULL, 0, sysctl_shmsegs, "", "Current number of shared memory segments allocated"); static int shm_find_segment_by_key(key) key_t key; { int i; for (i = 0; i < shmalloced; i++) if ((shmsegs[i].u.shm_perm.mode & SHMSEG_ALLOCATED) && shmsegs[i].u.shm_perm.key == key) return (i); return (-1); } static struct shmid_kernel * shm_find_segment_by_shmid(int shmid) { int segnum; struct shmid_kernel *shmseg; segnum = IPCID_TO_IX(shmid); if (segnum < 0 || segnum >= shmalloced) return (NULL); shmseg = &shmsegs[segnum]; if ((shmseg->u.shm_perm.mode & SHMSEG_ALLOCATED) == 0 || (!shm_allow_removed && (shmseg->u.shm_perm.mode & SHMSEG_REMOVED) != 0) || shmseg->u.shm_perm.seq != IPCID_TO_SEQ(shmid)) return (NULL); return (shmseg); } static struct shmid_kernel * shm_find_segment_by_shmidx(int segnum) { struct shmid_kernel *shmseg; if (segnum < 0 || segnum >= shmalloced) return (NULL); shmseg = &shmsegs[segnum]; if ((shmseg->u.shm_perm.mode & SHMSEG_ALLOCATED) == 0 || (!shm_allow_removed && (shmseg->u.shm_perm.mode & SHMSEG_REMOVED) != 0)) return (NULL); return (shmseg); } static void shm_deallocate_segment(shmseg) struct shmid_kernel *shmseg; { vm_size_t size; GIANT_REQUIRED; vm_object_deallocate(shmseg->u.shm_internal); shmseg->u.shm_internal = NULL; size = round_page(shmseg->shm_bsegsz); shm_committed -= btoc(size); shm_nused--; shmseg->u.shm_perm.mode = SHMSEG_FREE; #ifdef MAC mac_sysvshm_cleanup(shmseg); #endif } static int shm_delete_mapping(struct vmspace *vm, struct shmmap_state *shmmap_s) { struct shmid_kernel *shmseg; int segnum, result; vm_size_t size; GIANT_REQUIRED; segnum = IPCID_TO_IX(shmmap_s->shmid); shmseg = &shmsegs[segnum]; size = round_page(shmseg->shm_bsegsz); result = vm_map_remove(&vm->vm_map, shmmap_s->va, shmmap_s->va + size); if (result != KERN_SUCCESS) return (EINVAL); shmmap_s->shmid = -1; shmseg->u.shm_dtime = time_second; if ((--shmseg->u.shm_nattch <= 0) && (shmseg->u.shm_perm.mode & SHMSEG_REMOVED)) { shm_deallocate_segment(shmseg); shm_last_free = segnum; } return (0); } #ifndef _SYS_SYSPROTO_H_ struct shmdt_args { const void *shmaddr; }; #endif int shmdt(td, uap) struct thread *td; struct shmdt_args *uap; { struct proc *p = td->td_proc; struct shmmap_state *shmmap_s; #ifdef MAC struct shmid_kernel *shmsegptr; #endif int i; int error = 0; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); mtx_lock(&Giant); shmmap_s = p->p_vmspace->vm_shm; if (shmmap_s == NULL) { error = EINVAL; goto done2; } for (i = 0; i < shminfo.shmseg; i++, shmmap_s++) { if (shmmap_s->shmid != -1 && shmmap_s->va == (vm_offset_t)uap->shmaddr) { break; } } if (i == shminfo.shmseg) { error = EINVAL; goto done2; } #ifdef MAC shmsegptr = &shmsegs[IPCID_TO_IX(shmmap_s->shmid)]; error = mac_sysvshm_check_shmdt(td->td_ucred, shmsegptr); if (error != 0) goto done2; #endif error = shm_delete_mapping(p->p_vmspace, shmmap_s); done2: mtx_unlock(&Giant); return (error); } #ifndef _SYS_SYSPROTO_H_ struct shmat_args { int shmid; const void *shmaddr; int shmflg; }; #endif int kern_shmat(td, shmid, shmaddr, shmflg) struct thread *td; int shmid; const void *shmaddr; int shmflg; { struct proc *p = td->td_proc; int i, flags; struct shmid_kernel *shmseg; struct shmmap_state *shmmap_s = NULL; vm_offset_t attach_va; vm_prot_t prot; vm_size_t size; int rv; int error = 0; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); mtx_lock(&Giant); shmmap_s = p->p_vmspace->vm_shm; if (shmmap_s == NULL) { shmmap_s = malloc(shminfo.shmseg * sizeof(struct shmmap_state), M_SHM, M_WAITOK); for (i = 0; i < shminfo.shmseg; i++) shmmap_s[i].shmid = -1; p->p_vmspace->vm_shm = shmmap_s; } shmseg = shm_find_segment_by_shmid(shmid); if (shmseg == NULL) { error = EINVAL; goto done2; } error = ipcperm(td, &shmseg->u.shm_perm, (shmflg & SHM_RDONLY) ? IPC_R : IPC_R|IPC_W); if (error) goto done2; #ifdef MAC error = mac_sysvshm_check_shmat(td->td_ucred, shmseg, shmflg); if (error != 0) goto done2; #endif for (i = 0; i < shminfo.shmseg; i++) { if (shmmap_s->shmid == -1) break; shmmap_s++; } if (i >= shminfo.shmseg) { error = EMFILE; goto done2; } size = round_page(shmseg->shm_bsegsz); #ifdef VM_PROT_READ_IS_EXEC prot = VM_PROT_READ | VM_PROT_EXECUTE; #else prot = VM_PROT_READ; #endif if ((shmflg & SHM_RDONLY) == 0) prot |= VM_PROT_WRITE; flags = MAP_ANON | MAP_SHARED; if (shmaddr) { flags |= MAP_FIXED; if (shmflg & SHM_RND) { attach_va = (vm_offset_t)shmaddr & ~(SHMLBA-1); } else if (((vm_offset_t)shmaddr & (SHMLBA-1)) == 0) { attach_va = (vm_offset_t)shmaddr; } else { error = EINVAL; goto done2; } } else { /* * This is just a hint to vm_map_find() about where to * put it. */ PROC_LOCK(p); attach_va = round_page((vm_offset_t)p->p_vmspace->vm_daddr + lim_max(p, RLIMIT_DATA)); PROC_UNLOCK(p); } vm_object_reference(shmseg->u.shm_internal); rv = vm_map_find(&p->p_vmspace->vm_map, shmseg->u.shm_internal, 0, &attach_va, size, (flags & MAP_FIXED) ? VMFS_NO_SPACE : VMFS_ANY_SPACE, prot, prot, 0); if (rv != KERN_SUCCESS) { vm_object_deallocate(shmseg->u.shm_internal); error = ENOMEM; goto done2; } vm_map_inherit(&p->p_vmspace->vm_map, attach_va, attach_va + size, VM_INHERIT_SHARE); shmmap_s->va = attach_va; shmmap_s->shmid = shmid; shmseg->u.shm_lpid = p->p_pid; shmseg->u.shm_atime = time_second; shmseg->u.shm_nattch++; td->td_retval[0] = attach_va; done2: mtx_unlock(&Giant); return (error); } int shmat(td, uap) struct thread *td; struct shmat_args *uap; { return kern_shmat(td, uap->shmid, uap->shmaddr, uap->shmflg); } #if defined(__i386__) && (defined(COMPAT_FREEBSD4) || defined(COMPAT_43)) struct oshmid_ds { struct ipc_perm shm_perm; /* operation perms */ int shm_segsz; /* size of segment (bytes) */ u_short shm_cpid; /* pid, creator */ u_short shm_lpid; /* pid, last operation */ short shm_nattch; /* no. of current attaches */ time_t shm_atime; /* last attach time */ time_t shm_dtime; /* last detach time */ time_t shm_ctime; /* last change time */ void *shm_handle; /* internal handle for shm segment */ }; struct oshmctl_args { int shmid; int cmd; struct oshmid_ds *ubuf; }; static int oshmctl(td, uap) struct thread *td; struct oshmctl_args *uap; { #ifdef COMPAT_43 int error = 0; struct shmid_kernel *shmseg; struct oshmid_ds outbuf; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); mtx_lock(&Giant); shmseg = shm_find_segment_by_shmid(uap->shmid); if (shmseg == NULL) { error = EINVAL; goto done2; } switch (uap->cmd) { case IPC_STAT: error = ipcperm(td, &shmseg->u.shm_perm, IPC_R); if (error) goto done2; #ifdef MAC error = mac_sysvshm_check_shmctl(td->td_ucred, shmseg, uap->cmd); if (error != 0) goto done2; #endif outbuf.shm_perm = shmseg->u.shm_perm; outbuf.shm_segsz = shmseg->u.shm_segsz; outbuf.shm_cpid = shmseg->u.shm_cpid; outbuf.shm_lpid = shmseg->u.shm_lpid; outbuf.shm_nattch = shmseg->u.shm_nattch; outbuf.shm_atime = shmseg->u.shm_atime; outbuf.shm_dtime = shmseg->u.shm_dtime; outbuf.shm_ctime = shmseg->u.shm_ctime; outbuf.shm_handle = shmseg->u.shm_internal; error = copyout(&outbuf, uap->ubuf, sizeof(outbuf)); if (error) goto done2; break; default: error = shmctl(td, (struct shmctl_args *)uap); break; } done2: mtx_unlock(&Giant); return (error); #else return (EINVAL); #endif } #endif #ifndef _SYS_SYSPROTO_H_ struct shmctl_args { int shmid; int cmd; struct shmid_ds *buf; }; #endif int kern_shmctl(td, shmid, cmd, buf, bufsz) struct thread *td; int shmid; int cmd; void *buf; size_t *bufsz; { int error = 0; struct shmid_kernel *shmseg; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); mtx_lock(&Giant); switch (cmd) { /* * It is possible that kern_shmctl is being called from the Linux ABI * layer, in which case, we will need to implement IPC_INFO. It should * be noted that other shmctl calls will be funneled through here for * Linix binaries as well. * * NB: The Linux ABI layer will convert this data to structure(s) more * consistent with the Linux ABI. */ case IPC_INFO: memcpy(buf, &shminfo, sizeof(shminfo)); if (bufsz) *bufsz = sizeof(shminfo); td->td_retval[0] = shmalloced; goto done2; case SHM_INFO: { struct shm_info shm_info; shm_info.used_ids = shm_nused; shm_info.shm_rss = 0; /*XXX where to get from ? */ shm_info.shm_tot = 0; /*XXX where to get from ? */ shm_info.shm_swp = 0; /*XXX where to get from ? */ shm_info.swap_attempts = 0; /*XXX where to get from ? */ shm_info.swap_successes = 0; /*XXX where to get from ? */ memcpy(buf, &shm_info, sizeof(shm_info)); if (bufsz) *bufsz = sizeof(shm_info); td->td_retval[0] = shmalloced; goto done2; } } if (cmd == SHM_STAT) shmseg = shm_find_segment_by_shmidx(shmid); else shmseg = shm_find_segment_by_shmid(shmid); if (shmseg == NULL) { error = EINVAL; goto done2; } #ifdef MAC error = mac_sysvshm_check_shmctl(td->td_ucred, shmseg, cmd); if (error != 0) goto done2; #endif switch (cmd) { case SHM_STAT: case IPC_STAT: error = ipcperm(td, &shmseg->u.shm_perm, IPC_R); if (error) goto done2; memcpy(buf, &shmseg->u, sizeof(struct shmid_ds)); if (bufsz) *bufsz = sizeof(struct shmid_ds); if (cmd == SHM_STAT) td->td_retval[0] = IXSEQ_TO_IPCID(shmid, shmseg->u.shm_perm); break; case IPC_SET: { struct shmid_ds *shmid; shmid = (struct shmid_ds *)buf; error = ipcperm(td, &shmseg->u.shm_perm, IPC_M); if (error) goto done2; shmseg->u.shm_perm.uid = shmid->shm_perm.uid; shmseg->u.shm_perm.gid = shmid->shm_perm.gid; shmseg->u.shm_perm.mode = (shmseg->u.shm_perm.mode & ~ACCESSPERMS) | (shmid->shm_perm.mode & ACCESSPERMS); shmseg->u.shm_ctime = time_second; break; } case IPC_RMID: error = ipcperm(td, &shmseg->u.shm_perm, IPC_M); if (error) goto done2; shmseg->u.shm_perm.key = IPC_PRIVATE; shmseg->u.shm_perm.mode |= SHMSEG_REMOVED; if (shmseg->u.shm_nattch <= 0) { shm_deallocate_segment(shmseg); shm_last_free = IPCID_TO_IX(shmid); } break; #if 0 case SHM_LOCK: case SHM_UNLOCK: #endif default: error = EINVAL; break; } done2: mtx_unlock(&Giant); return (error); } int shmctl(td, uap) struct thread *td; struct shmctl_args *uap; { int error = 0; struct shmid_ds buf; size_t bufsz; /* * The only reason IPC_INFO, SHM_INFO, SHM_STAT exists is to support * Linux binaries. If we see the call come through the FreeBSD ABI, * return an error back to the user since we do not to support this. */ if (uap->cmd == IPC_INFO || uap->cmd == SHM_INFO || uap->cmd == SHM_STAT) return (EINVAL); /* IPC_SET needs to copyin the buffer before calling kern_shmctl */ if (uap->cmd == IPC_SET) { if ((error = copyin(uap->buf, &buf, sizeof(struct shmid_ds)))) goto done; } error = kern_shmctl(td, uap->shmid, uap->cmd, (void *)&buf, &bufsz); if (error) goto done; /* Cases in which we need to copyout */ switch (uap->cmd) { case IPC_STAT: error = copyout(&buf, uap->buf, bufsz); break; } done: if (error) { /* Invalidate the return value */ td->td_retval[0] = -1; } return (error); } #ifndef _SYS_SYSPROTO_H_ struct shmget_args { key_t key; size_t size; int shmflg; }; #endif static int shmget_existing(td, uap, mode, segnum) struct thread *td; struct shmget_args *uap; int mode; int segnum; { struct shmid_kernel *shmseg; int error; shmseg = &shmsegs[segnum]; if (shmseg->u.shm_perm.mode & SHMSEG_REMOVED) { /* * This segment is in the process of being allocated. Wait * until it's done, and look the key up again (in case the * allocation failed or it was freed). */ shmseg->u.shm_perm.mode |= SHMSEG_WANTED; error = tsleep(shmseg, PLOCK | PCATCH, "shmget", 0); if (error) return (error); return (EAGAIN); } if ((uap->shmflg & (IPC_CREAT | IPC_EXCL)) == (IPC_CREAT | IPC_EXCL)) return (EEXIST); #ifdef MAC error = mac_sysvshm_check_shmget(td->td_ucred, shmseg, uap->shmflg); if (error != 0) return (error); #endif if (uap->size != 0 && uap->size > shmseg->shm_bsegsz) return (EINVAL); td->td_retval[0] = IXSEQ_TO_IPCID(segnum, shmseg->u.shm_perm); return (0); } static int shmget_allocate_segment(td, uap, mode) struct thread *td; struct shmget_args *uap; int mode; { int i, segnum, shmid; size_t size; struct ucred *cred = td->td_ucred; struct shmid_kernel *shmseg; vm_object_t shm_object; GIANT_REQUIRED; if (uap->size < shminfo.shmmin || uap->size > shminfo.shmmax) return (EINVAL); if (shm_nused >= shminfo.shmmni) /* Any shmids left? */ return (ENOSPC); size = round_page(uap->size); if (shm_committed + btoc(size) > shminfo.shmall) return (ENOMEM); if (shm_last_free < 0) { shmrealloc(); /* Maybe expand the shmsegs[] array. */ for (i = 0; i < shmalloced; i++) if (shmsegs[i].u.shm_perm.mode & SHMSEG_FREE) break; if (i == shmalloced) return (ENOSPC); segnum = i; } else { segnum = shm_last_free; shm_last_free = -1; } shmseg = &shmsegs[segnum]; /* * In case we sleep in malloc(), mark the segment present but deleted * so that noone else tries to create the same key. */ shmseg->u.shm_perm.mode = SHMSEG_ALLOCATED | SHMSEG_REMOVED; shmseg->u.shm_perm.key = uap->key; shmseg->u.shm_perm.seq = (shmseg->u.shm_perm.seq + 1) & 0x7fff; shmid = IXSEQ_TO_IPCID(segnum, shmseg->u.shm_perm); /* * We make sure that we have allocated a pager before we need * to. */ if (shm_use_phys) { shm_object = vm_pager_allocate(OBJT_PHYS, 0, size, VM_PROT_DEFAULT, 0); } else { shm_object = vm_pager_allocate(OBJT_SWAP, 0, size, VM_PROT_DEFAULT, 0); } VM_OBJECT_LOCK(shm_object); vm_object_clear_flag(shm_object, OBJ_ONEMAPPING); vm_object_set_flag(shm_object, OBJ_NOSPLIT); VM_OBJECT_UNLOCK(shm_object); shmseg->u.shm_internal = shm_object; shmseg->u.shm_perm.cuid = shmseg->u.shm_perm.uid = cred->cr_uid; shmseg->u.shm_perm.cgid = shmseg->u.shm_perm.gid = cred->cr_gid; shmseg->u.shm_perm.mode = (shmseg->u.shm_perm.mode & SHMSEG_WANTED) | (mode & ACCESSPERMS) | SHMSEG_ALLOCATED; shmseg->u.shm_segsz = uap->size; shmseg->shm_bsegsz = uap->size; shmseg->u.shm_cpid = td->td_proc->p_pid; shmseg->u.shm_lpid = shmseg->u.shm_nattch = 0; shmseg->u.shm_atime = shmseg->u.shm_dtime = 0; #ifdef MAC mac_sysvshm_create(cred, shmseg); #endif shmseg->u.shm_ctime = time_second; shm_committed += btoc(size); shm_nused++; if (shmseg->u.shm_perm.mode & SHMSEG_WANTED) { /* * Somebody else wanted this key while we were asleep. Wake * them up now. */ shmseg->u.shm_perm.mode &= ~SHMSEG_WANTED; wakeup(shmseg); } td->td_retval[0] = shmid; return (0); } int shmget(td, uap) struct thread *td; struct shmget_args *uap; { int segnum, mode; int error; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); mtx_lock(&Giant); mode = uap->shmflg & ACCESSPERMS; if (uap->key != IPC_PRIVATE) { again: segnum = shm_find_segment_by_key(uap->key); if (segnum >= 0) { error = shmget_existing(td, uap, mode, segnum); if (error == EAGAIN) goto again; goto done2; } if ((uap->shmflg & IPC_CREAT) == 0) { error = ENOENT; goto done2; } } error = shmget_allocate_segment(td, uap, mode); done2: mtx_unlock(&Giant); return (error); } int shmsys(td, uap) struct thread *td; /* XXX actually varargs. */ struct shmsys_args /* { int which; int a2; int a3; int a4; } */ *uap; { #if defined(__i386__) && (defined(COMPAT_FREEBSD4) || defined(COMPAT_43)) int error; - if (!jail_sysvipc_allowed && jailed(td->td_ucred)) + if (!prison_allow(td->td_ucred, PR_ALLOW_SYSVIPC)) return (ENOSYS); if (uap->which < 0 || uap->which >= sizeof(shmcalls)/sizeof(shmcalls[0])) return (EINVAL); mtx_lock(&Giant); error = (*shmcalls[uap->which])(td, &uap->a2); mtx_unlock(&Giant); return (error); #else return (nosys(td, NULL)); #endif } static void shmfork_myhook(p1, p2) struct proc *p1, *p2; { struct shmmap_state *shmmap_s; size_t size; int i; mtx_lock(&Giant); size = shminfo.shmseg * sizeof(struct shmmap_state); shmmap_s = malloc(size, M_SHM, M_WAITOK); bcopy(p1->p_vmspace->vm_shm, shmmap_s, size); p2->p_vmspace->vm_shm = shmmap_s; for (i = 0; i < shminfo.shmseg; i++, shmmap_s++) if (shmmap_s->shmid != -1) shmsegs[IPCID_TO_IX(shmmap_s->shmid)].u.shm_nattch++; mtx_unlock(&Giant); } static void shmexit_myhook(struct vmspace *vm) { struct shmmap_state *base, *shm; int i; if ((base = vm->vm_shm) != NULL) { vm->vm_shm = NULL; mtx_lock(&Giant); for (i = 0, shm = base; i < shminfo.shmseg; i++, shm++) { if (shm->shmid != -1) shm_delete_mapping(vm, shm); } mtx_unlock(&Giant); free(base, M_SHM); } } static void shmrealloc(void) { int i; struct shmid_kernel *newsegs; if (shmalloced >= shminfo.shmmni) return; newsegs = malloc(shminfo.shmmni * sizeof(*newsegs), M_SHM, M_WAITOK); if (newsegs == NULL) return; for (i = 0; i < shmalloced; i++) bcopy(&shmsegs[i], &newsegs[i], sizeof(newsegs[0])); for (; i < shminfo.shmmni; i++) { shmsegs[i].u.shm_perm.mode = SHMSEG_FREE; shmsegs[i].u.shm_perm.seq = 0; #ifdef MAC mac_sysvshm_init(&shmsegs[i]); #endif } free(shmsegs, M_SHM); shmsegs = newsegs; shmalloced = shminfo.shmmni; } static void shminit() { int i; TUNABLE_ULONG_FETCH("kern.ipc.shmmaxpgs", &shminfo.shmall); for (i = PAGE_SIZE; i > 0; i--) { shminfo.shmmax = shminfo.shmall * i; if (shminfo.shmmax >= shminfo.shmall) break; } TUNABLE_ULONG_FETCH("kern.ipc.shmmin", &shminfo.shmmin); TUNABLE_ULONG_FETCH("kern.ipc.shmmni", &shminfo.shmmni); TUNABLE_ULONG_FETCH("kern.ipc.shmseg", &shminfo.shmseg); TUNABLE_INT_FETCH("kern.ipc.shm_use_phys", &shm_use_phys); shmalloced = shminfo.shmmni; shmsegs = malloc(shmalloced * sizeof(shmsegs[0]), M_SHM, M_WAITOK); if (shmsegs == NULL) panic("cannot allocate initial memory for sysvshm"); for (i = 0; i < shmalloced; i++) { shmsegs[i].u.shm_perm.mode = SHMSEG_FREE; shmsegs[i].u.shm_perm.seq = 0; #ifdef MAC mac_sysvshm_init(&shmsegs[i]); #endif } shm_last_free = 0; shm_nused = 0; shm_committed = 0; shmexit_hook = &shmexit_myhook; shmfork_hook = &shmfork_myhook; } static int shmunload() { #ifdef MAC int i; #endif if (shm_nused > 0) return (EBUSY); #ifdef MAC for (i = 0; i < shmalloced; i++) mac_sysvshm_destroy(&shmsegs[i]); #endif free(shmsegs, M_SHM); shmexit_hook = NULL; shmfork_hook = NULL; return (0); } static int sysctl_shmsegs(SYSCTL_HANDLER_ARGS) { return (SYSCTL_OUT(req, shmsegs, shmalloced * sizeof(shmsegs[0]))); } static int sysvshm_modload(struct module *module, int cmd, void *arg) { int error = 0; switch (cmd) { case MOD_LOAD: shminit(); break; case MOD_UNLOAD: error = shmunload(); break; case MOD_SHUTDOWN: break; default: error = EINVAL; break; } return (error); } static moduledata_t sysvshm_mod = { "sysvshm", &sysvshm_modload, NULL }; SYSCALL_MODULE_HELPER(shmsys); SYSCALL_MODULE_HELPER(shmat); SYSCALL_MODULE_HELPER(shmctl); SYSCALL_MODULE_HELPER(shmdt); SYSCALL_MODULE_HELPER(shmget); DECLARE_MODULE(sysvshm, sysvshm_mod, SI_SUB_SYSV_SHM, SI_ORDER_FIRST); MODULE_VERSION(sysvshm, 1); Index: head/sys/kern/vfs_lookup.c =================================================================== --- head/sys/kern/vfs_lookup.c (revision 192894) +++ head/sys/kern/vfs_lookup.c (revision 192895) @@ -1,1206 +1,1213 @@ /*- * Copyright (c) 1982, 1986, 1989, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)vfs_lookup.c 8.4 (Berkeley) 2/16/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_kdtrace.h" #include "opt_ktrace.h" #include "opt_mac.h" #include #include #include #include +#include #include #include #include #include #include #include #include #include #include #include #ifdef KTRACE #include #endif #include #include #include #define NAMEI_DIAGNOSTIC 1 #undef NAMEI_DIAGNOSTIC SDT_PROVIDER_DECLARE(vfs); SDT_PROBE_DEFINE3(vfs, namei, lookup, entry, "struct vnode *", "char *", "unsigned long"); SDT_PROBE_DEFINE2(vfs, namei, lookup, return, "int", "struct vnode *"); /* * Allocation zone for namei */ uma_zone_t namei_zone; /* * Placeholder vnode for mp traversal */ static struct vnode *vp_crossmp; static void nameiinit(void *dummy __unused) { int error; namei_zone = uma_zcreate("NAMEI", MAXPATHLEN, NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); error = getnewvnode("crossmp", NULL, &dead_vnodeops, &vp_crossmp); if (error != 0) panic("nameiinit: getnewvnode"); VN_LOCK_ASHARE(vp_crossmp); } SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_SECOND, nameiinit, NULL); static int lookup_shared = 1; SYSCTL_INT(_vfs, OID_AUTO, lookup_shared, CTLFLAG_RW, &lookup_shared, 0, "Enables/Disables shared locks for path name translation"); TUNABLE_INT("vfs.lookup_shared", &lookup_shared); /* * Convert a pathname into a pointer to a locked vnode. * * The FOLLOW flag is set when symbolic links are to be followed * when they occur at the end of the name translation process. * Symbolic links are always followed for all other pathname * components other than the last. * * The segflg defines whether the name is to be copied from user * space or kernel space. * * Overall outline of namei: * * copy in name * get starting directory * while (!done && !error) { * call lookup to search path. * if symbolic link, massage name in buffer and continue * } */ int namei(struct nameidata *ndp) { struct filedesc *fdp; /* pointer to file descriptor state */ char *cp; /* pointer into pathname argument */ struct vnode *dp; /* the directory we are searching */ struct iovec aiov; /* uio for reading symbolic links */ struct uio auio; int error, linklen; struct componentname *cnp = &ndp->ni_cnd; struct thread *td = cnp->cn_thread; struct proc *p = td->td_proc; int vfslocked; KASSERT((cnp->cn_flags & MPSAFE) != 0 || mtx_owned(&Giant) != 0, ("NOT MPSAFE and Giant not held")); ndp->ni_cnd.cn_cred = ndp->ni_cnd.cn_thread->td_ucred; KASSERT(cnp->cn_cred && p, ("namei: bad cred/proc")); KASSERT((cnp->cn_nameiop & (~OPMASK)) == 0, ("namei: nameiop contaminated with flags")); KASSERT((cnp->cn_flags & OPMASK) == 0, ("namei: flags contaminated with nameiops")); if (!lookup_shared) cnp->cn_flags &= ~LOCKSHARED; fdp = p->p_fd; /* * Get a buffer for the name to be translated, and copy the * name into the buffer. */ if ((cnp->cn_flags & HASBUF) == 0) cnp->cn_pnbuf = uma_zalloc(namei_zone, M_WAITOK); if (ndp->ni_segflg == UIO_SYSSPACE) error = copystr(ndp->ni_dirp, cnp->cn_pnbuf, MAXPATHLEN, (size_t *)&ndp->ni_pathlen); else error = copyinstr(ndp->ni_dirp, cnp->cn_pnbuf, MAXPATHLEN, (size_t *)&ndp->ni_pathlen); /* If we are auditing the kernel pathname, save the user pathname. */ if (cnp->cn_flags & AUDITVNODE1) AUDIT_ARG(upath, td, cnp->cn_pnbuf, ARG_UPATH1); if (cnp->cn_flags & AUDITVNODE2) AUDIT_ARG(upath, td, cnp->cn_pnbuf, ARG_UPATH2); /* * Don't allow empty pathnames. */ if (!error && *cnp->cn_pnbuf == '\0') error = ENOENT; if (error) { uma_zfree(namei_zone, cnp->cn_pnbuf); #ifdef DIAGNOSTIC cnp->cn_pnbuf = NULL; cnp->cn_nameptr = NULL; #endif ndp->ni_vp = NULL; return (error); } ndp->ni_loopcnt = 0; #ifdef KTRACE if (KTRPOINT(td, KTR_NAMEI)) { KASSERT(cnp->cn_thread == curthread, ("namei not using curthread")); ktrnamei(cnp->cn_pnbuf); } #endif /* * Get starting point for the translation. */ FILEDESC_SLOCK(fdp); ndp->ni_rootdir = fdp->fd_rdir; ndp->ni_topdir = fdp->fd_jdir; dp = NULL; if (cnp->cn_pnbuf[0] != '/') { if (ndp->ni_startdir != NULL) { dp = ndp->ni_startdir; error = 0; } else if (ndp->ni_dirfd != AT_FDCWD) error = fgetvp(td, ndp->ni_dirfd, &dp); if (error != 0 || dp != NULL) { FILEDESC_SUNLOCK(fdp); if (error == 0 && dp->v_type != VDIR) { vfslocked = VFS_LOCK_GIANT(dp->v_mount); vrele(dp); VFS_UNLOCK_GIANT(vfslocked); error = ENOTDIR; } } if (error) { uma_zfree(namei_zone, cnp->cn_pnbuf); #ifdef DIAGNOSTIC cnp->cn_pnbuf = NULL; cnp->cn_nameptr = NULL; #endif return (error); } } if (dp == NULL) { dp = fdp->fd_cdir; VREF(dp); FILEDESC_SUNLOCK(fdp); if (ndp->ni_startdir != NULL) { vfslocked = VFS_LOCK_GIANT(ndp->ni_startdir->v_mount); vrele(ndp->ni_startdir); VFS_UNLOCK_GIANT(vfslocked); } } SDT_PROBE(vfs, namei, lookup, entry, dp, cnp->cn_pnbuf, cnp->cn_flags, 0, 0); vfslocked = VFS_LOCK_GIANT(dp->v_mount); for (;;) { /* * Check if root directory should replace current directory. * Done at start of translation and after symbolic link. */ cnp->cn_nameptr = cnp->cn_pnbuf; if (*(cnp->cn_nameptr) == '/') { vrele(dp); VFS_UNLOCK_GIANT(vfslocked); while (*(cnp->cn_nameptr) == '/') { cnp->cn_nameptr++; ndp->ni_pathlen--; } dp = ndp->ni_rootdir; vfslocked = VFS_LOCK_GIANT(dp->v_mount); VREF(dp); } if (vfslocked) ndp->ni_cnd.cn_flags |= GIANTHELD; ndp->ni_startdir = dp; error = lookup(ndp); if (error) { uma_zfree(namei_zone, cnp->cn_pnbuf); #ifdef DIAGNOSTIC cnp->cn_pnbuf = NULL; cnp->cn_nameptr = NULL; #endif SDT_PROBE(vfs, namei, lookup, return, error, NULL, 0, 0, 0); return (error); } vfslocked = (ndp->ni_cnd.cn_flags & GIANTHELD) != 0; ndp->ni_cnd.cn_flags &= ~GIANTHELD; /* * Check for symbolic link */ if ((cnp->cn_flags & ISSYMLINK) == 0) { if ((cnp->cn_flags & (SAVENAME | SAVESTART)) == 0) { uma_zfree(namei_zone, cnp->cn_pnbuf); #ifdef DIAGNOSTIC cnp->cn_pnbuf = NULL; cnp->cn_nameptr = NULL; #endif } else cnp->cn_flags |= HASBUF; if ((cnp->cn_flags & MPSAFE) == 0) { VFS_UNLOCK_GIANT(vfslocked); } else if (vfslocked) ndp->ni_cnd.cn_flags |= GIANTHELD; SDT_PROBE(vfs, namei, lookup, return, 0, ndp->ni_vp, 0, 0, 0); return (0); } if (ndp->ni_loopcnt++ >= MAXSYMLINKS) { error = ELOOP; break; } #ifdef MAC if ((cnp->cn_flags & NOMACCHECK) == 0) { error = mac_vnode_check_readlink(td->td_ucred, ndp->ni_vp); if (error) break; } #endif if (ndp->ni_pathlen > 1) cp = uma_zalloc(namei_zone, M_WAITOK); else cp = cnp->cn_pnbuf; aiov.iov_base = cp; aiov.iov_len = MAXPATHLEN; auio.uio_iov = &aiov; auio.uio_iovcnt = 1; auio.uio_offset = 0; auio.uio_rw = UIO_READ; auio.uio_segflg = UIO_SYSSPACE; auio.uio_td = (struct thread *)0; auio.uio_resid = MAXPATHLEN; error = VOP_READLINK(ndp->ni_vp, &auio, cnp->cn_cred); if (error) { if (ndp->ni_pathlen > 1) uma_zfree(namei_zone, cp); break; } linklen = MAXPATHLEN - auio.uio_resid; if (linklen == 0) { if (ndp->ni_pathlen > 1) uma_zfree(namei_zone, cp); error = ENOENT; break; } if (linklen + ndp->ni_pathlen >= MAXPATHLEN) { if (ndp->ni_pathlen > 1) uma_zfree(namei_zone, cp); error = ENAMETOOLONG; break; } if (ndp->ni_pathlen > 1) { bcopy(ndp->ni_next, cp + linklen, ndp->ni_pathlen); uma_zfree(namei_zone, cnp->cn_pnbuf); cnp->cn_pnbuf = cp; } else cnp->cn_pnbuf[linklen] = '\0'; ndp->ni_pathlen += linklen; vput(ndp->ni_vp); dp = ndp->ni_dvp; } uma_zfree(namei_zone, cnp->cn_pnbuf); #ifdef DIAGNOSTIC cnp->cn_pnbuf = NULL; cnp->cn_nameptr = NULL; #endif vput(ndp->ni_vp); ndp->ni_vp = NULL; vrele(ndp->ni_dvp); VFS_UNLOCK_GIANT(vfslocked); SDT_PROBE(vfs, namei, lookup, return, error, NULL, 0, 0, 0); return (error); } static int compute_cn_lkflags(struct mount *mp, int lkflags) { if (mp == NULL || ((lkflags & LK_SHARED) && !(mp->mnt_kern_flag & MNTK_LOOKUP_SHARED))) { lkflags &= ~LK_SHARED; lkflags |= LK_EXCLUSIVE; } return (lkflags); } static __inline int needs_exclusive_leaf(struct mount *mp, int flags) { /* * Intermediate nodes can use shared locks, we only need to * force an exclusive lock for leaf nodes. */ if ((flags & (ISLASTCN | LOCKLEAF)) != (ISLASTCN | LOCKLEAF)) return (0); /* Always use exclusive locks if LOCKSHARED isn't set. */ if (!(flags & LOCKSHARED)) return (1); /* * For lookups during open(), if the mount point supports * extended shared operations, then use a shared lock for the * leaf node, otherwise use an exclusive lock. */ if (flags & ISOPEN) { if (mp != NULL && (mp->mnt_kern_flag & MNTK_EXTENDED_SHARED)) return (0); else return (1); } /* * Lookup requests outside of open() that specify LOCKSHARED * only need a shared lock on the leaf vnode. */ return (0); } /* * Search a pathname. * This is a very central and rather complicated routine. * * The pathname is pointed to by ni_ptr and is of length ni_pathlen. * The starting directory is taken from ni_startdir. The pathname is * descended until done, or a symbolic link is encountered. The variable * ni_more is clear if the path is completed; it is set to one if a * symbolic link needing interpretation is encountered. * * The flag argument is LOOKUP, CREATE, RENAME, or DELETE depending on * whether the name is to be looked up, created, renamed, or deleted. * When CREATE, RENAME, or DELETE is specified, information usable in * creating, renaming, or deleting a directory entry may be calculated. * If flag has LOCKPARENT or'ed into it, the parent directory is returned * locked. If flag has WANTPARENT or'ed into it, the parent directory is * returned unlocked. Otherwise the parent directory is not returned. If * the target of the pathname exists and LOCKLEAF is or'ed into the flag * the target is returned locked, otherwise it is returned unlocked. * When creating or renaming and LOCKPARENT is specified, the target may not * be ".". When deleting and LOCKPARENT is specified, the target may be ".". * * Overall outline of lookup: * * dirloop: * identify next component of name at ndp->ni_ptr * handle degenerate case where name is null string * if .. and crossing mount points and on mounted filesys, find parent * call VOP_LOOKUP routine for next component name * directory vnode returned in ni_dvp, unlocked unless LOCKPARENT set * component vnode returned in ni_vp (if it exists), locked. * if result vnode is mounted on and crossing mount points, * find mounted on vnode * if more components of name, do next level at dirloop * return the answer in ni_vp, locked if LOCKLEAF set * if LOCKPARENT set, return locked parent in ni_dvp * if WANTPARENT set, return unlocked parent in ni_dvp */ int lookup(struct nameidata *ndp) { char *cp; /* pointer into pathname argument */ struct vnode *dp = 0; /* the directory we are searching */ struct vnode *tdp; /* saved dp */ struct mount *mp; /* mount table entry */ + struct prison *pr; int docache; /* == 0 do not cache last component */ int wantparent; /* 1 => wantparent or lockparent flag */ int rdonly; /* lookup read-only flag bit */ int trailing_slash; int error = 0; int dpunlocked = 0; /* dp has already been unlocked */ struct componentname *cnp = &ndp->ni_cnd; int vfslocked; /* VFS Giant state for child */ int dvfslocked; /* VFS Giant state for parent */ int tvfslocked; int lkflags_save; #ifdef AUDIT struct thread *td = curthread; #endif /* * Setup: break out flag bits into variables. */ dvfslocked = (ndp->ni_cnd.cn_flags & GIANTHELD) != 0; vfslocked = 0; ndp->ni_cnd.cn_flags &= ~GIANTHELD; wantparent = cnp->cn_flags & (LOCKPARENT | WANTPARENT); KASSERT(cnp->cn_nameiop == LOOKUP || wantparent, ("CREATE, DELETE, RENAME require LOCKPARENT or WANTPARENT.")); docache = (cnp->cn_flags & NOCACHE) ^ NOCACHE; if (cnp->cn_nameiop == DELETE || (wantparent && cnp->cn_nameiop != CREATE && cnp->cn_nameiop != LOOKUP)) docache = 0; rdonly = cnp->cn_flags & RDONLY; cnp->cn_flags &= ~ISSYMLINK; ndp->ni_dvp = NULL; /* * We use shared locks until we hit the parent of the last cn then * we adjust based on the requesting flags. */ if (lookup_shared) cnp->cn_lkflags = LK_SHARED; else cnp->cn_lkflags = LK_EXCLUSIVE; dp = ndp->ni_startdir; ndp->ni_startdir = NULLVP; vn_lock(dp, compute_cn_lkflags(dp->v_mount, cnp->cn_lkflags | LK_RETRY)); dirloop: /* * Search a new directory. * * The last component of the filename is left accessible via * cnp->cn_nameptr for callers that need the name. Callers needing * the name set the SAVENAME flag. When done, they assume * responsibility for freeing the pathname buffer. */ cnp->cn_consume = 0; for (cp = cnp->cn_nameptr; *cp != 0 && *cp != '/'; cp++) continue; cnp->cn_namelen = cp - cnp->cn_nameptr; if (cnp->cn_namelen > NAME_MAX) { error = ENAMETOOLONG; goto bad; } #ifdef NAMEI_DIAGNOSTIC { char c = *cp; *cp = '\0'; printf("{%s}: ", cnp->cn_nameptr); *cp = c; } #endif ndp->ni_pathlen -= cnp->cn_namelen; ndp->ni_next = cp; /* * Replace multiple slashes by a single slash and trailing slashes * by a null. This must be done before VOP_LOOKUP() because some * fs's don't know about trailing slashes. Remember if there were * trailing slashes to handle symlinks, existing non-directories * and non-existing files that won't be directories specially later. */ trailing_slash = 0; while (*cp == '/' && (cp[1] == '/' || cp[1] == '\0')) { cp++; ndp->ni_pathlen--; if (*cp == '\0') { trailing_slash = 1; *ndp->ni_next = '\0'; /* XXX for direnter() ... */ } } ndp->ni_next = cp; cnp->cn_flags |= MAKEENTRY; if (*cp == '\0' && docache == 0) cnp->cn_flags &= ~MAKEENTRY; if (cnp->cn_namelen == 2 && cnp->cn_nameptr[1] == '.' && cnp->cn_nameptr[0] == '.') cnp->cn_flags |= ISDOTDOT; else cnp->cn_flags &= ~ISDOTDOT; if (*ndp->ni_next == 0) cnp->cn_flags |= ISLASTCN; else cnp->cn_flags &= ~ISLASTCN; /* * Check for degenerate name (e.g. / or "") * which is a way of talking about a directory, * e.g. like "/." or ".". */ if (cnp->cn_nameptr[0] == '\0') { if (dp->v_type != VDIR) { error = ENOTDIR; goto bad; } if (cnp->cn_nameiop != LOOKUP) { error = EISDIR; goto bad; } if (wantparent) { ndp->ni_dvp = dp; VREF(dp); } ndp->ni_vp = dp; if (cnp->cn_flags & AUDITVNODE1) AUDIT_ARG(vnode, dp, ARG_VNODE1); else if (cnp->cn_flags & AUDITVNODE2) AUDIT_ARG(vnode, dp, ARG_VNODE2); if (!(cnp->cn_flags & (LOCKPARENT | LOCKLEAF))) VOP_UNLOCK(dp, 0); /* XXX This should probably move to the top of function. */ if (cnp->cn_flags & SAVESTART) panic("lookup: SAVESTART"); goto success; } /* * Handle "..": four special cases. * 1. Return an error if this is the last component of * the name and the operation is DELETE or RENAME. * 2. If at root directory (e.g. after chroot) * or at absolute root directory * then ignore it so can't get out. * 3. If this vnode is the root of a mounted * filesystem, then replace it with the * vnode which was mounted on so we take the * .. in the other filesystem. * 4. If the vnode is the top directory of * the jail or chroot, don't let them out. */ if (cnp->cn_flags & ISDOTDOT) { if ((cnp->cn_flags & ISLASTCN) != 0 && (cnp->cn_nameiop == DELETE || cnp->cn_nameiop == RENAME)) { error = EINVAL; goto bad; } for (;;) { + for (pr = cnp->cn_cred->cr_prison; pr != NULL; + pr = pr->pr_parent) + if (dp == pr->pr_root) + break; if (dp == ndp->ni_rootdir || dp == ndp->ni_topdir || dp == rootvnode || + pr != NULL || ((dp->v_vflag & VV_ROOT) != 0 && (cnp->cn_flags & NOCROSSMOUNT) != 0)) { ndp->ni_dvp = dp; ndp->ni_vp = dp; vfslocked = VFS_LOCK_GIANT(dp->v_mount); VREF(dp); goto nextname; } if ((dp->v_vflag & VV_ROOT) == 0) break; if (dp->v_iflag & VI_DOOMED) { /* forced unmount */ error = ENOENT; goto bad; } tdp = dp; dp = dp->v_mount->mnt_vnodecovered; tvfslocked = dvfslocked; dvfslocked = VFS_LOCK_GIANT(dp->v_mount); VREF(dp); vput(tdp); VFS_UNLOCK_GIANT(tvfslocked); vn_lock(dp, compute_cn_lkflags(dp->v_mount, cnp->cn_lkflags | LK_RETRY)); } } /* * We now have a segment name to search for, and a directory to search. */ unionlookup: #ifdef MAC if ((cnp->cn_flags & NOMACCHECK) == 0) { error = mac_vnode_check_lookup(cnp->cn_thread->td_ucred, dp, cnp); if (error) goto bad; } #endif ndp->ni_dvp = dp; ndp->ni_vp = NULL; ASSERT_VOP_LOCKED(dp, "lookup"); VNASSERT(vfslocked == 0, dp, ("lookup: vfslocked %d", vfslocked)); /* * If we have a shared lock we may need to upgrade the lock for the * last operation. */ if (dp != vp_crossmp && VOP_ISLOCKED(dp) == LK_SHARED && (cnp->cn_flags & ISLASTCN) && (cnp->cn_flags & LOCKPARENT)) vn_lock(dp, LK_UPGRADE|LK_RETRY); /* * If we're looking up the last component and we need an exclusive * lock, adjust our lkflags. */ if (needs_exclusive_leaf(dp->v_mount, cnp->cn_flags)) cnp->cn_lkflags = LK_EXCLUSIVE; #ifdef NAMEI_DIAGNOSTIC vprint("lookup in", dp); #endif lkflags_save = cnp->cn_lkflags; cnp->cn_lkflags = compute_cn_lkflags(dp->v_mount, cnp->cn_lkflags); if ((error = VOP_LOOKUP(dp, &ndp->ni_vp, cnp)) != 0) { cnp->cn_lkflags = lkflags_save; KASSERT(ndp->ni_vp == NULL, ("leaf should be empty")); #ifdef NAMEI_DIAGNOSTIC printf("not found\n"); #endif if ((error == ENOENT) && (dp->v_vflag & VV_ROOT) && (dp->v_mount != NULL) && (dp->v_mount->mnt_flag & MNT_UNION)) { tdp = dp; dp = dp->v_mount->mnt_vnodecovered; tvfslocked = dvfslocked; dvfslocked = VFS_LOCK_GIANT(dp->v_mount); VREF(dp); vput(tdp); VFS_UNLOCK_GIANT(tvfslocked); vn_lock(dp, compute_cn_lkflags(dp->v_mount, cnp->cn_lkflags | LK_RETRY)); goto unionlookup; } if (error != EJUSTRETURN) goto bad; /* * If creating and at end of pathname, then can consider * allowing file to be created. */ if (rdonly) { error = EROFS; goto bad; } if (*cp == '\0' && trailing_slash && !(cnp->cn_flags & WILLBEDIR)) { error = ENOENT; goto bad; } if ((cnp->cn_flags & LOCKPARENT) == 0) VOP_UNLOCK(dp, 0); /* * This is a temporary assert to make sure I know what the * behavior here was. */ KASSERT((cnp->cn_flags & (WANTPARENT|LOCKPARENT)) != 0, ("lookup: Unhandled case.")); /* * We return with ni_vp NULL to indicate that the entry * doesn't currently exist, leaving a pointer to the * (possibly locked) directory vnode in ndp->ni_dvp. */ if (cnp->cn_flags & SAVESTART) { ndp->ni_startdir = ndp->ni_dvp; VREF(ndp->ni_startdir); } goto success; } else cnp->cn_lkflags = lkflags_save; #ifdef NAMEI_DIAGNOSTIC printf("found\n"); #endif /* * Take into account any additional components consumed by * the underlying filesystem. */ if (cnp->cn_consume > 0) { cnp->cn_nameptr += cnp->cn_consume; ndp->ni_next += cnp->cn_consume; ndp->ni_pathlen -= cnp->cn_consume; cnp->cn_consume = 0; } dp = ndp->ni_vp; vfslocked = VFS_LOCK_GIANT(dp->v_mount); /* * Check to see if the vnode has been mounted on; * if so find the root of the mounted filesystem. */ while (dp->v_type == VDIR && (mp = dp->v_mountedhere) && (cnp->cn_flags & NOCROSSMOUNT) == 0) { if (vfs_busy(mp, 0)) continue; vput(dp); VFS_UNLOCK_GIANT(vfslocked); vfslocked = VFS_LOCK_GIANT(mp); if (dp != ndp->ni_dvp) vput(ndp->ni_dvp); else vrele(ndp->ni_dvp); VFS_UNLOCK_GIANT(dvfslocked); dvfslocked = 0; vref(vp_crossmp); ndp->ni_dvp = vp_crossmp; error = VFS_ROOT(mp, compute_cn_lkflags(mp, cnp->cn_lkflags), &tdp); vfs_unbusy(mp); if (vn_lock(vp_crossmp, LK_SHARED | LK_NOWAIT)) panic("vp_crossmp exclusively locked or reclaimed"); if (error) { dpunlocked = 1; goto bad2; } ndp->ni_vp = dp = tdp; } /* * Check for symbolic link */ if ((dp->v_type == VLNK) && ((cnp->cn_flags & FOLLOW) || trailing_slash || *ndp->ni_next == '/')) { cnp->cn_flags |= ISSYMLINK; if (dp->v_iflag & VI_DOOMED) { /* * We can't know whether the directory was mounted with * NOSYMFOLLOW, so we can't follow safely. */ error = ENOENT; goto bad2; } if (dp->v_mount->mnt_flag & MNT_NOSYMFOLLOW) { error = EACCES; goto bad2; } /* * Symlink code always expects an unlocked dvp. */ if (ndp->ni_dvp != ndp->ni_vp) VOP_UNLOCK(ndp->ni_dvp, 0); goto success; } /* * Check for bogus trailing slashes. */ if (trailing_slash && dp->v_type != VDIR) { error = ENOTDIR; goto bad2; } nextname: /* * Not a symbolic link. If more pathname, * continue at next component, else return. */ KASSERT((cnp->cn_flags & ISLASTCN) || *ndp->ni_next == '/', ("lookup: invalid path state.")); if (*ndp->ni_next == '/') { cnp->cn_nameptr = ndp->ni_next; while (*cnp->cn_nameptr == '/') { cnp->cn_nameptr++; ndp->ni_pathlen--; } if (ndp->ni_dvp != dp) vput(ndp->ni_dvp); else vrele(ndp->ni_dvp); VFS_UNLOCK_GIANT(dvfslocked); dvfslocked = vfslocked; /* dp becomes dvp in dirloop */ vfslocked = 0; goto dirloop; } /* * Disallow directory write attempts on read-only filesystems. */ if (rdonly && (cnp->cn_nameiop == DELETE || cnp->cn_nameiop == RENAME)) { error = EROFS; goto bad2; } if (cnp->cn_flags & SAVESTART) { ndp->ni_startdir = ndp->ni_dvp; VREF(ndp->ni_startdir); } if (!wantparent) { if (ndp->ni_dvp != dp) vput(ndp->ni_dvp); else vrele(ndp->ni_dvp); VFS_UNLOCK_GIANT(dvfslocked); dvfslocked = 0; } else if ((cnp->cn_flags & LOCKPARENT) == 0 && ndp->ni_dvp != dp) VOP_UNLOCK(ndp->ni_dvp, 0); if (cnp->cn_flags & AUDITVNODE1) AUDIT_ARG(vnode, dp, ARG_VNODE1); else if (cnp->cn_flags & AUDITVNODE2) AUDIT_ARG(vnode, dp, ARG_VNODE2); if ((cnp->cn_flags & LOCKLEAF) == 0) VOP_UNLOCK(dp, 0); success: /* * Because of lookup_shared we may have the vnode shared locked, but * the caller may want it to be exclusively locked. */ if (needs_exclusive_leaf(dp->v_mount, cnp->cn_flags) && VOP_ISLOCKED(dp) != LK_EXCLUSIVE) { vn_lock(dp, LK_UPGRADE | LK_RETRY); if (dp->v_iflag & VI_DOOMED) { error = ENOENT; goto bad2; } } if (vfslocked && dvfslocked) VFS_UNLOCK_GIANT(dvfslocked); /* Only need one */ if (vfslocked || dvfslocked) ndp->ni_cnd.cn_flags |= GIANTHELD; return (0); bad2: if (dp != ndp->ni_dvp) vput(ndp->ni_dvp); else vrele(ndp->ni_dvp); bad: if (!dpunlocked) vput(dp); VFS_UNLOCK_GIANT(vfslocked); VFS_UNLOCK_GIANT(dvfslocked); ndp->ni_cnd.cn_flags &= ~GIANTHELD; ndp->ni_vp = NULL; return (error); } /* * relookup - lookup a path name component * Used by lookup to re-acquire things. */ int relookup(struct vnode *dvp, struct vnode **vpp, struct componentname *cnp) { struct vnode *dp = 0; /* the directory we are searching */ int wantparent; /* 1 => wantparent or lockparent flag */ int rdonly; /* lookup read-only flag bit */ int error = 0; KASSERT(cnp->cn_flags & ISLASTCN, ("relookup: Not given last component.")); /* * Setup: break out flag bits into variables. */ wantparent = cnp->cn_flags & (LOCKPARENT|WANTPARENT); KASSERT(wantparent, ("relookup: parent not wanted.")); rdonly = cnp->cn_flags & RDONLY; cnp->cn_flags &= ~ISSYMLINK; dp = dvp; cnp->cn_lkflags = LK_EXCLUSIVE; vn_lock(dp, LK_EXCLUSIVE | LK_RETRY); /* * Search a new directory. * * The last component of the filename is left accessible via * cnp->cn_nameptr for callers that need the name. Callers needing * the name set the SAVENAME flag. When done, they assume * responsibility for freeing the pathname buffer. */ #ifdef NAMEI_DIAGNOSTIC printf("{%s}: ", cnp->cn_nameptr); #endif /* * Check for degenerate name (e.g. / or "") * which is a way of talking about a directory, * e.g. like "/." or ".". */ if (cnp->cn_nameptr[0] == '\0') { if (cnp->cn_nameiop != LOOKUP || wantparent) { error = EISDIR; goto bad; } if (dp->v_type != VDIR) { error = ENOTDIR; goto bad; } if (!(cnp->cn_flags & LOCKLEAF)) VOP_UNLOCK(dp, 0); *vpp = dp; /* XXX This should probably move to the top of function. */ if (cnp->cn_flags & SAVESTART) panic("lookup: SAVESTART"); return (0); } if (cnp->cn_flags & ISDOTDOT) panic ("relookup: lookup on dot-dot"); /* * We now have a segment name to search for, and a directory to search. */ #ifdef NAMEI_DIAGNOSTIC vprint("search in:", dp); #endif if ((error = VOP_LOOKUP(dp, vpp, cnp)) != 0) { KASSERT(*vpp == NULL, ("leaf should be empty")); if (error != EJUSTRETURN) goto bad; /* * If creating and at end of pathname, then can consider * allowing file to be created. */ if (rdonly) { error = EROFS; goto bad; } /* ASSERT(dvp == ndp->ni_startdir) */ if (cnp->cn_flags & SAVESTART) VREF(dvp); if ((cnp->cn_flags & LOCKPARENT) == 0) VOP_UNLOCK(dp, 0); /* * This is a temporary assert to make sure I know what the * behavior here was. */ KASSERT((cnp->cn_flags & (WANTPARENT|LOCKPARENT)) != 0, ("relookup: Unhandled case.")); /* * We return with ni_vp NULL to indicate that the entry * doesn't currently exist, leaving a pointer to the * (possibly locked) directory vnode in ndp->ni_dvp. */ return (0); } dp = *vpp; /* * Disallow directory write attempts on read-only filesystems. */ if (rdonly && (cnp->cn_nameiop == DELETE || cnp->cn_nameiop == RENAME)) { if (dvp == dp) vrele(dvp); else vput(dvp); error = EROFS; goto bad; } /* * Set the parent lock/ref state to the requested state. */ if ((cnp->cn_flags & LOCKPARENT) == 0 && dvp != dp) { if (wantparent) VOP_UNLOCK(dvp, 0); else vput(dvp); } else if (!wantparent) vrele(dvp); /* * Check for symbolic link */ KASSERT(dp->v_type != VLNK || !(cnp->cn_flags & FOLLOW), ("relookup: symlink found.\n")); /* ASSERT(dvp == ndp->ni_startdir) */ if (cnp->cn_flags & SAVESTART) VREF(dvp); if ((cnp->cn_flags & LOCKLEAF) == 0) VOP_UNLOCK(dp, 0); return (0); bad: vput(dp); *vpp = NULL; return (error); } /* * Free data allocated by namei(); see namei(9) for details. */ void NDFREE(struct nameidata *ndp, const u_int flags) { int unlock_dvp; int unlock_vp; unlock_dvp = 0; unlock_vp = 0; if (!(flags & NDF_NO_FREE_PNBUF) && (ndp->ni_cnd.cn_flags & HASBUF)) { uma_zfree(namei_zone, ndp->ni_cnd.cn_pnbuf); ndp->ni_cnd.cn_flags &= ~HASBUF; } if (!(flags & NDF_NO_VP_UNLOCK) && (ndp->ni_cnd.cn_flags & LOCKLEAF) && ndp->ni_vp) unlock_vp = 1; if (!(flags & NDF_NO_VP_RELE) && ndp->ni_vp) { if (unlock_vp) { vput(ndp->ni_vp); unlock_vp = 0; } else vrele(ndp->ni_vp); ndp->ni_vp = NULL; } if (unlock_vp) VOP_UNLOCK(ndp->ni_vp, 0); if (!(flags & NDF_NO_DVP_UNLOCK) && (ndp->ni_cnd.cn_flags & LOCKPARENT) && ndp->ni_dvp != ndp->ni_vp) unlock_dvp = 1; if (!(flags & NDF_NO_DVP_RELE) && (ndp->ni_cnd.cn_flags & (LOCKPARENT|WANTPARENT))) { if (unlock_dvp) { vput(ndp->ni_dvp); unlock_dvp = 0; } else vrele(ndp->ni_dvp); ndp->ni_dvp = NULL; } if (unlock_dvp) VOP_UNLOCK(ndp->ni_dvp, 0); if (!(flags & NDF_NO_STARTDIR_RELE) && (ndp->ni_cnd.cn_flags & SAVESTART)) { vrele(ndp->ni_startdir); ndp->ni_startdir = NULL; } } /* * Determine if there is a suitable alternate filename under the specified * prefix for the specified path. If the create flag is set, then the * alternate prefix will be used so long as the parent directory exists. * This is used by the various compatiblity ABIs so that Linux binaries prefer * files under /compat/linux for example. The chosen path (whether under * the prefix or under /) is returned in a kernel malloc'd buffer pointed * to by pathbuf. The caller is responsible for free'ing the buffer from * the M_TEMP bucket if one is returned. */ int kern_alternate_path(struct thread *td, const char *prefix, const char *path, enum uio_seg pathseg, char **pathbuf, int create, int dirfd) { struct nameidata nd, ndroot; char *ptr, *buf, *cp; size_t len, sz; int error; buf = (char *) malloc(MAXPATHLEN, M_TEMP, M_WAITOK); *pathbuf = buf; /* Copy the prefix into the new pathname as a starting point. */ len = strlcpy(buf, prefix, MAXPATHLEN); if (len >= MAXPATHLEN) { *pathbuf = NULL; free(buf, M_TEMP); return (EINVAL); } sz = MAXPATHLEN - len; ptr = buf + len; /* Append the filename to the prefix. */ if (pathseg == UIO_SYSSPACE) error = copystr(path, ptr, sz, &len); else error = copyinstr(path, ptr, sz, &len); if (error) { *pathbuf = NULL; free(buf, M_TEMP); return (error); } /* Only use a prefix with absolute pathnames. */ if (*ptr != '/') { error = EINVAL; goto keeporig; } if (dirfd != AT_FDCWD) { /* * We want the original because the "prefix" is * included in the already opened dirfd. */ bcopy(ptr, buf, len); return (0); } /* * We know that there is a / somewhere in this pathname. * Search backwards for it, to find the file's parent dir * to see if it exists in the alternate tree. If it does, * and we want to create a file (cflag is set). We don't * need to worry about the root comparison in this case. */ if (create) { for (cp = &ptr[len] - 1; *cp != '/'; cp--); *cp = '\0'; NDINIT(&nd, LOOKUP, FOLLOW | MPSAFE, UIO_SYSSPACE, buf, td); error = namei(&nd); *cp = '/'; if (error != 0) goto keeporig; } else { NDINIT(&nd, LOOKUP, FOLLOW | MPSAFE, UIO_SYSSPACE, buf, td); error = namei(&nd); if (error != 0) goto keeporig; /* * We now compare the vnode of the prefix to the one * vnode asked. If they resolve to be the same, then we * ignore the match so that the real root gets used. * This avoids the problem of traversing "../.." to find the * root directory and never finding it, because "/" resolves * to the emulation root directory. This is expensive :-( */ NDINIT(&ndroot, LOOKUP, FOLLOW | MPSAFE, UIO_SYSSPACE, prefix, td); /* We shouldn't ever get an error from this namei(). */ error = namei(&ndroot); if (error == 0) { if (nd.ni_vp == ndroot.ni_vp) error = ENOENT; NDFREE(&ndroot, NDF_ONLY_PNBUF); vrele(ndroot.ni_vp); VFS_UNLOCK_GIANT(NDHASGIANT(&ndroot)); } } NDFREE(&nd, NDF_ONLY_PNBUF); vrele(nd.ni_vp); VFS_UNLOCK_GIANT(NDHASGIANT(&nd)); keeporig: /* If there was an error, use the original path name. */ if (error) bcopy(ptr, buf, len); return (error); } Index: head/sys/kern/vfs_mount.c =================================================================== --- head/sys/kern/vfs_mount.c (revision 192894) +++ head/sys/kern/vfs_mount.c (revision 192895) @@ -1,2399 +1,2404 @@ /*- * Copyright (c) 1999-2004 Poul-Henning Kamp * Copyright (c) 1999 Michael Smith * Copyright (c) 1989, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "opt_rootdevname.h" #include "opt_mac.h" #define ROOTNAME "root_device" #define VFS_MOUNTARG_SIZE_MAX (1024 * 64) static void set_rootvnode(void); static int vfs_domount(struct thread *td, const char *fstype, char *fspath, int fsflags, void *fsdata); static int vfs_mountroot_ask(void); static int vfs_mountroot_try(const char *mountfrom); static void free_mntarg(struct mntarg *ma); static int usermount = 0; SYSCTL_INT(_vfs, OID_AUTO, usermount, CTLFLAG_RW, &usermount, 0, "Unprivileged users may mount and unmount file systems"); MALLOC_DEFINE(M_MOUNT, "mount", "vfs mount structure"); MALLOC_DEFINE(M_VNODE_MARKER, "vnodemarker", "vnode marker"); static uma_zone_t mount_zone; /* List of mounted filesystems. */ struct mntlist mountlist = TAILQ_HEAD_INITIALIZER(mountlist); /* For any iteration/modification of mountlist */ struct mtx mountlist_mtx; MTX_SYSINIT(mountlist, &mountlist_mtx, "mountlist", MTX_DEF); /* * The vnode of the system's root (/ in the filesystem, without chroot * active.) */ struct vnode *rootvnode; /* * The root filesystem is detailed in the kernel environment variable * vfs.root.mountfrom, which is expected to be in the general format * * :[] * vfsname := the name of a VFS known to the kernel and capable * of being mounted as root * path := disk device name or other data used by the filesystem * to locate its physical store */ /* * Global opts, taken by all filesystems */ static const char *global_opts[] = { "errmsg", "fstype", "fspath", "ro", "rw", "nosuid", "noexec", NULL }; /* * The root specifiers we will try if RB_CDROM is specified. */ static char *cdrom_rootdevnames[] = { "cd9660:cd0", "cd9660:acd0", NULL }; /* legacy find-root code */ char *rootdevnames[2] = {NULL, NULL}; #ifndef ROOTDEVNAME # define ROOTDEVNAME NULL #endif static const char *ctrootdevname = ROOTDEVNAME; /* * --------------------------------------------------------------------- * Functions for building and sanitizing the mount options */ /* Remove one mount option. */ static void vfs_freeopt(struct vfsoptlist *opts, struct vfsopt *opt) { TAILQ_REMOVE(opts, opt, link); free(opt->name, M_MOUNT); if (opt->value != NULL) free(opt->value, M_MOUNT); free(opt, M_MOUNT); } /* Release all resources related to the mount options. */ void vfs_freeopts(struct vfsoptlist *opts) { struct vfsopt *opt; while (!TAILQ_EMPTY(opts)) { opt = TAILQ_FIRST(opts); vfs_freeopt(opts, opt); } free(opts, M_MOUNT); } void vfs_deleteopt(struct vfsoptlist *opts, const char *name) { struct vfsopt *opt, *temp; if (opts == NULL) return; TAILQ_FOREACH_SAFE(opt, opts, link, temp) { if (strcmp(opt->name, name) == 0) vfs_freeopt(opts, opt); } } /* * Check if options are equal (with or without the "no" prefix). */ static int vfs_equalopts(const char *opt1, const char *opt2) { char *p; /* "opt" vs. "opt" or "noopt" vs. "noopt" */ if (strcmp(opt1, opt2) == 0) return (1); /* "noopt" vs. "opt" */ if (strncmp(opt1, "no", 2) == 0 && strcmp(opt1 + 2, opt2) == 0) return (1); /* "opt" vs. "noopt" */ if (strncmp(opt2, "no", 2) == 0 && strcmp(opt1, opt2 + 2) == 0) return (1); while ((p = strchr(opt1, '.')) != NULL && !strncmp(opt1, opt2, ++p - opt1)) { opt2 += p - opt1; opt1 = p; /* "foo.noopt" vs. "foo.opt" */ if (strncmp(opt1, "no", 2) == 0 && strcmp(opt1 + 2, opt2) == 0) return (1); /* "foo.opt" vs. "foo.noopt" */ if (strncmp(opt2, "no", 2) == 0 && strcmp(opt1, opt2 + 2) == 0) return (1); } return (0); } /* * If a mount option is specified several times, * (with or without the "no" prefix) only keep * the last occurence of it. */ static void vfs_sanitizeopts(struct vfsoptlist *opts) { struct vfsopt *opt, *opt2, *tmp; TAILQ_FOREACH_REVERSE(opt, opts, vfsoptlist, link) { opt2 = TAILQ_PREV(opt, vfsoptlist, link); while (opt2 != NULL) { if (vfs_equalopts(opt->name, opt2->name)) { tmp = TAILQ_PREV(opt2, vfsoptlist, link); vfs_freeopt(opts, opt2); opt2 = tmp; } else { opt2 = TAILQ_PREV(opt2, vfsoptlist, link); } } } } /* * Build a linked list of mount options from a struct uio. */ int vfs_buildopts(struct uio *auio, struct vfsoptlist **options) { struct vfsoptlist *opts; struct vfsopt *opt; size_t memused, namelen, optlen; unsigned int i, iovcnt; int error; opts = malloc(sizeof(struct vfsoptlist), M_MOUNT, M_WAITOK); TAILQ_INIT(opts); memused = 0; iovcnt = auio->uio_iovcnt; for (i = 0; i < iovcnt; i += 2) { namelen = auio->uio_iov[i].iov_len; optlen = auio->uio_iov[i + 1].iov_len; memused += sizeof(struct vfsopt) + optlen + namelen; /* * Avoid consuming too much memory, and attempts to overflow * memused. */ if (memused > VFS_MOUNTARG_SIZE_MAX || optlen > VFS_MOUNTARG_SIZE_MAX || namelen > VFS_MOUNTARG_SIZE_MAX) { error = EINVAL; goto bad; } opt = malloc(sizeof(struct vfsopt), M_MOUNT, M_WAITOK); opt->name = malloc(namelen, M_MOUNT, M_WAITOK); opt->value = NULL; opt->len = 0; opt->pos = i / 2; opt->seen = 0; /* * Do this early, so jumps to "bad" will free the current * option. */ TAILQ_INSERT_TAIL(opts, opt, link); if (auio->uio_segflg == UIO_SYSSPACE) { bcopy(auio->uio_iov[i].iov_base, opt->name, namelen); } else { error = copyin(auio->uio_iov[i].iov_base, opt->name, namelen); if (error) goto bad; } /* Ensure names are null-terminated strings. */ if (namelen == 0 || opt->name[namelen - 1] != '\0') { error = EINVAL; goto bad; } if (optlen != 0) { opt->len = optlen; opt->value = malloc(optlen, M_MOUNT, M_WAITOK); if (auio->uio_segflg == UIO_SYSSPACE) { bcopy(auio->uio_iov[i + 1].iov_base, opt->value, optlen); } else { error = copyin(auio->uio_iov[i + 1].iov_base, opt->value, optlen); if (error) goto bad; } } } vfs_sanitizeopts(opts); *options = opts; return (0); bad: vfs_freeopts(opts); return (error); } /* * Merge the old mount options with the new ones passed * in the MNT_UPDATE case. * * XXX This function will keep a "nofoo" option in the * new options if there is no matching "foo" option * to be cancelled in the old options. This is a bug * if the option's canonical name is "foo". E.g., "noro" * shouldn't end up in the mount point's active options, * but it can. */ static void vfs_mergeopts(struct vfsoptlist *toopts, struct vfsoptlist *opts) { struct vfsopt *opt, *opt2, *new; TAILQ_FOREACH(opt, opts, link) { /* * Check that this option hasn't been redefined * nor cancelled with a "no" mount option. */ opt2 = TAILQ_FIRST(toopts); while (opt2 != NULL) { if (strcmp(opt2->name, opt->name) == 0) goto next; if (strncmp(opt2->name, "no", 2) == 0 && strcmp(opt2->name + 2, opt->name) == 0) { vfs_freeopt(toopts, opt2); goto next; } opt2 = TAILQ_NEXT(opt2, link); } /* We want this option, duplicate it. */ new = malloc(sizeof(struct vfsopt), M_MOUNT, M_WAITOK); new->name = malloc(strlen(opt->name) + 1, M_MOUNT, M_WAITOK); strcpy(new->name, opt->name); if (opt->len != 0) { new->value = malloc(opt->len, M_MOUNT, M_WAITOK); bcopy(opt->value, new->value, opt->len); } else { new->value = NULL; } new->len = opt->len; new->seen = opt->seen; TAILQ_INSERT_TAIL(toopts, new, link); next: continue; } } /* * Mount a filesystem. */ int nmount(td, uap) struct thread *td; struct nmount_args /* { struct iovec *iovp; unsigned int iovcnt; int flags; } */ *uap; { struct uio *auio; int error; u_int iovcnt; AUDIT_ARG(fflags, uap->flags); CTR4(KTR_VFS, "%s: iovp %p with iovcnt %d and flags %d", __func__, uap->iovp, uap->iovcnt, uap->flags); /* * Filter out MNT_ROOTFS. We do not want clients of nmount() in * userspace to set this flag, but we must filter it out if we want * MNT_UPDATE on the root file system to work. * MNT_ROOTFS should only be set in the kernel in vfs_mountroot_try(). */ uap->flags &= ~MNT_ROOTFS; iovcnt = uap->iovcnt; /* * Check that we have an even number of iovec's * and that we have at least two options. */ if ((iovcnt & 1) || (iovcnt < 4)) { CTR2(KTR_VFS, "%s: failed for invalid iovcnt %d", __func__, uap->iovcnt); return (EINVAL); } error = copyinuio(uap->iovp, iovcnt, &auio); if (error) { CTR2(KTR_VFS, "%s: failed for invalid uio op with %d errno", __func__, error); return (error); } error = vfs_donmount(td, uap->flags, auio); free(auio, M_IOV); return (error); } /* * --------------------------------------------------------------------- * Various utility functions */ void vfs_ref(struct mount *mp) { CTR2(KTR_VFS, "%s: mp %p", __func__, mp); MNT_ILOCK(mp); MNT_REF(mp); MNT_IUNLOCK(mp); } void vfs_rel(struct mount *mp) { CTR2(KTR_VFS, "%s: mp %p", __func__, mp); MNT_ILOCK(mp); MNT_REL(mp); MNT_IUNLOCK(mp); } static int mount_init(void *mem, int size, int flags) { struct mount *mp; mp = (struct mount *)mem; mtx_init(&mp->mnt_mtx, "struct mount mtx", NULL, MTX_DEF); lockinit(&mp->mnt_explock, PVFS, "explock", 0, 0); return (0); } static void mount_fini(void *mem, int size) { struct mount *mp; mp = (struct mount *)mem; lockdestroy(&mp->mnt_explock); mtx_destroy(&mp->mnt_mtx); } /* * Allocate and initialize the mount point struct. */ struct mount * vfs_mount_alloc(struct vnode *vp, struct vfsconf *vfsp, const char *fspath, struct ucred *cred) { struct mount *mp; mp = uma_zalloc(mount_zone, M_WAITOK); bzero(&mp->mnt_startzero, __rangeof(struct mount, mnt_startzero, mnt_endzero)); TAILQ_INIT(&mp->mnt_nvnodelist); mp->mnt_nvnodelistsize = 0; mp->mnt_ref = 0; (void) vfs_busy(mp, MBF_NOWAIT); mp->mnt_op = vfsp->vfc_vfsops; mp->mnt_vfc = vfsp; vfsp->vfc_refcount++; /* XXX Unlocked */ mp->mnt_stat.f_type = vfsp->vfc_typenum; mp->mnt_gen++; strlcpy(mp->mnt_stat.f_fstypename, vfsp->vfc_name, MFSNAMELEN); mp->mnt_vnodecovered = vp; mp->mnt_cred = crdup(cred); mp->mnt_stat.f_owner = cred->cr_uid; strlcpy(mp->mnt_stat.f_mntonname, fspath, MNAMELEN); mp->mnt_iosize_max = DFLTPHYS; #ifdef MAC mac_mount_init(mp); mac_mount_create(cred, mp); #endif arc4rand(&mp->mnt_hashseed, sizeof mp->mnt_hashseed, 0); return (mp); } /* * Destroy the mount struct previously allocated by vfs_mount_alloc(). */ void vfs_mount_destroy(struct mount *mp) { MNT_ILOCK(mp); mp->mnt_kern_flag |= MNTK_REFEXPIRE; if (mp->mnt_kern_flag & MNTK_MWAIT) { mp->mnt_kern_flag &= ~MNTK_MWAIT; wakeup(mp); } while (mp->mnt_ref) msleep(mp, MNT_MTX(mp), PVFS, "mntref", 0); KASSERT(mp->mnt_ref == 0, ("%s: invalid refcount in the drain path @ %s:%d", __func__, __FILE__, __LINE__)); if (mp->mnt_writeopcount != 0) panic("vfs_mount_destroy: nonzero writeopcount"); if (mp->mnt_secondary_writes != 0) panic("vfs_mount_destroy: nonzero secondary_writes"); mp->mnt_vfc->vfc_refcount--; if (!TAILQ_EMPTY(&mp->mnt_nvnodelist)) { struct vnode *vp; TAILQ_FOREACH(vp, &mp->mnt_nvnodelist, v_nmntvnodes) vprint("", vp); panic("unmount: dangling vnode"); } if (mp->mnt_nvnodelistsize != 0) panic("vfs_mount_destroy: nonzero nvnodelistsize"); if (mp->mnt_lockref != 0) panic("vfs_mount_destroy: nonzero lock refcount"); MNT_IUNLOCK(mp); #ifdef MAC mac_mount_destroy(mp); #endif if (mp->mnt_opt != NULL) vfs_freeopts(mp->mnt_opt); crfree(mp->mnt_cred); uma_zfree(mount_zone, mp); } int vfs_donmount(struct thread *td, int fsflags, struct uio *fsoptions) { struct vfsoptlist *optlist; struct vfsopt *opt, *noro_opt, *tmp_opt; char *fstype, *fspath, *errmsg; int error, fstypelen, fspathlen, errmsg_len, errmsg_pos; int has_rw, has_noro; errmsg = fspath = NULL; errmsg_len = has_noro = has_rw = fspathlen = 0; errmsg_pos = -1; error = vfs_buildopts(fsoptions, &optlist); if (error) return (error); if (vfs_getopt(optlist, "errmsg", (void **)&errmsg, &errmsg_len) == 0) errmsg_pos = vfs_getopt_pos(optlist, "errmsg"); /* * We need these two options before the others, * and they are mandatory for any filesystem. * Ensure they are NUL terminated as well. */ fstypelen = 0; error = vfs_getopt(optlist, "fstype", (void **)&fstype, &fstypelen); if (error || fstype[fstypelen - 1] != '\0') { error = EINVAL; if (errmsg != NULL) strncpy(errmsg, "Invalid fstype", errmsg_len); goto bail; } fspathlen = 0; error = vfs_getopt(optlist, "fspath", (void **)&fspath, &fspathlen); if (error || fspath[fspathlen - 1] != '\0') { error = EINVAL; if (errmsg != NULL) strncpy(errmsg, "Invalid fspath", errmsg_len); goto bail; } /* * We need to see if we have the "update" option * before we call vfs_domount(), since vfs_domount() has special * logic based on MNT_UPDATE. This is very important * when we want to update the root filesystem. */ TAILQ_FOREACH_SAFE(opt, optlist, link, tmp_opt) { if (strcmp(opt->name, "update") == 0) { fsflags |= MNT_UPDATE; vfs_freeopt(optlist, opt); } else if (strcmp(opt->name, "async") == 0) fsflags |= MNT_ASYNC; else if (strcmp(opt->name, "force") == 0) { fsflags |= MNT_FORCE; vfs_freeopt(optlist, opt); } else if (strcmp(opt->name, "reload") == 0) { fsflags |= MNT_RELOAD; vfs_freeopt(optlist, opt); } else if (strcmp(opt->name, "multilabel") == 0) fsflags |= MNT_MULTILABEL; else if (strcmp(opt->name, "noasync") == 0) fsflags &= ~MNT_ASYNC; else if (strcmp(opt->name, "noatime") == 0) fsflags |= MNT_NOATIME; else if (strcmp(opt->name, "atime") == 0) { free(opt->name, M_MOUNT); opt->name = strdup("nonoatime", M_MOUNT); } else if (strcmp(opt->name, "noclusterr") == 0) fsflags |= MNT_NOCLUSTERR; else if (strcmp(opt->name, "clusterr") == 0) { free(opt->name, M_MOUNT); opt->name = strdup("nonoclusterr", M_MOUNT); } else if (strcmp(opt->name, "noclusterw") == 0) fsflags |= MNT_NOCLUSTERW; else if (strcmp(opt->name, "clusterw") == 0) { free(opt->name, M_MOUNT); opt->name = strdup("nonoclusterw", M_MOUNT); } else if (strcmp(opt->name, "noexec") == 0) fsflags |= MNT_NOEXEC; else if (strcmp(opt->name, "exec") == 0) { free(opt->name, M_MOUNT); opt->name = strdup("nonoexec", M_MOUNT); } else if (strcmp(opt->name, "nosuid") == 0) fsflags |= MNT_NOSUID; else if (strcmp(opt->name, "suid") == 0) { free(opt->name, M_MOUNT); opt->name = strdup("nonosuid", M_MOUNT); } else if (strcmp(opt->name, "nosymfollow") == 0) fsflags |= MNT_NOSYMFOLLOW; else if (strcmp(opt->name, "symfollow") == 0) { free(opt->name, M_MOUNT); opt->name = strdup("nonosymfollow", M_MOUNT); } else if (strcmp(opt->name, "noro") == 0) { fsflags &= ~MNT_RDONLY; has_noro = 1; } else if (strcmp(opt->name, "rw") == 0) { fsflags &= ~MNT_RDONLY; has_rw = 1; } else if (strcmp(opt->name, "ro") == 0) fsflags |= MNT_RDONLY; else if (strcmp(opt->name, "rdonly") == 0) { free(opt->name, M_MOUNT); opt->name = strdup("ro", M_MOUNT); fsflags |= MNT_RDONLY; } else if (strcmp(opt->name, "suiddir") == 0) fsflags |= MNT_SUIDDIR; else if (strcmp(opt->name, "sync") == 0) fsflags |= MNT_SYNCHRONOUS; else if (strcmp(opt->name, "union") == 0) fsflags |= MNT_UNION; } /* * If "rw" was specified as a mount option, and we * are trying to update a mount-point from "ro" to "rw", * we need a mount option "noro", since in vfs_mergeopts(), * "noro" will cancel "ro", but "rw" will not do anything. */ if (has_rw && !has_noro) { noro_opt = malloc(sizeof(struct vfsopt), M_MOUNT, M_WAITOK); noro_opt->name = strdup("noro", M_MOUNT); noro_opt->value = NULL; noro_opt->len = 0; noro_opt->pos = -1; noro_opt->seen = 1; TAILQ_INSERT_TAIL(optlist, noro_opt, link); } /* * Be ultra-paranoid about making sure the type and fspath * variables will fit in our mp buffers, including the * terminating NUL. */ if (fstypelen >= MFSNAMELEN - 1 || fspathlen >= MNAMELEN - 1) { error = ENAMETOOLONG; goto bail; } mtx_lock(&Giant); error = vfs_domount(td, fstype, fspath, fsflags, optlist); mtx_unlock(&Giant); bail: /* copyout the errmsg */ if (errmsg_pos != -1 && ((2 * errmsg_pos + 1) < fsoptions->uio_iovcnt) && errmsg_len > 0 && errmsg != NULL) { if (fsoptions->uio_segflg == UIO_SYSSPACE) { bcopy(errmsg, fsoptions->uio_iov[2 * errmsg_pos + 1].iov_base, fsoptions->uio_iov[2 * errmsg_pos + 1].iov_len); } else { copyout(errmsg, fsoptions->uio_iov[2 * errmsg_pos + 1].iov_base, fsoptions->uio_iov[2 * errmsg_pos + 1].iov_len); } } if (error != 0) vfs_freeopts(optlist); return (error); } /* * Old mount API. */ #ifndef _SYS_SYSPROTO_H_ struct mount_args { char *type; char *path; int flags; caddr_t data; }; #endif /* ARGSUSED */ int mount(td, uap) struct thread *td; struct mount_args /* { char *type; char *path; int flags; caddr_t data; } */ *uap; { char *fstype; struct vfsconf *vfsp = NULL; struct mntarg *ma = NULL; int error; AUDIT_ARG(fflags, uap->flags); /* * Filter out MNT_ROOTFS. We do not want clients of mount() in * userspace to set this flag, but we must filter it out if we want * MNT_UPDATE on the root file system to work. * MNT_ROOTFS should only be set in the kernel in vfs_mountroot_try(). */ uap->flags &= ~MNT_ROOTFS; fstype = malloc(MFSNAMELEN, M_TEMP, M_WAITOK); error = copyinstr(uap->type, fstype, MFSNAMELEN, NULL); if (error) { free(fstype, M_TEMP); return (error); } AUDIT_ARG(text, fstype); mtx_lock(&Giant); vfsp = vfs_byname_kld(fstype, td, &error); free(fstype, M_TEMP); if (vfsp == NULL) { mtx_unlock(&Giant); return (ENOENT); } if (vfsp->vfc_vfsops->vfs_cmount == NULL) { mtx_unlock(&Giant); return (EOPNOTSUPP); } ma = mount_argsu(ma, "fstype", uap->type, MNAMELEN); ma = mount_argsu(ma, "fspath", uap->path, MNAMELEN); ma = mount_argb(ma, uap->flags & MNT_RDONLY, "noro"); ma = mount_argb(ma, !(uap->flags & MNT_NOSUID), "nosuid"); ma = mount_argb(ma, !(uap->flags & MNT_NOEXEC), "noexec"); error = vfsp->vfc_vfsops->vfs_cmount(ma, uap->data, uap->flags); mtx_unlock(&Giant); return (error); } /* * vfs_domount(): actually attempt a filesystem mount. */ static int vfs_domount( struct thread *td, /* Calling thread. */ const char *fstype, /* Filesystem type. */ char *fspath, /* Mount path. */ int fsflags, /* Flags common to all filesystems. */ void *fsdata /* Options local to the filesystem. */ ) { struct vnode *vp; struct mount *mp; struct vfsconf *vfsp; struct oexport_args oexport; struct export_args export; int error, flag = 0; struct vattr va; struct nameidata nd; mtx_assert(&Giant, MA_OWNED); /* * Be ultra-paranoid about making sure the type and fspath * variables will fit in our mp buffers, including the * terminating NUL. */ if (strlen(fstype) >= MFSNAMELEN || strlen(fspath) >= MNAMELEN) return (ENAMETOOLONG); if (jailed(td->td_ucred) || usermount == 0) { if ((error = priv_check(td, PRIV_VFS_MOUNT)) != 0) return (error); } /* * Do not allow NFS export or MNT_SUIDDIR by unprivileged users. */ if (fsflags & MNT_EXPORTED) { error = priv_check(td, PRIV_VFS_MOUNT_EXPORTED); if (error) return (error); } if (fsflags & MNT_SUIDDIR) { error = priv_check(td, PRIV_VFS_MOUNT_SUIDDIR); if (error) return (error); } /* * Silently enforce MNT_NOSUID and MNT_USER for unprivileged users. */ if ((fsflags & (MNT_NOSUID | MNT_USER)) != (MNT_NOSUID | MNT_USER)) { if (priv_check(td, PRIV_VFS_MOUNT_NONUSER) != 0) fsflags |= MNT_NOSUID | MNT_USER; } /* Load KLDs before we lock the covered vnode to avoid reversals. */ vfsp = NULL; if ((fsflags & MNT_UPDATE) == 0) { /* Don't try to load KLDs if we're mounting the root. */ if (fsflags & MNT_ROOTFS) vfsp = vfs_byname(fstype); else vfsp = vfs_byname_kld(fstype, td, &error); if (vfsp == NULL) return (ENODEV); if (jailed(td->td_ucred) && !(vfsp->vfc_flags & VFCF_JAIL)) return (EPERM); } /* * Get vnode to be covered */ NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | AUDITVNODE1, UIO_SYSSPACE, fspath, td); if ((error = namei(&nd)) != 0) return (error); NDFREE(&nd, NDF_ONLY_PNBUF); vp = nd.ni_vp; if (fsflags & MNT_UPDATE) { if ((vp->v_vflag & VV_ROOT) == 0) { vput(vp); return (EINVAL); } mp = vp->v_mount; MNT_ILOCK(mp); flag = mp->mnt_flag; /* * We only allow the filesystem to be reloaded if it * is currently mounted read-only. */ if ((fsflags & MNT_RELOAD) && ((mp->mnt_flag & MNT_RDONLY) == 0)) { MNT_IUNLOCK(mp); vput(vp); return (EOPNOTSUPP); /* Needs translation */ } MNT_IUNLOCK(mp); /* * Only privileged root, or (if MNT_USER is set) the user that * did the original mount is permitted to update it. */ error = vfs_suser(mp, td); if (error) { vput(vp); return (error); } if (vfs_busy(mp, MBF_NOWAIT)) { vput(vp); return (EBUSY); } VI_LOCK(vp); if ((vp->v_iflag & VI_MOUNT) != 0 || vp->v_mountedhere != NULL) { VI_UNLOCK(vp); vfs_unbusy(mp); vput(vp); return (EBUSY); } vp->v_iflag |= VI_MOUNT; VI_UNLOCK(vp); MNT_ILOCK(mp); mp->mnt_flag |= fsflags & (MNT_RELOAD | MNT_FORCE | MNT_UPDATE | MNT_SNAPSHOT | MNT_ROOTFS); MNT_IUNLOCK(mp); VOP_UNLOCK(vp, 0); mp->mnt_optnew = fsdata; vfs_mergeopts(mp->mnt_optnew, mp->mnt_opt); } else { /* * If the user is not root, ensure that they own the directory * onto which we are attempting to mount. */ error = VOP_GETATTR(vp, &va, td->td_ucred); if (error) { vput(vp); return (error); } if (va.va_uid != td->td_ucred->cr_uid) { error = priv_check_cred(td->td_ucred, PRIV_VFS_ADMIN, 0); if (error) { vput(vp); return (error); } } error = vinvalbuf(vp, V_SAVE, 0, 0); if (error != 0) { vput(vp); return (error); } if (vp->v_type != VDIR) { vput(vp); return (ENOTDIR); } VI_LOCK(vp); if ((vp->v_iflag & VI_MOUNT) != 0 || vp->v_mountedhere != NULL) { VI_UNLOCK(vp); vput(vp); return (EBUSY); } vp->v_iflag |= VI_MOUNT; VI_UNLOCK(vp); /* * Allocate and initialize the filesystem. */ mp = vfs_mount_alloc(vp, vfsp, fspath, td->td_ucred); VOP_UNLOCK(vp, 0); /* XXXMAC: pass to vfs_mount_alloc? */ mp->mnt_optnew = fsdata; } /* * Set the mount level flags. */ MNT_ILOCK(mp); mp->mnt_flag = (mp->mnt_flag & ~MNT_UPDATEMASK) | (fsflags & (MNT_UPDATEMASK | MNT_FORCE | MNT_ROOTFS | MNT_RDONLY)); if ((mp->mnt_flag & MNT_ASYNC) == 0) mp->mnt_kern_flag &= ~MNTK_ASYNC; MNT_IUNLOCK(mp); /* * Mount the filesystem. * XXX The final recipients of VFS_MOUNT just overwrite the ndp they * get. No freeing of cn_pnbuf. */ error = VFS_MOUNT(mp); /* * Process the export option only if we are * updating mount options. */ if (!error && (fsflags & MNT_UPDATE)) { if (vfs_copyopt(mp->mnt_optnew, "export", &export, sizeof(export)) == 0) error = vfs_export(mp, &export); else if (vfs_copyopt(mp->mnt_optnew, "export", &oexport, sizeof(oexport)) == 0) { export.ex_flags = oexport.ex_flags; export.ex_root = oexport.ex_root; export.ex_anon = oexport.ex_anon; export.ex_addr = oexport.ex_addr; export.ex_addrlen = oexport.ex_addrlen; export.ex_mask = oexport.ex_mask; export.ex_masklen = oexport.ex_masklen; export.ex_indexfile = oexport.ex_indexfile; export.ex_numsecflavors = 0; error = vfs_export(mp, &export); } } if (!error) { if (mp->mnt_opt != NULL) vfs_freeopts(mp->mnt_opt); mp->mnt_opt = mp->mnt_optnew; (void)VFS_STATFS(mp, &mp->mnt_stat); } /* * Prevent external consumers of mount options from reading * mnt_optnew. */ mp->mnt_optnew = NULL; if (mp->mnt_flag & MNT_UPDATE) { MNT_ILOCK(mp); if (error) mp->mnt_flag = (mp->mnt_flag & MNT_QUOTA) | (flag & ~MNT_QUOTA); else mp->mnt_flag &= ~(MNT_UPDATE | MNT_RELOAD | MNT_FORCE | MNT_SNAPSHOT); if ((mp->mnt_flag & MNT_ASYNC) != 0 && mp->mnt_noasync == 0) mp->mnt_kern_flag |= MNTK_ASYNC; else mp->mnt_kern_flag &= ~MNTK_ASYNC; MNT_IUNLOCK(mp); if ((mp->mnt_flag & MNT_RDONLY) == 0) { if (mp->mnt_syncer == NULL) error = vfs_allocate_syncvnode(mp); } else { if (mp->mnt_syncer != NULL) vrele(mp->mnt_syncer); mp->mnt_syncer = NULL; } vfs_unbusy(mp); VI_LOCK(vp); vp->v_iflag &= ~VI_MOUNT; VI_UNLOCK(vp); vrele(vp); return (error); } MNT_ILOCK(mp); if ((mp->mnt_flag & MNT_ASYNC) != 0 && mp->mnt_noasync == 0) mp->mnt_kern_flag |= MNTK_ASYNC; else mp->mnt_kern_flag &= ~MNTK_ASYNC; MNT_IUNLOCK(mp); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); /* * Put the new filesystem on the mount list after root. */ cache_purge(vp); if (!error) { struct vnode *newdp; VI_LOCK(vp); vp->v_iflag &= ~VI_MOUNT; VI_UNLOCK(vp); vp->v_mountedhere = mp; mtx_lock(&mountlist_mtx); TAILQ_INSERT_TAIL(&mountlist, mp, mnt_list); mtx_unlock(&mountlist_mtx); vfs_event_signal(NULL, VQ_MOUNT, 0); if (VFS_ROOT(mp, LK_EXCLUSIVE, &newdp)) panic("mount: lost mount"); mountcheckdirs(vp, newdp); vput(newdp); VOP_UNLOCK(vp, 0); if ((mp->mnt_flag & MNT_RDONLY) == 0) error = vfs_allocate_syncvnode(mp); vfs_unbusy(mp); if (error) vrele(vp); } else { VI_LOCK(vp); vp->v_iflag &= ~VI_MOUNT; VI_UNLOCK(vp); vfs_unbusy(mp); vfs_mount_destroy(mp); vput(vp); } return (error); } /* * Unmount a filesystem. * * Note: unmount takes a path to the vnode mounted on as argument, not * special file (as before). */ #ifndef _SYS_SYSPROTO_H_ struct unmount_args { char *path; int flags; }; #endif /* ARGSUSED */ int unmount(td, uap) struct thread *td; register struct unmount_args /* { char *path; int flags; } */ *uap; { struct mount *mp; char *pathbuf; int error, id0, id1; if (jailed(td->td_ucred) || usermount == 0) { error = priv_check(td, PRIV_VFS_UNMOUNT); if (error) return (error); } pathbuf = malloc(MNAMELEN, M_TEMP, M_WAITOK); error = copyinstr(uap->path, pathbuf, MNAMELEN, NULL); if (error) { free(pathbuf, M_TEMP); return (error); } AUDIT_ARG(upath, td, pathbuf, ARG_UPATH1); mtx_lock(&Giant); if (uap->flags & MNT_BYFSID) { /* Decode the filesystem ID. */ if (sscanf(pathbuf, "FSID:%d:%d", &id0, &id1) != 2) { mtx_unlock(&Giant); free(pathbuf, M_TEMP); return (EINVAL); } mtx_lock(&mountlist_mtx); TAILQ_FOREACH_REVERSE(mp, &mountlist, mntlist, mnt_list) { if (mp->mnt_stat.f_fsid.val[0] == id0 && mp->mnt_stat.f_fsid.val[1] == id1) break; } mtx_unlock(&mountlist_mtx); } else { mtx_lock(&mountlist_mtx); TAILQ_FOREACH_REVERSE(mp, &mountlist, mntlist, mnt_list) { if (strcmp(mp->mnt_stat.f_mntonname, pathbuf) == 0) break; } mtx_unlock(&mountlist_mtx); } free(pathbuf, M_TEMP); if (mp == NULL) { /* * Previously we returned ENOENT for a nonexistent path and * EINVAL for a non-mountpoint. We cannot tell these apart * now, so in the !MNT_BYFSID case return the more likely * EINVAL for compatibility. */ mtx_unlock(&Giant); return ((uap->flags & MNT_BYFSID) ? ENOENT : EINVAL); } /* * Don't allow unmounting the root filesystem. */ if (mp->mnt_flag & MNT_ROOTFS) { mtx_unlock(&Giant); return (EINVAL); } error = dounmount(mp, uap->flags, td); mtx_unlock(&Giant); return (error); } /* * Do the actual filesystem unmount. */ int dounmount(mp, flags, td) struct mount *mp; int flags; struct thread *td; { struct vnode *coveredvp, *fsrootvp; int error; int async_flag; int mnt_gen_r; mtx_assert(&Giant, MA_OWNED); if ((coveredvp = mp->mnt_vnodecovered) != NULL) { mnt_gen_r = mp->mnt_gen; VI_LOCK(coveredvp); vholdl(coveredvp); vn_lock(coveredvp, LK_EXCLUSIVE | LK_INTERLOCK | LK_RETRY); vdrop(coveredvp); /* * Check for mp being unmounted while waiting for the * covered vnode lock. */ if (coveredvp->v_mountedhere != mp || coveredvp->v_mountedhere->mnt_gen != mnt_gen_r) { VOP_UNLOCK(coveredvp, 0); return (EBUSY); } } /* * Only privileged root, or (if MNT_USER is set) the user that did the * original mount is permitted to unmount this filesystem. */ error = vfs_suser(mp, td); if (error) { if (coveredvp) VOP_UNLOCK(coveredvp, 0); return (error); } MNT_ILOCK(mp); if (mp->mnt_kern_flag & MNTK_UNMOUNT) { MNT_IUNLOCK(mp); if (coveredvp) VOP_UNLOCK(coveredvp, 0); return (EBUSY); } mp->mnt_kern_flag |= MNTK_UNMOUNT | MNTK_NOINSMNTQ; /* Allow filesystems to detect that a forced unmount is in progress. */ if (flags & MNT_FORCE) mp->mnt_kern_flag |= MNTK_UNMOUNTF; error = 0; if (mp->mnt_lockref) { if ((flags & MNT_FORCE) == 0) { mp->mnt_kern_flag &= ~(MNTK_UNMOUNT | MNTK_NOINSMNTQ | MNTK_UNMOUNTF); if (mp->mnt_kern_flag & MNTK_MWAIT) { mp->mnt_kern_flag &= ~MNTK_MWAIT; wakeup(mp); } MNT_IUNLOCK(mp); if (coveredvp) VOP_UNLOCK(coveredvp, 0); return (EBUSY); } mp->mnt_kern_flag |= MNTK_DRAINING; error = msleep(&mp->mnt_lockref, MNT_MTX(mp), PVFS, "mount drain", 0); } MNT_IUNLOCK(mp); KASSERT(mp->mnt_lockref == 0, ("%s: invalid lock refcount in the drain path @ %s:%d", __func__, __FILE__, __LINE__)); KASSERT(error == 0, ("%s: invalid return value for msleep in the drain path @ %s:%d", __func__, __FILE__, __LINE__)); vn_start_write(NULL, &mp, V_WAIT); if (mp->mnt_flag & MNT_EXPUBLIC) vfs_setpublicfs(NULL, NULL, NULL); vfs_msync(mp, MNT_WAIT); MNT_ILOCK(mp); async_flag = mp->mnt_flag & MNT_ASYNC; mp->mnt_flag &= ~MNT_ASYNC; mp->mnt_kern_flag &= ~MNTK_ASYNC; MNT_IUNLOCK(mp); cache_purgevfs(mp); /* remove cache entries for this file sys */ if (mp->mnt_syncer != NULL) vrele(mp->mnt_syncer); /* * For forced unmounts, move process cdir/rdir refs on the fs root * vnode to the covered vnode. For non-forced unmounts we want * such references to cause an EBUSY error. */ if ((flags & MNT_FORCE) && VFS_ROOT(mp, LK_EXCLUSIVE, &fsrootvp) == 0) { if (mp->mnt_vnodecovered != NULL) mountcheckdirs(fsrootvp, mp->mnt_vnodecovered); if (fsrootvp == rootvnode) { vrele(rootvnode); rootvnode = NULL; } vput(fsrootvp); } if (((mp->mnt_flag & MNT_RDONLY) || (error = VFS_SYNC(mp, MNT_WAIT)) == 0) || (flags & MNT_FORCE) != 0) error = VFS_UNMOUNT(mp, flags); vn_finished_write(mp); /* * If we failed to flush the dirty blocks for this mount point, * undo all the cdir/rdir and rootvnode changes we made above. * Unless we failed to do so because the device is reporting that * it doesn't exist anymore. */ if (error && error != ENXIO) { if ((flags & MNT_FORCE) && VFS_ROOT(mp, LK_EXCLUSIVE, &fsrootvp) == 0) { if (mp->mnt_vnodecovered != NULL) mountcheckdirs(mp->mnt_vnodecovered, fsrootvp); if (rootvnode == NULL) { rootvnode = fsrootvp; vref(rootvnode); } vput(fsrootvp); } MNT_ILOCK(mp); mp->mnt_kern_flag &= ~MNTK_NOINSMNTQ; if ((mp->mnt_flag & MNT_RDONLY) == 0 && mp->mnt_syncer == NULL) { MNT_IUNLOCK(mp); (void) vfs_allocate_syncvnode(mp); MNT_ILOCK(mp); } mp->mnt_kern_flag &= ~(MNTK_UNMOUNT | MNTK_UNMOUNTF); mp->mnt_flag |= async_flag; if ((mp->mnt_flag & MNT_ASYNC) != 0 && mp->mnt_noasync == 0) mp->mnt_kern_flag |= MNTK_ASYNC; if (mp->mnt_kern_flag & MNTK_MWAIT) { mp->mnt_kern_flag &= ~MNTK_MWAIT; wakeup(mp); } MNT_IUNLOCK(mp); if (coveredvp) VOP_UNLOCK(coveredvp, 0); return (error); } mtx_lock(&mountlist_mtx); TAILQ_REMOVE(&mountlist, mp, mnt_list); mtx_unlock(&mountlist_mtx); if (coveredvp != NULL) { coveredvp->v_mountedhere = NULL; vput(coveredvp); } vfs_event_signal(NULL, VQ_UNMOUNT, 0); vfs_mount_destroy(mp); return (0); } /* * --------------------------------------------------------------------- * Mounting of root filesystem * */ struct root_hold_token { const char *who; LIST_ENTRY(root_hold_token) list; }; static LIST_HEAD(, root_hold_token) root_holds = LIST_HEAD_INITIALIZER(&root_holds); static int root_mount_complete; /* * Hold root mount. */ struct root_hold_token * root_mount_hold(const char *identifier) { struct root_hold_token *h; if (root_mounted()) return (NULL); h = malloc(sizeof *h, M_DEVBUF, M_ZERO | M_WAITOK); h->who = identifier; mtx_lock(&mountlist_mtx); LIST_INSERT_HEAD(&root_holds, h, list); mtx_unlock(&mountlist_mtx); return (h); } /* * Release root mount. */ void root_mount_rel(struct root_hold_token *h) { if (h == NULL) return; mtx_lock(&mountlist_mtx); LIST_REMOVE(h, list); wakeup(&root_holds); mtx_unlock(&mountlist_mtx); free(h, M_DEVBUF); } /* * Wait for all subsystems to release root mount. */ static void root_mount_prepare(void) { struct root_hold_token *h; struct timeval lastfail; int curfail = 0; for (;;) { DROP_GIANT(); g_waitidle(); PICKUP_GIANT(); mtx_lock(&mountlist_mtx); if (LIST_EMPTY(&root_holds)) { mtx_unlock(&mountlist_mtx); break; } if (ppsratecheck(&lastfail, &curfail, 1)) { printf("Root mount waiting for:"); LIST_FOREACH(h, &root_holds, list) printf(" %s", h->who); printf("\n"); } msleep(&root_holds, &mountlist_mtx, PZERO | PDROP, "roothold", hz); } } /* * Root was mounted, share the good news. */ static void root_mount_done(void) { + /* Keep prison0's root in sync with the global rootvnode. */ + mtx_lock(&prison0.pr_mtx); + prison0.pr_root = rootvnode; + vref(prison0.pr_root); + mtx_unlock(&prison0.pr_mtx); /* * Use a mutex to prevent the wakeup being missed and waiting for * an extra 1 second sleep. */ mtx_lock(&mountlist_mtx); root_mount_complete = 1; wakeup(&root_mount_complete); mtx_unlock(&mountlist_mtx); } /* * Return true if root is already mounted. */ int root_mounted(void) { /* No mutex is acquired here because int stores are atomic. */ return (root_mount_complete); } /* * Wait until root is mounted. */ void root_mount_wait(void) { /* * Panic on an obvious deadlock - the function can't be called from * a thread which is doing the whole SYSINIT stuff. */ KASSERT(curthread->td_proc->p_pid != 0, ("root_mount_wait: cannot be called from the swapper thread")); mtx_lock(&mountlist_mtx); while (!root_mount_complete) { msleep(&root_mount_complete, &mountlist_mtx, PZERO, "rootwait", hz); } mtx_unlock(&mountlist_mtx); } static void set_rootvnode() { struct proc *p; if (VFS_ROOT(TAILQ_FIRST(&mountlist), LK_EXCLUSIVE, &rootvnode)) panic("Cannot find root vnode"); p = curthread->td_proc; FILEDESC_XLOCK(p->p_fd); if (p->p_fd->fd_cdir != NULL) vrele(p->p_fd->fd_cdir); p->p_fd->fd_cdir = rootvnode; VREF(rootvnode); if (p->p_fd->fd_rdir != NULL) vrele(p->p_fd->fd_rdir); p->p_fd->fd_rdir = rootvnode; VREF(rootvnode); FILEDESC_XUNLOCK(p->p_fd); VOP_UNLOCK(rootvnode, 0); EVENTHANDLER_INVOKE(mountroot); } /* * Mount /devfs as our root filesystem, but do not put it on the mountlist * yet. Create a /dev -> / symlink so that absolute pathnames will lookup. */ static void devfs_first(void) { struct thread *td = curthread; struct vfsoptlist *opts; struct vfsconf *vfsp; struct mount *mp = NULL; int error; vfsp = vfs_byname("devfs"); KASSERT(vfsp != NULL, ("Could not find devfs by name")); if (vfsp == NULL) return; mp = vfs_mount_alloc(NULLVP, vfsp, "/dev", td->td_ucred); error = VFS_MOUNT(mp); KASSERT(error == 0, ("VFS_MOUNT(devfs) failed %d", error)); if (error) return; opts = malloc(sizeof(struct vfsoptlist), M_MOUNT, M_WAITOK); TAILQ_INIT(opts); mp->mnt_opt = opts; mtx_lock(&mountlist_mtx); TAILQ_INSERT_HEAD(&mountlist, mp, mnt_list); mtx_unlock(&mountlist_mtx); set_rootvnode(); error = kern_symlink(td, "/", "dev", UIO_SYSSPACE); if (error) printf("kern_symlink /dev -> / returns %d\n", error); } /* * Surgically move our devfs to be mounted on /dev. */ static void devfs_fixup(struct thread *td) { struct nameidata nd; int error; struct vnode *vp, *dvp; struct mount *mp; /* Remove our devfs mount from the mountlist and purge the cache */ mtx_lock(&mountlist_mtx); mp = TAILQ_FIRST(&mountlist); TAILQ_REMOVE(&mountlist, mp, mnt_list); mtx_unlock(&mountlist_mtx); cache_purgevfs(mp); VFS_ROOT(mp, LK_EXCLUSIVE, &dvp); VI_LOCK(dvp); dvp->v_iflag &= ~VI_MOUNT; VI_UNLOCK(dvp); dvp->v_mountedhere = NULL; /* Set up the real rootvnode, and purge the cache */ TAILQ_FIRST(&mountlist)->mnt_vnodecovered = NULL; set_rootvnode(); cache_purgevfs(rootvnode->v_mount); NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF, UIO_SYSSPACE, "/dev", td); error = namei(&nd); if (error) { printf("Lookup of /dev for devfs, error: %d\n", error); return; } NDFREE(&nd, NDF_ONLY_PNBUF); vp = nd.ni_vp; if (vp->v_type != VDIR) { vput(vp); } error = vinvalbuf(vp, V_SAVE, 0, 0); if (error) { vput(vp); } cache_purge(vp); mp->mnt_vnodecovered = vp; vp->v_mountedhere = mp; mtx_lock(&mountlist_mtx); TAILQ_INSERT_TAIL(&mountlist, mp, mnt_list); mtx_unlock(&mountlist_mtx); VOP_UNLOCK(vp, 0); vput(dvp); vfs_unbusy(mp); /* Unlink the no longer needed /dev/dev -> / symlink */ kern_unlink(td, "/dev/dev", UIO_SYSSPACE); } /* * Report errors during filesystem mounting. */ void vfs_mount_error(struct mount *mp, const char *fmt, ...) { struct vfsoptlist *moptlist = mp->mnt_optnew; va_list ap; int error, len; char *errmsg; error = vfs_getopt(moptlist, "errmsg", (void **)&errmsg, &len); if (error || errmsg == NULL || len <= 0) return; va_start(ap, fmt); vsnprintf(errmsg, (size_t)len, fmt, ap); va_end(ap); } void vfs_opterror(struct vfsoptlist *opts, const char *fmt, ...) { va_list ap; int error, len; char *errmsg; error = vfs_getopt(opts, "errmsg", (void **)&errmsg, &len); if (error || errmsg == NULL || len <= 0) return; va_start(ap, fmt); vsnprintf(errmsg, (size_t)len, fmt, ap); va_end(ap); } /* * Find and mount the root filesystem */ void vfs_mountroot(void) { char *cp; int error, i, asked = 0; root_mount_prepare(); mount_zone = uma_zcreate("Mountpoints", sizeof(struct mount), NULL, NULL, mount_init, mount_fini, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); devfs_first(); /* * We are booted with instructions to prompt for the root filesystem. */ if (boothowto & RB_ASKNAME) { if (!vfs_mountroot_ask()) goto mounted; asked = 1; } /* * The root filesystem information is compiled in, and we are * booted with instructions to use it. */ if (ctrootdevname != NULL && (boothowto & RB_DFLTROOT)) { if (!vfs_mountroot_try(ctrootdevname)) goto mounted; ctrootdevname = NULL; } /* * We've been given the generic "use CDROM as root" flag. This is * necessary because one media may be used in many different * devices, so we need to search for them. */ if (boothowto & RB_CDROM) { for (i = 0; cdrom_rootdevnames[i] != NULL; i++) { if (!vfs_mountroot_try(cdrom_rootdevnames[i])) goto mounted; } } /* * Try to use the value read by the loader from /etc/fstab, or * supplied via some other means. This is the preferred * mechanism. */ cp = getenv("vfs.root.mountfrom"); if (cp != NULL) { error = vfs_mountroot_try(cp); freeenv(cp); if (!error) goto mounted; } /* * Try values that may have been computed by code during boot */ if (!vfs_mountroot_try(rootdevnames[0])) goto mounted; if (!vfs_mountroot_try(rootdevnames[1])) goto mounted; /* * If we (still) have a compiled-in default, try it. */ if (ctrootdevname != NULL) if (!vfs_mountroot_try(ctrootdevname)) goto mounted; /* * Everything so far has failed, prompt on the console if we haven't * already tried that. */ if (!asked) if (!vfs_mountroot_ask()) goto mounted; panic("Root mount failed, startup aborted."); mounted: root_mount_done(); } /* * Mount (mountfrom) as the root filesystem. */ static int vfs_mountroot_try(const char *mountfrom) { struct mount *mp; char *vfsname, *path; time_t timebase; int error; char patt[32]; vfsname = NULL; path = NULL; mp = NULL; error = EINVAL; if (mountfrom == NULL) return (error); /* don't complain */ printf("Trying to mount root from %s\n", mountfrom); /* parse vfs name and path */ vfsname = malloc(MFSNAMELEN, M_MOUNT, M_WAITOK); path = malloc(MNAMELEN, M_MOUNT, M_WAITOK); vfsname[0] = path[0] = 0; sprintf(patt, "%%%d[a-z0-9]:%%%ds", MFSNAMELEN, MNAMELEN); if (sscanf(mountfrom, patt, vfsname, path) < 1) goto out; if (path[0] == '\0') strcpy(path, ROOTNAME); error = kernel_vmount( MNT_RDONLY | MNT_ROOTFS, "fstype", vfsname, "fspath", "/", "from", path, NULL); if (error == 0) { /* * We mount devfs prior to mounting the / FS, so the first * entry will typically be devfs. */ mp = TAILQ_FIRST(&mountlist); KASSERT(mp != NULL, ("%s: mountlist is empty", __func__)); /* * Iterate over all currently mounted file systems and use * the time stamp found to check and/or initialize the RTC. * Typically devfs has no time stamp and the only other FS * is the actual / FS. * Call inittodr() only once and pass it the largest of the * timestamps we encounter. */ timebase = 0; do { if (mp->mnt_time > timebase) timebase = mp->mnt_time; mp = TAILQ_NEXT(mp, mnt_list); } while (mp != NULL); inittodr(timebase); devfs_fixup(curthread); } out: free(path, M_MOUNT); free(vfsname, M_MOUNT); return (error); } /* * --------------------------------------------------------------------- * Interactive root filesystem selection code. */ static int vfs_mountroot_ask(void) { char name[128]; for(;;) { printf("\nManual root filesystem specification:\n"); printf(" : Mount using filesystem \n"); #if defined(__amd64__) || defined(__i386__) || defined(__ia64__) printf(" eg. ufs:da0s1a\n"); #else printf(" eg. ufs:/dev/da0a\n"); #endif printf(" ? List valid disk boot devices\n"); printf(" Abort manual input\n"); printf("\nmountroot> "); gets(name, sizeof(name), 1); if (name[0] == '\0') return (1); if (name[0] == '?') { printf("\nList of GEOM managed disk devices:\n "); g_dev_print(); continue; } if (!vfs_mountroot_try(name)) return (0); } } /* * --------------------------------------------------------------------- * Functions for querying mount options/arguments from filesystems. */ /* * Check that no unknown options are given */ int vfs_filteropt(struct vfsoptlist *opts, const char **legal) { struct vfsopt *opt; char errmsg[255]; const char **t, *p, *q; int ret = 0; TAILQ_FOREACH(opt, opts, link) { p = opt->name; q = NULL; if (p[0] == 'n' && p[1] == 'o') q = p + 2; for(t = global_opts; *t != NULL; t++) { if (strcmp(*t, p) == 0) break; if (q != NULL) { if (strcmp(*t, q) == 0) break; } } if (*t != NULL) continue; for(t = legal; *t != NULL; t++) { if (strcmp(*t, p) == 0) break; if (q != NULL) { if (strcmp(*t, q) == 0) break; } } if (*t != NULL) continue; snprintf(errmsg, sizeof(errmsg), "mount option <%s> is unknown", p); printf("%s\n", errmsg); ret = EINVAL; } if (ret != 0) { TAILQ_FOREACH(opt, opts, link) { if (strcmp(opt->name, "errmsg") == 0) { strncpy((char *)opt->value, errmsg, opt->len); } } } return (ret); } /* * Get a mount option by its name. * * Return 0 if the option was found, ENOENT otherwise. * If len is non-NULL it will be filled with the length * of the option. If buf is non-NULL, it will be filled * with the address of the option. */ int vfs_getopt(opts, name, buf, len) struct vfsoptlist *opts; const char *name; void **buf; int *len; { struct vfsopt *opt; KASSERT(opts != NULL, ("vfs_getopt: caller passed 'opts' as NULL")); TAILQ_FOREACH(opt, opts, link) { if (strcmp(name, opt->name) == 0) { opt->seen = 1; if (len != NULL) *len = opt->len; if (buf != NULL) *buf = opt->value; return (0); } } return (ENOENT); } int vfs_getopt_pos(struct vfsoptlist *opts, const char *name) { struct vfsopt *opt; if (opts == NULL) return (-1); TAILQ_FOREACH(opt, opts, link) { if (strcmp(name, opt->name) == 0) { opt->seen = 1; return (opt->pos); } } return (-1); } char * vfs_getopts(struct vfsoptlist *opts, const char *name, int *error) { struct vfsopt *opt; *error = 0; TAILQ_FOREACH(opt, opts, link) { if (strcmp(name, opt->name) != 0) continue; opt->seen = 1; if (opt->len == 0 || ((char *)opt->value)[opt->len - 1] != '\0') { *error = EINVAL; return (NULL); } return (opt->value); } *error = ENOENT; return (NULL); } int vfs_flagopt(struct vfsoptlist *opts, const char *name, u_int *w, u_int val) { struct vfsopt *opt; TAILQ_FOREACH(opt, opts, link) { if (strcmp(name, opt->name) == 0) { opt->seen = 1; if (w != NULL) *w |= val; return (1); } } if (w != NULL) *w &= ~val; return (0); } int vfs_scanopt(struct vfsoptlist *opts, const char *name, const char *fmt, ...) { va_list ap; struct vfsopt *opt; int ret; KASSERT(opts != NULL, ("vfs_getopt: caller passed 'opts' as NULL")); TAILQ_FOREACH(opt, opts, link) { if (strcmp(name, opt->name) != 0) continue; opt->seen = 1; if (opt->len == 0 || opt->value == NULL) return (0); if (((char *)opt->value)[opt->len - 1] != '\0') return (0); va_start(ap, fmt); ret = vsscanf(opt->value, fmt, ap); va_end(ap); return (ret); } return (0); } int vfs_setopt(struct vfsoptlist *opts, const char *name, void *value, int len) { struct vfsopt *opt; TAILQ_FOREACH(opt, opts, link) { if (strcmp(name, opt->name) != 0) continue; opt->seen = 1; if (opt->value == NULL) opt->len = len; else { if (opt->len != len) return (EINVAL); bcopy(value, opt->value, len); } return (0); } return (ENOENT); } int vfs_setopt_part(struct vfsoptlist *opts, const char *name, void *value, int len) { struct vfsopt *opt; TAILQ_FOREACH(opt, opts, link) { if (strcmp(name, opt->name) != 0) continue; opt->seen = 1; if (opt->value == NULL) opt->len = len; else { if (opt->len < len) return (EINVAL); opt->len = len; bcopy(value, opt->value, len); } return (0); } return (ENOENT); } int vfs_setopts(struct vfsoptlist *opts, const char *name, const char *value) { struct vfsopt *opt; TAILQ_FOREACH(opt, opts, link) { if (strcmp(name, opt->name) != 0) continue; opt->seen = 1; if (opt->value == NULL) opt->len = strlen(value) + 1; else if (strlcpy(opt->value, value, opt->len) >= opt->len) return (EINVAL); return (0); } return (ENOENT); } /* * Find and copy a mount option. * * The size of the buffer has to be specified * in len, if it is not the same length as the * mount option, EINVAL is returned. * Returns ENOENT if the option is not found. */ int vfs_copyopt(opts, name, dest, len) struct vfsoptlist *opts; const char *name; void *dest; int len; { struct vfsopt *opt; KASSERT(opts != NULL, ("vfs_copyopt: caller passed 'opts' as NULL")); TAILQ_FOREACH(opt, opts, link) { if (strcmp(name, opt->name) == 0) { opt->seen = 1; if (len != opt->len) return (EINVAL); bcopy(opt->value, dest, opt->len); return (0); } } return (ENOENT); } /* * This is a helper function for filesystems to traverse their * vnodes. See MNT_VNODE_FOREACH() in sys/mount.h */ struct vnode * __mnt_vnode_next(struct vnode **mvp, struct mount *mp) { struct vnode *vp; mtx_assert(MNT_MTX(mp), MA_OWNED); KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch")); if ((*mvp)->v_yield++ == 500) { MNT_IUNLOCK(mp); (*mvp)->v_yield = 0; uio_yield(); MNT_ILOCK(mp); } vp = TAILQ_NEXT(*mvp, v_nmntvnodes); while (vp != NULL && vp->v_type == VMARKER) vp = TAILQ_NEXT(vp, v_nmntvnodes); /* Check if we are done */ if (vp == NULL) { __mnt_vnode_markerfree(mvp, mp); return (NULL); } TAILQ_REMOVE(&mp->mnt_nvnodelist, *mvp, v_nmntvnodes); TAILQ_INSERT_AFTER(&mp->mnt_nvnodelist, vp, *mvp, v_nmntvnodes); return (vp); } struct vnode * __mnt_vnode_first(struct vnode **mvp, struct mount *mp) { struct vnode *vp; mtx_assert(MNT_MTX(mp), MA_OWNED); vp = TAILQ_FIRST(&mp->mnt_nvnodelist); while (vp != NULL && vp->v_type == VMARKER) vp = TAILQ_NEXT(vp, v_nmntvnodes); /* Check if we are done */ if (vp == NULL) { *mvp = NULL; return (NULL); } MNT_REF(mp); MNT_IUNLOCK(mp); *mvp = (struct vnode *) malloc(sizeof(struct vnode), M_VNODE_MARKER, M_WAITOK | M_ZERO); MNT_ILOCK(mp); (*mvp)->v_type = VMARKER; vp = TAILQ_FIRST(&mp->mnt_nvnodelist); while (vp != NULL && vp->v_type == VMARKER) vp = TAILQ_NEXT(vp, v_nmntvnodes); /* Check if we are done */ if (vp == NULL) { MNT_IUNLOCK(mp); free(*mvp, M_VNODE_MARKER); MNT_ILOCK(mp); *mvp = NULL; MNT_REL(mp); return (NULL); } (*mvp)->v_mount = mp; TAILQ_INSERT_AFTER(&mp->mnt_nvnodelist, vp, *mvp, v_nmntvnodes); return (vp); } void __mnt_vnode_markerfree(struct vnode **mvp, struct mount *mp) { if (*mvp == NULL) return; mtx_assert(MNT_MTX(mp), MA_OWNED); KASSERT((*mvp)->v_mount == mp, ("marker vnode mount list mismatch")); TAILQ_REMOVE(&mp->mnt_nvnodelist, *mvp, v_nmntvnodes); MNT_IUNLOCK(mp); free(*mvp, M_VNODE_MARKER); MNT_ILOCK(mp); *mvp = NULL; MNT_REL(mp); } int __vfs_statfs(struct mount *mp, struct statfs *sbp) { int error; error = mp->mnt_op->vfs_statfs(mp, &mp->mnt_stat); if (sbp != &mp->mnt_stat) *sbp = mp->mnt_stat; return (error); } void vfs_mountedfrom(struct mount *mp, const char *from) { bzero(mp->mnt_stat.f_mntfromname, sizeof mp->mnt_stat.f_mntfromname); strlcpy(mp->mnt_stat.f_mntfromname, from, sizeof mp->mnt_stat.f_mntfromname); } /* * --------------------------------------------------------------------- * This is the api for building mount args and mounting filesystems from * inside the kernel. * * The API works by accumulation of individual args. First error is * latched. * * XXX: should be documented in new manpage kernel_mount(9) */ /* A memory allocation which must be freed when we are done */ struct mntaarg { SLIST_ENTRY(mntaarg) next; }; /* The header for the mount arguments */ struct mntarg { struct iovec *v; int len; int error; SLIST_HEAD(, mntaarg) list; }; /* * Add a boolean argument. * * flag is the boolean value. * name must start with "no". */ struct mntarg * mount_argb(struct mntarg *ma, int flag, const char *name) { KASSERT(name[0] == 'n' && name[1] == 'o', ("mount_argb(...,%s): name must start with 'no'", name)); return (mount_arg(ma, name + (flag ? 2 : 0), NULL, 0)); } /* * Add an argument printf style */ struct mntarg * mount_argf(struct mntarg *ma, const char *name, const char *fmt, ...) { va_list ap; struct mntaarg *maa; struct sbuf *sb; int len; if (ma == NULL) { ma = malloc(sizeof *ma, M_MOUNT, M_WAITOK | M_ZERO); SLIST_INIT(&ma->list); } if (ma->error) return (ma); ma->v = realloc(ma->v, sizeof *ma->v * (ma->len + 2), M_MOUNT, M_WAITOK); ma->v[ma->len].iov_base = (void *)(uintptr_t)name; ma->v[ma->len].iov_len = strlen(name) + 1; ma->len++; sb = sbuf_new_auto(); va_start(ap, fmt); sbuf_vprintf(sb, fmt, ap); va_end(ap); sbuf_finish(sb); len = sbuf_len(sb) + 1; maa = malloc(sizeof *maa + len, M_MOUNT, M_WAITOK | M_ZERO); SLIST_INSERT_HEAD(&ma->list, maa, next); bcopy(sbuf_data(sb), maa + 1, len); sbuf_delete(sb); ma->v[ma->len].iov_base = maa + 1; ma->v[ma->len].iov_len = len; ma->len++; return (ma); } /* * Add an argument which is a userland string. */ struct mntarg * mount_argsu(struct mntarg *ma, const char *name, const void *val, int len) { struct mntaarg *maa; char *tbuf; if (val == NULL) return (ma); if (ma == NULL) { ma = malloc(sizeof *ma, M_MOUNT, M_WAITOK | M_ZERO); SLIST_INIT(&ma->list); } if (ma->error) return (ma); maa = malloc(sizeof *maa + len, M_MOUNT, M_WAITOK | M_ZERO); SLIST_INSERT_HEAD(&ma->list, maa, next); tbuf = (void *)(maa + 1); ma->error = copyinstr(val, tbuf, len, NULL); return (mount_arg(ma, name, tbuf, -1)); } /* * Plain argument. * * If length is -1, treat value as a C string. */ struct mntarg * mount_arg(struct mntarg *ma, const char *name, const void *val, int len) { if (ma == NULL) { ma = malloc(sizeof *ma, M_MOUNT, M_WAITOK | M_ZERO); SLIST_INIT(&ma->list); } if (ma->error) return (ma); ma->v = realloc(ma->v, sizeof *ma->v * (ma->len + 2), M_MOUNT, M_WAITOK); ma->v[ma->len].iov_base = (void *)(uintptr_t)name; ma->v[ma->len].iov_len = strlen(name) + 1; ma->len++; ma->v[ma->len].iov_base = (void *)(uintptr_t)val; if (len < 0) ma->v[ma->len].iov_len = strlen(val) + 1; else ma->v[ma->len].iov_len = len; ma->len++; return (ma); } /* * Free a mntarg structure */ static void free_mntarg(struct mntarg *ma) { struct mntaarg *maa; while (!SLIST_EMPTY(&ma->list)) { maa = SLIST_FIRST(&ma->list); SLIST_REMOVE_HEAD(&ma->list, next); free(maa, M_MOUNT); } free(ma->v, M_MOUNT); free(ma, M_MOUNT); } /* * Mount a filesystem */ int kernel_mount(struct mntarg *ma, int flags) { struct uio auio; int error; KASSERT(ma != NULL, ("kernel_mount NULL ma")); KASSERT(ma->v != NULL, ("kernel_mount NULL ma->v")); KASSERT(!(ma->len & 1), ("kernel_mount odd ma->len (%d)", ma->len)); auio.uio_iov = ma->v; auio.uio_iovcnt = ma->len; auio.uio_segflg = UIO_SYSSPACE; error = ma->error; if (!error) error = vfs_donmount(curthread, flags, &auio); free_mntarg(ma); return (error); } /* * A printflike function to mount a filesystem. */ int kernel_vmount(int flags, ...) { struct mntarg *ma = NULL; va_list ap; const char *cp; const void *vp; int error; va_start(ap, flags); for (;;) { cp = va_arg(ap, const char *); if (cp == NULL) break; vp = va_arg(ap, const void *); ma = mount_arg(ma, cp, vp, (vp != NULL ? -1 : 0)); } va_end(ap); error = kernel_mount(ma, flags); return (error); } Index: head/sys/kern/vfs_subr.c =================================================================== --- head/sys/kern/vfs_subr.c (revision 192894) +++ head/sys/kern/vfs_subr.c (revision 192895) @@ -1,4263 +1,4255 @@ /*- * Copyright (c) 1989, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)vfs_subr.c 8.31 (Berkeley) 5/26/95 */ /* * External virtual filesystem routines */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef DDB #include #endif #define WI_MPSAFEQ 0 #define WI_GIANTQ 1 static MALLOC_DEFINE(M_NETADDR, "subr_export_host", "Export host address structure"); static void delmntque(struct vnode *vp); static int flushbuflist(struct bufv *bufv, int flags, struct bufobj *bo, int slpflag, int slptimeo); static void syncer_shutdown(void *arg, int howto); static int vtryrecycle(struct vnode *vp); static void vbusy(struct vnode *vp); static void vinactive(struct vnode *, struct thread *); static void v_incr_usecount(struct vnode *); static void v_decr_usecount(struct vnode *); static void v_decr_useonly(struct vnode *); static void v_upgrade_usecount(struct vnode *); static void vfree(struct vnode *); static void vnlru_free(int); static void vgonel(struct vnode *); static void vfs_knllock(void *arg); static void vfs_knlunlock(void *arg); static int vfs_knllocked(void *arg); static void destroy_vpollinfo(struct vpollinfo *vi); /* * Enable Giant pushdown based on whether or not the vm is mpsafe in this * build. Without mpsafevm the buffer cache can not run Giant free. */ int mpsafe_vfs = 1; TUNABLE_INT("debug.mpsafevfs", &mpsafe_vfs); SYSCTL_INT(_debug, OID_AUTO, mpsafevfs, CTLFLAG_RD, &mpsafe_vfs, 0, "MPSAFE VFS"); /* * Number of vnodes in existence. Increased whenever getnewvnode() * allocates a new vnode, decreased on vdestroy() called on VI_DOOMed * vnode. */ static unsigned long numvnodes; SYSCTL_LONG(_vfs, OID_AUTO, numvnodes, CTLFLAG_RD, &numvnodes, 0, ""); /* * Conversion tables for conversion from vnode types to inode formats * and back. */ enum vtype iftovt_tab[16] = { VNON, VFIFO, VCHR, VNON, VDIR, VNON, VBLK, VNON, VREG, VNON, VLNK, VNON, VSOCK, VNON, VNON, VBAD, }; int vttoif_tab[10] = { 0, S_IFREG, S_IFDIR, S_IFBLK, S_IFCHR, S_IFLNK, S_IFSOCK, S_IFIFO, S_IFMT, S_IFMT }; /* * List of vnodes that are ready for recycling. */ static TAILQ_HEAD(freelst, vnode) vnode_free_list; /* * Free vnode target. Free vnodes may simply be files which have been stat'd * but not read. This is somewhat common, and a small cache of such files * should be kept to avoid recreation costs. */ static u_long wantfreevnodes; SYSCTL_LONG(_vfs, OID_AUTO, wantfreevnodes, CTLFLAG_RW, &wantfreevnodes, 0, ""); /* Number of vnodes in the free list. */ static u_long freevnodes; SYSCTL_LONG(_vfs, OID_AUTO, freevnodes, CTLFLAG_RD, &freevnodes, 0, ""); /* * Various variables used for debugging the new implementation of * reassignbuf(). * XXX these are probably of (very) limited utility now. */ static int reassignbufcalls; SYSCTL_INT(_vfs, OID_AUTO, reassignbufcalls, CTLFLAG_RW, &reassignbufcalls, 0, ""); /* * Cache for the mount type id assigned to NFS. This is used for * special checks in nfs/nfs_nqlease.c and vm/vnode_pager.c. */ int nfs_mount_type = -1; /* To keep more than one thread at a time from running vfs_getnewfsid */ static struct mtx mntid_mtx; /* * Lock for any access to the following: * vnode_free_list * numvnodes * freevnodes */ static struct mtx vnode_free_list_mtx; /* Publicly exported FS */ struct nfs_public nfs_pub; /* Zone for allocation of new vnodes - used exclusively by getnewvnode() */ static uma_zone_t vnode_zone; static uma_zone_t vnodepoll_zone; /* Set to 1 to print out reclaim of active vnodes */ int prtactive; /* * The workitem queue. * * It is useful to delay writes of file data and filesystem metadata * for tens of seconds so that quickly created and deleted files need * not waste disk bandwidth being created and removed. To realize this, * we append vnodes to a "workitem" queue. When running with a soft * updates implementation, most pending metadata dependencies should * not wait for more than a few seconds. Thus, mounted on block devices * are delayed only about a half the time that file data is delayed. * Similarly, directory updates are more critical, so are only delayed * about a third the time that file data is delayed. Thus, there are * SYNCER_MAXDELAY queues that are processed round-robin at a rate of * one each second (driven off the filesystem syncer process). The * syncer_delayno variable indicates the next queue that is to be processed. * Items that need to be processed soon are placed in this queue: * * syncer_workitem_pending[syncer_delayno] * * A delay of fifteen seconds is done by placing the request fifteen * entries later in the queue: * * syncer_workitem_pending[(syncer_delayno + 15) & syncer_mask] * */ static int syncer_delayno; static long syncer_mask; LIST_HEAD(synclist, bufobj); static struct synclist *syncer_workitem_pending[2]; /* * The sync_mtx protects: * bo->bo_synclist * sync_vnode_count * syncer_delayno * syncer_state * syncer_workitem_pending * syncer_worklist_len * rushjob */ static struct mtx sync_mtx; static struct cv sync_wakeup; #define SYNCER_MAXDELAY 32 static int syncer_maxdelay = SYNCER_MAXDELAY; /* maximum delay time */ static int syncdelay = 30; /* max time to delay syncing data */ static int filedelay = 30; /* time to delay syncing files */ SYSCTL_INT(_kern, OID_AUTO, filedelay, CTLFLAG_RW, &filedelay, 0, ""); static int dirdelay = 29; /* time to delay syncing directories */ SYSCTL_INT(_kern, OID_AUTO, dirdelay, CTLFLAG_RW, &dirdelay, 0, ""); static int metadelay = 28; /* time to delay syncing metadata */ SYSCTL_INT(_kern, OID_AUTO, metadelay, CTLFLAG_RW, &metadelay, 0, ""); static int rushjob; /* number of slots to run ASAP */ static int stat_rush_requests; /* number of times I/O speeded up */ SYSCTL_INT(_debug, OID_AUTO, rush_requests, CTLFLAG_RW, &stat_rush_requests, 0, ""); /* * When shutting down the syncer, run it at four times normal speed. */ #define SYNCER_SHUTDOWN_SPEEDUP 4 static int sync_vnode_count; static int syncer_worklist_len; static enum { SYNCER_RUNNING, SYNCER_SHUTTING_DOWN, SYNCER_FINAL_DELAY } syncer_state; /* * Number of vnodes we want to exist at any one time. This is mostly used * to size hash tables in vnode-related code. It is normally not used in * getnewvnode(), as wantfreevnodes is normally nonzero.) * * XXX desiredvnodes is historical cruft and should not exist. */ int desiredvnodes; SYSCTL_INT(_kern, KERN_MAXVNODES, maxvnodes, CTLFLAG_RW, &desiredvnodes, 0, "Maximum number of vnodes"); SYSCTL_INT(_kern, OID_AUTO, minvnodes, CTLFLAG_RW, &wantfreevnodes, 0, "Minimum number of vnodes (legacy)"); static int vnlru_nowhere; SYSCTL_INT(_debug, OID_AUTO, vnlru_nowhere, CTLFLAG_RW, &vnlru_nowhere, 0, "Number of times the vnlru process ran without success"); /* * Macros to control when a vnode is freed and recycled. All require * the vnode interlock. */ #define VCANRECYCLE(vp) (((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt) #define VSHOULDFREE(vp) (!((vp)->v_iflag & VI_FREE) && !(vp)->v_holdcnt) #define VSHOULDBUSY(vp) (((vp)->v_iflag & VI_FREE) && (vp)->v_holdcnt) /* * Initialize the vnode management data structures. */ #ifndef MAXVNODES_MAX #define MAXVNODES_MAX 100000 #endif static void vntblinit(void *dummy __unused) { /* * Desiredvnodes is a function of the physical memory size and * the kernel's heap size. Specifically, desiredvnodes scales * in proportion to the physical memory size until two fifths * of the kernel's heap size is consumed by vnodes and vm * objects. */ desiredvnodes = min(maxproc + cnt.v_page_count / 4, 2 * vm_kmem_size / (5 * (sizeof(struct vm_object) + sizeof(struct vnode)))); if (desiredvnodes > MAXVNODES_MAX) { if (bootverbose) printf("Reducing kern.maxvnodes %d -> %d\n", desiredvnodes, MAXVNODES_MAX); desiredvnodes = MAXVNODES_MAX; } wantfreevnodes = desiredvnodes / 4; mtx_init(&mntid_mtx, "mntid", NULL, MTX_DEF); TAILQ_INIT(&vnode_free_list); mtx_init(&vnode_free_list_mtx, "vnode_free_list", NULL, MTX_DEF); vnode_zone = uma_zcreate("VNODE", sizeof (struct vnode), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); vnodepoll_zone = uma_zcreate("VNODEPOLL", sizeof (struct vpollinfo), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, 0); /* * Initialize the filesystem syncer. */ syncer_workitem_pending[WI_MPSAFEQ] = hashinit(syncer_maxdelay, M_VNODE, &syncer_mask); syncer_workitem_pending[WI_GIANTQ] = hashinit(syncer_maxdelay, M_VNODE, &syncer_mask); syncer_maxdelay = syncer_mask + 1; mtx_init(&sync_mtx, "Syncer mtx", NULL, MTX_DEF); cv_init(&sync_wakeup, "syncer"); } SYSINIT(vfs, SI_SUB_VFS, SI_ORDER_FIRST, vntblinit, NULL); /* * Mark a mount point as busy. Used to synchronize access and to delay * unmounting. Eventually, mountlist_mtx is not released on failure. */ int vfs_busy(struct mount *mp, int flags) { MPASS((flags & ~MBF_MASK) == 0); CTR3(KTR_VFS, "%s: mp %p with flags %d", __func__, mp, flags); MNT_ILOCK(mp); MNT_REF(mp); /* * If mount point is currenly being unmounted, sleep until the * mount point fate is decided. If thread doing the unmounting fails, * it will clear MNTK_UNMOUNT flag before waking us up, indicating * that this mount point has survived the unmount attempt and vfs_busy * should retry. Otherwise the unmounter thread will set MNTK_REFEXPIRE * flag in addition to MNTK_UNMOUNT, indicating that mount point is * about to be really destroyed. vfs_busy needs to release its * reference on the mount point in this case and return with ENOENT, * telling the caller that mount mount it tried to busy is no longer * valid. */ while (mp->mnt_kern_flag & MNTK_UNMOUNT) { if (flags & MBF_NOWAIT || mp->mnt_kern_flag & MNTK_REFEXPIRE) { MNT_REL(mp); MNT_IUNLOCK(mp); CTR1(KTR_VFS, "%s: failed busying before sleeping", __func__); return (ENOENT); } if (flags & MBF_MNTLSTLOCK) mtx_unlock(&mountlist_mtx); mp->mnt_kern_flag |= MNTK_MWAIT; msleep(mp, MNT_MTX(mp), PVFS, "vfs_busy", 0); if (flags & MBF_MNTLSTLOCK) mtx_lock(&mountlist_mtx); } if (flags & MBF_MNTLSTLOCK) mtx_unlock(&mountlist_mtx); mp->mnt_lockref++; MNT_IUNLOCK(mp); return (0); } /* * Free a busy filesystem. */ void vfs_unbusy(struct mount *mp) { CTR2(KTR_VFS, "%s: mp %p", __func__, mp); MNT_ILOCK(mp); MNT_REL(mp); KASSERT(mp->mnt_lockref > 0, ("negative mnt_lockref")); mp->mnt_lockref--; if (mp->mnt_lockref == 0 && (mp->mnt_kern_flag & MNTK_DRAINING) != 0) { MPASS(mp->mnt_kern_flag & MNTK_UNMOUNT); CTR1(KTR_VFS, "%s: waking up waiters", __func__); mp->mnt_kern_flag &= ~MNTK_DRAINING; wakeup(&mp->mnt_lockref); } MNT_IUNLOCK(mp); } /* * Lookup a mount point by filesystem identifier. */ struct mount * vfs_getvfs(fsid_t *fsid) { struct mount *mp; CTR2(KTR_VFS, "%s: fsid %p", __func__, fsid); mtx_lock(&mountlist_mtx); TAILQ_FOREACH(mp, &mountlist, mnt_list) { if (mp->mnt_stat.f_fsid.val[0] == fsid->val[0] && mp->mnt_stat.f_fsid.val[1] == fsid->val[1]) { vfs_ref(mp); mtx_unlock(&mountlist_mtx); return (mp); } } mtx_unlock(&mountlist_mtx); CTR2(KTR_VFS, "%s: lookup failed for %p id", __func__, fsid); return ((struct mount *) 0); } /* * Lookup a mount point by filesystem identifier, busying it before * returning. */ struct mount * vfs_busyfs(fsid_t *fsid) { struct mount *mp; int error; CTR2(KTR_VFS, "%s: fsid %p", __func__, fsid); mtx_lock(&mountlist_mtx); TAILQ_FOREACH(mp, &mountlist, mnt_list) { if (mp->mnt_stat.f_fsid.val[0] == fsid->val[0] && mp->mnt_stat.f_fsid.val[1] == fsid->val[1]) { error = vfs_busy(mp, MBF_MNTLSTLOCK); if (error) { mtx_unlock(&mountlist_mtx); return (NULL); } return (mp); } } CTR2(KTR_VFS, "%s: lookup failed for %p id", __func__, fsid); mtx_unlock(&mountlist_mtx); return ((struct mount *) 0); } /* * Check if a user can access privileged mount options. */ int vfs_suser(struct mount *mp, struct thread *td) { int error; /* * If the thread is jailed, but this is not a jail-friendly file * system, deny immediately. */ if (!(mp->mnt_vfc->vfc_flags & VFCF_JAIL) && jailed(td->td_ucred)) return (EPERM); /* - * If the file system was mounted outside a jail and a jailed thread - * tries to access it, deny immediately. + * If the file system was mounted outside the jail of the calling + * thread, deny immediately. */ - if (!jailed(mp->mnt_cred) && jailed(td->td_ucred)) + if (mp->mnt_cred->cr_prison != td->td_ucred->cr_prison && + !prison_ischild(td->td_ucred->cr_prison, mp->mnt_cred->cr_prison)) return (EPERM); /* - * If the file system was mounted inside different jail that the jail of - * the calling thread, deny immediately. - */ - if (jailed(mp->mnt_cred) && jailed(td->td_ucred) && - mp->mnt_cred->cr_prison != td->td_ucred->cr_prison) { - return (EPERM); - } - - /* * If file system supports delegated administration, we don't check * for the PRIV_VFS_MOUNT_OWNER privilege - it will be better verified * by the file system itself. * If this is not the user that did original mount, we check for * the PRIV_VFS_MOUNT_OWNER privilege. */ if (!(mp->mnt_vfc->vfc_flags & VFCF_DELEGADMIN) && mp->mnt_cred->cr_uid != td->td_ucred->cr_uid) { if ((error = priv_check(td, PRIV_VFS_MOUNT_OWNER)) != 0) return (error); } return (0); } /* * Get a new unique fsid. Try to make its val[0] unique, since this value * will be used to create fake device numbers for stat(). Also try (but * not so hard) make its val[0] unique mod 2^16, since some emulators only * support 16-bit device numbers. We end up with unique val[0]'s for the * first 2^16 calls and unique val[0]'s mod 2^16 for the first 2^8 calls. * * Keep in mind that several mounts may be running in parallel. Starting * the search one past where the previous search terminated is both a * micro-optimization and a defense against returning the same fsid to * different mounts. */ void vfs_getnewfsid(struct mount *mp) { static u_int16_t mntid_base; struct mount *nmp; fsid_t tfsid; int mtype; CTR2(KTR_VFS, "%s: mp %p", __func__, mp); mtx_lock(&mntid_mtx); mtype = mp->mnt_vfc->vfc_typenum; tfsid.val[1] = mtype; mtype = (mtype & 0xFF) << 24; for (;;) { tfsid.val[0] = makedev(255, mtype | ((mntid_base & 0xFF00) << 8) | (mntid_base & 0xFF)); mntid_base++; if ((nmp = vfs_getvfs(&tfsid)) == NULL) break; vfs_rel(nmp); } mp->mnt_stat.f_fsid.val[0] = tfsid.val[0]; mp->mnt_stat.f_fsid.val[1] = tfsid.val[1]; mtx_unlock(&mntid_mtx); } /* * Knob to control the precision of file timestamps: * * 0 = seconds only; nanoseconds zeroed. * 1 = seconds and nanoseconds, accurate within 1/HZ. * 2 = seconds and nanoseconds, truncated to microseconds. * >=3 = seconds and nanoseconds, maximum precision. */ enum { TSP_SEC, TSP_HZ, TSP_USEC, TSP_NSEC }; static int timestamp_precision = TSP_SEC; SYSCTL_INT(_vfs, OID_AUTO, timestamp_precision, CTLFLAG_RW, ×tamp_precision, 0, ""); /* * Get a current timestamp. */ void vfs_timestamp(struct timespec *tsp) { struct timeval tv; switch (timestamp_precision) { case TSP_SEC: tsp->tv_sec = time_second; tsp->tv_nsec = 0; break; case TSP_HZ: getnanotime(tsp); break; case TSP_USEC: microtime(&tv); TIMEVAL_TO_TIMESPEC(&tv, tsp); break; case TSP_NSEC: default: nanotime(tsp); break; } } /* * Set vnode attributes to VNOVAL */ void vattr_null(struct vattr *vap) { vap->va_type = VNON; vap->va_size = VNOVAL; vap->va_bytes = VNOVAL; vap->va_mode = VNOVAL; vap->va_nlink = VNOVAL; vap->va_uid = VNOVAL; vap->va_gid = VNOVAL; vap->va_fsid = VNOVAL; vap->va_fileid = VNOVAL; vap->va_blocksize = VNOVAL; vap->va_rdev = VNOVAL; vap->va_atime.tv_sec = VNOVAL; vap->va_atime.tv_nsec = VNOVAL; vap->va_mtime.tv_sec = VNOVAL; vap->va_mtime.tv_nsec = VNOVAL; vap->va_ctime.tv_sec = VNOVAL; vap->va_ctime.tv_nsec = VNOVAL; vap->va_birthtime.tv_sec = VNOVAL; vap->va_birthtime.tv_nsec = VNOVAL; vap->va_flags = VNOVAL; vap->va_gen = VNOVAL; vap->va_vaflags = 0; } /* * This routine is called when we have too many vnodes. It attempts * to free vnodes and will potentially free vnodes that still * have VM backing store (VM backing store is typically the cause * of a vnode blowout so we want to do this). Therefore, this operation * is not considered cheap. * * A number of conditions may prevent a vnode from being reclaimed. * the buffer cache may have references on the vnode, a directory * vnode may still have references due to the namei cache representing * underlying files, or the vnode may be in active use. It is not * desireable to reuse such vnodes. These conditions may cause the * number of vnodes to reach some minimum value regardless of what * you set kern.maxvnodes to. Do not set kern.maxvnodes too low. */ static int vlrureclaim(struct mount *mp) { struct vnode *vp; int done; int trigger; int usevnodes; int count; /* * Calculate the trigger point, don't allow user * screwups to blow us up. This prevents us from * recycling vnodes with lots of resident pages. We * aren't trying to free memory, we are trying to * free vnodes. */ usevnodes = desiredvnodes; if (usevnodes <= 0) usevnodes = 1; trigger = cnt.v_page_count * 2 / usevnodes; done = 0; vn_start_write(NULL, &mp, V_WAIT); MNT_ILOCK(mp); count = mp->mnt_nvnodelistsize / 10 + 1; while (count != 0) { vp = TAILQ_FIRST(&mp->mnt_nvnodelist); while (vp != NULL && vp->v_type == VMARKER) vp = TAILQ_NEXT(vp, v_nmntvnodes); if (vp == NULL) break; TAILQ_REMOVE(&mp->mnt_nvnodelist, vp, v_nmntvnodes); TAILQ_INSERT_TAIL(&mp->mnt_nvnodelist, vp, v_nmntvnodes); --count; if (!VI_TRYLOCK(vp)) goto next_iter; /* * If it's been deconstructed already, it's still * referenced, or it exceeds the trigger, skip it. */ if (vp->v_usecount || !LIST_EMPTY(&(vp)->v_cache_src) || (vp->v_iflag & VI_DOOMED) != 0 || (vp->v_object != NULL && vp->v_object->resident_page_count > trigger)) { VI_UNLOCK(vp); goto next_iter; } MNT_IUNLOCK(mp); vholdl(vp); if (VOP_LOCK(vp, LK_INTERLOCK|LK_EXCLUSIVE|LK_NOWAIT)) { vdrop(vp); goto next_iter_mntunlocked; } VI_LOCK(vp); /* * v_usecount may have been bumped after VOP_LOCK() dropped * the vnode interlock and before it was locked again. * * It is not necessary to recheck VI_DOOMED because it can * only be set by another thread that holds both the vnode * lock and vnode interlock. If another thread has the * vnode lock before we get to VOP_LOCK() and obtains the * vnode interlock after VOP_LOCK() drops the vnode * interlock, the other thread will be unable to drop the * vnode lock before our VOP_LOCK() call fails. */ if (vp->v_usecount || !LIST_EMPTY(&(vp)->v_cache_src) || (vp->v_object != NULL && vp->v_object->resident_page_count > trigger)) { VOP_UNLOCK(vp, LK_INTERLOCK); goto next_iter_mntunlocked; } KASSERT((vp->v_iflag & VI_DOOMED) == 0, ("VI_DOOMED unexpectedly detected in vlrureclaim()")); vgonel(vp); VOP_UNLOCK(vp, 0); vdropl(vp); done++; next_iter_mntunlocked: if ((count % 256) != 0) goto relock_mnt; goto yield; next_iter: if ((count % 256) != 0) continue; MNT_IUNLOCK(mp); yield: uio_yield(); relock_mnt: MNT_ILOCK(mp); } MNT_IUNLOCK(mp); vn_finished_write(mp); return done; } /* * Attempt to keep the free list at wantfreevnodes length. */ static void vnlru_free(int count) { struct vnode *vp; int vfslocked; mtx_assert(&vnode_free_list_mtx, MA_OWNED); for (; count > 0; count--) { vp = TAILQ_FIRST(&vnode_free_list); /* * The list can be modified while the free_list_mtx * has been dropped and vp could be NULL here. */ if (!vp) break; VNASSERT(vp->v_op != NULL, vp, ("vnlru_free: vnode already reclaimed.")); TAILQ_REMOVE(&vnode_free_list, vp, v_freelist); /* * Don't recycle if we can't get the interlock. */ if (!VI_TRYLOCK(vp)) { TAILQ_INSERT_TAIL(&vnode_free_list, vp, v_freelist); continue; } VNASSERT(VCANRECYCLE(vp), vp, ("vp inconsistent on freelist")); freevnodes--; vp->v_iflag &= ~VI_FREE; vholdl(vp); mtx_unlock(&vnode_free_list_mtx); VI_UNLOCK(vp); vfslocked = VFS_LOCK_GIANT(vp->v_mount); vtryrecycle(vp); VFS_UNLOCK_GIANT(vfslocked); /* * If the recycled succeeded this vdrop will actually free * the vnode. If not it will simply place it back on * the free list. */ vdrop(vp); mtx_lock(&vnode_free_list_mtx); } } /* * Attempt to recycle vnodes in a context that is always safe to block. * Calling vlrurecycle() from the bowels of filesystem code has some * interesting deadlock problems. */ static struct proc *vnlruproc; static int vnlruproc_sig; static void vnlru_proc(void) { struct mount *mp, *nmp; int done, vfslocked; struct proc *p = vnlruproc; EVENTHANDLER_REGISTER(shutdown_pre_sync, kproc_shutdown, p, SHUTDOWN_PRI_FIRST); for (;;) { kproc_suspend_check(p); mtx_lock(&vnode_free_list_mtx); if (freevnodes > wantfreevnodes) vnlru_free(freevnodes - wantfreevnodes); if (numvnodes <= desiredvnodes * 9 / 10) { vnlruproc_sig = 0; wakeup(&vnlruproc_sig); msleep(vnlruproc, &vnode_free_list_mtx, PVFS|PDROP, "vlruwt", hz); continue; } mtx_unlock(&vnode_free_list_mtx); done = 0; mtx_lock(&mountlist_mtx); for (mp = TAILQ_FIRST(&mountlist); mp != NULL; mp = nmp) { if (vfs_busy(mp, MBF_NOWAIT | MBF_MNTLSTLOCK)) { nmp = TAILQ_NEXT(mp, mnt_list); continue; } vfslocked = VFS_LOCK_GIANT(mp); done += vlrureclaim(mp); VFS_UNLOCK_GIANT(vfslocked); mtx_lock(&mountlist_mtx); nmp = TAILQ_NEXT(mp, mnt_list); vfs_unbusy(mp); } mtx_unlock(&mountlist_mtx); if (done == 0) { EVENTHANDLER_INVOKE(vfs_lowvnodes, desiredvnodes / 10); #if 0 /* These messages are temporary debugging aids */ if (vnlru_nowhere < 5) printf("vnlru process getting nowhere..\n"); else if (vnlru_nowhere == 5) printf("vnlru process messages stopped.\n"); #endif vnlru_nowhere++; tsleep(vnlruproc, PPAUSE, "vlrup", hz * 3); } else uio_yield(); } } static struct kproc_desc vnlru_kp = { "vnlru", vnlru_proc, &vnlruproc }; SYSINIT(vnlru, SI_SUB_KTHREAD_UPDATE, SI_ORDER_FIRST, kproc_start, &vnlru_kp); /* * Routines having to do with the management of the vnode table. */ void vdestroy(struct vnode *vp) { struct bufobj *bo; CTR2(KTR_VFS, "%s: vp %p", __func__, vp); mtx_lock(&vnode_free_list_mtx); numvnodes--; mtx_unlock(&vnode_free_list_mtx); bo = &vp->v_bufobj; VNASSERT((vp->v_iflag & VI_FREE) == 0, vp, ("cleaned vnode still on the free list.")); VNASSERT(vp->v_data == NULL, vp, ("cleaned vnode isn't")); VNASSERT(vp->v_holdcnt == 0, vp, ("Non-zero hold count")); VNASSERT(vp->v_usecount == 0, vp, ("Non-zero use count")); VNASSERT(vp->v_writecount == 0, vp, ("Non-zero write count")); VNASSERT(bo->bo_numoutput == 0, vp, ("Clean vnode has pending I/O's")); VNASSERT(bo->bo_clean.bv_cnt == 0, vp, ("cleanbufcnt not 0")); VNASSERT(bo->bo_clean.bv_root == NULL, vp, ("cleanblkroot not NULL")); VNASSERT(bo->bo_dirty.bv_cnt == 0, vp, ("dirtybufcnt not 0")); VNASSERT(bo->bo_dirty.bv_root == NULL, vp, ("dirtyblkroot not NULL")); VNASSERT(TAILQ_EMPTY(&vp->v_cache_dst), vp, ("vp has namecache dst")); VNASSERT(LIST_EMPTY(&vp->v_cache_src), vp, ("vp has namecache src")); VNASSERT(vp->v_cache_dd == NULL, vp, ("vp has namecache for ..")); VI_UNLOCK(vp); #ifdef MAC mac_vnode_destroy(vp); #endif if (vp->v_pollinfo != NULL) destroy_vpollinfo(vp->v_pollinfo); #ifdef INVARIANTS /* XXX Elsewhere we can detect an already freed vnode via NULL v_op. */ vp->v_op = NULL; #endif lockdestroy(vp->v_vnlock); mtx_destroy(&vp->v_interlock); mtx_destroy(BO_MTX(bo)); uma_zfree(vnode_zone, vp); } /* * Try to recycle a freed vnode. We abort if anyone picks up a reference * before we actually vgone(). This function must be called with the vnode * held to prevent the vnode from being returned to the free list midway * through vgone(). */ static int vtryrecycle(struct vnode *vp) { struct mount *vnmp; CTR2(KTR_VFS, "%s: vp %p", __func__, vp); VNASSERT(vp->v_holdcnt, vp, ("vtryrecycle: Recycling vp %p without a reference.", vp)); /* * This vnode may found and locked via some other list, if so we * can't recycle it yet. */ if (VOP_LOCK(vp, LK_EXCLUSIVE | LK_NOWAIT) != 0) { CTR2(KTR_VFS, "%s: impossible to recycle, vp %p lock is already held", __func__, vp); return (EWOULDBLOCK); } /* * Don't recycle if its filesystem is being suspended. */ if (vn_start_write(vp, &vnmp, V_NOWAIT) != 0) { VOP_UNLOCK(vp, 0); CTR2(KTR_VFS, "%s: impossible to recycle, cannot start the write for %p", __func__, vp); return (EBUSY); } /* * If we got this far, we need to acquire the interlock and see if * anyone picked up this vnode from another list. If not, we will * mark it with DOOMED via vgonel() so that anyone who does find it * will skip over it. */ VI_LOCK(vp); if (vp->v_usecount) { VOP_UNLOCK(vp, LK_INTERLOCK); vn_finished_write(vnmp); CTR2(KTR_VFS, "%s: impossible to recycle, %p is already referenced", __func__, vp); return (EBUSY); } if ((vp->v_iflag & VI_DOOMED) == 0) vgonel(vp); VOP_UNLOCK(vp, LK_INTERLOCK); vn_finished_write(vnmp); return (0); } /* * Return the next vnode from the free list. */ int getnewvnode(const char *tag, struct mount *mp, struct vop_vector *vops, struct vnode **vpp) { struct vnode *vp = NULL; struct bufobj *bo; CTR3(KTR_VFS, "%s: mp %p with tag %s", __func__, mp, tag); mtx_lock(&vnode_free_list_mtx); /* * Lend our context to reclaim vnodes if they've exceeded the max. */ if (freevnodes > wantfreevnodes) vnlru_free(1); /* * Wait for available vnodes. */ if (numvnodes > desiredvnodes) { if (mp != NULL && (mp->mnt_kern_flag & MNTK_SUSPEND)) { /* * File system is beeing suspended, we cannot risk a * deadlock here, so allocate new vnode anyway. */ if (freevnodes > wantfreevnodes) vnlru_free(freevnodes - wantfreevnodes); goto alloc; } if (vnlruproc_sig == 0) { vnlruproc_sig = 1; /* avoid unnecessary wakeups */ wakeup(vnlruproc); } msleep(&vnlruproc_sig, &vnode_free_list_mtx, PVFS, "vlruwk", hz); #if 0 /* XXX Not all VFS_VGET/ffs_vget callers check returns. */ if (numvnodes > desiredvnodes) { mtx_unlock(&vnode_free_list_mtx); return (ENFILE); } #endif } alloc: numvnodes++; mtx_unlock(&vnode_free_list_mtx); vp = (struct vnode *) uma_zalloc(vnode_zone, M_WAITOK|M_ZERO); /* * Setup locks. */ vp->v_vnlock = &vp->v_lock; mtx_init(&vp->v_interlock, "vnode interlock", NULL, MTX_DEF); /* * By default, don't allow shared locks unless filesystems * opt-in. */ lockinit(vp->v_vnlock, PVFS, tag, VLKTIMEOUT, LK_NOSHARE); /* * Initialize bufobj. */ bo = &vp->v_bufobj; bo->__bo_vnode = vp; mtx_init(BO_MTX(bo), "bufobj interlock", NULL, MTX_DEF); bo->bo_ops = &buf_ops_bio; bo->bo_private = vp; TAILQ_INIT(&bo->bo_clean.bv_hd); TAILQ_INIT(&bo->bo_dirty.bv_hd); /* * Initialize namecache. */ LIST_INIT(&vp->v_cache_src); TAILQ_INIT(&vp->v_cache_dst); /* * Finalize various vnode identity bits. */ vp->v_type = VNON; vp->v_tag = tag; vp->v_op = vops; v_incr_usecount(vp); vp->v_data = 0; #ifdef MAC mac_vnode_init(vp); if (mp != NULL && (mp->mnt_flag & MNT_MULTILABEL) == 0) mac_vnode_associate_singlelabel(mp, vp); else if (mp == NULL && vops != &dead_vnodeops) printf("NULL mp in getnewvnode()\n"); #endif if (mp != NULL) { bo->bo_bsize = mp->mnt_stat.f_iosize; if ((mp->mnt_kern_flag & MNTK_NOKNOTE) != 0) vp->v_vflag |= VV_NOKNOTE; } *vpp = vp; return (0); } /* * Delete from old mount point vnode list, if on one. */ static void delmntque(struct vnode *vp) { struct mount *mp; mp = vp->v_mount; if (mp == NULL) return; MNT_ILOCK(mp); vp->v_mount = NULL; VNASSERT(mp->mnt_nvnodelistsize > 0, vp, ("bad mount point vnode list size")); TAILQ_REMOVE(&mp->mnt_nvnodelist, vp, v_nmntvnodes); mp->mnt_nvnodelistsize--; MNT_REL(mp); MNT_IUNLOCK(mp); } static void insmntque_stddtr(struct vnode *vp, void *dtr_arg) { vp->v_data = NULL; vp->v_op = &dead_vnodeops; /* XXX non mp-safe fs may still call insmntque with vnode unlocked */ if (!VOP_ISLOCKED(vp)) vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); vgone(vp); vput(vp); } /* * Insert into list of vnodes for the new mount point, if available. */ int insmntque1(struct vnode *vp, struct mount *mp, void (*dtr)(struct vnode *, void *), void *dtr_arg) { int locked; KASSERT(vp->v_mount == NULL, ("insmntque: vnode already on per mount vnode list")); VNASSERT(mp != NULL, vp, ("Don't call insmntque(foo, NULL)")); #ifdef DEBUG_VFS_LOCKS if (!VFS_NEEDSGIANT(mp)) ASSERT_VOP_ELOCKED(vp, "insmntque: mp-safe fs and non-locked vp"); #endif MNT_ILOCK(mp); if ((mp->mnt_kern_flag & MNTK_NOINSMNTQ) != 0 && ((mp->mnt_kern_flag & MNTK_UNMOUNTF) != 0 || mp->mnt_nvnodelistsize == 0)) { locked = VOP_ISLOCKED(vp); if (!locked || (locked == LK_EXCLUSIVE && (vp->v_vflag & VV_FORCEINSMQ) == 0)) { MNT_IUNLOCK(mp); if (dtr != NULL) dtr(vp, dtr_arg); return (EBUSY); } } vp->v_mount = mp; MNT_REF(mp); TAILQ_INSERT_TAIL(&mp->mnt_nvnodelist, vp, v_nmntvnodes); VNASSERT(mp->mnt_nvnodelistsize >= 0, vp, ("neg mount point vnode list size")); mp->mnt_nvnodelistsize++; MNT_IUNLOCK(mp); return (0); } int insmntque(struct vnode *vp, struct mount *mp) { return (insmntque1(vp, mp, insmntque_stddtr, NULL)); } /* * Flush out and invalidate all buffers associated with a bufobj * Called with the underlying object locked. */ int bufobj_invalbuf(struct bufobj *bo, int flags, int slpflag, int slptimeo) { int error; BO_LOCK(bo); if (flags & V_SAVE) { error = bufobj_wwait(bo, slpflag, slptimeo); if (error) { BO_UNLOCK(bo); return (error); } if (bo->bo_dirty.bv_cnt > 0) { BO_UNLOCK(bo); if ((error = BO_SYNC(bo, MNT_WAIT)) != 0) return (error); /* * XXX We could save a lock/unlock if this was only * enabled under INVARIANTS */ BO_LOCK(bo); if (bo->bo_numoutput > 0 || bo->bo_dirty.bv_cnt > 0) panic("vinvalbuf: dirty bufs"); } } /* * If you alter this loop please notice that interlock is dropped and * reacquired in flushbuflist. Special care is needed to ensure that * no race conditions occur from this. */ do { error = flushbuflist(&bo->bo_clean, flags, bo, slpflag, slptimeo); if (error == 0) error = flushbuflist(&bo->bo_dirty, flags, bo, slpflag, slptimeo); if (error != 0 && error != EAGAIN) { BO_UNLOCK(bo); return (error); } } while (error != 0); /* * Wait for I/O to complete. XXX needs cleaning up. The vnode can * have write I/O in-progress but if there is a VM object then the * VM object can also have read-I/O in-progress. */ do { bufobj_wwait(bo, 0, 0); BO_UNLOCK(bo); if (bo->bo_object != NULL) { VM_OBJECT_LOCK(bo->bo_object); vm_object_pip_wait(bo->bo_object, "bovlbx"); VM_OBJECT_UNLOCK(bo->bo_object); } BO_LOCK(bo); } while (bo->bo_numoutput > 0); BO_UNLOCK(bo); /* * Destroy the copy in the VM cache, too. */ if (bo->bo_object != NULL && (flags & (V_ALT | V_NORMAL)) == 0) { VM_OBJECT_LOCK(bo->bo_object); vm_object_page_remove(bo->bo_object, 0, 0, (flags & V_SAVE) ? TRUE : FALSE); VM_OBJECT_UNLOCK(bo->bo_object); } #ifdef INVARIANTS BO_LOCK(bo); if ((flags & (V_ALT | V_NORMAL)) == 0 && (bo->bo_dirty.bv_cnt > 0 || bo->bo_clean.bv_cnt > 0)) panic("vinvalbuf: flush failed"); BO_UNLOCK(bo); #endif return (0); } /* * Flush out and invalidate all buffers associated with a vnode. * Called with the underlying object locked. */ int vinvalbuf(struct vnode *vp, int flags, int slpflag, int slptimeo) { CTR3(KTR_VFS, "%s: vp %p with flags %d", __func__, vp, flags); ASSERT_VOP_LOCKED(vp, "vinvalbuf"); return (bufobj_invalbuf(&vp->v_bufobj, flags, slpflag, slptimeo)); } /* * Flush out buffers on the specified list. * */ static int flushbuflist( struct bufv *bufv, int flags, struct bufobj *bo, int slpflag, int slptimeo) { struct buf *bp, *nbp; int retval, error; daddr_t lblkno; b_xflags_t xflags; ASSERT_BO_LOCKED(bo); retval = 0; TAILQ_FOREACH_SAFE(bp, &bufv->bv_hd, b_bobufs, nbp) { if (((flags & V_NORMAL) && (bp->b_xflags & BX_ALTDATA)) || ((flags & V_ALT) && (bp->b_xflags & BX_ALTDATA) == 0)) { continue; } lblkno = 0; xflags = 0; if (nbp != NULL) { lblkno = nbp->b_lblkno; xflags = nbp->b_xflags & (BX_BKGRDMARKER | BX_VNDIRTY | BX_VNCLEAN); } retval = EAGAIN; error = BUF_TIMELOCK(bp, LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK, BO_MTX(bo), "flushbuf", slpflag, slptimeo); if (error) { BO_LOCK(bo); return (error != ENOLCK ? error : EAGAIN); } KASSERT(bp->b_bufobj == bo, ("bp %p wrong b_bufobj %p should be %p", bp, bp->b_bufobj, bo)); if (bp->b_bufobj != bo) { /* XXX: necessary ? */ BUF_UNLOCK(bp); BO_LOCK(bo); return (EAGAIN); } /* * XXX Since there are no node locks for NFS, I * believe there is a slight chance that a delayed * write will occur while sleeping just above, so * check for it. */ if (((bp->b_flags & (B_DELWRI | B_INVAL)) == B_DELWRI) && (flags & V_SAVE)) { bremfree(bp); bp->b_flags |= B_ASYNC; bwrite(bp); BO_LOCK(bo); return (EAGAIN); /* XXX: why not loop ? */ } bremfree(bp); bp->b_flags |= (B_INVAL | B_RELBUF); bp->b_flags &= ~B_ASYNC; brelse(bp); BO_LOCK(bo); if (nbp != NULL && (nbp->b_bufobj != bo || nbp->b_lblkno != lblkno || (nbp->b_xflags & (BX_BKGRDMARKER | BX_VNDIRTY | BX_VNCLEAN)) != xflags)) break; /* nbp invalid */ } return (retval); } /* * Truncate a file's buffer and pages to a specified length. This * is in lieu of the old vinvalbuf mechanism, which performed unneeded * sync activity. */ int vtruncbuf(struct vnode *vp, struct ucred *cred, struct thread *td, off_t length, int blksize) { struct buf *bp, *nbp; int anyfreed; int trunclbn; struct bufobj *bo; CTR5(KTR_VFS, "%s: vp %p with cred %p and block %d:%ju", __func__, vp, cred, blksize, (uintmax_t)length); /* * Round up to the *next* lbn. */ trunclbn = (length + blksize - 1) / blksize; ASSERT_VOP_LOCKED(vp, "vtruncbuf"); restart: bo = &vp->v_bufobj; BO_LOCK(bo); anyfreed = 1; for (;anyfreed;) { anyfreed = 0; TAILQ_FOREACH_SAFE(bp, &bo->bo_clean.bv_hd, b_bobufs, nbp) { if (bp->b_lblkno < trunclbn) continue; if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK, BO_MTX(bo)) == ENOLCK) goto restart; bremfree(bp); bp->b_flags |= (B_INVAL | B_RELBUF); bp->b_flags &= ~B_ASYNC; brelse(bp); anyfreed = 1; if (nbp != NULL && (((nbp->b_xflags & BX_VNCLEAN) == 0) || (nbp->b_vp != vp) || (nbp->b_flags & B_DELWRI))) { goto restart; } BO_LOCK(bo); } TAILQ_FOREACH_SAFE(bp, &bo->bo_dirty.bv_hd, b_bobufs, nbp) { if (bp->b_lblkno < trunclbn) continue; if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK, BO_MTX(bo)) == ENOLCK) goto restart; bremfree(bp); bp->b_flags |= (B_INVAL | B_RELBUF); bp->b_flags &= ~B_ASYNC; brelse(bp); anyfreed = 1; if (nbp != NULL && (((nbp->b_xflags & BX_VNDIRTY) == 0) || (nbp->b_vp != vp) || (nbp->b_flags & B_DELWRI) == 0)) { goto restart; } BO_LOCK(bo); } } if (length > 0) { restartsync: TAILQ_FOREACH_SAFE(bp, &bo->bo_dirty.bv_hd, b_bobufs, nbp) { if (bp->b_lblkno > 0) continue; /* * Since we hold the vnode lock this should only * fail if we're racing with the buf daemon. */ if (BUF_LOCK(bp, LK_EXCLUSIVE | LK_SLEEPFAIL | LK_INTERLOCK, BO_MTX(bo)) == ENOLCK) { goto restart; } VNASSERT((bp->b_flags & B_DELWRI), vp, ("buf(%p) on dirty queue without DELWRI", bp)); bremfree(bp); bawrite(bp); BO_LOCK(bo); goto restartsync; } } bufobj_wwait(bo, 0, 0); BO_UNLOCK(bo); vnode_pager_setsize(vp, length); return (0); } /* * buf_splay() - splay tree core for the clean/dirty list of buffers in * a vnode. * * NOTE: We have to deal with the special case of a background bitmap * buffer, a situation where two buffers will have the same logical * block offset. We want (1) only the foreground buffer to be accessed * in a lookup and (2) must differentiate between the foreground and * background buffer in the splay tree algorithm because the splay * tree cannot normally handle multiple entities with the same 'index'. * We accomplish this by adding differentiating flags to the splay tree's * numerical domain. */ static struct buf * buf_splay(daddr_t lblkno, b_xflags_t xflags, struct buf *root) { struct buf dummy; struct buf *lefttreemax, *righttreemin, *y; if (root == NULL) return (NULL); lefttreemax = righttreemin = &dummy; for (;;) { if (lblkno < root->b_lblkno || (lblkno == root->b_lblkno && (xflags & BX_BKGRDMARKER) < (root->b_xflags & BX_BKGRDMARKER))) { if ((y = root->b_left) == NULL) break; if (lblkno < y->b_lblkno) { /* Rotate right. */ root->b_left = y->b_right; y->b_right = root; root = y; if ((y = root->b_left) == NULL) break; } /* Link into the new root's right tree. */ righttreemin->b_left = root; righttreemin = root; } else if (lblkno > root->b_lblkno || (lblkno == root->b_lblkno && (xflags & BX_BKGRDMARKER) > (root->b_xflags & BX_BKGRDMARKER))) { if ((y = root->b_right) == NULL) break; if (lblkno > y->b_lblkno) { /* Rotate left. */ root->b_right = y->b_left; y->b_left = root; root = y; if ((y = root->b_right) == NULL) break; } /* Link into the new root's left tree. */ lefttreemax->b_right = root; lefttreemax = root; } else { break; } root = y; } /* Assemble the new root. */ lefttreemax->b_right = root->b_left; righttreemin->b_left = root->b_right; root->b_left = dummy.b_right; root->b_right = dummy.b_left; return (root); } static void buf_vlist_remove(struct buf *bp) { struct buf *root; struct bufv *bv; KASSERT(bp->b_bufobj != NULL, ("No b_bufobj %p", bp)); ASSERT_BO_LOCKED(bp->b_bufobj); KASSERT((bp->b_xflags & (BX_VNDIRTY|BX_VNCLEAN)) != (BX_VNDIRTY|BX_VNCLEAN), ("buf_vlist_remove: Buf %p is on two lists", bp)); if (bp->b_xflags & BX_VNDIRTY) bv = &bp->b_bufobj->bo_dirty; else bv = &bp->b_bufobj->bo_clean; if (bp != bv->bv_root) { root = buf_splay(bp->b_lblkno, bp->b_xflags, bv->bv_root); KASSERT(root == bp, ("splay lookup failed in remove")); } if (bp->b_left == NULL) { root = bp->b_right; } else { root = buf_splay(bp->b_lblkno, bp->b_xflags, bp->b_left); root->b_right = bp->b_right; } bv->bv_root = root; TAILQ_REMOVE(&bv->bv_hd, bp, b_bobufs); bv->bv_cnt--; bp->b_xflags &= ~(BX_VNDIRTY | BX_VNCLEAN); } /* * Add the buffer to the sorted clean or dirty block list using a * splay tree algorithm. * * NOTE: xflags is passed as a constant, optimizing this inline function! */ static void buf_vlist_add(struct buf *bp, struct bufobj *bo, b_xflags_t xflags) { struct buf *root; struct bufv *bv; ASSERT_BO_LOCKED(bo); KASSERT((bp->b_xflags & (BX_VNDIRTY|BX_VNCLEAN)) == 0, ("buf_vlist_add: Buf %p has existing xflags %d", bp, bp->b_xflags)); bp->b_xflags |= xflags; if (xflags & BX_VNDIRTY) bv = &bo->bo_dirty; else bv = &bo->bo_clean; root = buf_splay(bp->b_lblkno, bp->b_xflags, bv->bv_root); if (root == NULL) { bp->b_left = NULL; bp->b_right = NULL; TAILQ_INSERT_TAIL(&bv->bv_hd, bp, b_bobufs); } else if (bp->b_lblkno < root->b_lblkno || (bp->b_lblkno == root->b_lblkno && (bp->b_xflags & BX_BKGRDMARKER) < (root->b_xflags & BX_BKGRDMARKER))) { bp->b_left = root->b_left; bp->b_right = root; root->b_left = NULL; TAILQ_INSERT_BEFORE(root, bp, b_bobufs); } else { bp->b_right = root->b_right; bp->b_left = root; root->b_right = NULL; TAILQ_INSERT_AFTER(&bv->bv_hd, root, bp, b_bobufs); } bv->bv_cnt++; bv->bv_root = bp; } /* * Lookup a buffer using the splay tree. Note that we specifically avoid * shadow buffers used in background bitmap writes. * * This code isn't quite efficient as it could be because we are maintaining * two sorted lists and do not know which list the block resides in. * * During a "make buildworld" the desired buffer is found at one of * the roots more than 60% of the time. Thus, checking both roots * before performing either splay eliminates unnecessary splays on the * first tree splayed. */ struct buf * gbincore(struct bufobj *bo, daddr_t lblkno) { struct buf *bp; ASSERT_BO_LOCKED(bo); if ((bp = bo->bo_clean.bv_root) != NULL && bp->b_lblkno == lblkno && !(bp->b_xflags & BX_BKGRDMARKER)) return (bp); if ((bp = bo->bo_dirty.bv_root) != NULL && bp->b_lblkno == lblkno && !(bp->b_xflags & BX_BKGRDMARKER)) return (bp); if ((bp = bo->bo_clean.bv_root) != NULL) { bo->bo_clean.bv_root = bp = buf_splay(lblkno, 0, bp); if (bp->b_lblkno == lblkno && !(bp->b_xflags & BX_BKGRDMARKER)) return (bp); } if ((bp = bo->bo_dirty.bv_root) != NULL) { bo->bo_dirty.bv_root = bp = buf_splay(lblkno, 0, bp); if (bp->b_lblkno == lblkno && !(bp->b_xflags & BX_BKGRDMARKER)) return (bp); } return (NULL); } /* * Associate a buffer with a vnode. */ void bgetvp(struct vnode *vp, struct buf *bp) { struct bufobj *bo; bo = &vp->v_bufobj; ASSERT_BO_LOCKED(bo); VNASSERT(bp->b_vp == NULL, bp->b_vp, ("bgetvp: not free")); CTR3(KTR_BUF, "bgetvp(%p) vp %p flags %X", bp, vp, bp->b_flags); VNASSERT((bp->b_xflags & (BX_VNDIRTY|BX_VNCLEAN)) == 0, vp, ("bgetvp: bp already attached! %p", bp)); vhold(vp); if (VFS_NEEDSGIANT(vp->v_mount) || bo->bo_flag & BO_NEEDSGIANT) bp->b_flags |= B_NEEDSGIANT; bp->b_vp = vp; bp->b_bufobj = bo; /* * Insert onto list for new vnode. */ buf_vlist_add(bp, bo, BX_VNCLEAN); } /* * Disassociate a buffer from a vnode. */ void brelvp(struct buf *bp) { struct bufobj *bo; struct vnode *vp; CTR3(KTR_BUF, "brelvp(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); KASSERT(bp->b_vp != NULL, ("brelvp: NULL")); /* * Delete from old vnode list, if on one. */ vp = bp->b_vp; /* XXX */ bo = bp->b_bufobj; BO_LOCK(bo); if (bp->b_xflags & (BX_VNDIRTY | BX_VNCLEAN)) buf_vlist_remove(bp); else panic("brelvp: Buffer %p not on queue.", bp); if ((bo->bo_flag & BO_ONWORKLST) && bo->bo_dirty.bv_cnt == 0) { bo->bo_flag &= ~BO_ONWORKLST; mtx_lock(&sync_mtx); LIST_REMOVE(bo, bo_synclist); syncer_worklist_len--; mtx_unlock(&sync_mtx); } bp->b_flags &= ~B_NEEDSGIANT; bp->b_vp = NULL; bp->b_bufobj = NULL; BO_UNLOCK(bo); vdrop(vp); } /* * Add an item to the syncer work queue. */ static void vn_syncer_add_to_worklist(struct bufobj *bo, int delay) { int queue, slot; ASSERT_BO_LOCKED(bo); mtx_lock(&sync_mtx); if (bo->bo_flag & BO_ONWORKLST) LIST_REMOVE(bo, bo_synclist); else { bo->bo_flag |= BO_ONWORKLST; syncer_worklist_len++; } if (delay > syncer_maxdelay - 2) delay = syncer_maxdelay - 2; slot = (syncer_delayno + delay) & syncer_mask; queue = VFS_NEEDSGIANT(bo->__bo_vnode->v_mount) ? WI_GIANTQ : WI_MPSAFEQ; LIST_INSERT_HEAD(&syncer_workitem_pending[queue][slot], bo, bo_synclist); mtx_unlock(&sync_mtx); } static int sysctl_vfs_worklist_len(SYSCTL_HANDLER_ARGS) { int error, len; mtx_lock(&sync_mtx); len = syncer_worklist_len - sync_vnode_count; mtx_unlock(&sync_mtx); error = SYSCTL_OUT(req, &len, sizeof(len)); return (error); } SYSCTL_PROC(_vfs, OID_AUTO, worklist_len, CTLTYPE_INT | CTLFLAG_RD, NULL, 0, sysctl_vfs_worklist_len, "I", "Syncer thread worklist length"); static struct proc *updateproc; static void sched_sync(void); static struct kproc_desc up_kp = { "syncer", sched_sync, &updateproc }; SYSINIT(syncer, SI_SUB_KTHREAD_UPDATE, SI_ORDER_FIRST, kproc_start, &up_kp); static int sync_vnode(struct synclist *slp, struct bufobj **bo, struct thread *td) { struct vnode *vp; struct mount *mp; *bo = LIST_FIRST(slp); if (*bo == NULL) return (0); vp = (*bo)->__bo_vnode; /* XXX */ if (VOP_ISLOCKED(vp) != 0 || VI_TRYLOCK(vp) == 0) return (1); /* * We use vhold in case the vnode does not * successfully sync. vhold prevents the vnode from * going away when we unlock the sync_mtx so that * we can acquire the vnode interlock. */ vholdl(vp); mtx_unlock(&sync_mtx); VI_UNLOCK(vp); if (vn_start_write(vp, &mp, V_NOWAIT) != 0) { vdrop(vp); mtx_lock(&sync_mtx); return (*bo == LIST_FIRST(slp)); } vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); (void) VOP_FSYNC(vp, MNT_LAZY, td); VOP_UNLOCK(vp, 0); vn_finished_write(mp); BO_LOCK(*bo); if (((*bo)->bo_flag & BO_ONWORKLST) != 0) { /* * Put us back on the worklist. The worklist * routine will remove us from our current * position and then add us back in at a later * position. */ vn_syncer_add_to_worklist(*bo, syncdelay); } BO_UNLOCK(*bo); vdrop(vp); mtx_lock(&sync_mtx); return (0); } /* * System filesystem synchronizer daemon. */ static void sched_sync(void) { struct synclist *gnext, *next; struct synclist *gslp, *slp; struct bufobj *bo; long starttime; struct thread *td = curthread; int last_work_seen; int net_worklist_len; int syncer_final_iter; int first_printf; int error; last_work_seen = 0; syncer_final_iter = 0; first_printf = 1; syncer_state = SYNCER_RUNNING; starttime = time_uptime; td->td_pflags |= TDP_NORUNNINGBUF; EVENTHANDLER_REGISTER(shutdown_pre_sync, syncer_shutdown, td->td_proc, SHUTDOWN_PRI_LAST); mtx_lock(&sync_mtx); for (;;) { if (syncer_state == SYNCER_FINAL_DELAY && syncer_final_iter == 0) { mtx_unlock(&sync_mtx); kproc_suspend_check(td->td_proc); mtx_lock(&sync_mtx); } net_worklist_len = syncer_worklist_len - sync_vnode_count; if (syncer_state != SYNCER_RUNNING && starttime != time_uptime) { if (first_printf) { printf("\nSyncing disks, vnodes remaining..."); first_printf = 0; } printf("%d ", net_worklist_len); } starttime = time_uptime; /* * Push files whose dirty time has expired. Be careful * of interrupt race on slp queue. * * Skip over empty worklist slots when shutting down. */ do { slp = &syncer_workitem_pending[WI_MPSAFEQ][syncer_delayno]; gslp = &syncer_workitem_pending[WI_GIANTQ][syncer_delayno]; syncer_delayno += 1; if (syncer_delayno == syncer_maxdelay) syncer_delayno = 0; next = &syncer_workitem_pending[WI_MPSAFEQ][syncer_delayno]; gnext = &syncer_workitem_pending[WI_GIANTQ][syncer_delayno]; /* * If the worklist has wrapped since the * it was emptied of all but syncer vnodes, * switch to the FINAL_DELAY state and run * for one more second. */ if (syncer_state == SYNCER_SHUTTING_DOWN && net_worklist_len == 0 && last_work_seen == syncer_delayno) { syncer_state = SYNCER_FINAL_DELAY; syncer_final_iter = SYNCER_SHUTDOWN_SPEEDUP; } } while (syncer_state != SYNCER_RUNNING && LIST_EMPTY(slp) && LIST_EMPTY(gslp) && syncer_worklist_len > 0); /* * Keep track of the last time there was anything * on the worklist other than syncer vnodes. * Return to the SHUTTING_DOWN state if any * new work appears. */ if (net_worklist_len > 0 || syncer_state == SYNCER_RUNNING) last_work_seen = syncer_delayno; if (net_worklist_len > 0 && syncer_state == SYNCER_FINAL_DELAY) syncer_state = SYNCER_SHUTTING_DOWN; while (!LIST_EMPTY(slp)) { error = sync_vnode(slp, &bo, td); if (error == 1) { LIST_REMOVE(bo, bo_synclist); LIST_INSERT_HEAD(next, bo, bo_synclist); continue; } } if (!LIST_EMPTY(gslp)) { mtx_unlock(&sync_mtx); mtx_lock(&Giant); mtx_lock(&sync_mtx); while (!LIST_EMPTY(gslp)) { error = sync_vnode(gslp, &bo, td); if (error == 1) { LIST_REMOVE(bo, bo_synclist); LIST_INSERT_HEAD(gnext, bo, bo_synclist); continue; } } mtx_unlock(&Giant); } if (syncer_state == SYNCER_FINAL_DELAY && syncer_final_iter > 0) syncer_final_iter--; /* * The variable rushjob allows the kernel to speed up the * processing of the filesystem syncer process. A rushjob * value of N tells the filesystem syncer to process the next * N seconds worth of work on its queue ASAP. Currently rushjob * is used by the soft update code to speed up the filesystem * syncer process when the incore state is getting so far * ahead of the disk that the kernel memory pool is being * threatened with exhaustion. */ if (rushjob > 0) { rushjob -= 1; continue; } /* * Just sleep for a short period of time between * iterations when shutting down to allow some I/O * to happen. * * If it has taken us less than a second to process the * current work, then wait. Otherwise start right over * again. We can still lose time if any single round * takes more than two seconds, but it does not really * matter as we are just trying to generally pace the * filesystem activity. */ if (syncer_state != SYNCER_RUNNING) cv_timedwait(&sync_wakeup, &sync_mtx, hz / SYNCER_SHUTDOWN_SPEEDUP); else if (time_uptime == starttime) cv_timedwait(&sync_wakeup, &sync_mtx, hz); } } /* * Request the syncer daemon to speed up its work. * We never push it to speed up more than half of its * normal turn time, otherwise it could take over the cpu. */ int speedup_syncer(void) { int ret = 0; mtx_lock(&sync_mtx); if (rushjob < syncdelay / 2) { rushjob += 1; stat_rush_requests += 1; ret = 1; } mtx_unlock(&sync_mtx); cv_broadcast(&sync_wakeup); return (ret); } /* * Tell the syncer to speed up its work and run though its work * list several times, then tell it to shut down. */ static void syncer_shutdown(void *arg, int howto) { if (howto & RB_NOSYNC) return; mtx_lock(&sync_mtx); syncer_state = SYNCER_SHUTTING_DOWN; rushjob = 0; mtx_unlock(&sync_mtx); cv_broadcast(&sync_wakeup); kproc_shutdown(arg, howto); } /* * Reassign a buffer from one vnode to another. * Used to assign file specific control information * (indirect blocks) to the vnode to which they belong. */ void reassignbuf(struct buf *bp) { struct vnode *vp; struct bufobj *bo; int delay; #ifdef INVARIANTS struct bufv *bv; #endif vp = bp->b_vp; bo = bp->b_bufobj; ++reassignbufcalls; CTR3(KTR_BUF, "reassignbuf(%p) vp %p flags %X", bp, bp->b_vp, bp->b_flags); /* * B_PAGING flagged buffers cannot be reassigned because their vp * is not fully linked in. */ if (bp->b_flags & B_PAGING) panic("cannot reassign paging buffer"); /* * Delete from old vnode list, if on one. */ BO_LOCK(bo); if (bp->b_xflags & (BX_VNDIRTY | BX_VNCLEAN)) buf_vlist_remove(bp); else panic("reassignbuf: Buffer %p not on queue.", bp); /* * If dirty, put on list of dirty buffers; otherwise insert onto list * of clean buffers. */ if (bp->b_flags & B_DELWRI) { if ((bo->bo_flag & BO_ONWORKLST) == 0) { switch (vp->v_type) { case VDIR: delay = dirdelay; break; case VCHR: delay = metadelay; break; default: delay = filedelay; } vn_syncer_add_to_worklist(bo, delay); } buf_vlist_add(bp, bo, BX_VNDIRTY); } else { buf_vlist_add(bp, bo, BX_VNCLEAN); if ((bo->bo_flag & BO_ONWORKLST) && bo->bo_dirty.bv_cnt == 0) { mtx_lock(&sync_mtx); LIST_REMOVE(bo, bo_synclist); syncer_worklist_len--; mtx_unlock(&sync_mtx); bo->bo_flag &= ~BO_ONWORKLST; } } #ifdef INVARIANTS bv = &bo->bo_clean; bp = TAILQ_FIRST(&bv->bv_hd); KASSERT(bp == NULL || bp->b_bufobj == bo, ("bp %p wrong b_bufobj %p should be %p", bp, bp->b_bufobj, bo)); bp = TAILQ_LAST(&bv->bv_hd, buflists); KASSERT(bp == NULL || bp->b_bufobj == bo, ("bp %p wrong b_bufobj %p should be %p", bp, bp->b_bufobj, bo)); bv = &bo->bo_dirty; bp = TAILQ_FIRST(&bv->bv_hd); KASSERT(bp == NULL || bp->b_bufobj == bo, ("bp %p wrong b_bufobj %p should be %p", bp, bp->b_bufobj, bo)); bp = TAILQ_LAST(&bv->bv_hd, buflists); KASSERT(bp == NULL || bp->b_bufobj == bo, ("bp %p wrong b_bufobj %p should be %p", bp, bp->b_bufobj, bo)); #endif BO_UNLOCK(bo); } /* * Increment the use and hold counts on the vnode, taking care to reference * the driver's usecount if this is a chardev. The vholdl() will remove * the vnode from the free list if it is presently free. Requires the * vnode interlock and returns with it held. */ static void v_incr_usecount(struct vnode *vp) { CTR2(KTR_VFS, "%s: vp %p", __func__, vp); vp->v_usecount++; if (vp->v_type == VCHR && vp->v_rdev != NULL) { dev_lock(); vp->v_rdev->si_usecount++; dev_unlock(); } vholdl(vp); } /* * Turn a holdcnt into a use+holdcnt such that only one call to * v_decr_usecount is needed. */ static void v_upgrade_usecount(struct vnode *vp) { CTR2(KTR_VFS, "%s: vp %p", __func__, vp); vp->v_usecount++; if (vp->v_type == VCHR && vp->v_rdev != NULL) { dev_lock(); vp->v_rdev->si_usecount++; dev_unlock(); } } /* * Decrement the vnode use and hold count along with the driver's usecount * if this is a chardev. The vdropl() below releases the vnode interlock * as it may free the vnode. */ static void v_decr_usecount(struct vnode *vp) { ASSERT_VI_LOCKED(vp, __FUNCTION__); VNASSERT(vp->v_usecount > 0, vp, ("v_decr_usecount: negative usecount")); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); vp->v_usecount--; if (vp->v_type == VCHR && vp->v_rdev != NULL) { dev_lock(); vp->v_rdev->si_usecount--; dev_unlock(); } vdropl(vp); } /* * Decrement only the use count and driver use count. This is intended to * be paired with a follow on vdropl() to release the remaining hold count. * In this way we may vgone() a vnode with a 0 usecount without risk of * having it end up on a free list because the hold count is kept above 0. */ static void v_decr_useonly(struct vnode *vp) { ASSERT_VI_LOCKED(vp, __FUNCTION__); VNASSERT(vp->v_usecount > 0, vp, ("v_decr_useonly: negative usecount")); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); vp->v_usecount--; if (vp->v_type == VCHR && vp->v_rdev != NULL) { dev_lock(); vp->v_rdev->si_usecount--; dev_unlock(); } } /* * Grab a particular vnode from the free list, increment its * reference count and lock it. VI_DOOMED is set if the vnode * is being destroyed. Only callers who specify LK_RETRY will * see doomed vnodes. If inactive processing was delayed in * vput try to do it here. */ int vget(struct vnode *vp, int flags, struct thread *td) { int error; error = 0; VFS_ASSERT_GIANT(vp->v_mount); VNASSERT((flags & LK_TYPE_MASK) != 0, vp, ("vget: invalid lock operation")); CTR3(KTR_VFS, "%s: vp %p with flags %d", __func__, vp, flags); if ((flags & LK_INTERLOCK) == 0) VI_LOCK(vp); vholdl(vp); if ((error = vn_lock(vp, flags | LK_INTERLOCK)) != 0) { vdrop(vp); CTR2(KTR_VFS, "%s: impossible to lock vnode %p", __func__, vp); return (error); } if (vp->v_iflag & VI_DOOMED && (flags & LK_RETRY) == 0) panic("vget: vn_lock failed to return ENOENT\n"); VI_LOCK(vp); /* Upgrade our holdcnt to a usecount. */ v_upgrade_usecount(vp); /* * We don't guarantee that any particular close will * trigger inactive processing so just make a best effort * here at preventing a reference to a removed file. If * we don't succeed no harm is done. */ if (vp->v_iflag & VI_OWEINACT) { if (VOP_ISLOCKED(vp) == LK_EXCLUSIVE && (flags & LK_NOWAIT) == 0) vinactive(vp, td); vp->v_iflag &= ~VI_OWEINACT; } VI_UNLOCK(vp); return (0); } /* * Increase the reference count of a vnode. */ void vref(struct vnode *vp) { CTR2(KTR_VFS, "%s: vp %p", __func__, vp); VI_LOCK(vp); v_incr_usecount(vp); VI_UNLOCK(vp); } /* * Return reference count of a vnode. * * The results of this call are only guaranteed when some mechanism other * than the VI lock is used to stop other processes from gaining references * to the vnode. This may be the case if the caller holds the only reference. * This is also useful when stale data is acceptable as race conditions may * be accounted for by some other means. */ int vrefcnt(struct vnode *vp) { int usecnt; VI_LOCK(vp); usecnt = vp->v_usecount; VI_UNLOCK(vp); return (usecnt); } /* * Vnode put/release. * If count drops to zero, call inactive routine and return to freelist. */ void vrele(struct vnode *vp) { struct thread *td = curthread; /* XXX */ KASSERT(vp != NULL, ("vrele: null vp")); VFS_ASSERT_GIANT(vp->v_mount); VI_LOCK(vp); /* Skip this v_writecount check if we're going to panic below. */ VNASSERT(vp->v_writecount < vp->v_usecount || vp->v_usecount < 1, vp, ("vrele: missed vn_close")); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); if (vp->v_usecount > 1 || ((vp->v_iflag & VI_DOINGINACT) && vp->v_usecount == 1)) { v_decr_usecount(vp); return; } if (vp->v_usecount != 1) { #ifdef DIAGNOSTIC vprint("vrele: negative ref count", vp); #endif VI_UNLOCK(vp); panic("vrele: negative ref cnt"); } CTR2(KTR_VFS, "%s: return vnode %p to the freelist", __func__, vp); /* * We want to hold the vnode until the inactive finishes to * prevent vgone() races. We drop the use count here and the * hold count below when we're done. */ v_decr_useonly(vp); /* * We must call VOP_INACTIVE with the node locked. Mark * as VI_DOINGINACT to avoid recursion. */ vp->v_iflag |= VI_OWEINACT; if (vn_lock(vp, LK_EXCLUSIVE | LK_INTERLOCK) == 0) { VI_LOCK(vp); if (vp->v_usecount > 0) vp->v_iflag &= ~VI_OWEINACT; if (vp->v_iflag & VI_OWEINACT) vinactive(vp, td); VOP_UNLOCK(vp, 0); } else { VI_LOCK(vp); if (vp->v_usecount > 0) vp->v_iflag &= ~VI_OWEINACT; } vdropl(vp); } /* * Release an already locked vnode. This give the same effects as * unlock+vrele(), but takes less time and avoids releasing and * re-aquiring the lock (as vrele() acquires the lock internally.) */ void vput(struct vnode *vp) { struct thread *td = curthread; /* XXX */ int error; KASSERT(vp != NULL, ("vput: null vp")); ASSERT_VOP_LOCKED(vp, "vput"); VFS_ASSERT_GIANT(vp->v_mount); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); VI_LOCK(vp); /* Skip this v_writecount check if we're going to panic below. */ VNASSERT(vp->v_writecount < vp->v_usecount || vp->v_usecount < 1, vp, ("vput: missed vn_close")); error = 0; if (vp->v_usecount > 1 || ((vp->v_iflag & VI_DOINGINACT) && vp->v_usecount == 1)) { VOP_UNLOCK(vp, 0); v_decr_usecount(vp); return; } if (vp->v_usecount != 1) { #ifdef DIAGNOSTIC vprint("vput: negative ref count", vp); #endif panic("vput: negative ref cnt"); } CTR2(KTR_VFS, "%s: return to freelist the vnode %p", __func__, vp); /* * We want to hold the vnode until the inactive finishes to * prevent vgone() races. We drop the use count here and the * hold count below when we're done. */ v_decr_useonly(vp); vp->v_iflag |= VI_OWEINACT; if (VOP_ISLOCKED(vp) != LK_EXCLUSIVE) { error = VOP_LOCK(vp, LK_UPGRADE|LK_INTERLOCK|LK_NOWAIT); VI_LOCK(vp); if (error) { if (vp->v_usecount > 0) vp->v_iflag &= ~VI_OWEINACT; goto done; } } if (vp->v_usecount > 0) vp->v_iflag &= ~VI_OWEINACT; if (vp->v_iflag & VI_OWEINACT) vinactive(vp, td); VOP_UNLOCK(vp, 0); done: vdropl(vp); } /* * Somebody doesn't want the vnode recycled. */ void vhold(struct vnode *vp) { VI_LOCK(vp); vholdl(vp); VI_UNLOCK(vp); } void vholdl(struct vnode *vp) { CTR2(KTR_VFS, "%s: vp %p", __func__, vp); vp->v_holdcnt++; if (VSHOULDBUSY(vp)) vbusy(vp); } /* * Note that there is one less who cares about this vnode. vdrop() is the * opposite of vhold(). */ void vdrop(struct vnode *vp) { VI_LOCK(vp); vdropl(vp); } /* * Drop the hold count of the vnode. If this is the last reference to * the vnode we will free it if it has been vgone'd otherwise it is * placed on the free list. */ void vdropl(struct vnode *vp) { ASSERT_VI_LOCKED(vp, "vdropl"); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); if (vp->v_holdcnt <= 0) panic("vdrop: holdcnt %d", vp->v_holdcnt); vp->v_holdcnt--; if (vp->v_holdcnt == 0) { if (vp->v_iflag & VI_DOOMED) { CTR2(KTR_VFS, "%s: destroying the vnode %p", __func__, vp); vdestroy(vp); return; } else vfree(vp); } VI_UNLOCK(vp); } /* * Call VOP_INACTIVE on the vnode and manage the DOINGINACT and OWEINACT * flags. DOINGINACT prevents us from recursing in calls to vinactive. * OWEINACT tracks whether a vnode missed a call to inactive due to a * failed lock upgrade. */ static void vinactive(struct vnode *vp, struct thread *td) { ASSERT_VOP_ELOCKED(vp, "vinactive"); ASSERT_VI_LOCKED(vp, "vinactive"); VNASSERT((vp->v_iflag & VI_DOINGINACT) == 0, vp, ("vinactive: recursed on VI_DOINGINACT")); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); vp->v_iflag |= VI_DOINGINACT; vp->v_iflag &= ~VI_OWEINACT; VI_UNLOCK(vp); VOP_INACTIVE(vp, td); VI_LOCK(vp); VNASSERT(vp->v_iflag & VI_DOINGINACT, vp, ("vinactive: lost VI_DOINGINACT")); vp->v_iflag &= ~VI_DOINGINACT; } /* * Remove any vnodes in the vnode table belonging to mount point mp. * * If FORCECLOSE is not specified, there should not be any active ones, * return error if any are found (nb: this is a user error, not a * system error). If FORCECLOSE is specified, detach any active vnodes * that are found. * * If WRITECLOSE is set, only flush out regular file vnodes open for * writing. * * SKIPSYSTEM causes any vnodes marked VV_SYSTEM to be skipped. * * `rootrefs' specifies the base reference count for the root vnode * of this filesystem. The root vnode is considered busy if its * v_usecount exceeds this value. On a successful return, vflush(, td) * will call vrele() on the root vnode exactly rootrefs times. * If the SKIPSYSTEM or WRITECLOSE flags are specified, rootrefs must * be zero. */ #ifdef DIAGNOSTIC static int busyprt = 0; /* print out busy vnodes */ SYSCTL_INT(_debug, OID_AUTO, busyprt, CTLFLAG_RW, &busyprt, 0, ""); #endif int vflush( struct mount *mp, int rootrefs, int flags, struct thread *td) { struct vnode *vp, *mvp, *rootvp = NULL; struct vattr vattr; int busy = 0, error; CTR4(KTR_VFS, "%s: mp %p with rootrefs %d and flags %d", __func__, mp, rootrefs, flags); if (rootrefs > 0) { KASSERT((flags & (SKIPSYSTEM | WRITECLOSE)) == 0, ("vflush: bad args")); /* * Get the filesystem root vnode. We can vput() it * immediately, since with rootrefs > 0, it won't go away. */ if ((error = VFS_ROOT(mp, LK_EXCLUSIVE, &rootvp)) != 0) { CTR2(KTR_VFS, "%s: vfs_root lookup failed with %d", __func__, error); return (error); } vput(rootvp); } MNT_ILOCK(mp); loop: MNT_VNODE_FOREACH(vp, mp, mvp) { VI_LOCK(vp); vholdl(vp); MNT_IUNLOCK(mp); error = vn_lock(vp, LK_INTERLOCK | LK_EXCLUSIVE); if (error) { vdrop(vp); MNT_ILOCK(mp); MNT_VNODE_FOREACH_ABORT_ILOCKED(mp, mvp); goto loop; } /* * Skip over a vnodes marked VV_SYSTEM. */ if ((flags & SKIPSYSTEM) && (vp->v_vflag & VV_SYSTEM)) { VOP_UNLOCK(vp, 0); vdrop(vp); MNT_ILOCK(mp); continue; } /* * If WRITECLOSE is set, flush out unlinked but still open * files (even if open only for reading) and regular file * vnodes open for writing. */ if (flags & WRITECLOSE) { error = VOP_GETATTR(vp, &vattr, td->td_ucred); VI_LOCK(vp); if ((vp->v_type == VNON || (error == 0 && vattr.va_nlink > 0)) && (vp->v_writecount == 0 || vp->v_type != VREG)) { VOP_UNLOCK(vp, 0); vdropl(vp); MNT_ILOCK(mp); continue; } } else VI_LOCK(vp); /* * With v_usecount == 0, all we need to do is clear out the * vnode data structures and we are done. * * If FORCECLOSE is set, forcibly close the vnode. */ if (vp->v_usecount == 0 || (flags & FORCECLOSE)) { VNASSERT(vp->v_usecount == 0 || (vp->v_type != VCHR && vp->v_type != VBLK), vp, ("device VNODE %p is FORCECLOSED", vp)); vgonel(vp); } else { busy++; #ifdef DIAGNOSTIC if (busyprt) vprint("vflush: busy vnode", vp); #endif } VOP_UNLOCK(vp, 0); vdropl(vp); MNT_ILOCK(mp); } MNT_IUNLOCK(mp); if (rootrefs > 0 && (flags & FORCECLOSE) == 0) { /* * If just the root vnode is busy, and if its refcount * is equal to `rootrefs', then go ahead and kill it. */ VI_LOCK(rootvp); KASSERT(busy > 0, ("vflush: not busy")); VNASSERT(rootvp->v_usecount >= rootrefs, rootvp, ("vflush: usecount %d < rootrefs %d", rootvp->v_usecount, rootrefs)); if (busy == 1 && rootvp->v_usecount == rootrefs) { VOP_LOCK(rootvp, LK_EXCLUSIVE|LK_INTERLOCK); vgone(rootvp); VOP_UNLOCK(rootvp, 0); busy = 0; } else VI_UNLOCK(rootvp); } if (busy) { CTR2(KTR_VFS, "%s: failing as %d vnodes are busy", __func__, busy); return (EBUSY); } for (; rootrefs > 0; rootrefs--) vrele(rootvp); return (0); } /* * Recycle an unused vnode to the front of the free list. */ int vrecycle(struct vnode *vp, struct thread *td) { int recycled; ASSERT_VOP_ELOCKED(vp, "vrecycle"); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); recycled = 0; VI_LOCK(vp); if (vp->v_usecount == 0) { recycled = 1; vgonel(vp); } VI_UNLOCK(vp); return (recycled); } /* * Eliminate all activity associated with a vnode * in preparation for reuse. */ void vgone(struct vnode *vp) { VI_LOCK(vp); vgonel(vp); VI_UNLOCK(vp); } /* * vgone, with the vp interlock held. */ void vgonel(struct vnode *vp) { struct thread *td; int oweinact; int active; struct mount *mp; ASSERT_VOP_ELOCKED(vp, "vgonel"); ASSERT_VI_LOCKED(vp, "vgonel"); VNASSERT(vp->v_holdcnt, vp, ("vgonel: vp %p has no reference.", vp)); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); td = curthread; /* * Don't vgonel if we're already doomed. */ if (vp->v_iflag & VI_DOOMED) return; vp->v_iflag |= VI_DOOMED; /* * Check to see if the vnode is in use. If so, we have to call * VOP_CLOSE() and VOP_INACTIVE(). */ active = vp->v_usecount; oweinact = (vp->v_iflag & VI_OWEINACT); VI_UNLOCK(vp); /* * Clean out any buffers associated with the vnode. * If the flush fails, just toss the buffers. */ mp = NULL; if (!TAILQ_EMPTY(&vp->v_bufobj.bo_dirty.bv_hd)) (void) vn_start_secondary_write(vp, &mp, V_WAIT); if (vinvalbuf(vp, V_SAVE, 0, 0) != 0) vinvalbuf(vp, 0, 0, 0); /* * If purging an active vnode, it must be closed and * deactivated before being reclaimed. */ if (active) VOP_CLOSE(vp, FNONBLOCK, NOCRED, td); if (oweinact || active) { VI_LOCK(vp); if ((vp->v_iflag & VI_DOINGINACT) == 0) vinactive(vp, td); VI_UNLOCK(vp); } /* * Reclaim the vnode. */ if (VOP_RECLAIM(vp, td)) panic("vgone: cannot reclaim"); if (mp != NULL) vn_finished_secondary_write(mp); VNASSERT(vp->v_object == NULL, vp, ("vop_reclaim left v_object vp=%p, tag=%s", vp, vp->v_tag)); /* * Clear the advisory locks and wake up waiting threads. */ lf_purgelocks(vp, &(vp->v_lockf)); /* * Delete from old mount point vnode list. */ delmntque(vp); cache_purge(vp); /* * Done with purge, reset to the standard lock and invalidate * the vnode. */ VI_LOCK(vp); vp->v_vnlock = &vp->v_lock; vp->v_op = &dead_vnodeops; vp->v_tag = "none"; vp->v_type = VBAD; } /* * Calculate the total number of references to a special device. */ int vcount(struct vnode *vp) { int count; dev_lock(); count = vp->v_rdev->si_usecount; dev_unlock(); return (count); } /* * Same as above, but using the struct cdev *as argument */ int count_dev(struct cdev *dev) { int count; dev_lock(); count = dev->si_usecount; dev_unlock(); return(count); } /* * Print out a description of a vnode. */ static char *typename[] = {"VNON", "VREG", "VDIR", "VBLK", "VCHR", "VLNK", "VSOCK", "VFIFO", "VBAD", "VMARKER"}; void vn_printf(struct vnode *vp, const char *fmt, ...) { va_list ap; char buf[256], buf2[16]; u_long flags; va_start(ap, fmt); vprintf(fmt, ap); va_end(ap); printf("%p: ", (void *)vp); printf("tag %s, type %s\n", vp->v_tag, typename[vp->v_type]); printf(" usecount %d, writecount %d, refcount %d mountedhere %p\n", vp->v_usecount, vp->v_writecount, vp->v_holdcnt, vp->v_mountedhere); buf[0] = '\0'; buf[1] = '\0'; if (vp->v_vflag & VV_ROOT) strlcat(buf, "|VV_ROOT", sizeof(buf)); if (vp->v_vflag & VV_ISTTY) strlcat(buf, "|VV_ISTTY", sizeof(buf)); if (vp->v_vflag & VV_NOSYNC) strlcat(buf, "|VV_NOSYNC", sizeof(buf)); if (vp->v_vflag & VV_CACHEDLABEL) strlcat(buf, "|VV_CACHEDLABEL", sizeof(buf)); if (vp->v_vflag & VV_TEXT) strlcat(buf, "|VV_TEXT", sizeof(buf)); if (vp->v_vflag & VV_COPYONWRITE) strlcat(buf, "|VV_COPYONWRITE", sizeof(buf)); if (vp->v_vflag & VV_SYSTEM) strlcat(buf, "|VV_SYSTEM", sizeof(buf)); if (vp->v_vflag & VV_PROCDEP) strlcat(buf, "|VV_PROCDEP", sizeof(buf)); if (vp->v_vflag & VV_NOKNOTE) strlcat(buf, "|VV_NOKNOTE", sizeof(buf)); if (vp->v_vflag & VV_DELETED) strlcat(buf, "|VV_DELETED", sizeof(buf)); if (vp->v_vflag & VV_MD) strlcat(buf, "|VV_MD", sizeof(buf)); flags = vp->v_vflag & ~(VV_ROOT | VV_ISTTY | VV_NOSYNC | VV_CACHEDLABEL | VV_TEXT | VV_COPYONWRITE | VV_SYSTEM | VV_PROCDEP | VV_NOKNOTE | VV_DELETED | VV_MD); if (flags != 0) { snprintf(buf2, sizeof(buf2), "|VV(0x%lx)", flags); strlcat(buf, buf2, sizeof(buf)); } if (vp->v_iflag & VI_MOUNT) strlcat(buf, "|VI_MOUNT", sizeof(buf)); if (vp->v_iflag & VI_AGE) strlcat(buf, "|VI_AGE", sizeof(buf)); if (vp->v_iflag & VI_DOOMED) strlcat(buf, "|VI_DOOMED", sizeof(buf)); if (vp->v_iflag & VI_FREE) strlcat(buf, "|VI_FREE", sizeof(buf)); if (vp->v_iflag & VI_OBJDIRTY) strlcat(buf, "|VI_OBJDIRTY", sizeof(buf)); if (vp->v_iflag & VI_DOINGINACT) strlcat(buf, "|VI_DOINGINACT", sizeof(buf)); if (vp->v_iflag & VI_OWEINACT) strlcat(buf, "|VI_OWEINACT", sizeof(buf)); flags = vp->v_iflag & ~(VI_MOUNT | VI_AGE | VI_DOOMED | VI_FREE | VI_OBJDIRTY | VI_DOINGINACT | VI_OWEINACT); if (flags != 0) { snprintf(buf2, sizeof(buf2), "|VI(0x%lx)", flags); strlcat(buf, buf2, sizeof(buf)); } printf(" flags (%s)\n", buf + 1); if (mtx_owned(VI_MTX(vp))) printf(" VI_LOCKed"); if (vp->v_object != NULL) printf(" v_object %p ref %d pages %d\n", vp->v_object, vp->v_object->ref_count, vp->v_object->resident_page_count); printf(" "); lockmgr_printinfo(vp->v_vnlock); if (vp->v_data != NULL) VOP_PRINT(vp); } #ifdef DDB /* * List all of the locked vnodes in the system. * Called when debugging the kernel. */ DB_SHOW_COMMAND(lockedvnods, lockedvnodes) { struct mount *mp, *nmp; struct vnode *vp; /* * Note: because this is DDB, we can't obey the locking semantics * for these structures, which means we could catch an inconsistent * state and dereference a nasty pointer. Not much to be done * about that. */ db_printf("Locked vnodes\n"); for (mp = TAILQ_FIRST(&mountlist); mp != NULL; mp = nmp) { nmp = TAILQ_NEXT(mp, mnt_list); TAILQ_FOREACH(vp, &mp->mnt_nvnodelist, v_nmntvnodes) { if (vp->v_type != VMARKER && VOP_ISLOCKED(vp)) vprint("", vp); } nmp = TAILQ_NEXT(mp, mnt_list); } } /* * Show details about the given vnode. */ DB_SHOW_COMMAND(vnode, db_show_vnode) { struct vnode *vp; if (!have_addr) return; vp = (struct vnode *)addr; vn_printf(vp, "vnode "); } /* * Show details about the given mount point. */ DB_SHOW_COMMAND(mount, db_show_mount) { struct mount *mp; struct statfs *sp; struct vnode *vp; char buf[512]; u_int flags; if (!have_addr) { /* No address given, print short info about all mount points. */ TAILQ_FOREACH(mp, &mountlist, mnt_list) { db_printf("%p %s on %s (%s)\n", mp, mp->mnt_stat.f_mntfromname, mp->mnt_stat.f_mntonname, mp->mnt_stat.f_fstypename); if (db_pager_quit) break; } db_printf("\nMore info: show mount \n"); return; } mp = (struct mount *)addr; db_printf("%p %s on %s (%s)\n", mp, mp->mnt_stat.f_mntfromname, mp->mnt_stat.f_mntonname, mp->mnt_stat.f_fstypename); buf[0] = '\0'; flags = mp->mnt_flag; #define MNT_FLAG(flag) do { \ if (flags & (flag)) { \ if (buf[0] != '\0') \ strlcat(buf, ", ", sizeof(buf)); \ strlcat(buf, (#flag) + 4, sizeof(buf)); \ flags &= ~(flag); \ } \ } while (0) MNT_FLAG(MNT_RDONLY); MNT_FLAG(MNT_SYNCHRONOUS); MNT_FLAG(MNT_NOEXEC); MNT_FLAG(MNT_NOSUID); MNT_FLAG(MNT_UNION); MNT_FLAG(MNT_ASYNC); MNT_FLAG(MNT_SUIDDIR); MNT_FLAG(MNT_SOFTDEP); MNT_FLAG(MNT_NOSYMFOLLOW); MNT_FLAG(MNT_GJOURNAL); MNT_FLAG(MNT_MULTILABEL); MNT_FLAG(MNT_ACLS); MNT_FLAG(MNT_NOATIME); MNT_FLAG(MNT_NOCLUSTERR); MNT_FLAG(MNT_NOCLUSTERW); MNT_FLAG(MNT_EXRDONLY); MNT_FLAG(MNT_EXPORTED); MNT_FLAG(MNT_DEFEXPORTED); MNT_FLAG(MNT_EXPORTANON); MNT_FLAG(MNT_EXKERB); MNT_FLAG(MNT_EXPUBLIC); MNT_FLAG(MNT_LOCAL); MNT_FLAG(MNT_QUOTA); MNT_FLAG(MNT_ROOTFS); MNT_FLAG(MNT_USER); MNT_FLAG(MNT_IGNORE); MNT_FLAG(MNT_UPDATE); MNT_FLAG(MNT_DELEXPORT); MNT_FLAG(MNT_RELOAD); MNT_FLAG(MNT_FORCE); MNT_FLAG(MNT_SNAPSHOT); MNT_FLAG(MNT_BYFSID); #undef MNT_FLAG if (flags != 0) { if (buf[0] != '\0') strlcat(buf, ", ", sizeof(buf)); snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), "0x%08x", flags); } db_printf(" mnt_flag = %s\n", buf); buf[0] = '\0'; flags = mp->mnt_kern_flag; #define MNT_KERN_FLAG(flag) do { \ if (flags & (flag)) { \ if (buf[0] != '\0') \ strlcat(buf, ", ", sizeof(buf)); \ strlcat(buf, (#flag) + 5, sizeof(buf)); \ flags &= ~(flag); \ } \ } while (0) MNT_KERN_FLAG(MNTK_UNMOUNTF); MNT_KERN_FLAG(MNTK_ASYNC); MNT_KERN_FLAG(MNTK_SOFTDEP); MNT_KERN_FLAG(MNTK_NOINSMNTQ); MNT_KERN_FLAG(MNTK_UNMOUNT); MNT_KERN_FLAG(MNTK_MWAIT); MNT_KERN_FLAG(MNTK_SUSPEND); MNT_KERN_FLAG(MNTK_SUSPEND2); MNT_KERN_FLAG(MNTK_SUSPENDED); MNT_KERN_FLAG(MNTK_MPSAFE); MNT_KERN_FLAG(MNTK_NOKNOTE); MNT_KERN_FLAG(MNTK_LOOKUP_SHARED); #undef MNT_KERN_FLAG if (flags != 0) { if (buf[0] != '\0') strlcat(buf, ", ", sizeof(buf)); snprintf(buf + strlen(buf), sizeof(buf) - strlen(buf), "0x%08x", flags); } db_printf(" mnt_kern_flag = %s\n", buf); sp = &mp->mnt_stat; db_printf(" mnt_stat = { version=%u type=%u flags=0x%016jx " "bsize=%ju iosize=%ju blocks=%ju bfree=%ju bavail=%jd files=%ju " "ffree=%jd syncwrites=%ju asyncwrites=%ju syncreads=%ju " "asyncreads=%ju namemax=%u owner=%u fsid=[%d, %d] }\n", (u_int)sp->f_version, (u_int)sp->f_type, (uintmax_t)sp->f_flags, (uintmax_t)sp->f_bsize, (uintmax_t)sp->f_iosize, (uintmax_t)sp->f_blocks, (uintmax_t)sp->f_bfree, (intmax_t)sp->f_bavail, (uintmax_t)sp->f_files, (intmax_t)sp->f_ffree, (uintmax_t)sp->f_syncwrites, (uintmax_t)sp->f_asyncwrites, (uintmax_t)sp->f_syncreads, (uintmax_t)sp->f_asyncreads, (u_int)sp->f_namemax, (u_int)sp->f_owner, (int)sp->f_fsid.val[0], (int)sp->f_fsid.val[1]); db_printf(" mnt_cred = { uid=%u ruid=%u", (u_int)mp->mnt_cred->cr_uid, (u_int)mp->mnt_cred->cr_ruid); - if (mp->mnt_cred->cr_prison != NULL) + if (jailed(mp->mnt_cred)) db_printf(", jail=%d", mp->mnt_cred->cr_prison->pr_id); db_printf(" }\n"); db_printf(" mnt_ref = %d\n", mp->mnt_ref); db_printf(" mnt_gen = %d\n", mp->mnt_gen); db_printf(" mnt_nvnodelistsize = %d\n", mp->mnt_nvnodelistsize); db_printf(" mnt_writeopcount = %d\n", mp->mnt_writeopcount); db_printf(" mnt_noasync = %u\n", mp->mnt_noasync); db_printf(" mnt_maxsymlinklen = %d\n", mp->mnt_maxsymlinklen); db_printf(" mnt_iosize_max = %d\n", mp->mnt_iosize_max); db_printf(" mnt_hashseed = %u\n", mp->mnt_hashseed); db_printf(" mnt_secondary_writes = %d\n", mp->mnt_secondary_writes); db_printf(" mnt_secondary_accwrites = %d\n", mp->mnt_secondary_accwrites); db_printf(" mnt_gjprovider = %s\n", mp->mnt_gjprovider != NULL ? mp->mnt_gjprovider : "NULL"); db_printf("\n"); TAILQ_FOREACH(vp, &mp->mnt_nvnodelist, v_nmntvnodes) { if (vp->v_type != VMARKER) { vn_printf(vp, "vnode "); if (db_pager_quit) break; } } } #endif /* DDB */ /* * Fill in a struct xvfsconf based on a struct vfsconf. */ static void vfsconf2x(struct vfsconf *vfsp, struct xvfsconf *xvfsp) { strcpy(xvfsp->vfc_name, vfsp->vfc_name); xvfsp->vfc_typenum = vfsp->vfc_typenum; xvfsp->vfc_refcount = vfsp->vfc_refcount; xvfsp->vfc_flags = vfsp->vfc_flags; /* * These are unused in userland, we keep them * to not break binary compatibility. */ xvfsp->vfc_vfsops = NULL; xvfsp->vfc_next = NULL; } /* * Top level filesystem related information gathering. */ static int sysctl_vfs_conflist(SYSCTL_HANDLER_ARGS) { struct vfsconf *vfsp; struct xvfsconf xvfsp; int error; error = 0; TAILQ_FOREACH(vfsp, &vfsconf, vfc_list) { bzero(&xvfsp, sizeof(xvfsp)); vfsconf2x(vfsp, &xvfsp); error = SYSCTL_OUT(req, &xvfsp, sizeof xvfsp); if (error) break; } return (error); } SYSCTL_PROC(_vfs, OID_AUTO, conflist, CTLFLAG_RD, NULL, 0, sysctl_vfs_conflist, "S,xvfsconf", "List of all configured filesystems"); #ifndef BURN_BRIDGES static int sysctl_ovfs_conf(SYSCTL_HANDLER_ARGS); static int vfs_sysctl(SYSCTL_HANDLER_ARGS) { int *name = (int *)arg1 - 1; /* XXX */ u_int namelen = arg2 + 1; /* XXX */ struct vfsconf *vfsp; struct xvfsconf xvfsp; printf("WARNING: userland calling deprecated sysctl, " "please rebuild world\n"); #if 1 || defined(COMPAT_PRELITE2) /* Resolve ambiguity between VFS_VFSCONF and VFS_GENERIC. */ if (namelen == 1) return (sysctl_ovfs_conf(oidp, arg1, arg2, req)); #endif switch (name[1]) { case VFS_MAXTYPENUM: if (namelen != 2) return (ENOTDIR); return (SYSCTL_OUT(req, &maxvfsconf, sizeof(int))); case VFS_CONF: if (namelen != 3) return (ENOTDIR); /* overloaded */ TAILQ_FOREACH(vfsp, &vfsconf, vfc_list) if (vfsp->vfc_typenum == name[2]) break; if (vfsp == NULL) return (EOPNOTSUPP); bzero(&xvfsp, sizeof(xvfsp)); vfsconf2x(vfsp, &xvfsp); return (SYSCTL_OUT(req, &xvfsp, sizeof(xvfsp))); } return (EOPNOTSUPP); } static SYSCTL_NODE(_vfs, VFS_GENERIC, generic, CTLFLAG_RD | CTLFLAG_SKIP, vfs_sysctl, "Generic filesystem"); #if 1 || defined(COMPAT_PRELITE2) static int sysctl_ovfs_conf(SYSCTL_HANDLER_ARGS) { int error; struct vfsconf *vfsp; struct ovfsconf ovfs; TAILQ_FOREACH(vfsp, &vfsconf, vfc_list) { bzero(&ovfs, sizeof(ovfs)); ovfs.vfc_vfsops = vfsp->vfc_vfsops; /* XXX used as flag */ strcpy(ovfs.vfc_name, vfsp->vfc_name); ovfs.vfc_index = vfsp->vfc_typenum; ovfs.vfc_refcount = vfsp->vfc_refcount; ovfs.vfc_flags = vfsp->vfc_flags; error = SYSCTL_OUT(req, &ovfs, sizeof ovfs); if (error) return error; } return 0; } #endif /* 1 || COMPAT_PRELITE2 */ #endif /* !BURN_BRIDGES */ #define KINFO_VNODESLOP 10 #ifdef notyet /* * Dump vnode list (via sysctl). */ /* ARGSUSED */ static int sysctl_vnode(SYSCTL_HANDLER_ARGS) { struct xvnode *xvn; struct mount *mp; struct vnode *vp; int error, len, n; /* * Stale numvnodes access is not fatal here. */ req->lock = 0; len = (numvnodes + KINFO_VNODESLOP) * sizeof *xvn; if (!req->oldptr) /* Make an estimate */ return (SYSCTL_OUT(req, 0, len)); error = sysctl_wire_old_buffer(req, 0); if (error != 0) return (error); xvn = malloc(len, M_TEMP, M_ZERO | M_WAITOK); n = 0; mtx_lock(&mountlist_mtx); TAILQ_FOREACH(mp, &mountlist, mnt_list) { if (vfs_busy(mp, MBF_NOWAIT | MBF_MNTLSTLOCK)) continue; MNT_ILOCK(mp); TAILQ_FOREACH(vp, &mp->mnt_nvnodelist, v_nmntvnodes) { if (n == len) break; vref(vp); xvn[n].xv_size = sizeof *xvn; xvn[n].xv_vnode = vp; xvn[n].xv_id = 0; /* XXX compat */ #define XV_COPY(field) xvn[n].xv_##field = vp->v_##field XV_COPY(usecount); XV_COPY(writecount); XV_COPY(holdcnt); XV_COPY(mount); XV_COPY(numoutput); XV_COPY(type); #undef XV_COPY xvn[n].xv_flag = vp->v_vflag; switch (vp->v_type) { case VREG: case VDIR: case VLNK: break; case VBLK: case VCHR: if (vp->v_rdev == NULL) { vrele(vp); continue; } xvn[n].xv_dev = dev2udev(vp->v_rdev); break; case VSOCK: xvn[n].xv_socket = vp->v_socket; break; case VFIFO: xvn[n].xv_fifo = vp->v_fifoinfo; break; case VNON: case VBAD: default: /* shouldn't happen? */ vrele(vp); continue; } vrele(vp); ++n; } MNT_IUNLOCK(mp); mtx_lock(&mountlist_mtx); vfs_unbusy(mp); if (n == len) break; } mtx_unlock(&mountlist_mtx); error = SYSCTL_OUT(req, xvn, n * sizeof *xvn); free(xvn, M_TEMP); return (error); } SYSCTL_PROC(_kern, KERN_VNODE, vnode, CTLTYPE_OPAQUE|CTLFLAG_RD, 0, 0, sysctl_vnode, "S,xvnode", ""); #endif /* * Unmount all filesystems. The list is traversed in reverse order * of mounting to avoid dependencies. */ void vfs_unmountall(void) { struct mount *mp; struct thread *td; int error; KASSERT(curthread != NULL, ("vfs_unmountall: NULL curthread")); CTR1(KTR_VFS, "%s: unmounting all filesystems", __func__); td = curthread; /* * Since this only runs when rebooting, it is not interlocked. */ while(!TAILQ_EMPTY(&mountlist)) { mp = TAILQ_LAST(&mountlist, mntlist); error = dounmount(mp, MNT_FORCE, td); if (error) { TAILQ_REMOVE(&mountlist, mp, mnt_list); /* * XXX: Due to the way in which we mount the root * file system off of devfs, devfs will generate a * "busy" warning when we try to unmount it before * the root. Don't print a warning as a result in * order to avoid false positive errors that may * cause needless upset. */ if (strcmp(mp->mnt_vfc->vfc_name, "devfs") != 0) { printf("unmount of %s failed (", mp->mnt_stat.f_mntonname); if (error == EBUSY) printf("BUSY)\n"); else printf("%d)\n", error); } } else { /* The unmount has removed mp from the mountlist */ } } } /* * perform msync on all vnodes under a mount point * the mount point must be locked. */ void vfs_msync(struct mount *mp, int flags) { struct vnode *vp, *mvp; struct vm_object *obj; CTR2(KTR_VFS, "%s: mp %p", __func__, mp); MNT_ILOCK(mp); MNT_VNODE_FOREACH(vp, mp, mvp) { VI_LOCK(vp); if ((vp->v_iflag & VI_OBJDIRTY) && (flags == MNT_WAIT || VOP_ISLOCKED(vp) == 0)) { MNT_IUNLOCK(mp); if (!vget(vp, LK_EXCLUSIVE | LK_RETRY | LK_INTERLOCK, curthread)) { if (vp->v_vflag & VV_NOSYNC) { /* unlinked */ vput(vp); MNT_ILOCK(mp); continue; } obj = vp->v_object; if (obj != NULL) { VM_OBJECT_LOCK(obj); vm_object_page_clean(obj, 0, 0, flags == MNT_WAIT ? OBJPC_SYNC : OBJPC_NOSYNC); VM_OBJECT_UNLOCK(obj); } vput(vp); } MNT_ILOCK(mp); } else VI_UNLOCK(vp); } MNT_IUNLOCK(mp); } /* * Mark a vnode as free, putting it up for recycling. */ static void vfree(struct vnode *vp) { ASSERT_VI_LOCKED(vp, "vfree"); mtx_lock(&vnode_free_list_mtx); VNASSERT(vp->v_op != NULL, vp, ("vfree: vnode already reclaimed.")); VNASSERT((vp->v_iflag & VI_FREE) == 0, vp, ("vnode already free")); VNASSERT(VSHOULDFREE(vp), vp, ("vfree: freeing when we shouldn't")); VNASSERT((vp->v_iflag & VI_DOOMED) == 0, vp, ("vfree: Freeing doomed vnode")); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); if (vp->v_iflag & VI_AGE) { TAILQ_INSERT_HEAD(&vnode_free_list, vp, v_freelist); } else { TAILQ_INSERT_TAIL(&vnode_free_list, vp, v_freelist); } freevnodes++; vp->v_iflag &= ~VI_AGE; vp->v_iflag |= VI_FREE; mtx_unlock(&vnode_free_list_mtx); } /* * Opposite of vfree() - mark a vnode as in use. */ static void vbusy(struct vnode *vp) { ASSERT_VI_LOCKED(vp, "vbusy"); VNASSERT((vp->v_iflag & VI_FREE) != 0, vp, ("vnode not free")); VNASSERT(vp->v_op != NULL, vp, ("vbusy: vnode already reclaimed.")); CTR2(KTR_VFS, "%s: vp %p", __func__, vp); mtx_lock(&vnode_free_list_mtx); TAILQ_REMOVE(&vnode_free_list, vp, v_freelist); freevnodes--; vp->v_iflag &= ~(VI_FREE|VI_AGE); mtx_unlock(&vnode_free_list_mtx); } static void destroy_vpollinfo(struct vpollinfo *vi) { knlist_destroy(&vi->vpi_selinfo.si_note); mtx_destroy(&vi->vpi_lock); uma_zfree(vnodepoll_zone, vi); } /* * Initalize per-vnode helper structure to hold poll-related state. */ void v_addpollinfo(struct vnode *vp) { struct vpollinfo *vi; if (vp->v_pollinfo != NULL) return; vi = uma_zalloc(vnodepoll_zone, M_WAITOK); mtx_init(&vi->vpi_lock, "vnode pollinfo", NULL, MTX_DEF); knlist_init(&vi->vpi_selinfo.si_note, vp, vfs_knllock, vfs_knlunlock, vfs_knllocked); VI_LOCK(vp); if (vp->v_pollinfo != NULL) { VI_UNLOCK(vp); destroy_vpollinfo(vi); return; } vp->v_pollinfo = vi; VI_UNLOCK(vp); } /* * Record a process's interest in events which might happen to * a vnode. Because poll uses the historic select-style interface * internally, this routine serves as both the ``check for any * pending events'' and the ``record my interest in future events'' * functions. (These are done together, while the lock is held, * to avoid race conditions.) */ int vn_pollrecord(struct vnode *vp, struct thread *td, int events) { v_addpollinfo(vp); mtx_lock(&vp->v_pollinfo->vpi_lock); if (vp->v_pollinfo->vpi_revents & events) { /* * This leaves events we are not interested * in available for the other process which * which presumably had requested them * (otherwise they would never have been * recorded). */ events &= vp->v_pollinfo->vpi_revents; vp->v_pollinfo->vpi_revents &= ~events; mtx_unlock(&vp->v_pollinfo->vpi_lock); return (events); } vp->v_pollinfo->vpi_events |= events; selrecord(td, &vp->v_pollinfo->vpi_selinfo); mtx_unlock(&vp->v_pollinfo->vpi_lock); return (0); } /* * Routine to create and manage a filesystem syncer vnode. */ #define sync_close ((int (*)(struct vop_close_args *))nullop) static int sync_fsync(struct vop_fsync_args *); static int sync_inactive(struct vop_inactive_args *); static int sync_reclaim(struct vop_reclaim_args *); static struct vop_vector sync_vnodeops = { .vop_bypass = VOP_EOPNOTSUPP, .vop_close = sync_close, /* close */ .vop_fsync = sync_fsync, /* fsync */ .vop_inactive = sync_inactive, /* inactive */ .vop_reclaim = sync_reclaim, /* reclaim */ .vop_lock1 = vop_stdlock, /* lock */ .vop_unlock = vop_stdunlock, /* unlock */ .vop_islocked = vop_stdislocked, /* islocked */ }; /* * Create a new filesystem syncer vnode for the specified mount point. */ int vfs_allocate_syncvnode(struct mount *mp) { struct vnode *vp; struct bufobj *bo; static long start, incr, next; int error; /* Allocate a new vnode */ if ((error = getnewvnode("syncer", mp, &sync_vnodeops, &vp)) != 0) { mp->mnt_syncer = NULL; return (error); } vp->v_type = VNON; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); vp->v_vflag |= VV_FORCEINSMQ; error = insmntque(vp, mp); if (error != 0) panic("vfs_allocate_syncvnode: insmntque failed"); vp->v_vflag &= ~VV_FORCEINSMQ; VOP_UNLOCK(vp, 0); /* * Place the vnode onto the syncer worklist. We attempt to * scatter them about on the list so that they will go off * at evenly distributed times even if all the filesystems * are mounted at once. */ next += incr; if (next == 0 || next > syncer_maxdelay) { start /= 2; incr /= 2; if (start == 0) { start = syncer_maxdelay / 2; incr = syncer_maxdelay; } next = start; } bo = &vp->v_bufobj; BO_LOCK(bo); vn_syncer_add_to_worklist(bo, syncdelay > 0 ? next % syncdelay : 0); /* XXX - vn_syncer_add_to_worklist() also grabs and drops sync_mtx. */ mtx_lock(&sync_mtx); sync_vnode_count++; mtx_unlock(&sync_mtx); BO_UNLOCK(bo); mp->mnt_syncer = vp; return (0); } /* * Do a lazy sync of the filesystem. */ static int sync_fsync(struct vop_fsync_args *ap) { struct vnode *syncvp = ap->a_vp; struct mount *mp = syncvp->v_mount; int error; struct bufobj *bo; /* * We only need to do something if this is a lazy evaluation. */ if (ap->a_waitfor != MNT_LAZY) return (0); /* * Move ourselves to the back of the sync list. */ bo = &syncvp->v_bufobj; BO_LOCK(bo); vn_syncer_add_to_worklist(bo, syncdelay); BO_UNLOCK(bo); /* * Walk the list of vnodes pushing all that are dirty and * not already on the sync list. */ mtx_lock(&mountlist_mtx); if (vfs_busy(mp, MBF_NOWAIT | MBF_MNTLSTLOCK) != 0) { mtx_unlock(&mountlist_mtx); return (0); } if (vn_start_write(NULL, &mp, V_NOWAIT) != 0) { vfs_unbusy(mp); return (0); } MNT_ILOCK(mp); mp->mnt_noasync++; mp->mnt_kern_flag &= ~MNTK_ASYNC; MNT_IUNLOCK(mp); vfs_msync(mp, MNT_NOWAIT); error = VFS_SYNC(mp, MNT_LAZY); MNT_ILOCK(mp); mp->mnt_noasync--; if ((mp->mnt_flag & MNT_ASYNC) != 0 && mp->mnt_noasync == 0) mp->mnt_kern_flag |= MNTK_ASYNC; MNT_IUNLOCK(mp); vn_finished_write(mp); vfs_unbusy(mp); return (error); } /* * The syncer vnode is no referenced. */ static int sync_inactive(struct vop_inactive_args *ap) { vgone(ap->a_vp); return (0); } /* * The syncer vnode is no longer needed and is being decommissioned. * * Modifications to the worklist must be protected by sync_mtx. */ static int sync_reclaim(struct vop_reclaim_args *ap) { struct vnode *vp = ap->a_vp; struct bufobj *bo; bo = &vp->v_bufobj; BO_LOCK(bo); vp->v_mount->mnt_syncer = NULL; if (bo->bo_flag & BO_ONWORKLST) { mtx_lock(&sync_mtx); LIST_REMOVE(bo, bo_synclist); syncer_worklist_len--; sync_vnode_count--; mtx_unlock(&sync_mtx); bo->bo_flag &= ~BO_ONWORKLST; } BO_UNLOCK(bo); return (0); } /* * Check if vnode represents a disk device */ int vn_isdisk(struct vnode *vp, int *errp) { int error; error = 0; dev_lock(); if (vp->v_type != VCHR) error = ENOTBLK; else if (vp->v_rdev == NULL) error = ENXIO; else if (vp->v_rdev->si_devsw == NULL) error = ENXIO; else if (!(vp->v_rdev->si_devsw->d_flags & D_DISK)) error = ENOTBLK; dev_unlock(); if (errp != NULL) *errp = error; return (error == 0); } /* * Common filesystem object access control check routine. Accepts a * vnode's type, "mode", uid and gid, requested access mode, credentials, * and optional call-by-reference privused argument allowing vaccess() * to indicate to the caller whether privilege was used to satisfy the * request (obsoleted). Returns 0 on success, or an errno on failure. * * The ifdef'd CAPABILITIES version is here for reference, but is not * actually used. */ int vaccess(enum vtype type, mode_t file_mode, uid_t file_uid, gid_t file_gid, accmode_t accmode, struct ucred *cred, int *privused) { accmode_t dac_granted; accmode_t priv_granted; /* * Look for a normal, non-privileged way to access the file/directory * as requested. If it exists, go with that. */ if (privused != NULL) *privused = 0; dac_granted = 0; /* Check the owner. */ if (cred->cr_uid == file_uid) { dac_granted |= VADMIN; if (file_mode & S_IXUSR) dac_granted |= VEXEC; if (file_mode & S_IRUSR) dac_granted |= VREAD; if (file_mode & S_IWUSR) dac_granted |= (VWRITE | VAPPEND); if ((accmode & dac_granted) == accmode) return (0); goto privcheck; } /* Otherwise, check the groups (first match) */ if (groupmember(file_gid, cred)) { if (file_mode & S_IXGRP) dac_granted |= VEXEC; if (file_mode & S_IRGRP) dac_granted |= VREAD; if (file_mode & S_IWGRP) dac_granted |= (VWRITE | VAPPEND); if ((accmode & dac_granted) == accmode) return (0); goto privcheck; } /* Otherwise, check everyone else. */ if (file_mode & S_IXOTH) dac_granted |= VEXEC; if (file_mode & S_IROTH) dac_granted |= VREAD; if (file_mode & S_IWOTH) dac_granted |= (VWRITE | VAPPEND); if ((accmode & dac_granted) == accmode) return (0); privcheck: /* * Build a privilege mask to determine if the set of privileges * satisfies the requirements when combined with the granted mask * from above. For each privilege, if the privilege is required, * bitwise or the request type onto the priv_granted mask. */ priv_granted = 0; if (type == VDIR) { /* * For directories, use PRIV_VFS_LOOKUP to satisfy VEXEC * requests, instead of PRIV_VFS_EXEC. */ if ((accmode & VEXEC) && ((dac_granted & VEXEC) == 0) && !priv_check_cred(cred, PRIV_VFS_LOOKUP, 0)) priv_granted |= VEXEC; } else { if ((accmode & VEXEC) && ((dac_granted & VEXEC) == 0) && !priv_check_cred(cred, PRIV_VFS_EXEC, 0)) priv_granted |= VEXEC; } if ((accmode & VREAD) && ((dac_granted & VREAD) == 0) && !priv_check_cred(cred, PRIV_VFS_READ, 0)) priv_granted |= VREAD; if ((accmode & VWRITE) && ((dac_granted & VWRITE) == 0) && !priv_check_cred(cred, PRIV_VFS_WRITE, 0)) priv_granted |= (VWRITE | VAPPEND); if ((accmode & VADMIN) && ((dac_granted & VADMIN) == 0) && !priv_check_cred(cred, PRIV_VFS_ADMIN, 0)) priv_granted |= VADMIN; if ((accmode & (priv_granted | dac_granted)) == accmode) { /* XXX audit: privilege used */ if (privused != NULL) *privused = 1; return (0); } return ((accmode & VADMIN) ? EPERM : EACCES); } /* * Credential check based on process requesting service, and per-attribute * permissions. */ int extattr_check_cred(struct vnode *vp, int attrnamespace, struct ucred *cred, struct thread *td, accmode_t accmode) { /* * Kernel-invoked always succeeds. */ if (cred == NOCRED) return (0); /* * Do not allow privileged processes in jail to directly manipulate * system attributes. */ switch (attrnamespace) { case EXTATTR_NAMESPACE_SYSTEM: /* Potentially should be: return (EPERM); */ return (priv_check_cred(cred, PRIV_VFS_EXTATTR_SYSTEM, 0)); case EXTATTR_NAMESPACE_USER: return (VOP_ACCESS(vp, accmode, cred, td)); default: return (EPERM); } } #ifdef DEBUG_VFS_LOCKS /* * This only exists to supress warnings from unlocked specfs accesses. It is * no longer ok to have an unlocked VFS. */ #define IGNORE_LOCK(vp) (panicstr != NULL || (vp) == NULL || \ (vp)->v_type == VCHR || (vp)->v_type == VBAD) int vfs_badlock_ddb = 1; /* Drop into debugger on violation. */ SYSCTL_INT(_debug, OID_AUTO, vfs_badlock_ddb, CTLFLAG_RW, &vfs_badlock_ddb, 0, ""); int vfs_badlock_mutex = 1; /* Check for interlock across VOPs. */ SYSCTL_INT(_debug, OID_AUTO, vfs_badlock_mutex, CTLFLAG_RW, &vfs_badlock_mutex, 0, ""); int vfs_badlock_print = 1; /* Print lock violations. */ SYSCTL_INT(_debug, OID_AUTO, vfs_badlock_print, CTLFLAG_RW, &vfs_badlock_print, 0, ""); #ifdef KDB int vfs_badlock_backtrace = 1; /* Print backtrace at lock violations. */ SYSCTL_INT(_debug, OID_AUTO, vfs_badlock_backtrace, CTLFLAG_RW, &vfs_badlock_backtrace, 0, ""); #endif static void vfs_badlock(const char *msg, const char *str, struct vnode *vp) { #ifdef KDB if (vfs_badlock_backtrace) kdb_backtrace(); #endif if (vfs_badlock_print) printf("%s: %p %s\n", str, (void *)vp, msg); if (vfs_badlock_ddb) kdb_enter(KDB_WHY_VFSLOCK, "lock violation"); } void assert_vi_locked(struct vnode *vp, const char *str) { if (vfs_badlock_mutex && !mtx_owned(VI_MTX(vp))) vfs_badlock("interlock is not locked but should be", str, vp); } void assert_vi_unlocked(struct vnode *vp, const char *str) { if (vfs_badlock_mutex && mtx_owned(VI_MTX(vp))) vfs_badlock("interlock is locked but should not be", str, vp); } void assert_vop_locked(struct vnode *vp, const char *str) { if (!IGNORE_LOCK(vp) && VOP_ISLOCKED(vp) == 0) vfs_badlock("is not locked but should be", str, vp); } void assert_vop_unlocked(struct vnode *vp, const char *str) { if (!IGNORE_LOCK(vp) && VOP_ISLOCKED(vp) == LK_EXCLUSIVE) vfs_badlock("is locked but should not be", str, vp); } void assert_vop_elocked(struct vnode *vp, const char *str) { if (!IGNORE_LOCK(vp) && VOP_ISLOCKED(vp) != LK_EXCLUSIVE) vfs_badlock("is not exclusive locked but should be", str, vp); } #if 0 void assert_vop_elocked_other(struct vnode *vp, const char *str) { if (!IGNORE_LOCK(vp) && VOP_ISLOCKED(vp) != LK_EXCLOTHER) vfs_badlock("is not exclusive locked by another thread", str, vp); } void assert_vop_slocked(struct vnode *vp, const char *str) { if (!IGNORE_LOCK(vp) && VOP_ISLOCKED(vp) != LK_SHARED) vfs_badlock("is not locked shared but should be", str, vp); } #endif /* 0 */ #endif /* DEBUG_VFS_LOCKS */ void vop_rename_pre(void *ap) { struct vop_rename_args *a = ap; #ifdef DEBUG_VFS_LOCKS if (a->a_tvp) ASSERT_VI_UNLOCKED(a->a_tvp, "VOP_RENAME"); ASSERT_VI_UNLOCKED(a->a_tdvp, "VOP_RENAME"); ASSERT_VI_UNLOCKED(a->a_fvp, "VOP_RENAME"); ASSERT_VI_UNLOCKED(a->a_fdvp, "VOP_RENAME"); /* Check the source (from). */ if (a->a_tdvp != a->a_fdvp && a->a_tvp != a->a_fdvp) ASSERT_VOP_UNLOCKED(a->a_fdvp, "vop_rename: fdvp locked"); if (a->a_tvp != a->a_fvp) ASSERT_VOP_UNLOCKED(a->a_fvp, "vop_rename: fvp locked"); /* Check the target. */ if (a->a_tvp) ASSERT_VOP_LOCKED(a->a_tvp, "vop_rename: tvp not locked"); ASSERT_VOP_LOCKED(a->a_tdvp, "vop_rename: tdvp not locked"); #endif if (a->a_tdvp != a->a_fdvp) vhold(a->a_fdvp); if (a->a_tvp != a->a_fvp) vhold(a->a_fvp); vhold(a->a_tdvp); if (a->a_tvp) vhold(a->a_tvp); } void vop_strategy_pre(void *ap) { #ifdef DEBUG_VFS_LOCKS struct vop_strategy_args *a; struct buf *bp; a = ap; bp = a->a_bp; /* * Cluster ops lock their component buffers but not the IO container. */ if ((bp->b_flags & B_CLUSTER) != 0) return; if (!BUF_ISLOCKED(bp)) { if (vfs_badlock_print) printf( "VOP_STRATEGY: bp is not locked but should be\n"); if (vfs_badlock_ddb) kdb_enter(KDB_WHY_VFSLOCK, "lock violation"); } #endif } void vop_lookup_pre(void *ap) { #ifdef DEBUG_VFS_LOCKS struct vop_lookup_args *a; struct vnode *dvp; a = ap; dvp = a->a_dvp; ASSERT_VI_UNLOCKED(dvp, "VOP_LOOKUP"); ASSERT_VOP_LOCKED(dvp, "VOP_LOOKUP"); #endif } void vop_lookup_post(void *ap, int rc) { #ifdef DEBUG_VFS_LOCKS struct vop_lookup_args *a; struct vnode *dvp; struct vnode *vp; a = ap; dvp = a->a_dvp; vp = *(a->a_vpp); ASSERT_VI_UNLOCKED(dvp, "VOP_LOOKUP"); ASSERT_VOP_LOCKED(dvp, "VOP_LOOKUP"); if (!rc) ASSERT_VOP_LOCKED(vp, "VOP_LOOKUP (child)"); #endif } void vop_lock_pre(void *ap) { #ifdef DEBUG_VFS_LOCKS struct vop_lock1_args *a = ap; if ((a->a_flags & LK_INTERLOCK) == 0) ASSERT_VI_UNLOCKED(a->a_vp, "VOP_LOCK"); else ASSERT_VI_LOCKED(a->a_vp, "VOP_LOCK"); #endif } void vop_lock_post(void *ap, int rc) { #ifdef DEBUG_VFS_LOCKS struct vop_lock1_args *a = ap; ASSERT_VI_UNLOCKED(a->a_vp, "VOP_LOCK"); if (rc == 0) ASSERT_VOP_LOCKED(a->a_vp, "VOP_LOCK"); #endif } void vop_unlock_pre(void *ap) { #ifdef DEBUG_VFS_LOCKS struct vop_unlock_args *a = ap; if (a->a_flags & LK_INTERLOCK) ASSERT_VI_LOCKED(a->a_vp, "VOP_UNLOCK"); ASSERT_VOP_LOCKED(a->a_vp, "VOP_UNLOCK"); #endif } void vop_unlock_post(void *ap, int rc) { #ifdef DEBUG_VFS_LOCKS struct vop_unlock_args *a = ap; if (a->a_flags & LK_INTERLOCK) ASSERT_VI_UNLOCKED(a->a_vp, "VOP_UNLOCK"); #endif } void vop_create_post(void *ap, int rc) { struct vop_create_args *a = ap; if (!rc) VFS_KNOTE_LOCKED(a->a_dvp, NOTE_WRITE); } void vop_link_post(void *ap, int rc) { struct vop_link_args *a = ap; if (!rc) { VFS_KNOTE_LOCKED(a->a_vp, NOTE_LINK); VFS_KNOTE_LOCKED(a->a_tdvp, NOTE_WRITE); } } void vop_mkdir_post(void *ap, int rc) { struct vop_mkdir_args *a = ap; if (!rc) VFS_KNOTE_LOCKED(a->a_dvp, NOTE_WRITE | NOTE_LINK); } void vop_mknod_post(void *ap, int rc) { struct vop_mknod_args *a = ap; if (!rc) VFS_KNOTE_LOCKED(a->a_dvp, NOTE_WRITE); } void vop_remove_post(void *ap, int rc) { struct vop_remove_args *a = ap; if (!rc) { VFS_KNOTE_LOCKED(a->a_dvp, NOTE_WRITE); VFS_KNOTE_LOCKED(a->a_vp, NOTE_DELETE); } } void vop_rename_post(void *ap, int rc) { struct vop_rename_args *a = ap; if (!rc) { VFS_KNOTE_UNLOCKED(a->a_fdvp, NOTE_WRITE); VFS_KNOTE_UNLOCKED(a->a_tdvp, NOTE_WRITE); VFS_KNOTE_UNLOCKED(a->a_fvp, NOTE_RENAME); if (a->a_tvp) VFS_KNOTE_UNLOCKED(a->a_tvp, NOTE_DELETE); } if (a->a_tdvp != a->a_fdvp) vdrop(a->a_fdvp); if (a->a_tvp != a->a_fvp) vdrop(a->a_fvp); vdrop(a->a_tdvp); if (a->a_tvp) vdrop(a->a_tvp); } void vop_rmdir_post(void *ap, int rc) { struct vop_rmdir_args *a = ap; if (!rc) { VFS_KNOTE_LOCKED(a->a_dvp, NOTE_WRITE | NOTE_LINK); VFS_KNOTE_LOCKED(a->a_vp, NOTE_DELETE); } } void vop_setattr_post(void *ap, int rc) { struct vop_setattr_args *a = ap; if (!rc) VFS_KNOTE_LOCKED(a->a_vp, NOTE_ATTRIB); } void vop_symlink_post(void *ap, int rc) { struct vop_symlink_args *a = ap; if (!rc) VFS_KNOTE_LOCKED(a->a_dvp, NOTE_WRITE); } static struct knlist fs_knlist; static void vfs_event_init(void *arg) { knlist_init(&fs_knlist, NULL, NULL, NULL, NULL); } /* XXX - correct order? */ SYSINIT(vfs_knlist, SI_SUB_VFS, SI_ORDER_ANY, vfs_event_init, NULL); void vfs_event_signal(fsid_t *fsid, u_int32_t event, intptr_t data __unused) { KNOTE_UNLOCKED(&fs_knlist, event); } static int filt_fsattach(struct knote *kn); static void filt_fsdetach(struct knote *kn); static int filt_fsevent(struct knote *kn, long hint); struct filterops fs_filtops = { 0, filt_fsattach, filt_fsdetach, filt_fsevent }; static int filt_fsattach(struct knote *kn) { kn->kn_flags |= EV_CLEAR; knlist_add(&fs_knlist, kn, 0); return (0); } static void filt_fsdetach(struct knote *kn) { knlist_remove(&fs_knlist, kn, 0); } static int filt_fsevent(struct knote *kn, long hint) { kn->kn_fflags |= hint; return (kn->kn_fflags != 0); } static int sysctl_vfs_ctl(SYSCTL_HANDLER_ARGS) { struct vfsidctl vc; int error; struct mount *mp; error = SYSCTL_IN(req, &vc, sizeof(vc)); if (error) return (error); if (vc.vc_vers != VFS_CTL_VERS1) return (EINVAL); mp = vfs_getvfs(&vc.vc_fsid); if (mp == NULL) return (ENOENT); /* ensure that a specific sysctl goes to the right filesystem. */ if (strcmp(vc.vc_fstypename, "*") != 0 && strcmp(vc.vc_fstypename, mp->mnt_vfc->vfc_name) != 0) { vfs_rel(mp); return (EINVAL); } VCTLTOREQ(&vc, req); error = VFS_SYSCTL(mp, vc.vc_op, req); vfs_rel(mp); return (error); } SYSCTL_PROC(_vfs, OID_AUTO, ctl, CTLFLAG_WR, NULL, 0, sysctl_vfs_ctl, "", "Sysctl by fsid"); /* * Function to initialize a va_filerev field sensibly. * XXX: Wouldn't a random number make a lot more sense ?? */ u_quad_t init_va_filerev(void) { struct bintime bt; getbinuptime(&bt); return (((u_quad_t)bt.sec << 32LL) | (bt.frac >> 32LL)); } static int filt_vfsread(struct knote *kn, long hint); static int filt_vfswrite(struct knote *kn, long hint); static int filt_vfsvnode(struct knote *kn, long hint); static void filt_vfsdetach(struct knote *kn); static struct filterops vfsread_filtops = { 1, NULL, filt_vfsdetach, filt_vfsread }; static struct filterops vfswrite_filtops = { 1, NULL, filt_vfsdetach, filt_vfswrite }; static struct filterops vfsvnode_filtops = { 1, NULL, filt_vfsdetach, filt_vfsvnode }; static void vfs_knllock(void *arg) { struct vnode *vp = arg; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); } static void vfs_knlunlock(void *arg) { struct vnode *vp = arg; VOP_UNLOCK(vp, 0); } static int vfs_knllocked(void *arg) { struct vnode *vp = arg; return (VOP_ISLOCKED(vp) == LK_EXCLUSIVE); } int vfs_kqfilter(struct vop_kqfilter_args *ap) { struct vnode *vp = ap->a_vp; struct knote *kn = ap->a_kn; struct knlist *knl; switch (kn->kn_filter) { case EVFILT_READ: kn->kn_fop = &vfsread_filtops; break; case EVFILT_WRITE: kn->kn_fop = &vfswrite_filtops; break; case EVFILT_VNODE: kn->kn_fop = &vfsvnode_filtops; break; default: return (EINVAL); } kn->kn_hook = (caddr_t)vp; v_addpollinfo(vp); if (vp->v_pollinfo == NULL) return (ENOMEM); knl = &vp->v_pollinfo->vpi_selinfo.si_note; knlist_add(knl, kn, 0); return (0); } /* * Detach knote from vnode */ static void filt_vfsdetach(struct knote *kn) { struct vnode *vp = (struct vnode *)kn->kn_hook; KASSERT(vp->v_pollinfo != NULL, ("Missing v_pollinfo")); knlist_remove(&vp->v_pollinfo->vpi_selinfo.si_note, kn, 0); } /*ARGSUSED*/ static int filt_vfsread(struct knote *kn, long hint) { struct vnode *vp = (struct vnode *)kn->kn_hook; struct vattr va; /* * filesystem is gone, so set the EOF flag and schedule * the knote for deletion. */ if (hint == NOTE_REVOKE) { kn->kn_flags |= (EV_EOF | EV_ONESHOT); return (1); } if (VOP_GETATTR(vp, &va, curthread->td_ucred)) return (0); kn->kn_data = va.va_size - kn->kn_fp->f_offset; return (kn->kn_data != 0); } /*ARGSUSED*/ static int filt_vfswrite(struct knote *kn, long hint) { /* * filesystem is gone, so set the EOF flag and schedule * the knote for deletion. */ if (hint == NOTE_REVOKE) kn->kn_flags |= (EV_EOF | EV_ONESHOT); kn->kn_data = 0; return (1); } static int filt_vfsvnode(struct knote *kn, long hint) { if (kn->kn_sfflags & hint) kn->kn_fflags |= hint; if (hint == NOTE_REVOKE) { kn->kn_flags |= EV_EOF; return (1); } return (kn->kn_fflags != 0); } int vfs_read_dirent(struct vop_readdir_args *ap, struct dirent *dp, off_t off) { int error; if (dp->d_reclen > ap->a_uio->uio_resid) return (ENAMETOOLONG); error = uiomove(dp, dp->d_reclen, ap->a_uio); if (error) { if (ap->a_ncookies != NULL) { if (ap->a_cookies != NULL) free(ap->a_cookies, M_TEMP); ap->a_cookies = NULL; *ap->a_ncookies = 0; } return (error); } if (ap->a_ncookies == NULL) return (0); KASSERT(ap->a_cookies, ("NULL ap->a_cookies value with non-NULL ap->a_ncookies!")); *ap->a_cookies = realloc(*ap->a_cookies, (*ap->a_ncookies + 1) * sizeof(u_long), M_TEMP, M_WAITOK | M_ZERO); (*ap->a_cookies)[*ap->a_ncookies] = off; return (0); } /* * Mark for update the access time of the file if the filesystem * supports VOP_MARKATIME. This functionality is used by execve and * mmap, so we want to avoid the I/O implied by directly setting * va_atime for the sake of efficiency. */ void vfs_mark_atime(struct vnode *vp, struct ucred *cred) { if ((vp->v_mount->mnt_flag & (MNT_NOATIME | MNT_RDONLY)) == 0) (void)VOP_MARKATIME(vp); } Index: head/sys/kern/vfs_syscalls.c =================================================================== --- head/sys/kern/vfs_syscalls.c (revision 192894) +++ head/sys/kern/vfs_syscalls.c (revision 192895) @@ -1,4642 +1,4636 @@ /*- * Copyright (c) 1989, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)vfs_syscalls.c 8.13 (Berkeley) 4/15/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_compat.h" #include "opt_kdtrace.h" #include "opt_ktrace.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef KTRACE #include #endif #include #include #include #include #include #include #include SDT_PROVIDER_DEFINE(vfs); SDT_PROBE_DEFINE(vfs, , stat, mode); SDT_PROBE_ARGTYPE(vfs, , stat, mode, 0, "char *"); SDT_PROBE_ARGTYPE(vfs, , stat, mode, 1, "int"); SDT_PROBE_DEFINE(vfs, , stat, reg); SDT_PROBE_ARGTYPE(vfs, , stat, reg, 0, "char *"); SDT_PROBE_ARGTYPE(vfs, , stat, reg, 1, "int"); static int chroot_refuse_vdir_fds(struct filedesc *fdp); static int getutimes(const struct timeval *, enum uio_seg, struct timespec *); static int setfown(struct thread *td, struct vnode *, uid_t, gid_t); static int setfmode(struct thread *td, struct vnode *, int); static int setfflags(struct thread *td, struct vnode *, int); static int setutimes(struct thread *td, struct vnode *, const struct timespec *, int, int); static int vn_access(struct vnode *vp, int user_flags, struct ucred *cred, struct thread *td); /* * The module initialization routine for POSIX asynchronous I/O will * set this to the version of AIO that it implements. (Zero means * that it is not implemented.) This value is used here by pathconf() * and in kern_descrip.c by fpathconf(). */ int async_io_version; #ifdef DEBUG static int syncprt = 0; SYSCTL_INT(_debug, OID_AUTO, syncprt, CTLFLAG_RW, &syncprt, 0, ""); #endif /* * Sync each mounted filesystem. */ #ifndef _SYS_SYSPROTO_H_ struct sync_args { int dummy; }; #endif /* ARGSUSED */ int sync(td, uap) struct thread *td; struct sync_args *uap; { struct mount *mp, *nmp; int vfslocked; mtx_lock(&mountlist_mtx); for (mp = TAILQ_FIRST(&mountlist); mp != NULL; mp = nmp) { if (vfs_busy(mp, MBF_NOWAIT | MBF_MNTLSTLOCK)) { nmp = TAILQ_NEXT(mp, mnt_list); continue; } vfslocked = VFS_LOCK_GIANT(mp); if ((mp->mnt_flag & MNT_RDONLY) == 0 && vn_start_write(NULL, &mp, V_NOWAIT) == 0) { MNT_ILOCK(mp); mp->mnt_noasync++; mp->mnt_kern_flag &= ~MNTK_ASYNC; MNT_IUNLOCK(mp); vfs_msync(mp, MNT_NOWAIT); VFS_SYNC(mp, MNT_NOWAIT); MNT_ILOCK(mp); mp->mnt_noasync--; if ((mp->mnt_flag & MNT_ASYNC) != 0 && mp->mnt_noasync == 0) mp->mnt_kern_flag |= MNTK_ASYNC; MNT_IUNLOCK(mp); vn_finished_write(mp); } VFS_UNLOCK_GIANT(vfslocked); mtx_lock(&mountlist_mtx); nmp = TAILQ_NEXT(mp, mnt_list); vfs_unbusy(mp); } mtx_unlock(&mountlist_mtx); return (0); } -/* XXX PRISON: could be per prison flag */ -static int prison_quotas; -#if 0 -SYSCTL_INT(_kern_prison, OID_AUTO, quotas, CTLFLAG_RW, &prison_quotas, 0, ""); -#endif - /* * Change filesystem quotas. */ #ifndef _SYS_SYSPROTO_H_ struct quotactl_args { char *path; int cmd; int uid; caddr_t arg; }; #endif int quotactl(td, uap) struct thread *td; register struct quotactl_args /* { char *path; int cmd; int uid; caddr_t arg; } */ *uap; { struct mount *mp; int vfslocked; int error; struct nameidata nd; AUDIT_ARG(cmd, uap->cmd); AUDIT_ARG(uid, uap->uid); - if (jailed(td->td_ucred) && !prison_quotas) + if (!prison_allow(td->td_ucred, PR_ALLOW_QUOTAS)) return (EPERM); NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | MPSAFE | AUDITVNODE1, UIO_USERSPACE, uap->path, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); mp = nd.ni_vp->v_mount; vfs_ref(mp); vput(nd.ni_vp); error = vfs_busy(mp, 0); vfs_rel(mp); if (error) { VFS_UNLOCK_GIANT(vfslocked); return (error); } error = VFS_QUOTACTL(mp, uap->cmd, uap->uid, uap->arg); vfs_unbusy(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Used by statfs conversion routines to scale the block size up if * necessary so that all of the block counts are <= 'max_size'. Note * that 'max_size' should be a bitmask, i.e. 2^n - 1 for some non-zero * value of 'n'. */ void statfs_scale_blocks(struct statfs *sf, long max_size) { uint64_t count; int shift; KASSERT(powerof2(max_size + 1), ("%s: invalid max_size", __func__)); /* * Attempt to scale the block counts to give a more accurate * overview to userland of the ratio of free space to used * space. To do this, find the largest block count and compute * a divisor that lets it fit into a signed integer <= max_size. */ if (sf->f_bavail < 0) count = -sf->f_bavail; else count = sf->f_bavail; count = MAX(sf->f_blocks, MAX(sf->f_bfree, count)); if (count <= max_size) return; count >>= flsl(max_size); shift = 0; while (count > 0) { shift++; count >>=1; } sf->f_bsize <<= shift; sf->f_blocks >>= shift; sf->f_bfree >>= shift; sf->f_bavail >>= shift; } /* * Get filesystem statistics. */ #ifndef _SYS_SYSPROTO_H_ struct statfs_args { char *path; struct statfs *buf; }; #endif int statfs(td, uap) struct thread *td; register struct statfs_args /* { char *path; struct statfs *buf; } */ *uap; { struct statfs sf; int error; error = kern_statfs(td, uap->path, UIO_USERSPACE, &sf); if (error == 0) error = copyout(&sf, uap->buf, sizeof(sf)); return (error); } int kern_statfs(struct thread *td, char *path, enum uio_seg pathseg, struct statfs *buf) { struct mount *mp; struct statfs *sp, sb; int vfslocked; int error; struct nameidata nd; NDINIT(&nd, LOOKUP, FOLLOW | LOCKSHARED | LOCKLEAF | MPSAFE | AUDITVNODE1, pathseg, path, td); error = namei(&nd); if (error) return (error); vfslocked = NDHASGIANT(&nd); mp = nd.ni_vp->v_mount; vfs_ref(mp); NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_vp); error = vfs_busy(mp, 0); vfs_rel(mp); if (error) { VFS_UNLOCK_GIANT(vfslocked); return (error); } #ifdef MAC error = mac_mount_check_stat(td->td_ucred, mp); if (error) goto out; #endif /* * Set these in case the underlying filesystem fails to do so. */ sp = &mp->mnt_stat; sp->f_version = STATFS_VERSION; sp->f_namemax = NAME_MAX; sp->f_flags = mp->mnt_flag & MNT_VISFLAGMASK; error = VFS_STATFS(mp, sp); if (error) goto out; if (priv_check(td, PRIV_VFS_GENERATION)) { bcopy(sp, &sb, sizeof(sb)); sb.f_fsid.val[0] = sb.f_fsid.val[1] = 0; prison_enforce_statfs(td->td_ucred, mp, &sb); sp = &sb; } *buf = *sp; out: vfs_unbusy(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Get filesystem statistics. */ #ifndef _SYS_SYSPROTO_H_ struct fstatfs_args { int fd; struct statfs *buf; }; #endif int fstatfs(td, uap) struct thread *td; register struct fstatfs_args /* { int fd; struct statfs *buf; } */ *uap; { struct statfs sf; int error; error = kern_fstatfs(td, uap->fd, &sf); if (error == 0) error = copyout(&sf, uap->buf, sizeof(sf)); return (error); } int kern_fstatfs(struct thread *td, int fd, struct statfs *buf) { struct file *fp; struct mount *mp; struct statfs *sp, sb; int vfslocked; struct vnode *vp; int error; AUDIT_ARG(fd, fd); error = getvnode(td->td_proc->p_fd, fd, &fp); if (error) return (error); vp = fp->f_vnode; vfslocked = VFS_LOCK_GIANT(vp->v_mount); vn_lock(vp, LK_SHARED | LK_RETRY); #ifdef AUDIT AUDIT_ARG(vnode, vp, ARG_VNODE1); #endif mp = vp->v_mount; if (mp) vfs_ref(mp); VOP_UNLOCK(vp, 0); fdrop(fp, td); if (mp == NULL) { error = EBADF; goto out; } error = vfs_busy(mp, 0); vfs_rel(mp); if (error) { VFS_UNLOCK_GIANT(vfslocked); return (error); } #ifdef MAC error = mac_mount_check_stat(td->td_ucred, mp); if (error) goto out; #endif /* * Set these in case the underlying filesystem fails to do so. */ sp = &mp->mnt_stat; sp->f_version = STATFS_VERSION; sp->f_namemax = NAME_MAX; sp->f_flags = mp->mnt_flag & MNT_VISFLAGMASK; error = VFS_STATFS(mp, sp); if (error) goto out; if (priv_check(td, PRIV_VFS_GENERATION)) { bcopy(sp, &sb, sizeof(sb)); sb.f_fsid.val[0] = sb.f_fsid.val[1] = 0; prison_enforce_statfs(td->td_ucred, mp, &sb); sp = &sb; } *buf = *sp; out: if (mp) vfs_unbusy(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Get statistics on all filesystems. */ #ifndef _SYS_SYSPROTO_H_ struct getfsstat_args { struct statfs *buf; long bufsize; int flags; }; #endif int getfsstat(td, uap) struct thread *td; register struct getfsstat_args /* { struct statfs *buf; long bufsize; int flags; } */ *uap; { return (kern_getfsstat(td, &uap->buf, uap->bufsize, UIO_USERSPACE, uap->flags)); } /* * If (bufsize > 0 && bufseg == UIO_SYSSPACE) * The caller is responsible for freeing memory which will be allocated * in '*buf'. */ int kern_getfsstat(struct thread *td, struct statfs **buf, size_t bufsize, enum uio_seg bufseg, int flags) { struct mount *mp, *nmp; struct statfs *sfsp, *sp, sb; size_t count, maxcount; int vfslocked; int error; maxcount = bufsize / sizeof(struct statfs); if (bufsize == 0) sfsp = NULL; else if (bufseg == UIO_USERSPACE) sfsp = *buf; else /* if (bufseg == UIO_SYSSPACE) */ { count = 0; mtx_lock(&mountlist_mtx); TAILQ_FOREACH(mp, &mountlist, mnt_list) { count++; } mtx_unlock(&mountlist_mtx); if (maxcount > count) maxcount = count; sfsp = *buf = malloc(maxcount * sizeof(struct statfs), M_TEMP, M_WAITOK); } count = 0; mtx_lock(&mountlist_mtx); for (mp = TAILQ_FIRST(&mountlist); mp != NULL; mp = nmp) { if (prison_canseemount(td->td_ucred, mp) != 0) { nmp = TAILQ_NEXT(mp, mnt_list); continue; } #ifdef MAC if (mac_mount_check_stat(td->td_ucred, mp) != 0) { nmp = TAILQ_NEXT(mp, mnt_list); continue; } #endif if (vfs_busy(mp, MBF_NOWAIT | MBF_MNTLSTLOCK)) { nmp = TAILQ_NEXT(mp, mnt_list); continue; } vfslocked = VFS_LOCK_GIANT(mp); if (sfsp && count < maxcount) { sp = &mp->mnt_stat; /* * Set these in case the underlying filesystem * fails to do so. */ sp->f_version = STATFS_VERSION; sp->f_namemax = NAME_MAX; sp->f_flags = mp->mnt_flag & MNT_VISFLAGMASK; /* * If MNT_NOWAIT or MNT_LAZY is specified, do not * refresh the fsstat cache. MNT_NOWAIT or MNT_LAZY * overrides MNT_WAIT. */ if (((flags & (MNT_LAZY|MNT_NOWAIT)) == 0 || (flags & MNT_WAIT)) && (error = VFS_STATFS(mp, sp))) { VFS_UNLOCK_GIANT(vfslocked); mtx_lock(&mountlist_mtx); nmp = TAILQ_NEXT(mp, mnt_list); vfs_unbusy(mp); continue; } if (priv_check(td, PRIV_VFS_GENERATION)) { bcopy(sp, &sb, sizeof(sb)); sb.f_fsid.val[0] = sb.f_fsid.val[1] = 0; prison_enforce_statfs(td->td_ucred, mp, &sb); sp = &sb; } if (bufseg == UIO_SYSSPACE) bcopy(sp, sfsp, sizeof(*sp)); else /* if (bufseg == UIO_USERSPACE) */ { error = copyout(sp, sfsp, sizeof(*sp)); if (error) { vfs_unbusy(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } } sfsp++; } VFS_UNLOCK_GIANT(vfslocked); count++; mtx_lock(&mountlist_mtx); nmp = TAILQ_NEXT(mp, mnt_list); vfs_unbusy(mp); } mtx_unlock(&mountlist_mtx); if (sfsp && count > maxcount) td->td_retval[0] = maxcount; else td->td_retval[0] = count; return (0); } #ifdef COMPAT_FREEBSD4 /* * Get old format filesystem statistics. */ static void cvtstatfs(struct statfs *, struct ostatfs *); #ifndef _SYS_SYSPROTO_H_ struct freebsd4_statfs_args { char *path; struct ostatfs *buf; }; #endif int freebsd4_statfs(td, uap) struct thread *td; struct freebsd4_statfs_args /* { char *path; struct ostatfs *buf; } */ *uap; { struct ostatfs osb; struct statfs sf; int error; error = kern_statfs(td, uap->path, UIO_USERSPACE, &sf); if (error) return (error); cvtstatfs(&sf, &osb); return (copyout(&osb, uap->buf, sizeof(osb))); } /* * Get filesystem statistics. */ #ifndef _SYS_SYSPROTO_H_ struct freebsd4_fstatfs_args { int fd; struct ostatfs *buf; }; #endif int freebsd4_fstatfs(td, uap) struct thread *td; struct freebsd4_fstatfs_args /* { int fd; struct ostatfs *buf; } */ *uap; { struct ostatfs osb; struct statfs sf; int error; error = kern_fstatfs(td, uap->fd, &sf); if (error) return (error); cvtstatfs(&sf, &osb); return (copyout(&osb, uap->buf, sizeof(osb))); } /* * Get statistics on all filesystems. */ #ifndef _SYS_SYSPROTO_H_ struct freebsd4_getfsstat_args { struct ostatfs *buf; long bufsize; int flags; }; #endif int freebsd4_getfsstat(td, uap) struct thread *td; register struct freebsd4_getfsstat_args /* { struct ostatfs *buf; long bufsize; int flags; } */ *uap; { struct statfs *buf, *sp; struct ostatfs osb; size_t count, size; int error; count = uap->bufsize / sizeof(struct ostatfs); size = count * sizeof(struct statfs); error = kern_getfsstat(td, &buf, size, UIO_SYSSPACE, uap->flags); if (size > 0) { count = td->td_retval[0]; sp = buf; while (count > 0 && error == 0) { cvtstatfs(sp, &osb); error = copyout(&osb, uap->buf, sizeof(osb)); sp++; uap->buf++; count--; } free(buf, M_TEMP); } return (error); } /* * Implement fstatfs() for (NFS) file handles. */ #ifndef _SYS_SYSPROTO_H_ struct freebsd4_fhstatfs_args { struct fhandle *u_fhp; struct ostatfs *buf; }; #endif int freebsd4_fhstatfs(td, uap) struct thread *td; struct freebsd4_fhstatfs_args /* { struct fhandle *u_fhp; struct ostatfs *buf; } */ *uap; { struct ostatfs osb; struct statfs sf; fhandle_t fh; int error; error = copyin(uap->u_fhp, &fh, sizeof(fhandle_t)); if (error) return (error); error = kern_fhstatfs(td, fh, &sf); if (error) return (error); cvtstatfs(&sf, &osb); return (copyout(&osb, uap->buf, sizeof(osb))); } /* * Convert a new format statfs structure to an old format statfs structure. */ static void cvtstatfs(nsp, osp) struct statfs *nsp; struct ostatfs *osp; { statfs_scale_blocks(nsp, LONG_MAX); bzero(osp, sizeof(*osp)); osp->f_bsize = nsp->f_bsize; osp->f_iosize = MIN(nsp->f_iosize, LONG_MAX); osp->f_blocks = nsp->f_blocks; osp->f_bfree = nsp->f_bfree; osp->f_bavail = nsp->f_bavail; osp->f_files = MIN(nsp->f_files, LONG_MAX); osp->f_ffree = MIN(nsp->f_ffree, LONG_MAX); osp->f_owner = nsp->f_owner; osp->f_type = nsp->f_type; osp->f_flags = nsp->f_flags; osp->f_syncwrites = MIN(nsp->f_syncwrites, LONG_MAX); osp->f_asyncwrites = MIN(nsp->f_asyncwrites, LONG_MAX); osp->f_syncreads = MIN(nsp->f_syncreads, LONG_MAX); osp->f_asyncreads = MIN(nsp->f_asyncreads, LONG_MAX); strlcpy(osp->f_fstypename, nsp->f_fstypename, MIN(MFSNAMELEN, OMFSNAMELEN)); strlcpy(osp->f_mntonname, nsp->f_mntonname, MIN(MNAMELEN, OMNAMELEN)); strlcpy(osp->f_mntfromname, nsp->f_mntfromname, MIN(MNAMELEN, OMNAMELEN)); osp->f_fsid = nsp->f_fsid; } #endif /* COMPAT_FREEBSD4 */ /* * Change current working directory to a given file descriptor. */ #ifndef _SYS_SYSPROTO_H_ struct fchdir_args { int fd; }; #endif int fchdir(td, uap) struct thread *td; struct fchdir_args /* { int fd; } */ *uap; { register struct filedesc *fdp = td->td_proc->p_fd; struct vnode *vp, *tdp, *vpold; struct mount *mp; struct file *fp; int vfslocked; int error; AUDIT_ARG(fd, uap->fd); if ((error = getvnode(fdp, uap->fd, &fp)) != 0) return (error); vp = fp->f_vnode; VREF(vp); fdrop(fp, td); vfslocked = VFS_LOCK_GIANT(vp->v_mount); vn_lock(vp, LK_SHARED | LK_RETRY); AUDIT_ARG(vnode, vp, ARG_VNODE1); error = change_dir(vp, td); while (!error && (mp = vp->v_mountedhere) != NULL) { int tvfslocked; if (vfs_busy(mp, 0)) continue; tvfslocked = VFS_LOCK_GIANT(mp); error = VFS_ROOT(mp, LK_SHARED, &tdp); vfs_unbusy(mp); if (error) { VFS_UNLOCK_GIANT(tvfslocked); break; } vput(vp); VFS_UNLOCK_GIANT(vfslocked); vp = tdp; vfslocked = tvfslocked; } if (error) { vput(vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } VOP_UNLOCK(vp, 0); VFS_UNLOCK_GIANT(vfslocked); FILEDESC_XLOCK(fdp); vpold = fdp->fd_cdir; fdp->fd_cdir = vp; FILEDESC_XUNLOCK(fdp); vfslocked = VFS_LOCK_GIANT(vpold->v_mount); vrele(vpold); VFS_UNLOCK_GIANT(vfslocked); return (0); } /* * Change current working directory (``.''). */ #ifndef _SYS_SYSPROTO_H_ struct chdir_args { char *path; }; #endif int chdir(td, uap) struct thread *td; struct chdir_args /* { char *path; } */ *uap; { return (kern_chdir(td, uap->path, UIO_USERSPACE)); } int kern_chdir(struct thread *td, char *path, enum uio_seg pathseg) { register struct filedesc *fdp = td->td_proc->p_fd; int error; struct nameidata nd; struct vnode *vp; int vfslocked; NDINIT(&nd, LOOKUP, FOLLOW | LOCKSHARED | LOCKLEAF | AUDITVNODE1 | MPSAFE, pathseg, path, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); if ((error = change_dir(nd.ni_vp, td)) != 0) { vput(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); NDFREE(&nd, NDF_ONLY_PNBUF); return (error); } VOP_UNLOCK(nd.ni_vp, 0); VFS_UNLOCK_GIANT(vfslocked); NDFREE(&nd, NDF_ONLY_PNBUF); FILEDESC_XLOCK(fdp); vp = fdp->fd_cdir; fdp->fd_cdir = nd.ni_vp; FILEDESC_XUNLOCK(fdp); vfslocked = VFS_LOCK_GIANT(vp->v_mount); vrele(vp); VFS_UNLOCK_GIANT(vfslocked); return (0); } /* * Helper function for raised chroot(2) security function: Refuse if * any filedescriptors are open directories. */ static int chroot_refuse_vdir_fds(fdp) struct filedesc *fdp; { struct vnode *vp; struct file *fp; int fd; FILEDESC_LOCK_ASSERT(fdp); for (fd = 0; fd < fdp->fd_nfiles ; fd++) { fp = fget_locked(fdp, fd); if (fp == NULL) continue; if (fp->f_type == DTYPE_VNODE) { vp = fp->f_vnode; if (vp->v_type == VDIR) return (EPERM); } } return (0); } /* * This sysctl determines if we will allow a process to chroot(2) if it * has a directory open: * 0: disallowed for all processes. * 1: allowed for processes that were not already chroot(2)'ed. * 2: allowed for all processes. */ static int chroot_allow_open_directories = 1; SYSCTL_INT(_kern, OID_AUTO, chroot_allow_open_directories, CTLFLAG_RW, &chroot_allow_open_directories, 0, ""); /* * Change notion of root (``/'') directory. */ #ifndef _SYS_SYSPROTO_H_ struct chroot_args { char *path; }; #endif int chroot(td, uap) struct thread *td; struct chroot_args /* { char *path; } */ *uap; { int error; struct nameidata nd; int vfslocked; error = priv_check(td, PRIV_VFS_CHROOT); if (error) return (error); NDINIT(&nd, LOOKUP, FOLLOW | LOCKSHARED | LOCKLEAF | MPSAFE | AUDITVNODE1, UIO_USERSPACE, uap->path, td); error = namei(&nd); if (error) goto error; vfslocked = NDHASGIANT(&nd); if ((error = change_dir(nd.ni_vp, td)) != 0) goto e_vunlock; #ifdef MAC if ((error = mac_vnode_check_chroot(td->td_ucred, nd.ni_vp))) goto e_vunlock; #endif VOP_UNLOCK(nd.ni_vp, 0); error = change_root(nd.ni_vp, td); vrele(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); NDFREE(&nd, NDF_ONLY_PNBUF); return (error); e_vunlock: vput(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); error: NDFREE(&nd, NDF_ONLY_PNBUF); return (error); } /* * Common routine for chroot and chdir. Callers must provide a locked vnode * instance. */ int change_dir(vp, td) struct vnode *vp; struct thread *td; { int error; ASSERT_VOP_LOCKED(vp, "change_dir(): vp not locked"); if (vp->v_type != VDIR) return (ENOTDIR); #ifdef MAC error = mac_vnode_check_chdir(td->td_ucred, vp); if (error) return (error); #endif error = VOP_ACCESS(vp, VEXEC, td->td_ucred, td); return (error); } /* * Common routine for kern_chroot() and jail_attach(). The caller is * responsible for invoking priv_check() and mac_vnode_check_chroot() to * authorize this operation. */ int change_root(vp, td) struct vnode *vp; struct thread *td; { struct filedesc *fdp; struct vnode *oldvp; int vfslocked; int error; VFS_ASSERT_GIANT(vp->v_mount); fdp = td->td_proc->p_fd; FILEDESC_XLOCK(fdp); if (chroot_allow_open_directories == 0 || (chroot_allow_open_directories == 1 && fdp->fd_rdir != rootvnode)) { error = chroot_refuse_vdir_fds(fdp); if (error) { FILEDESC_XUNLOCK(fdp); return (error); } } oldvp = fdp->fd_rdir; fdp->fd_rdir = vp; VREF(fdp->fd_rdir); if (!fdp->fd_jdir) { fdp->fd_jdir = vp; VREF(fdp->fd_jdir); } FILEDESC_XUNLOCK(fdp); vfslocked = VFS_LOCK_GIANT(oldvp->v_mount); vrele(oldvp); VFS_UNLOCK_GIANT(vfslocked); return (0); } /* * Check permissions, allocate an open file structure, and call the device * open routine if any. */ #ifndef _SYS_SYSPROTO_H_ struct open_args { char *path; int flags; int mode; }; #endif int open(td, uap) struct thread *td; register struct open_args /* { char *path; int flags; int mode; } */ *uap; { return (kern_open(td, uap->path, UIO_USERSPACE, uap->flags, uap->mode)); } #ifndef _SYS_SYSPROTO_H_ struct openat_args { int fd; char *path; int flag; int mode; }; #endif int openat(struct thread *td, struct openat_args *uap) { return (kern_openat(td, uap->fd, uap->path, UIO_USERSPACE, uap->flag, uap->mode)); } int kern_open(struct thread *td, char *path, enum uio_seg pathseg, int flags, int mode) { return (kern_openat(td, AT_FDCWD, path, pathseg, flags, mode)); } int kern_openat(struct thread *td, int fd, char *path, enum uio_seg pathseg, int flags, int mode) { struct proc *p = td->td_proc; struct filedesc *fdp = p->p_fd; struct file *fp; struct vnode *vp; struct vattr vat; struct mount *mp; int cmode; struct file *nfp; int type, indx, error; struct flock lf; struct nameidata nd; int vfslocked; AUDIT_ARG(fflags, flags); AUDIT_ARG(mode, mode); /* XXX: audit dirfd */ /* * Only one of the O_EXEC, O_RDONLY, O_WRONLY and O_RDWR may * be specified. */ if (flags & O_EXEC) { if (flags & O_ACCMODE) return (EINVAL); } else if ((flags & O_ACCMODE) == O_ACCMODE) return (EINVAL); else flags = FFLAGS(flags); error = falloc(td, &nfp, &indx); if (error) return (error); /* An extra reference on `nfp' has been held for us by falloc(). */ fp = nfp; /* Set the flags early so the finit in devfs can pick them up. */ fp->f_flag = flags & FMASK; cmode = ((mode &~ fdp->fd_cmask) & ALLPERMS) &~ S_ISTXT; NDINIT_AT(&nd, LOOKUP, FOLLOW | AUDITVNODE1 | MPSAFE, pathseg, path, fd, td); td->td_dupfd = -1; /* XXX check for fdopen */ error = vn_open(&nd, &flags, cmode, fp); if (error) { /* * If the vn_open replaced the method vector, something * wonderous happened deep below and we just pass it up * pretending we know what we do. */ if (error == ENXIO && fp->f_ops != &badfileops) { fdrop(fp, td); td->td_retval[0] = indx; return (0); } /* * handle special fdopen() case. bleh. dupfdopen() is * responsible for dropping the old contents of ofiles[indx] * if it succeeds. */ if ((error == ENODEV || error == ENXIO) && td->td_dupfd >= 0 && /* XXX from fdopen */ (error = dupfdopen(td, fdp, indx, td->td_dupfd, flags, error)) == 0) { td->td_retval[0] = indx; fdrop(fp, td); return (0); } /* * Clean up the descriptor, but only if another thread hadn't * replaced or closed it. */ fdclose(fdp, fp, indx, td); fdrop(fp, td); if (error == ERESTART) error = EINTR; return (error); } td->td_dupfd = 0; vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); vp = nd.ni_vp; fp->f_vnode = vp; /* XXX Does devfs need this? */ /* * If the file wasn't claimed by devfs bind it to the normal * vnode operations here. */ if (fp->f_ops == &badfileops) { KASSERT(vp->v_type != VFIFO, ("Unexpected fifo.")); fp->f_seqcount = 1; finit(fp, flags & FMASK, DTYPE_VNODE, vp, &vnops); } VOP_UNLOCK(vp, 0); if (flags & (O_EXLOCK | O_SHLOCK)) { lf.l_whence = SEEK_SET; lf.l_start = 0; lf.l_len = 0; if (flags & O_EXLOCK) lf.l_type = F_WRLCK; else lf.l_type = F_RDLCK; type = F_FLOCK; if ((flags & FNONBLOCK) == 0) type |= F_WAIT; if ((error = VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &lf, type)) != 0) goto bad; atomic_set_int(&fp->f_flag, FHASLOCK); } if (flags & O_TRUNC) { if ((error = vn_start_write(vp, &mp, V_WAIT | PCATCH)) != 0) goto bad; VATTR_NULL(&vat); vat.va_size = 0; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); #ifdef MAC error = mac_vnode_check_write(td->td_ucred, fp->f_cred, vp); if (error == 0) #endif error = VOP_SETATTR(vp, &vat, td->td_ucred); VOP_UNLOCK(vp, 0); vn_finished_write(mp); if (error) goto bad; } VFS_UNLOCK_GIANT(vfslocked); /* * Release our private reference, leaving the one associated with * the descriptor table intact. */ fdrop(fp, td); td->td_retval[0] = indx; return (0); bad: VFS_UNLOCK_GIANT(vfslocked); fdclose(fdp, fp, indx, td); fdrop(fp, td); return (error); } #ifdef COMPAT_43 /* * Create a file. */ #ifndef _SYS_SYSPROTO_H_ struct ocreat_args { char *path; int mode; }; #endif int ocreat(td, uap) struct thread *td; register struct ocreat_args /* { char *path; int mode; } */ *uap; { return (kern_open(td, uap->path, UIO_USERSPACE, O_WRONLY | O_CREAT | O_TRUNC, uap->mode)); } #endif /* COMPAT_43 */ /* * Create a special file. */ #ifndef _SYS_SYSPROTO_H_ struct mknod_args { char *path; int mode; int dev; }; #endif int mknod(td, uap) struct thread *td; register struct mknod_args /* { char *path; int mode; int dev; } */ *uap; { return (kern_mknod(td, uap->path, UIO_USERSPACE, uap->mode, uap->dev)); } #ifndef _SYS_SYSPROTO_H_ struct mknodat_args { int fd; char *path; mode_t mode; dev_t dev; }; #endif int mknodat(struct thread *td, struct mknodat_args *uap) { return (kern_mknodat(td, uap->fd, uap->path, UIO_USERSPACE, uap->mode, uap->dev)); } int kern_mknod(struct thread *td, char *path, enum uio_seg pathseg, int mode, int dev) { return (kern_mknodat(td, AT_FDCWD, path, pathseg, mode, dev)); } int kern_mknodat(struct thread *td, int fd, char *path, enum uio_seg pathseg, int mode, int dev) { struct vnode *vp; struct mount *mp; struct vattr vattr; int error; int whiteout = 0; struct nameidata nd; int vfslocked; AUDIT_ARG(mode, mode); AUDIT_ARG(dev, dev); switch (mode & S_IFMT) { case S_IFCHR: case S_IFBLK: error = priv_check(td, PRIV_VFS_MKNOD_DEV); break; case S_IFMT: error = priv_check(td, PRIV_VFS_MKNOD_BAD); break; case S_IFWHT: error = priv_check(td, PRIV_VFS_MKNOD_WHT); break; case S_IFIFO: if (dev == 0) return (kern_mkfifoat(td, fd, path, pathseg, mode)); /* FALLTHROUGH */ default: error = EINVAL; break; } if (error) return (error); restart: bwillwrite(); NDINIT_AT(&nd, CREATE, LOCKPARENT | SAVENAME | MPSAFE | AUDITVNODE1, pathseg, path, fd, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); vp = nd.ni_vp; if (vp != NULL) { NDFREE(&nd, NDF_ONLY_PNBUF); if (vp == nd.ni_dvp) vrele(nd.ni_dvp); else vput(nd.ni_dvp); vrele(vp); VFS_UNLOCK_GIANT(vfslocked); return (EEXIST); } else { VATTR_NULL(&vattr); FILEDESC_SLOCK(td->td_proc->p_fd); vattr.va_mode = (mode & ALLPERMS) & ~td->td_proc->p_fd->fd_cmask; FILEDESC_SUNLOCK(td->td_proc->p_fd); vattr.va_rdev = dev; whiteout = 0; switch (mode & S_IFMT) { case S_IFMT: /* used by badsect to flag bad sectors */ vattr.va_type = VBAD; break; case S_IFCHR: vattr.va_type = VCHR; break; case S_IFBLK: vattr.va_type = VBLK; break; case S_IFWHT: whiteout = 1; break; default: panic("kern_mknod: invalid mode"); } } if (vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) { NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); VFS_UNLOCK_GIANT(vfslocked); if ((error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH)) != 0) return (error); goto restart; } #ifdef MAC if (error == 0 && !whiteout) error = mac_vnode_check_create(td->td_ucred, nd.ni_dvp, &nd.ni_cnd, &vattr); #endif if (!error) { if (whiteout) error = VOP_WHITEOUT(nd.ni_dvp, &nd.ni_cnd, CREATE); else { error = VOP_MKNOD(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vattr); if (error == 0) vput(nd.ni_vp); } } NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); vn_finished_write(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Create a named pipe. */ #ifndef _SYS_SYSPROTO_H_ struct mkfifo_args { char *path; int mode; }; #endif int mkfifo(td, uap) struct thread *td; register struct mkfifo_args /* { char *path; int mode; } */ *uap; { return (kern_mkfifo(td, uap->path, UIO_USERSPACE, uap->mode)); } #ifndef _SYS_SYSPROTO_H_ struct mkfifoat_args { int fd; char *path; mode_t mode; }; #endif int mkfifoat(struct thread *td, struct mkfifoat_args *uap) { return (kern_mkfifoat(td, uap->fd, uap->path, UIO_USERSPACE, uap->mode)); } int kern_mkfifo(struct thread *td, char *path, enum uio_seg pathseg, int mode) { return (kern_mkfifoat(td, AT_FDCWD, path, pathseg, mode)); } int kern_mkfifoat(struct thread *td, int fd, char *path, enum uio_seg pathseg, int mode) { struct mount *mp; struct vattr vattr; int error; struct nameidata nd; int vfslocked; AUDIT_ARG(mode, mode); restart: bwillwrite(); NDINIT_AT(&nd, CREATE, LOCKPARENT | SAVENAME | MPSAFE | AUDITVNODE1, pathseg, path, fd, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); if (nd.ni_vp != NULL) { NDFREE(&nd, NDF_ONLY_PNBUF); if (nd.ni_vp == nd.ni_dvp) vrele(nd.ni_dvp); else vput(nd.ni_dvp); vrele(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); return (EEXIST); } if (vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) { NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); VFS_UNLOCK_GIANT(vfslocked); if ((error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH)) != 0) return (error); goto restart; } VATTR_NULL(&vattr); vattr.va_type = VFIFO; FILEDESC_SLOCK(td->td_proc->p_fd); vattr.va_mode = (mode & ALLPERMS) & ~td->td_proc->p_fd->fd_cmask; FILEDESC_SUNLOCK(td->td_proc->p_fd); #ifdef MAC error = mac_vnode_check_create(td->td_ucred, nd.ni_dvp, &nd.ni_cnd, &vattr); if (error) goto out; #endif error = VOP_MKNOD(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vattr); if (error == 0) vput(nd.ni_vp); #ifdef MAC out: #endif vput(nd.ni_dvp); vn_finished_write(mp); VFS_UNLOCK_GIANT(vfslocked); NDFREE(&nd, NDF_ONLY_PNBUF); return (error); } /* * Make a hard file link. */ #ifndef _SYS_SYSPROTO_H_ struct link_args { char *path; char *link; }; #endif int link(td, uap) struct thread *td; register struct link_args /* { char *path; char *link; } */ *uap; { return (kern_link(td, uap->path, uap->link, UIO_USERSPACE)); } #ifndef _SYS_SYSPROTO_H_ struct linkat_args { int fd1; char *path1; int fd2; char *path2; int flag; }; #endif int linkat(struct thread *td, struct linkat_args *uap) { int flag; flag = uap->flag; if (flag & ~AT_SYMLINK_FOLLOW) return (EINVAL); return (kern_linkat(td, uap->fd1, uap->fd2, uap->path1, uap->path2, UIO_USERSPACE, (flag & AT_SYMLINK_FOLLOW) ? FOLLOW : NOFOLLOW)); } static int hardlink_check_uid = 0; SYSCTL_INT(_security_bsd, OID_AUTO, hardlink_check_uid, CTLFLAG_RW, &hardlink_check_uid, 0, "Unprivileged processes cannot create hard links to files owned by other " "users"); static int hardlink_check_gid = 0; SYSCTL_INT(_security_bsd, OID_AUTO, hardlink_check_gid, CTLFLAG_RW, &hardlink_check_gid, 0, "Unprivileged processes cannot create hard links to files owned by other " "groups"); static int can_hardlink(struct vnode *vp, struct ucred *cred) { struct vattr va; int error; if (!hardlink_check_uid && !hardlink_check_gid) return (0); error = VOP_GETATTR(vp, &va, cred); if (error != 0) return (error); if (hardlink_check_uid && cred->cr_uid != va.va_uid) { error = priv_check_cred(cred, PRIV_VFS_LINK, 0); if (error) return (error); } if (hardlink_check_gid && !groupmember(va.va_gid, cred)) { error = priv_check_cred(cred, PRIV_VFS_LINK, 0); if (error) return (error); } return (0); } int kern_link(struct thread *td, char *path, char *link, enum uio_seg segflg) { return (kern_linkat(td, AT_FDCWD, AT_FDCWD, path,link, segflg, FOLLOW)); } int kern_linkat(struct thread *td, int fd1, int fd2, char *path1, char *path2, enum uio_seg segflg, int follow) { struct vnode *vp; struct mount *mp; struct nameidata nd; int vfslocked; int lvfslocked; int error; bwillwrite(); NDINIT_AT(&nd, LOOKUP, follow | MPSAFE | AUDITVNODE1, segflg, path1, fd1, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); vp = nd.ni_vp; if (vp->v_type == VDIR) { vrele(vp); VFS_UNLOCK_GIANT(vfslocked); return (EPERM); /* POSIX */ } if ((error = vn_start_write(vp, &mp, V_WAIT | PCATCH)) != 0) { vrele(vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } NDINIT_AT(&nd, CREATE, LOCKPARENT | SAVENAME | MPSAFE | AUDITVNODE1, segflg, path2, fd2, td); if ((error = namei(&nd)) == 0) { lvfslocked = NDHASGIANT(&nd); if (nd.ni_vp != NULL) { if (nd.ni_dvp == nd.ni_vp) vrele(nd.ni_dvp); else vput(nd.ni_dvp); vrele(nd.ni_vp); error = EEXIST; } else if ((error = vn_lock(vp, LK_EXCLUSIVE | LK_RETRY)) == 0) { error = can_hardlink(vp, td->td_ucred); if (error == 0) #ifdef MAC error = mac_vnode_check_link(td->td_ucred, nd.ni_dvp, vp, &nd.ni_cnd); if (error == 0) #endif error = VOP_LINK(nd.ni_dvp, vp, &nd.ni_cnd); VOP_UNLOCK(vp, 0); vput(nd.ni_dvp); } NDFREE(&nd, NDF_ONLY_PNBUF); VFS_UNLOCK_GIANT(lvfslocked); } vrele(vp); vn_finished_write(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Make a symbolic link. */ #ifndef _SYS_SYSPROTO_H_ struct symlink_args { char *path; char *link; }; #endif int symlink(td, uap) struct thread *td; register struct symlink_args /* { char *path; char *link; } */ *uap; { return (kern_symlink(td, uap->path, uap->link, UIO_USERSPACE)); } #ifndef _SYS_SYSPROTO_H_ struct symlinkat_args { char *path; int fd; char *path2; }; #endif int symlinkat(struct thread *td, struct symlinkat_args *uap) { return (kern_symlinkat(td, uap->path1, uap->fd, uap->path2, UIO_USERSPACE)); } int kern_symlink(struct thread *td, char *path, char *link, enum uio_seg segflg) { return (kern_symlinkat(td, path, AT_FDCWD, link, segflg)); } int kern_symlinkat(struct thread *td, char *path1, int fd, char *path2, enum uio_seg segflg) { struct mount *mp; struct vattr vattr; char *syspath; int error; struct nameidata nd; int vfslocked; if (segflg == UIO_SYSSPACE) { syspath = path1; } else { syspath = uma_zalloc(namei_zone, M_WAITOK); if ((error = copyinstr(path1, syspath, MAXPATHLEN, NULL)) != 0) goto out; } AUDIT_ARG(text, syspath); restart: bwillwrite(); NDINIT_AT(&nd, CREATE, LOCKPARENT | SAVENAME | MPSAFE | AUDITVNODE1, segflg, path2, fd, td); if ((error = namei(&nd)) != 0) goto out; vfslocked = NDHASGIANT(&nd); if (nd.ni_vp) { NDFREE(&nd, NDF_ONLY_PNBUF); if (nd.ni_vp == nd.ni_dvp) vrele(nd.ni_dvp); else vput(nd.ni_dvp); vrele(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); error = EEXIST; goto out; } if (vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) { NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); VFS_UNLOCK_GIANT(vfslocked); if ((error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH)) != 0) goto out; goto restart; } VATTR_NULL(&vattr); FILEDESC_SLOCK(td->td_proc->p_fd); vattr.va_mode = ACCESSPERMS &~ td->td_proc->p_fd->fd_cmask; FILEDESC_SUNLOCK(td->td_proc->p_fd); #ifdef MAC vattr.va_type = VLNK; error = mac_vnode_check_create(td->td_ucred, nd.ni_dvp, &nd.ni_cnd, &vattr); if (error) goto out2; #endif error = VOP_SYMLINK(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vattr, syspath); if (error == 0) vput(nd.ni_vp); #ifdef MAC out2: #endif NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); vn_finished_write(mp); VFS_UNLOCK_GIANT(vfslocked); out: if (segflg != UIO_SYSSPACE) uma_zfree(namei_zone, syspath); return (error); } /* * Delete a whiteout from the filesystem. */ int undelete(td, uap) struct thread *td; register struct undelete_args /* { char *path; } */ *uap; { int error; struct mount *mp; struct nameidata nd; int vfslocked; restart: bwillwrite(); NDINIT(&nd, DELETE, LOCKPARENT | DOWHITEOUT | MPSAFE | AUDITVNODE1, UIO_USERSPACE, uap->path, td); error = namei(&nd); if (error) return (error); vfslocked = NDHASGIANT(&nd); if (nd.ni_vp != NULLVP || !(nd.ni_cnd.cn_flags & ISWHITEOUT)) { NDFREE(&nd, NDF_ONLY_PNBUF); if (nd.ni_vp == nd.ni_dvp) vrele(nd.ni_dvp); else vput(nd.ni_dvp); if (nd.ni_vp) vrele(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); return (EEXIST); } if (vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) { NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); VFS_UNLOCK_GIANT(vfslocked); if ((error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH)) != 0) return (error); goto restart; } error = VOP_WHITEOUT(nd.ni_dvp, &nd.ni_cnd, DELETE); NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); vn_finished_write(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Delete a name from the filesystem. */ #ifndef _SYS_SYSPROTO_H_ struct unlink_args { char *path; }; #endif int unlink(td, uap) struct thread *td; struct unlink_args /* { char *path; } */ *uap; { return (kern_unlink(td, uap->path, UIO_USERSPACE)); } #ifndef _SYS_SYSPROTO_H_ struct unlinkat_args { int fd; char *path; int flag; }; #endif int unlinkat(struct thread *td, struct unlinkat_args *uap) { int flag = uap->flag; int fd = uap->fd; char *path = uap->path; if (flag & ~AT_REMOVEDIR) return (EINVAL); if (flag & AT_REMOVEDIR) return (kern_rmdirat(td, fd, path, UIO_USERSPACE)); else return (kern_unlinkat(td, fd, path, UIO_USERSPACE)); } int kern_unlink(struct thread *td, char *path, enum uio_seg pathseg) { return (kern_unlinkat(td, AT_FDCWD, path, pathseg)); } int kern_unlinkat(struct thread *td, int fd, char *path, enum uio_seg pathseg) { struct mount *mp; struct vnode *vp; int error; struct nameidata nd; int vfslocked; restart: bwillwrite(); NDINIT_AT(&nd, DELETE, LOCKPARENT | LOCKLEAF | MPSAFE | AUDITVNODE1, pathseg, path, fd, td); if ((error = namei(&nd)) != 0) return (error == EINVAL ? EPERM : error); vfslocked = NDHASGIANT(&nd); vp = nd.ni_vp; if (vp->v_type == VDIR) error = EPERM; /* POSIX */ else { /* * The root of a mounted filesystem cannot be deleted. * * XXX: can this only be a VDIR case? */ if (vp->v_vflag & VV_ROOT) error = EBUSY; } if (error == 0) { if (vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) { NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); if (vp == nd.ni_dvp) vrele(vp); else vput(vp); VFS_UNLOCK_GIANT(vfslocked); if ((error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH)) != 0) return (error); goto restart; } #ifdef MAC error = mac_vnode_check_unlink(td->td_ucred, nd.ni_dvp, vp, &nd.ni_cnd); if (error) goto out; #endif error = VOP_REMOVE(nd.ni_dvp, vp, &nd.ni_cnd); #ifdef MAC out: #endif vn_finished_write(mp); } NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); if (vp == nd.ni_dvp) vrele(vp); else vput(vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Reposition read/write file offset. */ #ifndef _SYS_SYSPROTO_H_ struct lseek_args { int fd; int pad; off_t offset; int whence; }; #endif int lseek(td, uap) struct thread *td; register struct lseek_args /* { int fd; int pad; off_t offset; int whence; } */ *uap; { struct ucred *cred = td->td_ucred; struct file *fp; struct vnode *vp; struct vattr vattr; off_t offset; int error, noneg; int vfslocked; if ((error = fget(td, uap->fd, &fp)) != 0) return (error); if (!(fp->f_ops->fo_flags & DFLAG_SEEKABLE)) { fdrop(fp, td); return (ESPIPE); } vp = fp->f_vnode; vfslocked = VFS_LOCK_GIANT(vp->v_mount); noneg = (vp->v_type != VCHR); offset = uap->offset; switch (uap->whence) { case L_INCR: if (noneg && (fp->f_offset < 0 || (offset > 0 && fp->f_offset > OFF_MAX - offset))) { error = EOVERFLOW; break; } offset += fp->f_offset; break; case L_XTND: vn_lock(vp, LK_SHARED | LK_RETRY); error = VOP_GETATTR(vp, &vattr, cred); VOP_UNLOCK(vp, 0); if (error) break; if (noneg && (vattr.va_size > OFF_MAX || (offset > 0 && vattr.va_size > OFF_MAX - offset))) { error = EOVERFLOW; break; } offset += vattr.va_size; break; case L_SET: break; case SEEK_DATA: error = fo_ioctl(fp, FIOSEEKDATA, &offset, cred, td); break; case SEEK_HOLE: error = fo_ioctl(fp, FIOSEEKHOLE, &offset, cred, td); break; default: error = EINVAL; } if (error == 0 && noneg && offset < 0) error = EINVAL; if (error != 0) goto drop; fp->f_offset = offset; *(off_t *)(td->td_retval) = fp->f_offset; drop: fdrop(fp, td); VFS_UNLOCK_GIANT(vfslocked); return (error); } #if defined(COMPAT_43) /* * Reposition read/write file offset. */ #ifndef _SYS_SYSPROTO_H_ struct olseek_args { int fd; long offset; int whence; }; #endif int olseek(td, uap) struct thread *td; register struct olseek_args /* { int fd; long offset; int whence; } */ *uap; { struct lseek_args /* { int fd; int pad; off_t offset; int whence; } */ nuap; nuap.fd = uap->fd; nuap.offset = uap->offset; nuap.whence = uap->whence; return (lseek(td, &nuap)); } #endif /* COMPAT_43 */ /* Version with the 'pad' argument */ int freebsd6_lseek(td, uap) struct thread *td; register struct freebsd6_lseek_args *uap; { struct lseek_args ouap; ouap.fd = uap->fd; ouap.offset = uap->offset; ouap.whence = uap->whence; return (lseek(td, &ouap)); } /* * Check access permissions using passed credentials. */ static int vn_access(vp, user_flags, cred, td) struct vnode *vp; int user_flags; struct ucred *cred; struct thread *td; { int error; accmode_t accmode; /* Flags == 0 means only check for existence. */ error = 0; if (user_flags) { accmode = 0; if (user_flags & R_OK) accmode |= VREAD; if (user_flags & W_OK) accmode |= VWRITE; if (user_flags & X_OK) accmode |= VEXEC; #ifdef MAC error = mac_vnode_check_access(cred, vp, accmode); if (error) return (error); #endif if ((accmode & VWRITE) == 0 || (error = vn_writechk(vp)) == 0) error = VOP_ACCESS(vp, accmode, cred, td); } return (error); } /* * Check access permissions using "real" credentials. */ #ifndef _SYS_SYSPROTO_H_ struct access_args { char *path; int flags; }; #endif int access(td, uap) struct thread *td; register struct access_args /* { char *path; int flags; } */ *uap; { return (kern_access(td, uap->path, UIO_USERSPACE, uap->flags)); } #ifndef _SYS_SYSPROTO_H_ struct faccessat_args { int dirfd; char *path; int mode; int flag; } #endif int faccessat(struct thread *td, struct faccessat_args *uap) { if (uap->flag & ~AT_EACCESS) return (EINVAL); return (kern_accessat(td, uap->fd, uap->path, UIO_USERSPACE, uap->flag, uap->mode)); } int kern_access(struct thread *td, char *path, enum uio_seg pathseg, int mode) { return (kern_accessat(td, AT_FDCWD, path, pathseg, 0, mode)); } int kern_accessat(struct thread *td, int fd, char *path, enum uio_seg pathseg, int flags, int mode) { struct ucred *cred, *tmpcred; struct vnode *vp; struct nameidata nd; int vfslocked; int error; /* * Create and modify a temporary credential instead of one that * is potentially shared. This could also mess up socket * buffer accounting which can run in an interrupt context. */ if (!(flags & AT_EACCESS)) { cred = td->td_ucred; tmpcred = crdup(cred); tmpcred->cr_uid = cred->cr_ruid; tmpcred->cr_groups[0] = cred->cr_rgid; td->td_ucred = tmpcred; } else cred = tmpcred = td->td_ucred; NDINIT_AT(&nd, LOOKUP, FOLLOW | LOCKSHARED | LOCKLEAF | MPSAFE | AUDITVNODE1, pathseg, path, fd, td); if ((error = namei(&nd)) != 0) goto out1; vfslocked = NDHASGIANT(&nd); vp = nd.ni_vp; error = vn_access(vp, mode, tmpcred, td); NDFREE(&nd, NDF_ONLY_PNBUF); vput(vp); VFS_UNLOCK_GIANT(vfslocked); out1: if (!(flags & AT_EACCESS)) { td->td_ucred = cred; crfree(tmpcred); } return (error); } /* * Check access permissions using "effective" credentials. */ #ifndef _SYS_SYSPROTO_H_ struct eaccess_args { char *path; int flags; }; #endif int eaccess(td, uap) struct thread *td; register struct eaccess_args /* { char *path; int flags; } */ *uap; { return (kern_eaccess(td, uap->path, UIO_USERSPACE, uap->flags)); } int kern_eaccess(struct thread *td, char *path, enum uio_seg pathseg, int flags) { return (kern_accessat(td, AT_FDCWD, path, pathseg, AT_EACCESS, flags)); } #if defined(COMPAT_43) /* * Get file status; this version follows links. */ #ifndef _SYS_SYSPROTO_H_ struct ostat_args { char *path; struct ostat *ub; }; #endif int ostat(td, uap) struct thread *td; register struct ostat_args /* { char *path; struct ostat *ub; } */ *uap; { struct stat sb; struct ostat osb; int error; error = kern_stat(td, uap->path, UIO_USERSPACE, &sb); if (error) return (error); cvtstat(&sb, &osb); error = copyout(&osb, uap->ub, sizeof (osb)); return (error); } /* * Get file status; this version does not follow links. */ #ifndef _SYS_SYSPROTO_H_ struct olstat_args { char *path; struct ostat *ub; }; #endif int olstat(td, uap) struct thread *td; register struct olstat_args /* { char *path; struct ostat *ub; } */ *uap; { struct stat sb; struct ostat osb; int error; error = kern_lstat(td, uap->path, UIO_USERSPACE, &sb); if (error) return (error); cvtstat(&sb, &osb); error = copyout(&osb, uap->ub, sizeof (osb)); return (error); } /* * Convert from an old to a new stat structure. */ void cvtstat(st, ost) struct stat *st; struct ostat *ost; { ost->st_dev = st->st_dev; ost->st_ino = st->st_ino; ost->st_mode = st->st_mode; ost->st_nlink = st->st_nlink; ost->st_uid = st->st_uid; ost->st_gid = st->st_gid; ost->st_rdev = st->st_rdev; if (st->st_size < (quad_t)1 << 32) ost->st_size = st->st_size; else ost->st_size = -2; ost->st_atime = st->st_atime; ost->st_mtime = st->st_mtime; ost->st_ctime = st->st_ctime; ost->st_blksize = st->st_blksize; ost->st_blocks = st->st_blocks; ost->st_flags = st->st_flags; ost->st_gen = st->st_gen; } #endif /* COMPAT_43 */ /* * Get file status; this version follows links. */ #ifndef _SYS_SYSPROTO_H_ struct stat_args { char *path; struct stat *ub; }; #endif int stat(td, uap) struct thread *td; register struct stat_args /* { char *path; struct stat *ub; } */ *uap; { struct stat sb; int error; error = kern_stat(td, uap->path, UIO_USERSPACE, &sb); if (error == 0) error = copyout(&sb, uap->ub, sizeof (sb)); return (error); } #ifndef _SYS_SYSPROTO_H_ struct fstatat_args { int fd; char *path; struct stat *buf; int flag; } #endif int fstatat(struct thread *td, struct fstatat_args *uap) { struct stat sb; int error; error = kern_statat(td, uap->flag, uap->fd, uap->path, UIO_USERSPACE, &sb); if (error == 0) error = copyout(&sb, uap->buf, sizeof (sb)); return (error); } int kern_stat(struct thread *td, char *path, enum uio_seg pathseg, struct stat *sbp) { return (kern_statat(td, 0, AT_FDCWD, path, pathseg, sbp)); } int kern_statat(struct thread *td, int flag, int fd, char *path, enum uio_seg pathseg, struct stat *sbp) { return (kern_statat_vnhook(td, flag, fd, path, pathseg, sbp, NULL)); } int kern_statat_vnhook(struct thread *td, int flag, int fd, char *path, enum uio_seg pathseg, struct stat *sbp, void (*hook)(struct vnode *vp, struct stat *sbp)) { struct nameidata nd; struct stat sb; int error, vfslocked; if (flag & ~AT_SYMLINK_NOFOLLOW) return (EINVAL); NDINIT_AT(&nd, LOOKUP, ((flag & AT_SYMLINK_NOFOLLOW) ? NOFOLLOW : FOLLOW) | LOCKSHARED | LOCKLEAF | AUDITVNODE1 | MPSAFE, pathseg, path, fd, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); error = vn_stat(nd.ni_vp, &sb, td->td_ucred, NOCRED, td); if (!error) { SDT_PROBE(vfs, , stat, mode, path, sb.st_mode, 0, 0, 0); if (S_ISREG(sb.st_mode)) SDT_PROBE(vfs, , stat, reg, path, pathseg, 0, 0, 0); if (__predict_false(hook != NULL)) hook(nd.ni_vp, &sb); } NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); if (error) return (error); *sbp = sb; #ifdef KTRACE if (KTRPOINT(td, KTR_STRUCT)) ktrstat(&sb); #endif return (0); } /* * Get file status; this version does not follow links. */ #ifndef _SYS_SYSPROTO_H_ struct lstat_args { char *path; struct stat *ub; }; #endif int lstat(td, uap) struct thread *td; register struct lstat_args /* { char *path; struct stat *ub; } */ *uap; { struct stat sb; int error; error = kern_lstat(td, uap->path, UIO_USERSPACE, &sb); if (error == 0) error = copyout(&sb, uap->ub, sizeof (sb)); return (error); } int kern_lstat(struct thread *td, char *path, enum uio_seg pathseg, struct stat *sbp) { return (kern_statat(td, AT_SYMLINK_NOFOLLOW, AT_FDCWD, path, pathseg, sbp)); } /* * Implementation of the NetBSD [l]stat() functions. */ void cvtnstat(sb, nsb) struct stat *sb; struct nstat *nsb; { bzero(nsb, sizeof *nsb); nsb->st_dev = sb->st_dev; nsb->st_ino = sb->st_ino; nsb->st_mode = sb->st_mode; nsb->st_nlink = sb->st_nlink; nsb->st_uid = sb->st_uid; nsb->st_gid = sb->st_gid; nsb->st_rdev = sb->st_rdev; nsb->st_atimespec = sb->st_atimespec; nsb->st_mtimespec = sb->st_mtimespec; nsb->st_ctimespec = sb->st_ctimespec; nsb->st_size = sb->st_size; nsb->st_blocks = sb->st_blocks; nsb->st_blksize = sb->st_blksize; nsb->st_flags = sb->st_flags; nsb->st_gen = sb->st_gen; nsb->st_birthtimespec = sb->st_birthtimespec; } #ifndef _SYS_SYSPROTO_H_ struct nstat_args { char *path; struct nstat *ub; }; #endif int nstat(td, uap) struct thread *td; register struct nstat_args /* { char *path; struct nstat *ub; } */ *uap; { struct stat sb; struct nstat nsb; int error; error = kern_stat(td, uap->path, UIO_USERSPACE, &sb); if (error) return (error); cvtnstat(&sb, &nsb); error = copyout(&nsb, uap->ub, sizeof (nsb)); return (error); } /* * NetBSD lstat. Get file status; this version does not follow links. */ #ifndef _SYS_SYSPROTO_H_ struct lstat_args { char *path; struct stat *ub; }; #endif int nlstat(td, uap) struct thread *td; register struct nlstat_args /* { char *path; struct nstat *ub; } */ *uap; { struct stat sb; struct nstat nsb; int error; error = kern_lstat(td, uap->path, UIO_USERSPACE, &sb); if (error) return (error); cvtnstat(&sb, &nsb); error = copyout(&nsb, uap->ub, sizeof (nsb)); return (error); } /* * Get configurable pathname variables. */ #ifndef _SYS_SYSPROTO_H_ struct pathconf_args { char *path; int name; }; #endif int pathconf(td, uap) struct thread *td; register struct pathconf_args /* { char *path; int name; } */ *uap; { return (kern_pathconf(td, uap->path, UIO_USERSPACE, uap->name)); } int kern_pathconf(struct thread *td, char *path, enum uio_seg pathseg, int name) { struct nameidata nd; int error, vfslocked; NDINIT(&nd, LOOKUP, FOLLOW | LOCKSHARED | LOCKLEAF | MPSAFE | AUDITVNODE1, pathseg, path, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); /* If asynchronous I/O is available, it works for all files. */ if (name == _PC_ASYNC_IO) td->td_retval[0] = async_io_version; else error = VOP_PATHCONF(nd.ni_vp, name, td->td_retval); vput(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Return target name of a symbolic link. */ #ifndef _SYS_SYSPROTO_H_ struct readlink_args { char *path; char *buf; size_t count; }; #endif int readlink(td, uap) struct thread *td; register struct readlink_args /* { char *path; char *buf; size_t count; } */ *uap; { return (kern_readlink(td, uap->path, UIO_USERSPACE, uap->buf, UIO_USERSPACE, uap->count)); } #ifndef _SYS_SYSPROTO_H_ struct readlinkat_args { int fd; char *path; char *buf; size_t bufsize; }; #endif int readlinkat(struct thread *td, struct readlinkat_args *uap) { return (kern_readlinkat(td, uap->fd, uap->path, UIO_USERSPACE, uap->buf, UIO_USERSPACE, uap->bufsize)); } int kern_readlink(struct thread *td, char *path, enum uio_seg pathseg, char *buf, enum uio_seg bufseg, size_t count) { return (kern_readlinkat(td, AT_FDCWD, path, pathseg, buf, bufseg, count)); } int kern_readlinkat(struct thread *td, int fd, char *path, enum uio_seg pathseg, char *buf, enum uio_seg bufseg, size_t count) { struct vnode *vp; struct iovec aiov; struct uio auio; int error; struct nameidata nd; int vfslocked; if (count > INT_MAX) return (EINVAL); NDINIT_AT(&nd, LOOKUP, NOFOLLOW | LOCKSHARED | LOCKLEAF | MPSAFE | AUDITVNODE1, pathseg, path, fd, td); if ((error = namei(&nd)) != 0) return (error); NDFREE(&nd, NDF_ONLY_PNBUF); vfslocked = NDHASGIANT(&nd); vp = nd.ni_vp; #ifdef MAC error = mac_vnode_check_readlink(td->td_ucred, vp); if (error) { vput(vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } #endif if (vp->v_type != VLNK) error = EINVAL; else { aiov.iov_base = buf; aiov.iov_len = count; auio.uio_iov = &aiov; auio.uio_iovcnt = 1; auio.uio_offset = 0; auio.uio_rw = UIO_READ; auio.uio_segflg = bufseg; auio.uio_td = td; auio.uio_resid = count; error = VOP_READLINK(vp, &auio, td->td_ucred); } vput(vp); VFS_UNLOCK_GIANT(vfslocked); td->td_retval[0] = count - auio.uio_resid; return (error); } /* * Common implementation code for chflags() and fchflags(). */ static int setfflags(td, vp, flags) struct thread *td; struct vnode *vp; int flags; { int error; struct mount *mp; struct vattr vattr; /* * Prevent non-root users from setting flags on devices. When * a device is reused, users can retain ownership of the device * if they are allowed to set flags and programs assume that * chown can't fail when done as root. */ if (vp->v_type == VCHR || vp->v_type == VBLK) { error = priv_check(td, PRIV_VFS_CHFLAGS_DEV); if (error) return (error); } if ((error = vn_start_write(vp, &mp, V_WAIT | PCATCH)) != 0) return (error); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); VATTR_NULL(&vattr); vattr.va_flags = flags; #ifdef MAC error = mac_vnode_check_setflags(td->td_ucred, vp, vattr.va_flags); if (error == 0) #endif error = VOP_SETATTR(vp, &vattr, td->td_ucred); VOP_UNLOCK(vp, 0); vn_finished_write(mp); return (error); } /* * Change flags of a file given a path name. */ #ifndef _SYS_SYSPROTO_H_ struct chflags_args { char *path; int flags; }; #endif int chflags(td, uap) struct thread *td; register struct chflags_args /* { char *path; int flags; } */ *uap; { int error; struct nameidata nd; int vfslocked; AUDIT_ARG(fflags, uap->flags); NDINIT(&nd, LOOKUP, FOLLOW | MPSAFE | AUDITVNODE1, UIO_USERSPACE, uap->path, td); if ((error = namei(&nd)) != 0) return (error); NDFREE(&nd, NDF_ONLY_PNBUF); vfslocked = NDHASGIANT(&nd); error = setfflags(td, nd.ni_vp, uap->flags); vrele(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Same as chflags() but doesn't follow symlinks. */ int lchflags(td, uap) struct thread *td; register struct lchflags_args /* { char *path; int flags; } */ *uap; { int error; struct nameidata nd; int vfslocked; AUDIT_ARG(fflags, uap->flags); NDINIT(&nd, LOOKUP, NOFOLLOW | MPSAFE | AUDITVNODE1, UIO_USERSPACE, uap->path, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); error = setfflags(td, nd.ni_vp, uap->flags); vrele(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Change flags of a file given a file descriptor. */ #ifndef _SYS_SYSPROTO_H_ struct fchflags_args { int fd; int flags; }; #endif int fchflags(td, uap) struct thread *td; register struct fchflags_args /* { int fd; int flags; } */ *uap; { struct file *fp; int vfslocked; int error; AUDIT_ARG(fd, uap->fd); AUDIT_ARG(fflags, uap->flags); if ((error = getvnode(td->td_proc->p_fd, uap->fd, &fp)) != 0) return (error); vfslocked = VFS_LOCK_GIANT(fp->f_vnode->v_mount); #ifdef AUDIT vn_lock(fp->f_vnode, LK_SHARED | LK_RETRY); AUDIT_ARG(vnode, fp->f_vnode, ARG_VNODE1); VOP_UNLOCK(fp->f_vnode, 0); #endif error = setfflags(td, fp->f_vnode, uap->flags); VFS_UNLOCK_GIANT(vfslocked); fdrop(fp, td); return (error); } /* * Common implementation code for chmod(), lchmod() and fchmod(). */ static int setfmode(td, vp, mode) struct thread *td; struct vnode *vp; int mode; { int error; struct mount *mp; struct vattr vattr; if ((error = vn_start_write(vp, &mp, V_WAIT | PCATCH)) != 0) return (error); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); VATTR_NULL(&vattr); vattr.va_mode = mode & ALLPERMS; #ifdef MAC error = mac_vnode_check_setmode(td->td_ucred, vp, vattr.va_mode); if (error == 0) #endif error = VOP_SETATTR(vp, &vattr, td->td_ucred); VOP_UNLOCK(vp, 0); vn_finished_write(mp); return (error); } /* * Change mode of a file given path name. */ #ifndef _SYS_SYSPROTO_H_ struct chmod_args { char *path; int mode; }; #endif int chmod(td, uap) struct thread *td; register struct chmod_args /* { char *path; int mode; } */ *uap; { return (kern_chmod(td, uap->path, UIO_USERSPACE, uap->mode)); } #ifndef _SYS_SYSPROTO_H_ struct fchmodat_args { int dirfd; char *path; mode_t mode; int flag; } #endif int fchmodat(struct thread *td, struct fchmodat_args *uap) { int flag = uap->flag; int fd = uap->fd; char *path = uap->path; mode_t mode = uap->mode; if (flag & ~AT_SYMLINK_NOFOLLOW) return (EINVAL); return (kern_fchmodat(td, fd, path, UIO_USERSPACE, mode, flag)); } int kern_chmod(struct thread *td, char *path, enum uio_seg pathseg, int mode) { return (kern_fchmodat(td, AT_FDCWD, path, pathseg, mode, 0)); } /* * Change mode of a file given path name (don't follow links.) */ #ifndef _SYS_SYSPROTO_H_ struct lchmod_args { char *path; int mode; }; #endif int lchmod(td, uap) struct thread *td; register struct lchmod_args /* { char *path; int mode; } */ *uap; { return (kern_fchmodat(td, AT_FDCWD, uap->path, UIO_USERSPACE, uap->mode, AT_SYMLINK_NOFOLLOW)); } int kern_fchmodat(struct thread *td, int fd, char *path, enum uio_seg pathseg, mode_t mode, int flag) { int error; struct nameidata nd; int vfslocked; int follow; AUDIT_ARG(mode, mode); follow = (flag & AT_SYMLINK_NOFOLLOW) ? NOFOLLOW : FOLLOW; NDINIT_AT(&nd, LOOKUP, follow | MPSAFE | AUDITVNODE1, pathseg, path, fd, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); error = setfmode(td, nd.ni_vp, mode); vrele(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Change mode of a file given a file descriptor. */ #ifndef _SYS_SYSPROTO_H_ struct fchmod_args { int fd; int mode; }; #endif int fchmod(td, uap) struct thread *td; register struct fchmod_args /* { int fd; int mode; } */ *uap; { struct file *fp; int vfslocked; int error; AUDIT_ARG(fd, uap->fd); AUDIT_ARG(mode, uap->mode); if ((error = getvnode(td->td_proc->p_fd, uap->fd, &fp)) != 0) return (error); vfslocked = VFS_LOCK_GIANT(fp->f_vnode->v_mount); #ifdef AUDIT vn_lock(fp->f_vnode, LK_SHARED | LK_RETRY); AUDIT_ARG(vnode, fp->f_vnode, ARG_VNODE1); VOP_UNLOCK(fp->f_vnode, 0); #endif error = setfmode(td, fp->f_vnode, uap->mode); VFS_UNLOCK_GIANT(vfslocked); fdrop(fp, td); return (error); } /* * Common implementation for chown(), lchown(), and fchown() */ static int setfown(td, vp, uid, gid) struct thread *td; struct vnode *vp; uid_t uid; gid_t gid; { int error; struct mount *mp; struct vattr vattr; if ((error = vn_start_write(vp, &mp, V_WAIT | PCATCH)) != 0) return (error); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); VATTR_NULL(&vattr); vattr.va_uid = uid; vattr.va_gid = gid; #ifdef MAC error = mac_vnode_check_setowner(td->td_ucred, vp, vattr.va_uid, vattr.va_gid); if (error == 0) #endif error = VOP_SETATTR(vp, &vattr, td->td_ucred); VOP_UNLOCK(vp, 0); vn_finished_write(mp); return (error); } /* * Set ownership given a path name. */ #ifndef _SYS_SYSPROTO_H_ struct chown_args { char *path; int uid; int gid; }; #endif int chown(td, uap) struct thread *td; register struct chown_args /* { char *path; int uid; int gid; } */ *uap; { return (kern_chown(td, uap->path, UIO_USERSPACE, uap->uid, uap->gid)); } #ifndef _SYS_SYSPROTO_H_ struct fchownat_args { int fd; const char * path; uid_t uid; gid_t gid; int flag; }; #endif int fchownat(struct thread *td, struct fchownat_args *uap) { int flag; flag = uap->flag; if (flag & ~AT_SYMLINK_NOFOLLOW) return (EINVAL); return (kern_fchownat(td, uap->fd, uap->path, UIO_USERSPACE, uap->uid, uap->gid, uap->flag)); } int kern_chown(struct thread *td, char *path, enum uio_seg pathseg, int uid, int gid) { return (kern_fchownat(td, AT_FDCWD, path, pathseg, uid, gid, 0)); } int kern_fchownat(struct thread *td, int fd, char *path, enum uio_seg pathseg, int uid, int gid, int flag) { struct nameidata nd; int error, vfslocked, follow; AUDIT_ARG(owner, uid, gid); follow = (flag & AT_SYMLINK_NOFOLLOW) ? NOFOLLOW : FOLLOW; NDINIT_AT(&nd, LOOKUP, follow | MPSAFE | AUDITVNODE1, pathseg, path, fd, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); error = setfown(td, nd.ni_vp, uid, gid); vrele(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Set ownership given a path name, do not cross symlinks. */ #ifndef _SYS_SYSPROTO_H_ struct lchown_args { char *path; int uid; int gid; }; #endif int lchown(td, uap) struct thread *td; register struct lchown_args /* { char *path; int uid; int gid; } */ *uap; { return (kern_lchown(td, uap->path, UIO_USERSPACE, uap->uid, uap->gid)); } int kern_lchown(struct thread *td, char *path, enum uio_seg pathseg, int uid, int gid) { return (kern_fchownat(td, AT_FDCWD, path, pathseg, uid, gid, AT_SYMLINK_NOFOLLOW)); } /* * Set ownership given a file descriptor. */ #ifndef _SYS_SYSPROTO_H_ struct fchown_args { int fd; int uid; int gid; }; #endif int fchown(td, uap) struct thread *td; register struct fchown_args /* { int fd; int uid; int gid; } */ *uap; { struct file *fp; int vfslocked; int error; AUDIT_ARG(fd, uap->fd); AUDIT_ARG(owner, uap->uid, uap->gid); if ((error = getvnode(td->td_proc->p_fd, uap->fd, &fp)) != 0) return (error); vfslocked = VFS_LOCK_GIANT(fp->f_vnode->v_mount); #ifdef AUDIT vn_lock(fp->f_vnode, LK_SHARED | LK_RETRY); AUDIT_ARG(vnode, fp->f_vnode, ARG_VNODE1); VOP_UNLOCK(fp->f_vnode, 0); #endif error = setfown(td, fp->f_vnode, uap->uid, uap->gid); VFS_UNLOCK_GIANT(vfslocked); fdrop(fp, td); return (error); } /* * Common implementation code for utimes(), lutimes(), and futimes(). */ static int getutimes(usrtvp, tvpseg, tsp) const struct timeval *usrtvp; enum uio_seg tvpseg; struct timespec *tsp; { struct timeval tv[2]; const struct timeval *tvp; int error; if (usrtvp == NULL) { microtime(&tv[0]); TIMEVAL_TO_TIMESPEC(&tv[0], &tsp[0]); tsp[1] = tsp[0]; } else { if (tvpseg == UIO_SYSSPACE) { tvp = usrtvp; } else { if ((error = copyin(usrtvp, tv, sizeof(tv))) != 0) return (error); tvp = tv; } if (tvp[0].tv_usec < 0 || tvp[0].tv_usec >= 1000000 || tvp[1].tv_usec < 0 || tvp[1].tv_usec >= 1000000) return (EINVAL); TIMEVAL_TO_TIMESPEC(&tvp[0], &tsp[0]); TIMEVAL_TO_TIMESPEC(&tvp[1], &tsp[1]); } return (0); } /* * Common implementation code for utimes(), lutimes(), and futimes(). */ static int setutimes(td, vp, ts, numtimes, nullflag) struct thread *td; struct vnode *vp; const struct timespec *ts; int numtimes; int nullflag; { int error, setbirthtime; struct mount *mp; struct vattr vattr; if ((error = vn_start_write(vp, &mp, V_WAIT | PCATCH)) != 0) return (error); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); setbirthtime = 0; if (numtimes < 3 && !VOP_GETATTR(vp, &vattr, td->td_ucred) && timespeccmp(&ts[1], &vattr.va_birthtime, < )) setbirthtime = 1; VATTR_NULL(&vattr); vattr.va_atime = ts[0]; vattr.va_mtime = ts[1]; if (setbirthtime) vattr.va_birthtime = ts[1]; if (numtimes > 2) vattr.va_birthtime = ts[2]; if (nullflag) vattr.va_vaflags |= VA_UTIMES_NULL; #ifdef MAC error = mac_vnode_check_setutimes(td->td_ucred, vp, vattr.va_atime, vattr.va_mtime); #endif if (error == 0) error = VOP_SETATTR(vp, &vattr, td->td_ucred); VOP_UNLOCK(vp, 0); vn_finished_write(mp); return (error); } /* * Set the access and modification times of a file. */ #ifndef _SYS_SYSPROTO_H_ struct utimes_args { char *path; struct timeval *tptr; }; #endif int utimes(td, uap) struct thread *td; register struct utimes_args /* { char *path; struct timeval *tptr; } */ *uap; { return (kern_utimes(td, uap->path, UIO_USERSPACE, uap->tptr, UIO_USERSPACE)); } #ifndef _SYS_SYSPROTO_H_ struct futimesat_args { int fd; const char * path; const struct timeval * times; }; #endif int futimesat(struct thread *td, struct futimesat_args *uap) { return (kern_utimesat(td, uap->fd, uap->path, UIO_USERSPACE, uap->times, UIO_USERSPACE)); } int kern_utimes(struct thread *td, char *path, enum uio_seg pathseg, struct timeval *tptr, enum uio_seg tptrseg) { return (kern_utimesat(td, AT_FDCWD, path, pathseg, tptr, tptrseg)); } int kern_utimesat(struct thread *td, int fd, char *path, enum uio_seg pathseg, struct timeval *tptr, enum uio_seg tptrseg) { struct nameidata nd; struct timespec ts[2]; int error, vfslocked; if ((error = getutimes(tptr, tptrseg, ts)) != 0) return (error); NDINIT_AT(&nd, LOOKUP, FOLLOW | MPSAFE | AUDITVNODE1, pathseg, path, fd, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); error = setutimes(td, nd.ni_vp, ts, 2, tptr == NULL); vrele(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Set the access and modification times of a file. */ #ifndef _SYS_SYSPROTO_H_ struct lutimes_args { char *path; struct timeval *tptr; }; #endif int lutimes(td, uap) struct thread *td; register struct lutimes_args /* { char *path; struct timeval *tptr; } */ *uap; { return (kern_lutimes(td, uap->path, UIO_USERSPACE, uap->tptr, UIO_USERSPACE)); } int kern_lutimes(struct thread *td, char *path, enum uio_seg pathseg, struct timeval *tptr, enum uio_seg tptrseg) { struct timespec ts[2]; int error; struct nameidata nd; int vfslocked; if ((error = getutimes(tptr, tptrseg, ts)) != 0) return (error); NDINIT(&nd, LOOKUP, NOFOLLOW | MPSAFE | AUDITVNODE1, pathseg, path, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); error = setutimes(td, nd.ni_vp, ts, 2, tptr == NULL); vrele(nd.ni_vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Set the access and modification times of a file. */ #ifndef _SYS_SYSPROTO_H_ struct futimes_args { int fd; struct timeval *tptr; }; #endif int futimes(td, uap) struct thread *td; register struct futimes_args /* { int fd; struct timeval *tptr; } */ *uap; { return (kern_futimes(td, uap->fd, uap->tptr, UIO_USERSPACE)); } int kern_futimes(struct thread *td, int fd, struct timeval *tptr, enum uio_seg tptrseg) { struct timespec ts[2]; struct file *fp; int vfslocked; int error; AUDIT_ARG(fd, fd); if ((error = getutimes(tptr, tptrseg, ts)) != 0) return (error); if ((error = getvnode(td->td_proc->p_fd, fd, &fp)) != 0) return (error); vfslocked = VFS_LOCK_GIANT(fp->f_vnode->v_mount); #ifdef AUDIT vn_lock(fp->f_vnode, LK_SHARED | LK_RETRY); AUDIT_ARG(vnode, fp->f_vnode, ARG_VNODE1); VOP_UNLOCK(fp->f_vnode, 0); #endif error = setutimes(td, fp->f_vnode, ts, 2, tptr == NULL); VFS_UNLOCK_GIANT(vfslocked); fdrop(fp, td); return (error); } /* * Truncate a file given its path name. */ #ifndef _SYS_SYSPROTO_H_ struct truncate_args { char *path; int pad; off_t length; }; #endif int truncate(td, uap) struct thread *td; register struct truncate_args /* { char *path; int pad; off_t length; } */ *uap; { return (kern_truncate(td, uap->path, UIO_USERSPACE, uap->length)); } int kern_truncate(struct thread *td, char *path, enum uio_seg pathseg, off_t length) { struct mount *mp; struct vnode *vp; struct vattr vattr; int error; struct nameidata nd; int vfslocked; if (length < 0) return(EINVAL); NDINIT(&nd, LOOKUP, FOLLOW | MPSAFE | AUDITVNODE1, pathseg, path, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); vp = nd.ni_vp; if ((error = vn_start_write(vp, &mp, V_WAIT | PCATCH)) != 0) { vrele(vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } NDFREE(&nd, NDF_ONLY_PNBUF); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); if (vp->v_type == VDIR) error = EISDIR; #ifdef MAC else if ((error = mac_vnode_check_write(td->td_ucred, NOCRED, vp))) { } #endif else if ((error = vn_writechk(vp)) == 0 && (error = VOP_ACCESS(vp, VWRITE, td->td_ucred, td)) == 0) { VATTR_NULL(&vattr); vattr.va_size = length; error = VOP_SETATTR(vp, &vattr, td->td_ucred); } vput(vp); vn_finished_write(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } #if defined(COMPAT_43) /* * Truncate a file given its path name. */ #ifndef _SYS_SYSPROTO_H_ struct otruncate_args { char *path; long length; }; #endif int otruncate(td, uap) struct thread *td; register struct otruncate_args /* { char *path; long length; } */ *uap; { struct truncate_args /* { char *path; int pad; off_t length; } */ nuap; nuap.path = uap->path; nuap.length = uap->length; return (truncate(td, &nuap)); } #endif /* COMPAT_43 */ /* Versions with the pad argument */ int freebsd6_truncate(struct thread *td, struct freebsd6_truncate_args *uap) { struct truncate_args ouap; ouap.path = uap->path; ouap.length = uap->length; return (truncate(td, &ouap)); } int freebsd6_ftruncate(struct thread *td, struct freebsd6_ftruncate_args *uap) { struct ftruncate_args ouap; ouap.fd = uap->fd; ouap.length = uap->length; return (ftruncate(td, &ouap)); } /* * Sync an open file. */ #ifndef _SYS_SYSPROTO_H_ struct fsync_args { int fd; }; #endif int fsync(td, uap) struct thread *td; struct fsync_args /* { int fd; } */ *uap; { struct vnode *vp; struct mount *mp; struct file *fp; int vfslocked; int error; AUDIT_ARG(fd, uap->fd); if ((error = getvnode(td->td_proc->p_fd, uap->fd, &fp)) != 0) return (error); vp = fp->f_vnode; vfslocked = VFS_LOCK_GIANT(vp->v_mount); if ((error = vn_start_write(vp, &mp, V_WAIT | PCATCH)) != 0) goto drop; vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); AUDIT_ARG(vnode, vp, ARG_VNODE1); if (vp->v_object != NULL) { VM_OBJECT_LOCK(vp->v_object); vm_object_page_clean(vp->v_object, 0, 0, 0); VM_OBJECT_UNLOCK(vp->v_object); } error = VOP_FSYNC(vp, MNT_WAIT, td); VOP_UNLOCK(vp, 0); vn_finished_write(mp); drop: VFS_UNLOCK_GIANT(vfslocked); fdrop(fp, td); return (error); } /* * Rename files. Source and destination must either both be directories, or * both not be directories. If target is a directory, it must be empty. */ #ifndef _SYS_SYSPROTO_H_ struct rename_args { char *from; char *to; }; #endif int rename(td, uap) struct thread *td; register struct rename_args /* { char *from; char *to; } */ *uap; { return (kern_rename(td, uap->from, uap->to, UIO_USERSPACE)); } #ifndef _SYS_SYSPROTO_H_ struct renameat_args { int oldfd; char *old; int newfd; char *new; }; #endif int renameat(struct thread *td, struct renameat_args *uap) { return (kern_renameat(td, uap->oldfd, uap->old, uap->newfd, uap->new, UIO_USERSPACE)); } int kern_rename(struct thread *td, char *from, char *to, enum uio_seg pathseg) { return (kern_renameat(td, AT_FDCWD, from, AT_FDCWD, to, pathseg)); } int kern_renameat(struct thread *td, int oldfd, char *old, int newfd, char *new, enum uio_seg pathseg) { struct mount *mp = NULL; struct vnode *tvp, *fvp, *tdvp; struct nameidata fromnd, tond; int tvfslocked; int fvfslocked; int error; bwillwrite(); #ifdef MAC NDINIT_AT(&fromnd, DELETE, LOCKPARENT | LOCKLEAF | SAVESTART | MPSAFE | AUDITVNODE1, pathseg, old, oldfd, td); #else NDINIT_AT(&fromnd, DELETE, WANTPARENT | SAVESTART | MPSAFE | AUDITVNODE1, pathseg, old, oldfd, td); #endif if ((error = namei(&fromnd)) != 0) return (error); fvfslocked = NDHASGIANT(&fromnd); tvfslocked = 0; #ifdef MAC error = mac_vnode_check_rename_from(td->td_ucred, fromnd.ni_dvp, fromnd.ni_vp, &fromnd.ni_cnd); VOP_UNLOCK(fromnd.ni_dvp, 0); if (fromnd.ni_dvp != fromnd.ni_vp) VOP_UNLOCK(fromnd.ni_vp, 0); #endif fvp = fromnd.ni_vp; if (error == 0) error = vn_start_write(fvp, &mp, V_WAIT | PCATCH); if (error != 0) { NDFREE(&fromnd, NDF_ONLY_PNBUF); vrele(fromnd.ni_dvp); vrele(fvp); goto out1; } NDINIT_AT(&tond, RENAME, LOCKPARENT | LOCKLEAF | NOCACHE | SAVESTART | MPSAFE | AUDITVNODE2, pathseg, new, newfd, td); if (fromnd.ni_vp->v_type == VDIR) tond.ni_cnd.cn_flags |= WILLBEDIR; if ((error = namei(&tond)) != 0) { /* Translate error code for rename("dir1", "dir2/."). */ if (error == EISDIR && fvp->v_type == VDIR) error = EINVAL; NDFREE(&fromnd, NDF_ONLY_PNBUF); vrele(fromnd.ni_dvp); vrele(fvp); vn_finished_write(mp); goto out1; } tvfslocked = NDHASGIANT(&tond); tdvp = tond.ni_dvp; tvp = tond.ni_vp; if (tvp != NULL) { if (fvp->v_type == VDIR && tvp->v_type != VDIR) { error = ENOTDIR; goto out; } else if (fvp->v_type != VDIR && tvp->v_type == VDIR) { error = EISDIR; goto out; } } if (fvp == tdvp) { error = EINVAL; goto out; } /* * If the source is the same as the destination (that is, if they * are links to the same vnode), then there is nothing to do. */ if (fvp == tvp) error = -1; #ifdef MAC else error = mac_vnode_check_rename_to(td->td_ucred, tdvp, tond.ni_vp, fromnd.ni_dvp == tdvp, &tond.ni_cnd); #endif out: if (!error) { error = VOP_RENAME(fromnd.ni_dvp, fromnd.ni_vp, &fromnd.ni_cnd, tond.ni_dvp, tond.ni_vp, &tond.ni_cnd); NDFREE(&fromnd, NDF_ONLY_PNBUF); NDFREE(&tond, NDF_ONLY_PNBUF); } else { NDFREE(&fromnd, NDF_ONLY_PNBUF); NDFREE(&tond, NDF_ONLY_PNBUF); if (tvp) vput(tvp); if (tdvp == tvp) vrele(tdvp); else vput(tdvp); vrele(fromnd.ni_dvp); vrele(fvp); } vrele(tond.ni_startdir); vn_finished_write(mp); out1: if (fromnd.ni_startdir) vrele(fromnd.ni_startdir); VFS_UNLOCK_GIANT(fvfslocked); VFS_UNLOCK_GIANT(tvfslocked); if (error == -1) return (0); return (error); } /* * Make a directory file. */ #ifndef _SYS_SYSPROTO_H_ struct mkdir_args { char *path; int mode; }; #endif int mkdir(td, uap) struct thread *td; register struct mkdir_args /* { char *path; int mode; } */ *uap; { return (kern_mkdir(td, uap->path, UIO_USERSPACE, uap->mode)); } #ifndef _SYS_SYSPROTO_H_ struct mkdirat_args { int fd; char *path; mode_t mode; }; #endif int mkdirat(struct thread *td, struct mkdirat_args *uap) { return (kern_mkdirat(td, uap->fd, uap->path, UIO_USERSPACE, uap->mode)); } int kern_mkdir(struct thread *td, char *path, enum uio_seg segflg, int mode) { return (kern_mkdirat(td, AT_FDCWD, path, segflg, mode)); } int kern_mkdirat(struct thread *td, int fd, char *path, enum uio_seg segflg, int mode) { struct mount *mp; struct vnode *vp; struct vattr vattr; int error; struct nameidata nd; int vfslocked; AUDIT_ARG(mode, mode); restart: bwillwrite(); NDINIT_AT(&nd, CREATE, LOCKPARENT | SAVENAME | MPSAFE | AUDITVNODE1, segflg, path, fd, td); nd.ni_cnd.cn_flags |= WILLBEDIR; if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); vp = nd.ni_vp; if (vp != NULL) { NDFREE(&nd, NDF_ONLY_PNBUF); /* * XXX namei called with LOCKPARENT but not LOCKLEAF has * the strange behaviour of leaving the vnode unlocked * if the target is the same vnode as the parent. */ if (vp == nd.ni_dvp) vrele(nd.ni_dvp); else vput(nd.ni_dvp); vrele(vp); VFS_UNLOCK_GIANT(vfslocked); return (EEXIST); } if (vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) { NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); VFS_UNLOCK_GIANT(vfslocked); if ((error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH)) != 0) return (error); goto restart; } VATTR_NULL(&vattr); vattr.va_type = VDIR; FILEDESC_SLOCK(td->td_proc->p_fd); vattr.va_mode = (mode & ACCESSPERMS) &~ td->td_proc->p_fd->fd_cmask; FILEDESC_SUNLOCK(td->td_proc->p_fd); #ifdef MAC error = mac_vnode_check_create(td->td_ucred, nd.ni_dvp, &nd.ni_cnd, &vattr); if (error) goto out; #endif error = VOP_MKDIR(nd.ni_dvp, &nd.ni_vp, &nd.ni_cnd, &vattr); #ifdef MAC out: #endif NDFREE(&nd, NDF_ONLY_PNBUF); vput(nd.ni_dvp); if (!error) vput(nd.ni_vp); vn_finished_write(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Remove a directory file. */ #ifndef _SYS_SYSPROTO_H_ struct rmdir_args { char *path; }; #endif int rmdir(td, uap) struct thread *td; struct rmdir_args /* { char *path; } */ *uap; { return (kern_rmdir(td, uap->path, UIO_USERSPACE)); } int kern_rmdir(struct thread *td, char *path, enum uio_seg pathseg) { return (kern_rmdirat(td, AT_FDCWD, path, pathseg)); } int kern_rmdirat(struct thread *td, int fd, char *path, enum uio_seg pathseg) { struct mount *mp; struct vnode *vp; int error; struct nameidata nd; int vfslocked; restart: bwillwrite(); NDINIT_AT(&nd, DELETE, LOCKPARENT | LOCKLEAF | MPSAFE | AUDITVNODE1, pathseg, path, fd, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); vp = nd.ni_vp; if (vp->v_type != VDIR) { error = ENOTDIR; goto out; } /* * No rmdir "." please. */ if (nd.ni_dvp == vp) { error = EINVAL; goto out; } /* * The root of a mounted filesystem cannot be deleted. */ if (vp->v_vflag & VV_ROOT) { error = EBUSY; goto out; } #ifdef MAC error = mac_vnode_check_unlink(td->td_ucred, nd.ni_dvp, vp, &nd.ni_cnd); if (error) goto out; #endif if (vn_start_write(nd.ni_dvp, &mp, V_NOWAIT) != 0) { NDFREE(&nd, NDF_ONLY_PNBUF); vput(vp); if (nd.ni_dvp == vp) vrele(nd.ni_dvp); else vput(nd.ni_dvp); VFS_UNLOCK_GIANT(vfslocked); if ((error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH)) != 0) return (error); goto restart; } error = VOP_RMDIR(nd.ni_dvp, nd.ni_vp, &nd.ni_cnd); vn_finished_write(mp); out: NDFREE(&nd, NDF_ONLY_PNBUF); vput(vp); if (nd.ni_dvp == vp) vrele(nd.ni_dvp); else vput(nd.ni_dvp); VFS_UNLOCK_GIANT(vfslocked); return (error); } #ifdef COMPAT_43 /* * Read a block of directory entries in a filesystem independent format. */ #ifndef _SYS_SYSPROTO_H_ struct ogetdirentries_args { int fd; char *buf; u_int count; long *basep; }; #endif int ogetdirentries(td, uap) struct thread *td; register struct ogetdirentries_args /* { int fd; char *buf; u_int count; long *basep; } */ *uap; { struct vnode *vp; struct file *fp; struct uio auio, kuio; struct iovec aiov, kiov; struct dirent *dp, *edp; caddr_t dirbuf; int error, eofflag, readcnt, vfslocked; long loff; /* XXX arbitrary sanity limit on `count'. */ if (uap->count > 64 * 1024) return (EINVAL); if ((error = getvnode(td->td_proc->p_fd, uap->fd, &fp)) != 0) return (error); if ((fp->f_flag & FREAD) == 0) { fdrop(fp, td); return (EBADF); } vp = fp->f_vnode; unionread: vfslocked = VFS_LOCK_GIANT(vp->v_mount); if (vp->v_type != VDIR) { VFS_UNLOCK_GIANT(vfslocked); fdrop(fp, td); return (EINVAL); } aiov.iov_base = uap->buf; aiov.iov_len = uap->count; auio.uio_iov = &aiov; auio.uio_iovcnt = 1; auio.uio_rw = UIO_READ; auio.uio_segflg = UIO_USERSPACE; auio.uio_td = td; auio.uio_resid = uap->count; vn_lock(vp, LK_SHARED | LK_RETRY); loff = auio.uio_offset = fp->f_offset; #ifdef MAC error = mac_vnode_check_readdir(td->td_ucred, vp); if (error) { VOP_UNLOCK(vp, 0); VFS_UNLOCK_GIANT(vfslocked); fdrop(fp, td); return (error); } #endif # if (BYTE_ORDER != LITTLE_ENDIAN) if (vp->v_mount->mnt_maxsymlinklen <= 0) { error = VOP_READDIR(vp, &auio, fp->f_cred, &eofflag, NULL, NULL); fp->f_offset = auio.uio_offset; } else # endif { kuio = auio; kuio.uio_iov = &kiov; kuio.uio_segflg = UIO_SYSSPACE; kiov.iov_len = uap->count; dirbuf = malloc(uap->count, M_TEMP, M_WAITOK); kiov.iov_base = dirbuf; error = VOP_READDIR(vp, &kuio, fp->f_cred, &eofflag, NULL, NULL); fp->f_offset = kuio.uio_offset; if (error == 0) { readcnt = uap->count - kuio.uio_resid; edp = (struct dirent *)&dirbuf[readcnt]; for (dp = (struct dirent *)dirbuf; dp < edp; ) { # if (BYTE_ORDER == LITTLE_ENDIAN) /* * The expected low byte of * dp->d_namlen is our dp->d_type. * The high MBZ byte of dp->d_namlen * is our dp->d_namlen. */ dp->d_type = dp->d_namlen; dp->d_namlen = 0; # else /* * The dp->d_type is the high byte * of the expected dp->d_namlen, * so must be zero'ed. */ dp->d_type = 0; # endif if (dp->d_reclen > 0) { dp = (struct dirent *) ((char *)dp + dp->d_reclen); } else { error = EIO; break; } } if (dp >= edp) error = uiomove(dirbuf, readcnt, &auio); } free(dirbuf, M_TEMP); } if (error) { VOP_UNLOCK(vp, 0); VFS_UNLOCK_GIANT(vfslocked); fdrop(fp, td); return (error); } if (uap->count == auio.uio_resid && (vp->v_vflag & VV_ROOT) && (vp->v_mount->mnt_flag & MNT_UNION)) { struct vnode *tvp = vp; vp = vp->v_mount->mnt_vnodecovered; VREF(vp); fp->f_vnode = vp; fp->f_data = vp; fp->f_offset = 0; vput(tvp); VFS_UNLOCK_GIANT(vfslocked); goto unionread; } VOP_UNLOCK(vp, 0); VFS_UNLOCK_GIANT(vfslocked); error = copyout(&loff, uap->basep, sizeof(long)); fdrop(fp, td); td->td_retval[0] = uap->count - auio.uio_resid; return (error); } #endif /* COMPAT_43 */ /* * Read a block of directory entries in a filesystem independent format. */ #ifndef _SYS_SYSPROTO_H_ struct getdirentries_args { int fd; char *buf; u_int count; long *basep; }; #endif int getdirentries(td, uap) struct thread *td; register struct getdirentries_args /* { int fd; char *buf; u_int count; long *basep; } */ *uap; { long base; int error; error = kern_getdirentries(td, uap->fd, uap->buf, uap->count, &base); if (error) return (error); if (uap->basep != NULL) error = copyout(&base, uap->basep, sizeof(long)); return (error); } int kern_getdirentries(struct thread *td, int fd, char *buf, u_int count, long *basep) { struct vnode *vp; struct file *fp; struct uio auio; struct iovec aiov; int vfslocked; long loff; int error, eofflag; AUDIT_ARG(fd, fd); if (count > INT_MAX) return (EINVAL); if ((error = getvnode(td->td_proc->p_fd, fd, &fp)) != 0) return (error); if ((fp->f_flag & FREAD) == 0) { fdrop(fp, td); return (EBADF); } vp = fp->f_vnode; unionread: vfslocked = VFS_LOCK_GIANT(vp->v_mount); if (vp->v_type != VDIR) { VFS_UNLOCK_GIANT(vfslocked); error = EINVAL; goto fail; } aiov.iov_base = buf; aiov.iov_len = count; auio.uio_iov = &aiov; auio.uio_iovcnt = 1; auio.uio_rw = UIO_READ; auio.uio_segflg = UIO_USERSPACE; auio.uio_td = td; auio.uio_resid = count; vn_lock(vp, LK_SHARED | LK_RETRY); AUDIT_ARG(vnode, vp, ARG_VNODE1); loff = auio.uio_offset = fp->f_offset; #ifdef MAC error = mac_vnode_check_readdir(td->td_ucred, vp); if (error == 0) #endif error = VOP_READDIR(vp, &auio, fp->f_cred, &eofflag, NULL, NULL); fp->f_offset = auio.uio_offset; if (error) { VOP_UNLOCK(vp, 0); VFS_UNLOCK_GIANT(vfslocked); goto fail; } if (count == auio.uio_resid && (vp->v_vflag & VV_ROOT) && (vp->v_mount->mnt_flag & MNT_UNION)) { struct vnode *tvp = vp; vp = vp->v_mount->mnt_vnodecovered; VREF(vp); fp->f_vnode = vp; fp->f_data = vp; fp->f_offset = 0; vput(tvp); VFS_UNLOCK_GIANT(vfslocked); goto unionread; } VOP_UNLOCK(vp, 0); VFS_UNLOCK_GIANT(vfslocked); *basep = loff; td->td_retval[0] = count - auio.uio_resid; fail: fdrop(fp, td); return (error); } #ifndef _SYS_SYSPROTO_H_ struct getdents_args { int fd; char *buf; size_t count; }; #endif int getdents(td, uap) struct thread *td; register struct getdents_args /* { int fd; char *buf; u_int count; } */ *uap; { struct getdirentries_args ap; ap.fd = uap->fd; ap.buf = uap->buf; ap.count = uap->count; ap.basep = NULL; return (getdirentries(td, &ap)); } /* * Set the mode mask for creation of filesystem nodes. */ #ifndef _SYS_SYSPROTO_H_ struct umask_args { int newmask; }; #endif int umask(td, uap) struct thread *td; struct umask_args /* { int newmask; } */ *uap; { register struct filedesc *fdp; FILEDESC_XLOCK(td->td_proc->p_fd); fdp = td->td_proc->p_fd; td->td_retval[0] = fdp->fd_cmask; fdp->fd_cmask = uap->newmask & ALLPERMS; FILEDESC_XUNLOCK(td->td_proc->p_fd); return (0); } /* * Void all references to file by ripping underlying filesystem away from * vnode. */ #ifndef _SYS_SYSPROTO_H_ struct revoke_args { char *path; }; #endif int revoke(td, uap) struct thread *td; register struct revoke_args /* { char *path; } */ *uap; { struct vnode *vp; struct vattr vattr; int error; struct nameidata nd; int vfslocked; NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | MPSAFE | AUDITVNODE1, UIO_USERSPACE, uap->path, td); if ((error = namei(&nd)) != 0) return (error); vfslocked = NDHASGIANT(&nd); vp = nd.ni_vp; NDFREE(&nd, NDF_ONLY_PNBUF); if (vp->v_type != VCHR) { error = EINVAL; goto out; } #ifdef MAC error = mac_vnode_check_revoke(td->td_ucred, vp); if (error) goto out; #endif error = VOP_GETATTR(vp, &vattr, td->td_ucred); if (error) goto out; if (td->td_ucred->cr_uid != vattr.va_uid) { error = priv_check(td, PRIV_VFS_ADMIN); if (error) goto out; } if (vcount(vp) > 1) VOP_REVOKE(vp, REVOKEALL); out: vput(vp); VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Convert a user file descriptor to a kernel file entry. * A reference on the file entry is held upon returning. */ int getvnode(fdp, fd, fpp) struct filedesc *fdp; int fd; struct file **fpp; { int error; struct file *fp; error = 0; fp = NULL; if (fdp == NULL || (fp = fget_unlocked(fdp, fd)) == NULL) error = EBADF; else if (fp->f_vnode == NULL) { error = EINVAL; fdrop(fp, curthread); } *fpp = fp; return (error); } /* * Get an (NFS) file handle. */ #ifndef _SYS_SYSPROTO_H_ struct lgetfh_args { char *fname; fhandle_t *fhp; }; #endif int lgetfh(td, uap) struct thread *td; register struct lgetfh_args *uap; { struct nameidata nd; fhandle_t fh; register struct vnode *vp; int vfslocked; int error; error = priv_check(td, PRIV_VFS_GETFH); if (error) return (error); NDINIT(&nd, LOOKUP, NOFOLLOW | LOCKLEAF | MPSAFE | AUDITVNODE1, UIO_USERSPACE, uap->fname, td); error = namei(&nd); if (error) return (error); vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); vp = nd.ni_vp; bzero(&fh, sizeof(fh)); fh.fh_fsid = vp->v_mount->mnt_stat.f_fsid; error = VOP_VPTOFH(vp, &fh.fh_fid); vput(vp); VFS_UNLOCK_GIANT(vfslocked); if (error) return (error); error = copyout(&fh, uap->fhp, sizeof (fh)); return (error); } #ifndef _SYS_SYSPROTO_H_ struct getfh_args { char *fname; fhandle_t *fhp; }; #endif int getfh(td, uap) struct thread *td; register struct getfh_args *uap; { struct nameidata nd; fhandle_t fh; register struct vnode *vp; int vfslocked; int error; error = priv_check(td, PRIV_VFS_GETFH); if (error) return (error); NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | MPSAFE | AUDITVNODE1, UIO_USERSPACE, uap->fname, td); error = namei(&nd); if (error) return (error); vfslocked = NDHASGIANT(&nd); NDFREE(&nd, NDF_ONLY_PNBUF); vp = nd.ni_vp; bzero(&fh, sizeof(fh)); fh.fh_fsid = vp->v_mount->mnt_stat.f_fsid; error = VOP_VPTOFH(vp, &fh.fh_fid); vput(vp); VFS_UNLOCK_GIANT(vfslocked); if (error) return (error); error = copyout(&fh, uap->fhp, sizeof (fh)); return (error); } /* * syscall for the rpc.lockd to use to translate a NFS file handle into an * open descriptor. * * warning: do not remove the priv_check() call or this becomes one giant * security hole. */ #ifndef _SYS_SYSPROTO_H_ struct fhopen_args { const struct fhandle *u_fhp; int flags; }; #endif int fhopen(td, uap) struct thread *td; struct fhopen_args /* { const struct fhandle *u_fhp; int flags; } */ *uap; { struct proc *p = td->td_proc; struct mount *mp; struct vnode *vp; struct fhandle fhp; struct vattr vat; struct vattr *vap = &vat; struct flock lf; struct file *fp; register struct filedesc *fdp = p->p_fd; int fmode, error, type; accmode_t accmode; struct file *nfp; int vfslocked; int indx; error = priv_check(td, PRIV_VFS_FHOPEN); if (error) return (error); fmode = FFLAGS(uap->flags); /* why not allow a non-read/write open for our lockd? */ if (((fmode & (FREAD | FWRITE)) == 0) || (fmode & O_CREAT)) return (EINVAL); error = copyin(uap->u_fhp, &fhp, sizeof(fhp)); if (error) return(error); /* find the mount point */ mp = vfs_busyfs(&fhp.fh_fsid); if (mp == NULL) return (ESTALE); vfslocked = VFS_LOCK_GIANT(mp); /* now give me my vnode, it gets returned to me locked */ error = VFS_FHTOVP(mp, &fhp.fh_fid, &vp); vfs_unbusy(mp); if (error) goto out; /* * from now on we have to make sure not * to forget about the vnode * any error that causes an abort must vput(vp) * just set error = err and 'goto bad;'. */ /* * from vn_open */ if (vp->v_type == VLNK) { error = EMLINK; goto bad; } if (vp->v_type == VSOCK) { error = EOPNOTSUPP; goto bad; } accmode = 0; if (fmode & (FWRITE | O_TRUNC)) { if (vp->v_type == VDIR) { error = EISDIR; goto bad; } error = vn_writechk(vp); if (error) goto bad; accmode |= VWRITE; } if (fmode & FREAD) accmode |= VREAD; if (fmode & O_APPEND) accmode |= VAPPEND; #ifdef MAC error = mac_vnode_check_open(td->td_ucred, vp, accmode); if (error) goto bad; #endif if (accmode) { error = VOP_ACCESS(vp, accmode, td->td_ucred, td); if (error) goto bad; } if (fmode & O_TRUNC) { VOP_UNLOCK(vp, 0); /* XXX */ if ((error = vn_start_write(NULL, &mp, V_WAIT | PCATCH)) != 0) { vrele(vp); goto out; } vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); /* XXX */ #ifdef MAC /* * We don't yet have fp->f_cred, so use td->td_ucred, which * should be right. */ error = mac_vnode_check_write(td->td_ucred, td->td_ucred, vp); if (error == 0) { #endif VATTR_NULL(vap); vap->va_size = 0; error = VOP_SETATTR(vp, vap, td->td_ucred); #ifdef MAC } #endif vn_finished_write(mp); if (error) goto bad; } error = VOP_OPEN(vp, fmode, td->td_ucred, td, NULL); if (error) goto bad; if (fmode & FWRITE) vp->v_writecount++; /* * end of vn_open code */ if ((error = falloc(td, &nfp, &indx)) != 0) { if (fmode & FWRITE) vp->v_writecount--; goto bad; } /* An extra reference on `nfp' has been held for us by falloc(). */ fp = nfp; nfp->f_vnode = vp; finit(nfp, fmode & FMASK, DTYPE_VNODE, vp, &vnops); if (fmode & (O_EXLOCK | O_SHLOCK)) { lf.l_whence = SEEK_SET; lf.l_start = 0; lf.l_len = 0; if (fmode & O_EXLOCK) lf.l_type = F_WRLCK; else lf.l_type = F_RDLCK; type = F_FLOCK; if ((fmode & FNONBLOCK) == 0) type |= F_WAIT; VOP_UNLOCK(vp, 0); if ((error = VOP_ADVLOCK(vp, (caddr_t)fp, F_SETLK, &lf, type)) != 0) { /* * The lock request failed. Normally close the * descriptor but handle the case where someone might * have dup()d or close()d it when we weren't looking. */ fdclose(fdp, fp, indx, td); /* * release our private reference */ fdrop(fp, td); goto out; } vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); atomic_set_int(&fp->f_flag, FHASLOCK); } VOP_UNLOCK(vp, 0); fdrop(fp, td); vfs_rel(mp); VFS_UNLOCK_GIANT(vfslocked); td->td_retval[0] = indx; return (0); bad: vput(vp); out: VFS_UNLOCK_GIANT(vfslocked); return (error); } /* * Stat an (NFS) file handle. */ #ifndef _SYS_SYSPROTO_H_ struct fhstat_args { struct fhandle *u_fhp; struct stat *sb; }; #endif int fhstat(td, uap) struct thread *td; register struct fhstat_args /* { struct fhandle *u_fhp; struct stat *sb; } */ *uap; { struct stat sb; fhandle_t fh; struct mount *mp; struct vnode *vp; int vfslocked; int error; error = priv_check(td, PRIV_VFS_FHSTAT); if (error) return (error); error = copyin(uap->u_fhp, &fh, sizeof(fhandle_t)); if (error) return (error); if ((mp = vfs_busyfs(&fh.fh_fsid)) == NULL) return (ESTALE); vfslocked = VFS_LOCK_GIANT(mp); error = VFS_FHTOVP(mp, &fh.fh_fid, &vp); vfs_unbusy(mp); if (error) { VFS_UNLOCK_GIANT(vfslocked); return (error); } error = vn_stat(vp, &sb, td->td_ucred, NOCRED, td); vput(vp); VFS_UNLOCK_GIANT(vfslocked); if (error) return (error); error = copyout(&sb, uap->sb, sizeof(sb)); return (error); } /* * Implement fstatfs() for (NFS) file handles. */ #ifndef _SYS_SYSPROTO_H_ struct fhstatfs_args { struct fhandle *u_fhp; struct statfs *buf; }; #endif int fhstatfs(td, uap) struct thread *td; struct fhstatfs_args /* { struct fhandle *u_fhp; struct statfs *buf; } */ *uap; { struct statfs sf; fhandle_t fh; int error; error = copyin(uap->u_fhp, &fh, sizeof(fhandle_t)); if (error) return (error); error = kern_fhstatfs(td, fh, &sf); if (error) return (error); return (copyout(&sf, uap->buf, sizeof(sf))); } int kern_fhstatfs(struct thread *td, fhandle_t fh, struct statfs *buf) { struct statfs *sp; struct mount *mp; struct vnode *vp; int vfslocked; int error; error = priv_check(td, PRIV_VFS_FHSTATFS); if (error) return (error); if ((mp = vfs_busyfs(&fh.fh_fsid)) == NULL) return (ESTALE); vfslocked = VFS_LOCK_GIANT(mp); error = VFS_FHTOVP(mp, &fh.fh_fid, &vp); if (error) { vfs_unbusy(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } vput(vp); error = prison_canseemount(td->td_ucred, mp); if (error) goto out; #ifdef MAC error = mac_mount_check_stat(td->td_ucred, mp); if (error) goto out; #endif /* * Set these in case the underlying filesystem fails to do so. */ sp = &mp->mnt_stat; sp->f_version = STATFS_VERSION; sp->f_namemax = NAME_MAX; sp->f_flags = mp->mnt_flag & MNT_VISFLAGMASK; error = VFS_STATFS(mp, sp); if (error == 0) *buf = *sp; out: vfs_unbusy(mp); VFS_UNLOCK_GIANT(vfslocked); return (error); } Index: head/sys/net/rtsock.c =================================================================== --- head/sys/net/rtsock.c (revision 192894) +++ head/sys/net/rtsock.c (revision 192895) @@ -1,1499 +1,1503 @@ /*- * Copyright (c) 1988, 1991, 1993 * The Regents of the University of California. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)rtsock.c 8.7 (Berkeley) 10/12/95 * $FreeBSD$ */ #include "opt_sctp.h" #include "opt_mpath.h" #include "opt_route.h" #include "opt_inet.h" #include "opt_inet6.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET6 #include #endif #ifdef SCTP extern void sctp_addr_change(struct ifaddr *ifa, int cmd); #endif /* SCTP */ MALLOC_DEFINE(M_RTABLE, "routetbl", "routing tables"); /* NB: these are not modified */ static struct sockaddr route_src = { 2, PF_ROUTE, }; static struct sockaddr sa_zero = { sizeof(sa_zero), AF_INET, }; static struct { int ip_count; /* attached w/ AF_INET */ int ip6_count; /* attached w/ AF_INET6 */ int ipx_count; /* attached w/ AF_IPX */ int any_count; /* total attached */ } route_cb; struct mtx rtsock_mtx; MTX_SYSINIT(rtsock, &rtsock_mtx, "rtsock route_cb lock", MTX_DEF); #define RTSOCK_LOCK() mtx_lock(&rtsock_mtx) #define RTSOCK_UNLOCK() mtx_unlock(&rtsock_mtx) #define RTSOCK_LOCK_ASSERT() mtx_assert(&rtsock_mtx, MA_OWNED) static struct ifqueue rtsintrq; SYSCTL_NODE(_net, OID_AUTO, route, CTLFLAG_RD, 0, ""); SYSCTL_INT(_net_route, OID_AUTO, netisr_maxqlen, CTLFLAG_RW, &rtsintrq.ifq_maxlen, 0, "maximum routing socket dispatch queue length"); struct walkarg { int w_tmemsize; int w_op, w_arg; caddr_t w_tmem; struct sysctl_req *w_req; }; static void rts_input(struct mbuf *m); static struct mbuf *rt_msg1(int type, struct rt_addrinfo *rtinfo); static int rt_msg2(int type, struct rt_addrinfo *rtinfo, caddr_t cp, struct walkarg *w); static int rt_xaddrs(caddr_t cp, caddr_t cplim, struct rt_addrinfo *rtinfo); static int sysctl_dumpentry(struct radix_node *rn, void *vw); static int sysctl_iflist(int af, struct walkarg *w); static int sysctl_ifmalist(int af, struct walkarg *w); static int route_output(struct mbuf *m, struct socket *so); static void rt_setmetrics(u_long which, const struct rt_metrics *in, struct rt_metrics_lite *out); static void rt_getmetrics(const struct rt_metrics_lite *in, struct rt_metrics *out); static void rt_dispatch(struct mbuf *, const struct sockaddr *); static void rts_init(void) { int tmp; rtsintrq.ifq_maxlen = 256; if (TUNABLE_INT_FETCH("net.route.netisr_maxqlen", &tmp)) rtsintrq.ifq_maxlen = tmp; mtx_init(&rtsintrq.ifq_mtx, "rts_inq", NULL, MTX_DEF); netisr_register(NETISR_ROUTE, rts_input, &rtsintrq, 0); } SYSINIT(rtsock, SI_SUB_PROTO_DOMAIN, SI_ORDER_THIRD, rts_init, 0); static void rts_input(struct mbuf *m) { struct sockproto route_proto; unsigned short *family; struct m_tag *tag; route_proto.sp_family = PF_ROUTE; tag = m_tag_find(m, PACKET_TAG_RTSOCKFAM, NULL); if (tag != NULL) { family = (unsigned short *)(tag + 1); route_proto.sp_protocol = *family; m_tag_delete(m, tag); } else route_proto.sp_protocol = 0; raw_input(m, &route_proto, &route_src); } /* * It really doesn't make any sense at all for this code to share much * with raw_usrreq.c, since its functionality is so restricted. XXX */ static void rts_abort(struct socket *so) { raw_usrreqs.pru_abort(so); } static void rts_close(struct socket *so) { raw_usrreqs.pru_close(so); } /* pru_accept is EOPNOTSUPP */ static int rts_attach(struct socket *so, int proto, struct thread *td) { struct rawcb *rp; int s, error; KASSERT(so->so_pcb == NULL, ("rts_attach: so_pcb != NULL")); /* XXX */ rp = malloc(sizeof *rp, M_PCB, M_WAITOK | M_ZERO); if (rp == NULL) return ENOBUFS; /* * The splnet() is necessary to block protocols from sending * error notifications (like RTM_REDIRECT or RTM_LOSING) while * this PCB is extant but incompletely initialized. * Probably we should try to do more of this work beforehand and * eliminate the spl. */ s = splnet(); so->so_pcb = (caddr_t)rp; so->so_fibnum = td->td_proc->p_fibnum; error = raw_attach(so, proto); rp = sotorawcb(so); if (error) { splx(s); so->so_pcb = NULL; free(rp, M_PCB); return error; } RTSOCK_LOCK(); switch(rp->rcb_proto.sp_protocol) { case AF_INET: route_cb.ip_count++; break; case AF_INET6: route_cb.ip6_count++; break; case AF_IPX: route_cb.ipx_count++; break; } route_cb.any_count++; RTSOCK_UNLOCK(); soisconnected(so); so->so_options |= SO_USELOOPBACK; splx(s); return 0; } static int rts_bind(struct socket *so, struct sockaddr *nam, struct thread *td) { return (raw_usrreqs.pru_bind(so, nam, td)); /* xxx just EINVAL */ } static int rts_connect(struct socket *so, struct sockaddr *nam, struct thread *td) { return (raw_usrreqs.pru_connect(so, nam, td)); /* XXX just EINVAL */ } /* pru_connect2 is EOPNOTSUPP */ /* pru_control is EOPNOTSUPP */ static void rts_detach(struct socket *so) { struct rawcb *rp = sotorawcb(so); KASSERT(rp != NULL, ("rts_detach: rp == NULL")); RTSOCK_LOCK(); switch(rp->rcb_proto.sp_protocol) { case AF_INET: route_cb.ip_count--; break; case AF_INET6: route_cb.ip6_count--; break; case AF_IPX: route_cb.ipx_count--; break; } route_cb.any_count--; RTSOCK_UNLOCK(); raw_usrreqs.pru_detach(so); } static int rts_disconnect(struct socket *so) { return (raw_usrreqs.pru_disconnect(so)); } /* pru_listen is EOPNOTSUPP */ static int rts_peeraddr(struct socket *so, struct sockaddr **nam) { return (raw_usrreqs.pru_peeraddr(so, nam)); } /* pru_rcvd is EOPNOTSUPP */ /* pru_rcvoob is EOPNOTSUPP */ static int rts_send(struct socket *so, int flags, struct mbuf *m, struct sockaddr *nam, struct mbuf *control, struct thread *td) { return (raw_usrreqs.pru_send(so, flags, m, nam, control, td)); } /* pru_sense is null */ static int rts_shutdown(struct socket *so) { return (raw_usrreqs.pru_shutdown(so)); } static int rts_sockaddr(struct socket *so, struct sockaddr **nam) { return (raw_usrreqs.pru_sockaddr(so, nam)); } static struct pr_usrreqs route_usrreqs = { .pru_abort = rts_abort, .pru_attach = rts_attach, .pru_bind = rts_bind, .pru_connect = rts_connect, .pru_detach = rts_detach, .pru_disconnect = rts_disconnect, .pru_peeraddr = rts_peeraddr, .pru_send = rts_send, .pru_shutdown = rts_shutdown, .pru_sockaddr = rts_sockaddr, .pru_close = rts_close, }; #ifndef _SOCKADDR_UNION_DEFINED #define _SOCKADDR_UNION_DEFINED /* * The union of all possible address formats we handle. */ union sockaddr_union { struct sockaddr sa; struct sockaddr_in sin; struct sockaddr_in6 sin6; }; #endif /* _SOCKADDR_UNION_DEFINED */ static int rtm_get_jailed(struct rt_addrinfo *info, struct ifnet *ifp, struct rtentry *rt, union sockaddr_union *saun, struct ucred *cred) { /* First, see if the returned address is part of the jail. */ if (prison_if(cred, rt->rt_ifa->ifa_addr) == 0) { info->rti_info[RTAX_IFA] = rt->rt_ifa->ifa_addr; return (0); } switch (info->rti_info[RTAX_DST]->sa_family) { #ifdef INET case AF_INET: { struct in_addr ia; struct ifaddr *ifa; int found; found = 0; /* * Try to find an address on the given outgoing interface * that belongs to the jail. */ IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { struct sockaddr *sa; sa = ifa->ifa_addr; if (sa->sa_family != AF_INET) continue; ia = ((struct sockaddr_in *)sa)->sin_addr; if (prison_check_ip4(cred, &ia) == 0) { found = 1; break; } } IF_ADDR_UNLOCK(ifp); if (!found) { /* * As a last resort return the 'default' jail address. */ + ia = ((struct sockaddr_in *)rt->rt_ifa->ifa_addr)-> + sin_addr; if (prison_get_ip4(cred, &ia) != 0) return (ESRCH); } bzero(&saun->sin, sizeof(struct sockaddr_in)); saun->sin.sin_len = sizeof(struct sockaddr_in); saun->sin.sin_family = AF_INET; saun->sin.sin_addr.s_addr = ia.s_addr; info->rti_info[RTAX_IFA] = (struct sockaddr *)&saun->sin; break; } #endif #ifdef INET6 case AF_INET6: { struct in6_addr ia6; struct ifaddr *ifa; int found; found = 0; /* * Try to find an address on the given outgoing interface * that belongs to the jail. */ IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { struct sockaddr *sa; sa = ifa->ifa_addr; if (sa->sa_family != AF_INET6) continue; bcopy(&((struct sockaddr_in6 *)sa)->sin6_addr, &ia6, sizeof(struct in6_addr)); if (prison_check_ip6(cred, &ia6) == 0) { found = 1; break; } } IF_ADDR_UNLOCK(ifp); if (!found) { /* * As a last resort return the 'default' jail address. */ + ia6 = ((struct sockaddr_in6 *)rt->rt_ifa->ifa_addr)-> + sin6_addr; if (prison_get_ip6(cred, &ia6) != 0) return (ESRCH); } bzero(&saun->sin6, sizeof(struct sockaddr_in6)); saun->sin6.sin6_len = sizeof(struct sockaddr_in6); saun->sin6.sin6_family = AF_INET6; bcopy(&ia6, &saun->sin6.sin6_addr, sizeof(struct in6_addr)); if (sa6_recoverscope(&saun->sin6) != 0) return (ESRCH); info->rti_info[RTAX_IFA] = (struct sockaddr *)&saun->sin6; break; } #endif default: return (ESRCH); } return (0); } /*ARGSUSED*/ static int route_output(struct mbuf *m, struct socket *so) { #define sa_equal(a1, a2) (bcmp((a1), (a2), (a1)->sa_len) == 0) INIT_VNET_NET(so->so_vnet); struct rt_msghdr *rtm = NULL; struct rtentry *rt = NULL; struct radix_node_head *rnh; struct rt_addrinfo info; int len, error = 0; struct ifnet *ifp = NULL; union sockaddr_union saun; #define senderr(e) { error = e; goto flush;} if (m == NULL || ((m->m_len < sizeof(long)) && (m = m_pullup(m, sizeof(long))) == NULL)) return (ENOBUFS); if ((m->m_flags & M_PKTHDR) == 0) panic("route_output"); len = m->m_pkthdr.len; if (len < sizeof(*rtm) || len != mtod(m, struct rt_msghdr *)->rtm_msglen) { info.rti_info[RTAX_DST] = NULL; senderr(EINVAL); } R_Malloc(rtm, struct rt_msghdr *, len); if (rtm == NULL) { info.rti_info[RTAX_DST] = NULL; senderr(ENOBUFS); } m_copydata(m, 0, len, (caddr_t)rtm); if (rtm->rtm_version != RTM_VERSION) { info.rti_info[RTAX_DST] = NULL; senderr(EPROTONOSUPPORT); } rtm->rtm_pid = curproc->p_pid; bzero(&info, sizeof(info)); info.rti_addrs = rtm->rtm_addrs; if (rt_xaddrs((caddr_t)(rtm + 1), len + (caddr_t)rtm, &info)) { info.rti_info[RTAX_DST] = NULL; senderr(EINVAL); } info.rti_flags = rtm->rtm_flags; if (info.rti_info[RTAX_DST] == NULL || info.rti_info[RTAX_DST]->sa_family >= AF_MAX || (info.rti_info[RTAX_GATEWAY] != NULL && info.rti_info[RTAX_GATEWAY]->sa_family >= AF_MAX)) senderr(EINVAL); /* * Verify that the caller has the appropriate privilege; RTM_GET * is the only operation the non-superuser is allowed. */ if (rtm->rtm_type != RTM_GET) { error = priv_check(curthread, PRIV_NET_ROUTE); if (error) senderr(error); } switch (rtm->rtm_type) { struct rtentry *saved_nrt; case RTM_ADD: if (info.rti_info[RTAX_GATEWAY] == NULL) senderr(EINVAL); saved_nrt = NULL; /* support for new ARP code */ if (info.rti_info[RTAX_GATEWAY]->sa_family == AF_LINK && (rtm->rtm_flags & RTF_LLDATA) != 0) { error = lla_rt_output(rtm, &info); break; } error = rtrequest1_fib(RTM_ADD, &info, &saved_nrt, so->so_fibnum); if (error == 0 && saved_nrt) { RT_LOCK(saved_nrt); rt_setmetrics(rtm->rtm_inits, &rtm->rtm_rmx, &saved_nrt->rt_rmx); rtm->rtm_index = saved_nrt->rt_ifp->if_index; RT_REMREF(saved_nrt); RT_UNLOCK(saved_nrt); } break; case RTM_DELETE: saved_nrt = NULL; /* support for new ARP code */ if (info.rti_info[RTAX_GATEWAY] && (info.rti_info[RTAX_GATEWAY]->sa_family == AF_LINK) && (rtm->rtm_flags & RTF_LLDATA) != 0) { error = lla_rt_output(rtm, &info); break; } error = rtrequest1_fib(RTM_DELETE, &info, &saved_nrt, so->so_fibnum); if (error == 0) { RT_LOCK(saved_nrt); rt = saved_nrt; goto report; } break; case RTM_GET: case RTM_CHANGE: case RTM_LOCK: rnh = V_rt_tables[so->so_fibnum][info.rti_info[RTAX_DST]->sa_family]; if (rnh == NULL) senderr(EAFNOSUPPORT); RADIX_NODE_HEAD_RLOCK(rnh); rt = (struct rtentry *) rnh->rnh_lookup(info.rti_info[RTAX_DST], info.rti_info[RTAX_NETMASK], rnh); if (rt == NULL) { /* XXX looks bogus */ RADIX_NODE_HEAD_RUNLOCK(rnh); senderr(ESRCH); } #ifdef RADIX_MPATH /* * for RTM_CHANGE/LOCK, if we got multipath routes, * we require users to specify a matching RTAX_GATEWAY. * * for RTM_GET, gate is optional even with multipath. * if gate == NULL the first match is returned. * (no need to call rt_mpath_matchgate if gate == NULL) */ if (rn_mpath_capable(rnh) && (rtm->rtm_type != RTM_GET || info.rti_info[RTAX_GATEWAY])) { rt = rt_mpath_matchgate(rt, info.rti_info[RTAX_GATEWAY]); if (!rt) { RADIX_NODE_HEAD_RUNLOCK(rnh); senderr(ESRCH); } } #endif RT_LOCK(rt); RT_ADDREF(rt); RADIX_NODE_HEAD_RUNLOCK(rnh); /* * Fix for PR: 82974 * * RTM_CHANGE/LOCK need a perfect match, rn_lookup() * returns a perfect match in case a netmask is * specified. For host routes only a longest prefix * match is returned so it is necessary to compare the * existence of the netmask. If both have a netmask * rnh_lookup() did a perfect match and if none of them * have a netmask both are host routes which is also a * perfect match. */ if (rtm->rtm_type != RTM_GET && (!rt_mask(rt) != !info.rti_info[RTAX_NETMASK])) { RT_UNLOCK(rt); senderr(ESRCH); } switch(rtm->rtm_type) { case RTM_GET: report: RT_LOCK_ASSERT(rt); if ((rt->rt_flags & RTF_HOST) == 0 ? jailed(curthread->td_ucred) : prison_if(curthread->td_ucred, rt_key(rt)) != 0) { RT_UNLOCK(rt); senderr(ESRCH); } info.rti_info[RTAX_DST] = rt_key(rt); info.rti_info[RTAX_GATEWAY] = rt->rt_gateway; info.rti_info[RTAX_NETMASK] = rt_mask(rt); info.rti_info[RTAX_GENMASK] = 0; if (rtm->rtm_addrs & (RTA_IFP | RTA_IFA)) { ifp = rt->rt_ifp; if (ifp) { info.rti_info[RTAX_IFP] = ifp->if_addr->ifa_addr; error = rtm_get_jailed(&info, ifp, rt, &saun, curthread->td_ucred); if (error != 0) { RT_UNLOCK(rt); senderr(error); } if (ifp->if_flags & IFF_POINTOPOINT) info.rti_info[RTAX_BRD] = rt->rt_ifa->ifa_dstaddr; rtm->rtm_index = ifp->if_index; } else { info.rti_info[RTAX_IFP] = NULL; info.rti_info[RTAX_IFA] = NULL; } } else if ((ifp = rt->rt_ifp) != NULL) { rtm->rtm_index = ifp->if_index; } len = rt_msg2(rtm->rtm_type, &info, NULL, NULL); if (len > rtm->rtm_msglen) { struct rt_msghdr *new_rtm; R_Malloc(new_rtm, struct rt_msghdr *, len); if (new_rtm == NULL) { RT_UNLOCK(rt); senderr(ENOBUFS); } bcopy(rtm, new_rtm, rtm->rtm_msglen); Free(rtm); rtm = new_rtm; } (void)rt_msg2(rtm->rtm_type, &info, (caddr_t)rtm, NULL); rtm->rtm_flags = rt->rt_flags; rt_getmetrics(&rt->rt_rmx, &rtm->rtm_rmx); rtm->rtm_addrs = info.rti_addrs; break; case RTM_CHANGE: /* * New gateway could require new ifaddr, ifp; * flags may also be different; ifp may be specified * by ll sockaddr when protocol address is ambiguous */ if (((rt->rt_flags & RTF_GATEWAY) && info.rti_info[RTAX_GATEWAY] != NULL) || info.rti_info[RTAX_IFP] != NULL || (info.rti_info[RTAX_IFA] != NULL && !sa_equal(info.rti_info[RTAX_IFA], rt->rt_ifa->ifa_addr))) { RT_UNLOCK(rt); RADIX_NODE_HEAD_LOCK(rnh); error = rt_getifa_fib(&info, rt->rt_fibnum); RADIX_NODE_HEAD_UNLOCK(rnh); if (error != 0) senderr(error); RT_LOCK(rt); } if (info.rti_ifa != NULL && info.rti_ifa != rt->rt_ifa && rt->rt_ifa != NULL && rt->rt_ifa->ifa_rtrequest != NULL) { rt->rt_ifa->ifa_rtrequest(RTM_DELETE, rt, &info); IFAFREE(rt->rt_ifa); } if (info.rti_info[RTAX_GATEWAY] != NULL) { RT_UNLOCK(rt); RADIX_NODE_HEAD_LOCK(rnh); RT_LOCK(rt); error = rt_setgate(rt, rt_key(rt), info.rti_info[RTAX_GATEWAY]); RADIX_NODE_HEAD_UNLOCK(rnh); if (error != 0) { RT_UNLOCK(rt); senderr(error); } rt->rt_flags |= RTF_GATEWAY; } if (info.rti_ifa != NULL && info.rti_ifa != rt->rt_ifa) { IFAREF(info.rti_ifa); rt->rt_ifa = info.rti_ifa; rt->rt_ifp = info.rti_ifp; } /* Allow some flags to be toggled on change. */ rt->rt_flags = (rt->rt_flags & ~RTF_FMASK) | (rtm->rtm_flags & RTF_FMASK); rt_setmetrics(rtm->rtm_inits, &rtm->rtm_rmx, &rt->rt_rmx); rtm->rtm_index = rt->rt_ifp->if_index; if (rt->rt_ifa && rt->rt_ifa->ifa_rtrequest) rt->rt_ifa->ifa_rtrequest(RTM_ADD, rt, &info); /* FALLTHROUGH */ case RTM_LOCK: /* We don't support locks anymore */ break; } RT_UNLOCK(rt); break; default: senderr(EOPNOTSUPP); } flush: if (rtm) { if (error) rtm->rtm_errno = error; else rtm->rtm_flags |= RTF_DONE; } if (rt) /* XXX can this be true? */ RTFREE(rt); { struct rawcb *rp = NULL; /* * Check to see if we don't want our own messages. */ if ((so->so_options & SO_USELOOPBACK) == 0) { if (route_cb.any_count <= 1) { if (rtm) Free(rtm); m_freem(m); return (error); } /* There is another listener, so construct message */ rp = sotorawcb(so); } if (rtm) { m_copyback(m, 0, rtm->rtm_msglen, (caddr_t)rtm); if (m->m_pkthdr.len < rtm->rtm_msglen) { m_freem(m); m = NULL; } else if (m->m_pkthdr.len > rtm->rtm_msglen) m_adj(m, rtm->rtm_msglen - m->m_pkthdr.len); Free(rtm); } if (m) { if (rp) { /* * XXX insure we don't get a copy by * invalidating our protocol */ unsigned short family = rp->rcb_proto.sp_family; rp->rcb_proto.sp_family = 0; rt_dispatch(m, info.rti_info[RTAX_DST]); rp->rcb_proto.sp_family = family; } else rt_dispatch(m, info.rti_info[RTAX_DST]); } } return (error); #undef sa_equal } static void rt_setmetrics(u_long which, const struct rt_metrics *in, struct rt_metrics_lite *out) { #define metric(f, e) if (which & (f)) out->e = in->e; /* * Only these are stored in the routing entry since introduction * of tcp hostcache. The rest is ignored. */ metric(RTV_MTU, rmx_mtu); metric(RTV_WEIGHT, rmx_weight); /* Userland -> kernel timebase conversion. */ if (which & RTV_EXPIRE) out->rmx_expire = in->rmx_expire ? in->rmx_expire - time_second + time_uptime : 0; #undef metric } static void rt_getmetrics(const struct rt_metrics_lite *in, struct rt_metrics *out) { #define metric(e) out->e = in->e; bzero(out, sizeof(*out)); metric(rmx_mtu); metric(rmx_weight); /* Kernel -> userland timebase conversion. */ out->rmx_expire = in->rmx_expire ? in->rmx_expire - time_uptime + time_second : 0; #undef metric } /* * Extract the addresses of the passed sockaddrs. * Do a little sanity checking so as to avoid bad memory references. * This data is derived straight from userland. */ static int rt_xaddrs(caddr_t cp, caddr_t cplim, struct rt_addrinfo *rtinfo) { struct sockaddr *sa; int i; for (i = 0; i < RTAX_MAX && cp < cplim; i++) { if ((rtinfo->rti_addrs & (1 << i)) == 0) continue; sa = (struct sockaddr *)cp; /* * It won't fit. */ if (cp + sa->sa_len > cplim) return (EINVAL); /* * there are no more.. quit now * If there are more bits, they are in error. * I've seen this. route(1) can evidently generate these. * This causes kernel to core dump. * for compatibility, If we see this, point to a safe address. */ if (sa->sa_len == 0) { rtinfo->rti_info[i] = &sa_zero; return (0); /* should be EINVAL but for compat */ } /* accept it */ rtinfo->rti_info[i] = sa; cp += SA_SIZE(sa); } return (0); } static struct mbuf * rt_msg1(int type, struct rt_addrinfo *rtinfo) { struct rt_msghdr *rtm; struct mbuf *m; int i; struct sockaddr *sa; int len, dlen; switch (type) { case RTM_DELADDR: case RTM_NEWADDR: len = sizeof(struct ifa_msghdr); break; case RTM_DELMADDR: case RTM_NEWMADDR: len = sizeof(struct ifma_msghdr); break; case RTM_IFINFO: len = sizeof(struct if_msghdr); break; case RTM_IFANNOUNCE: case RTM_IEEE80211: len = sizeof(struct if_announcemsghdr); break; default: len = sizeof(struct rt_msghdr); } if (len > MCLBYTES) panic("rt_msg1"); m = m_gethdr(M_DONTWAIT, MT_DATA); if (m && len > MHLEN) { MCLGET(m, M_DONTWAIT); if ((m->m_flags & M_EXT) == 0) { m_free(m); m = NULL; } } if (m == NULL) return (m); m->m_pkthdr.len = m->m_len = len; m->m_pkthdr.rcvif = NULL; rtm = mtod(m, struct rt_msghdr *); bzero((caddr_t)rtm, len); for (i = 0; i < RTAX_MAX; i++) { if ((sa = rtinfo->rti_info[i]) == NULL) continue; rtinfo->rti_addrs |= (1 << i); dlen = SA_SIZE(sa); m_copyback(m, len, dlen, (caddr_t)sa); len += dlen; } if (m->m_pkthdr.len != len) { m_freem(m); return (NULL); } rtm->rtm_msglen = len; rtm->rtm_version = RTM_VERSION; rtm->rtm_type = type; return (m); } static int rt_msg2(int type, struct rt_addrinfo *rtinfo, caddr_t cp, struct walkarg *w) { int i; int len, dlen, second_time = 0; caddr_t cp0; rtinfo->rti_addrs = 0; again: switch (type) { case RTM_DELADDR: case RTM_NEWADDR: len = sizeof(struct ifa_msghdr); break; case RTM_IFINFO: len = sizeof(struct if_msghdr); break; case RTM_NEWMADDR: len = sizeof(struct ifma_msghdr); break; default: len = sizeof(struct rt_msghdr); } cp0 = cp; if (cp0) cp += len; for (i = 0; i < RTAX_MAX; i++) { struct sockaddr *sa; if ((sa = rtinfo->rti_info[i]) == NULL) continue; rtinfo->rti_addrs |= (1 << i); dlen = SA_SIZE(sa); if (cp) { bcopy((caddr_t)sa, cp, (unsigned)dlen); cp += dlen; } len += dlen; } len = ALIGN(len); if (cp == NULL && w != NULL && !second_time) { struct walkarg *rw = w; if (rw->w_req) { if (rw->w_tmemsize < len) { if (rw->w_tmem) free(rw->w_tmem, M_RTABLE); rw->w_tmem = (caddr_t) malloc(len, M_RTABLE, M_NOWAIT); if (rw->w_tmem) rw->w_tmemsize = len; } if (rw->w_tmem) { cp = rw->w_tmem; second_time = 1; goto again; } } } if (cp) { struct rt_msghdr *rtm = (struct rt_msghdr *)cp0; rtm->rtm_version = RTM_VERSION; rtm->rtm_type = type; rtm->rtm_msglen = len; } return (len); } /* * This routine is called to generate a message from the routing * socket indicating that a redirect has occured, a routing lookup * has failed, or that a protocol has detected timeouts to a particular * destination. */ void rt_missmsg(int type, struct rt_addrinfo *rtinfo, int flags, int error) { struct rt_msghdr *rtm; struct mbuf *m; struct sockaddr *sa = rtinfo->rti_info[RTAX_DST]; if (route_cb.any_count == 0) return; m = rt_msg1(type, rtinfo); if (m == NULL) return; rtm = mtod(m, struct rt_msghdr *); rtm->rtm_flags = RTF_DONE | flags; rtm->rtm_errno = error; rtm->rtm_addrs = rtinfo->rti_addrs; rt_dispatch(m, sa); } /* * This routine is called to generate a message from the routing * socket indicating that the status of a network interface has changed. */ void rt_ifmsg(struct ifnet *ifp) { struct if_msghdr *ifm; struct mbuf *m; struct rt_addrinfo info; if (route_cb.any_count == 0) return; bzero((caddr_t)&info, sizeof(info)); m = rt_msg1(RTM_IFINFO, &info); if (m == NULL) return; ifm = mtod(m, struct if_msghdr *); ifm->ifm_index = ifp->if_index; ifm->ifm_flags = ifp->if_flags | ifp->if_drv_flags; ifm->ifm_data = ifp->if_data; ifm->ifm_addrs = 0; rt_dispatch(m, NULL); } /* * This is called to generate messages from the routing socket * indicating a network interface has had addresses associated with it. * if we ever reverse the logic and replace messages TO the routing * socket indicate a request to configure interfaces, then it will * be unnecessary as the routing socket will automatically generate * copies of it. */ void rt_newaddrmsg(int cmd, struct ifaddr *ifa, int error, struct rtentry *rt) { struct rt_addrinfo info; struct sockaddr *sa = NULL; int pass; struct mbuf *m = NULL; struct ifnet *ifp = ifa->ifa_ifp; KASSERT(cmd == RTM_ADD || cmd == RTM_DELETE, ("unexpected cmd %u", cmd)); #ifdef SCTP /* * notify the SCTP stack * this will only get called when an address is added/deleted * XXX pass the ifaddr struct instead if ifa->ifa_addr... */ sctp_addr_change(ifa, cmd); #endif /* SCTP */ if (route_cb.any_count == 0) return; for (pass = 1; pass < 3; pass++) { bzero((caddr_t)&info, sizeof(info)); if ((cmd == RTM_ADD && pass == 1) || (cmd == RTM_DELETE && pass == 2)) { struct ifa_msghdr *ifam; int ncmd = cmd == RTM_ADD ? RTM_NEWADDR : RTM_DELADDR; info.rti_info[RTAX_IFA] = sa = ifa->ifa_addr; info.rti_info[RTAX_IFP] = ifp->if_addr->ifa_addr; info.rti_info[RTAX_NETMASK] = ifa->ifa_netmask; info.rti_info[RTAX_BRD] = ifa->ifa_dstaddr; if ((m = rt_msg1(ncmd, &info)) == NULL) continue; ifam = mtod(m, struct ifa_msghdr *); ifam->ifam_index = ifp->if_index; ifam->ifam_metric = ifa->ifa_metric; ifam->ifam_flags = ifa->ifa_flags; ifam->ifam_addrs = info.rti_addrs; } if ((cmd == RTM_ADD && pass == 2) || (cmd == RTM_DELETE && pass == 1)) { struct rt_msghdr *rtm; if (rt == NULL) continue; info.rti_info[RTAX_NETMASK] = rt_mask(rt); info.rti_info[RTAX_DST] = sa = rt_key(rt); info.rti_info[RTAX_GATEWAY] = rt->rt_gateway; if ((m = rt_msg1(cmd, &info)) == NULL) continue; rtm = mtod(m, struct rt_msghdr *); rtm->rtm_index = ifp->if_index; rtm->rtm_flags |= rt->rt_flags; rtm->rtm_errno = error; rtm->rtm_addrs = info.rti_addrs; } rt_dispatch(m, sa); } } /* * This is the analogue to the rt_newaddrmsg which performs the same * function but for multicast group memberhips. This is easier since * there is no route state to worry about. */ void rt_newmaddrmsg(int cmd, struct ifmultiaddr *ifma) { struct rt_addrinfo info; struct mbuf *m = NULL; struct ifnet *ifp = ifma->ifma_ifp; struct ifma_msghdr *ifmam; if (route_cb.any_count == 0) return; bzero((caddr_t)&info, sizeof(info)); info.rti_info[RTAX_IFA] = ifma->ifma_addr; info.rti_info[RTAX_IFP] = ifp ? ifp->if_addr->ifa_addr : NULL; /* * If a link-layer address is present, present it as a ``gateway'' * (similarly to how ARP entries, e.g., are presented). */ info.rti_info[RTAX_GATEWAY] = ifma->ifma_lladdr; m = rt_msg1(cmd, &info); if (m == NULL) return; ifmam = mtod(m, struct ifma_msghdr *); KASSERT(ifp != NULL, ("%s: link-layer multicast address w/o ifp\n", __func__)); ifmam->ifmam_index = ifp->if_index; ifmam->ifmam_addrs = info.rti_addrs; rt_dispatch(m, ifma->ifma_addr); } static struct mbuf * rt_makeifannouncemsg(struct ifnet *ifp, int type, int what, struct rt_addrinfo *info) { struct if_announcemsghdr *ifan; struct mbuf *m; if (route_cb.any_count == 0) return NULL; bzero((caddr_t)info, sizeof(*info)); m = rt_msg1(type, info); if (m != NULL) { ifan = mtod(m, struct if_announcemsghdr *); ifan->ifan_index = ifp->if_index; strlcpy(ifan->ifan_name, ifp->if_xname, sizeof(ifan->ifan_name)); ifan->ifan_what = what; } return m; } /* * This is called to generate routing socket messages indicating * IEEE80211 wireless events. * XXX we piggyback on the RTM_IFANNOUNCE msg format in a clumsy way. */ void rt_ieee80211msg(struct ifnet *ifp, int what, void *data, size_t data_len) { struct mbuf *m; struct rt_addrinfo info; m = rt_makeifannouncemsg(ifp, RTM_IEEE80211, what, &info); if (m != NULL) { /* * Append the ieee80211 data. Try to stick it in the * mbuf containing the ifannounce msg; otherwise allocate * a new mbuf and append. * * NB: we assume m is a single mbuf. */ if (data_len > M_TRAILINGSPACE(m)) { struct mbuf *n = m_get(M_NOWAIT, MT_DATA); if (n == NULL) { m_freem(m); return; } bcopy(data, mtod(n, void *), data_len); n->m_len = data_len; m->m_next = n; } else if (data_len > 0) { bcopy(data, mtod(m, u_int8_t *) + m->m_len, data_len); m->m_len += data_len; } if (m->m_flags & M_PKTHDR) m->m_pkthdr.len += data_len; mtod(m, struct if_announcemsghdr *)->ifan_msglen += data_len; rt_dispatch(m, NULL); } } /* * This is called to generate routing socket messages indicating * network interface arrival and departure. */ void rt_ifannouncemsg(struct ifnet *ifp, int what) { struct mbuf *m; struct rt_addrinfo info; m = rt_makeifannouncemsg(ifp, RTM_IFANNOUNCE, what, &info); if (m != NULL) rt_dispatch(m, NULL); } static void rt_dispatch(struct mbuf *m, const struct sockaddr *sa) { INIT_VNET_NET(curvnet); struct m_tag *tag; /* * Preserve the family from the sockaddr, if any, in an m_tag for * use when injecting the mbuf into the routing socket buffer from * the netisr. */ if (sa != NULL) { tag = m_tag_get(PACKET_TAG_RTSOCKFAM, sizeof(unsigned short), M_NOWAIT); if (tag == NULL) { m_freem(m); return; } *(unsigned short *)(tag + 1) = sa->sa_family; m_tag_prepend(m, tag); } #ifdef VIMAGE if (V_loif) m->m_pkthdr.rcvif = V_loif; else { m_freem(m); return; } #endif netisr_queue(NETISR_ROUTE, m); /* mbuf is free'd on failure. */ } /* * This is used in dumping the kernel table via sysctl(). */ static int sysctl_dumpentry(struct radix_node *rn, void *vw) { struct walkarg *w = vw; struct rtentry *rt = (struct rtentry *)rn; int error = 0, size; struct rt_addrinfo info; if (w->w_op == NET_RT_FLAGS && !(rt->rt_flags & w->w_arg)) return 0; if ((rt->rt_flags & RTF_HOST) == 0 ? jailed(w->w_req->td->td_ucred) : prison_if(w->w_req->td->td_ucred, rt_key(rt)) != 0) return (0); bzero((caddr_t)&info, sizeof(info)); info.rti_info[RTAX_DST] = rt_key(rt); info.rti_info[RTAX_GATEWAY] = rt->rt_gateway; info.rti_info[RTAX_NETMASK] = rt_mask(rt); info.rti_info[RTAX_GENMASK] = 0; if (rt->rt_ifp) { info.rti_info[RTAX_IFP] = rt->rt_ifp->if_addr->ifa_addr; info.rti_info[RTAX_IFA] = rt->rt_ifa->ifa_addr; if (rt->rt_ifp->if_flags & IFF_POINTOPOINT) info.rti_info[RTAX_BRD] = rt->rt_ifa->ifa_dstaddr; } size = rt_msg2(RTM_GET, &info, NULL, w); if (w->w_req && w->w_tmem) { struct rt_msghdr *rtm = (struct rt_msghdr *)w->w_tmem; rtm->rtm_flags = rt->rt_flags; /* * let's be honest about this being a retarded hack */ rtm->rtm_fmask = rt->rt_rmx.rmx_pksent; rt_getmetrics(&rt->rt_rmx, &rtm->rtm_rmx); rtm->rtm_index = rt->rt_ifp->if_index; rtm->rtm_errno = rtm->rtm_pid = rtm->rtm_seq = 0; rtm->rtm_addrs = info.rti_addrs; error = SYSCTL_OUT(w->w_req, (caddr_t)rtm, size); return (error); } return (error); } static int sysctl_iflist(int af, struct walkarg *w) { INIT_VNET_NET(curvnet); struct ifnet *ifp; struct ifaddr *ifa; struct rt_addrinfo info; int len, error = 0; bzero((caddr_t)&info, sizeof(info)); IFNET_RLOCK(); TAILQ_FOREACH(ifp, &V_ifnet, if_link) { if (w->w_arg && w->w_arg != ifp->if_index) continue; ifa = ifp->if_addr; info.rti_info[RTAX_IFP] = ifa->ifa_addr; len = rt_msg2(RTM_IFINFO, &info, NULL, w); info.rti_info[RTAX_IFP] = NULL; if (w->w_req && w->w_tmem) { struct if_msghdr *ifm; ifm = (struct if_msghdr *)w->w_tmem; ifm->ifm_index = ifp->if_index; ifm->ifm_flags = ifp->if_flags | ifp->if_drv_flags; ifm->ifm_data = ifp->if_data; ifm->ifm_addrs = info.rti_addrs; error = SYSCTL_OUT(w->w_req,(caddr_t)ifm, len); if (error) goto done; } while ((ifa = TAILQ_NEXT(ifa, ifa_link)) != NULL) { if (af && af != ifa->ifa_addr->sa_family) continue; if (prison_if(w->w_req->td->td_ucred, ifa->ifa_addr) != 0) continue; info.rti_info[RTAX_IFA] = ifa->ifa_addr; info.rti_info[RTAX_NETMASK] = ifa->ifa_netmask; info.rti_info[RTAX_BRD] = ifa->ifa_dstaddr; len = rt_msg2(RTM_NEWADDR, &info, NULL, w); if (w->w_req && w->w_tmem) { struct ifa_msghdr *ifam; ifam = (struct ifa_msghdr *)w->w_tmem; ifam->ifam_index = ifa->ifa_ifp->if_index; ifam->ifam_flags = ifa->ifa_flags; ifam->ifam_metric = ifa->ifa_metric; ifam->ifam_addrs = info.rti_addrs; error = SYSCTL_OUT(w->w_req, w->w_tmem, len); if (error) goto done; } } info.rti_info[RTAX_IFA] = info.rti_info[RTAX_NETMASK] = info.rti_info[RTAX_BRD] = NULL; } done: IFNET_RUNLOCK(); return (error); } static int sysctl_ifmalist(int af, struct walkarg *w) { INIT_VNET_NET(curvnet); struct ifnet *ifp; struct ifmultiaddr *ifma; struct rt_addrinfo info; int len, error = 0; struct ifaddr *ifa; bzero((caddr_t)&info, sizeof(info)); IFNET_RLOCK(); TAILQ_FOREACH(ifp, &V_ifnet, if_link) { if (w->w_arg && w->w_arg != ifp->if_index) continue; ifa = ifp->if_addr; info.rti_info[RTAX_IFP] = ifa ? ifa->ifa_addr : NULL; IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifma, &ifp->if_multiaddrs, ifma_link) { if (af && af != ifma->ifma_addr->sa_family) continue; if (prison_if(w->w_req->td->td_ucred, ifma->ifma_addr) != 0) continue; info.rti_info[RTAX_IFA] = ifma->ifma_addr; info.rti_info[RTAX_GATEWAY] = (ifma->ifma_addr->sa_family != AF_LINK) ? ifma->ifma_lladdr : NULL; len = rt_msg2(RTM_NEWMADDR, &info, NULL, w); if (w->w_req && w->w_tmem) { struct ifma_msghdr *ifmam; ifmam = (struct ifma_msghdr *)w->w_tmem; ifmam->ifmam_index = ifma->ifma_ifp->if_index; ifmam->ifmam_flags = 0; ifmam->ifmam_addrs = info.rti_addrs; error = SYSCTL_OUT(w->w_req, w->w_tmem, len); if (error) { IF_ADDR_UNLOCK(ifp); goto done; } } } IF_ADDR_UNLOCK(ifp); } done: IFNET_RUNLOCK(); return (error); } static int sysctl_rtsock(SYSCTL_HANDLER_ARGS) { INIT_VNET_NET(curvnet); int *name = (int *)arg1; u_int namelen = arg2; struct radix_node_head *rnh; int i, lim, error = EINVAL; u_char af; struct walkarg w; name ++; namelen--; if (req->newptr) return (EPERM); if (namelen != 3) return ((namelen < 3) ? EISDIR : ENOTDIR); af = name[0]; if (af > AF_MAX) return (EINVAL); bzero(&w, sizeof(w)); w.w_op = name[1]; w.w_arg = name[2]; w.w_req = req; error = sysctl_wire_old_buffer(req, 0); if (error) return (error); switch (w.w_op) { case NET_RT_DUMP: case NET_RT_FLAGS: if (af == 0) { /* dump all tables */ i = 1; lim = AF_MAX; } else /* dump only one table */ i = lim = af; /* * take care of llinfo entries, the caller must * specify an AF */ if (w.w_op == NET_RT_FLAGS && (w.w_arg == 0 || w.w_arg & RTF_LLINFO)) { if (af != 0) error = lltable_sysctl_dumparp(af, w.w_req); else error = EINVAL; break; } /* * take care of routing entries */ for (error = 0; error == 0 && i <= lim; i++) if ((rnh = V_rt_tables[req->td->td_proc->p_fibnum][i]) != NULL) { RADIX_NODE_HEAD_LOCK(rnh); error = rnh->rnh_walktree(rnh, sysctl_dumpentry, &w); RADIX_NODE_HEAD_UNLOCK(rnh); } else if (af != 0) error = EAFNOSUPPORT; break; case NET_RT_IFLIST: error = sysctl_iflist(af, &w); break; case NET_RT_IFMALIST: error = sysctl_ifmalist(af, &w); break; } if (w.w_tmem) free(w.w_tmem, M_RTABLE); return (error); } SYSCTL_NODE(_net, PF_ROUTE, routetable, CTLFLAG_RD, sysctl_rtsock, ""); /* * Definitions of protocols supported in the ROUTE domain. */ static struct domain routedomain; /* or at least forward */ static struct protosw routesw[] = { { .pr_type = SOCK_RAW, .pr_domain = &routedomain, .pr_flags = PR_ATOMIC|PR_ADDR, .pr_output = route_output, .pr_ctlinput = raw_ctlinput, .pr_init = raw_init, .pr_usrreqs = &route_usrreqs } }; static struct domain routedomain = { .dom_family = PF_ROUTE, .dom_name = "route", .dom_protosw = routesw, .dom_protoswNPROTOSW = &routesw[sizeof(routesw)/sizeof(routesw[0])] }; DOMAIN_SET(route); Index: head/sys/netinet/in_pcb.c =================================================================== --- head/sys/netinet/in_pcb.c (revision 192894) +++ head/sys/netinet/in_pcb.c (revision 192895) @@ -1,1951 +1,1953 @@ /*- * Copyright (c) 1982, 1986, 1991, 1993, 1995 * The Regents of the University of California. * Copyright (c) 2007-2009 Robert N. M. Watson * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)in_pcb.c 8.4 (Berkeley) 5/24/95 */ #include __FBSDID("$FreeBSD$"); #include "opt_ddb.h" #include "opt_inet.h" #include "opt_ipsec.h" #include "opt_inet6.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef DDB #include #endif #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET6 #include #include #include #endif /* INET6 */ #ifdef IPSEC #include #include #endif /* IPSEC */ #include #ifdef VIMAGE_GLOBALS /* * These configure the range of local port addresses assigned to * "unspecified" outgoing connections/packets/whatever. */ int ipport_lowfirstauto; int ipport_lowlastauto; int ipport_firstauto; int ipport_lastauto; int ipport_hifirstauto; int ipport_hilastauto; /* * Reserved ports accessible only to root. There are significant * security considerations that must be accounted for when changing these, * but the security benefits can be great. Please be careful. */ int ipport_reservedhigh; int ipport_reservedlow; /* Variables dealing with random ephemeral port allocation. */ int ipport_randomized; int ipport_randomcps; int ipport_randomtime; int ipport_stoprandom; int ipport_tcpallocs; int ipport_tcplastcount; #endif #define RANGECHK(var, min, max) \ if ((var) < (min)) { (var) = (min); } \ else if ((var) > (max)) { (var) = (max); } static void in_pcbremlists(struct inpcb *inp); static int sysctl_net_ipport_check(SYSCTL_HANDLER_ARGS) { INIT_VNET_INET(curvnet); int error; SYSCTL_RESOLVE_V_ARG1(); error = sysctl_handle_int(oidp, arg1, arg2, req); if (error == 0) { RANGECHK(V_ipport_lowfirstauto, 1, IPPORT_RESERVED - 1); RANGECHK(V_ipport_lowlastauto, 1, IPPORT_RESERVED - 1); RANGECHK(V_ipport_firstauto, IPPORT_RESERVED, IPPORT_MAX); RANGECHK(V_ipport_lastauto, IPPORT_RESERVED, IPPORT_MAX); RANGECHK(V_ipport_hifirstauto, IPPORT_RESERVED, IPPORT_MAX); RANGECHK(V_ipport_hilastauto, IPPORT_RESERVED, IPPORT_MAX); } return (error); } #undef RANGECHK SYSCTL_NODE(_net_inet_ip, IPPROTO_IP, portrange, CTLFLAG_RW, 0, "IP Ports"); SYSCTL_V_PROC(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, lowfirst, CTLTYPE_INT|CTLFLAG_RW, ipport_lowfirstauto, 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_V_PROC(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, lowlast, CTLTYPE_INT|CTLFLAG_RW, ipport_lowlastauto, 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_V_PROC(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, first, CTLTYPE_INT|CTLFLAG_RW, ipport_firstauto, 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_V_PROC(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, last, CTLTYPE_INT|CTLFLAG_RW, ipport_lastauto, 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_V_PROC(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, hifirst, CTLTYPE_INT|CTLFLAG_RW, ipport_hifirstauto, 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_V_PROC(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, hilast, CTLTYPE_INT|CTLFLAG_RW, ipport_hilastauto, 0, &sysctl_net_ipport_check, "I", ""); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, reservedhigh, CTLFLAG_RW|CTLFLAG_SECURE, ipport_reservedhigh, 0, ""); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, reservedlow, CTLFLAG_RW|CTLFLAG_SECURE, ipport_reservedlow, 0, ""); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, randomized, CTLFLAG_RW, ipport_randomized, 0, "Enable random port allocation"); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, randomcps, CTLFLAG_RW, ipport_randomcps, 0, "Maximum number of random port " "allocations before switching to a sequental one"); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_ip_portrange, OID_AUTO, randomtime, CTLFLAG_RW, ipport_randomtime, 0, "Minimum time to keep sequental port " "allocation before switching to a random one"); /* * in_pcb.c: manage the Protocol Control Blocks. * * NOTE: It is assumed that most of these functions will be called with * the pcbinfo lock held, and often, the inpcb lock held, as these utility * functions often modify hash chains or addresses in pcbs. */ /* * Allocate a PCB and associate it with the socket. * On success return with the PCB locked. */ int in_pcballoc(struct socket *so, struct inpcbinfo *pcbinfo) { #ifdef INET6 INIT_VNET_INET6(curvnet); #endif struct inpcb *inp; int error; INP_INFO_WLOCK_ASSERT(pcbinfo); error = 0; inp = uma_zalloc(pcbinfo->ipi_zone, M_NOWAIT); if (inp == NULL) return (ENOBUFS); bzero(inp, inp_zero_size); inp->inp_pcbinfo = pcbinfo; inp->inp_socket = so; inp->inp_cred = crhold(so->so_cred); inp->inp_inc.inc_fibnum = so->so_fibnum; #ifdef MAC error = mac_inpcb_init(inp, M_NOWAIT); if (error != 0) goto out; SOCK_LOCK(so); mac_inpcb_create(so, inp); SOCK_UNLOCK(so); #endif #ifdef IPSEC error = ipsec_init_policy(so, &inp->inp_sp); if (error != 0) { #ifdef MAC mac_inpcb_destroy(inp); #endif goto out; } #endif /*IPSEC*/ #ifdef INET6 if (INP_SOCKAF(so) == AF_INET6) { inp->inp_vflag |= INP_IPV6PROTO; if (V_ip6_v6only) inp->inp_flags |= IN6P_IPV6_V6ONLY; } #endif LIST_INSERT_HEAD(pcbinfo->ipi_listhead, inp, inp_list); pcbinfo->ipi_count++; so->so_pcb = (caddr_t)inp; #ifdef INET6 if (V_ip6_auto_flowlabel) inp->inp_flags |= IN6P_AUTOFLOWLABEL; #endif INP_WLOCK(inp); inp->inp_gencnt = ++pcbinfo->ipi_gencnt; inp->inp_refcount = 1; /* Reference from the inpcbinfo */ #if defined(IPSEC) || defined(MAC) out: if (error != 0) { crfree(inp->inp_cred); uma_zfree(pcbinfo->ipi_zone, inp); } #endif return (error); } int in_pcbbind(struct inpcb *inp, struct sockaddr *nam, struct ucred *cred) { int anonport, error; INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo); INP_WLOCK_ASSERT(inp); if (inp->inp_lport != 0 || inp->inp_laddr.s_addr != INADDR_ANY) return (EINVAL); anonport = inp->inp_lport == 0 && (nam == NULL || ((struct sockaddr_in *)nam)->sin_port == 0); error = in_pcbbind_setup(inp, nam, &inp->inp_laddr.s_addr, &inp->inp_lport, cred); if (error) return (error); if (in_pcbinshash(inp) != 0) { inp->inp_laddr.s_addr = INADDR_ANY; inp->inp_lport = 0; return (EAGAIN); } if (anonport) inp->inp_flags |= INP_ANONPORT; return (0); } /* * Set up a bind operation on a PCB, performing port allocation * as required, but do not actually modify the PCB. Callers can * either complete the bind by setting inp_laddr/inp_lport and * calling in_pcbinshash(), or they can just use the resulting * port and address to authorise the sending of a once-off packet. * * On error, the values of *laddrp and *lportp are not changed. */ int in_pcbbind_setup(struct inpcb *inp, struct sockaddr *nam, in_addr_t *laddrp, u_short *lportp, struct ucred *cred) { INIT_VNET_INET(inp->inp_vnet); struct socket *so = inp->inp_socket; unsigned short *lastport; struct sockaddr_in *sin; struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; struct in_addr laddr; u_short lport = 0; int wild = 0, reuseport = (so->so_options & SO_REUSEPORT); int error; int dorandom; /* * Because no actual state changes occur here, a global write lock on * the pcbinfo isn't required. */ INP_INFO_LOCK_ASSERT(pcbinfo); INP_LOCK_ASSERT(inp); if (TAILQ_EMPTY(&V_in_ifaddrhead)) /* XXX broken! */ return (EADDRNOTAVAIL); laddr.s_addr = *laddrp; if (nam != NULL && laddr.s_addr != INADDR_ANY) return (EINVAL); if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) == 0) wild = INPLOOKUP_WILDCARD; if (nam == NULL) { if ((error = prison_local_ip4(cred, &laddr)) != 0) return (error); } else { sin = (struct sockaddr_in *)nam; if (nam->sa_len != sizeof (*sin)) return (EINVAL); #ifdef notdef /* * We should check the family, but old programs * incorrectly fail to initialize it. */ if (sin->sin_family != AF_INET) return (EAFNOSUPPORT); #endif error = prison_local_ip4(cred, &sin->sin_addr); if (error) return (error); if (sin->sin_port != *lportp) { /* Don't allow the port to change. */ if (*lportp != 0) return (EINVAL); lport = sin->sin_port; } /* NB: lport is left as 0 if the port isn't being changed. */ if (IN_MULTICAST(ntohl(sin->sin_addr.s_addr))) { /* * Treat SO_REUSEADDR as SO_REUSEPORT for multicast; * allow complete duplication of binding if * SO_REUSEPORT is set, or if SO_REUSEADDR is set * and a multicast address is bound on both * new and duplicated sockets. */ if (so->so_options & SO_REUSEADDR) reuseport = SO_REUSEADDR|SO_REUSEPORT; } else if (sin->sin_addr.s_addr != INADDR_ANY) { sin->sin_port = 0; /* yech... */ bzero(&sin->sin_zero, sizeof(sin->sin_zero)); /* * Is the address a local IP address? * If INP_NONLOCALOK is set, then the socket may be bound * to any endpoint address, local or not. */ if ( #if defined(IP_NONLOCALBIND) ((inp->inp_flags & INP_NONLOCALOK) == 0) && #endif (ifa_ifwithaddr((struct sockaddr *)sin) == 0)) return (EADDRNOTAVAIL); } laddr = sin->sin_addr; if (lport) { struct inpcb *t; struct tcptw *tw; /* GROSS */ if (ntohs(lport) <= V_ipport_reservedhigh && ntohs(lport) >= V_ipport_reservedlow && priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT, 0)) return (EACCES); if (!IN_MULTICAST(ntohl(sin->sin_addr.s_addr)) && priv_check_cred(inp->inp_cred, PRIV_NETINET_REUSEPORT, 0) != 0) { t = in_pcblookup_local(pcbinfo, sin->sin_addr, lport, INPLOOKUP_WILDCARD, cred); /* * XXX * This entire block sorely needs a rewrite. */ if (t && ((t->inp_flags & INP_TIMEWAIT) == 0) && (so->so_type != SOCK_STREAM || ntohl(t->inp_faddr.s_addr) == INADDR_ANY) && (ntohl(sin->sin_addr.s_addr) != INADDR_ANY || ntohl(t->inp_laddr.s_addr) != INADDR_ANY || (t->inp_socket->so_options & SO_REUSEPORT) == 0) && (inp->inp_cred->cr_uid != t->inp_cred->cr_uid)) return (EADDRINUSE); } t = in_pcblookup_local(pcbinfo, sin->sin_addr, lport, wild, cred); if (t && (t->inp_flags & INP_TIMEWAIT)) { /* * XXXRW: If an incpb has had its timewait * state recycled, we treat the address as * being in use (for now). This is better * than a panic, but not desirable. */ tw = intotw(inp); if (tw == NULL || (reuseport & tw->tw_so_options) == 0) return (EADDRINUSE); } else if (t && (reuseport & t->inp_socket->so_options) == 0) { #ifdef INET6 if (ntohl(sin->sin_addr.s_addr) != INADDR_ANY || ntohl(t->inp_laddr.s_addr) != INADDR_ANY || INP_SOCKAF(so) == INP_SOCKAF(t->inp_socket)) #endif return (EADDRINUSE); } } } if (*lportp != 0) lport = *lportp; if (lport == 0) { u_short first, last, aux; int count; if (inp->inp_flags & INP_HIGHPORT) { first = V_ipport_hifirstauto; /* sysctl */ last = V_ipport_hilastauto; lastport = &pcbinfo->ipi_lasthi; } else if (inp->inp_flags & INP_LOWPORT) { error = priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT, 0); if (error) return error; first = V_ipport_lowfirstauto; /* 1023 */ last = V_ipport_lowlastauto; /* 600 */ lastport = &pcbinfo->ipi_lastlow; } else { first = V_ipport_firstauto; /* sysctl */ last = V_ipport_lastauto; lastport = &pcbinfo->ipi_lastport; } /* * For UDP, use random port allocation as long as the user * allows it. For TCP (and as of yet unknown) connections, * use random port allocation only if the user allows it AND * ipport_tick() allows it. */ if (V_ipport_randomized && (!V_ipport_stoprandom || pcbinfo == &V_udbinfo)) dorandom = 1; else dorandom = 0; /* * It makes no sense to do random port allocation if * we have the only port available. */ if (first == last) dorandom = 0; /* Make sure to not include UDP packets in the count. */ if (pcbinfo != &V_udbinfo) V_ipport_tcpallocs++; /* * Instead of having two loops further down counting up or down * make sure that first is always <= last and go with only one * code path implementing all logic. */ if (first > last) { aux = first; first = last; last = aux; } if (dorandom) *lastport = first + (arc4random() % (last - first)); count = last - first; do { if (count-- < 0) /* completely used? */ return (EADDRNOTAVAIL); ++*lastport; if (*lastport < first || *lastport > last) *lastport = first; lport = htons(*lastport); } while (in_pcblookup_local(pcbinfo, laddr, lport, wild, cred)); } *laddrp = laddr.s_addr; *lportp = lport; return (0); } /* * Connect from a socket to a specified address. * Both address and port must be specified in argument sin. * If don't have a local address for this socket yet, * then pick one. */ int in_pcbconnect(struct inpcb *inp, struct sockaddr *nam, struct ucred *cred) { u_short lport, fport; in_addr_t laddr, faddr; int anonport, error; INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo); INP_WLOCK_ASSERT(inp); lport = inp->inp_lport; laddr = inp->inp_laddr.s_addr; anonport = (lport == 0); error = in_pcbconnect_setup(inp, nam, &laddr, &lport, &faddr, &fport, NULL, cred); if (error) return (error); /* Do the initial binding of the local address if required. */ if (inp->inp_laddr.s_addr == INADDR_ANY && inp->inp_lport == 0) { inp->inp_lport = lport; inp->inp_laddr.s_addr = laddr; if (in_pcbinshash(inp) != 0) { inp->inp_laddr.s_addr = INADDR_ANY; inp->inp_lport = 0; return (EAGAIN); } } /* Commit the remaining changes. */ inp->inp_lport = lport; inp->inp_laddr.s_addr = laddr; inp->inp_faddr.s_addr = faddr; inp->inp_fport = fport; in_pcbrehash(inp); if (anonport) inp->inp_flags |= INP_ANONPORT; return (0); } /* * Do proper source address selection on an unbound socket in case * of connect. Take jails into account as well. */ static int in_pcbladdr(struct inpcb *inp, struct in_addr *faddr, struct in_addr *laddr, struct ucred *cred) { struct in_ifaddr *ia; struct ifaddr *ifa; struct sockaddr *sa; struct sockaddr_in *sin; struct route sro; int error; KASSERT(laddr != NULL, ("%s: laddr NULL", __func__)); error = 0; ia = NULL; bzero(&sro, sizeof(sro)); sin = (struct sockaddr_in *)&sro.ro_dst; sin->sin_family = AF_INET; sin->sin_len = sizeof(struct sockaddr_in); sin->sin_addr.s_addr = faddr->s_addr; /* * If route is known our src addr is taken from the i/f, * else punt. * * Find out route to destination. */ if ((inp->inp_socket->so_options & SO_DONTROUTE) == 0) in_rtalloc_ign(&sro, 0, inp->inp_inc.inc_fibnum); /* * If we found a route, use the address corresponding to * the outgoing interface. * * Otherwise assume faddr is reachable on a directly connected * network and try to find a corresponding interface to take * the source address from. */ if (sro.ro_rt == NULL || sro.ro_rt->rt_ifp == NULL) { struct ifnet *ifp; ia = ifatoia(ifa_ifwithdstaddr((struct sockaddr *)sin)); if (ia == NULL) ia = ifatoia(ifa_ifwithnet((struct sockaddr *)sin)); if (ia == NULL) { error = ENETUNREACH; goto done; } - if (cred == NULL || !jailed(cred)) { + if (cred == NULL || !prison_flag(cred, PR_IP4)) { laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } ifp = ia->ia_ifp; ia = NULL; IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { sa = ifa->ifa_addr; if (sa->sa_family != AF_INET) continue; sin = (struct sockaddr_in *)sa; if (prison_check_ip4(cred, &sin->sin_addr) == 0) { ia = (struct in_ifaddr *)ifa; break; } } if (ia != NULL) { laddr->s_addr = ia->ia_addr.sin_addr.s_addr; IF_ADDR_UNLOCK(ifp); goto done; } IF_ADDR_UNLOCK(ifp); /* 3. As a last resort return the 'default' jail address. */ error = prison_get_ip4(cred, laddr); goto done; } /* * If the outgoing interface on the route found is not * a loopback interface, use the address from that interface. * In case of jails do those three steps: * 1. check if the interface address belongs to the jail. If so use it. * 2. check if we have any address on the outgoing interface * belonging to this jail. If so use it. * 3. as a last resort return the 'default' jail address. */ if ((sro.ro_rt->rt_ifp->if_flags & IFF_LOOPBACK) == 0) { struct ifnet *ifp; /* If not jailed, use the default returned. */ - if (cred == NULL || !jailed(cred)) { + if (cred == NULL || !prison_flag(cred, PR_IP4)) { ia = (struct in_ifaddr *)sro.ro_rt->rt_ifa; laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } /* Jailed. */ /* 1. Check if the iface address belongs to the jail. */ sin = (struct sockaddr_in *)sro.ro_rt->rt_ifa->ifa_addr; if (prison_check_ip4(cred, &sin->sin_addr) == 0) { ia = (struct in_ifaddr *)sro.ro_rt->rt_ifa; laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } /* * 2. Check if we have any address on the outgoing interface * belonging to this jail. */ ifp = sro.ro_rt->rt_ifp; IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { sa = ifa->ifa_addr; if (sa->sa_family != AF_INET) continue; sin = (struct sockaddr_in *)sa; if (prison_check_ip4(cred, &sin->sin_addr) == 0) { ia = (struct in_ifaddr *)ifa; break; } } if (ia != NULL) { laddr->s_addr = ia->ia_addr.sin_addr.s_addr; IF_ADDR_UNLOCK(ifp); goto done; } IF_ADDR_UNLOCK(ifp); /* 3. As a last resort return the 'default' jail address. */ error = prison_get_ip4(cred, laddr); goto done; } /* * The outgoing interface is marked with 'loopback net', so a route * to ourselves is here. * Try to find the interface of the destination address and then * take the address from there. That interface is not necessarily * a loopback interface. * In case of jails, check that it is an address of the jail * and if we cannot find, fall back to the 'default' jail address. */ if ((sro.ro_rt->rt_ifp->if_flags & IFF_LOOPBACK) != 0) { struct sockaddr_in sain; bzero(&sain, sizeof(struct sockaddr_in)); sain.sin_family = AF_INET; sain.sin_len = sizeof(struct sockaddr_in); sain.sin_addr.s_addr = faddr->s_addr; ia = ifatoia(ifa_ifwithdstaddr(sintosa(&sain))); if (ia == NULL) ia = ifatoia(ifa_ifwithnet(sintosa(&sain))); - if (cred == NULL || !jailed(cred)) { + if (cred == NULL || !prison_flag(cred, PR_IP4)) { #if __FreeBSD_version < 800000 if (ia == NULL) ia = (struct in_ifaddr *)sro.ro_rt->rt_ifa; #endif if (ia == NULL) { error = ENETUNREACH; goto done; } laddr->s_addr = ia->ia_addr.sin_addr.s_addr; goto done; } /* Jailed. */ if (ia != NULL) { struct ifnet *ifp; ifp = ia->ia_ifp; ia = NULL; IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { sa = ifa->ifa_addr; if (sa->sa_family != AF_INET) continue; sin = (struct sockaddr_in *)sa; if (prison_check_ip4(cred, &sin->sin_addr) == 0) { ia = (struct in_ifaddr *)ifa; break; } } if (ia != NULL) { laddr->s_addr = ia->ia_addr.sin_addr.s_addr; IF_ADDR_UNLOCK(ifp); goto done; } IF_ADDR_UNLOCK(ifp); } /* 3. As a last resort return the 'default' jail address. */ error = prison_get_ip4(cred, laddr); goto done; } done: if (sro.ro_rt != NULL) RTFREE(sro.ro_rt); return (error); } /* * Set up for a connect from a socket to the specified address. * On entry, *laddrp and *lportp should contain the current local * address and port for the PCB; these are updated to the values * that should be placed in inp_laddr and inp_lport to complete * the connect. * * On success, *faddrp and *fportp will be set to the remote address * and port. These are not updated in the error case. * * If the operation fails because the connection already exists, * *oinpp will be set to the PCB of that connection so that the * caller can decide to override it. In all other cases, *oinpp * is set to NULL. */ int in_pcbconnect_setup(struct inpcb *inp, struct sockaddr *nam, in_addr_t *laddrp, u_short *lportp, in_addr_t *faddrp, u_short *fportp, struct inpcb **oinpp, struct ucred *cred) { INIT_VNET_INET(inp->inp_vnet); struct sockaddr_in *sin = (struct sockaddr_in *)nam; struct in_ifaddr *ia; struct inpcb *oinp; struct in_addr laddr, faddr; u_short lport, fport; int error; /* * Because a global state change doesn't actually occur here, a read * lock is sufficient. */ INP_INFO_LOCK_ASSERT(inp->inp_pcbinfo); INP_LOCK_ASSERT(inp); if (oinpp != NULL) *oinpp = NULL; if (nam->sa_len != sizeof (*sin)) return (EINVAL); if (sin->sin_family != AF_INET) return (EAFNOSUPPORT); if (sin->sin_port == 0) return (EADDRNOTAVAIL); laddr.s_addr = *laddrp; lport = *lportp; faddr = sin->sin_addr; fport = sin->sin_port; if (!TAILQ_EMPTY(&V_in_ifaddrhead)) { /* * If the destination address is INADDR_ANY, * use the primary local address. * If the supplied address is INADDR_BROADCAST, * and the primary interface supports broadcast, * choose the broadcast address for that interface. */ if (faddr.s_addr == INADDR_ANY) { faddr = IA_SIN(TAILQ_FIRST(&V_in_ifaddrhead))->sin_addr; if (cred != NULL && (error = prison_get_ip4(cred, &faddr)) != 0) return (error); } else if (faddr.s_addr == (u_long)INADDR_BROADCAST && (TAILQ_FIRST(&V_in_ifaddrhead)->ia_ifp->if_flags & IFF_BROADCAST)) faddr = satosin(&TAILQ_FIRST( &V_in_ifaddrhead)->ia_broadaddr)->sin_addr; } if (laddr.s_addr == INADDR_ANY) { error = in_pcbladdr(inp, &faddr, &laddr, cred); if (error) return (error); /* * If the destination address is multicast and an outgoing * interface has been set as a multicast option, use the * address of that interface as our source address. */ if (IN_MULTICAST(ntohl(faddr.s_addr)) && inp->inp_moptions != NULL) { struct ip_moptions *imo; struct ifnet *ifp; imo = inp->inp_moptions; if (imo->imo_multicast_ifp != NULL) { ifp = imo->imo_multicast_ifp; TAILQ_FOREACH(ia, &V_in_ifaddrhead, ia_link) if (ia->ia_ifp == ifp) break; if (ia == NULL) return (EADDRNOTAVAIL); laddr = ia->ia_addr.sin_addr; } } } oinp = in_pcblookup_hash(inp->inp_pcbinfo, faddr, fport, laddr, lport, 0, NULL); if (oinp != NULL) { if (oinpp != NULL) *oinpp = oinp; return (EADDRINUSE); } if (lport == 0) { error = in_pcbbind_setup(inp, NULL, &laddr.s_addr, &lport, cred); if (error) return (error); } *laddrp = laddr.s_addr; *lportp = lport; *faddrp = faddr.s_addr; *fportp = fport; return (0); } void in_pcbdisconnect(struct inpcb *inp) { INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo); INP_WLOCK_ASSERT(inp); inp->inp_faddr.s_addr = INADDR_ANY; inp->inp_fport = 0; in_pcbrehash(inp); } /* * in_pcbdetach() is responsibe for disassociating a socket from an inpcb. * For most protocols, this will be invoked immediately prior to calling * in_pcbfree(). However, with TCP the inpcb may significantly outlive the * socket, in which case in_pcbfree() is deferred. */ void in_pcbdetach(struct inpcb *inp) { KASSERT(inp->inp_socket != NULL, ("%s: inp_socket == NULL", __func__)); inp->inp_socket->so_pcb = NULL; inp->inp_socket = NULL; } /* * in_pcbfree_internal() frees an inpcb that has been detached from its * socket, and whose reference count has reached 0. It will also remove the * inpcb from any global lists it might remain on. */ static void in_pcbfree_internal(struct inpcb *inp) { struct inpcbinfo *ipi = inp->inp_pcbinfo; KASSERT(inp->inp_socket == NULL, ("%s: inp_socket != NULL", __func__)); KASSERT(inp->inp_refcount == 0, ("%s: refcount !0", __func__)); INP_INFO_WLOCK_ASSERT(ipi); INP_WLOCK_ASSERT(inp); #ifdef IPSEC if (inp->inp_sp != NULL) ipsec_delete_pcbpolicy(inp); #endif /* IPSEC */ inp->inp_gencnt = ++ipi->ipi_gencnt; in_pcbremlists(inp); #ifdef INET6 if (inp->inp_vflag & INP_IPV6PROTO) { ip6_freepcbopts(inp->in6p_outputopts); if (inp->in6p_moptions != NULL) ip6_freemoptions(inp->in6p_moptions); } #endif if (inp->inp_options) (void)m_free(inp->inp_options); if (inp->inp_moptions != NULL) inp_freemoptions(inp->inp_moptions); inp->inp_vflag = 0; crfree(inp->inp_cred); #ifdef MAC mac_inpcb_destroy(inp); #endif INP_WUNLOCK(inp); uma_zfree(ipi->ipi_zone, inp); } /* * in_pcbref() bumps the reference count on an inpcb in order to maintain * stability of an inpcb pointer despite the inpcb lock being released. This * is used in TCP when the inpcbinfo lock needs to be acquired or upgraded, * but where the inpcb lock is already held. * * While the inpcb will not be freed, releasing the inpcb lock means that the * connection's state may change, so the caller should be careful to * revalidate any cached state on reacquiring the lock. Drop the reference * using in_pcbrele(). */ void in_pcbref(struct inpcb *inp) { INP_WLOCK_ASSERT(inp); KASSERT(inp->inp_refcount > 0, ("%s: refcount 0", __func__)); inp->inp_refcount++; } /* * Drop a refcount on an inpcb elevated using in_pcbref(); because a call to * in_pcbfree() may have been made between in_pcbref() and in_pcbrele(), we * return a flag indicating whether or not the inpcb remains valid. If it is * valid, we return with the inpcb lock held. */ int in_pcbrele(struct inpcb *inp) { #ifdef INVARIANTS struct inpcbinfo *ipi = inp->inp_pcbinfo; #endif KASSERT(inp->inp_refcount > 0, ("%s: refcount 0", __func__)); INP_INFO_WLOCK_ASSERT(ipi); INP_WLOCK_ASSERT(inp); inp->inp_refcount--; if (inp->inp_refcount > 0) return (0); in_pcbfree_internal(inp); return (1); } /* * Unconditionally schedule an inpcb to be freed by decrementing its * reference count, which should occur only after the inpcb has been detached * from its socket. If another thread holds a temporary reference (acquired * using in_pcbref()) then the free is deferred until that reference is * released using in_pcbrele(), but the inpcb is still unlocked. */ void in_pcbfree(struct inpcb *inp) { #ifdef INVARIANTS struct inpcbinfo *ipi = inp->inp_pcbinfo; #endif KASSERT(inp->inp_socket == NULL, ("%s: inp_socket != NULL", __func__)); INP_INFO_WLOCK_ASSERT(ipi); INP_WLOCK_ASSERT(inp); if (!in_pcbrele(inp)) INP_WUNLOCK(inp); } /* * in_pcbdrop() removes an inpcb from hashed lists, releasing its address and * port reservation, and preventing it from being returned by inpcb lookups. * * It is used by TCP to mark an inpcb as unused and avoid future packet * delivery or event notification when a socket remains open but TCP has * closed. This might occur as a result of a shutdown()-initiated TCP close * or a RST on the wire, and allows the port binding to be reused while still * maintaining the invariant that so_pcb always points to a valid inpcb until * in_pcbdetach(). * * XXXRW: An inp_lport of 0 is used to indicate that the inpcb is not on hash * lists, but can lead to confusing netstat output, as open sockets with * closed TCP connections will no longer appear to have their bound port * number. An explicit flag would be better, as it would allow us to leave * the port number intact after the connection is dropped. * * XXXRW: Possibly in_pcbdrop() should also prevent future notifications by * in_pcbnotifyall() and in_pcbpurgeif0()? */ void in_pcbdrop(struct inpcb *inp) { INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo); INP_WLOCK_ASSERT(inp); inp->inp_flags |= INP_DROPPED; if (inp->inp_flags & INP_INHASHLIST) { struct inpcbport *phd = inp->inp_phd; LIST_REMOVE(inp, inp_hash); LIST_REMOVE(inp, inp_portlist); if (LIST_FIRST(&phd->phd_pcblist) == NULL) { LIST_REMOVE(phd, phd_hash); free(phd, M_PCB); } inp->inp_flags &= ~INP_INHASHLIST; } } /* * Common routines to return the socket addresses associated with inpcbs. */ struct sockaddr * in_sockaddr(in_port_t port, struct in_addr *addr_p) { struct sockaddr_in *sin; sin = malloc(sizeof *sin, M_SONAME, M_WAITOK | M_ZERO); sin->sin_family = AF_INET; sin->sin_len = sizeof(*sin); sin->sin_addr = *addr_p; sin->sin_port = port; return (struct sockaddr *)sin; } int in_getsockaddr(struct socket *so, struct sockaddr **nam) { struct inpcb *inp; struct in_addr addr; in_port_t port; inp = sotoinpcb(so); KASSERT(inp != NULL, ("in_getsockaddr: inp == NULL")); INP_RLOCK(inp); port = inp->inp_lport; addr = inp->inp_laddr; INP_RUNLOCK(inp); *nam = in_sockaddr(port, &addr); return 0; } int in_getpeeraddr(struct socket *so, struct sockaddr **nam) { struct inpcb *inp; struct in_addr addr; in_port_t port; inp = sotoinpcb(so); KASSERT(inp != NULL, ("in_getpeeraddr: inp == NULL")); INP_RLOCK(inp); port = inp->inp_fport; addr = inp->inp_faddr; INP_RUNLOCK(inp); *nam = in_sockaddr(port, &addr); return 0; } void in_pcbnotifyall(struct inpcbinfo *pcbinfo, struct in_addr faddr, int errno, struct inpcb *(*notify)(struct inpcb *, int)) { struct inpcb *inp, *inp_temp; INP_INFO_WLOCK(pcbinfo); LIST_FOREACH_SAFE(inp, pcbinfo->ipi_listhead, inp_list, inp_temp) { INP_WLOCK(inp); #ifdef INET6 if ((inp->inp_vflag & INP_IPV4) == 0) { INP_WUNLOCK(inp); continue; } #endif if (inp->inp_faddr.s_addr != faddr.s_addr || inp->inp_socket == NULL) { INP_WUNLOCK(inp); continue; } if ((*notify)(inp, errno)) INP_WUNLOCK(inp); } INP_INFO_WUNLOCK(pcbinfo); } void in_pcbpurgeif0(struct inpcbinfo *pcbinfo, struct ifnet *ifp) { struct inpcb *inp; struct ip_moptions *imo; int i, gap; INP_INFO_RLOCK(pcbinfo); LIST_FOREACH(inp, pcbinfo->ipi_listhead, inp_list) { INP_WLOCK(inp); imo = inp->inp_moptions; if ((inp->inp_vflag & INP_IPV4) && imo != NULL) { /* * Unselect the outgoing interface if it is being * detached. */ if (imo->imo_multicast_ifp == ifp) imo->imo_multicast_ifp = NULL; /* * Drop multicast group membership if we joined * through the interface being detached. */ for (i = 0, gap = 0; i < imo->imo_num_memberships; i++) { if (imo->imo_membership[i]->inm_ifp == ifp) { in_delmulti(imo->imo_membership[i]); gap++; } else if (gap != 0) imo->imo_membership[i - gap] = imo->imo_membership[i]; } imo->imo_num_memberships -= gap; } INP_WUNLOCK(inp); } INP_INFO_RUNLOCK(pcbinfo); } /* * Lookup a PCB based on the local address and port. */ #define INP_LOOKUP_MAPPED_PCB_COST 3 struct inpcb * in_pcblookup_local(struct inpcbinfo *pcbinfo, struct in_addr laddr, u_short lport, int wild_okay, struct ucred *cred) { struct inpcb *inp; #ifdef INET6 int matchwild = 3 + INP_LOOKUP_MAPPED_PCB_COST; #else int matchwild = 3; #endif int wildcard; INP_INFO_LOCK_ASSERT(pcbinfo); if (!wild_okay) { struct inpcbhead *head; /* * Look for an unconnected (wildcard foreign addr) PCB that * matches the local address and port we're looking for. */ head = &pcbinfo->ipi_hashbase[INP_PCBHASH(INADDR_ANY, lport, 0, pcbinfo->ipi_hashmask)]; LIST_FOREACH(inp, head, inp_hash) { #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; #endif if (inp->inp_faddr.s_addr == INADDR_ANY && inp->inp_laddr.s_addr == laddr.s_addr && inp->inp_lport == lport) { /* * Found? */ if (cred == NULL || - inp->inp_cred->cr_prison == cred->cr_prison) + prison_equal_ip4(cred->cr_prison, + inp->inp_cred->cr_prison)) return (inp); } } /* * Not found. */ return (NULL); } else { struct inpcbporthead *porthash; struct inpcbport *phd; struct inpcb *match = NULL; /* * Best fit PCB lookup. * * First see if this local port is in use by looking on the * port hash list. */ porthash = &pcbinfo->ipi_porthashbase[INP_PCBPORTHASH(lport, pcbinfo->ipi_porthashmask)]; LIST_FOREACH(phd, porthash, phd_hash) { if (phd->phd_port == lport) break; } if (phd != NULL) { /* * Port is in use by one or more PCBs. Look for best * fit. */ LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist) { wildcard = 0; if (cred != NULL && - inp->inp_cred->cr_prison != cred->cr_prison) + !prison_equal_ip4(inp->inp_cred->cr_prison, + cred->cr_prison)) continue; #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; /* * We never select the PCB that has * INP_IPV6 flag and is bound to :: if * we have another PCB which is bound * to 0.0.0.0. If a PCB has the * INP_IPV6 flag, then we set its cost * higher than IPv4 only PCBs. * * Note that the case only happens * when a socket is bound to ::, under * the condition that the use of the * mapped address is allowed. */ if ((inp->inp_vflag & INP_IPV6) != 0) wildcard += INP_LOOKUP_MAPPED_PCB_COST; #endif if (inp->inp_faddr.s_addr != INADDR_ANY) wildcard++; if (inp->inp_laddr.s_addr != INADDR_ANY) { if (laddr.s_addr == INADDR_ANY) wildcard++; else if (inp->inp_laddr.s_addr != laddr.s_addr) continue; } else { if (laddr.s_addr != INADDR_ANY) wildcard++; } if (wildcard < matchwild) { match = inp; matchwild = wildcard; if (matchwild == 0) break; } } } return (match); } } #undef INP_LOOKUP_MAPPED_PCB_COST /* * Lookup PCB in hash list. */ struct inpcb * in_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in_addr faddr, u_int fport_arg, struct in_addr laddr, u_int lport_arg, int wildcard, struct ifnet *ifp) { struct inpcbhead *head; struct inpcb *inp, *tmpinp; u_short fport = fport_arg, lport = lport_arg; INP_INFO_LOCK_ASSERT(pcbinfo); /* * First look for an exact match. */ tmpinp = NULL; head = &pcbinfo->ipi_hashbase[INP_PCBHASH(faddr.s_addr, lport, fport, pcbinfo->ipi_hashmask)]; LIST_FOREACH(inp, head, inp_hash) { #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; #endif if (inp->inp_faddr.s_addr == faddr.s_addr && inp->inp_laddr.s_addr == laddr.s_addr && inp->inp_fport == fport && inp->inp_lport == lport) { /* * XXX We should be able to directly return * the inp here, without any checks. * Well unless both bound with SO_REUSEPORT? */ - if (jailed(inp->inp_cred)) + if (prison_flag(inp->inp_cred, PR_IP4)) return (inp); if (tmpinp == NULL) tmpinp = inp; } } if (tmpinp != NULL) return (tmpinp); /* * Then look for a wildcard match, if requested. */ if (wildcard == INPLOOKUP_WILDCARD) { struct inpcb *local_wild = NULL, *local_exact = NULL; #ifdef INET6 struct inpcb *local_wild_mapped = NULL; #endif struct inpcb *jail_wild = NULL; int injail; /* * Order of socket selection - we always prefer jails. * 1. jailed, non-wild. * 2. jailed, wild. * 3. non-jailed, non-wild. * 4. non-jailed, wild. */ head = &pcbinfo->ipi_hashbase[INP_PCBHASH(INADDR_ANY, lport, 0, pcbinfo->ipi_hashmask)]; LIST_FOREACH(inp, head, inp_hash) { #ifdef INET6 /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV4) == 0) continue; #endif if (inp->inp_faddr.s_addr != INADDR_ANY || inp->inp_lport != lport) continue; /* XXX inp locking */ if (ifp && ifp->if_type == IFT_FAITH && (inp->inp_flags & INP_FAITH) == 0) continue; - injail = jailed(inp->inp_cred); + injail = prison_flag(inp->inp_cred, PR_IP4); if (injail) { if (prison_check_ip4(inp->inp_cred, &laddr) != 0) continue; } else { if (local_exact != NULL) continue; } if (inp->inp_laddr.s_addr == laddr.s_addr) { if (injail) return (inp); else local_exact = inp; } else if (inp->inp_laddr.s_addr == INADDR_ANY) { #ifdef INET6 /* XXX inp locking, NULL check */ if (inp->inp_vflag & INP_IPV6PROTO) local_wild_mapped = inp; else #endif /* INET6 */ if (injail) jail_wild = inp; else local_wild = inp; } } /* LIST_FOREACH */ if (jail_wild != NULL) return (jail_wild); if (local_exact != NULL) return (local_exact); if (local_wild != NULL) return (local_wild); #ifdef INET6 if (local_wild_mapped != NULL) return (local_wild_mapped); #endif /* defined(INET6) */ } /* if (wildcard == INPLOOKUP_WILDCARD) */ return (NULL); } /* * Insert PCB onto various hash lists. */ int in_pcbinshash(struct inpcb *inp) { struct inpcbhead *pcbhash; struct inpcbporthead *pcbporthash; struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; struct inpcbport *phd; u_int32_t hashkey_faddr; INP_INFO_WLOCK_ASSERT(pcbinfo); INP_WLOCK_ASSERT(inp); KASSERT((inp->inp_flags & INP_INHASHLIST) == 0, ("in_pcbinshash: INP_INHASHLIST")); #ifdef INET6 if (inp->inp_vflag & INP_IPV6) hashkey_faddr = inp->in6p_faddr.s6_addr32[3] /* XXX */; else #endif /* INET6 */ hashkey_faddr = inp->inp_faddr.s_addr; pcbhash = &pcbinfo->ipi_hashbase[INP_PCBHASH(hashkey_faddr, inp->inp_lport, inp->inp_fport, pcbinfo->ipi_hashmask)]; pcbporthash = &pcbinfo->ipi_porthashbase[ INP_PCBPORTHASH(inp->inp_lport, pcbinfo->ipi_porthashmask)]; /* * Go through port list and look for a head for this lport. */ LIST_FOREACH(phd, pcbporthash, phd_hash) { if (phd->phd_port == inp->inp_lport) break; } /* * If none exists, malloc one and tack it on. */ if (phd == NULL) { phd = malloc(sizeof(struct inpcbport), M_PCB, M_NOWAIT); if (phd == NULL) { return (ENOBUFS); /* XXX */ } phd->phd_port = inp->inp_lport; LIST_INIT(&phd->phd_pcblist); LIST_INSERT_HEAD(pcbporthash, phd, phd_hash); } inp->inp_phd = phd; LIST_INSERT_HEAD(&phd->phd_pcblist, inp, inp_portlist); LIST_INSERT_HEAD(pcbhash, inp, inp_hash); inp->inp_flags |= INP_INHASHLIST; return (0); } /* * Move PCB to the proper hash bucket when { faddr, fport } have been * changed. NOTE: This does not handle the case of the lport changing (the * hashed port list would have to be updated as well), so the lport must * not change after in_pcbinshash() has been called. */ void in_pcbrehash(struct inpcb *inp) { struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; struct inpcbhead *head; u_int32_t hashkey_faddr; INP_INFO_WLOCK_ASSERT(pcbinfo); INP_WLOCK_ASSERT(inp); KASSERT(inp->inp_flags & INP_INHASHLIST, ("in_pcbrehash: !INP_INHASHLIST")); #ifdef INET6 if (inp->inp_vflag & INP_IPV6) hashkey_faddr = inp->in6p_faddr.s6_addr32[3] /* XXX */; else #endif /* INET6 */ hashkey_faddr = inp->inp_faddr.s_addr; head = &pcbinfo->ipi_hashbase[INP_PCBHASH(hashkey_faddr, inp->inp_lport, inp->inp_fport, pcbinfo->ipi_hashmask)]; LIST_REMOVE(inp, inp_hash); LIST_INSERT_HEAD(head, inp, inp_hash); } /* * Remove PCB from various lists. */ static void in_pcbremlists(struct inpcb *inp) { struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; INP_INFO_WLOCK_ASSERT(pcbinfo); INP_WLOCK_ASSERT(inp); inp->inp_gencnt = ++pcbinfo->ipi_gencnt; if (inp->inp_flags & INP_INHASHLIST) { struct inpcbport *phd = inp->inp_phd; LIST_REMOVE(inp, inp_hash); LIST_REMOVE(inp, inp_portlist); if (LIST_FIRST(&phd->phd_pcblist) == NULL) { LIST_REMOVE(phd, phd_hash); free(phd, M_PCB); } inp->inp_flags &= ~INP_INHASHLIST; } LIST_REMOVE(inp, inp_list); pcbinfo->ipi_count--; } /* * A set label operation has occurred at the socket layer, propagate the * label change into the in_pcb for the socket. */ void in_pcbsosetlabel(struct socket *so) { #ifdef MAC struct inpcb *inp; inp = sotoinpcb(so); KASSERT(inp != NULL, ("in_pcbsosetlabel: so->so_pcb == NULL")); INP_WLOCK(inp); SOCK_LOCK(so); mac_inpcb_sosetlabel(so, inp); SOCK_UNLOCK(so); INP_WUNLOCK(inp); #endif } /* * ipport_tick runs once per second, determining if random port allocation * should be continued. If more than ipport_randomcps ports have been * allocated in the last second, then we return to sequential port * allocation. We return to random allocation only once we drop below * ipport_randomcps for at least ipport_randomtime seconds. */ void ipport_tick(void *xtp) { VNET_ITERATOR_DECL(vnet_iter); VNET_LIST_RLOCK(); VNET_FOREACH(vnet_iter) { CURVNET_SET(vnet_iter); /* XXX appease INVARIANTS here */ INIT_VNET_INET(vnet_iter); if (V_ipport_tcpallocs <= V_ipport_tcplastcount + V_ipport_randomcps) { if (V_ipport_stoprandom > 0) V_ipport_stoprandom--; } else V_ipport_stoprandom = V_ipport_randomtime; V_ipport_tcplastcount = V_ipport_tcpallocs; CURVNET_RESTORE(); } VNET_LIST_RUNLOCK(); callout_reset(&ipport_tick_callout, hz, ipport_tick, NULL); } void inp_wlock(struct inpcb *inp) { INP_WLOCK(inp); } void inp_wunlock(struct inpcb *inp) { INP_WUNLOCK(inp); } void inp_rlock(struct inpcb *inp) { INP_RLOCK(inp); } void inp_runlock(struct inpcb *inp) { INP_RUNLOCK(inp); } #ifdef INVARIANTS void inp_lock_assert(struct inpcb *inp) { INP_WLOCK_ASSERT(inp); } void inp_unlock_assert(struct inpcb *inp) { INP_UNLOCK_ASSERT(inp); } #endif void inp_apply_all(void (*func)(struct inpcb *, void *), void *arg) { INIT_VNET_INET(curvnet); struct inpcb *inp; INP_INFO_RLOCK(&V_tcbinfo); LIST_FOREACH(inp, V_tcbinfo.ipi_listhead, inp_list) { INP_WLOCK(inp); func(inp, arg); INP_WUNLOCK(inp); } INP_INFO_RUNLOCK(&V_tcbinfo); } struct socket * inp_inpcbtosocket(struct inpcb *inp) { INP_WLOCK_ASSERT(inp); return (inp->inp_socket); } struct tcpcb * inp_inpcbtotcpcb(struct inpcb *inp) { INP_WLOCK_ASSERT(inp); return ((struct tcpcb *)inp->inp_ppcb); } int inp_ip_tos_get(const struct inpcb *inp) { return (inp->inp_ip_tos); } void inp_ip_tos_set(struct inpcb *inp, int val) { inp->inp_ip_tos = val; } void inp_4tuple_get(struct inpcb *inp, uint32_t *laddr, uint16_t *lp, uint32_t *faddr, uint16_t *fp) { INP_LOCK_ASSERT(inp); *laddr = inp->inp_laddr.s_addr; *faddr = inp->inp_faddr.s_addr; *lp = inp->inp_lport; *fp = inp->inp_fport; } struct inpcb * so_sotoinpcb(struct socket *so) { return (sotoinpcb(so)); } struct tcpcb * so_sototcpcb(struct socket *so) { return (sototcpcb(so)); } #ifdef DDB static void db_print_indent(int indent) { int i; for (i = 0; i < indent; i++) db_printf(" "); } static void db_print_inconninfo(struct in_conninfo *inc, const char *name, int indent) { char faddr_str[48], laddr_str[48]; db_print_indent(indent); db_printf("%s at %p\n", name, inc); indent += 2; #ifdef INET6 if (inc->inc_flags & INC_ISIPV6) { /* IPv6. */ ip6_sprintf(laddr_str, &inc->inc6_laddr); ip6_sprintf(faddr_str, &inc->inc6_faddr); } else { #endif /* IPv4. */ inet_ntoa_r(inc->inc_laddr, laddr_str); inet_ntoa_r(inc->inc_faddr, faddr_str); #ifdef INET6 } #endif db_print_indent(indent); db_printf("inc_laddr %s inc_lport %u\n", laddr_str, ntohs(inc->inc_lport)); db_print_indent(indent); db_printf("inc_faddr %s inc_fport %u\n", faddr_str, ntohs(inc->inc_fport)); } static void db_print_inpflags(int inp_flags) { int comma; comma = 0; if (inp_flags & INP_RECVOPTS) { db_printf("%sINP_RECVOPTS", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_RECVRETOPTS) { db_printf("%sINP_RECVRETOPTS", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_RECVDSTADDR) { db_printf("%sINP_RECVDSTADDR", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_HDRINCL) { db_printf("%sINP_HDRINCL", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_HIGHPORT) { db_printf("%sINP_HIGHPORT", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_LOWPORT) { db_printf("%sINP_LOWPORT", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_ANONPORT) { db_printf("%sINP_ANONPORT", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_RECVIF) { db_printf("%sINP_RECVIF", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_MTUDISC) { db_printf("%sINP_MTUDISC", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_FAITH) { db_printf("%sINP_FAITH", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_RECVTTL) { db_printf("%sINP_RECVTTL", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_DONTFRAG) { db_printf("%sINP_DONTFRAG", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_IPV6_V6ONLY) { db_printf("%sIN6P_IPV6_V6ONLY", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_PKTINFO) { db_printf("%sIN6P_PKTINFO", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_HOPLIMIT) { db_printf("%sIN6P_HOPLIMIT", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_HOPOPTS) { db_printf("%sIN6P_HOPOPTS", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_DSTOPTS) { db_printf("%sIN6P_DSTOPTS", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_RTHDR) { db_printf("%sIN6P_RTHDR", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_RTHDRDSTOPTS) { db_printf("%sIN6P_RTHDRDSTOPTS", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_TCLASS) { db_printf("%sIN6P_TCLASS", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_AUTOFLOWLABEL) { db_printf("%sIN6P_AUTOFLOWLABEL", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_TIMEWAIT) { db_printf("%sINP_TIMEWAIT", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_ONESBCAST) { db_printf("%sINP_ONESBCAST", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_DROPPED) { db_printf("%sINP_DROPPED", comma ? ", " : ""); comma = 1; } if (inp_flags & INP_SOCKREF) { db_printf("%sINP_SOCKREF", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_RFC2292) { db_printf("%sIN6P_RFC2292", comma ? ", " : ""); comma = 1; } if (inp_flags & IN6P_MTU) { db_printf("IN6P_MTU%s", comma ? ", " : ""); comma = 1; } } static void db_print_inpvflag(u_char inp_vflag) { int comma; comma = 0; if (inp_vflag & INP_IPV4) { db_printf("%sINP_IPV4", comma ? ", " : ""); comma = 1; } if (inp_vflag & INP_IPV6) { db_printf("%sINP_IPV6", comma ? ", " : ""); comma = 1; } if (inp_vflag & INP_IPV6PROTO) { db_printf("%sINP_IPV6PROTO", comma ? ", " : ""); comma = 1; } } static void db_print_inpcb(struct inpcb *inp, const char *name, int indent) { db_print_indent(indent); db_printf("%s at %p\n", name, inp); indent += 2; db_print_indent(indent); db_printf("inp_flow: 0x%x\n", inp->inp_flow); db_print_inconninfo(&inp->inp_inc, "inp_conninfo", indent); db_print_indent(indent); db_printf("inp_ppcb: %p inp_pcbinfo: %p inp_socket: %p\n", inp->inp_ppcb, inp->inp_pcbinfo, inp->inp_socket); db_print_indent(indent); db_printf("inp_label: %p inp_flags: 0x%x (", inp->inp_label, inp->inp_flags); db_print_inpflags(inp->inp_flags); db_printf(")\n"); db_print_indent(indent); db_printf("inp_sp: %p inp_vflag: 0x%x (", inp->inp_sp, inp->inp_vflag); db_print_inpvflag(inp->inp_vflag); db_printf(")\n"); db_print_indent(indent); db_printf("inp_ip_ttl: %d inp_ip_p: %d inp_ip_minttl: %d\n", inp->inp_ip_ttl, inp->inp_ip_p, inp->inp_ip_minttl); db_print_indent(indent); #ifdef INET6 if (inp->inp_vflag & INP_IPV6) { db_printf("in6p_options: %p in6p_outputopts: %p " "in6p_moptions: %p\n", inp->in6p_options, inp->in6p_outputopts, inp->in6p_moptions); db_printf("in6p_icmp6filt: %p in6p_cksum %d " "in6p_hops %u\n", inp->in6p_icmp6filt, inp->in6p_cksum, inp->in6p_hops); } else #endif { db_printf("inp_ip_tos: %d inp_ip_options: %p " "inp_ip_moptions: %p\n", inp->inp_ip_tos, inp->inp_options, inp->inp_moptions); } db_print_indent(indent); db_printf("inp_phd: %p inp_gencnt: %ju\n", inp->inp_phd, (uintmax_t)inp->inp_gencnt); } DB_SHOW_COMMAND(inpcb, db_show_inpcb) { struct inpcb *inp; if (!have_addr) { db_printf("usage: show inpcb \n"); return; } inp = (struct inpcb *)addr; db_print_inpcb(inp, "inpcb", 0); } #endif Index: head/sys/netinet/udp_usrreq.c =================================================================== --- head/sys/netinet/udp_usrreq.c (revision 192894) +++ head/sys/netinet/udp_usrreq.c (revision 192895) @@ -1,1365 +1,1365 @@ /*- * Copyright (c) 1982, 1986, 1988, 1990, 1993, 1995 * The Regents of the University of California. * Copyright (c) 2008 Robert N. M. Watson * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)udp_usrreq.c 8.6 (Berkeley) 5/23/95 */ #include __FBSDID("$FreeBSD$"); #include "opt_ipfw.h" #include "opt_inet6.h" #include "opt_ipsec.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef INET6 #include #endif #include #include #include #include #ifdef INET6 #include #endif #include #include #include #ifdef IPSEC #include #endif #include #include /* * UDP protocol implementation. * Per RFC 768, August, 1980. */ #ifdef VIMAGE_GLOBALS int udp_blackhole; #endif /* * BSD 4.2 defaulted the udp checksum to be off. Turning off udp checksums * removes the only data integrity mechanism for packets and malformed * packets that would otherwise be discarded due to bad checksums, and may * cause problems (especially for NFS data blocks). */ static int udp_cksum = 1; SYSCTL_INT(_net_inet_udp, UDPCTL_CHECKSUM, checksum, CTLFLAG_RW, &udp_cksum, 0, "compute udp checksum"); int udp_log_in_vain = 0; SYSCTL_INT(_net_inet_udp, OID_AUTO, log_in_vain, CTLFLAG_RW, &udp_log_in_vain, 0, "Log all incoming UDP packets"); SYSCTL_V_INT(V_NET, vnet_inet, _net_inet_udp, OID_AUTO, blackhole, CTLFLAG_RW, udp_blackhole, 0, "Do not send port unreachables for refused connects"); u_long udp_sendspace = 9216; /* really max datagram size */ /* 40 1K datagrams */ SYSCTL_ULONG(_net_inet_udp, UDPCTL_MAXDGRAM, maxdgram, CTLFLAG_RW, &udp_sendspace, 0, "Maximum outgoing UDP datagram size"); u_long udp_recvspace = 40 * (1024 + #ifdef INET6 sizeof(struct sockaddr_in6) #else sizeof(struct sockaddr_in) #endif ); SYSCTL_ULONG(_net_inet_udp, UDPCTL_RECVSPACE, recvspace, CTLFLAG_RW, &udp_recvspace, 0, "Maximum space for incoming UDP datagrams"); #ifdef VIMAGE_GLOBALS struct inpcbhead udb; /* from udp_var.h */ struct inpcbinfo udbinfo; static uma_zone_t udpcb_zone; struct udpstat udpstat; /* from udp_var.h */ #endif #ifndef UDBHASHSIZE #define UDBHASHSIZE 128 #endif SYSCTL_V_STRUCT(V_NET, vnet_inet, _net_inet_udp, UDPCTL_STATS, stats, CTLFLAG_RW, udpstat, udpstat, "UDP statistics (struct udpstat, netinet/udp_var.h)"); static void udp_detach(struct socket *so); static int udp_output(struct inpcb *, struct mbuf *, struct sockaddr *, struct mbuf *, struct thread *); static void udp_zone_change(void *tag) { INIT_VNET_INET(curvnet); uma_zone_set_max(V_udbinfo.ipi_zone, maxsockets); uma_zone_set_max(V_udpcb_zone, maxsockets); } static int udp_inpcb_init(void *mem, int size, int flags) { struct inpcb *inp; inp = mem; INP_LOCK_INIT(inp, "inp", "udpinp"); return (0); } void udp_init(void) { INIT_VNET_INET(curvnet); V_udp_blackhole = 0; INP_INFO_LOCK_INIT(&V_udbinfo, "udp"); LIST_INIT(&V_udb); #ifdef VIMAGE V_udbinfo.ipi_vnet = curvnet; #endif V_udbinfo.ipi_listhead = &V_udb; V_udbinfo.ipi_hashbase = hashinit(UDBHASHSIZE, M_PCB, &V_udbinfo.ipi_hashmask); V_udbinfo.ipi_porthashbase = hashinit(UDBHASHSIZE, M_PCB, &V_udbinfo.ipi_porthashmask); V_udbinfo.ipi_zone = uma_zcreate("udp_inpcb", sizeof(struct inpcb), NULL, NULL, udp_inpcb_init, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); uma_zone_set_max(V_udbinfo.ipi_zone, maxsockets); V_udpcb_zone = uma_zcreate("udpcb", sizeof(struct udpcb), NULL, NULL, NULL, NULL, UMA_ALIGN_PTR, UMA_ZONE_NOFREE); uma_zone_set_max(V_udpcb_zone, maxsockets); EVENTHANDLER_REGISTER(maxsockets_change, udp_zone_change, NULL, EVENTHANDLER_PRI_ANY); } int udp_newudpcb(struct inpcb *inp) { INIT_VNET_INET(curvnet); struct udpcb *up; up = uma_zalloc(V_udpcb_zone, M_NOWAIT | M_ZERO); if (up == NULL) return (ENOBUFS); inp->inp_ppcb = up; return (0); } void udp_discardcb(struct udpcb *up) { INIT_VNET_INET(curvnet); uma_zfree(V_udpcb_zone, up); } /* * Subroutine of udp_input(), which appends the provided mbuf chain to the * passed pcb/socket. The caller must provide a sockaddr_in via udp_in that * contains the source address. If the socket ends up being an IPv6 socket, * udp_append() will convert to a sockaddr_in6 before passing the address * into the socket code. */ static void udp_append(struct inpcb *inp, struct ip *ip, struct mbuf *n, int off, struct sockaddr_in *udp_in) { struct sockaddr *append_sa; struct socket *so; struct mbuf *opts = 0; #ifdef INET6 struct sockaddr_in6 udp_in6; #endif INP_RLOCK_ASSERT(inp); #ifdef IPSEC /* Check AH/ESP integrity. */ if (ipsec4_in_reject(n, inp)) { INIT_VNET_IPSEC(curvnet); m_freem(n); V_ipsec4stat.in_polvio++; return; } #endif /* IPSEC */ #ifdef MAC if (mac_inpcb_check_deliver(inp, n) != 0) { m_freem(n); return; } #endif if (inp->inp_flags & INP_CONTROLOPTS || inp->inp_socket->so_options & (SO_TIMESTAMP | SO_BINTIME)) { #ifdef INET6 if (inp->inp_vflag & INP_IPV6) (void)ip6_savecontrol_v4(inp, n, &opts, NULL); else #endif ip_savecontrol(inp, &opts, ip, n); } #ifdef INET6 if (inp->inp_vflag & INP_IPV6) { bzero(&udp_in6, sizeof(udp_in6)); udp_in6.sin6_len = sizeof(udp_in6); udp_in6.sin6_family = AF_INET6; in6_sin_2_v4mapsin6(udp_in, &udp_in6); append_sa = (struct sockaddr *)&udp_in6; } else #endif append_sa = (struct sockaddr *)udp_in; m_adj(n, off); so = inp->inp_socket; SOCKBUF_LOCK(&so->so_rcv); if (sbappendaddr_locked(&so->so_rcv, append_sa, n, opts) == 0) { INIT_VNET_INET(so->so_vnet); SOCKBUF_UNLOCK(&so->so_rcv); m_freem(n); if (opts) m_freem(opts); UDPSTAT_INC(udps_fullsock); } else sorwakeup_locked(so); } void udp_input(struct mbuf *m, int off) { INIT_VNET_INET(curvnet); int iphlen = off; struct ip *ip; struct udphdr *uh; struct ifnet *ifp; struct inpcb *inp; struct udpcb *up; int len; struct ip save_ip; struct sockaddr_in udp_in; #ifdef IPFIREWALL_FORWARD struct m_tag *fwd_tag; #endif ifp = m->m_pkthdr.rcvif; UDPSTAT_INC(udps_ipackets); /* * Strip IP options, if any; should skip this, make available to * user, and use on returned packets, but we don't yet have a way to * check the checksum with options still present. */ if (iphlen > sizeof (struct ip)) { ip_stripoptions(m, (struct mbuf *)0); iphlen = sizeof(struct ip); } /* * Get IP and UDP header together in first mbuf. */ ip = mtod(m, struct ip *); if (m->m_len < iphlen + sizeof(struct udphdr)) { if ((m = m_pullup(m, iphlen + sizeof(struct udphdr))) == 0) { UDPSTAT_INC(udps_hdrops); return; } ip = mtod(m, struct ip *); } uh = (struct udphdr *)((caddr_t)ip + iphlen); /* * Destination port of 0 is illegal, based on RFC768. */ if (uh->uh_dport == 0) goto badunlocked; /* * Construct sockaddr format source address. Stuff source address * and datagram in user buffer. */ bzero(&udp_in, sizeof(udp_in)); udp_in.sin_len = sizeof(udp_in); udp_in.sin_family = AF_INET; udp_in.sin_port = uh->uh_sport; udp_in.sin_addr = ip->ip_src; /* * Make mbuf data length reflect UDP length. If not enough data to * reflect UDP length, drop. */ len = ntohs((u_short)uh->uh_ulen); if (ip->ip_len != len) { if (len > ip->ip_len || len < sizeof(struct udphdr)) { UDPSTAT_INC(udps_badlen); goto badunlocked; } m_adj(m, len - ip->ip_len); /* ip->ip_len = len; */ } /* * Save a copy of the IP header in case we want restore it for * sending an ICMP error message in response. */ if (!V_udp_blackhole) save_ip = *ip; else memset(&save_ip, 0, sizeof(save_ip)); /* * Checksum extended UDP header and data. */ if (uh->uh_sum) { u_short uh_sum; if (m->m_pkthdr.csum_flags & CSUM_DATA_VALID) { if (m->m_pkthdr.csum_flags & CSUM_PSEUDO_HDR) uh_sum = m->m_pkthdr.csum_data; else uh_sum = in_pseudo(ip->ip_src.s_addr, ip->ip_dst.s_addr, htonl((u_short)len + m->m_pkthdr.csum_data + IPPROTO_UDP)); uh_sum ^= 0xffff; } else { char b[9]; bcopy(((struct ipovly *)ip)->ih_x1, b, 9); bzero(((struct ipovly *)ip)->ih_x1, 9); ((struct ipovly *)ip)->ih_len = uh->uh_ulen; uh_sum = in_cksum(m, len + sizeof (struct ip)); bcopy(b, ((struct ipovly *)ip)->ih_x1, 9); } if (uh_sum) { UDPSTAT_INC(udps_badsum); m_freem(m); return; } } else UDPSTAT_INC(udps_nosum); #ifdef IPFIREWALL_FORWARD /* * Grab info from PACKET_TAG_IPFORWARD tag prepended to the chain. */ fwd_tag = m_tag_find(m, PACKET_TAG_IPFORWARD, NULL); if (fwd_tag != NULL) { struct sockaddr_in *next_hop; /* * Do the hack. */ next_hop = (struct sockaddr_in *)(fwd_tag + 1); ip->ip_dst = next_hop->sin_addr; uh->uh_dport = ntohs(next_hop->sin_port); /* * Remove the tag from the packet. We don't need it anymore. */ m_tag_delete(m, fwd_tag); } #endif INP_INFO_RLOCK(&V_udbinfo); if (IN_MULTICAST(ntohl(ip->ip_dst.s_addr)) || in_broadcast(ip->ip_dst, ifp)) { struct inpcb *last; struct ip_moptions *imo; last = NULL; LIST_FOREACH(inp, &V_udb, inp_list) { if (inp->inp_lport != uh->uh_dport) continue; #ifdef INET6 if ((inp->inp_vflag & INP_IPV4) == 0) continue; #endif if (inp->inp_laddr.s_addr != INADDR_ANY && inp->inp_laddr.s_addr != ip->ip_dst.s_addr) continue; if (inp->inp_faddr.s_addr != INADDR_ANY && inp->inp_faddr.s_addr != ip->ip_src.s_addr) continue; if (inp->inp_fport != 0 && inp->inp_fport != uh->uh_sport) continue; INP_RLOCK(inp); /* * Handle socket delivery policy for any-source * and source-specific multicast. [RFC3678] */ imo = inp->inp_moptions; if (IN_MULTICAST(ntohl(ip->ip_dst.s_addr)) && imo != NULL) { struct sockaddr_in group; int blocked; bzero(&group, sizeof(struct sockaddr_in)); group.sin_len = sizeof(struct sockaddr_in); group.sin_family = AF_INET; group.sin_addr = ip->ip_dst; blocked = imo_multi_filter(imo, ifp, (struct sockaddr *)&group, (struct sockaddr *)&udp_in); if (blocked != MCAST_PASS) { if (blocked == MCAST_NOTGMEMBER) IPSTAT_INC(ips_notmember); if (blocked == MCAST_NOTSMEMBER || blocked == MCAST_MUTED) UDPSTAT_INC(udps_filtermcast); INP_RUNLOCK(inp); continue; } } if (last != NULL) { struct mbuf *n; n = m_copy(m, 0, M_COPYALL); up = intoudpcb(last); if (up->u_tun_func == NULL) { if (n != NULL) udp_append(last, ip, n, iphlen + sizeof(struct udphdr), &udp_in); } else { /* * Engage the tunneling protocol we * will have to leave the info_lock * up, since we are hunting through * multiple UDP's. */ (*up->u_tun_func)(n, iphlen, last); } INP_RUNLOCK(last); } last = inp; /* * Don't look for additional matches if this one does * not have either the SO_REUSEPORT or SO_REUSEADDR * socket options set. This heuristic avoids * searching through all pcbs in the common case of a * non-shared port. It assumes that an application * will never clear these options after setting them. */ if ((last->inp_socket->so_options & (SO_REUSEPORT|SO_REUSEADDR)) == 0) break; } if (last == NULL) { /* * No matching pcb found; discard datagram. (No need * to send an ICMP Port Unreachable for a broadcast * or multicast datgram.) */ UDPSTAT_INC(udps_noportbcast); goto badheadlocked; } up = intoudpcb(last); if (up->u_tun_func == NULL) { udp_append(last, ip, m, iphlen + sizeof(struct udphdr), &udp_in); } else { /* * Engage the tunneling protocol. */ (*up->u_tun_func)(m, iphlen, last); } INP_RUNLOCK(last); INP_INFO_RUNLOCK(&V_udbinfo); return; } /* * Locate pcb for datagram. */ inp = in_pcblookup_hash(&V_udbinfo, ip->ip_src, uh->uh_sport, ip->ip_dst, uh->uh_dport, 1, ifp); if (inp == NULL) { if (udp_log_in_vain) { char buf[4*sizeof "123"]; strcpy(buf, inet_ntoa(ip->ip_dst)); log(LOG_INFO, "Connection attempt to UDP %s:%d from %s:%d\n", buf, ntohs(uh->uh_dport), inet_ntoa(ip->ip_src), ntohs(uh->uh_sport)); } UDPSTAT_INC(udps_noport); if (m->m_flags & (M_BCAST | M_MCAST)) { UDPSTAT_INC(udps_noportbcast); goto badheadlocked; } if (V_udp_blackhole) goto badheadlocked; if (badport_bandlim(BANDLIM_ICMP_UNREACH) < 0) goto badheadlocked; *ip = save_ip; ip->ip_len += iphlen; icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_PORT, 0, 0); INP_INFO_RUNLOCK(&V_udbinfo); return; } /* * Check the minimum TTL for socket. */ INP_RLOCK(inp); INP_INFO_RUNLOCK(&V_udbinfo); if (inp->inp_ip_minttl && inp->inp_ip_minttl > ip->ip_ttl) { INP_RUNLOCK(inp); goto badunlocked; } up = intoudpcb(inp); if (up->u_tun_func == NULL) { udp_append(inp, ip, m, iphlen + sizeof(struct udphdr), &udp_in); } else { /* * Engage the tunneling protocol. */ (*up->u_tun_func)(m, iphlen, inp); } INP_RUNLOCK(inp); return; badheadlocked: if (inp) INP_RUNLOCK(inp); INP_INFO_RUNLOCK(&V_udbinfo); badunlocked: m_freem(m); } /* * Notify a udp user of an asynchronous error; just wake up so that they can * collect error status. */ struct inpcb * udp_notify(struct inpcb *inp, int errno) { /* * While udp_ctlinput() always calls udp_notify() with a read lock * when invoking it directly, in_pcbnotifyall() currently uses write * locks due to sharing code with TCP. For now, accept either a read * or a write lock, but a read lock is sufficient. */ INP_LOCK_ASSERT(inp); inp->inp_socket->so_error = errno; sorwakeup(inp->inp_socket); sowwakeup(inp->inp_socket); return (inp); } void udp_ctlinput(int cmd, struct sockaddr *sa, void *vip) { INIT_VNET_INET(curvnet); struct ip *ip = vip; struct udphdr *uh; struct in_addr faddr; struct inpcb *inp; faddr = ((struct sockaddr_in *)sa)->sin_addr; if (sa->sa_family != AF_INET || faddr.s_addr == INADDR_ANY) return; /* * Redirects don't need to be handled up here. */ if (PRC_IS_REDIRECT(cmd)) return; /* * Hostdead is ugly because it goes linearly through all PCBs. * * XXX: We never get this from ICMP, otherwise it makes an excellent * DoS attack on machines with many connections. */ if (cmd == PRC_HOSTDEAD) ip = NULL; else if ((unsigned)cmd >= PRC_NCMDS || inetctlerrmap[cmd] == 0) return; if (ip != NULL) { uh = (struct udphdr *)((caddr_t)ip + (ip->ip_hl << 2)); INP_INFO_RLOCK(&V_udbinfo); inp = in_pcblookup_hash(&V_udbinfo, faddr, uh->uh_dport, ip->ip_src, uh->uh_sport, 0, NULL); if (inp != NULL) { INP_RLOCK(inp); if (inp->inp_socket != NULL) { udp_notify(inp, inetctlerrmap[cmd]); } INP_RUNLOCK(inp); } INP_INFO_RUNLOCK(&V_udbinfo); } else in_pcbnotifyall(&V_udbinfo, faddr, inetctlerrmap[cmd], udp_notify); } static int udp_pcblist(SYSCTL_HANDLER_ARGS) { INIT_VNET_INET(curvnet); int error, i, n; struct inpcb *inp, **inp_list; inp_gen_t gencnt; struct xinpgen xig; /* * The process of preparing the PCB list is too time-consuming and * resource-intensive to repeat twice on every request. */ if (req->oldptr == 0) { n = V_udbinfo.ipi_count; req->oldidx = 2 * (sizeof xig) + (n + n/8) * sizeof(struct xinpcb); return (0); } if (req->newptr != 0) return (EPERM); /* * OK, now we're committed to doing something. */ INP_INFO_RLOCK(&V_udbinfo); gencnt = V_udbinfo.ipi_gencnt; n = V_udbinfo.ipi_count; INP_INFO_RUNLOCK(&V_udbinfo); error = sysctl_wire_old_buffer(req, 2 * (sizeof xig) + n * sizeof(struct xinpcb)); if (error != 0) return (error); xig.xig_len = sizeof xig; xig.xig_count = n; xig.xig_gen = gencnt; xig.xig_sogen = so_gencnt; error = SYSCTL_OUT(req, &xig, sizeof xig); if (error) return (error); inp_list = malloc(n * sizeof *inp_list, M_TEMP, M_WAITOK); if (inp_list == 0) return (ENOMEM); INP_INFO_RLOCK(&V_udbinfo); for (inp = LIST_FIRST(V_udbinfo.ipi_listhead), i = 0; inp && i < n; inp = LIST_NEXT(inp, inp_list)) { INP_RLOCK(inp); if (inp->inp_gencnt <= gencnt && cr_canseeinpcb(req->td->td_ucred, inp) == 0) inp_list[i++] = inp; INP_RUNLOCK(inp); } INP_INFO_RUNLOCK(&V_udbinfo); n = i; error = 0; for (i = 0; i < n; i++) { inp = inp_list[i]; INP_RLOCK(inp); if (inp->inp_gencnt <= gencnt) { struct xinpcb xi; bzero(&xi, sizeof(xi)); xi.xi_len = sizeof xi; /* XXX should avoid extra copy */ bcopy(inp, &xi.xi_inp, sizeof *inp); if (inp->inp_socket) sotoxsocket(inp->inp_socket, &xi.xi_socket); xi.xi_inp.inp_gencnt = inp->inp_gencnt; INP_RUNLOCK(inp); error = SYSCTL_OUT(req, &xi, sizeof xi); } else INP_RUNLOCK(inp); } if (!error) { /* * Give the user an updated idea of our state. If the * generation differs from what we told her before, she knows * that something happened while we were processing this * request, and it might be necessary to retry. */ INP_INFO_RLOCK(&V_udbinfo); xig.xig_gen = V_udbinfo.ipi_gencnt; xig.xig_sogen = so_gencnt; xig.xig_count = V_udbinfo.ipi_count; INP_INFO_RUNLOCK(&V_udbinfo); error = SYSCTL_OUT(req, &xig, sizeof xig); } free(inp_list, M_TEMP); return (error); } SYSCTL_PROC(_net_inet_udp, UDPCTL_PCBLIST, pcblist, CTLFLAG_RD, 0, 0, udp_pcblist, "S,xinpcb", "List of active UDP sockets"); static int udp_getcred(SYSCTL_HANDLER_ARGS) { INIT_VNET_INET(curvnet); struct xucred xuc; struct sockaddr_in addrs[2]; struct inpcb *inp; int error; error = priv_check(req->td, PRIV_NETINET_GETCRED); if (error) return (error); error = SYSCTL_IN(req, addrs, sizeof(addrs)); if (error) return (error); INP_INFO_RLOCK(&V_udbinfo); inp = in_pcblookup_hash(&V_udbinfo, addrs[1].sin_addr, addrs[1].sin_port, addrs[0].sin_addr, addrs[0].sin_port, 1, NULL); if (inp != NULL) { INP_RLOCK(inp); INP_INFO_RUNLOCK(&V_udbinfo); if (inp->inp_socket == NULL) error = ENOENT; if (error == 0) error = cr_canseeinpcb(req->td->td_ucred, inp); if (error == 0) cru2x(inp->inp_cred, &xuc); INP_RUNLOCK(inp); } else { INP_INFO_RUNLOCK(&V_udbinfo); error = ENOENT; } if (error == 0) error = SYSCTL_OUT(req, &xuc, sizeof(struct xucred)); return (error); } SYSCTL_PROC(_net_inet_udp, OID_AUTO, getcred, CTLTYPE_OPAQUE|CTLFLAG_RW|CTLFLAG_PRISON, 0, 0, udp_getcred, "S,xucred", "Get the xucred of a UDP connection"); static int udp_output(struct inpcb *inp, struct mbuf *m, struct sockaddr *addr, struct mbuf *control, struct thread *td) { INIT_VNET_INET(inp->inp_vnet); struct udpiphdr *ui; int len = m->m_pkthdr.len; struct in_addr faddr, laddr; struct cmsghdr *cm; struct sockaddr_in *sin, src; int error = 0; int ipflags; u_short fport, lport; int unlock_udbinfo; /* * udp_output() may need to temporarily bind or connect the current * inpcb. As such, we don't know up front whether we will need the * pcbinfo lock or not. Do any work to decide what is needed up * front before acquiring any locks. */ if (len + sizeof(struct udpiphdr) > IP_MAXPACKET) { if (control) m_freem(control); m_freem(m); return (EMSGSIZE); } src.sin_family = 0; if (control != NULL) { /* * XXX: Currently, we assume all the optional information is * stored in a single mbuf. */ if (control->m_next) { m_freem(control); m_freem(m); return (EINVAL); } for (; control->m_len > 0; control->m_data += CMSG_ALIGN(cm->cmsg_len), control->m_len -= CMSG_ALIGN(cm->cmsg_len)) { cm = mtod(control, struct cmsghdr *); if (control->m_len < sizeof(*cm) || cm->cmsg_len == 0 || cm->cmsg_len > control->m_len) { error = EINVAL; break; } if (cm->cmsg_level != IPPROTO_IP) continue; switch (cm->cmsg_type) { case IP_SENDSRCADDR: if (cm->cmsg_len != CMSG_LEN(sizeof(struct in_addr))) { error = EINVAL; break; } bzero(&src, sizeof(src)); src.sin_family = AF_INET; src.sin_len = sizeof(src); src.sin_port = inp->inp_lport; src.sin_addr = *(struct in_addr *)CMSG_DATA(cm); break; default: error = ENOPROTOOPT; break; } if (error) break; } m_freem(control); } if (error) { m_freem(m); return (error); } /* * Depending on whether or not the application has bound or connected * the socket, we may have to do varying levels of work. The optimal * case is for a connected UDP socket, as a global lock isn't * required at all. * * In order to decide which we need, we require stability of the * inpcb binding, which we ensure by acquiring a read lock on the * inpcb. This doesn't strictly follow the lock order, so we play * the trylock and retry game; note that we may end up with more * conservative locks than required the second time around, so later * assertions have to accept that. Further analysis of the number of * misses under contention is required. */ sin = (struct sockaddr_in *)addr; INP_RLOCK(inp); if (sin != NULL && (inp->inp_laddr.s_addr == INADDR_ANY && inp->inp_lport == 0)) { INP_RUNLOCK(inp); INP_INFO_WLOCK(&V_udbinfo); INP_WLOCK(inp); unlock_udbinfo = 2; } else if ((sin != NULL && ( (sin->sin_addr.s_addr == INADDR_ANY) || (sin->sin_addr.s_addr == INADDR_BROADCAST) || (inp->inp_laddr.s_addr == INADDR_ANY) || (inp->inp_lport == 0))) || (src.sin_family == AF_INET)) { if (!INP_INFO_TRY_RLOCK(&V_udbinfo)) { INP_RUNLOCK(inp); INP_INFO_RLOCK(&V_udbinfo); INP_RLOCK(inp); } unlock_udbinfo = 1; } else unlock_udbinfo = 0; /* * If the IP_SENDSRCADDR control message was specified, override the * source address for this datagram. Its use is invalidated if the * address thus specified is incomplete or clobbers other inpcbs. */ laddr = inp->inp_laddr; lport = inp->inp_lport; if (src.sin_family == AF_INET) { INP_INFO_LOCK_ASSERT(&V_udbinfo); if ((lport == 0) || (laddr.s_addr == INADDR_ANY && src.sin_addr.s_addr == INADDR_ANY)) { error = EINVAL; goto release; } error = in_pcbbind_setup(inp, (struct sockaddr *)&src, &laddr.s_addr, &lport, td->td_ucred); if (error) goto release; } /* * If a UDP socket has been connected, then a local address/port will * have been selected and bound. * * If a UDP socket has not been connected to, then an explicit * destination address must be used, in which case a local * address/port may not have been selected and bound. */ if (sin != NULL) { INP_LOCK_ASSERT(inp); if (inp->inp_faddr.s_addr != INADDR_ANY) { error = EISCONN; goto release; } /* * Jail may rewrite the destination address, so let it do * that before we use it. */ error = prison_remote_ip4(td->td_ucred, &sin->sin_addr); if (error) goto release; /* * If a local address or port hasn't yet been selected, or if * the destination address needs to be rewritten due to using * a special INADDR_ constant, invoke in_pcbconnect_setup() * to do the heavy lifting. Once a port is selected, we * commit the binding back to the socket; we also commit the * binding of the address if in jail. * * If we already have a valid binding and we're not * requesting a destination address rewrite, use a fast path. */ if (inp->inp_laddr.s_addr == INADDR_ANY || inp->inp_lport == 0 || sin->sin_addr.s_addr == INADDR_ANY || sin->sin_addr.s_addr == INADDR_BROADCAST) { INP_INFO_LOCK_ASSERT(&V_udbinfo); error = in_pcbconnect_setup(inp, addr, &laddr.s_addr, &lport, &faddr.s_addr, &fport, NULL, td->td_ucred); if (error) goto release; /* * XXXRW: Why not commit the port if the address is * !INADDR_ANY? */ /* Commit the local port if newly assigned. */ if (inp->inp_laddr.s_addr == INADDR_ANY && inp->inp_lport == 0) { INP_INFO_WLOCK_ASSERT(&V_udbinfo); INP_WLOCK_ASSERT(inp); /* * Remember addr if jailed, to prevent * rebinding. */ - if (jailed(td->td_ucred)) + if (prison_flag(td->td_ucred, PR_IP4)) inp->inp_laddr = laddr; inp->inp_lport = lport; if (in_pcbinshash(inp) != 0) { inp->inp_lport = 0; error = EAGAIN; goto release; } inp->inp_flags |= INP_ANONPORT; } } else { faddr = sin->sin_addr; fport = sin->sin_port; } } else { INP_LOCK_ASSERT(inp); faddr = inp->inp_faddr; fport = inp->inp_fport; if (faddr.s_addr == INADDR_ANY) { error = ENOTCONN; goto release; } } /* * Calculate data length and get a mbuf for UDP, IP, and possible * link-layer headers. Immediate slide the data pointer back forward * since we won't use that space at this layer. */ M_PREPEND(m, sizeof(struct udpiphdr) + max_linkhdr, M_DONTWAIT); if (m == NULL) { error = ENOBUFS; goto release; } m->m_data += max_linkhdr; m->m_len -= max_linkhdr; m->m_pkthdr.len -= max_linkhdr; /* * Fill in mbuf with extended UDP header and addresses and length put * into network format. */ ui = mtod(m, struct udpiphdr *); bzero(ui->ui_x1, sizeof(ui->ui_x1)); /* XXX still needed? */ ui->ui_pr = IPPROTO_UDP; ui->ui_src = laddr; ui->ui_dst = faddr; ui->ui_sport = lport; ui->ui_dport = fport; ui->ui_ulen = htons((u_short)len + sizeof(struct udphdr)); /* * Set the Don't Fragment bit in the IP header. */ if (inp->inp_flags & INP_DONTFRAG) { struct ip *ip; ip = (struct ip *)&ui->ui_i; ip->ip_off |= IP_DF; } ipflags = 0; if (inp->inp_socket->so_options & SO_DONTROUTE) ipflags |= IP_ROUTETOIF; if (inp->inp_socket->so_options & SO_BROADCAST) ipflags |= IP_ALLOWBROADCAST; if (inp->inp_flags & INP_ONESBCAST) ipflags |= IP_SENDONES; #ifdef MAC mac_inpcb_create_mbuf(inp, m); #endif /* * Set up checksum and output datagram. */ if (udp_cksum) { if (inp->inp_flags & INP_ONESBCAST) faddr.s_addr = INADDR_BROADCAST; ui->ui_sum = in_pseudo(ui->ui_src.s_addr, faddr.s_addr, htons((u_short)len + sizeof(struct udphdr) + IPPROTO_UDP)); m->m_pkthdr.csum_flags = CSUM_UDP; m->m_pkthdr.csum_data = offsetof(struct udphdr, uh_sum); } else ui->ui_sum = 0; ((struct ip *)ui)->ip_len = sizeof (struct udpiphdr) + len; ((struct ip *)ui)->ip_ttl = inp->inp_ip_ttl; /* XXX */ ((struct ip *)ui)->ip_tos = inp->inp_ip_tos; /* XXX */ UDPSTAT_INC(udps_opackets); if (unlock_udbinfo == 2) INP_INFO_WUNLOCK(&V_udbinfo); else if (unlock_udbinfo == 1) INP_INFO_RUNLOCK(&V_udbinfo); error = ip_output(m, inp->inp_options, NULL, ipflags, inp->inp_moptions, inp); if (unlock_udbinfo == 2) INP_WUNLOCK(inp); else INP_RUNLOCK(inp); return (error); release: if (unlock_udbinfo == 2) { INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_udbinfo); } else if (unlock_udbinfo == 1) { INP_RUNLOCK(inp); INP_INFO_RUNLOCK(&V_udbinfo); } else INP_RUNLOCK(inp); m_freem(m); return (error); } static void udp_abort(struct socket *so) { INIT_VNET_INET(so->so_vnet); struct inpcb *inp; inp = sotoinpcb(so); KASSERT(inp != NULL, ("udp_abort: inp == NULL")); INP_INFO_WLOCK(&V_udbinfo); INP_WLOCK(inp); if (inp->inp_faddr.s_addr != INADDR_ANY) { in_pcbdisconnect(inp); inp->inp_laddr.s_addr = INADDR_ANY; soisdisconnected(so); } INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_udbinfo); } static int udp_attach(struct socket *so, int proto, struct thread *td) { INIT_VNET_INET(so->so_vnet); struct inpcb *inp; int error; inp = sotoinpcb(so); KASSERT(inp == NULL, ("udp_attach: inp != NULL")); error = soreserve(so, udp_sendspace, udp_recvspace); if (error) return (error); INP_INFO_WLOCK(&V_udbinfo); error = in_pcballoc(so, &V_udbinfo); if (error) { INP_INFO_WUNLOCK(&V_udbinfo); return (error); } inp = (struct inpcb *)so->so_pcb; inp->inp_vflag |= INP_IPV4; inp->inp_ip_ttl = V_ip_defttl; error = udp_newudpcb(inp); if (error) { in_pcbdetach(inp); in_pcbfree(inp); INP_INFO_WUNLOCK(&V_udbinfo); return (error); } INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_udbinfo); return (0); } int udp_set_kernel_tunneling(struct socket *so, udp_tun_func_t f) { struct inpcb *inp; struct udpcb *up; KASSERT(so->so_type == SOCK_DGRAM, ("udp_set_kernel_tunneling: !dgram")); KASSERT(so->so_pcb != NULL, ("udp_set_kernel_tunneling: NULL inp")); if (so->so_type != SOCK_DGRAM) { /* Not UDP socket... sorry! */ return (ENOTSUP); } inp = (struct inpcb *)so->so_pcb; if (inp == NULL) { /* NULL INP? */ return (EINVAL); } INP_WLOCK(inp); up = intoudpcb(inp); if (up->u_tun_func != NULL) { INP_WUNLOCK(inp); return (EBUSY); } up->u_tun_func = f; INP_WUNLOCK(inp); return (0); } static int udp_bind(struct socket *so, struct sockaddr *nam, struct thread *td) { INIT_VNET_INET(so->so_vnet); struct inpcb *inp; int error; inp = sotoinpcb(so); KASSERT(inp != NULL, ("udp_bind: inp == NULL")); INP_INFO_WLOCK(&V_udbinfo); INP_WLOCK(inp); error = in_pcbbind(inp, nam, td->td_ucred); INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_udbinfo); return (error); } static void udp_close(struct socket *so) { INIT_VNET_INET(so->so_vnet); struct inpcb *inp; inp = sotoinpcb(so); KASSERT(inp != NULL, ("udp_close: inp == NULL")); INP_INFO_WLOCK(&V_udbinfo); INP_WLOCK(inp); if (inp->inp_faddr.s_addr != INADDR_ANY) { in_pcbdisconnect(inp); inp->inp_laddr.s_addr = INADDR_ANY; soisdisconnected(so); } INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_udbinfo); } static int udp_connect(struct socket *so, struct sockaddr *nam, struct thread *td) { INIT_VNET_INET(so->so_vnet); struct inpcb *inp; int error; struct sockaddr_in *sin; inp = sotoinpcb(so); KASSERT(inp != NULL, ("udp_connect: inp == NULL")); INP_INFO_WLOCK(&V_udbinfo); INP_WLOCK(inp); if (inp->inp_faddr.s_addr != INADDR_ANY) { INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_udbinfo); return (EISCONN); } sin = (struct sockaddr_in *)nam; error = prison_remote_ip4(td->td_ucred, &sin->sin_addr); if (error != 0) { INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_udbinfo); return (error); } error = in_pcbconnect(inp, nam, td->td_ucred); if (error == 0) soisconnected(so); INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_udbinfo); return (error); } static void udp_detach(struct socket *so) { INIT_VNET_INET(so->so_vnet); struct inpcb *inp; struct udpcb *up; inp = sotoinpcb(so); KASSERT(inp != NULL, ("udp_detach: inp == NULL")); KASSERT(inp->inp_faddr.s_addr == INADDR_ANY, ("udp_detach: not disconnected")); INP_INFO_WLOCK(&V_udbinfo); INP_WLOCK(inp); up = intoudpcb(inp); KASSERT(up != NULL, ("%s: up == NULL", __func__)); inp->inp_ppcb = NULL; in_pcbdetach(inp); in_pcbfree(inp); INP_INFO_WUNLOCK(&V_udbinfo); udp_discardcb(up); } static int udp_disconnect(struct socket *so) { INIT_VNET_INET(so->so_vnet); struct inpcb *inp; inp = sotoinpcb(so); KASSERT(inp != NULL, ("udp_disconnect: inp == NULL")); INP_INFO_WLOCK(&V_udbinfo); INP_WLOCK(inp); if (inp->inp_faddr.s_addr == INADDR_ANY) { INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_udbinfo); return (ENOTCONN); } in_pcbdisconnect(inp); inp->inp_laddr.s_addr = INADDR_ANY; SOCK_LOCK(so); so->so_state &= ~SS_ISCONNECTED; /* XXX */ SOCK_UNLOCK(so); INP_WUNLOCK(inp); INP_INFO_WUNLOCK(&V_udbinfo); return (0); } static int udp_send(struct socket *so, int flags, struct mbuf *m, struct sockaddr *addr, struct mbuf *control, struct thread *td) { struct inpcb *inp; inp = sotoinpcb(so); KASSERT(inp != NULL, ("udp_send: inp == NULL")); return (udp_output(inp, m, addr, control, td)); } int udp_shutdown(struct socket *so) { struct inpcb *inp; inp = sotoinpcb(so); KASSERT(inp != NULL, ("udp_shutdown: inp == NULL")); INP_WLOCK(inp); socantsendmore(so); INP_WUNLOCK(inp); return (0); } struct pr_usrreqs udp_usrreqs = { .pru_abort = udp_abort, .pru_attach = udp_attach, .pru_bind = udp_bind, .pru_connect = udp_connect, .pru_control = in_control, .pru_detach = udp_detach, .pru_disconnect = udp_disconnect, .pru_peeraddr = in_getpeeraddr, .pru_send = udp_send, .pru_soreceive = soreceive_dgram, .pru_sosend = sosend_dgram, .pru_shutdown = udp_shutdown, .pru_sockaddr = in_getsockaddr, .pru_sosetlabel = in_pcbsosetlabel, .pru_close = udp_close, }; Index: head/sys/netinet6/in6.c =================================================================== --- head/sys/netinet6/in6.c (revision 192894) +++ head/sys/netinet6/in6.c (revision 192895) @@ -1,2600 +1,2593 @@ /*- * Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the project nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $KAME: in6.c,v 1.259 2002/01/21 11:37:50 keiichi Exp $ */ /*- * Copyright (c) 1982, 1986, 1991, 1993 * The Regents of the University of California. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)in.c 8.2 (Berkeley) 11/15/93 */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include "opt_route.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* * Definitions of some costant IP6 addresses. */ const struct in6_addr in6addr_any = IN6ADDR_ANY_INIT; const struct in6_addr in6addr_loopback = IN6ADDR_LOOPBACK_INIT; const struct in6_addr in6addr_nodelocal_allnodes = IN6ADDR_NODELOCAL_ALLNODES_INIT; const struct in6_addr in6addr_linklocal_allnodes = IN6ADDR_LINKLOCAL_ALLNODES_INIT; const struct in6_addr in6addr_linklocal_allrouters = IN6ADDR_LINKLOCAL_ALLROUTERS_INIT; const struct in6_addr in6addr_linklocal_allv2routers = IN6ADDR_LINKLOCAL_ALLV2ROUTERS_INIT; const struct in6_addr in6mask0 = IN6MASK0; const struct in6_addr in6mask32 = IN6MASK32; const struct in6_addr in6mask64 = IN6MASK64; const struct in6_addr in6mask96 = IN6MASK96; const struct in6_addr in6mask128 = IN6MASK128; const struct sockaddr_in6 sa6_any = { sizeof(sa6_any), AF_INET6, 0, 0, IN6ADDR_ANY_INIT, 0 }; static int in6_lifaddr_ioctl __P((struct socket *, u_long, caddr_t, struct ifnet *, struct thread *)); static int in6_ifinit __P((struct ifnet *, struct in6_ifaddr *, struct sockaddr_in6 *, int)); static void in6_unlink_ifa(struct in6_ifaddr *, struct ifnet *); int (*faithprefix_p)(struct in6_addr *); int in6_mask2len(struct in6_addr *mask, u_char *lim0) { int x = 0, y; u_char *lim = lim0, *p; /* ignore the scope_id part */ if (lim0 == NULL || lim0 - (u_char *)mask > sizeof(*mask)) lim = (u_char *)mask + sizeof(*mask); for (p = (u_char *)mask; p < lim; x++, p++) { if (*p != 0xff) break; } y = 0; if (p < lim) { for (y = 0; y < 8; y++) { if ((*p & (0x80 >> y)) == 0) break; } } /* * when the limit pointer is given, do a stricter check on the * remaining bits. */ if (p < lim) { if (y != 0 && (*p & (0x00ff >> y)) != 0) return (-1); for (p = p + 1; p < lim; p++) if (*p != 0) return (-1); } return x * 8 + y; } #define ifa2ia6(ifa) ((struct in6_ifaddr *)(ifa)) #define ia62ifa(ia6) (&((ia6)->ia_ifa)) int in6_control(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, struct thread *td) { INIT_VNET_INET6(curvnet); struct in6_ifreq *ifr = (struct in6_ifreq *)data; struct in6_ifaddr *ia = NULL; struct in6_aliasreq *ifra = (struct in6_aliasreq *)data; struct sockaddr_in6 *sa6; int error; switch (cmd) { case SIOCGETSGCNT_IN6: case SIOCGETMIFCNT_IN6: return (mrt6_ioctl ? mrt6_ioctl(cmd, data) : EOPNOTSUPP); } switch(cmd) { case SIOCAADDRCTL_POLICY: case SIOCDADDRCTL_POLICY: if (td != NULL) { error = priv_check(td, PRIV_NETINET_ADDRCTRL6); if (error) return (error); } return (in6_src_ioctl(cmd, data)); } if (ifp == NULL) return (EOPNOTSUPP); switch (cmd) { case SIOCSNDFLUSH_IN6: case SIOCSPFXFLUSH_IN6: case SIOCSRTRFLUSH_IN6: case SIOCSDEFIFACE_IN6: case SIOCSIFINFO_FLAGS: if (td != NULL) { error = priv_check(td, PRIV_NETINET_ND6); if (error) return (error); } /* FALLTHROUGH */ case OSIOCGIFINFO_IN6: case SIOCGIFINFO_IN6: case SIOCSIFINFO_IN6: case SIOCGDRLST_IN6: case SIOCGPRLST_IN6: case SIOCGNBRINFO_IN6: case SIOCGDEFIFACE_IN6: return (nd6_ioctl(cmd, data, ifp)); } switch (cmd) { case SIOCSIFPREFIX_IN6: case SIOCDIFPREFIX_IN6: case SIOCAIFPREFIX_IN6: case SIOCCIFPREFIX_IN6: case SIOCSGIFPREFIX_IN6: case SIOCGIFPREFIX_IN6: log(LOG_NOTICE, "prefix ioctls are now invalidated. " "please use ifconfig.\n"); return (EOPNOTSUPP); } switch (cmd) { case SIOCSSCOPE6: if (td != NULL) { error = priv_check(td, PRIV_NETINET_SCOPE6); if (error) return (error); } return (scope6_set(ifp, (struct scope6_id *)ifr->ifr_ifru.ifru_scope_id)); case SIOCGSCOPE6: return (scope6_get(ifp, (struct scope6_id *)ifr->ifr_ifru.ifru_scope_id)); case SIOCGSCOPE6DEF: return (scope6_get_default((struct scope6_id *) ifr->ifr_ifru.ifru_scope_id)); } switch (cmd) { case SIOCALIFADDR: if (td != NULL) { error = priv_check(td, PRIV_NET_ADDIFADDR); if (error) return (error); } return in6_lifaddr_ioctl(so, cmd, data, ifp, td); case SIOCDLIFADDR: if (td != NULL) { error = priv_check(td, PRIV_NET_DELIFADDR); if (error) return (error); } /* FALLTHROUGH */ case SIOCGLIFADDR: return in6_lifaddr_ioctl(so, cmd, data, ifp, td); } /* * Find address for this interface, if it exists. * * In netinet code, we have checked ifra_addr in SIOCSIF*ADDR operation * only, and used the first interface address as the target of other * operations (without checking ifra_addr). This was because netinet * code/API assumed at most 1 interface address per interface. * Since IPv6 allows a node to assign multiple addresses * on a single interface, we almost always look and check the * presence of ifra_addr, and reject invalid ones here. * It also decreases duplicated code among SIOC*_IN6 operations. */ switch (cmd) { case SIOCAIFADDR_IN6: case SIOCSIFPHYADDR_IN6: sa6 = &ifra->ifra_addr; break; case SIOCSIFADDR_IN6: case SIOCGIFADDR_IN6: case SIOCSIFDSTADDR_IN6: case SIOCSIFNETMASK_IN6: case SIOCGIFDSTADDR_IN6: case SIOCGIFNETMASK_IN6: case SIOCDIFADDR_IN6: case SIOCGIFPSRCADDR_IN6: case SIOCGIFPDSTADDR_IN6: case SIOCGIFAFLAG_IN6: case SIOCSNDFLUSH_IN6: case SIOCSPFXFLUSH_IN6: case SIOCSRTRFLUSH_IN6: case SIOCGIFALIFETIME_IN6: case SIOCSIFALIFETIME_IN6: case SIOCGIFSTAT_IN6: case SIOCGIFSTAT_ICMP6: sa6 = &ifr->ifr_addr; break; default: sa6 = NULL; break; } if (sa6 && sa6->sin6_family == AF_INET6) { int error = 0; if (sa6->sin6_scope_id != 0) error = sa6_embedscope(sa6, 0); else error = in6_setscope(&sa6->sin6_addr, ifp, NULL); if (error != 0) return (error); if (td != NULL && (error = prison_check_ip6(td->td_ucred, &sa6->sin6_addr)) != 0) return (error); ia = in6ifa_ifpwithaddr(ifp, &sa6->sin6_addr); } else ia = NULL; switch (cmd) { case SIOCSIFADDR_IN6: case SIOCSIFDSTADDR_IN6: case SIOCSIFNETMASK_IN6: /* * Since IPv6 allows a node to assign multiple addresses * on a single interface, SIOCSIFxxx ioctls are deprecated. */ /* we decided to obsolete this command (20000704) */ return (EINVAL); case SIOCDIFADDR_IN6: /* * for IPv4, we look for existing in_ifaddr here to allow * "ifconfig if0 delete" to remove the first IPv4 address on * the interface. For IPv6, as the spec allows multiple * interface address from the day one, we consider "remove the * first one" semantics to be not preferable. */ if (ia == NULL) return (EADDRNOTAVAIL); /* FALLTHROUGH */ case SIOCAIFADDR_IN6: /* * We always require users to specify a valid IPv6 address for * the corresponding operation. */ if (ifra->ifra_addr.sin6_family != AF_INET6 || ifra->ifra_addr.sin6_len != sizeof(struct sockaddr_in6)) return (EAFNOSUPPORT); if (td != NULL) { error = priv_check(td, (cmd == SIOCDIFADDR_IN6) ? PRIV_NET_DELIFADDR : PRIV_NET_ADDIFADDR); if (error) return (error); } break; case SIOCGIFADDR_IN6: /* This interface is basically deprecated. use SIOCGIFCONF. */ /* FALLTHROUGH */ case SIOCGIFAFLAG_IN6: case SIOCGIFNETMASK_IN6: case SIOCGIFDSTADDR_IN6: case SIOCGIFALIFETIME_IN6: /* must think again about its semantics */ if (ia == NULL) return (EADDRNOTAVAIL); break; case SIOCSIFALIFETIME_IN6: { struct in6_addrlifetime *lt; if (td != NULL) { error = priv_check(td, PRIV_NETINET_ALIFETIME6); if (error) return (error); } if (ia == NULL) return (EADDRNOTAVAIL); /* sanity for overflow - beware unsigned */ lt = &ifr->ifr_ifru.ifru_lifetime; if (lt->ia6t_vltime != ND6_INFINITE_LIFETIME && lt->ia6t_vltime + time_second < time_second) { return EINVAL; } if (lt->ia6t_pltime != ND6_INFINITE_LIFETIME && lt->ia6t_pltime + time_second < time_second) { return EINVAL; } break; } } switch (cmd) { case SIOCGIFADDR_IN6: ifr->ifr_addr = ia->ia_addr; if ((error = sa6_recoverscope(&ifr->ifr_addr)) != 0) return (error); break; case SIOCGIFDSTADDR_IN6: if ((ifp->if_flags & IFF_POINTOPOINT) == 0) return (EINVAL); /* * XXX: should we check if ifa_dstaddr is NULL and return * an error? */ ifr->ifr_dstaddr = ia->ia_dstaddr; if ((error = sa6_recoverscope(&ifr->ifr_dstaddr)) != 0) return (error); break; case SIOCGIFNETMASK_IN6: ifr->ifr_addr = ia->ia_prefixmask; break; case SIOCGIFAFLAG_IN6: ifr->ifr_ifru.ifru_flags6 = ia->ia6_flags; break; case SIOCGIFSTAT_IN6: if (ifp == NULL) return EINVAL; bzero(&ifr->ifr_ifru.ifru_stat, sizeof(ifr->ifr_ifru.ifru_stat)); ifr->ifr_ifru.ifru_stat = *((struct in6_ifextra *)ifp->if_afdata[AF_INET6])->in6_ifstat; break; case SIOCGIFSTAT_ICMP6: if (ifp == NULL) return EINVAL; bzero(&ifr->ifr_ifru.ifru_icmp6stat, sizeof(ifr->ifr_ifru.ifru_icmp6stat)); ifr->ifr_ifru.ifru_icmp6stat = *((struct in6_ifextra *)ifp->if_afdata[AF_INET6])->icmp6_ifstat; break; case SIOCGIFALIFETIME_IN6: ifr->ifr_ifru.ifru_lifetime = ia->ia6_lifetime; if (ia->ia6_lifetime.ia6t_vltime != ND6_INFINITE_LIFETIME) { time_t maxexpire; struct in6_addrlifetime *retlt = &ifr->ifr_ifru.ifru_lifetime; /* * XXX: adjust expiration time assuming time_t is * signed. */ maxexpire = (-1) & ~((time_t)1 << ((sizeof(maxexpire) * 8) - 1)); if (ia->ia6_lifetime.ia6t_vltime < maxexpire - ia->ia6_updatetime) { retlt->ia6t_expire = ia->ia6_updatetime + ia->ia6_lifetime.ia6t_vltime; } else retlt->ia6t_expire = maxexpire; } if (ia->ia6_lifetime.ia6t_pltime != ND6_INFINITE_LIFETIME) { time_t maxexpire; struct in6_addrlifetime *retlt = &ifr->ifr_ifru.ifru_lifetime; /* * XXX: adjust expiration time assuming time_t is * signed. */ maxexpire = (-1) & ~((time_t)1 << ((sizeof(maxexpire) * 8) - 1)); if (ia->ia6_lifetime.ia6t_pltime < maxexpire - ia->ia6_updatetime) { retlt->ia6t_preferred = ia->ia6_updatetime + ia->ia6_lifetime.ia6t_pltime; } else retlt->ia6t_preferred = maxexpire; } break; case SIOCSIFALIFETIME_IN6: ia->ia6_lifetime = ifr->ifr_ifru.ifru_lifetime; /* for sanity */ if (ia->ia6_lifetime.ia6t_vltime != ND6_INFINITE_LIFETIME) { ia->ia6_lifetime.ia6t_expire = time_second + ia->ia6_lifetime.ia6t_vltime; } else ia->ia6_lifetime.ia6t_expire = 0; if (ia->ia6_lifetime.ia6t_pltime != ND6_INFINITE_LIFETIME) { ia->ia6_lifetime.ia6t_preferred = time_second + ia->ia6_lifetime.ia6t_pltime; } else ia->ia6_lifetime.ia6t_preferred = 0; break; case SIOCAIFADDR_IN6: { int i, error = 0; struct nd_prefixctl pr0; struct nd_prefix *pr; /* * first, make or update the interface address structure, * and link it to the list. */ if ((error = in6_update_ifa(ifp, ifra, ia, 0)) != 0) return (error); if ((ia = in6ifa_ifpwithaddr(ifp, &ifra->ifra_addr.sin6_addr)) == NULL) { /* * this can happen when the user specify the 0 valid * lifetime. */ break; } /* * then, make the prefix on-link on the interface. * XXX: we'd rather create the prefix before the address, but * we need at least one address to install the corresponding * interface route, so we configure the address first. */ /* * convert mask to prefix length (prefixmask has already * been validated in in6_update_ifa(). */ bzero(&pr0, sizeof(pr0)); pr0.ndpr_ifp = ifp; pr0.ndpr_plen = in6_mask2len(&ifra->ifra_prefixmask.sin6_addr, NULL); if (pr0.ndpr_plen == 128) { break; /* we don't need to install a host route. */ } pr0.ndpr_prefix = ifra->ifra_addr; /* apply the mask for safety. */ for (i = 0; i < 4; i++) { pr0.ndpr_prefix.sin6_addr.s6_addr32[i] &= ifra->ifra_prefixmask.sin6_addr.s6_addr32[i]; } /* * XXX: since we don't have an API to set prefix (not address) * lifetimes, we just use the same lifetimes as addresses. * The (temporarily) installed lifetimes can be overridden by * later advertised RAs (when accept_rtadv is non 0), which is * an intended behavior. */ pr0.ndpr_raf_onlink = 1; /* should be configurable? */ pr0.ndpr_raf_auto = ((ifra->ifra_flags & IN6_IFF_AUTOCONF) != 0); pr0.ndpr_vltime = ifra->ifra_lifetime.ia6t_vltime; pr0.ndpr_pltime = ifra->ifra_lifetime.ia6t_pltime; /* add the prefix if not yet. */ if ((pr = nd6_prefix_lookup(&pr0)) == NULL) { /* * nd6_prelist_add will install the corresponding * interface route. */ if ((error = nd6_prelist_add(&pr0, NULL, &pr)) != 0) return (error); if (pr == NULL) { log(LOG_ERR, "nd6_prelist_add succeeded but " "no prefix\n"); return (EINVAL); /* XXX panic here? */ } } /* relate the address to the prefix */ if (ia->ia6_ndpr == NULL) { ia->ia6_ndpr = pr; pr->ndpr_refcnt++; /* * If this is the first autoconf address from the * prefix, create a temporary address as well * (when required). */ if ((ia->ia6_flags & IN6_IFF_AUTOCONF) && V_ip6_use_tempaddr && pr->ndpr_refcnt == 1) { int e; if ((e = in6_tmpifadd(ia, 1, 0)) != 0) { log(LOG_NOTICE, "in6_control: failed " "to create a temporary address, " "errno=%d\n", e); } } } /* * this might affect the status of autoconfigured addresses, * that is, this address might make other addresses detached. */ pfxlist_onlink_check(); if (error == 0 && ia) EVENTHANDLER_INVOKE(ifaddr_event, ifp); break; } case SIOCDIFADDR_IN6: { struct nd_prefix *pr; /* * If the address being deleted is the only one that owns * the corresponding prefix, expire the prefix as well. * XXX: theoretically, we don't have to worry about such * relationship, since we separate the address management * and the prefix management. We do this, however, to provide * as much backward compatibility as possible in terms of * the ioctl operation. * Note that in6_purgeaddr() will decrement ndpr_refcnt. */ pr = ia->ia6_ndpr; in6_purgeaddr(&ia->ia_ifa); if (pr && pr->ndpr_refcnt == 0) prelist_remove(pr); EVENTHANDLER_INVOKE(ifaddr_event, ifp); break; } default: if (ifp == NULL || ifp->if_ioctl == 0) return (EOPNOTSUPP); return ((*ifp->if_ioctl)(ifp, cmd, data)); } return (0); } /* * Update parameters of an IPv6 interface address. * If necessary, a new entry is created and linked into address chains. * This function is separated from in6_control(). * XXX: should this be performed under splnet()? */ int in6_update_ifa(struct ifnet *ifp, struct in6_aliasreq *ifra, struct in6_ifaddr *ia, int flags) { INIT_VNET_INET6(ifp->if_vnet); - INIT_VPROCG(TD_TO_VPROCG(curthread)); /* XXX V_hostname needs this */ int error = 0, hostIsNew = 0, plen = -1; struct in6_ifaddr *oia; struct sockaddr_in6 dst6; struct in6_addrlifetime *lt; struct in6_multi_mship *imm; struct in6_multi *in6m_sol; struct rtentry *rt; int delay; char ip6buf[INET6_ADDRSTRLEN]; /* Validate parameters */ if (ifp == NULL || ifra == NULL) /* this maybe redundant */ return (EINVAL); /* * The destination address for a p2p link must have a family * of AF_UNSPEC or AF_INET6. */ if ((ifp->if_flags & IFF_POINTOPOINT) != 0 && ifra->ifra_dstaddr.sin6_family != AF_INET6 && ifra->ifra_dstaddr.sin6_family != AF_UNSPEC) return (EAFNOSUPPORT); /* * validate ifra_prefixmask. don't check sin6_family, netmask * does not carry fields other than sin6_len. */ if (ifra->ifra_prefixmask.sin6_len > sizeof(struct sockaddr_in6)) return (EINVAL); /* * Because the IPv6 address architecture is classless, we require * users to specify a (non 0) prefix length (mask) for a new address. * We also require the prefix (when specified) mask is valid, and thus * reject a non-consecutive mask. */ if (ia == NULL && ifra->ifra_prefixmask.sin6_len == 0) return (EINVAL); if (ifra->ifra_prefixmask.sin6_len != 0) { plen = in6_mask2len(&ifra->ifra_prefixmask.sin6_addr, (u_char *)&ifra->ifra_prefixmask + ifra->ifra_prefixmask.sin6_len); if (plen <= 0) return (EINVAL); } else { /* * In this case, ia must not be NULL. We just use its prefix * length. */ plen = in6_mask2len(&ia->ia_prefixmask.sin6_addr, NULL); } /* * If the destination address on a p2p interface is specified, * and the address is a scoped one, validate/set the scope * zone identifier. */ dst6 = ifra->ifra_dstaddr; if ((ifp->if_flags & (IFF_POINTOPOINT|IFF_LOOPBACK)) != 0 && (dst6.sin6_family == AF_INET6)) { struct in6_addr in6_tmp; u_int32_t zoneid; in6_tmp = dst6.sin6_addr; if (in6_setscope(&in6_tmp, ifp, &zoneid)) return (EINVAL); /* XXX: should be impossible */ if (dst6.sin6_scope_id != 0) { if (dst6.sin6_scope_id != zoneid) return (EINVAL); } else /* user omit to specify the ID. */ dst6.sin6_scope_id = zoneid; /* convert into the internal form */ if (sa6_embedscope(&dst6, 0)) return (EINVAL); /* XXX: should be impossible */ } /* * The destination address can be specified only for a p2p or a * loopback interface. If specified, the corresponding prefix length * must be 128. */ if (ifra->ifra_dstaddr.sin6_family == AF_INET6) { if ((ifp->if_flags & (IFF_POINTOPOINT|IFF_LOOPBACK)) == 0) { /* XXX: noisy message */ nd6log((LOG_INFO, "in6_update_ifa: a destination can " "be specified for a p2p or a loopback IF only\n")); return (EINVAL); } if (plen != 128) { nd6log((LOG_INFO, "in6_update_ifa: prefixlen should " "be 128 when dstaddr is specified\n")); return (EINVAL); } } /* lifetime consistency check */ lt = &ifra->ifra_lifetime; if (lt->ia6t_pltime > lt->ia6t_vltime) return (EINVAL); if (lt->ia6t_vltime == 0) { /* * the following log might be noisy, but this is a typical * configuration mistake or a tool's bug. */ nd6log((LOG_INFO, "in6_update_ifa: valid lifetime is 0 for %s\n", ip6_sprintf(ip6buf, &ifra->ifra_addr.sin6_addr))); if (ia == NULL) return (0); /* there's nothing to do */ } /* * If this is a new address, allocate a new ifaddr and link it * into chains. */ if (ia == NULL) { hostIsNew = 1; /* * When in6_update_ifa() is called in a process of a received * RA, it is called under an interrupt context. So, we should * call malloc with M_NOWAIT. */ ia = (struct in6_ifaddr *) malloc(sizeof(*ia), M_IFADDR, M_NOWAIT); if (ia == NULL) return (ENOBUFS); bzero((caddr_t)ia, sizeof(*ia)); LIST_INIT(&ia->ia6_memberships); /* Initialize the address and masks, and put time stamp */ IFA_LOCK_INIT(&ia->ia_ifa); ia->ia_ifa.ifa_addr = (struct sockaddr *)&ia->ia_addr; ia->ia_addr.sin6_family = AF_INET6; ia->ia_addr.sin6_len = sizeof(ia->ia_addr); ia->ia6_createtime = time_second; if ((ifp->if_flags & (IFF_POINTOPOINT | IFF_LOOPBACK)) != 0) { /* * XXX: some functions expect that ifa_dstaddr is not * NULL for p2p interfaces. */ ia->ia_ifa.ifa_dstaddr = (struct sockaddr *)&ia->ia_dstaddr; } else { ia->ia_ifa.ifa_dstaddr = NULL; } ia->ia_ifa.ifa_netmask = (struct sockaddr *)&ia->ia_prefixmask; ia->ia_ifp = ifp; if ((oia = V_in6_ifaddr) != NULL) { for ( ; oia->ia_next; oia = oia->ia_next) continue; oia->ia_next = ia; } else V_in6_ifaddr = ia; ia->ia_ifa.ifa_refcnt = 1; IF_ADDR_LOCK(ifp); TAILQ_INSERT_TAIL(&ifp->if_addrhead, &ia->ia_ifa, ifa_link); IF_ADDR_UNLOCK(ifp); } /* update timestamp */ ia->ia6_updatetime = time_second; /* set prefix mask */ if (ifra->ifra_prefixmask.sin6_len) { /* * We prohibit changing the prefix length of an existing * address, because * + such an operation should be rare in IPv6, and * + the operation would confuse prefix management. */ if (ia->ia_prefixmask.sin6_len && in6_mask2len(&ia->ia_prefixmask.sin6_addr, NULL) != plen) { nd6log((LOG_INFO, "in6_update_ifa: the prefix length of an" " existing (%s) address should not be changed\n", ip6_sprintf(ip6buf, &ia->ia_addr.sin6_addr))); error = EINVAL; goto unlink; } ia->ia_prefixmask = ifra->ifra_prefixmask; } /* * If a new destination address is specified, scrub the old one and * install the new destination. Note that the interface must be * p2p or loopback (see the check above.) */ if (dst6.sin6_family == AF_INET6 && !IN6_ARE_ADDR_EQUAL(&dst6.sin6_addr, &ia->ia_dstaddr.sin6_addr)) { int e; if ((ia->ia_flags & IFA_ROUTE) != 0 && (e = rtinit(&(ia->ia_ifa), (int)RTM_DELETE, RTF_HOST)) != 0) { nd6log((LOG_ERR, "in6_update_ifa: failed to remove " "a route to the old destination: %s\n", ip6_sprintf(ip6buf, &ia->ia_addr.sin6_addr))); /* proceed anyway... */ } else ia->ia_flags &= ~IFA_ROUTE; ia->ia_dstaddr = dst6; } /* * Set lifetimes. We do not refer to ia6t_expire and ia6t_preferred * to see if the address is deprecated or invalidated, but initialize * these members for applications. */ ia->ia6_lifetime = ifra->ifra_lifetime; if (ia->ia6_lifetime.ia6t_vltime != ND6_INFINITE_LIFETIME) { ia->ia6_lifetime.ia6t_expire = time_second + ia->ia6_lifetime.ia6t_vltime; } else ia->ia6_lifetime.ia6t_expire = 0; if (ia->ia6_lifetime.ia6t_pltime != ND6_INFINITE_LIFETIME) { ia->ia6_lifetime.ia6t_preferred = time_second + ia->ia6_lifetime.ia6t_pltime; } else ia->ia6_lifetime.ia6t_preferred = 0; /* reset the interface and routing table appropriately. */ if ((error = in6_ifinit(ifp, ia, &ifra->ifra_addr, hostIsNew)) != 0) goto unlink; /* * configure address flags. */ ia->ia6_flags = ifra->ifra_flags; /* * backward compatibility - if IN6_IFF_DEPRECATED is set from the * userland, make it deprecated. */ if ((ifra->ifra_flags & IN6_IFF_DEPRECATED) != 0) { ia->ia6_lifetime.ia6t_pltime = 0; ia->ia6_lifetime.ia6t_preferred = time_second; } /* * Make the address tentative before joining multicast addresses, * so that corresponding MLD responses would not have a tentative * source address. */ ia->ia6_flags &= ~IN6_IFF_DUPLICATED; /* safety */ if (hostIsNew && in6if_do_dad(ifp)) ia->ia6_flags |= IN6_IFF_TENTATIVE; /* * We are done if we have simply modified an existing address. */ if (!hostIsNew) return (error); /* * Beyond this point, we should call in6_purgeaddr upon an error, * not just go to unlink. */ /* Join necessary multicast groups */ in6m_sol = NULL; if ((ifp->if_flags & IFF_MULTICAST) != 0) { struct sockaddr_in6 mltaddr, mltmask; struct in6_addr llsol; /* join solicited multicast addr for new host id */ bzero(&llsol, sizeof(struct in6_addr)); llsol.s6_addr32[0] = IPV6_ADDR_INT32_MLL; llsol.s6_addr32[1] = 0; llsol.s6_addr32[2] = htonl(1); llsol.s6_addr32[3] = ifra->ifra_addr.sin6_addr.s6_addr32[3]; llsol.s6_addr8[12] = 0xff; if ((error = in6_setscope(&llsol, ifp, NULL)) != 0) { /* XXX: should not happen */ log(LOG_ERR, "in6_update_ifa: " "in6_setscope failed\n"); goto cleanup; } delay = 0; if ((flags & IN6_IFAUPDATE_DADDELAY)) { /* * We need a random delay for DAD on the address * being configured. It also means delaying * transmission of the corresponding MLD report to * avoid report collision. * [draft-ietf-ipv6-rfc2462bis-02.txt] */ delay = arc4random() % (MAX_RTR_SOLICITATION_DELAY * hz); } imm = in6_joingroup(ifp, &llsol, &error, delay); if (imm == NULL) { nd6log((LOG_WARNING, "in6_update_ifa: addmulti failed for " "%s on %s (errno=%d)\n", ip6_sprintf(ip6buf, &llsol), if_name(ifp), error)); in6_purgeaddr((struct ifaddr *)ia); return (error); } LIST_INSERT_HEAD(&ia->ia6_memberships, imm, i6mm_chain); in6m_sol = imm->i6mm_maddr; bzero(&mltmask, sizeof(mltmask)); mltmask.sin6_len = sizeof(struct sockaddr_in6); mltmask.sin6_family = AF_INET6; mltmask.sin6_addr = in6mask32; #define MLTMASK_LEN 4 /* mltmask's masklen (=32bit=4octet) */ /* * join link-local all-nodes address */ bzero(&mltaddr, sizeof(mltaddr)); mltaddr.sin6_len = sizeof(struct sockaddr_in6); mltaddr.sin6_family = AF_INET6; mltaddr.sin6_addr = in6addr_linklocal_allnodes; if ((error = in6_setscope(&mltaddr.sin6_addr, ifp, NULL)) != 0) goto cleanup; /* XXX: should not fail */ /* * XXX: do we really need this automatic routes? * We should probably reconsider this stuff. Most applications * actually do not need the routes, since they usually specify * the outgoing interface. */ rt = rtalloc1((struct sockaddr *)&mltaddr, 0, 0UL); if (rt) { /* XXX: only works in !SCOPEDROUTING case. */ if (memcmp(&mltaddr.sin6_addr, &((struct sockaddr_in6 *)rt_key(rt))->sin6_addr, MLTMASK_LEN)) { RTFREE_LOCKED(rt); rt = NULL; } } if (!rt) { error = rtrequest(RTM_ADD, (struct sockaddr *)&mltaddr, (struct sockaddr *)&ia->ia_addr, (struct sockaddr *)&mltmask, RTF_UP, (struct rtentry **)0); if (error) goto cleanup; } else { RTFREE_LOCKED(rt); } imm = in6_joingroup(ifp, &mltaddr.sin6_addr, &error, 0); if (!imm) { nd6log((LOG_WARNING, "in6_update_ifa: addmulti failed for " "%s on %s (errno=%d)\n", ip6_sprintf(ip6buf, &mltaddr.sin6_addr), if_name(ifp), error)); goto cleanup; } LIST_INSERT_HEAD(&ia->ia6_memberships, imm, i6mm_chain); /* * join node information group address */ -#define hostnamelen strlen(V_hostname) delay = 0; if ((flags & IN6_IFAUPDATE_DADDELAY)) { /* * The spec doesn't say anything about delay for this * group, but the same logic should apply. */ delay = arc4random() % (MAX_RTR_SOLICITATION_DELAY * hz); } - mtx_lock(&hostname_mtx); - if (in6_nigroup(ifp, V_hostname, hostnamelen, - &mltaddr.sin6_addr) == 0) { - mtx_unlock(&hostname_mtx); + if (in6_nigroup(ifp, NULL, -1, &mltaddr.sin6_addr) == 0) { imm = in6_joingroup(ifp, &mltaddr.sin6_addr, &error, delay); /* XXX jinmei */ if (!imm) { nd6log((LOG_WARNING, "in6_update_ifa: " "addmulti failed for %s on %s " "(errno=%d)\n", ip6_sprintf(ip6buf, &mltaddr.sin6_addr), if_name(ifp), error)); /* XXX not very fatal, go on... */ } else { LIST_INSERT_HEAD(&ia->ia6_memberships, imm, i6mm_chain); } - } else - mtx_unlock(&hostname_mtx); -#undef hostnamelen + } /* * join interface-local all-nodes address. * (ff01::1%ifN, and ff01::%ifN/32) */ mltaddr.sin6_addr = in6addr_nodelocal_allnodes; if ((error = in6_setscope(&mltaddr.sin6_addr, ifp, NULL)) != 0) goto cleanup; /* XXX: should not fail */ /* XXX: again, do we really need the route? */ rt = rtalloc1((struct sockaddr *)&mltaddr, 0, 0UL); if (rt) { if (memcmp(&mltaddr.sin6_addr, &((struct sockaddr_in6 *)rt_key(rt))->sin6_addr, MLTMASK_LEN)) { RTFREE_LOCKED(rt); rt = NULL; } } if (!rt) { error = rtrequest(RTM_ADD, (struct sockaddr *)&mltaddr, (struct sockaddr *)&ia->ia_addr, (struct sockaddr *)&mltmask, RTF_UP, (struct rtentry **)0); if (error) goto cleanup; } else RTFREE_LOCKED(rt); imm = in6_joingroup(ifp, &mltaddr.sin6_addr, &error, 0); if (!imm) { nd6log((LOG_WARNING, "in6_update_ifa: " "addmulti failed for %s on %s " "(errno=%d)\n", ip6_sprintf(ip6buf, &mltaddr.sin6_addr), if_name(ifp), error)); goto cleanup; } LIST_INSERT_HEAD(&ia->ia6_memberships, imm, i6mm_chain); #undef MLTMASK_LEN } /* * Perform DAD, if needed. * XXX It may be of use, if we can administratively * disable DAD. */ if (hostIsNew && in6if_do_dad(ifp) && ((ifra->ifra_flags & IN6_IFF_NODAD) == 0) && (ia->ia6_flags & IN6_IFF_TENTATIVE)) { int mindelay, maxdelay; delay = 0; if ((flags & IN6_IFAUPDATE_DADDELAY)) { /* * We need to impose a delay before sending an NS * for DAD. Check if we also needed a delay for the * corresponding MLD message. If we did, the delay * should be larger than the MLD delay (this could be * relaxed a bit, but this simple logic is at least * safe). * XXX: Break data hiding guidelines and look at * state for the solicited multicast group. */ mindelay = 0; if (in6m_sol != NULL && in6m_sol->in6m_state == MLD_REPORTING_MEMBER) { mindelay = in6m_sol->in6m_timer; } maxdelay = MAX_RTR_SOLICITATION_DELAY * hz; if (maxdelay - mindelay == 0) delay = 0; else { delay = (arc4random() % (maxdelay - mindelay)) + mindelay; } } nd6_dad_start((struct ifaddr *)ia, delay); } return (error); unlink: /* * XXX: if a change of an existing address failed, keep the entry * anyway. */ if (hostIsNew) in6_unlink_ifa(ia, ifp); return (error); cleanup: in6_purgeaddr(&ia->ia_ifa); return error; } void in6_purgeaddr(struct ifaddr *ifa) { struct ifnet *ifp = ifa->ifa_ifp; struct in6_ifaddr *ia = (struct in6_ifaddr *) ifa; struct in6_multi_mship *imm; struct sockaddr_in6 mltaddr, mltmask; struct rtentry rt0; struct sockaddr_dl gateway; struct sockaddr_in6 mask, addr; int plen, error; struct rtentry *rt; struct ifaddr *ifa0, *nifa; /* * find another IPv6 address as the gateway for the * link-local and node-local all-nodes multicast * address routes */ TAILQ_FOREACH_SAFE(ifa0, &ifp->if_addrhead, ifa_link, nifa) { if ((ifa0->ifa_addr->sa_family != AF_INET6) || memcmp(&satosin6(ifa0->ifa_addr)->sin6_addr, &ia->ia_addr.sin6_addr, sizeof(struct in6_addr)) == 0) continue; else break; } /* stop DAD processing */ nd6_dad_stop(ifa); IF_AFDATA_LOCK(ifp); lla_lookup(LLTABLE6(ifp), (LLE_DELETE | LLE_IFADDR), (struct sockaddr *)&ia->ia_addr); IF_AFDATA_UNLOCK(ifp); /* * initialize for rtmsg generation */ bzero(&gateway, sizeof(gateway)); gateway.sdl_len = sizeof(gateway); gateway.sdl_family = AF_LINK; gateway.sdl_nlen = 0; gateway.sdl_alen = ifp->if_addrlen; /* */ bzero(&rt0, sizeof(rt0)); rt0.rt_gateway = (struct sockaddr *)&gateway; memcpy(&mask, &ia->ia_prefixmask, sizeof(ia->ia_prefixmask)); memcpy(&addr, &ia->ia_addr, sizeof(ia->ia_addr)); rt_mask(&rt0) = (struct sockaddr *)&mask; rt_key(&rt0) = (struct sockaddr *)&addr; rt0.rt_flags = RTF_HOST | RTF_STATIC; rt_newaddrmsg(RTM_DELETE, ifa, 0, &rt0); /* * leave from multicast groups we have joined for the interface */ while ((imm = ia->ia6_memberships.lh_first) != NULL) { LIST_REMOVE(imm, i6mm_chain); in6_leavegroup(imm); } /* * remove the link-local all-nodes address */ bzero(&mltmask, sizeof(mltmask)); mltmask.sin6_len = sizeof(struct sockaddr_in6); mltmask.sin6_family = AF_INET6; mltmask.sin6_addr = in6mask32; bzero(&mltaddr, sizeof(mltaddr)); mltaddr.sin6_len = sizeof(struct sockaddr_in6); mltaddr.sin6_family = AF_INET6; mltaddr.sin6_addr = in6addr_linklocal_allnodes; if ((error = in6_setscope(&mltaddr.sin6_addr, ifp, NULL)) != 0) goto cleanup; rt = rtalloc1((struct sockaddr *)&mltaddr, 0, 0UL); if (rt != NULL && rt->rt_gateway != NULL && (memcmp(&satosin6(rt->rt_gateway)->sin6_addr, &ia->ia_addr.sin6_addr, sizeof(ia->ia_addr.sin6_addr)) == 0)) { /* * if no more IPv6 address exists on this interface * then remove the multicast address route */ if (ifa0 == NULL) { memcpy(&mltaddr.sin6_addr, &satosin6(rt_key(rt))->sin6_addr, sizeof(mltaddr.sin6_addr)); RTFREE_LOCKED(rt); error = rtrequest(RTM_DELETE, (struct sockaddr *)&mltaddr, (struct sockaddr *)&ia->ia_addr, (struct sockaddr *)&mltmask, RTF_UP, (struct rtentry **)0); if (error) log(LOG_INFO, "in6_purgeaddr: link-local all-nodes" "multicast address deletion error\n"); } else { /* * replace the gateway of the route */ struct sockaddr_in6 sa; bzero(&sa, sizeof(sa)); sa.sin6_len = sizeof(struct sockaddr_in6); sa.sin6_family = AF_INET6; memcpy(&sa.sin6_addr, &satosin6(ifa0->ifa_addr)->sin6_addr, sizeof(sa.sin6_addr)); in6_setscope(&sa.sin6_addr, ifa0->ifa_ifp, NULL); memcpy(rt->rt_gateway, &sa, sizeof(sa)); RTFREE_LOCKED(rt); } } else { if (rt != NULL) RTFREE_LOCKED(rt); } /* * remove the node-local all-nodes address */ mltaddr.sin6_addr = in6addr_nodelocal_allnodes; if ((error = in6_setscope(&mltaddr.sin6_addr, ifp, NULL)) != 0) goto cleanup; rt = rtalloc1((struct sockaddr *)&mltaddr, 0, 0UL); if (rt != NULL && rt->rt_gateway != NULL && (memcmp(&satosin6(rt->rt_gateway)->sin6_addr, &ia->ia_addr.sin6_addr, sizeof(ia->ia_addr.sin6_addr)) == 0)) { /* * if no more IPv6 address exists on this interface * then remove the multicast address route */ if (ifa0 == NULL) { memcpy(&mltaddr.sin6_addr, &satosin6(rt_key(rt))->sin6_addr, sizeof(mltaddr.sin6_addr)); RTFREE_LOCKED(rt); error = rtrequest(RTM_DELETE, (struct sockaddr *)&mltaddr, (struct sockaddr *)&ia->ia_addr, (struct sockaddr *)&mltmask, RTF_UP, (struct rtentry **)0); if (error) log(LOG_INFO, "in6_purgeaddr: node-local all-nodes" "multicast address deletion error\n"); } else { /* * replace the gateway of the route */ struct sockaddr_in6 sa; bzero(&sa, sizeof(sa)); sa.sin6_len = sizeof(struct sockaddr_in6); sa.sin6_family = AF_INET6; memcpy(&sa.sin6_addr, &satosin6(ifa0->ifa_addr)->sin6_addr, sizeof(sa.sin6_addr)); in6_setscope(&sa.sin6_addr, ifa0->ifa_ifp, NULL); memcpy(rt->rt_gateway, &sa, sizeof(sa)); RTFREE_LOCKED(rt); } } else { if (rt != NULL) RTFREE_LOCKED(rt); } cleanup: plen = in6_mask2len(&ia->ia_prefixmask.sin6_addr, NULL); /* XXX */ if ((ia->ia_flags & IFA_ROUTE) && plen == 128) { int error; struct sockaddr *dstaddr; /* * use the interface address if configuring an * interface address with a /128 prefix len */ if (ia->ia_dstaddr.sin6_family == AF_INET6) dstaddr = (struct sockaddr *)&ia->ia_dstaddr; else dstaddr = (struct sockaddr *)&ia->ia_addr; error = rtrequest(RTM_DELETE, (struct sockaddr *)dstaddr, (struct sockaddr *)&ia->ia_addr, (struct sockaddr *)&ia->ia_prefixmask, ia->ia_flags | RTF_HOST, NULL); if (error != 0) return; ia->ia_flags &= ~IFA_ROUTE; } in6_unlink_ifa(ia, ifp); } static void in6_unlink_ifa(struct in6_ifaddr *ia, struct ifnet *ifp) { INIT_VNET_INET6(ifp->if_vnet); struct in6_ifaddr *oia; int s = splnet(); IF_ADDR_LOCK(ifp); TAILQ_REMOVE(&ifp->if_addrhead, &ia->ia_ifa, ifa_link); IF_ADDR_UNLOCK(ifp); oia = ia; if (oia == (ia = V_in6_ifaddr)) V_in6_ifaddr = ia->ia_next; else { while (ia->ia_next && (ia->ia_next != oia)) ia = ia->ia_next; if (ia->ia_next) ia->ia_next = oia->ia_next; else { /* search failed */ printf("Couldn't unlink in6_ifaddr from in6_ifaddr\n"); } } /* * Release the reference to the base prefix. There should be a * positive reference. */ if (oia->ia6_ndpr == NULL) { nd6log((LOG_NOTICE, "in6_unlink_ifa: autoconf'ed address " "%p has no prefix\n", oia)); } else { oia->ia6_ndpr->ndpr_refcnt--; oia->ia6_ndpr = NULL; } /* * Also, if the address being removed is autoconf'ed, call * pfxlist_onlink_check() since the release might affect the status of * other (detached) addresses. */ if ((oia->ia6_flags & IN6_IFF_AUTOCONF)) { pfxlist_onlink_check(); } /* * release another refcnt for the link from in6_ifaddr. * Note that we should decrement the refcnt at least once for all *BSD. */ IFAFREE(&oia->ia_ifa); splx(s); } void in6_purgeif(struct ifnet *ifp) { struct ifaddr *ifa, *nifa; TAILQ_FOREACH_SAFE(ifa, &ifp->if_addrhead, ifa_link, nifa) { if (ifa->ifa_addr->sa_family != AF_INET6) continue; in6_purgeaddr(ifa); } in6_ifdetach(ifp); } /* * SIOC[GAD]LIFADDR. * SIOCGLIFADDR: get first address. (?) * SIOCGLIFADDR with IFLR_PREFIX: * get first address that matches the specified prefix. * SIOCALIFADDR: add the specified address. * SIOCALIFADDR with IFLR_PREFIX: * add the specified prefix, filling hostid part from * the first link-local address. prefixlen must be <= 64. * SIOCDLIFADDR: delete the specified address. * SIOCDLIFADDR with IFLR_PREFIX: * delete the first address that matches the specified prefix. * return values: * EINVAL on invalid parameters * EADDRNOTAVAIL on prefix match failed/specified address not found * other values may be returned from in6_ioctl() * * NOTE: SIOCALIFADDR(with IFLR_PREFIX set) allows prefixlen less than 64. * this is to accomodate address naming scheme other than RFC2374, * in the future. * RFC2373 defines interface id to be 64bit, but it allows non-RFC2374 * address encoding scheme. (see figure on page 8) */ static int in6_lifaddr_ioctl(struct socket *so, u_long cmd, caddr_t data, struct ifnet *ifp, struct thread *td) { struct if_laddrreq *iflr = (struct if_laddrreq *)data; struct ifaddr *ifa; struct sockaddr *sa; /* sanity checks */ if (!data || !ifp) { panic("invalid argument to in6_lifaddr_ioctl"); /* NOTREACHED */ } switch (cmd) { case SIOCGLIFADDR: /* address must be specified on GET with IFLR_PREFIX */ if ((iflr->flags & IFLR_PREFIX) == 0) break; /* FALLTHROUGH */ case SIOCALIFADDR: case SIOCDLIFADDR: /* address must be specified on ADD and DELETE */ sa = (struct sockaddr *)&iflr->addr; if (sa->sa_family != AF_INET6) return EINVAL; if (sa->sa_len != sizeof(struct sockaddr_in6)) return EINVAL; /* XXX need improvement */ sa = (struct sockaddr *)&iflr->dstaddr; if (sa->sa_family && sa->sa_family != AF_INET6) return EINVAL; if (sa->sa_len && sa->sa_len != sizeof(struct sockaddr_in6)) return EINVAL; break; default: /* shouldn't happen */ #if 0 panic("invalid cmd to in6_lifaddr_ioctl"); /* NOTREACHED */ #else return EOPNOTSUPP; #endif } if (sizeof(struct in6_addr) * 8 < iflr->prefixlen) return EINVAL; switch (cmd) { case SIOCALIFADDR: { struct in6_aliasreq ifra; struct in6_addr *hostid = NULL; int prefixlen; if ((iflr->flags & IFLR_PREFIX) != 0) { struct sockaddr_in6 *sin6; /* * hostid is to fill in the hostid part of the * address. hostid points to the first link-local * address attached to the interface. */ ifa = (struct ifaddr *)in6ifa_ifpforlinklocal(ifp, 0); if (!ifa) return EADDRNOTAVAIL; hostid = IFA_IN6(ifa); /* prefixlen must be <= 64. */ if (64 < iflr->prefixlen) return EINVAL; prefixlen = iflr->prefixlen; /* hostid part must be zero. */ sin6 = (struct sockaddr_in6 *)&iflr->addr; if (sin6->sin6_addr.s6_addr32[2] != 0 || sin6->sin6_addr.s6_addr32[3] != 0) { return EINVAL; } } else prefixlen = iflr->prefixlen; /* copy args to in6_aliasreq, perform ioctl(SIOCAIFADDR_IN6). */ bzero(&ifra, sizeof(ifra)); bcopy(iflr->iflr_name, ifra.ifra_name, sizeof(ifra.ifra_name)); bcopy(&iflr->addr, &ifra.ifra_addr, ((struct sockaddr *)&iflr->addr)->sa_len); if (hostid) { /* fill in hostid part */ ifra.ifra_addr.sin6_addr.s6_addr32[2] = hostid->s6_addr32[2]; ifra.ifra_addr.sin6_addr.s6_addr32[3] = hostid->s6_addr32[3]; } if (((struct sockaddr *)&iflr->dstaddr)->sa_family) { /* XXX */ bcopy(&iflr->dstaddr, &ifra.ifra_dstaddr, ((struct sockaddr *)&iflr->dstaddr)->sa_len); if (hostid) { ifra.ifra_dstaddr.sin6_addr.s6_addr32[2] = hostid->s6_addr32[2]; ifra.ifra_dstaddr.sin6_addr.s6_addr32[3] = hostid->s6_addr32[3]; } } ifra.ifra_prefixmask.sin6_len = sizeof(struct sockaddr_in6); in6_prefixlen2mask(&ifra.ifra_prefixmask.sin6_addr, prefixlen); ifra.ifra_flags = iflr->flags & ~IFLR_PREFIX; return in6_control(so, SIOCAIFADDR_IN6, (caddr_t)&ifra, ifp, td); } case SIOCGLIFADDR: case SIOCDLIFADDR: { struct in6_ifaddr *ia; struct in6_addr mask, candidate, match; struct sockaddr_in6 *sin6; int cmp; bzero(&mask, sizeof(mask)); if (iflr->flags & IFLR_PREFIX) { /* lookup a prefix rather than address. */ in6_prefixlen2mask(&mask, iflr->prefixlen); sin6 = (struct sockaddr_in6 *)&iflr->addr; bcopy(&sin6->sin6_addr, &match, sizeof(match)); match.s6_addr32[0] &= mask.s6_addr32[0]; match.s6_addr32[1] &= mask.s6_addr32[1]; match.s6_addr32[2] &= mask.s6_addr32[2]; match.s6_addr32[3] &= mask.s6_addr32[3]; /* if you set extra bits, that's wrong */ if (bcmp(&match, &sin6->sin6_addr, sizeof(match))) return EINVAL; cmp = 1; } else { if (cmd == SIOCGLIFADDR) { /* on getting an address, take the 1st match */ cmp = 0; /* XXX */ } else { /* on deleting an address, do exact match */ in6_prefixlen2mask(&mask, 128); sin6 = (struct sockaddr_in6 *)&iflr->addr; bcopy(&sin6->sin6_addr, &match, sizeof(match)); cmp = 1; } } IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { if (ifa->ifa_addr->sa_family != AF_INET6) continue; if (!cmp) break; /* * XXX: this is adhoc, but is necessary to allow * a user to specify fe80::/64 (not /10) for a * link-local address. */ bcopy(IFA_IN6(ifa), &candidate, sizeof(candidate)); in6_clearscope(&candidate); candidate.s6_addr32[0] &= mask.s6_addr32[0]; candidate.s6_addr32[1] &= mask.s6_addr32[1]; candidate.s6_addr32[2] &= mask.s6_addr32[2]; candidate.s6_addr32[3] &= mask.s6_addr32[3]; if (IN6_ARE_ADDR_EQUAL(&candidate, &match)) break; } IF_ADDR_UNLOCK(ifp); if (!ifa) return EADDRNOTAVAIL; ia = ifa2ia6(ifa); if (cmd == SIOCGLIFADDR) { int error; /* fill in the if_laddrreq structure */ bcopy(&ia->ia_addr, &iflr->addr, ia->ia_addr.sin6_len); error = sa6_recoverscope( (struct sockaddr_in6 *)&iflr->addr); if (error != 0) return (error); if ((ifp->if_flags & IFF_POINTOPOINT) != 0) { bcopy(&ia->ia_dstaddr, &iflr->dstaddr, ia->ia_dstaddr.sin6_len); error = sa6_recoverscope( (struct sockaddr_in6 *)&iflr->dstaddr); if (error != 0) return (error); } else bzero(&iflr->dstaddr, sizeof(iflr->dstaddr)); iflr->prefixlen = in6_mask2len(&ia->ia_prefixmask.sin6_addr, NULL); iflr->flags = ia->ia6_flags; /* XXX */ return 0; } else { struct in6_aliasreq ifra; /* fill in6_aliasreq and do ioctl(SIOCDIFADDR_IN6) */ bzero(&ifra, sizeof(ifra)); bcopy(iflr->iflr_name, ifra.ifra_name, sizeof(ifra.ifra_name)); bcopy(&ia->ia_addr, &ifra.ifra_addr, ia->ia_addr.sin6_len); if ((ifp->if_flags & IFF_POINTOPOINT) != 0) { bcopy(&ia->ia_dstaddr, &ifra.ifra_dstaddr, ia->ia_dstaddr.sin6_len); } else { bzero(&ifra.ifra_dstaddr, sizeof(ifra.ifra_dstaddr)); } bcopy(&ia->ia_prefixmask, &ifra.ifra_dstaddr, ia->ia_prefixmask.sin6_len); ifra.ifra_flags = ia->ia6_flags; return in6_control(so, SIOCDIFADDR_IN6, (caddr_t)&ifra, ifp, td); } } } return EOPNOTSUPP; /* just for safety */ } /* * Initialize an interface's intetnet6 address * and routing table entry. */ static int in6_ifinit(struct ifnet *ifp, struct in6_ifaddr *ia, struct sockaddr_in6 *sin6, int newhost) { int error = 0, plen, ifacount = 0; int s = splimp(); struct ifaddr *ifa; /* * Give the interface a chance to initialize * if this is its first address, * and to validate the address if necessary. */ IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { if (ifa->ifa_addr->sa_family != AF_INET6) continue; ifacount++; } IF_ADDR_UNLOCK(ifp); ia->ia_addr = *sin6; if (ifacount <= 1 && ifp->if_ioctl) { error = (*ifp->if_ioctl)(ifp, SIOCSIFADDR, (caddr_t)ia); if (error) { splx(s); return (error); } } splx(s); ia->ia_ifa.ifa_metric = ifp->if_metric; /* we could do in(6)_socktrim here, but just omit it at this moment. */ /* * Special case: * If a new destination address is specified for a point-to-point * interface, install a route to the destination as an interface * direct route. * XXX: the logic below rejects assigning multiple addresses on a p2p * interface that share the same destination. */ plen = in6_mask2len(&ia->ia_prefixmask.sin6_addr, NULL); /* XXX */ if (!(ia->ia_flags & IFA_ROUTE) && plen == 128) { struct sockaddr *dstaddr; int rtflags = RTF_UP | RTF_HOST; /* * use the interface address if configuring an * interface address with a /128 prefix len */ if (ia->ia_dstaddr.sin6_family == AF_INET6) dstaddr = (struct sockaddr *)&ia->ia_dstaddr; else dstaddr = (struct sockaddr *)&ia->ia_addr; error = rtrequest(RTM_ADD, (struct sockaddr *)dstaddr, (struct sockaddr *)&ia->ia_addr, (struct sockaddr *)&ia->ia_prefixmask, ia->ia_flags | rtflags, NULL); if (error != 0) return (error); ia->ia_flags |= IFA_ROUTE; } /* Add ownaddr as loopback rtentry, if necessary (ex. on p2p link). */ if (newhost) { struct llentry *ln; struct rtentry rt; struct sockaddr_dl gateway; struct sockaddr_in6 mask, addr; IF_AFDATA_LOCK(ifp); ia->ia_ifa.ifa_rtrequest = NULL; /* XXX QL * we need to report rt_newaddrmsg */ ln = lla_lookup(LLTABLE6(ifp), (LLE_CREATE | LLE_IFADDR | LLE_EXCLUSIVE), (struct sockaddr *)&ia->ia_addr); IF_AFDATA_UNLOCK(ifp); if (ln != NULL) { ln->la_expire = 0; /* for IPv6 this means permanent */ ln->ln_state = ND6_LLINFO_REACHABLE; /* * initialize for rtmsg generation */ bzero(&gateway, sizeof(gateway)); gateway.sdl_len = sizeof(gateway); gateway.sdl_family = AF_LINK; gateway.sdl_nlen = 0; gateway.sdl_alen = 6; memcpy(gateway.sdl_data, &ln->ll_addr.mac_aligned, sizeof(ln->ll_addr)); /* */ LLE_WUNLOCK(ln); } bzero(&rt, sizeof(rt)); rt.rt_gateway = (struct sockaddr *)&gateway; memcpy(&mask, &ia->ia_prefixmask, sizeof(ia->ia_prefixmask)); memcpy(&addr, &ia->ia_addr, sizeof(ia->ia_addr)); rt_mask(&rt) = (struct sockaddr *)&mask; rt_key(&rt) = (struct sockaddr *)&addr; rt.rt_flags = RTF_UP | RTF_HOST | RTF_STATIC; rt_newaddrmsg(RTM_ADD, &ia->ia_ifa, 0, &rt); } return (error); } /* * Find an IPv6 interface link-local address specific to an interface. */ struct in6_ifaddr * in6ifa_ifpforlinklocal(struct ifnet *ifp, int ignoreflags) { struct ifaddr *ifa; IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { if (ifa->ifa_addr->sa_family != AF_INET6) continue; if (IN6_IS_ADDR_LINKLOCAL(IFA_IN6(ifa))) { if ((((struct in6_ifaddr *)ifa)->ia6_flags & ignoreflags) != 0) continue; break; } } IF_ADDR_UNLOCK(ifp); return ((struct in6_ifaddr *)ifa); } /* * find the internet address corresponding to a given interface and address. */ struct in6_ifaddr * in6ifa_ifpwithaddr(struct ifnet *ifp, struct in6_addr *addr) { struct ifaddr *ifa; IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { if (ifa->ifa_addr->sa_family != AF_INET6) continue; if (IN6_ARE_ADDR_EQUAL(addr, IFA_IN6(ifa))) break; } IF_ADDR_UNLOCK(ifp); return ((struct in6_ifaddr *)ifa); } /* * Convert IP6 address to printable (loggable) representation. Caller * has to make sure that ip6buf is at least INET6_ADDRSTRLEN long. */ static char digits[] = "0123456789abcdef"; char * ip6_sprintf(char *ip6buf, const struct in6_addr *addr) { int i; char *cp; const u_int16_t *a = (const u_int16_t *)addr; const u_int8_t *d; int dcolon = 0, zero = 0; cp = ip6buf; for (i = 0; i < 8; i++) { if (dcolon == 1) { if (*a == 0) { if (i == 7) *cp++ = ':'; a++; continue; } else dcolon = 2; } if (*a == 0) { if (dcolon == 0 && *(a + 1) == 0) { if (i == 0) *cp++ = ':'; *cp++ = ':'; dcolon = 1; } else { *cp++ = '0'; *cp++ = ':'; } a++; continue; } d = (const u_char *)a; /* Try to eliminate leading zeros in printout like in :0001. */ zero = 1; *cp = digits[*d >> 4]; if (*cp != '0') { zero = 0; cp++; } *cp = digits[*d++ & 0xf]; if (zero == 0 || (*cp != '0')) { zero = 0; cp++; } *cp = digits[*d >> 4]; if (zero == 0 || (*cp != '0')) { zero = 0; cp++; } *cp++ = digits[*d & 0xf]; *cp++ = ':'; a++; } *--cp = '\0'; return (ip6buf); } int in6_localaddr(struct in6_addr *in6) { INIT_VNET_INET6(curvnet); struct in6_ifaddr *ia; if (IN6_IS_ADDR_LOOPBACK(in6) || IN6_IS_ADDR_LINKLOCAL(in6)) return 1; for (ia = V_in6_ifaddr; ia; ia = ia->ia_next) { if (IN6_ARE_MASKED_ADDR_EQUAL(in6, &ia->ia_addr.sin6_addr, &ia->ia_prefixmask.sin6_addr)) { return 1; } } return (0); } int in6_is_addr_deprecated(struct sockaddr_in6 *sa6) { INIT_VNET_INET6(curvnet); struct in6_ifaddr *ia; for (ia = V_in6_ifaddr; ia; ia = ia->ia_next) { if (IN6_ARE_ADDR_EQUAL(&ia->ia_addr.sin6_addr, &sa6->sin6_addr) && (ia->ia6_flags & IN6_IFF_DEPRECATED) != 0) return (1); /* true */ /* XXX: do we still have to go thru the rest of the list? */ } return (0); /* false */ } /* * return length of part which dst and src are equal * hard coding... */ int in6_matchlen(struct in6_addr *src, struct in6_addr *dst) { int match = 0; u_char *s = (u_char *)src, *d = (u_char *)dst; u_char *lim = s + 16, r; while (s < lim) if ((r = (*d++ ^ *s++)) != 0) { while (r < 128) { match++; r <<= 1; } break; } else match += 8; return match; } /* XXX: to be scope conscious */ int in6_are_prefix_equal(struct in6_addr *p1, struct in6_addr *p2, int len) { int bytelen, bitlen; /* sanity check */ if (0 > len || len > 128) { log(LOG_ERR, "in6_are_prefix_equal: invalid prefix length(%d)\n", len); return (0); } bytelen = len / 8; bitlen = len % 8; if (bcmp(&p1->s6_addr, &p2->s6_addr, bytelen)) return (0); if (bitlen != 0 && p1->s6_addr[bytelen] >> (8 - bitlen) != p2->s6_addr[bytelen] >> (8 - bitlen)) return (0); return (1); } void in6_prefixlen2mask(struct in6_addr *maskp, int len) { u_char maskarray[8] = {0x80, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc, 0xfe, 0xff}; int bytelen, bitlen, i; /* sanity check */ if (0 > len || len > 128) { log(LOG_ERR, "in6_prefixlen2mask: invalid prefix length(%d)\n", len); return; } bzero(maskp, sizeof(*maskp)); bytelen = len / 8; bitlen = len % 8; for (i = 0; i < bytelen; i++) maskp->s6_addr[i] = 0xff; if (bitlen) maskp->s6_addr[bytelen] = maskarray[bitlen - 1]; } /* * return the best address out of the same scope. if no address was * found, return the first valid address from designated IF. */ struct in6_ifaddr * in6_ifawithifp(struct ifnet *ifp, struct in6_addr *dst) { INIT_VNET_INET6(curvnet); int dst_scope = in6_addrscope(dst), blen = -1, tlen; struct ifaddr *ifa; struct in6_ifaddr *besta = 0; struct in6_ifaddr *dep[2]; /* last-resort: deprecated */ dep[0] = dep[1] = NULL; /* * We first look for addresses in the same scope. * If there is one, return it. * If two or more, return one which matches the dst longest. * If none, return one of global addresses assigned other ifs. */ IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { if (ifa->ifa_addr->sa_family != AF_INET6) continue; if (((struct in6_ifaddr *)ifa)->ia6_flags & IN6_IFF_ANYCAST) continue; /* XXX: is there any case to allow anycast? */ if (((struct in6_ifaddr *)ifa)->ia6_flags & IN6_IFF_NOTREADY) continue; /* don't use this interface */ if (((struct in6_ifaddr *)ifa)->ia6_flags & IN6_IFF_DETACHED) continue; if (((struct in6_ifaddr *)ifa)->ia6_flags & IN6_IFF_DEPRECATED) { if (V_ip6_use_deprecated) dep[0] = (struct in6_ifaddr *)ifa; continue; } if (dst_scope == in6_addrscope(IFA_IN6(ifa))) { /* * call in6_matchlen() as few as possible */ if (besta) { if (blen == -1) blen = in6_matchlen(&besta->ia_addr.sin6_addr, dst); tlen = in6_matchlen(IFA_IN6(ifa), dst); if (tlen > blen) { blen = tlen; besta = (struct in6_ifaddr *)ifa; } } else besta = (struct in6_ifaddr *)ifa; } } if (besta) { IF_ADDR_UNLOCK(ifp); return (besta); } TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { if (ifa->ifa_addr->sa_family != AF_INET6) continue; if (((struct in6_ifaddr *)ifa)->ia6_flags & IN6_IFF_ANYCAST) continue; /* XXX: is there any case to allow anycast? */ if (((struct in6_ifaddr *)ifa)->ia6_flags & IN6_IFF_NOTREADY) continue; /* don't use this interface */ if (((struct in6_ifaddr *)ifa)->ia6_flags & IN6_IFF_DETACHED) continue; if (((struct in6_ifaddr *)ifa)->ia6_flags & IN6_IFF_DEPRECATED) { if (V_ip6_use_deprecated) dep[1] = (struct in6_ifaddr *)ifa; continue; } IF_ADDR_UNLOCK(ifp); return (struct in6_ifaddr *)ifa; } IF_ADDR_UNLOCK(ifp); /* use the last-resort values, that are, deprecated addresses */ if (dep[0]) return dep[0]; if (dep[1]) return dep[1]; return NULL; } /* * perform DAD when interface becomes IFF_UP. */ void in6_if_up(struct ifnet *ifp) { struct ifaddr *ifa; struct in6_ifaddr *ia; IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { if (ifa->ifa_addr->sa_family != AF_INET6) continue; ia = (struct in6_ifaddr *)ifa; if (ia->ia6_flags & IN6_IFF_TENTATIVE) { /* * The TENTATIVE flag was likely set by hand * beforehand, implicitly indicating the need for DAD. * We may be able to skip the random delay in this * case, but we impose delays just in case. */ nd6_dad_start(ifa, arc4random() % (MAX_RTR_SOLICITATION_DELAY * hz)); } } IF_ADDR_UNLOCK(ifp); /* * special cases, like 6to4, are handled in in6_ifattach */ in6_ifattach(ifp, NULL); } int in6if_do_dad(struct ifnet *ifp) { if ((ifp->if_flags & IFF_LOOPBACK) != 0) return (0); switch (ifp->if_type) { #ifdef IFT_DUMMY case IFT_DUMMY: #endif case IFT_FAITH: /* * These interfaces do not have the IFF_LOOPBACK flag, * but loop packets back. We do not have to do DAD on such * interfaces. We should even omit it, because loop-backed * NS would confuse the DAD procedure. */ return (0); default: /* * Our DAD routine requires the interface up and running. * However, some interfaces can be up before the RUNNING * status. Additionaly, users may try to assign addresses * before the interface becomes up (or running). * We simply skip DAD in such a case as a work around. * XXX: we should rather mark "tentative" on such addresses, * and do DAD after the interface becomes ready. */ if (!((ifp->if_flags & IFF_UP) && (ifp->if_drv_flags & IFF_DRV_RUNNING))) return (0); return (1); } } /* * Calculate max IPv6 MTU through all the interfaces and store it * to in6_maxmtu. */ void in6_setmaxmtu(void) { INIT_VNET_NET(curvnet); INIT_VNET_INET6(curvnet); unsigned long maxmtu = 0; struct ifnet *ifp; IFNET_RLOCK(); for (ifp = TAILQ_FIRST(&V_ifnet); ifp; ifp = TAILQ_NEXT(ifp, if_list)) { /* this function can be called during ifnet initialization */ if (!ifp->if_afdata[AF_INET6]) continue; if ((ifp->if_flags & IFF_LOOPBACK) == 0 && IN6_LINKMTU(ifp) > maxmtu) maxmtu = IN6_LINKMTU(ifp); } IFNET_RUNLOCK(); if (maxmtu) /* update only when maxmtu is positive */ V_in6_maxmtu = maxmtu; } /* * Provide the length of interface identifiers to be used for the link attached * to the given interface. The length should be defined in "IPv6 over * xxx-link" document. Note that address architecture might also define * the length for a particular set of address prefixes, regardless of the * link type. As clarified in rfc2462bis, those two definitions should be * consistent, and those really are as of August 2004. */ int in6_if2idlen(struct ifnet *ifp) { switch (ifp->if_type) { case IFT_ETHER: /* RFC2464 */ #ifdef IFT_PROPVIRTUAL case IFT_PROPVIRTUAL: /* XXX: no RFC. treat it as ether */ #endif #ifdef IFT_L2VLAN case IFT_L2VLAN: /* ditto */ #endif #ifdef IFT_IEEE80211 case IFT_IEEE80211: /* ditto */ #endif #ifdef IFT_MIP case IFT_MIP: /* ditto */ #endif return (64); case IFT_FDDI: /* RFC2467 */ return (64); case IFT_ISO88025: /* RFC2470 (IPv6 over Token Ring) */ return (64); case IFT_PPP: /* RFC2472 */ return (64); case IFT_ARCNET: /* RFC2497 */ return (64); case IFT_FRELAY: /* RFC2590 */ return (64); case IFT_IEEE1394: /* RFC3146 */ return (64); case IFT_GIF: return (64); /* draft-ietf-v6ops-mech-v2-07 */ case IFT_LOOP: return (64); /* XXX: is this really correct? */ default: /* * Unknown link type: * It might be controversial to use the today's common constant * of 64 for these cases unconditionally. For full compliance, * we should return an error in this case. On the other hand, * if we simply miss the standard for the link type or a new * standard is defined for a new link type, the IFID length * is very likely to be the common constant. As a compromise, * we always use the constant, but make an explicit notice * indicating the "unknown" case. */ printf("in6_if2idlen: unknown link type (%d)\n", ifp->if_type); return (64); } } #include struct in6_llentry { struct llentry base; struct sockaddr_in6 l3_addr6; }; static struct llentry * in6_lltable_new(const struct sockaddr *l3addr, u_int flags) { struct in6_llentry *lle; lle = malloc(sizeof(struct in6_llentry), M_LLTABLE, M_DONTWAIT | M_ZERO); if (lle == NULL) /* NB: caller generates msg */ return NULL; callout_init(&lle->base.ln_timer_ch, CALLOUT_MPSAFE); lle->l3_addr6 = *(const struct sockaddr_in6 *)l3addr; lle->base.lle_refcnt = 1; LLE_LOCK_INIT(&lle->base); return &lle->base; } /* * Deletes an address from the address table. * This function is called by the timer functions * such as arptimer() and nd6_llinfo_timer(), and * the caller does the locking. */ static void in6_lltable_free(struct lltable *llt, struct llentry *lle) { LLE_WUNLOCK(lle); LLE_LOCK_DESTROY(lle); free(lle, M_LLTABLE); } static void in6_lltable_prefix_free(struct lltable *llt, const struct sockaddr *prefix, const struct sockaddr *mask) { const struct sockaddr_in6 *pfx = (const struct sockaddr_in6 *)prefix; const struct sockaddr_in6 *msk = (const struct sockaddr_in6 *)mask; struct llentry *lle, *next; register int i; for (i=0; i < LLTBL_HASHTBL_SIZE; i++) { LIST_FOREACH_SAFE(lle, &llt->lle_head[i], lle_next, next) { if (IN6_ARE_MASKED_ADDR_EQUAL( &((struct sockaddr_in6 *)L3_ADDR(lle))->sin6_addr, &pfx->sin6_addr, &msk->sin6_addr)) { callout_drain(&lle->la_timer); LLE_WLOCK(lle); llentry_free(lle); } } } } static int in6_lltable_rtcheck(struct ifnet *ifp, const struct sockaddr *l3addr) { struct rtentry *rt; char ip6buf[INET6_ADDRSTRLEN]; KASSERT(l3addr->sa_family == AF_INET6, ("sin_family %d", l3addr->sa_family)); /* XXX rtalloc1 should take a const param */ rt = rtalloc1(__DECONST(struct sockaddr *, l3addr), 0, 0); if (rt == NULL || (rt->rt_flags & RTF_GATEWAY) || rt->rt_ifp != ifp) { struct ifaddr *ifa; /* * Create an ND6 cache for an IPv6 neighbor * that is not covered by our own prefix. */ /* XXX ifaof_ifpforaddr should take a const param */ ifa = ifaof_ifpforaddr(__DECONST(struct sockaddr *, l3addr), ifp); if (ifa != NULL) { if (rt != NULL) RTFREE_LOCKED(rt); return 0; } log(LOG_INFO, "IPv6 address: \"%s\" is not on the network\n", ip6_sprintf(ip6buf, &((const struct sockaddr_in6 *)l3addr)->sin6_addr)); if (rt != NULL) RTFREE_LOCKED(rt); return EINVAL; } RTFREE_LOCKED(rt); return 0; } static struct llentry * in6_lltable_lookup(struct lltable *llt, u_int flags, const struct sockaddr *l3addr) { const struct sockaddr_in6 *sin6 = (const struct sockaddr_in6 *)l3addr; struct ifnet *ifp = llt->llt_ifp; struct llentry *lle; struct llentries *lleh; u_int hashkey; IF_AFDATA_LOCK_ASSERT(ifp); KASSERT(l3addr->sa_family == AF_INET6, ("sin_family %d", l3addr->sa_family)); hashkey = sin6->sin6_addr.s6_addr32[3]; lleh = &llt->lle_head[LLATBL_HASH(hashkey, LLTBL_HASHMASK)]; LIST_FOREACH(lle, lleh, lle_next) { struct sockaddr_in6 *sa6 = (struct sockaddr_in6 *)L3_ADDR(lle); if (lle->la_flags & LLE_DELETED) continue; if (bcmp(&sa6->sin6_addr, &sin6->sin6_addr, sizeof(struct in6_addr)) == 0) break; } if (lle == NULL) { if (!(flags & LLE_CREATE)) return (NULL); /* * A route that covers the given address must have * been installed 1st because we are doing a resolution, * verify this. */ if (!(flags & LLE_IFADDR) && in6_lltable_rtcheck(ifp, l3addr) != 0) return NULL; lle = in6_lltable_new(l3addr, flags); if (lle == NULL) { log(LOG_INFO, "lla_lookup: new lle malloc failed\n"); return NULL; } lle->la_flags = flags & ~LLE_CREATE; if ((flags & (LLE_CREATE | LLE_IFADDR)) == (LLE_CREATE | LLE_IFADDR)) { bcopy(IF_LLADDR(ifp), &lle->ll_addr, ifp->if_addrlen); lle->la_flags |= (LLE_VALID | LLE_STATIC); } lle->lle_tbl = llt; lle->lle_head = lleh; LIST_INSERT_HEAD(lleh, lle, lle_next); } else if (flags & LLE_DELETE) { if (!(lle->la_flags & LLE_IFADDR) || (flags & LLE_IFADDR)) { LLE_WLOCK(lle); lle->la_flags = LLE_DELETED; LLE_WUNLOCK(lle); #ifdef DIAGNOSTICS log(LOG_INFO, "ifaddr cache = %p is deleted\n", lle); #endif } lle = (void *)-1; } if (LLE_IS_VALID(lle)) { if (flags & LLE_EXCLUSIVE) LLE_WLOCK(lle); else LLE_RLOCK(lle); } return (lle); } static int in6_lltable_dump(struct lltable *llt, struct sysctl_req *wr) { struct ifnet *ifp = llt->llt_ifp; struct llentry *lle; /* XXX stack use */ struct { struct rt_msghdr rtm; struct sockaddr_in6 sin6; /* * ndp.c assumes that sdl is word aligned */ #ifdef __LP64__ uint32_t pad; #endif struct sockaddr_dl sdl; } ndpc; int i, error; /* XXXXX * current IFNET_RLOCK() is mapped to IFNET_WLOCK() * so it is okay to use this ASSERT, change it when * IFNET lock is finalized */ IFNET_WLOCK_ASSERT(); error = 0; for (i = 0; i < LLTBL_HASHTBL_SIZE; i++) { LIST_FOREACH(lle, &llt->lle_head[i], lle_next) { struct sockaddr_dl *sdl; /* skip deleted or invalid entries */ if ((lle->la_flags & (LLE_DELETED|LLE_VALID)) != LLE_VALID) continue; /* Skip if jailed and not a valid IP of the prison. */ if (prison_if(wr->td->td_ucred, L3_ADDR(lle)) != 0) continue; /* * produce a msg made of: * struct rt_msghdr; * struct sockaddr_in6 (IPv6) * struct sockaddr_dl; */ bzero(&ndpc, sizeof(ndpc)); ndpc.rtm.rtm_msglen = sizeof(ndpc); ndpc.rtm.rtm_version = RTM_VERSION; ndpc.rtm.rtm_type = RTM_GET; ndpc.rtm.rtm_flags = RTF_UP; ndpc.rtm.rtm_addrs = RTA_DST | RTA_GATEWAY; ndpc.sin6.sin6_family = AF_INET6; ndpc.sin6.sin6_len = sizeof(ndpc.sin6); bcopy(L3_ADDR(lle), &ndpc.sin6, L3_ADDR_LEN(lle)); /* publish */ if (lle->la_flags & LLE_PUB) ndpc.rtm.rtm_flags |= RTF_ANNOUNCE; sdl = &ndpc.sdl; sdl->sdl_family = AF_LINK; sdl->sdl_len = sizeof(*sdl); sdl->sdl_alen = ifp->if_addrlen; sdl->sdl_index = ifp->if_index; sdl->sdl_type = ifp->if_type; bcopy(&lle->ll_addr, LLADDR(sdl), ifp->if_addrlen); ndpc.rtm.rtm_rmx.rmx_expire = lle->la_flags & LLE_STATIC ? 0 : lle->la_expire; ndpc.rtm.rtm_flags |= (RTF_HOST | RTF_LLDATA); if (lle->la_flags & LLE_STATIC) ndpc.rtm.rtm_flags |= RTF_STATIC; ndpc.rtm.rtm_index = ifp->if_index; error = SYSCTL_OUT(wr, &ndpc, sizeof(ndpc)); if (error) break; } } return error; } void * in6_domifattach(struct ifnet *ifp) { struct in6_ifextra *ext; ext = (struct in6_ifextra *)malloc(sizeof(*ext), M_IFADDR, M_WAITOK); bzero(ext, sizeof(*ext)); ext->in6_ifstat = (struct in6_ifstat *)malloc(sizeof(struct in6_ifstat), M_IFADDR, M_WAITOK); bzero(ext->in6_ifstat, sizeof(*ext->in6_ifstat)); ext->icmp6_ifstat = (struct icmp6_ifstat *)malloc(sizeof(struct icmp6_ifstat), M_IFADDR, M_WAITOK); bzero(ext->icmp6_ifstat, sizeof(*ext->icmp6_ifstat)); ext->nd_ifinfo = nd6_ifattach(ifp); ext->scope6_id = scope6_ifattach(ifp); ext->lltable = lltable_init(ifp, AF_INET6); if (ext->lltable != NULL) { ext->lltable->llt_new = in6_lltable_new; ext->lltable->llt_free = in6_lltable_free; ext->lltable->llt_prefix_free = in6_lltable_prefix_free; ext->lltable->llt_rtcheck = in6_lltable_rtcheck; ext->lltable->llt_lookup = in6_lltable_lookup; ext->lltable->llt_dump = in6_lltable_dump; } ext->mld_ifinfo = mld_domifattach(ifp); return ext; } void in6_domifdetach(struct ifnet *ifp, void *aux) { struct in6_ifextra *ext = (struct in6_ifextra *)aux; mld_domifdetach(ifp); scope6_ifdetach(ext->scope6_id); nd6_ifdetach(ext->nd_ifinfo); lltable_free(ext->lltable); free(ext->in6_ifstat, M_IFADDR); free(ext->icmp6_ifstat, M_IFADDR); free(ext, M_IFADDR); } /* * Convert sockaddr_in6 to sockaddr_in. Original sockaddr_in6 must be * v4 mapped addr or v4 compat addr */ void in6_sin6_2_sin(struct sockaddr_in *sin, struct sockaddr_in6 *sin6) { bzero(sin, sizeof(*sin)); sin->sin_len = sizeof(struct sockaddr_in); sin->sin_family = AF_INET; sin->sin_port = sin6->sin6_port; sin->sin_addr.s_addr = sin6->sin6_addr.s6_addr32[3]; } /* Convert sockaddr_in to sockaddr_in6 in v4 mapped addr format. */ void in6_sin_2_v4mapsin6(struct sockaddr_in *sin, struct sockaddr_in6 *sin6) { bzero(sin6, sizeof(*sin6)); sin6->sin6_len = sizeof(struct sockaddr_in6); sin6->sin6_family = AF_INET6; sin6->sin6_port = sin->sin_port; sin6->sin6_addr.s6_addr32[0] = 0; sin6->sin6_addr.s6_addr32[1] = 0; sin6->sin6_addr.s6_addr32[2] = IPV6_ADDR_INT32_SMP; sin6->sin6_addr.s6_addr32[3] = sin->sin_addr.s_addr; } /* Convert sockaddr_in6 into sockaddr_in. */ void in6_sin6_2_sin_in_sock(struct sockaddr *nam) { struct sockaddr_in *sin_p; struct sockaddr_in6 sin6; /* * Save original sockaddr_in6 addr and convert it * to sockaddr_in. */ sin6 = *(struct sockaddr_in6 *)nam; sin_p = (struct sockaddr_in *)nam; in6_sin6_2_sin(sin_p, &sin6); } /* Convert sockaddr_in into sockaddr_in6 in v4 mapped addr format. */ void in6_sin_2_v4mapsin6_in_sock(struct sockaddr **nam) { struct sockaddr_in *sin_p; struct sockaddr_in6 *sin6_p; sin6_p = malloc(sizeof *sin6_p, M_SONAME, M_WAITOK); sin_p = (struct sockaddr_in *)*nam; in6_sin_2_v4mapsin6(sin_p, sin6_p); free(*nam, M_SONAME); *nam = (struct sockaddr *)sin6_p; } Index: head/sys/netinet6/in6_ifattach.c =================================================================== --- head/sys/netinet6/in6_ifattach.c (revision 192894) +++ head/sys/netinet6/in6_ifattach.c (revision 192895) @@ -1,954 +1,975 @@ /*- * Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the project nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $KAME: in6_ifattach.c,v 1.118 2001/05/24 07:44:00 itojun Exp $ */ #include __FBSDID("$FreeBSD$"); #include "opt_route.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef VIMAGE_GLOBALS unsigned long in6_maxmtu; int ip6_auto_linklocal; struct callout in6_tmpaddrtimer_ch; extern struct inpcbinfo ripcbinfo; #endif static int get_rand_ifid(struct ifnet *, struct in6_addr *); static int generate_tmp_ifid(u_int8_t *, const u_int8_t *, u_int8_t *); static int get_ifid(struct ifnet *, struct ifnet *, struct in6_addr *); static int in6_ifattach_linklocal(struct ifnet *, struct ifnet *); static int in6_ifattach_loopback(struct ifnet *); static void in6_purgemaddrs(struct ifnet *); #define EUI64_GBIT 0x01 #define EUI64_UBIT 0x02 #define EUI64_TO_IFID(in6) do {(in6)->s6_addr[8] ^= EUI64_UBIT; } while (0) #define EUI64_GROUP(in6) ((in6)->s6_addr[8] & EUI64_GBIT) #define EUI64_INDIVIDUAL(in6) (!EUI64_GROUP(in6)) #define EUI64_LOCAL(in6) ((in6)->s6_addr[8] & EUI64_UBIT) #define EUI64_UNIVERSAL(in6) (!EUI64_LOCAL(in6)) #define IFID_LOCAL(in6) (!EUI64_LOCAL(in6)) #define IFID_UNIVERSAL(in6) (!EUI64_UNIVERSAL(in6)) /* * Generate a last-resort interface identifier, when the machine has no * IEEE802/EUI64 address sources. * The goal here is to get an interface identifier that is * (1) random enough and (2) does not change across reboot. * We currently use MD5(hostname) for it. * * in6 - upper 64bits are preserved */ static int get_rand_ifid(struct ifnet *ifp, struct in6_addr *in6) { INIT_VPROCG(TD_TO_VPROCG(curthread)); /* XXX V_hostname needs this */ MD5_CTX ctxt; u_int8_t digest[16]; int hostnamelen; mtx_lock(&hostname_mtx); hostnamelen = strlen(V_hostname); #if 0 /* we need at least several letters as seed for ifid */ if (hostnamelen < 3) return -1; #endif /* generate 8 bytes of pseudo-random value. */ bzero(&ctxt, sizeof(ctxt)); MD5Init(&ctxt); MD5Update(&ctxt, V_hostname, hostnamelen); mtx_unlock(&hostname_mtx); MD5Final(digest, &ctxt); /* assumes sizeof(digest) > sizeof(ifid) */ bcopy(digest, &in6->s6_addr[8], 8); /* make sure to set "u" bit to local, and "g" bit to individual. */ in6->s6_addr[8] &= ~EUI64_GBIT; /* g bit to "individual" */ in6->s6_addr[8] |= EUI64_UBIT; /* u bit to "local" */ /* convert EUI64 into IPv6 interface identifier */ EUI64_TO_IFID(in6); return 0; } static int generate_tmp_ifid(u_int8_t *seed0, const u_int8_t *seed1, u_int8_t *ret) { INIT_VNET_INET6(curvnet); MD5_CTX ctxt; u_int8_t seed[16], digest[16], nullbuf[8]; u_int32_t val32; /* If there's no history, start with a random seed. */ bzero(nullbuf, sizeof(nullbuf)); if (bcmp(nullbuf, seed0, sizeof(nullbuf)) == 0) { int i; for (i = 0; i < 2; i++) { val32 = arc4random(); bcopy(&val32, seed + sizeof(val32) * i, sizeof(val32)); } } else bcopy(seed0, seed, 8); /* copy the right-most 64-bits of the given address */ /* XXX assumption on the size of IFID */ bcopy(seed1, &seed[8], 8); if (0) { /* for debugging purposes only */ int i; printf("generate_tmp_ifid: new randomized ID from: "); for (i = 0; i < 16; i++) printf("%02x", seed[i]); printf(" "); } /* generate 16 bytes of pseudo-random value. */ bzero(&ctxt, sizeof(ctxt)); MD5Init(&ctxt); MD5Update(&ctxt, seed, sizeof(seed)); MD5Final(digest, &ctxt); /* * RFC 3041 3.2.1. (3) * Take the left-most 64-bits of the MD5 digest and set bit 6 (the * left-most bit is numbered 0) to zero. */ bcopy(digest, ret, 8); ret[0] &= ~EUI64_UBIT; /* * XXX: we'd like to ensure that the generated value is not zero * for simplicity. If the caclculated digest happens to be zero, * use a random non-zero value as the last resort. */ if (bcmp(nullbuf, ret, sizeof(nullbuf)) == 0) { nd6log((LOG_INFO, "generate_tmp_ifid: computed MD5 value is zero.\n")); val32 = arc4random(); val32 = 1 + (val32 % (0xffffffff - 1)); } /* * RFC 3041 3.2.1. (4) * Take the rightmost 64-bits of the MD5 digest and save them in * stable storage as the history value to be used in the next * iteration of the algorithm. */ bcopy(&digest[8], seed0, 8); if (0) { /* for debugging purposes only */ int i; printf("to: "); for (i = 0; i < 16; i++) printf("%02x", digest[i]); printf("\n"); } return 0; } /* * Get interface identifier for the specified interface. * XXX assumes single sockaddr_dl (AF_LINK address) per an interface * * in6 - upper 64bits are preserved */ int in6_get_hw_ifid(struct ifnet *ifp, struct in6_addr *in6) { struct ifaddr *ifa; struct sockaddr_dl *sdl; u_int8_t *addr; size_t addrlen; static u_int8_t allzero[8] = { 0, 0, 0, 0, 0, 0, 0, 0 }; static u_int8_t allone[8] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff }; IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifa, &ifp->if_addrhead, ifa_link) { if (ifa->ifa_addr->sa_family != AF_LINK) continue; sdl = (struct sockaddr_dl *)ifa->ifa_addr; if (sdl == NULL) continue; if (sdl->sdl_alen == 0) continue; goto found; } IF_ADDR_UNLOCK(ifp); return -1; found: addr = LLADDR(sdl); addrlen = sdl->sdl_alen; /* get EUI64 */ switch (ifp->if_type) { case IFT_ETHER: case IFT_FDDI: case IFT_ISO88025: case IFT_ATM: case IFT_IEEE1394: #ifdef IFT_IEEE80211 case IFT_IEEE80211: #endif /* IEEE802/EUI64 cases - what others? */ /* IEEE1394 uses 16byte length address starting with EUI64 */ if (addrlen > 8) addrlen = 8; /* look at IEEE802/EUI64 only */ if (addrlen != 8 && addrlen != 6) { IF_ADDR_UNLOCK(ifp); return -1; } /* * check for invalid MAC address - on bsdi, we see it a lot * since wildboar configures all-zero MAC on pccard before * card insertion. */ if (bcmp(addr, allzero, addrlen) == 0) { IF_ADDR_UNLOCK(ifp); return -1; } if (bcmp(addr, allone, addrlen) == 0) { IF_ADDR_UNLOCK(ifp); return -1; } /* make EUI64 address */ if (addrlen == 8) bcopy(addr, &in6->s6_addr[8], 8); else if (addrlen == 6) { in6->s6_addr[8] = addr[0]; in6->s6_addr[9] = addr[1]; in6->s6_addr[10] = addr[2]; in6->s6_addr[11] = 0xff; in6->s6_addr[12] = 0xfe; in6->s6_addr[13] = addr[3]; in6->s6_addr[14] = addr[4]; in6->s6_addr[15] = addr[5]; } break; case IFT_ARCNET: if (addrlen != 1) { IF_ADDR_UNLOCK(ifp); return -1; } if (!addr[0]) { IF_ADDR_UNLOCK(ifp); return -1; } bzero(&in6->s6_addr[8], 8); in6->s6_addr[15] = addr[0]; /* * due to insufficient bitwidth, we mark it local. */ in6->s6_addr[8] &= ~EUI64_GBIT; /* g bit to "individual" */ in6->s6_addr[8] |= EUI64_UBIT; /* u bit to "local" */ break; case IFT_GIF: #ifdef IFT_STF case IFT_STF: #endif /* * RFC2893 says: "SHOULD use IPv4 address as ifid source". * however, IPv4 address is not very suitable as unique * identifier source (can be renumbered). * we don't do this. */ IF_ADDR_UNLOCK(ifp); return -1; default: IF_ADDR_UNLOCK(ifp); return -1; } /* sanity check: g bit must not indicate "group" */ if (EUI64_GROUP(in6)) { IF_ADDR_UNLOCK(ifp); return -1; } /* convert EUI64 into IPv6 interface identifier */ EUI64_TO_IFID(in6); /* * sanity check: ifid must not be all zero, avoid conflict with * subnet router anycast */ if ((in6->s6_addr[8] & ~(EUI64_GBIT | EUI64_UBIT)) == 0x00 && bcmp(&in6->s6_addr[9], allzero, 7) == 0) { IF_ADDR_UNLOCK(ifp); return -1; } IF_ADDR_UNLOCK(ifp); return 0; } /* * Get interface identifier for the specified interface. If it is not * available on ifp0, borrow interface identifier from other information * sources. * * altifp - secondary EUI64 source */ static int get_ifid(struct ifnet *ifp0, struct ifnet *altifp, struct in6_addr *in6) { INIT_VNET_NET(ifp0->if_vnet); INIT_VNET_INET6(ifp0->if_vnet); struct ifnet *ifp; /* first, try to get it from the interface itself */ if (in6_get_hw_ifid(ifp0, in6) == 0) { nd6log((LOG_DEBUG, "%s: got interface identifier from itself\n", if_name(ifp0))); goto success; } /* try secondary EUI64 source. this basically is for ATM PVC */ if (altifp && in6_get_hw_ifid(altifp, in6) == 0) { nd6log((LOG_DEBUG, "%s: got interface identifier from %s\n", if_name(ifp0), if_name(altifp))); goto success; } /* next, try to get it from some other hardware interface */ IFNET_RLOCK(); for (ifp = V_ifnet.tqh_first; ifp; ifp = ifp->if_list.tqe_next) { if (ifp == ifp0) continue; if (in6_get_hw_ifid(ifp, in6) != 0) continue; /* * to borrow ifid from other interface, ifid needs to be * globally unique */ if (IFID_UNIVERSAL(in6)) { nd6log((LOG_DEBUG, "%s: borrow interface identifier from %s\n", if_name(ifp0), if_name(ifp))); IFNET_RUNLOCK(); goto success; } } IFNET_RUNLOCK(); /* last resort: get from random number source */ if (get_rand_ifid(ifp, in6) == 0) { nd6log((LOG_DEBUG, "%s: interface identifier generated by random number\n", if_name(ifp0))); goto success; } printf("%s: failed to get interface identifier\n", if_name(ifp0)); return -1; success: nd6log((LOG_INFO, "%s: ifid: %02x:%02x:%02x:%02x:%02x:%02x:%02x:%02x\n", if_name(ifp0), in6->s6_addr[8], in6->s6_addr[9], in6->s6_addr[10], in6->s6_addr[11], in6->s6_addr[12], in6->s6_addr[13], in6->s6_addr[14], in6->s6_addr[15])); return 0; } /* * altifp - secondary EUI64 source */ static int in6_ifattach_linklocal(struct ifnet *ifp, struct ifnet *altifp) { INIT_VNET_INET6(curvnet); struct in6_ifaddr *ia; struct in6_aliasreq ifra; struct nd_prefixctl pr0; int i, error; /* * configure link-local address. */ bzero(&ifra, sizeof(ifra)); /* * in6_update_ifa() does not use ifra_name, but we accurately set it * for safety. */ strncpy(ifra.ifra_name, if_name(ifp), sizeof(ifra.ifra_name)); ifra.ifra_addr.sin6_family = AF_INET6; ifra.ifra_addr.sin6_len = sizeof(struct sockaddr_in6); ifra.ifra_addr.sin6_addr.s6_addr32[0] = htonl(0xfe800000); ifra.ifra_addr.sin6_addr.s6_addr32[1] = 0; if ((ifp->if_flags & IFF_LOOPBACK) != 0) { ifra.ifra_addr.sin6_addr.s6_addr32[2] = 0; ifra.ifra_addr.sin6_addr.s6_addr32[3] = htonl(1); } else { if (get_ifid(ifp, altifp, &ifra.ifra_addr.sin6_addr) != 0) { nd6log((LOG_ERR, "%s: no ifid available\n", if_name(ifp))); return (-1); } } if (in6_setscope(&ifra.ifra_addr.sin6_addr, ifp, NULL)) return (-1); ifra.ifra_prefixmask.sin6_len = sizeof(struct sockaddr_in6); ifra.ifra_prefixmask.sin6_family = AF_INET6; ifra.ifra_prefixmask.sin6_addr = in6mask64; /* link-local addresses should NEVER expire. */ ifra.ifra_lifetime.ia6t_vltime = ND6_INFINITE_LIFETIME; ifra.ifra_lifetime.ia6t_pltime = ND6_INFINITE_LIFETIME; /* * Now call in6_update_ifa() to do a bunch of procedures to configure * a link-local address. We can set the 3rd argument to NULL, because * we know there's no other link-local address on the interface * and therefore we are adding one (instead of updating one). */ if ((error = in6_update_ifa(ifp, &ifra, NULL, IN6_IFAUPDATE_DADDELAY)) != 0) { /* * XXX: When the interface does not support IPv6, this call * would fail in the SIOCSIFADDR ioctl. I believe the * notification is rather confusing in this case, so just * suppress it. (jinmei@kame.net 20010130) */ if (error != EAFNOSUPPORT) nd6log((LOG_NOTICE, "in6_ifattach_linklocal: failed to " "configure a link-local address on %s " "(errno=%d)\n", if_name(ifp), error)); return (-1); } ia = in6ifa_ifpforlinklocal(ifp, 0); /* ia must not be NULL */ #ifdef DIAGNOSTIC if (!ia) { panic("ia == NULL in in6_ifattach_linklocal"); /* NOTREACHED */ } #endif /* * Make the link-local prefix (fe80::%link/64) as on-link. * Since we'd like to manage prefixes separately from addresses, * we make an ND6 prefix structure for the link-local prefix, * and add it to the prefix list as a never-expire prefix. * XXX: this change might affect some existing code base... */ bzero(&pr0, sizeof(pr0)); pr0.ndpr_ifp = ifp; /* this should be 64 at this moment. */ pr0.ndpr_plen = in6_mask2len(&ifra.ifra_prefixmask.sin6_addr, NULL); pr0.ndpr_prefix = ifra.ifra_addr; /* apply the mask for safety. (nd6_prelist_add will apply it again) */ for (i = 0; i < 4; i++) { pr0.ndpr_prefix.sin6_addr.s6_addr32[i] &= in6mask64.s6_addr32[i]; } /* * Initialize parameters. The link-local prefix must always be * on-link, and its lifetimes never expire. */ pr0.ndpr_raf_onlink = 1; pr0.ndpr_raf_auto = 1; /* probably meaningless */ pr0.ndpr_vltime = ND6_INFINITE_LIFETIME; pr0.ndpr_pltime = ND6_INFINITE_LIFETIME; /* * Since there is no other link-local addresses, nd6_prefix_lookup() * probably returns NULL. However, we cannot always expect the result. * For example, if we first remove the (only) existing link-local * address, and then reconfigure another one, the prefix is still * valid with referring to the old link-local address. */ if (nd6_prefix_lookup(&pr0) == NULL) { if ((error = nd6_prelist_add(&pr0, NULL, NULL)) != 0) return (error); } return 0; } /* * ifp - must be IFT_LOOP */ static int in6_ifattach_loopback(struct ifnet *ifp) { INIT_VNET_INET6(curvnet); struct in6_aliasreq ifra; int error; bzero(&ifra, sizeof(ifra)); /* * in6_update_ifa() does not use ifra_name, but we accurately set it * for safety. */ strncpy(ifra.ifra_name, if_name(ifp), sizeof(ifra.ifra_name)); ifra.ifra_prefixmask.sin6_len = sizeof(struct sockaddr_in6); ifra.ifra_prefixmask.sin6_family = AF_INET6; ifra.ifra_prefixmask.sin6_addr = in6mask128; /* * Always initialize ia_dstaddr (= broadcast address) to loopback * address. Follows IPv4 practice - see in_ifinit(). */ ifra.ifra_dstaddr.sin6_len = sizeof(struct sockaddr_in6); ifra.ifra_dstaddr.sin6_family = AF_INET6; ifra.ifra_dstaddr.sin6_addr = in6addr_loopback; ifra.ifra_addr.sin6_len = sizeof(struct sockaddr_in6); ifra.ifra_addr.sin6_family = AF_INET6; ifra.ifra_addr.sin6_addr = in6addr_loopback; /* the loopback address should NEVER expire. */ ifra.ifra_lifetime.ia6t_vltime = ND6_INFINITE_LIFETIME; ifra.ifra_lifetime.ia6t_pltime = ND6_INFINITE_LIFETIME; /* we don't need to perform DAD on loopback interfaces. */ ifra.ifra_flags |= IN6_IFF_NODAD; /* skip registration to the prefix list. XXX should be temporary. */ ifra.ifra_flags |= IN6_IFF_NOPFX; /* * We are sure that this is a newly assigned address, so we can set * NULL to the 3rd arg. */ if ((error = in6_update_ifa(ifp, &ifra, NULL, 0)) != 0) { nd6log((LOG_ERR, "in6_ifattach_loopback: failed to configure " "the loopback address on %s (errno=%d)\n", if_name(ifp), error)); return (-1); } return 0; } /* * compute NI group address, based on the current hostname setting. * see draft-ietf-ipngwg-icmp-name-lookup-* (04 and later). * * when ifp == NULL, the caller is responsible for filling scopeid. */ int in6_nigroup(struct ifnet *ifp, const char *name, int namelen, struct in6_addr *in6) { + INIT_VPROCG(TD_TO_VPROCG(curthread)); /* XXX V_hostname needs this */ const char *p; u_char *q; MD5_CTX ctxt; + int use_hostname; u_int8_t digest[16]; char l; char n[64]; /* a single label must not exceed 63 chars */ - if (!namelen || !name) + /* + * If no name is given and namelen is -1, + * we try to do the hostname lookup ourselves. + */ + if (!name && namelen == -1) { + use_hostname = 1; + mtx_lock(&hostname_mtx); + name = V_hostname; + namelen = strlen(name); + } else + use_hostname = 0; + if (!name || !namelen) { + if (use_hostname) + mtx_unlock(&hostname_mtx); return -1; + } p = name; while (p && *p && *p != '.' && p - name < namelen) p++; - if (p - name > sizeof(n) - 1) + if (p == name || p - name > sizeof(n) - 1) { + if (use_hostname) + mtx_unlock(&hostname_mtx); return -1; /* label too long */ + } l = p - name; strncpy(n, name, l); + if (use_hostname) + mtx_unlock(&hostname_mtx); n[(int)l] = '\0'; for (q = n; *q; q++) { if ('A' <= *q && *q <= 'Z') *q = *q - 'A' + 'a'; } /* generate 8 bytes of pseudo-random value. */ bzero(&ctxt, sizeof(ctxt)); MD5Init(&ctxt); MD5Update(&ctxt, &l, sizeof(l)); MD5Update(&ctxt, n, l); MD5Final(digest, &ctxt); bzero(in6, sizeof(*in6)); in6->s6_addr16[0] = IPV6_ADDR_INT16_MLL; in6->s6_addr8[11] = 2; bcopy(digest, &in6->s6_addr32[3], sizeof(in6->s6_addr32[3])); if (in6_setscope(in6, ifp, NULL)) return (-1); /* XXX: should not fail */ return 0; } /* * XXX multiple loopback interface needs more care. for instance, * nodelocal address needs to be configured onto only one of them. * XXX multiple link-local address case * * altifp - secondary EUI64 source */ void in6_ifattach(struct ifnet *ifp, struct ifnet *altifp) { INIT_VNET_INET6(ifp->if_vnet); struct in6_ifaddr *ia; struct in6_addr in6; /* some of the interfaces are inherently not IPv6 capable */ switch (ifp->if_type) { case IFT_PFLOG: case IFT_PFSYNC: case IFT_CARP: return; } /* * quirks based on interface type */ switch (ifp->if_type) { #ifdef IFT_STF case IFT_STF: /* * 6to4 interface is a very special kind of beast. * no multicast, no linklocal. RFC2529 specifies how to make * linklocals for 6to4 interface, but there's no use and * it is rather harmful to have one. */ goto statinit; #endif default: break; } /* * usually, we require multicast capability to the interface */ if ((ifp->if_flags & IFF_MULTICAST) == 0) { nd6log((LOG_INFO, "in6_ifattach: " "%s is not multicast capable, IPv6 not enabled\n", if_name(ifp))); return; } /* * assign loopback address for loopback interface. * XXX multiple loopback interface case. */ if ((ifp->if_flags & IFF_LOOPBACK) != 0) { in6 = in6addr_loopback; if (in6ifa_ifpwithaddr(ifp, &in6) == NULL) { if (in6_ifattach_loopback(ifp) != 0) return; } } /* * assign a link-local address, if there's none. */ if (V_ip6_auto_linklocal && ifp->if_type != IFT_BRIDGE) { ia = in6ifa_ifpforlinklocal(ifp, 0); if (ia == NULL) { if (in6_ifattach_linklocal(ifp, altifp) == 0) { /* linklocal address assigned */ } else { /* failed to assign linklocal address. bark? */ } } } #ifdef IFT_STF /* XXX */ statinit: #endif /* update dynamically. */ if (V_in6_maxmtu < ifp->if_mtu) V_in6_maxmtu = ifp->if_mtu; } /* * NOTE: in6_ifdetach() does not support loopback if at this moment. * We don't need this function in bsdi, because interfaces are never removed * from the ifnet list in bsdi. */ void in6_ifdetach(struct ifnet *ifp) { INIT_VNET_NET(ifp->if_vnet); INIT_VNET_INET(ifp->if_vnet); INIT_VNET_INET6(ifp->if_vnet); struct in6_ifaddr *ia, *oia; struct ifaddr *ifa, *next; struct rtentry *rt; short rtflags; struct sockaddr_in6 sin6; struct in6_multi_mship *imm; /* remove neighbor management table */ nd6_purge(ifp); /* nuke any of IPv6 addresses we have */ TAILQ_FOREACH_SAFE(ifa, &ifp->if_addrhead, ifa_link, next) { if (ifa->ifa_addr->sa_family != AF_INET6) continue; in6_purgeaddr(ifa); } /* undo everything done by in6_ifattach(), just in case */ TAILQ_FOREACH_SAFE(ifa, &ifp->if_addrhead, ifa_link, next) { if (ifa->ifa_addr->sa_family != AF_INET6 || !IN6_IS_ADDR_LINKLOCAL(&satosin6(&ifa->ifa_addr)->sin6_addr)) { continue; } ia = (struct in6_ifaddr *)ifa; /* * leave from multicast groups we have joined for the interface */ while ((imm = ia->ia6_memberships.lh_first) != NULL) { LIST_REMOVE(imm, i6mm_chain); in6_leavegroup(imm); } /* remove from the routing table */ if ((ia->ia_flags & IFA_ROUTE) && (rt = rtalloc1((struct sockaddr *)&ia->ia_addr, 0, 0UL))) { rtflags = rt->rt_flags; RTFREE_LOCKED(rt); rtrequest(RTM_DELETE, (struct sockaddr *)&ia->ia_addr, (struct sockaddr *)&ia->ia_addr, (struct sockaddr *)&ia->ia_prefixmask, rtflags, (struct rtentry **)0); } /* remove from the linked list */ IF_ADDR_LOCK(ifp); TAILQ_REMOVE(&ifp->if_addrhead, (struct ifaddr *)ia, ifa_link); IF_ADDR_UNLOCK(ifp); IFAFREE(&ia->ia_ifa); /* also remove from the IPv6 address chain(itojun&jinmei) */ oia = ia; if (oia == (ia = V_in6_ifaddr)) V_in6_ifaddr = ia->ia_next; else { while (ia->ia_next && (ia->ia_next != oia)) ia = ia->ia_next; if (ia->ia_next) ia->ia_next = oia->ia_next; else { nd6log((LOG_ERR, "%s: didn't unlink in6ifaddr from list\n", if_name(ifp))); } } IFAFREE(&oia->ia_ifa); } in6_pcbpurgeif0(&V_udbinfo, ifp); in6_pcbpurgeif0(&V_ripcbinfo, ifp); /* leave from all multicast groups joined */ in6_purgemaddrs(ifp); /* * remove neighbor management table. we call it twice just to make * sure we nuke everything. maybe we need just one call. * XXX: since the first call did not release addresses, some prefixes * might remain. We should call nd6_purge() again to release the * prefixes after removing all addresses above. * (Or can we just delay calling nd6_purge until at this point?) */ nd6_purge(ifp); /* remove route to link-local allnodes multicast (ff02::1) */ bzero(&sin6, sizeof(sin6)); sin6.sin6_len = sizeof(struct sockaddr_in6); sin6.sin6_family = AF_INET6; sin6.sin6_addr = in6addr_linklocal_allnodes; if (in6_setscope(&sin6.sin6_addr, ifp, NULL)) /* XXX: should not fail */ return; /* XXX grab lock first to avoid LOR */ if (V_rt_tables[0][AF_INET6] != NULL) { RADIX_NODE_HEAD_LOCK(V_rt_tables[0][AF_INET6]); rt = rtalloc1((struct sockaddr *)&sin6, 0, RTF_RNH_LOCKED); if (rt) { if (rt->rt_ifp == ifp) rtexpunge(rt); RTFREE_LOCKED(rt); } RADIX_NODE_HEAD_UNLOCK(V_rt_tables[0][AF_INET6]); } } int in6_get_tmpifid(struct ifnet *ifp, u_int8_t *retbuf, const u_int8_t *baseid, int generate) { u_int8_t nullbuf[8]; struct nd_ifinfo *ndi = ND_IFINFO(ifp); bzero(nullbuf, sizeof(nullbuf)); if (bcmp(ndi->randomid, nullbuf, sizeof(nullbuf)) == 0) { /* we've never created a random ID. Create a new one. */ generate = 1; } if (generate) { bcopy(baseid, ndi->randomseed1, sizeof(ndi->randomseed1)); /* generate_tmp_ifid will update seedn and buf */ (void)generate_tmp_ifid(ndi->randomseed0, ndi->randomseed1, ndi->randomid); } bcopy(ndi->randomid, retbuf, 8); return (0); } void in6_tmpaddrtimer(void *arg) { CURVNET_SET((struct vnet *) arg); INIT_VNET_NET(curvnet); INIT_VNET_INET6(curvnet); struct nd_ifinfo *ndi; u_int8_t nullbuf[8]; struct ifnet *ifp; callout_reset(&V_in6_tmpaddrtimer_ch, (V_ip6_temp_preferred_lifetime - V_ip6_desync_factor - V_ip6_temp_regen_advance) * hz, in6_tmpaddrtimer, curvnet); bzero(nullbuf, sizeof(nullbuf)); for (ifp = TAILQ_FIRST(&V_ifnet); ifp; ifp = TAILQ_NEXT(ifp, if_list)) { ndi = ND_IFINFO(ifp); if (bcmp(ndi->randomid, nullbuf, sizeof(nullbuf)) != 0) { /* * We've been generating a random ID on this interface. * Create a new one. */ (void)generate_tmp_ifid(ndi->randomseed0, ndi->randomseed1, ndi->randomid); } } CURVNET_RESTORE(); } static void in6_purgemaddrs(struct ifnet *ifp) { LIST_HEAD(,in6_multi) purgeinms; struct in6_multi *inm, *tinm; struct ifmultiaddr *ifma; LIST_INIT(&purgeinms); IN6_MULTI_LOCK(); /* * Extract list of in6_multi associated with the detaching ifp * which the PF_INET6 layer is about to release. * We need to do this as IF_ADDR_LOCK() may be re-acquired * by code further down. */ IF_ADDR_LOCK(ifp); TAILQ_FOREACH(ifma, &ifp->if_multiaddrs, ifma_link) { if (ifma->ifma_addr->sa_family != AF_INET6 || ifma->ifma_protospec == NULL) continue; inm = (struct in6_multi *)ifma->ifma_protospec; LIST_INSERT_HEAD(&purgeinms, inm, in6m_entry); } IF_ADDR_UNLOCK(ifp); LIST_FOREACH_SAFE(inm, &purgeinms, in6m_entry, tinm) { LIST_REMOVE(inm, in6m_entry); in6m_release_locked(inm); } mld_ifdetach(ifp); IN6_MULTI_UNLOCK(); } Index: head/sys/netinet6/in6_pcb.c =================================================================== --- head/sys/netinet6/in6_pcb.c (revision 192894) +++ head/sys/netinet6/in6_pcb.c (revision 192895) @@ -1,932 +1,934 @@ /*- * Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the project nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE PROJECT OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $KAME: in6_pcb.c,v 1.31 2001/05/21 05:45:10 jinmei Exp $ */ /*- * Copyright (c) 1982, 1986, 1991, 1993 * The Regents of the University of California. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)in_pcb.c 8.2 (Berkeley) 1/4/94 */ #include __FBSDID("$FreeBSD$"); #include "opt_inet.h" #include "opt_inet6.h" #include "opt_ipsec.h" #include "opt_mac.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct in6_addr zeroin6_addr; int in6_pcbbind(register struct inpcb *inp, struct sockaddr *nam, struct ucred *cred) { INIT_VNET_INET6(inp->inp_vnet); INIT_VNET_INET(inp->inp_vnet); struct socket *so = inp->inp_socket; struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)NULL; struct inpcbinfo *pcbinfo = inp->inp_pcbinfo; u_short lport = 0; int error, wild = 0, reuseport = (so->so_options & SO_REUSEPORT); INP_INFO_WLOCK_ASSERT(pcbinfo); INP_WLOCK_ASSERT(inp); if (!V_in6_ifaddr) /* XXX broken! */ return (EADDRNOTAVAIL); if (inp->inp_lport || !IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) return (EINVAL); if ((so->so_options & (SO_REUSEADDR|SO_REUSEPORT)) == 0) wild = INPLOOKUP_WILDCARD; if (nam == NULL) { if ((error = prison_local_ip6(cred, &inp->in6p_laddr, ((inp->inp_flags & IN6P_IPV6_V6ONLY) != 0))) != 0) return (error); } else { sin6 = (struct sockaddr_in6 *)nam; if (nam->sa_len != sizeof(*sin6)) return (EINVAL); /* * family check. */ if (nam->sa_family != AF_INET6) return (EAFNOSUPPORT); if ((error = sa6_embedscope(sin6, V_ip6_use_defzone)) != 0) return(error); if ((error = prison_local_ip6(cred, &sin6->sin6_addr, ((inp->inp_flags & IN6P_IPV6_V6ONLY) != 0))) != 0) return (error); lport = sin6->sin6_port; if (IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr)) { /* * Treat SO_REUSEADDR as SO_REUSEPORT for multicast; * allow compepte duplication of binding if * SO_REUSEPORT is set, or if SO_REUSEADDR is set * and a multicast address is bound on both * new and duplicated sockets. */ if (so->so_options & SO_REUSEADDR) reuseport = SO_REUSEADDR|SO_REUSEPORT; } else if (!IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) { struct ifaddr *ia = NULL; sin6->sin6_port = 0; /* yech... */ if ((ia = ifa_ifwithaddr((struct sockaddr *)sin6)) == 0) return (EADDRNOTAVAIL); /* * XXX: bind to an anycast address might accidentally * cause sending a packet with anycast source address. * We should allow to bind to a deprecated address, since * the application dares to use it. */ if (ia && ((struct in6_ifaddr *)ia)->ia6_flags & (IN6_IFF_ANYCAST|IN6_IFF_NOTREADY|IN6_IFF_DETACHED)) { return (EADDRNOTAVAIL); } } if (lport) { struct inpcb *t; /* GROSS */ if (ntohs(lport) <= V_ipport_reservedhigh && ntohs(lport) >= V_ipport_reservedlow && priv_check_cred(cred, PRIV_NETINET_RESERVEDPORT, 0)) return (EACCES); if (!IN6_IS_ADDR_MULTICAST(&sin6->sin6_addr) && priv_check_cred(inp->inp_cred, PRIV_NETINET_REUSEPORT, 0) != 0) { t = in6_pcblookup_local(pcbinfo, &sin6->sin6_addr, lport, INPLOOKUP_WILDCARD, cred); if (t && ((t->inp_flags & INP_TIMEWAIT) == 0) && (so->so_type != SOCK_STREAM || IN6_IS_ADDR_UNSPECIFIED(&t->in6p_faddr)) && (!IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr) || !IN6_IS_ADDR_UNSPECIFIED(&t->in6p_laddr) || (t->inp_socket->so_options & SO_REUSEPORT) == 0) && (inp->inp_cred->cr_uid != t->inp_cred->cr_uid)) return (EADDRINUSE); if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0 && IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) { struct sockaddr_in sin; in6_sin6_2_sin(&sin, sin6); t = in_pcblookup_local(pcbinfo, sin.sin_addr, lport, INPLOOKUP_WILDCARD, cred); if (t && ((t->inp_flags & INP_TIMEWAIT) == 0) && (so->so_type != SOCK_STREAM || ntohl(t->inp_faddr.s_addr) == INADDR_ANY) && (inp->inp_cred->cr_uid != t->inp_cred->cr_uid)) return (EADDRINUSE); } } t = in6_pcblookup_local(pcbinfo, &sin6->sin6_addr, lport, wild, cred); if (t && (reuseport & ((t->inp_flags & INP_TIMEWAIT) ? intotw(t)->tw_so_options : t->inp_socket->so_options)) == 0) return (EADDRINUSE); if ((inp->inp_flags & IN6P_IPV6_V6ONLY) == 0 && IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) { struct sockaddr_in sin; in6_sin6_2_sin(&sin, sin6); t = in_pcblookup_local(pcbinfo, sin.sin_addr, lport, wild, cred); if (t && t->inp_flags & INP_TIMEWAIT) { if ((reuseport & intotw(t)->tw_so_options) == 0 && (ntohl(t->inp_laddr.s_addr) != INADDR_ANY || ((inp->inp_vflag & INP_IPV6PROTO) == (t->inp_vflag & INP_IPV6PROTO)))) return (EADDRINUSE); } else if (t && (reuseport & t->inp_socket->so_options) == 0 && (ntohl(t->inp_laddr.s_addr) != INADDR_ANY || INP_SOCKAF(so) == INP_SOCKAF(t->inp_socket))) return (EADDRINUSE); } } inp->in6p_laddr = sin6->sin6_addr; } if (lport == 0) { if ((error = in6_pcbsetport(&inp->in6p_laddr, inp, cred)) != 0) return (error); } else { inp->inp_lport = lport; if (in_pcbinshash(inp) != 0) { inp->in6p_laddr = in6addr_any; inp->inp_lport = 0; return (EAGAIN); } } return (0); } /* * Transform old in6_pcbconnect() into an inner subroutine for new * in6_pcbconnect(): Do some validity-checking on the remote * address (in mbuf 'nam') and then determine local host address * (i.e., which interface) to use to access that remote host. * * This preserves definition of in6_pcbconnect(), while supporting a * slightly different version for T/TCP. (This is more than * a bit of a kludge, but cleaning up the internal interfaces would * have forced minor changes in every protocol). */ int in6_pcbladdr(register struct inpcb *inp, struct sockaddr *nam, struct in6_addr **plocal_addr6) { INIT_VNET_INET6(inp->inp_vnet); register struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)nam; int error = 0; struct ifnet *ifp = NULL; int scope_ambiguous = 0; INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo); INP_WLOCK_ASSERT(inp); if (nam->sa_len != sizeof (*sin6)) return (EINVAL); if (sin6->sin6_family != AF_INET6) return (EAFNOSUPPORT); if (sin6->sin6_port == 0) return (EADDRNOTAVAIL); if (sin6->sin6_scope_id == 0 && !V_ip6_use_defzone) scope_ambiguous = 1; if ((error = sa6_embedscope(sin6, V_ip6_use_defzone)) != 0) return(error); if (V_in6_ifaddr) { /* * If the destination address is UNSPECIFIED addr, * use the loopback addr, e.g ::1. */ if (IN6_IS_ADDR_UNSPECIFIED(&sin6->sin6_addr)) sin6->sin6_addr = in6addr_loopback; } if ((error = prison_remote_ip6(inp->inp_cred, &sin6->sin6_addr)) != 0) return (error); /* * XXX: in6_selectsrc might replace the bound local address * with the address specified by setsockopt(IPV6_PKTINFO). * Is it the intended behavior? */ *plocal_addr6 = in6_selectsrc(sin6, inp->in6p_outputopts, inp, NULL, inp->inp_cred, &ifp, &error); if (ifp && scope_ambiguous && (error = in6_setscope(&sin6->sin6_addr, ifp, NULL)) != 0) { return(error); } if (*plocal_addr6 == NULL) { if (error == 0) error = EADDRNOTAVAIL; return (error); } /* * Don't do pcblookup call here; return interface in * plocal_addr6 * and exit to caller, that will do the lookup. */ return (0); } /* * Outer subroutine: * Connect from a socket to a specified address. * Both address and port must be specified in argument sin. * If don't have a local address for this socket yet, * then pick one. */ int in6_pcbconnect(register struct inpcb *inp, struct sockaddr *nam, struct ucred *cred) { struct in6_addr *addr6; register struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)nam; int error; INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo); INP_WLOCK_ASSERT(inp); /* * Call inner routine, to assign local interface address. * in6_pcbladdr() may automatically fill in sin6_scope_id. */ if ((error = in6_pcbladdr(inp, nam, &addr6)) != 0) return (error); if (in6_pcblookup_hash(inp->inp_pcbinfo, &sin6->sin6_addr, sin6->sin6_port, IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr) ? addr6 : &inp->in6p_laddr, inp->inp_lport, 0, NULL) != NULL) { return (EADDRINUSE); } if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) { if (inp->inp_lport == 0) { error = in6_pcbbind(inp, (struct sockaddr *)0, cred); if (error) return (error); } inp->in6p_laddr = *addr6; } inp->in6p_faddr = sin6->sin6_addr; inp->inp_fport = sin6->sin6_port; /* update flowinfo - draft-itojun-ipv6-flowlabel-api-00 */ inp->inp_flow &= ~IPV6_FLOWLABEL_MASK; if (inp->inp_flags & IN6P_AUTOFLOWLABEL) inp->inp_flow |= (htonl(ip6_randomflowlabel()) & IPV6_FLOWLABEL_MASK); in_pcbrehash(inp); return (0); } void in6_pcbdisconnect(struct inpcb *inp) { INP_INFO_WLOCK_ASSERT(inp->inp_pcbinfo); INP_WLOCK_ASSERT(inp); bzero((caddr_t)&inp->in6p_faddr, sizeof(inp->in6p_faddr)); inp->inp_fport = 0; /* clear flowinfo - draft-itojun-ipv6-flowlabel-api-00 */ inp->inp_flow &= ~IPV6_FLOWLABEL_MASK; in_pcbrehash(inp); } struct sockaddr * in6_sockaddr(in_port_t port, struct in6_addr *addr_p) { struct sockaddr_in6 *sin6; sin6 = malloc(sizeof *sin6, M_SONAME, M_WAITOK); bzero(sin6, sizeof *sin6); sin6->sin6_family = AF_INET6; sin6->sin6_len = sizeof(*sin6); sin6->sin6_port = port; sin6->sin6_addr = *addr_p; (void)sa6_recoverscope(sin6); /* XXX: should catch errors */ return (struct sockaddr *)sin6; } struct sockaddr * in6_v4mapsin6_sockaddr(in_port_t port, struct in_addr *addr_p) { struct sockaddr_in sin; struct sockaddr_in6 *sin6_p; bzero(&sin, sizeof sin); sin.sin_family = AF_INET; sin.sin_len = sizeof(sin); sin.sin_port = port; sin.sin_addr = *addr_p; sin6_p = malloc(sizeof *sin6_p, M_SONAME, M_WAITOK); in6_sin_2_v4mapsin6(&sin, sin6_p); return (struct sockaddr *)sin6_p; } int in6_getsockaddr(struct socket *so, struct sockaddr **nam) { register struct inpcb *inp; struct in6_addr addr; in_port_t port; inp = sotoinpcb(so); KASSERT(inp != NULL, ("in6_getsockaddr: inp == NULL")); INP_RLOCK(inp); port = inp->inp_lport; addr = inp->in6p_laddr; INP_RUNLOCK(inp); *nam = in6_sockaddr(port, &addr); return 0; } int in6_getpeeraddr(struct socket *so, struct sockaddr **nam) { struct inpcb *inp; struct in6_addr addr; in_port_t port; inp = sotoinpcb(so); KASSERT(inp != NULL, ("in6_getpeeraddr: inp == NULL")); INP_RLOCK(inp); port = inp->inp_fport; addr = inp->in6p_faddr; INP_RUNLOCK(inp); *nam = in6_sockaddr(port, &addr); return 0; } int in6_mapped_sockaddr(struct socket *so, struct sockaddr **nam) { struct inpcb *inp; int error; inp = sotoinpcb(so); KASSERT(inp != NULL, ("in6_mapped_sockaddr: inp == NULL")); if ((inp->inp_vflag & (INP_IPV4 | INP_IPV6)) == INP_IPV4) { error = in_getsockaddr(so, nam); if (error == 0) in6_sin_2_v4mapsin6_in_sock(nam); } else { /* scope issues will be handled in in6_getsockaddr(). */ error = in6_getsockaddr(so, nam); } return error; } int in6_mapped_peeraddr(struct socket *so, struct sockaddr **nam) { struct inpcb *inp; int error; inp = sotoinpcb(so); KASSERT(inp != NULL, ("in6_mapped_peeraddr: inp == NULL")); if ((inp->inp_vflag & (INP_IPV4 | INP_IPV6)) == INP_IPV4) { error = in_getpeeraddr(so, nam); if (error == 0) in6_sin_2_v4mapsin6_in_sock(nam); } else /* scope issues will be handled in in6_getpeeraddr(). */ error = in6_getpeeraddr(so, nam); return error; } /* * Pass some notification to all connections of a protocol * associated with address dst. The local address and/or port numbers * may be specified to limit the search. The "usual action" will be * taken, depending on the ctlinput cmd. The caller must filter any * cmds that are uninteresting (e.g., no error in the map). * Call the protocol specific routine (if any) to report * any errors for each matching socket. */ void in6_pcbnotify(struct inpcbinfo *pcbinfo, struct sockaddr *dst, u_int fport_arg, const struct sockaddr *src, u_int lport_arg, int cmd, void *cmdarg, struct inpcb *(*notify)(struct inpcb *, int)) { struct inpcb *inp, *inp_temp; struct sockaddr_in6 sa6_src, *sa6_dst; u_short fport = fport_arg, lport = lport_arg; u_int32_t flowinfo; int errno; if ((unsigned)cmd >= PRC_NCMDS || dst->sa_family != AF_INET6) return; sa6_dst = (struct sockaddr_in6 *)dst; if (IN6_IS_ADDR_UNSPECIFIED(&sa6_dst->sin6_addr)) return; /* * note that src can be NULL when we get notify by local fragmentation. */ sa6_src = (src == NULL) ? sa6_any : *(const struct sockaddr_in6 *)src; flowinfo = sa6_src.sin6_flowinfo; /* * Redirects go to all references to the destination, * and use in6_rtchange to invalidate the route cache. * Dead host indications: also use in6_rtchange to invalidate * the cache, and deliver the error to all the sockets. * Otherwise, if we have knowledge of the local port and address, * deliver only to that socket. */ if (PRC_IS_REDIRECT(cmd) || cmd == PRC_HOSTDEAD) { fport = 0; lport = 0; bzero((caddr_t)&sa6_src.sin6_addr, sizeof(sa6_src.sin6_addr)); if (cmd != PRC_HOSTDEAD) notify = in6_rtchange; } errno = inet6ctlerrmap[cmd]; INP_INFO_WLOCK(pcbinfo); LIST_FOREACH_SAFE(inp, pcbinfo->ipi_listhead, inp_list, inp_temp) { INP_WLOCK(inp); if ((inp->inp_vflag & INP_IPV6) == 0) { INP_WUNLOCK(inp); continue; } /* * If the error designates a new path MTU for a destination * and the application (associated with this socket) wanted to * know the value, notify. Note that we notify for all * disconnected sockets if the corresponding application * wanted. This is because some UDP applications keep sending * sockets disconnected. * XXX: should we avoid to notify the value to TCP sockets? */ if (cmd == PRC_MSGSIZE && (inp->inp_flags & IN6P_MTU) != 0 && (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) || IN6_ARE_ADDR_EQUAL(&inp->in6p_faddr, &sa6_dst->sin6_addr))) { ip6_notify_pmtu(inp, (struct sockaddr_in6 *)dst, (u_int32_t *)cmdarg); } /* * Detect if we should notify the error. If no source and * destination ports are specifed, but non-zero flowinfo and * local address match, notify the error. This is the case * when the error is delivered with an encrypted buffer * by ESP. Otherwise, just compare addresses and ports * as usual. */ if (lport == 0 && fport == 0 && flowinfo && inp->inp_socket != NULL && flowinfo == (inp->inp_flow & IPV6_FLOWLABEL_MASK) && IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, &sa6_src.sin6_addr)) goto do_notify; else if (!IN6_ARE_ADDR_EQUAL(&inp->in6p_faddr, &sa6_dst->sin6_addr) || inp->inp_socket == 0 || (lport && inp->inp_lport != lport) || (!IN6_IS_ADDR_UNSPECIFIED(&sa6_src.sin6_addr) && !IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, &sa6_src.sin6_addr)) || (fport && inp->inp_fport != fport)) { INP_WUNLOCK(inp); continue; } do_notify: if (notify) { if ((*notify)(inp, errno)) INP_WUNLOCK(inp); } else INP_WUNLOCK(inp); } INP_INFO_WUNLOCK(pcbinfo); } /* * Lookup a PCB based on the local address and port. */ struct inpcb * in6_pcblookup_local(struct inpcbinfo *pcbinfo, struct in6_addr *laddr, u_short lport, int wild_okay, struct ucred *cred) { register struct inpcb *inp; int matchwild = 3, wildcard; INP_INFO_WLOCK_ASSERT(pcbinfo); if (!wild_okay) { struct inpcbhead *head; /* * Look for an unconnected (wildcard foreign addr) PCB that * matches the local address and port we're looking for. */ head = &pcbinfo->ipi_hashbase[INP_PCBHASH(INADDR_ANY, lport, 0, pcbinfo->ipi_hashmask)]; LIST_FOREACH(inp, head, inp_hash) { /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV6) == 0) continue; if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) && IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr) && inp->inp_lport == lport) { /* Found. */ if (cred == NULL || - inp->inp_cred->cr_prison == cred->cr_prison) + prison_equal_ip6(cred->cr_prison, + inp->inp_cred->cr_prison)) return (inp); } } /* * Not found. */ return (NULL); } else { struct inpcbporthead *porthash; struct inpcbport *phd; struct inpcb *match = NULL; /* * Best fit PCB lookup. * * First see if this local port is in use by looking on the * port hash list. */ porthash = &pcbinfo->ipi_porthashbase[INP_PCBPORTHASH(lport, pcbinfo->ipi_porthashmask)]; LIST_FOREACH(phd, porthash, phd_hash) { if (phd->phd_port == lport) break; } if (phd != NULL) { /* * Port is in use by one or more PCBs. Look for best * fit. */ LIST_FOREACH(inp, &phd->phd_pcblist, inp_portlist) { wildcard = 0; if (cred != NULL && - inp->inp_cred->cr_prison != cred->cr_prison) + !prison_equal_ip6(cred->cr_prison, + inp->inp_cred->cr_prison)) continue; /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV6) == 0) continue; if (!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr)) wildcard++; if (!IN6_IS_ADDR_UNSPECIFIED( &inp->in6p_laddr)) { if (IN6_IS_ADDR_UNSPECIFIED(laddr)) wildcard++; else if (!IN6_ARE_ADDR_EQUAL( &inp->in6p_laddr, laddr)) continue; } else { if (!IN6_IS_ADDR_UNSPECIFIED(laddr)) wildcard++; } if (wildcard < matchwild) { match = inp; matchwild = wildcard; if (matchwild == 0) break; } } } return (match); } } void in6_pcbpurgeif0(struct inpcbinfo *pcbinfo, struct ifnet *ifp) { struct inpcb *in6p; struct ip6_moptions *im6o; int i, gap; INP_INFO_RLOCK(pcbinfo); LIST_FOREACH(in6p, pcbinfo->ipi_listhead, inp_list) { INP_WLOCK(in6p); im6o = in6p->in6p_moptions; if ((in6p->inp_vflag & INP_IPV6) && im6o != NULL) { /* * Unselect the outgoing ifp for multicast if it * is being detached. */ if (im6o->im6o_multicast_ifp == ifp) im6o->im6o_multicast_ifp = NULL; /* * Drop multicast group membership if we joined * through the interface being detached. */ gap = 0; for (i = 0; i < im6o->im6o_num_memberships; i++) { if (im6o->im6o_membership[i]->in6m_ifp == ifp) { in6_mc_leave(im6o->im6o_membership[i], NULL); gap++; } else if (gap != 0) { im6o->im6o_membership[i - gap] = im6o->im6o_membership[i]; } } im6o->im6o_num_memberships -= gap; } INP_WUNLOCK(in6p); } INP_INFO_RUNLOCK(pcbinfo); } /* * Check for alternatives when higher level complains * about service problems. For now, invalidate cached * routing information. If the route was created dynamically * (by a redirect), time to try a default gateway again. */ void in6_losing(struct inpcb *in6p) { /* * We don't store route pointers in the routing table anymore */ return; } /* * After a routing change, flush old routing * and allocate a (hopefully) better one. */ struct inpcb * in6_rtchange(struct inpcb *inp, int errno) { /* * We don't store route pointers in the routing table anymore */ return inp; } /* * Lookup PCB in hash list. */ struct inpcb * in6_pcblookup_hash(struct inpcbinfo *pcbinfo, struct in6_addr *faddr, u_int fport_arg, struct in6_addr *laddr, u_int lport_arg, int wildcard, struct ifnet *ifp) { struct inpcbhead *head; struct inpcb *inp, *tmpinp; u_short fport = fport_arg, lport = lport_arg; int faith; INP_INFO_LOCK_ASSERT(pcbinfo); if (faithprefix_p != NULL) faith = (*faithprefix_p)(laddr); else faith = 0; /* * First look for an exact match. */ tmpinp = NULL; head = &pcbinfo->ipi_hashbase[ INP_PCBHASH(faddr->s6_addr32[3] /* XXX */, lport, fport, pcbinfo->ipi_hashmask)]; LIST_FOREACH(inp, head, inp_hash) { /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV6) == 0) continue; if (IN6_ARE_ADDR_EQUAL(&inp->in6p_faddr, faddr) && IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr) && inp->inp_fport == fport && inp->inp_lport == lport) { /* * XXX We should be able to directly return * the inp here, without any checks. * Well unless both bound with SO_REUSEPORT? */ - if (jailed(inp->inp_cred)) + if (prison_flag(inp->inp_cred, PR_IP6)) return (inp); if (tmpinp == NULL) tmpinp = inp; } } if (tmpinp != NULL) return (tmpinp); /* * Then look for a wildcard match, if requested. */ if (wildcard == INPLOOKUP_WILDCARD) { struct inpcb *local_wild = NULL, *local_exact = NULL; struct inpcb *jail_wild = NULL; int injail; /* * Order of socket selection - we always prefer jails. * 1. jailed, non-wild. * 2. jailed, wild. * 3. non-jailed, non-wild. * 4. non-jailed, wild. */ head = &pcbinfo->ipi_hashbase[INP_PCBHASH(INADDR_ANY, lport, 0, pcbinfo->ipi_hashmask)]; LIST_FOREACH(inp, head, inp_hash) { /* XXX inp locking */ if ((inp->inp_vflag & INP_IPV6) == 0) continue; if (!IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_faddr) || inp->inp_lport != lport) { continue; } /* XXX inp locking */ if (faith && (inp->inp_flags & INP_FAITH) == 0) continue; - injail = jailed(inp->inp_cred); + injail = prison_flag(inp->inp_cred, PR_IP6); if (injail) { if (prison_check_ip6(inp->inp_cred, laddr) != 0) continue; } else { if (local_exact != NULL) continue; } if (IN6_ARE_ADDR_EQUAL(&inp->in6p_laddr, laddr)) { if (injail) return (inp); else local_exact = inp; } else if (IN6_IS_ADDR_UNSPECIFIED(&inp->in6p_laddr)) { if (injail) jail_wild = inp; else local_wild = inp; } } /* LIST_FOREACH */ if (jail_wild != NULL) return (jail_wild); if (local_exact != NULL) return (local_exact); if (local_wild != NULL) return (local_wild); } /* if (wildcard == INPLOOKUP_WILDCARD) */ /* * Not found. */ return (NULL); } void init_sin6(struct sockaddr_in6 *sin6, struct mbuf *m) { struct ip6_hdr *ip; ip = mtod(m, struct ip6_hdr *); bzero(sin6, sizeof(*sin6)); sin6->sin6_len = sizeof(*sin6); sin6->sin6_family = AF_INET6; sin6->sin6_addr = ip->ip6_src; (void)sa6_recoverscope(sin6); /* XXX: should catch errors... */ return; } Index: head/sys/nfsserver/nfs_srvsock.c =================================================================== --- head/sys/nfsserver/nfs_srvsock.c (revision 192894) +++ head/sys/nfsserver/nfs_srvsock.c (revision 192895) @@ -1,816 +1,819 @@ /*- * Copyright (c) 1989, 1991, 1993, 1995 * The Regents of the University of California. All rights reserved. * * This code is derived from software contributed to Berkeley by * Rick Macklem at The University of Guelph. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)nfs_socket.c 8.5 (Berkeley) 3/30/95 */ #include __FBSDID("$FreeBSD$"); /* * Socket operations for use by nfs */ #include "opt_mac.h" #include #include +#include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef NFS_LEGACYRPC #define TRUE 1 #define FALSE 0 static int nfs_realign_test; static int nfs_realign_count; SYSCTL_DECL(_vfs_nfsrv); SYSCTL_INT(_vfs_nfsrv, OID_AUTO, realign_test, CTLFLAG_RW, &nfs_realign_test, 0, ""); SYSCTL_INT(_vfs_nfsrv, OID_AUTO, realign_count, CTLFLAG_RW, &nfs_realign_count, 0, ""); /* * There is a congestion window for outstanding rpcs maintained per mount * point. The cwnd size is adjusted in roughly the way that: * Van Jacobson, Congestion avoidance and Control, In "Proceedings of * SIGCOMM '88". ACM, August 1988. * describes for TCP. The cwnd size is chopped in half on a retransmit timeout * and incremented by 1/cwnd when each rpc reply is received and a full cwnd * of rpcs is in progress. * (The sent count and cwnd are scaled for integer arith.) * Variants of "slow start" were tried and were found to be too much of a * performance hit (ave. rtt 3 times larger), * I suspect due to the large rtt that nfs rpcs have. */ #define NFS_CWNDSCALE 256 #define NFS_MAXCWND (NFS_CWNDSCALE * 32) struct callout nfsrv_callout; static void nfs_realign(struct mbuf **pm, int hsiz); /* XXX SHARED */ static int nfsrv_getstream(struct nfssvc_sock *, int); int32_t (*nfsrv3_procs[NFS_NPROCS])(struct nfsrv_descript *nd, struct nfssvc_sock *slp, struct mbuf **mreqp) = { nfsrv_null, nfsrv_getattr, nfsrv_setattr, nfsrv_lookup, nfsrv3_access, nfsrv_readlink, nfsrv_read, nfsrv_write, nfsrv_create, nfsrv_mkdir, nfsrv_symlink, nfsrv_mknod, nfsrv_remove, nfsrv_rmdir, nfsrv_rename, nfsrv_link, nfsrv_readdir, nfsrv_readdirplus, nfsrv_statfs, nfsrv_fsinfo, nfsrv_pathconf, nfsrv_commit, nfsrv_noop }; /* * Generate the rpc reply header * siz arg. is used to decide if adding a cluster is worthwhile */ struct mbuf * nfs_rephead(int siz, struct nfsrv_descript *nd, int err, struct mbuf **mbp, caddr_t *bposp) { u_int32_t *tl; struct mbuf *mreq; caddr_t bpos; struct mbuf *mb; nd->nd_repstat = err; if (err && (nd->nd_flag & ND_NFSV3) == 0) /* XXX recheck */ siz = 0; MGETHDR(mreq, M_WAIT, MT_DATA); mb = mreq; /* * If this is a big reply, use a cluster else * try and leave leading space for the lower level headers. */ mreq->m_len = 6 * NFSX_UNSIGNED; siz += RPC_REPLYSIZ; if ((max_hdr + siz) >= MINCLSIZE) { MCLGET(mreq, M_WAIT); } else mreq->m_data += min(max_hdr, M_TRAILINGSPACE(mreq)); tl = mtod(mreq, u_int32_t *); bpos = ((caddr_t)tl) + mreq->m_len; *tl++ = txdr_unsigned(nd->nd_retxid); *tl++ = nfsrv_rpc_reply; if (err == ERPCMISMATCH || (err & NFSERR_AUTHERR)) { *tl++ = nfsrv_rpc_msgdenied; if (err & NFSERR_AUTHERR) { *tl++ = nfsrv_rpc_autherr; *tl = txdr_unsigned(err & ~NFSERR_AUTHERR); mreq->m_len -= NFSX_UNSIGNED; bpos -= NFSX_UNSIGNED; } else { *tl++ = nfsrv_rpc_mismatch; *tl++ = txdr_unsigned(RPC_VER2); *tl = txdr_unsigned(RPC_VER2); } } else { *tl++ = nfsrv_rpc_msgaccepted; /* * Send a RPCAUTH_NULL verifier - no Kerberos. */ *tl++ = 0; *tl++ = 0; switch (err) { case EPROGUNAVAIL: *tl = txdr_unsigned(RPC_PROGUNAVAIL); break; case EPROGMISMATCH: *tl = txdr_unsigned(RPC_PROGMISMATCH); tl = nfsm_build(u_int32_t *, 2 * NFSX_UNSIGNED); *tl++ = txdr_unsigned(2); *tl = txdr_unsigned(3); break; case EPROCUNAVAIL: *tl = txdr_unsigned(RPC_PROCUNAVAIL); break; case EBADRPC: *tl = txdr_unsigned(RPC_GARBAGE); break; default: *tl = 0; if (err != NFSERR_RETVOID) { tl = nfsm_build(u_int32_t *, NFSX_UNSIGNED); if (err) *tl = txdr_unsigned(nfsrv_errmap(nd, err)); else *tl = 0; } break; } } *mbp = mb; *bposp = bpos; if (err != 0 && err != NFSERR_RETVOID) nfsrvstats.srvrpc_errs++; return mreq; } /* * nfs_realign: * * Check for badly aligned mbuf data and realign by copying the unaligned * portion of the data into a new mbuf chain and freeing the portions * of the old chain that were replaced. * * We cannot simply realign the data within the existing mbuf chain * because the underlying buffers may contain other rpc commands and * we cannot afford to overwrite them. * * We would prefer to avoid this situation entirely. The situation does * not occur with NFS/UDP and is supposed to only occassionally occur * with TCP. Use vfs.nfs.realign_count and realign_test to check this. */ static void nfs_realign(struct mbuf **pm, int hsiz) /* XXX COMMON */ { struct mbuf *m; struct mbuf *n = NULL; int off = 0; ++nfs_realign_test; while ((m = *pm) != NULL) { if ((m->m_len & 0x3) || (mtod(m, intptr_t) & 0x3)) { MGET(n, M_WAIT, MT_DATA); if (m->m_len >= MINCLSIZE) { MCLGET(n, M_WAIT); } n->m_len = 0; break; } pm = &m->m_next; } /* * If n is non-NULL, loop on m copying data, then replace the * portion of the chain that had to be realigned. */ if (n != NULL) { ++nfs_realign_count; while (m) { m_copyback(n, off, m->m_len, mtod(m, caddr_t)); off += m->m_len; m = m->m_next; } m_freem(*pm); *pm = n; } } /* * Parse an RPC request * - verify it * - fill in the cred struct. */ int nfs_getreq(struct nfsrv_descript *nd, struct nfsd *nfsd, int has_header) { int len, i; u_int32_t *tl; caddr_t dpos; u_int32_t nfsvers, auth_type; int error = 0; struct mbuf *mrep, *md; NFSD_LOCK_ASSERT(); mrep = nd->nd_mrep; md = nd->nd_md; dpos = nd->nd_dpos; if (has_header) { tl = nfsm_dissect_nonblock(u_int32_t *, 10 * NFSX_UNSIGNED); nd->nd_retxid = fxdr_unsigned(u_int32_t, *tl++); if (*tl++ != nfsrv_rpc_call) { m_freem(mrep); return (EBADRPC); } } else tl = nfsm_dissect_nonblock(u_int32_t *, 8 * NFSX_UNSIGNED); nd->nd_repstat = 0; nd->nd_flag = 0; if (*tl++ != nfsrv_rpc_vers) { nd->nd_repstat = ERPCMISMATCH; nd->nd_procnum = NFSPROC_NOOP; return (0); } if (*tl != nfsrv_nfs_prog) { nd->nd_repstat = EPROGUNAVAIL; nd->nd_procnum = NFSPROC_NOOP; return (0); } tl++; nfsvers = fxdr_unsigned(u_int32_t, *tl++); if (nfsvers < NFS_VER2 || nfsvers > NFS_VER3) { nd->nd_repstat = EPROGMISMATCH; nd->nd_procnum = NFSPROC_NOOP; return (0); } nd->nd_procnum = fxdr_unsigned(u_int32_t, *tl++); if (nd->nd_procnum == NFSPROC_NULL) return (0); if (nfsvers == NFS_VER3) { nd->nd_flag = ND_NFSV3; if (nd->nd_procnum >= NFS_NPROCS) { nd->nd_repstat = EPROCUNAVAIL; nd->nd_procnum = NFSPROC_NOOP; return (0); } } else { if (nd->nd_procnum > NFSV2PROC_STATFS) { nd->nd_repstat = EPROCUNAVAIL; nd->nd_procnum = NFSPROC_NOOP; return (0); } /* Map the v2 procedure numbers into v3 ones */ nd->nd_procnum = nfsrv_nfsv3_procid[nd->nd_procnum]; } auth_type = *tl++; len = fxdr_unsigned(int, *tl++); if (len < 0 || len > RPCAUTH_MAXSIZ) { m_freem(mrep); return (EBADRPC); } /* * Handle auth_unix; */ if (auth_type == nfsrv_rpc_auth_unix) { len = fxdr_unsigned(int, *++tl); if (len < 0 || len > NFS_MAXNAMLEN) { m_freem(mrep); return (EBADRPC); } nfsm_adv(nfsm_rndup(len)); tl = nfsm_dissect_nonblock(u_int32_t *, 3 * NFSX_UNSIGNED); nd->nd_cr->cr_uid = nd->nd_cr->cr_ruid = nd->nd_cr->cr_svuid = fxdr_unsigned(uid_t, *tl++); nd->nd_cr->cr_groups[0] = nd->nd_cr->cr_rgid = nd->nd_cr->cr_svgid = fxdr_unsigned(gid_t, *tl++); #ifdef MAC mac_cred_associate_nfsd(nd->nd_cr); #endif len = fxdr_unsigned(int, *tl); if (len < 0 || len > RPCAUTH_UNIXGIDS) { m_freem(mrep); return (EBADRPC); } tl = nfsm_dissect_nonblock(u_int32_t *, (len + 2) * NFSX_UNSIGNED); for (i = 1; i <= len; i++) if (i < NGROUPS) nd->nd_cr->cr_groups[i] = fxdr_unsigned(gid_t, *tl++); else tl++; nd->nd_cr->cr_ngroups = (len >= NGROUPS) ? NGROUPS : (len + 1); if (nd->nd_cr->cr_ngroups > 1) nfsrvw_sort(nd->nd_cr->cr_groups, nd->nd_cr->cr_ngroups); len = fxdr_unsigned(int, *++tl); if (len < 0 || len > RPCAUTH_MAXSIZ) { m_freem(mrep); return (EBADRPC); } if (len > 0) nfsm_adv(nfsm_rndup(len)); nd->nd_credflavor = RPCAUTH_UNIX; } else { nd->nd_repstat = (NFSERR_AUTHERR | AUTH_REJECTCRED); nd->nd_procnum = NFSPROC_NOOP; return (0); } nd->nd_md = md; nd->nd_dpos = dpos; return (0); nfsmout: return (error); } /* * Socket upcall routine for the nfsd sockets. * The caddr_t arg is a pointer to the "struct nfssvc_sock". * Essentially do as much as possible non-blocking, else punt and it will * be called with M_WAIT from an nfsd. */ void nfsrv_rcv(struct socket *so, void *arg, int waitflag) { struct nfssvc_sock *slp = (struct nfssvc_sock *)arg; struct mbuf *m; struct mbuf *mp; struct sockaddr *nam; struct uio auio; int flags, error; NFSD_UNLOCK_ASSERT(); /* XXXRW: Unlocked read. */ if ((slp->ns_flag & SLP_VALID) == 0) return; /* * We can't do this in the context of a socket callback * because we're called with locks held. * XXX: SMP */ if (waitflag == M_DONTWAIT) { NFSD_LOCK(); slp->ns_flag |= SLP_NEEDQ; goto dorecs; } NFSD_LOCK(); auio.uio_td = NULL; if (so->so_type == SOCK_STREAM) { /* * If there are already records on the queue, defer soreceive() * to an nfsd so that there is feedback to the TCP layer that * the nfs servers are heavily loaded. */ if (STAILQ_FIRST(&slp->ns_rec) != NULL && waitflag == M_DONTWAIT) { slp->ns_flag |= SLP_NEEDQ; goto dorecs; } /* * Do soreceive(). */ auio.uio_resid = 1000000000; flags = MSG_DONTWAIT; NFSD_UNLOCK(); error = soreceive(so, &nam, &auio, &mp, NULL, &flags); NFSD_LOCK(); if (error || mp == NULL) { if (error == EWOULDBLOCK) slp->ns_flag |= SLP_NEEDQ; else slp->ns_flag |= SLP_DISCONN; goto dorecs; } m = mp; if (slp->ns_rawend) { slp->ns_rawend->m_next = m; slp->ns_cc += 1000000000 - auio.uio_resid; } else { slp->ns_raw = m; slp->ns_cc = 1000000000 - auio.uio_resid; } while (m->m_next) m = m->m_next; slp->ns_rawend = m; /* * Now try and parse record(s) out of the raw stream data. */ error = nfsrv_getstream(slp, waitflag); if (error) { if (error == EPERM) slp->ns_flag |= SLP_DISCONN; else slp->ns_flag |= SLP_NEEDQ; } } else { do { auio.uio_resid = 1000000000; flags = MSG_DONTWAIT; NFSD_UNLOCK(); error = soreceive(so, &nam, &auio, &mp, NULL, &flags); if (mp) { struct nfsrv_rec *rec; rec = malloc(sizeof(struct nfsrv_rec), M_NFSRVDESC, waitflag == M_DONTWAIT ? M_NOWAIT : M_WAITOK); if (!rec) { if (nam) free(nam, M_SONAME); m_freem(mp); NFSD_LOCK(); continue; } nfs_realign(&mp, 10 * NFSX_UNSIGNED); NFSD_LOCK(); rec->nr_address = nam; rec->nr_packet = mp; STAILQ_INSERT_TAIL(&slp->ns_rec, rec, nr_link); } else NFSD_LOCK(); if (error) { if ((so->so_proto->pr_flags & PR_CONNREQUIRED) && error != EWOULDBLOCK) { slp->ns_flag |= SLP_DISCONN; goto dorecs; } } } while (mp); } /* * Now try and process the request records, non-blocking. */ dorecs: if (waitflag == M_DONTWAIT && (STAILQ_FIRST(&slp->ns_rec) != NULL || (slp->ns_flag & (SLP_NEEDQ | SLP_DISCONN)))) nfsrv_wakenfsd(slp); NFSD_UNLOCK(); } /* * Try and extract an RPC request from the mbuf data list received on a * stream socket. The "waitflag" argument indicates whether or not it * can sleep. */ static int nfsrv_getstream(struct nfssvc_sock *slp, int waitflag) { struct mbuf *m, **mpp; char *cp1, *cp2; int len; struct mbuf *om, *m2, *recm; u_int32_t recmark; NFSD_LOCK_ASSERT(); if (slp->ns_flag & SLP_GETSTREAM) panic("nfs getstream"); slp->ns_flag |= SLP_GETSTREAM; for (;;) { if (slp->ns_reclen == 0) { if (slp->ns_cc < NFSX_UNSIGNED) { slp->ns_flag &= ~SLP_GETSTREAM; return (0); } m = slp->ns_raw; if (m->m_len >= NFSX_UNSIGNED) { bcopy(mtod(m, caddr_t), (caddr_t)&recmark, NFSX_UNSIGNED); m->m_data += NFSX_UNSIGNED; m->m_len -= NFSX_UNSIGNED; } else { cp1 = (caddr_t)&recmark; cp2 = mtod(m, caddr_t); while (cp1 < ((caddr_t)&recmark) + NFSX_UNSIGNED) { while (m->m_len == 0) { m = m->m_next; cp2 = mtod(m, caddr_t); } *cp1++ = *cp2++; m->m_data++; m->m_len--; } } slp->ns_cc -= NFSX_UNSIGNED; recmark = ntohl(recmark); slp->ns_reclen = recmark & ~0x80000000; if (recmark & 0x80000000) slp->ns_flag |= SLP_LASTFRAG; else slp->ns_flag &= ~SLP_LASTFRAG; if (slp->ns_reclen > NFS_MAXPACKET || slp->ns_reclen <= 0) { slp->ns_flag &= ~SLP_GETSTREAM; return (EPERM); } } /* * Now get the record part. * * Note that slp->ns_reclen may be 0. Linux sometimes * generates 0-length RPCs. */ recm = NULL; if (slp->ns_cc == slp->ns_reclen) { recm = slp->ns_raw; slp->ns_raw = slp->ns_rawend = NULL; slp->ns_cc = slp->ns_reclen = 0; } else if (slp->ns_cc > slp->ns_reclen) { len = 0; m = slp->ns_raw; om = NULL; while (len < slp->ns_reclen) { if ((len + m->m_len) > slp->ns_reclen) { NFSD_UNLOCK(); m2 = m_copym(m, 0, slp->ns_reclen - len, waitflag); NFSD_LOCK(); if (m2) { if (om) { om->m_next = m2; recm = slp->ns_raw; } else recm = m2; m->m_data += slp->ns_reclen - len; m->m_len -= slp->ns_reclen - len; len = slp->ns_reclen; } else { slp->ns_flag &= ~SLP_GETSTREAM; return (EWOULDBLOCK); } } else if ((len + m->m_len) == slp->ns_reclen) { om = m; len += m->m_len; m = m->m_next; recm = slp->ns_raw; om->m_next = NULL; } else { om = m; len += m->m_len; m = m->m_next; } } slp->ns_raw = m; slp->ns_cc -= len; slp->ns_reclen = 0; } else { slp->ns_flag &= ~SLP_GETSTREAM; return (0); } /* * Accumulate the fragments into a record. */ mpp = &slp->ns_frag; while (*mpp) mpp = &((*mpp)->m_next); *mpp = recm; if (slp->ns_flag & SLP_LASTFRAG) { struct nfsrv_rec *rec; NFSD_UNLOCK(); rec = malloc(sizeof(struct nfsrv_rec), M_NFSRVDESC, waitflag == M_DONTWAIT ? M_NOWAIT : M_WAITOK); if (rec) { nfs_realign(&slp->ns_frag, 10 * NFSX_UNSIGNED); rec->nr_address = NULL; rec->nr_packet = slp->ns_frag; NFSD_LOCK(); STAILQ_INSERT_TAIL(&slp->ns_rec, rec, nr_link); } else { NFSD_LOCK(); } if (!rec) { m_freem(slp->ns_frag); } slp->ns_frag = NULL; } } } /* * Parse an RPC header. */ int nfsrv_dorec(struct nfssvc_sock *slp, struct nfsd *nfsd, struct nfsrv_descript **ndp) { struct nfsrv_rec *rec; struct mbuf *m; struct sockaddr *nam; struct nfsrv_descript *nd; int error; NFSD_LOCK_ASSERT(); *ndp = NULL; if ((slp->ns_flag & SLP_VALID) == 0 || STAILQ_FIRST(&slp->ns_rec) == NULL) return (ENOBUFS); rec = STAILQ_FIRST(&slp->ns_rec); KASSERT(rec->nr_packet != NULL, ("nfsrv_dorec: missing mbuf")); STAILQ_REMOVE_HEAD(&slp->ns_rec, nr_link); nam = rec->nr_address; m = rec->nr_packet; free(rec, M_NFSRVDESC); NFSD_UNLOCK(); nd = malloc(sizeof (struct nfsrv_descript), M_NFSRVDESC, M_WAITOK); nd->nd_cr = crget(); + prison_hold(&prison0); + nd->nd_cr->cr_prison = &prison0; NFSD_LOCK(); nd->nd_md = nd->nd_mrep = m; nd->nd_nam2 = nam; nd->nd_dpos = mtod(m, caddr_t); error = nfs_getreq(nd, nfsd, TRUE); if (error) { if (nam) { free(nam, M_SONAME); } if (nd->nd_cr != NULL) crfree(nd->nd_cr); free((caddr_t)nd, M_NFSRVDESC); return (error); } *ndp = nd; nfsd->nfsd_nd = nd; return (0); } /* * Search for a sleeping nfsd and wake it up. * SIDE EFFECT: If none found, set NFSD_CHECKSLP flag, so that one of the * running nfsds will go look for the work in the nfssvc_sock list. */ void nfsrv_wakenfsd(struct nfssvc_sock *slp) { struct nfsd *nd; NFSD_LOCK_ASSERT(); if ((slp->ns_flag & SLP_VALID) == 0) return; TAILQ_FOREACH(nd, &nfsd_head, nfsd_chain) { if (nd->nfsd_flag & NFSD_WAITING) { nd->nfsd_flag &= ~NFSD_WAITING; if (nd->nfsd_slp) panic("nfsd wakeup"); slp->ns_sref++; nd->nfsd_slp = slp; wakeup(nd); return; } } slp->ns_flag |= SLP_DOREC; nfsd_head_flag |= NFSD_CHECKSLP; } /* * This is the nfs send routine. * For the server side: * - return EINTR or ERESTART if interrupted by a signal * - return EPIPE if a connection is lost for connection based sockets (TCP...) * - do any cleanup required by recoverable socket errors (?) */ int nfsrv_send(struct socket *so, struct sockaddr *nam, struct mbuf *top) { struct sockaddr *sendnam; int error, soflags, flags; NFSD_UNLOCK_ASSERT(); soflags = so->so_proto->pr_flags; if ((soflags & PR_CONNREQUIRED) || (so->so_state & SS_ISCONNECTED)) sendnam = NULL; else sendnam = nam; if (so->so_type == SOCK_SEQPACKET) flags = MSG_EOR; else flags = 0; error = sosend(so, sendnam, 0, top, 0, flags, curthread/*XXX*/); if (error == ENOBUFS && so->so_type == SOCK_DGRAM) error = 0; if (error) { log(LOG_INFO, "nfsd send error %d\n", error); /* * Handle any recoverable (soft) socket errors here. (?) */ if (error != EINTR && error != ERESTART && error != EWOULDBLOCK && error != EPIPE) error = 0; } return (error); } /* * NFS server timer routine. */ void nfsrv_timer(void *arg) { struct nfssvc_sock *slp; u_quad_t cur_usec; NFSD_LOCK(); /* * Scan the write gathering queues for writes that need to be * completed now. */ cur_usec = nfs_curusec(); TAILQ_FOREACH(slp, &nfssvc_sockhead, ns_chain) { if (LIST_FIRST(&slp->ns_tq) && LIST_FIRST(&slp->ns_tq)->nd_time <= cur_usec) nfsrv_wakenfsd(slp); } NFSD_UNLOCK(); callout_reset(&nfsrv_callout, nfsrv_ticks, nfsrv_timer, NULL); } #endif /* NFS_LEGACYRPC */ Index: head/sys/security/mac_bsdextended/mac_bsdextended.c =================================================================== --- head/sys/security/mac_bsdextended/mac_bsdextended.c (revision 192894) +++ head/sys/security/mac_bsdextended/mac_bsdextended.c (revision 192895) @@ -1,526 +1,526 @@ /*- * Copyright (c) 1999-2002, 2007-2008 Robert N. M. Watson * Copyright (c) 2001-2005 Networks Associates Technology, Inc. * Copyright (c) 2005 Tom Rhodes * Copyright (c) 2006 SPARTA, Inc. * All rights reserved. * * This software was developed by Robert Watson for the TrustedBSD Project. * It was later enhanced by Tom Rhodes for the TrustedBSD Project. * * This software was developed for the FreeBSD Project in part by Network * Associates Laboratories, the Security Research Division of Network * Associates, Inc. under DARPA/SPAWAR contract N66001-01-C-8035 ("CBOSS"), * as part of the DARPA CHATS research program. * * This software was enhanced by SPARTA ISSO under SPAWAR contract * N66001-04-C-6019 ("SEFOS"). * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ /* * Developed by the TrustedBSD Project. * * "BSD Extended" MAC policy, allowing the administrator to impose mandatory * firewall-like rules regarding users and file system objects. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include static struct mtx ugidfw_mtx; SYSCTL_DECL(_security_mac); SYSCTL_NODE(_security_mac, OID_AUTO, bsdextended, CTLFLAG_RW, 0, "TrustedBSD extended BSD MAC policy controls"); static int ugidfw_enabled = 1; SYSCTL_INT(_security_mac_bsdextended, OID_AUTO, enabled, CTLFLAG_RW, &ugidfw_enabled, 0, "Enforce extended BSD policy"); TUNABLE_INT("security.mac.bsdextended.enabled", &ugidfw_enabled); MALLOC_DEFINE(M_MACBSDEXTENDED, "mac_bsdextended", "BSD Extended MAC rule"); #define MAC_BSDEXTENDED_MAXRULES 250 static struct mac_bsdextended_rule *rules[MAC_BSDEXTENDED_MAXRULES]; static int rule_count = 0; static int rule_slots = 0; static int rule_version = MB_VERSION; SYSCTL_INT(_security_mac_bsdextended, OID_AUTO, rule_count, CTLFLAG_RD, &rule_count, 0, "Number of defined rules\n"); SYSCTL_INT(_security_mac_bsdextended, OID_AUTO, rule_slots, CTLFLAG_RD, &rule_slots, 0, "Number of used rule slots\n"); SYSCTL_INT(_security_mac_bsdextended, OID_AUTO, rule_version, CTLFLAG_RD, &rule_version, 0, "Version number for API\n"); /* * This is just used for logging purposes, eventually we would like to log * much more then failed requests. */ static int ugidfw_logging; SYSCTL_INT(_security_mac_bsdextended, OID_AUTO, logging, CTLFLAG_RW, &ugidfw_logging, 0, "Log failed authorization requests"); /* * This tunable is here for compatibility. It will allow the user to switch * between the new mode (first rule matches) and the old functionality (all * rules match). */ static int ugidfw_firstmatch_enabled; SYSCTL_INT(_security_mac_bsdextended, OID_AUTO, firstmatch_enabled, CTLFLAG_RW, &ugidfw_firstmatch_enabled, 1, "Disable/enable match first rule functionality"); static int ugidfw_rule_valid(struct mac_bsdextended_rule *rule) { if ((rule->mbr_subject.mbs_flags | MBS_ALL_FLAGS) != MBS_ALL_FLAGS) return (EINVAL); if ((rule->mbr_subject.mbs_neg | MBS_ALL_FLAGS) != MBS_ALL_FLAGS) return (EINVAL); if ((rule->mbr_object.mbo_flags | MBO_ALL_FLAGS) != MBO_ALL_FLAGS) return (EINVAL); if ((rule->mbr_object.mbo_neg | MBO_ALL_FLAGS) != MBO_ALL_FLAGS) return (EINVAL); if ((rule->mbr_object.mbo_neg | MBO_TYPE_DEFINED) && (rule->mbr_object.mbo_type | MBO_ALL_TYPE) != MBO_ALL_TYPE) return (EINVAL); if ((rule->mbr_mode | MBI_ALLPERM) != MBI_ALLPERM) return (EINVAL); return (0); } static int sysctl_rule(SYSCTL_HANDLER_ARGS) { struct mac_bsdextended_rule temprule, *ruleptr; u_int namelen; int error, index, *name; error = 0; name = (int *)arg1; namelen = arg2; if (namelen != 1) return (EINVAL); index = name[0]; if (index >= MAC_BSDEXTENDED_MAXRULES) return (ENOENT); ruleptr = NULL; if (req->newptr && req->newlen != 0) { error = SYSCTL_IN(req, &temprule, sizeof(temprule)); if (error) return (error); ruleptr = malloc(sizeof(*ruleptr), M_MACBSDEXTENDED, M_WAITOK | M_ZERO); } mtx_lock(&ugidfw_mtx); if (req->oldptr) { if (index < 0 || index > rule_slots + 1) { error = ENOENT; goto out; } if (rules[index] == NULL) { error = ENOENT; goto out; } temprule = *rules[index]; } if (req->newptr && req->newlen == 0) { KASSERT(ruleptr == NULL, ("sysctl_rule: ruleptr != NULL")); ruleptr = rules[index]; if (ruleptr == NULL) { error = ENOENT; goto out; } rule_count--; rules[index] = NULL; } else if (req->newptr) { error = ugidfw_rule_valid(&temprule); if (error) goto out; if (rules[index] == NULL) { *ruleptr = temprule; rules[index] = ruleptr; ruleptr = NULL; if (index + 1 > rule_slots) rule_slots = index + 1; rule_count++; } else *rules[index] = temprule; } out: mtx_unlock(&ugidfw_mtx); if (ruleptr != NULL) free(ruleptr, M_MACBSDEXTENDED); if (req->oldptr && error == 0) error = SYSCTL_OUT(req, &temprule, sizeof(temprule)); return (error); } SYSCTL_NODE(_security_mac_bsdextended, OID_AUTO, rules, CTLFLAG_MPSAFE | CTLFLAG_RW, sysctl_rule, "BSD extended MAC rules"); static void ugidfw_init(struct mac_policy_conf *mpc) { mtx_init(&ugidfw_mtx, "mac_bsdextended lock", NULL, MTX_DEF); } static void ugidfw_destroy(struct mac_policy_conf *mpc) { int i; for (i = 0; i < MAC_BSDEXTENDED_MAXRULES; i++) { if (rules[i] != NULL) free(rules[i], M_MACBSDEXTENDED); } mtx_destroy(&ugidfw_mtx); } static int ugidfw_rulecheck(struct mac_bsdextended_rule *rule, struct ucred *cred, struct vnode *vp, struct vattr *vap, int acc_mode) { int mac_granted, match, priv_granted; int i; /* * Is there a subject match? */ mtx_assert(&ugidfw_mtx, MA_OWNED); if (rule->mbr_subject.mbs_flags & MBS_UID_DEFINED) { match = ((cred->cr_uid <= rule->mbr_subject.mbs_uid_max && cred->cr_uid >= rule->mbr_subject.mbs_uid_min) || (cred->cr_ruid <= rule->mbr_subject.mbs_uid_max && cred->cr_ruid >= rule->mbr_subject.mbs_uid_min) || (cred->cr_svuid <= rule->mbr_subject.mbs_uid_max && cred->cr_svuid >= rule->mbr_subject.mbs_uid_min)); if (rule->mbr_subject.mbs_neg & MBS_UID_DEFINED) match = !match; if (!match) return (0); } if (rule->mbr_subject.mbs_flags & MBS_GID_DEFINED) { match = ((cred->cr_rgid <= rule->mbr_subject.mbs_gid_max && cred->cr_rgid >= rule->mbr_subject.mbs_gid_min) || (cred->cr_svgid <= rule->mbr_subject.mbs_gid_max && cred->cr_svgid >= rule->mbr_subject.mbs_gid_min)); if (!match) { for (i = 0; i < cred->cr_ngroups; i++) { if (cred->cr_groups[i] <= rule->mbr_subject.mbs_gid_max && cred->cr_groups[i] >= rule->mbr_subject.mbs_gid_min) { match = 1; break; } } } if (rule->mbr_subject.mbs_neg & MBS_GID_DEFINED) match = !match; if (!match) return (0); } if (rule->mbr_subject.mbs_flags & MBS_PRISON_DEFINED) { - match = (cred->cr_prison != NULL && - cred->cr_prison->pr_id == rule->mbr_subject.mbs_prison); + match = + (cred->cr_prison->pr_id == rule->mbr_subject.mbs_prison); if (rule->mbr_subject.mbs_neg & MBS_PRISON_DEFINED) match = !match; if (!match) return (0); } /* * Is there an object match? */ if (rule->mbr_object.mbo_flags & MBO_UID_DEFINED) { match = (vap->va_uid <= rule->mbr_object.mbo_uid_max && vap->va_uid >= rule->mbr_object.mbo_uid_min); if (rule->mbr_object.mbo_neg & MBO_UID_DEFINED) match = !match; if (!match) return (0); } if (rule->mbr_object.mbo_flags & MBO_GID_DEFINED) { match = (vap->va_gid <= rule->mbr_object.mbo_gid_max && vap->va_gid >= rule->mbr_object.mbo_gid_min); if (rule->mbr_object.mbo_neg & MBO_GID_DEFINED) match = !match; if (!match) return (0); } if (rule->mbr_object.mbo_flags & MBO_FSID_DEFINED) { match = (bcmp(&(vp->v_mount->mnt_stat.f_fsid), &(rule->mbr_object.mbo_fsid), sizeof(rule->mbr_object.mbo_fsid)) == 0); if (rule->mbr_object.mbo_neg & MBO_FSID_DEFINED) match = !match; if (!match) return (0); } if (rule->mbr_object.mbo_flags & MBO_SUID) { match = (vap->va_mode & S_ISUID); if (rule->mbr_object.mbo_neg & MBO_SUID) match = !match; if (!match) return (0); } if (rule->mbr_object.mbo_flags & MBO_SGID) { match = (vap->va_mode & S_ISGID); if (rule->mbr_object.mbo_neg & MBO_SGID) match = !match; if (!match) return (0); } if (rule->mbr_object.mbo_flags & MBO_UID_SUBJECT) { match = (vap->va_uid == cred->cr_uid || vap->va_uid == cred->cr_ruid || vap->va_uid == cred->cr_svuid); if (rule->mbr_object.mbo_neg & MBO_UID_SUBJECT) match = !match; if (!match) return (0); } if (rule->mbr_object.mbo_flags & MBO_GID_SUBJECT) { match = (groupmember(vap->va_gid, cred) || vap->va_gid == cred->cr_rgid || vap->va_gid == cred->cr_svgid); if (rule->mbr_object.mbo_neg & MBO_GID_SUBJECT) match = !match; if (!match) return (0); } if (rule->mbr_object.mbo_flags & MBO_TYPE_DEFINED) { switch (vap->va_type) { case VREG: match = (rule->mbr_object.mbo_type & MBO_TYPE_REG); break; case VDIR: match = (rule->mbr_object.mbo_type & MBO_TYPE_DIR); break; case VBLK: match = (rule->mbr_object.mbo_type & MBO_TYPE_BLK); break; case VCHR: match = (rule->mbr_object.mbo_type & MBO_TYPE_CHR); break; case VLNK: match = (rule->mbr_object.mbo_type & MBO_TYPE_LNK); break; case VSOCK: match = (rule->mbr_object.mbo_type & MBO_TYPE_SOCK); break; case VFIFO: match = (rule->mbr_object.mbo_type & MBO_TYPE_FIFO); break; default: match = 0; } if (rule->mbr_object.mbo_neg & MBO_TYPE_DEFINED) match = !match; if (!match) return (0); } /* * MBI_APPEND should not be here as it should get converted to * MBI_WRITE. */ priv_granted = 0; mac_granted = rule->mbr_mode; if ((acc_mode & MBI_ADMIN) && (mac_granted & MBI_ADMIN) == 0 && priv_check_cred(cred, PRIV_VFS_ADMIN, 0) == 0) priv_granted |= MBI_ADMIN; if ((acc_mode & MBI_EXEC) && (mac_granted & MBI_EXEC) == 0 && priv_check_cred(cred, (vap->va_type == VDIR) ? PRIV_VFS_LOOKUP : PRIV_VFS_EXEC, 0) == 0) priv_granted |= MBI_EXEC; if ((acc_mode & MBI_READ) && (mac_granted & MBI_READ) == 0 && priv_check_cred(cred, PRIV_VFS_READ, 0) == 0) priv_granted |= MBI_READ; if ((acc_mode & MBI_STAT) && (mac_granted & MBI_STAT) == 0 && priv_check_cred(cred, PRIV_VFS_STAT, 0) == 0) priv_granted |= MBI_STAT; if ((acc_mode & MBI_WRITE) && (mac_granted & MBI_WRITE) == 0 && priv_check_cred(cred, PRIV_VFS_WRITE, 0) == 0) priv_granted |= MBI_WRITE; /* * Is the access permitted? */ if (((mac_granted | priv_granted) & acc_mode) != acc_mode) { if (ugidfw_logging) log(LOG_AUTHPRIV, "mac_bsdextended: %d:%d request %d" " on %d:%d failed. \n", cred->cr_ruid, cred->cr_rgid, acc_mode, vap->va_uid, vap->va_gid); return (EACCES); } /* * If the rule matched, permits access, and first match is enabled, * return success. */ if (ugidfw_firstmatch_enabled) return (EJUSTRETURN); else return (0); } int ugidfw_check(struct ucred *cred, struct vnode *vp, struct vattr *vap, int acc_mode) { int error, i; /* * Since we do not separately handle append, map append to write. */ if (acc_mode & MBI_APPEND) { acc_mode &= ~MBI_APPEND; acc_mode |= MBI_WRITE; } mtx_lock(&ugidfw_mtx); for (i = 0; i < rule_slots; i++) { if (rules[i] == NULL) continue; error = ugidfw_rulecheck(rules[i], cred, vp, vap, acc_mode); if (error == EJUSTRETURN) break; if (error) { mtx_unlock(&ugidfw_mtx); return (error); } } mtx_unlock(&ugidfw_mtx); return (0); } int ugidfw_check_vp(struct ucred *cred, struct vnode *vp, int acc_mode) { int error; struct vattr vap; if (!ugidfw_enabled) return (0); error = VOP_GETATTR(vp, &vap, cred); if (error) return (error); return (ugidfw_check(cred, vp, &vap, acc_mode)); } int ugidfw_accmode2mbi(accmode_t accmode) { int mbi; mbi = 0; if (accmode & VEXEC) mbi |= MBI_EXEC; if (accmode & VWRITE) mbi |= MBI_WRITE; if (accmode & VREAD) mbi |= MBI_READ; if (accmode & VADMIN_PERMS) mbi |= MBI_ADMIN; if (accmode & VSTAT_PERMS) mbi |= MBI_STAT; if (accmode & VAPPEND) mbi |= MBI_APPEND; return (mbi); } static struct mac_policy_ops ugidfw_ops = { .mpo_destroy = ugidfw_destroy, .mpo_init = ugidfw_init, .mpo_system_check_acct = ugidfw_system_check_acct, .mpo_system_check_auditctl = ugidfw_system_check_auditctl, .mpo_system_check_swapon = ugidfw_system_check_swapon, .mpo_vnode_check_access = ugidfw_vnode_check_access, .mpo_vnode_check_chdir = ugidfw_vnode_check_chdir, .mpo_vnode_check_chroot = ugidfw_vnode_check_chroot, .mpo_vnode_check_create = ugidfw_check_create_vnode, .mpo_vnode_check_deleteacl = ugidfw_vnode_check_deleteacl, .mpo_vnode_check_deleteextattr = ugidfw_vnode_check_deleteextattr, .mpo_vnode_check_exec = ugidfw_vnode_check_exec, .mpo_vnode_check_getacl = ugidfw_vnode_check_getacl, .mpo_vnode_check_getextattr = ugidfw_vnode_check_getextattr, .mpo_vnode_check_link = ugidfw_vnode_check_link, .mpo_vnode_check_listextattr = ugidfw_vnode_check_listextattr, .mpo_vnode_check_lookup = ugidfw_vnode_check_lookup, .mpo_vnode_check_open = ugidfw_vnode_check_open, .mpo_vnode_check_readdir = ugidfw_vnode_check_readdir, .mpo_vnode_check_readlink = ugidfw_vnode_check_readdlink, .mpo_vnode_check_rename_from = ugidfw_vnode_check_rename_from, .mpo_vnode_check_rename_to = ugidfw_vnode_check_rename_to, .mpo_vnode_check_revoke = ugidfw_vnode_check_revoke, .mpo_vnode_check_setacl = ugidfw_check_setacl_vnode, .mpo_vnode_check_setextattr = ugidfw_vnode_check_setextattr, .mpo_vnode_check_setflags = ugidfw_vnode_check_setflags, .mpo_vnode_check_setmode = ugidfw_vnode_check_setmode, .mpo_vnode_check_setowner = ugidfw_vnode_check_setowner, .mpo_vnode_check_setutimes = ugidfw_vnode_check_setutimes, .mpo_vnode_check_stat = ugidfw_vnode_check_stat, .mpo_vnode_check_unlink = ugidfw_vnode_check_unlink, }; MAC_POLICY_SET(&ugidfw_ops, mac_bsdextended, "TrustedBSD MAC/BSD Extended", MPC_LOADTIME_FLAG_UNLOADOK, NULL); Index: head/sys/sys/cpuset.h =================================================================== --- head/sys/sys/cpuset.h (revision 192894) +++ head/sys/sys/cpuset.h (revision 192895) @@ -1,191 +1,191 @@ /*- * Copyright (c) 2008, Jeffrey Roberson * All rights reserved. * * Copyright (c) 2008 Nokia Corporation * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice unmodified, this list of conditions, and the following * disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR * IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. * IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, * INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF * THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _SYS_CPUSET_H_ #define _SYS_CPUSET_H_ #ifdef _KERNEL #define CPU_SETSIZE MAXCPU #endif #define CPU_MAXSIZE 128 #ifndef CPU_SETSIZE #define CPU_SETSIZE CPU_MAXSIZE #endif #define _NCPUBITS (sizeof(long) * NBBY) /* bits per mask */ #define _NCPUWORDS howmany(CPU_SETSIZE, _NCPUBITS) typedef struct _cpuset { long __bits[howmany(CPU_SETSIZE, _NCPUBITS)]; } cpuset_t; #define __cpuset_mask(n) ((long)1 << ((n) % _NCPUBITS)) #define CPU_CLR(n, p) ((p)->__bits[(n)/_NCPUBITS] &= ~__cpuset_mask(n)) #define CPU_COPY(f, t) (void)(*(t) = *(f)) #define CPU_ISSET(n, p) (((p)->__bits[(n)/_NCPUBITS] & __cpuset_mask(n)) != 0) #define CPU_SET(n, p) ((p)->__bits[(n)/_NCPUBITS] |= __cpuset_mask(n)) #define CPU_ZERO(p) do { \ __size_t __i; \ for (__i = 0; __i < _NCPUWORDS; __i++) \ (p)->__bits[__i] = 0; \ } while (0) /* Is p empty. */ #define CPU_EMPTY(p) __extension__ ({ \ __size_t __i; \ for (__i = 0; __i < _NCPUWORDS; __i++) \ if ((p)->__bits[__i]) \ break; \ __i == _NCPUWORDS; \ }) /* Is c a subset of p. */ #define CPU_SUBSET(p, c) __extension__ ({ \ __size_t __i; \ for (__i = 0; __i < _NCPUWORDS; __i++) \ if (((c)->__bits[__i] & \ (p)->__bits[__i]) != \ (c)->__bits[__i]) \ break; \ __i == _NCPUWORDS; \ }) /* Are there any common bits between b & c? */ #define CPU_OVERLAP(p, c) __extension__ ({ \ __size_t __i; \ for (__i = 0; __i < _NCPUWORDS; __i++) \ if (((c)->__bits[__i] & \ (p)->__bits[__i]) != 0) \ break; \ __i != _NCPUWORDS; \ }) /* Compare two sets, returns 0 if equal 1 otherwise. */ #define CPU_CMP(p, c) __extension__ ({ \ __size_t __i; \ for (__i = 0; __i < _NCPUWORDS; __i++) \ if (((c)->__bits[__i] != \ (p)->__bits[__i])) \ break; \ __i != _NCPUWORDS; \ }) #define CPU_OR(d, s) do { \ __size_t __i; \ for (__i = 0; __i < _NCPUWORDS; __i++) \ (d)->__bits[__i] |= (s)->__bits[__i]; \ } while (0) #define CPU_AND(d, s) do { \ __size_t __i; \ for (__i = 0; __i < _NCPUWORDS; __i++) \ (d)->__bits[__i] &= (s)->__bits[__i]; \ } while (0) #define CPU_NAND(d, s) do { \ __size_t __i; \ for (__i = 0; __i < _NCPUWORDS; __i++) \ (d)->__bits[__i] &= ~(s)->__bits[__i]; \ } while (0) /* * Valid cpulevel_t values. */ #define CPU_LEVEL_ROOT 1 /* All system cpus. */ #define CPU_LEVEL_CPUSET 2 /* Available cpus for which. */ #define CPU_LEVEL_WHICH 3 /* Actual mask/id for which. */ /* * Valid cpuwhich_t values. */ #define CPU_WHICH_TID 1 /* Specifies a thread id. */ #define CPU_WHICH_PID 2 /* Specifies a process id. */ #define CPU_WHICH_CPUSET 3 /* Specifies a set id. */ #define CPU_WHICH_IRQ 4 /* Specifies an irq #. */ #define CPU_WHICH_JAIL 5 /* Specifies a jail id. */ /* * Reserved cpuset identifiers. */ #define CPUSET_INVALID -1 #define CPUSET_DEFAULT 0 #ifdef _KERNEL LIST_HEAD(setlist, cpuset); /* * cpusets encapsulate cpu binding information for one or more threads. * * a - Accessed with atomics. * s - Set at creation, never modified. Only a ref required to read. * c - Locked internally by a cpuset lock. * * The bitmask is only modified while holding the cpuset lock. It may be * read while only a reference is held but the consumer must be prepared * to deal with inconsistent results. */ struct cpuset { cpuset_t cs_mask; /* bitmask of valid cpus. */ volatile u_int cs_ref; /* (a) Reference count. */ int cs_flags; /* (s) Flags from below. */ cpusetid_t cs_id; /* (s) Id or INVALID. */ struct cpuset *cs_parent; /* (s) Pointer to our parent. */ LIST_ENTRY(cpuset) cs_link; /* (c) All identified sets. */ LIST_ENTRY(cpuset) cs_siblings; /* (c) Sibling set link. */ struct setlist cs_children; /* (c) List of children. */ }; #define CPU_SET_ROOT 0x0001 /* Set is a root set. */ #define CPU_SET_RDONLY 0x0002 /* No modification allowed. */ extern cpuset_t *cpuset_root; +struct prison; struct proc; -struct thread; struct cpuset *cpuset_thread0(void); struct cpuset *cpuset_ref(struct cpuset *); void cpuset_rel(struct cpuset *); int cpuset_setthread(lwpid_t id, cpuset_t *); -int cpuset_create_root(struct thread *, struct cpuset **); +int cpuset_create_root(struct prison *, struct cpuset **); int cpuset_setproc_update_set(struct proc *, struct cpuset *); #else __BEGIN_DECLS int cpuset(cpusetid_t *); int cpuset_setid(cpuwhich_t, id_t, cpusetid_t); int cpuset_getid(cpulevel_t, cpuwhich_t, id_t, cpusetid_t *); int cpuset_getaffinity(cpulevel_t, cpuwhich_t, id_t, size_t, cpuset_t *); int cpuset_setaffinity(cpulevel_t, cpuwhich_t, id_t, size_t, const cpuset_t *); __END_DECLS #endif #endif /* !_SYS_CPUSET_H_ */ Index: head/sys/sys/jail.h =================================================================== --- head/sys/sys/jail.h (revision 192894) +++ head/sys/sys/jail.h (revision 192895) @@ -1,266 +1,345 @@ /*- * Copyright (c) 1999 Poul-Henning Kamp. * Copyright (c) 2009 James Gritton. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _SYS_JAIL_H_ #define _SYS_JAIL_H_ #ifdef _KERNEL struct jail_v0 { u_int32_t version; char *path; char *hostname; u_int32_t ip_number; }; #endif struct jail { uint32_t version; char *path; char *hostname; char *jailname; uint32_t ip4s; uint32_t ip6s; struct in_addr *ip4; struct in6_addr *ip6; }; #define JAIL_API_VERSION 2 /* * For all xprison structs, always keep the pr_version an int and * the first variable so userspace can easily distinguish them. */ #ifndef _KERNEL struct xprison_v1 { int pr_version; int pr_id; char pr_path[MAXPATHLEN]; char pr_host[MAXHOSTNAMELEN]; u_int32_t pr_ip; }; #endif struct xprison { int pr_version; int pr_id; int pr_state; cpusetid_t pr_cpusetid; char pr_path[MAXPATHLEN]; char pr_host[MAXHOSTNAMELEN]; char pr_name[MAXHOSTNAMELEN]; uint32_t pr_ip4s; uint32_t pr_ip6s; #if 0 /* * sizeof(xprison) will be malloced + size needed for all * IPv4 and IPv6 addesses. Offsets are based numbers of addresses. */ struct in_addr pr_ip4[]; struct in6_addr pr_ip6[]; #endif }; #define XPRISON_VERSION 3 static const struct prison_state { int pr_state; const char * state_name; } prison_states[] = { #define PRISON_STATE_INVALID 0 { PRISON_STATE_INVALID, "INVALID" }, #define PRISON_STATE_ALIVE 1 { PRISON_STATE_ALIVE, "ALIVE" }, #define PRISON_STATE_DYING 2 { PRISON_STATE_DYING, "DYING" }, }; /* * Flags for jail_set and jail_get. */ #define JAIL_CREATE 0x01 /* Create jail if it doesn't exist */ #define JAIL_UPDATE 0x02 /* Update parameters of existing jail */ #define JAIL_ATTACH 0x04 /* Attach to jail upon creation */ #define JAIL_DYING 0x08 /* Allow getting a dying jail */ #define JAIL_SET_MASK 0x0f #define JAIL_GET_MASK 0x08 #ifndef _KERNEL struct iovec; int jail(struct jail *); int jail_set(struct iovec *, unsigned int, int); int jail_get(struct iovec *, unsigned int, int); int jail_attach(int); int jail_remove(int); #else /* _KERNEL */ #include #include -#include -#include +#include +#include #include #define JAIL_MAX 999999 #ifdef MALLOC_DECLARE MALLOC_DECLARE(M_PRISON); #endif #endif /* _KERNEL */ #if defined(_KERNEL) || defined(_WANT_PRISON) #include -struct cpuset; - /* * This structure describes a prison. It is pointed to by all struct * ucreds's of the inmates. pr_ref keeps track of them and is used to * delete the struture when the last inmate is dead. * * Lock key: * (a) allprison_lock * (p) locked by pr_mtx * (c) set only during creation before the structure is shared, no mutex * required to read * (d) set only during destruction of jail, no mutex needed */ struct prison { TAILQ_ENTRY(prison) pr_list; /* (a) all prisons */ int pr_id; /* (c) prison id */ int pr_ref; /* (p) refcount */ int pr_uref; /* (p) user (alive) refcount */ unsigned pr_flags; /* (p) PR_* flags */ char pr_path[MAXPATHLEN]; /* (c) chroot path */ struct cpuset *pr_cpuset; /* (p) cpuset */ struct vnode *pr_root; /* (c) vnode to rdir */ char pr_host[MAXHOSTNAMELEN]; /* (p) jail hostname */ char pr_name[MAXHOSTNAMELEN]; /* (p) admin jail name */ - void *pr_spare; /* was pr_linux */ + struct prison *pr_parent; /* (c) containing jail */ int pr_securelevel; /* (p) securelevel */ struct task pr_task; /* (d) destroy task */ struct mtx pr_mtx; struct osd pr_osd; /* (p) additional data */ int pr_ip4s; /* (p) number of v4 IPs */ struct in_addr *pr_ip4; /* (p) v4 IPs of jail */ int pr_ip6s; /* (p) number of v6 IPs */ struct in6_addr *pr_ip6; /* (p) v6 IPs of jail */ + LIST_HEAD(, prison) pr_children; /* (a) list of child jails */ + LIST_ENTRY(prison) pr_sibling; /* (a) next in parent's list */ + int pr_prisoncount; /* (a) number of child jails */ + unsigned pr_allow; /* (p) PR_ALLOW_* flags */ + int pr_enforce_statfs; /* (p) statfs permission */ }; #endif /* _KERNEL || _WANT_PRISON */ #ifdef _KERNEL -/* - * Flag bits set via options or internally - */ +/* Flag bits set via options */ #define PR_PERSIST 0x00000001 /* Can exist without processes */ +#define PR_IP4_USER 0x00000004 /* Virtualize IPv4 addresses */ +#define PR_IP6_USER 0x00000008 /* Virtualize IPv6 addresses */ + +/* Internal flag bits */ #define PR_REMOVE 0x01000000 /* In process of being removed */ +#define PR_IP4 0x02000000 /* IPv4 virtualized by this jail or */ + /* an ancestor */ +#define PR_IP6 0x04000000 /* IPv6 virtualized by this jail or */ + /* an ancestor */ +/* Flags for pr_allow */ +#define PR_ALLOW_SET_HOSTNAME 0x0001 +#define PR_ALLOW_SYSVIPC 0x0002 +#define PR_ALLOW_RAW_SOCKETS 0x0004 +#define PR_ALLOW_CHFLAGS 0x0008 +#define PR_ALLOW_MOUNT 0x0010 +#define PR_ALLOW_QUOTAS 0x0020 +#define PR_ALLOW_JAILS 0x0040 +#define PR_ALLOW_SOCKET_AF 0x0080 +#define PR_ALLOW_ALL 0x00ff + /* * OSD methods */ #define PR_METHOD_CREATE 0 #define PR_METHOD_GET 1 #define PR_METHOD_SET 2 #define PR_METHOD_CHECK 3 #define PR_METHOD_ATTACH 4 #define PR_MAXMETHOD 5 /* - * Sysctl-set variables that determine global jail policy - * - * XXX MIB entries will need to be protected by a mutex. + * Lock/unlock a prison. + * XXX These exist not so much for general convenience, but to be useable in + * the FOREACH_PRISON_DESCENDANT_LOCKED macro which can't handle them in + * non-function form as currently defined. */ -extern int jail_set_hostname_allowed; -extern int jail_socket_unixiproute_only; -extern int jail_sysvipc_allowed; -extern int jail_getfsstat_jailrootonly; -extern int jail_allow_raw_sockets; -extern int jail_chflags_allowed; +static __inline void +prison_lock(struct prison *pr) +{ + mtx_lock(&pr->pr_mtx); +} + +static __inline void +prison_unlock(struct prison *pr) +{ + + mtx_unlock(&pr->pr_mtx); +} + +/* Traverse a prison's immediate children. */ +#define FOREACH_PRISON_CHILD(ppr, cpr) \ + LIST_FOREACH(cpr, &(ppr)->pr_children, pr_sibling) + +/* + * Preorder traversal of all of a prison's descendants. + * This ugly loop allows the macro to be followed by a single block + * as expected in a looping primitive. + */ +#define FOREACH_PRISON_DESCENDANT(ppr, cpr, descend) \ + for ((cpr) = (ppr), (descend) = 1; \ + ((cpr) = (((descend) && !LIST_EMPTY(&(cpr)->pr_children)) \ + ? LIST_FIRST(&(cpr)->pr_children) \ + : ((cpr) == (ppr) \ + ? NULL \ + : (((descend) = LIST_NEXT(cpr, pr_sibling) != NULL) \ + ? LIST_NEXT(cpr, pr_sibling) \ + : (cpr)->pr_parent))));) \ + if (!(descend)) \ + ; \ + else + +/* + * As above, but lock descendants on the way down and unlock on the way up. + */ +#define FOREACH_PRISON_DESCENDANT_LOCKED(ppr, cpr, descend) \ + for ((cpr) = (ppr), (descend) = 1; \ + ((cpr) = (((descend) && !LIST_EMPTY(&(cpr)->pr_children)) \ + ? LIST_FIRST(&(cpr)->pr_children) \ + : ((cpr) == (ppr) \ + ? NULL \ + : ((prison_unlock(cpr), \ + (descend) = LIST_NEXT(cpr, pr_sibling) != NULL) \ + ? LIST_NEXT(cpr, pr_sibling) \ + : (cpr)->pr_parent))));) \ + if ((descend) ? (prison_lock(cpr), 0) : 1) \ + ; \ + else + +/* + * Attributes of the physical system, and the root of the jail tree. + */ +extern struct prison prison0; + TAILQ_HEAD(prisonlist, prison); extern struct prisonlist allprison; extern struct sx allprison_lock; /* * Sysctls to describe jail parameters. */ SYSCTL_DECL(_security_jail_param); #define SYSCTL_JAIL_PARAM(module, param, type, fmt, descr) \ SYSCTL_PROC(_security_jail_param ## module, OID_AUTO, param, \ (type) | CTLFLAG_MPSAFE, NULL, 0, sysctl_jail_param, fmt, descr) #define SYSCTL_JAIL_PARAM_STRING(module, param, access, len, descr) \ SYSCTL_PROC(_security_jail_param ## module, OID_AUTO, param, \ CTLTYPE_STRING | CTLFLAG_MPSAFE | (access), NULL, len, \ sysctl_jail_param, "A", descr) #define SYSCTL_JAIL_PARAM_STRUCT(module, param, access, len, fmt, descr)\ SYSCTL_PROC(_security_jail_param ## module, OID_AUTO, param, \ CTLTYPE_STRUCT | CTLFLAG_MPSAFE | (access), NULL, len, \ sysctl_jail_param, fmt, descr) #define SYSCTL_JAIL_PARAM_NODE(module, descr) \ SYSCTL_NODE(_security_jail_param, OID_AUTO, module, CTLFLAG_RW, 0, descr) /* * Kernel support functions for jail(). */ struct ucred; struct mount; struct sockaddr; struct statfs; int jailed(struct ucred *cred); void getcredhostname(struct ucred *cred, char *, size_t); +int prison_allow(struct ucred *, unsigned); int prison_check(struct ucred *cred1, struct ucred *cred2); int prison_canseemount(struct ucred *cred, struct mount *mp); void prison_enforce_statfs(struct ucred *cred, struct mount *mp, struct statfs *sp); struct prison *prison_find(int prid); -struct prison *prison_find_name(const char *name); +struct prison *prison_find_child(struct prison *, int); +struct prison *prison_find_name(struct prison *, const char *); +int prison_flag(struct ucred *, unsigned); void prison_free(struct prison *pr); void prison_free_locked(struct prison *pr); void prison_hold(struct prison *pr); void prison_hold_locked(struct prison *pr); void prison_proc_hold(struct prison *); void prison_proc_free(struct prison *); +int prison_ischild(struct prison *, struct prison *); +int prison_equal_ip4(struct prison *, struct prison *); int prison_get_ip4(struct ucred *cred, struct in_addr *ia); int prison_local_ip4(struct ucred *cred, struct in_addr *ia); int prison_remote_ip4(struct ucred *cred, struct in_addr *ia); int prison_check_ip4(struct ucred *cred, struct in_addr *ia); #ifdef INET6 +int prison_equal_ip6(struct prison *, struct prison *); int prison_get_ip6(struct ucred *, struct in6_addr *); int prison_local_ip6(struct ucred *, struct in6_addr *, int); int prison_remote_ip6(struct ucred *, struct in6_addr *); int prison_check_ip6(struct ucred *, struct in6_addr *); #endif int prison_check_af(struct ucred *cred, int af); int prison_if(struct ucred *cred, struct sockaddr *sa); +char *prison_name(struct prison *, struct prison *); int prison_priv_check(struct ucred *cred, int priv); int sysctl_jail_param(struct sysctl_oid *, void *, int , struct sysctl_req *); #endif /* _KERNEL */ #endif /* !_SYS_JAIL_H_ */ Index: head/sys/sys/param.h =================================================================== --- head/sys/sys/param.h (revision 192894) +++ head/sys/sys/param.h (revision 192895) @@ -1,318 +1,318 @@ /*- * Copyright (c) 1982, 1986, 1989, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)param.h 8.3 (Berkeley) 4/4/95 * $FreeBSD$ */ #ifndef _SYS_PARAM_H_ #define _SYS_PARAM_H_ #include #define BSD 199506 /* System version (year & month). */ #define BSD4_3 1 #define BSD4_4 1 /* * __FreeBSD_version numbers are documented in the Porter's Handbook. * If you bump the version for any reason, you should update the documentation * there. * Currently this lives here: * * doc/en_US.ISO8859-1/books/porters-handbook/book.sgml * * scheme is: Rxx * 'R' is 0 if release branch or x.0-CURRENT before RELENG_*_0 * is created, otherwise 1. */ #undef __FreeBSD_version -#define __FreeBSD_version 800090 /* Master, propagated to newvers */ +#define __FreeBSD_version 800091 /* Master, propagated to newvers */ #ifndef LOCORE #include #endif /* * Machine-independent constants (some used in following include files). * Redefined constants are from POSIX 1003.1 limits file. * * MAXCOMLEN should be >= sizeof(ac_comm) (see ) * MAXLOGNAME should be == UT_NAMESIZE+1 (see ) */ #include #define MAXCOMLEN 19 /* max command name remembered */ #define MAXINTERP 32 /* max interpreter file name length */ #define MAXLOGNAME 17 /* max login name length (incl. NUL) */ #define MAXUPRC CHILD_MAX /* max simultaneous processes */ #define NCARGS ARG_MAX /* max bytes for an exec function */ #define NGROUPS NGROUPS_MAX /* max number groups */ #define NOFILE OPEN_MAX /* max open files per process */ #define NOGROUP 65535 /* marker for empty group set member */ #define MAXHOSTNAMELEN 256 /* max hostname size */ #define SPECNAMELEN 63 /* max length of devicename */ /* More types and definitions used throughout the kernel. */ #ifdef _KERNEL #include #include #ifndef LOCORE #include #include #endif #ifndef FALSE #define FALSE 0 #endif #ifndef TRUE #define TRUE 1 #endif #endif #ifndef _KERNEL /* Signals. */ #include #endif /* Machine type dependent parameters. */ #include #ifndef _KERNEL #include #endif #ifndef _NO_NAMESPACE_POLLUTION #ifndef DEV_BSHIFT #define DEV_BSHIFT 9 /* log2(DEV_BSIZE) */ #endif #define DEV_BSIZE (1<>PAGE_SHIFT) #endif /* * btodb() is messy and perhaps slow because `bytes' may be an off_t. We * want to shift an unsigned type to avoid sign extension and we don't * want to widen `bytes' unnecessarily. Assume that the result fits in * a daddr_t. */ #ifndef btodb #define btodb(bytes) /* calculates (bytes / DEV_BSIZE) */ \ (sizeof (bytes) > sizeof(long) \ ? (daddr_t)((unsigned long long)(bytes) >> DEV_BSHIFT) \ : (daddr_t)((unsigned long)(bytes) >> DEV_BSHIFT)) #endif #ifndef dbtob #define dbtob(db) /* calculates (db * DEV_BSIZE) */ \ ((off_t)(db) << DEV_BSHIFT) #endif #endif /* _NO_NAMESPACE_POLLUTION */ #define PRIMASK 0x0ff #define PCATCH 0x100 /* OR'd with pri for tsleep to check signals */ #define PDROP 0x200 /* OR'd with pri to stop re-entry of interlock mutex */ #define NZERO 0 /* default "nice" */ #define NBBY 8 /* number of bits in a byte */ #define NBPW sizeof(int) /* number of bytes per word (integer) */ #define CMASK 022 /* default file mask: S_IWGRP|S_IWOTH */ #define NODEV (dev_t)(-1) /* non-existent device */ #define CBLOCK 128 /* Clist block size, must be a power of 2. */ /* Data chars/clist. */ #define CBSIZE (CBLOCK - sizeof(struct cblock *)) #define CROUND (CBLOCK - 1) /* Clist rounding. */ /* * File system parameters and macros. * * MAXBSIZE - Filesystems are made out of blocks of at most MAXBSIZE bytes * per block. MAXBSIZE may be made larger without effecting * any existing filesystems as long as it does not exceed MAXPHYS, * and may be made smaller at the risk of not being able to use * filesystems which require a block size exceeding MAXBSIZE. * * BKVASIZE - Nominal buffer space per buffer, in bytes. BKVASIZE is the * minimum KVM memory reservation the kernel is willing to make. * Filesystems can of course request smaller chunks. Actual * backing memory uses a chunk size of a page (PAGE_SIZE). * * If you make BKVASIZE too small you risk seriously fragmenting * the buffer KVM map which may slow things down a bit. If you * make it too big the kernel will not be able to optimally use * the KVM memory reserved for the buffer cache and will wind * up with too-few buffers. * * The default is 16384, roughly 2x the block size used by a * normal UFS filesystem. */ #define MAXBSIZE 65536 /* must be power of 2 */ #define BKVASIZE 16384 /* must be power of 2 */ #define BKVAMASK (BKVASIZE-1) /* * MAXPATHLEN defines the longest permissible path length after expanding * symbolic links. It is used to allocate a temporary buffer from the buffer * pool in which to do the name expansion, hence should be a power of two, * and must be less than or equal to MAXBSIZE. MAXSYMLINKS defines the * maximum number of symbolic links that may be expanded in a path name. * It should be set high enough to allow all legitimate uses, but halt * infinite loops reasonably quickly. */ #define MAXPATHLEN PATH_MAX #define MAXSYMLINKS 32 /* Bit map related macros. */ #define setbit(a,i) (((unsigned char *)(a))[(i)/NBBY] |= 1<<((i)%NBBY)) #define clrbit(a,i) (((unsigned char *)(a))[(i)/NBBY] &= ~(1<<((i)%NBBY))) #define isset(a,i) \ (((const unsigned char *)(a))[(i)/NBBY] & (1<<((i)%NBBY))) #define isclr(a,i) \ ((((const unsigned char *)(a))[(i)/NBBY] & (1<<((i)%NBBY))) == 0) /* Macros for counting and rounding. */ #ifndef howmany #define howmany(x, y) (((x)+((y)-1))/(y)) #endif #define rounddown(x, y) (((x)/(y))*(y)) #define roundup(x, y) ((((x)+((y)-1))/(y))*(y)) /* to any y */ #define roundup2(x, y) (((x)+((y)-1))&(~((y)-1))) /* if y is powers of two */ #define powerof2(x) ((((x)-1)&(x))==0) /* Macros for min/max. */ #define MIN(a,b) (((a)<(b))?(a):(b)) #define MAX(a,b) (((a)>(b))?(a):(b)) #ifdef _KERNEL /* * Basic byte order function prototypes for non-inline functions. */ #ifndef LOCORE #ifndef _BYTEORDER_PROTOTYPED #define _BYTEORDER_PROTOTYPED __BEGIN_DECLS __uint32_t htonl(__uint32_t); __uint16_t htons(__uint16_t); __uint32_t ntohl(__uint32_t); __uint16_t ntohs(__uint16_t); __END_DECLS #endif #endif #ifndef lint #ifndef _BYTEORDER_FUNC_DEFINED #define _BYTEORDER_FUNC_DEFINED #define htonl(x) __htonl(x) #define htons(x) __htons(x) #define ntohl(x) __ntohl(x) #define ntohs(x) __ntohs(x) #endif /* !_BYTEORDER_FUNC_DEFINED */ #endif /* lint */ #endif /* _KERNEL */ /* * Scale factor for scaled integers used to count %cpu time and load avgs. * * The number of CPU `tick's that map to a unique `%age' can be expressed * by the formula (1 / (2 ^ (FSHIFT - 11))). The maximum load average that * can be calculated (assuming 32 bits) can be closely approximated using * the formula (2 ^ (2 * (16 - FSHIFT))) for (FSHIFT < 15). * * For the scheduler to maintain a 1:1 mapping of CPU `tick' to `%age', * FSHIFT must be at least 11; this gives us a maximum load avg of ~1024. */ #define FSHIFT 11 /* bits to right of fixed binary point */ #define FSCALE (1<> (PAGE_SHIFT - DEV_BSHIFT)) #define ctodb(db) /* calculates pages to devblks */ \ ((db) << (PAGE_SHIFT - DEV_BSHIFT)) /* * Given the pointer x to the member m of the struct s, return * a pointer to the containing structure. */ #define member2struct(s, m, x) \ ((struct s *)(void *)((char *)(x) - offsetof(struct s, m))) #endif /* _SYS_PARAM_H_ */ Index: head/sys/sys/syscallsubr.h =================================================================== --- head/sys/sys/syscallsubr.h (revision 192894) +++ head/sys/sys/syscallsubr.h (revision 192895) @@ -1,226 +1,228 @@ /*- * Copyright (c) 2002 Ian Dowse. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #ifndef _SYS_SYSCALLSUBR_H_ #define _SYS_SYSCALLSUBR_H_ #include #include #include #include #include struct file; struct itimerval; struct image_args; +struct jail; struct mbuf; struct msghdr; struct msqid_ds; struct rlimit; struct rusage; union semun; struct sockaddr; struct stat; struct kevent; struct kevent_copyops; struct sendfile_args; struct thr_param; int kern___getcwd(struct thread *td, u_char *buf, enum uio_seg bufseg, u_int buflen); int kern_accept(struct thread *td, int s, struct sockaddr **name, socklen_t *namelen, struct file **fp); int kern_access(struct thread *td, char *path, enum uio_seg pathseg, int flags); int kern_accessat(struct thread *td, int fd, char *path, enum uio_seg pathseg, int flags, int mode); int kern_adjtime(struct thread *td, struct timeval *delta, struct timeval *olddelta); int kern_alternate_path(struct thread *td, const char *prefix, const char *path, enum uio_seg pathseg, char **pathbuf, int create, int dirfd); int kern_bind(struct thread *td, int fd, struct sockaddr *sa); int kern_chdir(struct thread *td, char *path, enum uio_seg pathseg); int kern_chmod(struct thread *td, char *path, enum uio_seg pathseg, int mode); int kern_chown(struct thread *td, char *path, enum uio_seg pathseg, int uid, int gid); int kern_clock_getres(struct thread *td, clockid_t clock_id, struct timespec *ts); int kern_clock_gettime(struct thread *td, clockid_t clock_id, struct timespec *ats); int kern_clock_settime(struct thread *td, clockid_t clock_id, struct timespec *ats); int kern_close(struct thread *td, int fd); int kern_connect(struct thread *td, int fd, struct sockaddr *sa); int kern_eaccess(struct thread *td, char *path, enum uio_seg pathseg, int flags); int kern_execve(struct thread *td, struct image_args *args, struct mac *mac_p); int kern_fchmodat(struct thread *td, int fd, char *path, enum uio_seg pathseg, mode_t mode, int flag); int kern_fchownat(struct thread *td, int fd, char *path, enum uio_seg pathseg, int uid, int gid, int flag); int kern_fcntl(struct thread *td, int fd, int cmd, intptr_t arg); int kern_fhstatfs(struct thread *td, fhandle_t fh, struct statfs *buf); int kern_fstat(struct thread *td, int fd, struct stat *sbp); int kern_fstatfs(struct thread *td, int fd, struct statfs *buf); int kern_ftruncate(struct thread *td, int fd, off_t length); int kern_futimes(struct thread *td, int fd, struct timeval *tptr, enum uio_seg tptrseg); int kern_getdirentries(struct thread *td, int fd, char *buf, u_int count, long *basep); int kern_getfsstat(struct thread *td, struct statfs **buf, size_t bufsize, enum uio_seg bufseg, int flags); int kern_getgroups(struct thread *td, u_int *ngrp, gid_t *groups); int kern_getitimer(struct thread *, u_int, struct itimerval *); int kern_getpeername(struct thread *td, int fd, struct sockaddr **sa, socklen_t *alen); int kern_getrusage(struct thread *td, int who, struct rusage *rup); int kern_getsockname(struct thread *td, int fd, struct sockaddr **sa, socklen_t *alen); int kern_getsockopt(struct thread *td, int s, int level, int name, void *optval, enum uio_seg valseg, socklen_t *valsize); int kern_ioctl(struct thread *td, int fd, u_long com, caddr_t data); +int kern_jail(struct thread *td, struct jail *j); int kern_jail_get(struct thread *td, struct uio *options, int flags); int kern_jail_set(struct thread *td, struct uio *options, int flags); int kern_kevent(struct thread *td, int fd, int nchanges, int nevents, struct kevent_copyops *k_ops, const struct timespec *timeout); int kern_kldload(struct thread *td, const char *file, int *fileid); int kern_kldunload(struct thread *td, int fileid, int flags); int kern_lchown(struct thread *td, char *path, enum uio_seg pathseg, int uid, int gid); int kern_link(struct thread *td, char *path, char *link, enum uio_seg segflg); int kern_linkat(struct thread *td, int fd1, int fd2, char *path1, char *path2, enum uio_seg segflg, int follow); int kern_lstat(struct thread *td, char *path, enum uio_seg pathseg, struct stat *sbp); int kern_lutimes(struct thread *td, char *path, enum uio_seg pathseg, struct timeval *tptr, enum uio_seg tptrseg); int kern_mkdir(struct thread *td, char *path, enum uio_seg segflg, int mode); int kern_mkdirat(struct thread *td, int fd, char *path, enum uio_seg segflg, int mode); int kern_mkfifo(struct thread *td, char *path, enum uio_seg pathseg, int mode); int kern_mkfifoat(struct thread *td, int fd, char *path, enum uio_seg pathseg, int mode); int kern_mknod(struct thread *td, char *path, enum uio_seg pathseg, int mode, int dev); int kern_mknodat(struct thread *td, int fd, char *path, enum uio_seg pathseg, int mode, int dev); int kern_msgctl(struct thread *, int, int, struct msqid_ds *); int kern_msgsnd(struct thread *, int, const void *, size_t, int, long); int kern_msgrcv(struct thread *, int, void *, size_t, long, int, long *); int kern_nanosleep(struct thread *td, struct timespec *rqt, struct timespec *rmt); int kern_open(struct thread *td, char *path, enum uio_seg pathseg, int flags, int mode); int kern_openat(struct thread *td, int fd, char *path, enum uio_seg pathseg, int flags, int mode); int kern_pathconf(struct thread *td, char *path, enum uio_seg pathseg, int name); int kern_pipe(struct thread *td, int fildes[2]); int kern_preadv(struct thread *td, int fd, struct uio *auio, off_t offset); int kern_ptrace(struct thread *td, int req, pid_t pid, void *addr, int data); int kern_pwritev(struct thread *td, int fd, struct uio *auio, off_t offset); int kern_readlink(struct thread *td, char *path, enum uio_seg pathseg, char *buf, enum uio_seg bufseg, size_t count); int kern_readlinkat(struct thread *td, int fd, char *path, enum uio_seg pathseg, char *buf, enum uio_seg bufseg, size_t count); int kern_readv(struct thread *td, int fd, struct uio *auio); int kern_recvit(struct thread *td, int s, struct msghdr *mp, enum uio_seg fromseg, struct mbuf **controlp); int kern_rename(struct thread *td, char *from, char *to, enum uio_seg pathseg); int kern_renameat(struct thread *td, int oldfd, char *old, int newfd, char *new, enum uio_seg pathseg); int kern_rmdir(struct thread *td, char *path, enum uio_seg pathseg); int kern_rmdirat(struct thread *td, int fd, char *path, enum uio_seg pathseg); int kern_sched_rr_get_interval(struct thread *td, pid_t pid, struct timespec *ts); int kern_semctl(struct thread *td, int semid, int semnum, int cmd, union semun *arg, register_t *rval); int kern_select(struct thread *td, int nd, fd_set *fd_in, fd_set *fd_ou, fd_set *fd_ex, struct timeval *tvp); int kern_sendfile(struct thread *td, struct sendfile_args *uap, struct uio *hdr_uio, struct uio *trl_uio, int compat); int kern_sendit(struct thread *td, int s, struct msghdr *mp, int flags, struct mbuf *control, enum uio_seg segflg); int kern_setgroups(struct thread *td, u_int ngrp, gid_t *groups); int kern_setitimer(struct thread *, u_int, struct itimerval *, struct itimerval *); int kern_setrlimit(struct thread *, u_int, struct rlimit *); int kern_setsockopt(struct thread *td, int s, int level, int name, void *optval, enum uio_seg valseg, socklen_t valsize); int kern_settimeofday(struct thread *td, struct timeval *tv, struct timezone *tzp); int kern_shmat(struct thread *td, int shmid, const void *shmaddr, int shmflg); int kern_shmctl(struct thread *td, int shmid, int cmd, void *buf, size_t *bufsz); int kern_sigaction(struct thread *td, int sig, struct sigaction *act, struct sigaction *oact, int flags); int kern_sigaltstack(struct thread *td, stack_t *ss, stack_t *oss); int kern_sigprocmask(struct thread *td, int how, sigset_t *set, sigset_t *oset, int old); int kern_sigsuspend(struct thread *td, sigset_t mask); int kern_stat(struct thread *td, char *path, enum uio_seg pathseg, struct stat *sbp); int kern_statat(struct thread *td, int flag, int fd, char *path, enum uio_seg pathseg, struct stat *sbp); int kern_statat_vnhook(struct thread *td, int flag, int fd, char *path, enum uio_seg pathseg, struct stat *sbp, void (*hook)(struct vnode *vp, struct stat *sbp)); int kern_statfs(struct thread *td, char *path, enum uio_seg pathseg, struct statfs *buf); int kern_symlink(struct thread *td, char *path, char *link, enum uio_seg segflg); int kern_symlinkat(struct thread *td, char *path1, int fd, char *path2, enum uio_seg segflg); int kern_thr_new(struct thread *td, struct thr_param *param); int kern_thr_suspend(struct thread *td, struct timespec *tsp); int kern_truncate(struct thread *td, char *path, enum uio_seg pathseg, off_t length); int kern_unlink(struct thread *td, char *path, enum uio_seg pathseg); int kern_unlinkat(struct thread *td, int fd, char *path, enum uio_seg pathseg); int kern_utimes(struct thread *td, char *path, enum uio_seg pathseg, struct timeval *tptr, enum uio_seg tptrseg); int kern_utimesat(struct thread *td, int fd, char *path, enum uio_seg pathseg, struct timeval *tptr, enum uio_seg tptrseg); int kern_wait(struct thread *td, pid_t pid, int *status, int options, struct rusage *rup); int kern_writev(struct thread *td, int fd, struct uio *auio); /* flags for kern_sigaction */ #define KSA_OSIGSET 0x0001 /* uses osigact_t */ #define KSA_FREEBSD4 0x0002 /* uses ucontext4 */ #endif /* !_SYS_SYSCALLSUBR_H_ */ Index: head/sys/sys/systm.h =================================================================== --- head/sys/sys/systm.h (revision 192894) +++ head/sys/sys/systm.h (revision 192895) @@ -1,399 +1,397 @@ /*- * Copyright (c) 1982, 1988, 1991, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)systm.h 8.7 (Berkeley) 3/29/95 * $FreeBSD$ */ #ifndef _SYS_SYSTM_H_ #define _SYS_SYSTM_H_ #include #include #include #include #include #include /* for people using printf mainly */ -extern int securelevel; /* system security level (see init(8)) */ - extern int cold; /* nonzero if we are doing a cold boot */ extern int rebooting; /* boot() has been called. */ extern const char *panicstr; /* panic message */ extern char version[]; /* system version */ extern char copyright[]; /* system copyright */ extern int kstack_pages; /* number of kernel stack pages */ extern int nswap; /* size of swap space */ extern long physmem; /* physical memory */ extern long realmem; /* 'real' memory */ extern char *rootdevnames[2]; /* names of possible root devices */ extern int boothowto; /* reboot flags, from console subsystem */ extern int bootverbose; /* nonzero to print verbose messages */ extern int maxusers; /* system tune hint */ #ifdef INVARIANTS /* The option is always available */ #define KASSERT(exp,msg) do { \ if (__predict_false(!(exp))) \ panic msg; \ } while (0) #define VNASSERT(exp, vp, msg) do { \ if (__predict_false(!(exp))) { \ vn_printf(vp, "VNASSERT failed\n"); \ panic msg; \ } \ } while (0) #else #define KASSERT(exp,msg) do { \ } while (0) #define VNASSERT(exp, vp, msg) do { \ } while (0) #endif #ifndef CTASSERT /* Allow lint to override */ #define CTASSERT(x) _CTASSERT(x, __LINE__) #define _CTASSERT(x, y) __CTASSERT(x, y) #define __CTASSERT(x, y) typedef char __assert ## y[(x) ? 1 : -1] #endif /* * XXX the hints declarations are even more misplaced than most declarations * in this file, since they are needed in one file (per arch) and only used * in two files. * XXX most of these variables should be const. */ extern int osreldate; extern int envmode; extern int hintmode; /* 0 = off. 1 = config, 2 = fallback */ extern int dynamic_kenv; extern struct mtx kenv_lock; extern char *kern_envp; extern char static_env[]; extern char static_hints[]; /* by config for now */ extern char **kenvp; /* * General function declarations. */ struct inpcb; struct lock_object; struct malloc_type; struct mtx; struct proc; struct socket; struct thread; struct tty; struct ucred; struct uio; struct _jmp_buf; int setjmp(struct _jmp_buf *); void longjmp(struct _jmp_buf *, int) __dead2; int dumpstatus(vm_offset_t addr, off_t count); int nullop(void); int eopnotsupp(void); int ureadc(int, struct uio *); void hashdestroy(void *, struct malloc_type *, u_long); void *hashinit(int count, struct malloc_type *type, u_long *hashmark); void *hashinit_flags(int count, struct malloc_type *type, u_long *hashmask, int flags); #define HASH_NOWAIT 0x00000001 #define HASH_WAITOK 0x00000002 void *phashinit(int count, struct malloc_type *type, u_long *nentries); void g_waitidle(void); #ifdef RESTARTABLE_PANICS void panic(const char *, ...) __printflike(1, 2); #else void panic(const char *, ...) __dead2 __printflike(1, 2); #endif void cpu_boot(int); void cpu_flush_dcache(void *, size_t); void cpu_rootconf(void); void critical_enter(void); void critical_exit(void); void init_param1(void); void init_param2(long physpages); void init_param3(long kmempages); void tablefull(const char *); int kvprintf(char const *, void (*)(int, void*), void *, int, __va_list) __printflike(1, 0); void log(int, const char *, ...) __printflike(2, 3); void log_console(struct uio *); int printf(const char *, ...) __printflike(1, 2); int snprintf(char *, size_t, const char *, ...) __printflike(3, 4); int sprintf(char *buf, const char *, ...) __printflike(2, 3); int uprintf(const char *, ...) __printflike(1, 2); int vprintf(const char *, __va_list) __printflike(1, 0); int vsnprintf(char *, size_t, const char *, __va_list) __printflike(3, 0); int vsnrprintf(char *, size_t, int, const char *, __va_list) __printflike(4, 0); int vsprintf(char *buf, const char *, __va_list) __printflike(2, 0); int ttyprintf(struct tty *, const char *, ...) __printflike(2, 3); int sscanf(const char *, char const *, ...) __nonnull(1) __nonnull(2); int vsscanf(const char *, char const *, __va_list) __nonnull(1) __nonnull(2); long strtol(const char *, char **, int) __nonnull(1); u_long strtoul(const char *, char **, int) __nonnull(1); quad_t strtoq(const char *, char **, int) __nonnull(1); u_quad_t strtouq(const char *, char **, int) __nonnull(1); void tprintf(struct proc *p, int pri, const char *, ...) __printflike(3, 4); void hexdump(const void *ptr, int length, const char *hdr, int flags); #define HD_COLUMN_MASK 0xff #define HD_DELIM_MASK 0xff00 #define HD_OMIT_COUNT (1 << 16) #define HD_OMIT_HEX (1 << 17) #define HD_OMIT_CHARS (1 << 18) #define ovbcopy(f, t, l) bcopy((f), (t), (l)) void bcopy(const void *from, void *to, size_t len) __nonnull(1) __nonnull(2); void bzero(void *buf, size_t len) __nonnull(1); void *memcpy(void *to, const void *from, size_t len) __nonnull(1) __nonnull(2); void *memmove(void *dest, const void *src, size_t n) __nonnull(1) __nonnull(2); int copystr(const void * __restrict kfaddr, void * __restrict kdaddr, size_t len, size_t * __restrict lencopied) __nonnull(1) __nonnull(2); int copyinstr(const void * __restrict udaddr, void * __restrict kaddr, size_t len, size_t * __restrict lencopied) __nonnull(1) __nonnull(2); int copyin(const void * __restrict udaddr, void * __restrict kaddr, size_t len) __nonnull(1) __nonnull(2); int copyout(const void * __restrict kaddr, void * __restrict udaddr, size_t len) __nonnull(1) __nonnull(2); int fubyte(const void *base); long fuword(const void *base); int fuword16(void *base); int32_t fuword32(const void *base); int64_t fuword64(const void *base); int subyte(void *base, int byte); int suword(void *base, long word); int suword16(void *base, int word); int suword32(void *base, int32_t word); int suword64(void *base, int64_t word); uint32_t casuword32(volatile uint32_t *base, uint32_t oldval, uint32_t newval); u_long casuword(volatile u_long *p, u_long oldval, u_long newval); void realitexpire(void *); int sysbeep(int hertz, int period); void hardclock(int usermode, uintfptr_t pc); void hardclock_cpu(int usermode); void softclock(void *); void statclock(int usermode); void profclock(int usermode, uintfptr_t pc); void startprofclock(struct proc *); void stopprofclock(struct proc *); void cpu_startprofclock(void); void cpu_stopprofclock(void); int cr_cansee(struct ucred *u1, struct ucred *u2); int cr_canseesocket(struct ucred *cred, struct socket *so); int cr_canseeinpcb(struct ucred *cred, struct inpcb *inp); char *getenv(const char *name); void freeenv(char *env); int getenv_int(const char *name, int *data); int getenv_uint(const char *name, unsigned int *data); int getenv_long(const char *name, long *data); int getenv_ulong(const char *name, unsigned long *data); int getenv_string(const char *name, char *data, int size); int getenv_quad(const char *name, quad_t *data); int setenv(const char *name, const char *value); int unsetenv(const char *name); int testenv(const char *name); typedef uint64_t (cpu_tick_f)(void); void set_cputicker(cpu_tick_f *func, uint64_t freq, unsigned var); extern cpu_tick_f *cpu_ticks; uint64_t cpu_tickrate(void); uint64_t cputick2usec(uint64_t tick); #ifdef APM_FIXUP_CALLTODO struct timeval; void adjust_timeout_calltodo(struct timeval *time_change); #endif /* APM_FIXUP_CALLTODO */ #include /* Initialize the world */ void consinit(void); void cpu_initclocks(void); void usrinfoinit(void); /* Finalize the world */ void shutdown_nice(int); /* Timeouts */ typedef void timeout_t(void *); /* timeout function type */ #define CALLOUT_HANDLE_INITIALIZER(handle) \ { NULL } void callout_handle_init(struct callout_handle *); struct callout_handle timeout(timeout_t *, void *, int); void untimeout(timeout_t *, void *, struct callout_handle); caddr_t kern_timeout_callwheel_alloc(caddr_t v); void kern_timeout_callwheel_init(void); /* Stubs for obsolete functions that used to be for interrupt management */ static __inline void spl0(void) { return; } static __inline intrmask_t splbio(void) { return 0; } static __inline intrmask_t splcam(void) { return 0; } static __inline intrmask_t splclock(void) { return 0; } static __inline intrmask_t splhigh(void) { return 0; } static __inline intrmask_t splimp(void) { return 0; } static __inline intrmask_t splnet(void) { return 0; } static __inline intrmask_t splsoftcam(void) { return 0; } static __inline intrmask_t splsoftclock(void) { return 0; } static __inline intrmask_t splsofttty(void) { return 0; } static __inline intrmask_t splsoftvm(void) { return 0; } static __inline intrmask_t splsofttq(void) { return 0; } static __inline intrmask_t splstatclock(void) { return 0; } static __inline intrmask_t spltty(void) { return 0; } static __inline intrmask_t splvm(void) { return 0; } static __inline void splx(intrmask_t ipl __unused) { return; } /* * Common `proc' functions are declared here so that proc.h can be included * less often. */ int _sleep(void *chan, struct lock_object *lock, int pri, const char *wmesg, int timo) __nonnull(1); #define msleep(chan, mtx, pri, wmesg, timo) \ _sleep((chan), &(mtx)->lock_object, (pri), (wmesg), (timo)) int msleep_spin(void *chan, struct mtx *mtx, const char *wmesg, int timo) __nonnull(1); int pause(const char *wmesg, int timo); #define tsleep(chan, pri, wmesg, timo) \ _sleep((chan), NULL, (pri), (wmesg), (timo)) void wakeup(void *chan) __nonnull(1); void wakeup_one(void *chan) __nonnull(1); /* * Common `struct cdev *' stuff are declared here to avoid #include poisoning */ struct cdev; dev_t dev2udev(struct cdev *x); const char *devtoname(struct cdev *cdev); int poll_no_poll(int events); /* XXX: Should be void nanodelay(u_int nsec); */ void DELAY(int usec); /* Root mount holdback API */ struct root_hold_token; struct root_hold_token *root_mount_hold(const char *identifier); void root_mount_rel(struct root_hold_token *h); void root_mount_wait(void); int root_mounted(void); /* * Unit number allocation API. (kern/subr_unit.c) */ struct unrhdr; struct unrhdr *new_unrhdr(int low, int high, struct mtx *mutex); void delete_unrhdr(struct unrhdr *uh); void clean_unrhdr(struct unrhdr *uh); void clean_unrhdrl(struct unrhdr *uh); int alloc_unr(struct unrhdr *uh); int alloc_unrl(struct unrhdr *uh); void free_unr(struct unrhdr *uh, u_int item); /* * This is about as magic as it gets. fortune(1) has got similar code * for reversing bits in a word. Who thinks up this stuff?? * * Yes, it does appear to be consistently faster than: * while (i = ffs(m)) { * m >>= i; * bits++; * } * and * while (lsb = (m & -m)) { // This is magic too * m &= ~lsb; // or: m ^= lsb * bits++; * } * Both of these latter forms do some very strange things on gcc-3.1 with * -mcpu=pentiumpro and/or -march=pentiumpro and/or -O or -O2. * There is probably an SSE or MMX popcnt instruction. * * I wonder if this should be in libkern? * * XXX Stop the presses! Another one: * static __inline u_int32_t * popcnt1(u_int32_t v) * { * v -= ((v >> 1) & 0x55555555); * v = (v & 0x33333333) + ((v >> 2) & 0x33333333); * v = (v + (v >> 4)) & 0x0F0F0F0F; * return (v * 0x01010101) >> 24; * } * The downside is that it has a multiply. With a pentium3 with * -mcpu=pentiumpro and -march=pentiumpro then gcc-3.1 will use * an imull, and in that case it is faster. In most other cases * it appears slightly slower. * * Another variant (also from fortune): * #define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255) * #define BX_(x) ((x) - (((x)>>1)&0x77777777) \ * - (((x)>>2)&0x33333333) \ * - (((x)>>3)&0x11111111)) */ static __inline uint32_t bitcount32(uint32_t x) { x = (x & 0x55555555) + ((x & 0xaaaaaaaa) >> 1); x = (x & 0x33333333) + ((x & 0xcccccccc) >> 2); x = (x + (x >> 4)) & 0x0f0f0f0f; x = (x + (x >> 8)); x = (x + (x >> 16)) & 0x000000ff; return (x); } #endif /* !_SYS_SYSTM_H_ */ Index: head/sys/ufs/ufs/ufs_vnops.c =================================================================== --- head/sys/ufs/ufs/ufs_vnops.c (revision 192894) +++ head/sys/ufs/ufs/ufs_vnops.c (revision 192895) @@ -1,2541 +1,2540 @@ /*- * Copyright (c) 1982, 1986, 1989, 1993, 1995 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 4. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)ufs_vnops.c 8.27 (Berkeley) 5/27/95 */ #include __FBSDID("$FreeBSD$"); #include "opt_mac.h" #include "opt_quota.h" #include "opt_suiddir.h" #include "opt_ufs.h" #include "opt_ffs.h" #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include -#include #include #include #include /* XXX */ #include #include #include #include #include #include #include #include #include #include #ifdef UFS_DIRHASH #include #endif #ifdef UFS_GJOURNAL #include #endif #include static vop_access_t ufs_access; static int ufs_chmod(struct vnode *, int, struct ucred *, struct thread *); static int ufs_chown(struct vnode *, uid_t, gid_t, struct ucred *, struct thread *); static vop_close_t ufs_close; static vop_create_t ufs_create; static vop_getattr_t ufs_getattr; static vop_link_t ufs_link; static int ufs_makeinode(int mode, struct vnode *, struct vnode **, struct componentname *); static vop_markatime_t ufs_markatime; static vop_mkdir_t ufs_mkdir; static vop_mknod_t ufs_mknod; static vop_open_t ufs_open; static vop_pathconf_t ufs_pathconf; static vop_print_t ufs_print; static vop_readlink_t ufs_readlink; static vop_remove_t ufs_remove; static vop_rename_t ufs_rename; static vop_rmdir_t ufs_rmdir; static vop_setattr_t ufs_setattr; static vop_strategy_t ufs_strategy; static vop_symlink_t ufs_symlink; static vop_whiteout_t ufs_whiteout; static vop_close_t ufsfifo_close; static vop_kqfilter_t ufsfifo_kqfilter; /* * A virgin directory (no blushing please). */ static struct dirtemplate mastertemplate = { 0, 12, DT_DIR, 1, ".", 0, DIRBLKSIZ - 12, DT_DIR, 2, ".." }; static struct odirtemplate omastertemplate = { 0, 12, 1, ".", 0, DIRBLKSIZ - 12, 2, ".." }; static void ufs_itimes_locked(struct vnode *vp) { struct inode *ip; struct timespec ts; ASSERT_VI_LOCKED(vp, __func__); ip = VTOI(vp); if (UFS_RDONLY(ip)) goto out; if ((ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_UPDATE)) == 0) return; if ((vp->v_type == VBLK || vp->v_type == VCHR) && !DOINGSOFTDEP(vp)) ip->i_flag |= IN_LAZYMOD; else if (((vp->v_mount->mnt_kern_flag & (MNTK_SUSPENDED | MNTK_SUSPEND)) == 0) || (ip->i_flag & (IN_CHANGE | IN_UPDATE))) ip->i_flag |= IN_MODIFIED; else if (ip->i_flag & IN_ACCESS) ip->i_flag |= IN_LAZYACCESS; vfs_timestamp(&ts); if (ip->i_flag & IN_ACCESS) { DIP_SET(ip, i_atime, ts.tv_sec); DIP_SET(ip, i_atimensec, ts.tv_nsec); } if (ip->i_flag & IN_UPDATE) { DIP_SET(ip, i_mtime, ts.tv_sec); DIP_SET(ip, i_mtimensec, ts.tv_nsec); } if (ip->i_flag & IN_CHANGE) { DIP_SET(ip, i_ctime, ts.tv_sec); DIP_SET(ip, i_ctimensec, ts.tv_nsec); DIP_SET(ip, i_modrev, DIP(ip, i_modrev) + 1); } out: ip->i_flag &= ~(IN_ACCESS | IN_CHANGE | IN_UPDATE); } void ufs_itimes(struct vnode *vp) { VI_LOCK(vp); ufs_itimes_locked(vp); VI_UNLOCK(vp); } /* * Create a regular file */ static int ufs_create(ap) struct vop_create_args /* { struct vnode *a_dvp; struct vnode **a_vpp; struct componentname *a_cnp; struct vattr *a_vap; } */ *ap; { int error; error = ufs_makeinode(MAKEIMODE(ap->a_vap->va_type, ap->a_vap->va_mode), ap->a_dvp, ap->a_vpp, ap->a_cnp); if (error) return (error); return (0); } /* * Mknod vnode call */ /* ARGSUSED */ static int ufs_mknod(ap) struct vop_mknod_args /* { struct vnode *a_dvp; struct vnode **a_vpp; struct componentname *a_cnp; struct vattr *a_vap; } */ *ap; { struct vattr *vap = ap->a_vap; struct vnode **vpp = ap->a_vpp; struct inode *ip; ino_t ino; int error; error = ufs_makeinode(MAKEIMODE(vap->va_type, vap->va_mode), ap->a_dvp, vpp, ap->a_cnp); if (error) return (error); ip = VTOI(*vpp); ip->i_flag |= IN_ACCESS | IN_CHANGE | IN_UPDATE; if (vap->va_rdev != VNOVAL) { /* * Want to be able to use this to make badblock * inodes, so don't truncate the dev number. */ DIP_SET(ip, i_rdev, vap->va_rdev); } /* * Remove inode, then reload it through VFS_VGET so it is * checked to see if it is an alias of an existing entry in * the inode cache. XXX I don't believe this is necessary now. */ (*vpp)->v_type = VNON; ino = ip->i_number; /* Save this before vgone() invalidates ip. */ vgone(*vpp); vput(*vpp); error = VFS_VGET(ap->a_dvp->v_mount, ino, LK_EXCLUSIVE, vpp); if (error) { *vpp = NULL; return (error); } return (0); } /* * Open called. */ /* ARGSUSED */ static int ufs_open(struct vop_open_args *ap) { struct vnode *vp = ap->a_vp; struct inode *ip; if (vp->v_type == VCHR || vp->v_type == VBLK) return (EOPNOTSUPP); ip = VTOI(vp); /* * Files marked append-only must be opened for appending. */ if ((ip->i_flags & APPEND) && (ap->a_mode & (FWRITE | O_APPEND)) == FWRITE) return (EPERM); vnode_create_vobject(vp, DIP(ip, i_size), ap->a_td); return (0); } /* * Close called. * * Update the times on the inode. */ /* ARGSUSED */ static int ufs_close(ap) struct vop_close_args /* { struct vnode *a_vp; int a_fflag; struct ucred *a_cred; struct thread *a_td; } */ *ap; { struct vnode *vp = ap->a_vp; int usecount; VI_LOCK(vp); usecount = vp->v_usecount; if (usecount > 1) ufs_itimes_locked(vp); VI_UNLOCK(vp); return (0); } static int ufs_access(ap) struct vop_access_args /* { struct vnode *a_vp; accmode_t a_accmode; struct ucred *a_cred; struct thread *a_td; } */ *ap; { struct vnode *vp = ap->a_vp; struct inode *ip = VTOI(vp); accmode_t accmode = ap->a_accmode; int error; #ifdef QUOTA int relocked; #endif #ifdef UFS_ACL struct acl *acl; #endif /* * Disallow write attempts on read-only filesystems; * unless the file is a socket, fifo, or a block or * character device resident on the filesystem. */ if (accmode & VWRITE) { switch (vp->v_type) { case VDIR: case VLNK: case VREG: if (vp->v_mount->mnt_flag & MNT_RDONLY) return (EROFS); #ifdef QUOTA /* * Inode is accounted in the quotas only if struct * dquot is attached to it. VOP_ACCESS() is called * from vn_open_cred() and provides a convenient * point to call getinoquota(). */ if (VOP_ISLOCKED(vp) != LK_EXCLUSIVE) { /* * Upgrade vnode lock, since getinoquota() * requires exclusive lock to modify inode. */ relocked = 1; vhold(vp); vn_lock(vp, LK_UPGRADE | LK_RETRY); VI_LOCK(vp); if (vp->v_iflag & VI_DOOMED) { vdropl(vp); error = ENOENT; goto relock; } vdropl(vp); } else relocked = 0; error = getinoquota(ip); relock: if (relocked) vn_lock(vp, LK_DOWNGRADE | LK_RETRY); if (error != 0) return (error); #endif break; default: break; } } /* If immutable bit set, nobody gets to write it. */ if ((accmode & VWRITE) && (ip->i_flags & (IMMUTABLE | SF_SNAPSHOT))) return (EPERM); #ifdef UFS_ACL if ((vp->v_mount->mnt_flag & MNT_ACLS) != 0) { acl = acl_alloc(M_WAITOK); error = VOP_GETACL(vp, ACL_TYPE_ACCESS, acl, ap->a_cred, ap->a_td); switch (error) { case EOPNOTSUPP: error = vaccess(vp->v_type, ip->i_mode, ip->i_uid, ip->i_gid, ap->a_accmode, ap->a_cred, NULL); break; case 0: error = vaccess_acl_posix1e(vp->v_type, ip->i_uid, ip->i_gid, acl, ap->a_accmode, ap->a_cred, NULL); break; default: printf( "ufs_access(): Error retrieving ACL on object (%d).\n", error); /* * XXX: Fall back until debugged. Should * eventually possibly log an error, and return * EPERM for safety. */ error = vaccess(vp->v_type, ip->i_mode, ip->i_uid, ip->i_gid, ap->a_accmode, ap->a_cred, NULL); } acl_free(acl); } else #endif /* !UFS_ACL */ error = vaccess(vp->v_type, ip->i_mode, ip->i_uid, ip->i_gid, ap->a_accmode, ap->a_cred, NULL); return (error); } /* ARGSUSED */ static int ufs_getattr(ap) struct vop_getattr_args /* { struct vnode *a_vp; struct vattr *a_vap; struct ucred *a_cred; } */ *ap; { struct vnode *vp = ap->a_vp; struct inode *ip = VTOI(vp); struct vattr *vap = ap->a_vap; VI_LOCK(vp); ufs_itimes_locked(vp); if (ip->i_ump->um_fstype == UFS1) { vap->va_atime.tv_sec = ip->i_din1->di_atime; vap->va_atime.tv_nsec = ip->i_din1->di_atimensec; } else { vap->va_atime.tv_sec = ip->i_din2->di_atime; vap->va_atime.tv_nsec = ip->i_din2->di_atimensec; } VI_UNLOCK(vp); /* * Copy from inode table */ vap->va_fsid = dev2udev(ip->i_dev); vap->va_fileid = ip->i_number; vap->va_mode = ip->i_mode & ~IFMT; vap->va_nlink = ip->i_effnlink; vap->va_uid = ip->i_uid; vap->va_gid = ip->i_gid; if (ip->i_ump->um_fstype == UFS1) { vap->va_rdev = ip->i_din1->di_rdev; vap->va_size = ip->i_din1->di_size; vap->va_mtime.tv_sec = ip->i_din1->di_mtime; vap->va_mtime.tv_nsec = ip->i_din1->di_mtimensec; vap->va_ctime.tv_sec = ip->i_din1->di_ctime; vap->va_ctime.tv_nsec = ip->i_din1->di_ctimensec; vap->va_bytes = dbtob((u_quad_t)ip->i_din1->di_blocks); vap->va_filerev = ip->i_din1->di_modrev; } else { vap->va_rdev = ip->i_din2->di_rdev; vap->va_size = ip->i_din2->di_size; vap->va_mtime.tv_sec = ip->i_din2->di_mtime; vap->va_mtime.tv_nsec = ip->i_din2->di_mtimensec; vap->va_ctime.tv_sec = ip->i_din2->di_ctime; vap->va_ctime.tv_nsec = ip->i_din2->di_ctimensec; vap->va_birthtime.tv_sec = ip->i_din2->di_birthtime; vap->va_birthtime.tv_nsec = ip->i_din2->di_birthnsec; vap->va_bytes = dbtob((u_quad_t)ip->i_din2->di_blocks); vap->va_filerev = ip->i_din2->di_modrev; } vap->va_flags = ip->i_flags; vap->va_gen = ip->i_gen; vap->va_blocksize = vp->v_mount->mnt_stat.f_iosize; vap->va_type = IFTOVT(ip->i_mode); return (0); } /* * Set attribute vnode op. called from several syscalls */ static int ufs_setattr(ap) struct vop_setattr_args /* { struct vnode *a_vp; struct vattr *a_vap; struct ucred *a_cred; } */ *ap; { struct vattr *vap = ap->a_vap; struct vnode *vp = ap->a_vp; struct inode *ip = VTOI(vp); struct ucred *cred = ap->a_cred; struct thread *td = curthread; int error; /* * Check for unsettable attributes. */ if ((vap->va_type != VNON) || (vap->va_nlink != VNOVAL) || (vap->va_fsid != VNOVAL) || (vap->va_fileid != VNOVAL) || (vap->va_blocksize != VNOVAL) || (vap->va_rdev != VNOVAL) || ((int)vap->va_bytes != VNOVAL) || (vap->va_gen != VNOVAL)) { return (EINVAL); } if (vap->va_flags != VNOVAL) { if (vp->v_mount->mnt_flag & MNT_RDONLY) return (EROFS); /* * Callers may only modify the file flags on objects they * have VADMIN rights for. */ if ((error = VOP_ACCESS(vp, VADMIN, cred, td))) return (error); /* * Unprivileged processes are not permitted to unset system * flags, or modify flags if any system flags are set. * Privileged non-jail processes may not modify system flags * if securelevel > 0 and any existing system flags are set. * Privileged jail processes behave like privileged non-jail * processes if the security.jail.chflags_allowed sysctl is * is non-zero; otherwise, they behave like unprivileged * processes. */ if (!priv_check_cred(cred, PRIV_VFS_SYSFLAGS, 0)) { if (ip->i_flags & (SF_NOUNLINK | SF_IMMUTABLE | SF_APPEND)) { error = securelevel_gt(cred, 0); if (error) return (error); } /* Snapshot flag cannot be set or cleared */ if (((vap->va_flags & SF_SNAPSHOT) != 0 && (ip->i_flags & SF_SNAPSHOT) == 0) || ((vap->va_flags & SF_SNAPSHOT) == 0 && (ip->i_flags & SF_SNAPSHOT) != 0)) return (EPERM); ip->i_flags = vap->va_flags; DIP_SET(ip, i_flags, vap->va_flags); } else { if (ip->i_flags & (SF_NOUNLINK | SF_IMMUTABLE | SF_APPEND) || (vap->va_flags & UF_SETTABLE) != vap->va_flags) return (EPERM); ip->i_flags &= SF_SETTABLE; ip->i_flags |= (vap->va_flags & UF_SETTABLE); DIP_SET(ip, i_flags, ip->i_flags); } ip->i_flag |= IN_CHANGE; if (vap->va_flags & (IMMUTABLE | APPEND)) return (0); } if (ip->i_flags & (IMMUTABLE | APPEND)) return (EPERM); /* * Go through the fields and update iff not VNOVAL. */ if (vap->va_uid != (uid_t)VNOVAL || vap->va_gid != (gid_t)VNOVAL) { if (vp->v_mount->mnt_flag & MNT_RDONLY) return (EROFS); if ((error = ufs_chown(vp, vap->va_uid, vap->va_gid, cred, td)) != 0) return (error); } if (vap->va_size != VNOVAL) { /* * XXX most of the following special cases should be in * callers instead of in N filesystems. The VDIR check * mostly already is. */ switch (vp->v_type) { case VDIR: return (EISDIR); case VLNK: case VREG: /* * Truncation should have an effect in these cases. * Disallow it if the filesystem is read-only or * the file is being snapshotted. */ if (vp->v_mount->mnt_flag & MNT_RDONLY) return (EROFS); if ((ip->i_flags & SF_SNAPSHOT) != 0) return (EPERM); break; default: /* * According to POSIX, the result is unspecified * for file types other than regular files, * directories and shared memory objects. We * don't support shared memory objects in the file * system, and have dubious support for truncating * symlinks. Just ignore the request in other cases. */ return (0); } if ((error = UFS_TRUNCATE(vp, vap->va_size, IO_NORMAL, cred, td)) != 0) return (error); } if (vap->va_atime.tv_sec != VNOVAL || vap->va_mtime.tv_sec != VNOVAL || vap->va_birthtime.tv_sec != VNOVAL) { if (vp->v_mount->mnt_flag & MNT_RDONLY) return (EROFS); if ((ip->i_flags & SF_SNAPSHOT) != 0) return (EPERM); /* * From utimes(2): * If times is NULL, ... The caller must be the owner of * the file, have permission to write the file, or be the * super-user. * If times is non-NULL, ... The caller must be the owner of * the file or be the super-user. * * Possibly for historical reasons, try to use VADMIN in * preference to VWRITE for a NULL timestamp. This means we * will return EACCES in preference to EPERM if neither * check succeeds. */ if (vap->va_vaflags & VA_UTIMES_NULL) { error = VOP_ACCESS(vp, VADMIN, cred, td); if (error) error = VOP_ACCESS(vp, VWRITE, cred, td); } else error = VOP_ACCESS(vp, VADMIN, cred, td); if (error) return (error); if (vap->va_atime.tv_sec != VNOVAL) ip->i_flag |= IN_ACCESS; if (vap->va_mtime.tv_sec != VNOVAL) ip->i_flag |= IN_CHANGE | IN_UPDATE; if (vap->va_birthtime.tv_sec != VNOVAL && ip->i_ump->um_fstype == UFS2) ip->i_flag |= IN_MODIFIED; ufs_itimes(vp); if (vap->va_atime.tv_sec != VNOVAL) { DIP_SET(ip, i_atime, vap->va_atime.tv_sec); DIP_SET(ip, i_atimensec, vap->va_atime.tv_nsec); } if (vap->va_mtime.tv_sec != VNOVAL) { DIP_SET(ip, i_mtime, vap->va_mtime.tv_sec); DIP_SET(ip, i_mtimensec, vap->va_mtime.tv_nsec); } if (vap->va_birthtime.tv_sec != VNOVAL && ip->i_ump->um_fstype == UFS2) { ip->i_din2->di_birthtime = vap->va_birthtime.tv_sec; ip->i_din2->di_birthnsec = vap->va_birthtime.tv_nsec; } error = UFS_UPDATE(vp, 0); if (error) return (error); } error = 0; if (vap->va_mode != (mode_t)VNOVAL) { if (vp->v_mount->mnt_flag & MNT_RDONLY) return (EROFS); if ((ip->i_flags & SF_SNAPSHOT) != 0 && (vap->va_mode & (S_IXUSR | S_IWUSR | S_IXGRP | S_IWGRP | S_IXOTH | S_IWOTH))) return (EPERM); error = ufs_chmod(vp, (int)vap->va_mode, cred, td); } return (error); } /* * Mark this file's access time for update for vfs_mark_atime(). This * is called from execve() and mmap(). */ static int ufs_markatime(ap) struct vop_markatime_args /* { struct vnode *a_vp; } */ *ap; { struct vnode *vp = ap->a_vp; struct inode *ip = VTOI(vp); VI_LOCK(vp); ip->i_flag |= IN_ACCESS; VI_UNLOCK(vp); return (0); } /* * Change the mode on a file. * Inode must be locked before calling. */ static int ufs_chmod(vp, mode, cred, td) struct vnode *vp; int mode; struct ucred *cred; struct thread *td; { struct inode *ip = VTOI(vp); int error; /* * To modify the permissions on a file, must possess VADMIN * for that file. */ if ((error = VOP_ACCESS(vp, VADMIN, cred, td))) return (error); /* * Privileged processes may set the sticky bit on non-directories, * as well as set the setgid bit on a file with a group that the * process is not a member of. Both of these are allowed in * jail(8). */ if (vp->v_type != VDIR && (mode & S_ISTXT)) { if (priv_check_cred(cred, PRIV_VFS_STICKYFILE, 0)) return (EFTYPE); } if (!groupmember(ip->i_gid, cred) && (mode & ISGID)) { error = priv_check_cred(cred, PRIV_VFS_SETGID, 0); if (error) return (error); } ip->i_mode &= ~ALLPERMS; ip->i_mode |= (mode & ALLPERMS); DIP_SET(ip, i_mode, ip->i_mode); ip->i_flag |= IN_CHANGE; return (0); } /* * Perform chown operation on inode ip; * inode must be locked prior to call. */ static int ufs_chown(vp, uid, gid, cred, td) struct vnode *vp; uid_t uid; gid_t gid; struct ucred *cred; struct thread *td; { struct inode *ip = VTOI(vp); uid_t ouid; gid_t ogid; int error = 0; #ifdef QUOTA int i; ufs2_daddr_t change; #endif if (uid == (uid_t)VNOVAL) uid = ip->i_uid; if (gid == (gid_t)VNOVAL) gid = ip->i_gid; /* * To modify the ownership of a file, must possess VADMIN for that * file. */ if ((error = VOP_ACCESS(vp, VADMIN, cred, td))) return (error); /* * To change the owner of a file, or change the group of a file to a * group of which we are not a member, the caller must have * privilege. */ if ((uid != ip->i_uid || (gid != ip->i_gid && !groupmember(gid, cred))) && (error = priv_check_cred(cred, PRIV_VFS_CHOWN, 0))) return (error); ogid = ip->i_gid; ouid = ip->i_uid; #ifdef QUOTA if ((error = getinoquota(ip)) != 0) return (error); if (ouid == uid) { dqrele(vp, ip->i_dquot[USRQUOTA]); ip->i_dquot[USRQUOTA] = NODQUOT; } if (ogid == gid) { dqrele(vp, ip->i_dquot[GRPQUOTA]); ip->i_dquot[GRPQUOTA] = NODQUOT; } change = DIP(ip, i_blocks); (void) chkdq(ip, -change, cred, CHOWN); (void) chkiq(ip, -1, cred, CHOWN); for (i = 0; i < MAXQUOTAS; i++) { dqrele(vp, ip->i_dquot[i]); ip->i_dquot[i] = NODQUOT; } #endif ip->i_gid = gid; DIP_SET(ip, i_gid, gid); ip->i_uid = uid; DIP_SET(ip, i_uid, uid); #ifdef QUOTA if ((error = getinoquota(ip)) == 0) { if (ouid == uid) { dqrele(vp, ip->i_dquot[USRQUOTA]); ip->i_dquot[USRQUOTA] = NODQUOT; } if (ogid == gid) { dqrele(vp, ip->i_dquot[GRPQUOTA]); ip->i_dquot[GRPQUOTA] = NODQUOT; } if ((error = chkdq(ip, change, cred, CHOWN)) == 0) { if ((error = chkiq(ip, 1, cred, CHOWN)) == 0) goto good; else (void) chkdq(ip, -change, cred, CHOWN|FORCE); } for (i = 0; i < MAXQUOTAS; i++) { dqrele(vp, ip->i_dquot[i]); ip->i_dquot[i] = NODQUOT; } } ip->i_gid = ogid; DIP_SET(ip, i_gid, ogid); ip->i_uid = ouid; DIP_SET(ip, i_uid, ouid); if (getinoquota(ip) == 0) { if (ouid == uid) { dqrele(vp, ip->i_dquot[USRQUOTA]); ip->i_dquot[USRQUOTA] = NODQUOT; } if (ogid == gid) { dqrele(vp, ip->i_dquot[GRPQUOTA]); ip->i_dquot[GRPQUOTA] = NODQUOT; } (void) chkdq(ip, change, cred, FORCE|CHOWN); (void) chkiq(ip, 1, cred, FORCE|CHOWN); (void) getinoquota(ip); } return (error); good: if (getinoquota(ip)) panic("ufs_chown: lost quota"); #endif /* QUOTA */ ip->i_flag |= IN_CHANGE; if ((ip->i_mode & (ISUID | ISGID)) && (ouid != uid || ogid != gid)) { if (priv_check_cred(cred, PRIV_VFS_RETAINSUGID, 0)) { ip->i_mode &= ~(ISUID | ISGID); DIP_SET(ip, i_mode, ip->i_mode); } } return (0); } static int ufs_remove(ap) struct vop_remove_args /* { struct vnode *a_dvp; struct vnode *a_vp; struct componentname *a_cnp; } */ *ap; { struct inode *ip; struct vnode *vp = ap->a_vp; struct vnode *dvp = ap->a_dvp; int error; struct thread *td; td = curthread; ip = VTOI(vp); if ((ip->i_flags & (NOUNLINK | IMMUTABLE | APPEND)) || (VTOI(dvp)->i_flags & APPEND)) { error = EPERM; goto out; } #ifdef UFS_GJOURNAL ufs_gjournal_orphan(vp); #endif error = ufs_dirremove(dvp, ip, ap->a_cnp->cn_flags, 0); if (ip->i_nlink <= 0) vp->v_vflag |= VV_NOSYNC; if ((ip->i_flags & SF_SNAPSHOT) != 0) { /* * Avoid deadlock where another thread is trying to * update the inodeblock for dvp and is waiting on * snaplk. Temporary unlock the vnode lock for the * unlinked file and sync the directory. This should * allow vput() of the directory to not block later on * while holding the snapshot vnode locked, assuming * that the directory hasn't been unlinked too. */ VOP_UNLOCK(vp, 0); (void) VOP_FSYNC(dvp, MNT_WAIT, td); vn_lock(vp, LK_EXCLUSIVE | LK_RETRY); } out: return (error); } /* * link vnode call */ static int ufs_link(ap) struct vop_link_args /* { struct vnode *a_tdvp; struct vnode *a_vp; struct componentname *a_cnp; } */ *ap; { struct vnode *vp = ap->a_vp; struct vnode *tdvp = ap->a_tdvp; struct componentname *cnp = ap->a_cnp; struct inode *ip; struct direct newdir; int error; #ifdef INVARIANTS if ((cnp->cn_flags & HASBUF) == 0) panic("ufs_link: no name"); #endif if (tdvp->v_mount != vp->v_mount) { error = EXDEV; goto out; } ip = VTOI(vp); if ((nlink_t)ip->i_nlink >= LINK_MAX) { error = EMLINK; goto out; } if (ip->i_flags & (IMMUTABLE | APPEND)) { error = EPERM; goto out; } ip->i_effnlink++; ip->i_nlink++; DIP_SET(ip, i_nlink, ip->i_nlink); ip->i_flag |= IN_CHANGE; if (DOINGSOFTDEP(vp)) softdep_change_linkcnt(ip); error = UFS_UPDATE(vp, !(DOINGSOFTDEP(vp) | DOINGASYNC(vp))); if (!error) { ufs_makedirentry(ip, cnp, &newdir); error = ufs_direnter(tdvp, vp, &newdir, cnp, NULL); } if (error) { ip->i_effnlink--; ip->i_nlink--; DIP_SET(ip, i_nlink, ip->i_nlink); ip->i_flag |= IN_CHANGE; if (DOINGSOFTDEP(vp)) softdep_change_linkcnt(ip); } out: return (error); } /* * whiteout vnode call */ static int ufs_whiteout(ap) struct vop_whiteout_args /* { struct vnode *a_dvp; struct componentname *a_cnp; int a_flags; } */ *ap; { struct vnode *dvp = ap->a_dvp; struct componentname *cnp = ap->a_cnp; struct direct newdir; int error = 0; switch (ap->a_flags) { case LOOKUP: /* 4.4 format directories support whiteout operations */ if (dvp->v_mount->mnt_maxsymlinklen > 0) return (0); return (EOPNOTSUPP); case CREATE: /* create a new directory whiteout */ #ifdef INVARIANTS if ((cnp->cn_flags & SAVENAME) == 0) panic("ufs_whiteout: missing name"); if (dvp->v_mount->mnt_maxsymlinklen <= 0) panic("ufs_whiteout: old format filesystem"); #endif newdir.d_ino = WINO; newdir.d_namlen = cnp->cn_namelen; bcopy(cnp->cn_nameptr, newdir.d_name, (unsigned)cnp->cn_namelen + 1); newdir.d_type = DT_WHT; error = ufs_direnter(dvp, NULL, &newdir, cnp, NULL); break; case DELETE: /* remove an existing directory whiteout */ #ifdef INVARIANTS if (dvp->v_mount->mnt_maxsymlinklen <= 0) panic("ufs_whiteout: old format filesystem"); #endif cnp->cn_flags &= ~DOWHITEOUT; error = ufs_dirremove(dvp, NULL, cnp->cn_flags, 0); break; default: panic("ufs_whiteout: unknown op"); } return (error); } /* * Rename system call. * rename("foo", "bar"); * is essentially * unlink("bar"); * link("foo", "bar"); * unlink("foo"); * but ``atomically''. Can't do full commit without saving state in the * inode on disk which isn't feasible at this time. Best we can do is * always guarantee the target exists. * * Basic algorithm is: * * 1) Bump link count on source while we're linking it to the * target. This also ensure the inode won't be deleted out * from underneath us while we work (it may be truncated by * a concurrent `trunc' or `open' for creation). * 2) Link source to destination. If destination already exists, * delete it first. * 3) Unlink source reference to inode if still around. If a * directory was moved and the parent of the destination * is different from the source, patch the ".." entry in the * directory. */ static int ufs_rename(ap) struct vop_rename_args /* { struct vnode *a_fdvp; struct vnode *a_fvp; struct componentname *a_fcnp; struct vnode *a_tdvp; struct vnode *a_tvp; struct componentname *a_tcnp; } */ *ap; { struct vnode *tvp = ap->a_tvp; struct vnode *tdvp = ap->a_tdvp; struct vnode *fvp = ap->a_fvp; struct vnode *fdvp = ap->a_fdvp; struct componentname *tcnp = ap->a_tcnp; struct componentname *fcnp = ap->a_fcnp; struct thread *td = fcnp->cn_thread; struct inode *ip, *xp, *dp; struct direct newdir; int doingdirectory = 0, oldparent = 0, newparent = 0; int error = 0, ioflag; ino_t fvp_ino; #ifdef INVARIANTS if ((tcnp->cn_flags & HASBUF) == 0 || (fcnp->cn_flags & HASBUF) == 0) panic("ufs_rename: no name"); #endif /* * Check for cross-device rename. */ if ((fvp->v_mount != tdvp->v_mount) || (tvp && (fvp->v_mount != tvp->v_mount))) { error = EXDEV; abortit: if (tdvp == tvp) vrele(tdvp); else vput(tdvp); if (tvp) vput(tvp); vrele(fdvp); vrele(fvp); return (error); } if (tvp && ((VTOI(tvp)->i_flags & (NOUNLINK | IMMUTABLE | APPEND)) || (VTOI(tdvp)->i_flags & APPEND))) { error = EPERM; goto abortit; } /* * Renaming a file to itself has no effect. The upper layers should * not call us in that case. Temporarily just warn if they do. */ if (fvp == tvp) { printf("ufs_rename: fvp == tvp (can't happen)\n"); error = 0; goto abortit; } if ((error = vn_lock(fvp, LK_EXCLUSIVE)) != 0) goto abortit; dp = VTOI(fdvp); ip = VTOI(fvp); if (ip->i_nlink >= LINK_MAX) { VOP_UNLOCK(fvp, 0); error = EMLINK; goto abortit; } if ((ip->i_flags & (NOUNLINK | IMMUTABLE | APPEND)) || (dp->i_flags & APPEND)) { VOP_UNLOCK(fvp, 0); error = EPERM; goto abortit; } if ((ip->i_mode & IFMT) == IFDIR) { /* * Avoid ".", "..", and aliases of "." for obvious reasons. */ if ((fcnp->cn_namelen == 1 && fcnp->cn_nameptr[0] == '.') || dp == ip || (fcnp->cn_flags | tcnp->cn_flags) & ISDOTDOT || (ip->i_flag & IN_RENAME)) { VOP_UNLOCK(fvp, 0); error = EINVAL; goto abortit; } ip->i_flag |= IN_RENAME; oldparent = dp->i_number; doingdirectory = 1; } vrele(fdvp); /* * When the target exists, both the directory * and target vnodes are returned locked. */ dp = VTOI(tdvp); xp = NULL; if (tvp) xp = VTOI(tvp); /* * 1) Bump link count while we're moving stuff * around. If we crash somewhere before * completing our work, the link count * may be wrong, but correctable. */ ip->i_effnlink++; ip->i_nlink++; DIP_SET(ip, i_nlink, ip->i_nlink); ip->i_flag |= IN_CHANGE; if (DOINGSOFTDEP(fvp)) softdep_change_linkcnt(ip); if ((error = UFS_UPDATE(fvp, !(DOINGSOFTDEP(fvp) | DOINGASYNC(fvp)))) != 0) { VOP_UNLOCK(fvp, 0); goto bad; } /* * If ".." must be changed (ie the directory gets a new * parent) then the source directory must not be in the * directory hierarchy above the target, as this would * orphan everything below the source directory. Also * the user must have write permission in the source so * as to be able to change "..". We must repeat the call * to namei, as the parent directory is unlocked by the * call to checkpath(). */ error = VOP_ACCESS(fvp, VWRITE, tcnp->cn_cred, tcnp->cn_thread); fvp_ino = ip->i_number; VOP_UNLOCK(fvp, 0); if (oldparent != dp->i_number) newparent = dp->i_number; if (doingdirectory && newparent) { if (error) /* write access check above */ goto bad; if (xp != NULL) vput(tvp); error = ufs_checkpath(fvp_ino, dp, tcnp->cn_cred); if (error) goto out; if ((tcnp->cn_flags & SAVESTART) == 0) panic("ufs_rename: lost to startdir"); VREF(tdvp); error = relookup(tdvp, &tvp, tcnp); if (error) goto out; vrele(tdvp); dp = VTOI(tdvp); xp = NULL; if (tvp) xp = VTOI(tvp); } /* * 2) If target doesn't exist, link the target * to the source and unlink the source. * Otherwise, rewrite the target directory * entry to reference the source inode and * expunge the original entry's existence. */ if (xp == NULL) { if (dp->i_dev != ip->i_dev) panic("ufs_rename: EXDEV"); /* * Account for ".." in new directory. * When source and destination have the same * parent we don't fool with the link count. */ if (doingdirectory && newparent) { if ((nlink_t)dp->i_nlink >= LINK_MAX) { error = EMLINK; goto bad; } dp->i_effnlink++; dp->i_nlink++; DIP_SET(dp, i_nlink, dp->i_nlink); dp->i_flag |= IN_CHANGE; if (DOINGSOFTDEP(tdvp)) softdep_change_linkcnt(dp); error = UFS_UPDATE(tdvp, !(DOINGSOFTDEP(tdvp) | DOINGASYNC(tdvp))); if (error) goto bad; } ufs_makedirentry(ip, tcnp, &newdir); error = ufs_direnter(tdvp, NULL, &newdir, tcnp, NULL); if (error) { if (doingdirectory && newparent) { dp->i_effnlink--; dp->i_nlink--; DIP_SET(dp, i_nlink, dp->i_nlink); dp->i_flag |= IN_CHANGE; if (DOINGSOFTDEP(tdvp)) softdep_change_linkcnt(dp); (void)UFS_UPDATE(tdvp, 1); } goto bad; } vput(tdvp); } else { if (xp->i_dev != dp->i_dev || xp->i_dev != ip->i_dev) panic("ufs_rename: EXDEV"); /* * Short circuit rename(foo, foo). */ if (xp->i_number == ip->i_number) panic("ufs_rename: same file"); /* * If the parent directory is "sticky", then the caller * must possess VADMIN for the parent directory, or the * destination of the rename. This implements append-only * directories. */ if ((dp->i_mode & S_ISTXT) && VOP_ACCESS(tdvp, VADMIN, tcnp->cn_cred, td) && VOP_ACCESS(tvp, VADMIN, tcnp->cn_cred, td)) { error = EPERM; goto bad; } /* * Target must be empty if a directory and have no links * to it. Also, ensure source and target are compatible * (both directories, or both not directories). */ if ((xp->i_mode&IFMT) == IFDIR) { if ((xp->i_effnlink > 2) || !ufs_dirempty(xp, dp->i_number, tcnp->cn_cred)) { error = ENOTEMPTY; goto bad; } if (!doingdirectory) { error = ENOTDIR; goto bad; } cache_purge(tdvp); } else if (doingdirectory) { error = EISDIR; goto bad; } error = ufs_dirrewrite(dp, xp, ip->i_number, IFTODT(ip->i_mode), (doingdirectory && newparent) ? newparent : doingdirectory); if (error) goto bad; if (doingdirectory) { if (!newparent) { dp->i_effnlink--; if (DOINGSOFTDEP(tdvp)) softdep_change_linkcnt(dp); } xp->i_effnlink--; if (DOINGSOFTDEP(tvp)) softdep_change_linkcnt(xp); } if (doingdirectory && !DOINGSOFTDEP(tvp)) { /* * Truncate inode. The only stuff left in the directory * is "." and "..". The "." reference is inconsequential * since we are quashing it. We have removed the "." * reference and the reference in the parent directory, * but there may be other hard links. The soft * dependency code will arrange to do these operations * after the parent directory entry has been deleted on * disk, so when running with that code we avoid doing * them now. */ if (!newparent) { dp->i_nlink--; DIP_SET(dp, i_nlink, dp->i_nlink); dp->i_flag |= IN_CHANGE; } xp->i_nlink--; DIP_SET(xp, i_nlink, xp->i_nlink); xp->i_flag |= IN_CHANGE; ioflag = IO_NORMAL; if (!DOINGASYNC(tvp)) ioflag |= IO_SYNC; if ((error = UFS_TRUNCATE(tvp, (off_t)0, ioflag, tcnp->cn_cred, tcnp->cn_thread)) != 0) goto bad; } vput(tdvp); vput(tvp); xp = NULL; } /* * 3) Unlink the source. */ fcnp->cn_flags &= ~MODMASK; fcnp->cn_flags |= LOCKPARENT | LOCKLEAF; if ((fcnp->cn_flags & SAVESTART) == 0) panic("ufs_rename: lost from startdir"); VREF(fdvp); error = relookup(fdvp, &fvp, fcnp); if (error == 0) vrele(fdvp); if (fvp != NULL) { xp = VTOI(fvp); dp = VTOI(fdvp); } else { /* * From name has disappeared. IN_RENAME is not sufficient * to protect against directory races due to timing windows, * so we have to remove the panic. XXX the only real way * to solve this issue is at a much higher level. By the * time we hit ufs_rename() it's too late. */ #if 0 if (doingdirectory) panic("ufs_rename: lost dir entry"); #endif vrele(ap->a_fvp); return (0); } /* * Ensure that the directory entry still exists and has not * changed while the new name has been entered. If the source is * a file then the entry may have been unlinked or renamed. In * either case there is no further work to be done. If the source * is a directory then it cannot have been rmdir'ed; the IN_RENAME * flag ensures that it cannot be moved by another rename or removed * by a rmdir. */ if (xp != ip) { /* * From name resolves to a different inode. IN_RENAME is * not sufficient protection against timing window races * so we can't panic here. XXX the only real way * to solve this issue is at a much higher level. By the * time we hit ufs_rename() it's too late. */ #if 0 if (doingdirectory) panic("ufs_rename: lost dir entry"); #endif } else { /* * If the source is a directory with a * new parent, the link count of the old * parent directory must be decremented * and ".." set to point to the new parent. */ if (doingdirectory && newparent) { xp->i_offset = mastertemplate.dot_reclen; ufs_dirrewrite(xp, dp, newparent, DT_DIR, 0); cache_purge(fdvp); } error = ufs_dirremove(fdvp, xp, fcnp->cn_flags, 0); xp->i_flag &= ~IN_RENAME; } if (dp) vput(fdvp); if (xp) vput(fvp); vrele(ap->a_fvp); return (error); bad: if (xp) vput(ITOV(xp)); vput(ITOV(dp)); out: if (doingdirectory) ip->i_flag &= ~IN_RENAME; if (vn_lock(fvp, LK_EXCLUSIVE) == 0) { ip->i_effnlink--; ip->i_nlink--; DIP_SET(ip, i_nlink, ip->i_nlink); ip->i_flag |= IN_CHANGE; ip->i_flag &= ~IN_RENAME; if (DOINGSOFTDEP(fvp)) softdep_change_linkcnt(ip); vput(fvp); } else vrele(fvp); return (error); } /* * Mkdir system call */ static int ufs_mkdir(ap) struct vop_mkdir_args /* { struct vnode *a_dvp; struct vnode **a_vpp; struct componentname *a_cnp; struct vattr *a_vap; } */ *ap; { struct vnode *dvp = ap->a_dvp; struct vattr *vap = ap->a_vap; struct componentname *cnp = ap->a_cnp; struct inode *ip, *dp; struct vnode *tvp; struct buf *bp; struct dirtemplate dirtemplate, *dtp; struct direct newdir; #ifdef UFS_ACL struct acl *acl, *dacl; #endif int error, dmode; long blkoff; #ifdef INVARIANTS if ((cnp->cn_flags & HASBUF) == 0) panic("ufs_mkdir: no name"); #endif dp = VTOI(dvp); if ((nlink_t)dp->i_nlink >= LINK_MAX) { error = EMLINK; goto out; } dmode = vap->va_mode & 0777; dmode |= IFDIR; /* * Must simulate part of ufs_makeinode here to acquire the inode, * but not have it entered in the parent directory. The entry is * made later after writing "." and ".." entries. */ error = UFS_VALLOC(dvp, dmode, cnp->cn_cred, &tvp); if (error) goto out; ip = VTOI(tvp); ip->i_gid = dp->i_gid; DIP_SET(ip, i_gid, dp->i_gid); #ifdef SUIDDIR { #ifdef QUOTA struct ucred ucred, *ucp; ucp = cnp->cn_cred; #endif /* * If we are hacking owners here, (only do this where told to) * and we are not giving it TO root, (would subvert quotas) * then go ahead and give it to the other user. * The new directory also inherits the SUID bit. * If user's UID and dir UID are the same, * 'give it away' so that the SUID is still forced on. */ if ((dvp->v_mount->mnt_flag & MNT_SUIDDIR) && (dp->i_mode & ISUID) && dp->i_uid) { dmode |= ISUID; ip->i_uid = dp->i_uid; DIP_SET(ip, i_uid, dp->i_uid); #ifdef QUOTA if (dp->i_uid != cnp->cn_cred->cr_uid) { /* * Make sure the correct user gets charged * for the space. * Make a dummy credential for the victim. * XXX This seems to never be accessed out of * our context so a stack variable is ok. */ refcount_init(&ucred.cr_ref, 1); ucred.cr_uid = ip->i_uid; ucred.cr_ngroups = 1; ucred.cr_groups[0] = dp->i_gid; ucp = &ucred; } #endif } else { ip->i_uid = cnp->cn_cred->cr_uid; DIP_SET(ip, i_uid, ip->i_uid); } #ifdef QUOTA if ((error = getinoquota(ip)) || (error = chkiq(ip, 1, ucp, 0))) { UFS_VFREE(tvp, ip->i_number, dmode); vput(tvp); return (error); } #endif } #else /* !SUIDDIR */ ip->i_uid = cnp->cn_cred->cr_uid; DIP_SET(ip, i_uid, ip->i_uid); #ifdef QUOTA if ((error = getinoquota(ip)) || (error = chkiq(ip, 1, cnp->cn_cred, 0))) { UFS_VFREE(tvp, ip->i_number, dmode); vput(tvp); return (error); } #endif #endif /* !SUIDDIR */ ip->i_flag |= IN_ACCESS | IN_CHANGE | IN_UPDATE; #ifdef UFS_ACL acl = dacl = NULL; if ((dvp->v_mount->mnt_flag & MNT_ACLS) != 0) { acl = acl_alloc(M_WAITOK); dacl = acl_alloc(M_WAITOK); /* * Retrieve default ACL from parent, if any. */ error = VOP_GETACL(dvp, ACL_TYPE_DEFAULT, acl, cnp->cn_cred, cnp->cn_thread); switch (error) { case 0: /* * Retrieved a default ACL, so merge mode and ACL if * necessary. If the ACL is empty, fall through to * the "not defined or available" case. */ if (acl->acl_cnt != 0) { dmode = acl_posix1e_newfilemode(dmode, acl); ip->i_mode = dmode; DIP_SET(ip, i_mode, dmode); *dacl = *acl; ufs_sync_acl_from_inode(ip, acl); break; } /* FALLTHROUGH */ case EOPNOTSUPP: /* * Just use the mode as-is. */ ip->i_mode = dmode; DIP_SET(ip, i_mode, dmode); acl_free(acl); acl_free(dacl); dacl = acl = NULL; break; default: UFS_VFREE(tvp, ip->i_number, dmode); vput(tvp); acl_free(acl); acl_free(dacl); return (error); } } else { #endif /* !UFS_ACL */ ip->i_mode = dmode; DIP_SET(ip, i_mode, dmode); #ifdef UFS_ACL } #endif tvp->v_type = VDIR; /* Rest init'd in getnewvnode(). */ ip->i_effnlink = 2; ip->i_nlink = 2; DIP_SET(ip, i_nlink, 2); if (DOINGSOFTDEP(tvp)) softdep_change_linkcnt(ip); if (cnp->cn_flags & ISWHITEOUT) { ip->i_flags |= UF_OPAQUE; DIP_SET(ip, i_flags, ip->i_flags); } /* * Bump link count in parent directory to reflect work done below. * Should be done before reference is created so cleanup is * possible if we crash. */ dp->i_effnlink++; dp->i_nlink++; DIP_SET(dp, i_nlink, dp->i_nlink); dp->i_flag |= IN_CHANGE; if (DOINGSOFTDEP(dvp)) softdep_change_linkcnt(dp); error = UFS_UPDATE(tvp, !(DOINGSOFTDEP(dvp) | DOINGASYNC(dvp))); if (error) goto bad; #ifdef MAC if (dvp->v_mount->mnt_flag & MNT_MULTILABEL) { error = mac_vnode_create_extattr(cnp->cn_cred, dvp->v_mount, dvp, tvp, cnp); if (error) goto bad; } #endif #ifdef UFS_ACL if (acl != NULL) { /* * XXX: If we abort now, will Soft Updates notify the extattr * code that the EAs for the file need to be released? */ error = VOP_SETACL(tvp, ACL_TYPE_ACCESS, acl, cnp->cn_cred, cnp->cn_thread); if (error == 0) error = VOP_SETACL(tvp, ACL_TYPE_DEFAULT, dacl, cnp->cn_cred, cnp->cn_thread); switch (error) { case 0: break; case EOPNOTSUPP: /* * XXX: This should not happen, as EOPNOTSUPP above * was supposed to free acl. */ printf("ufs_mkdir: VOP_GETACL() but no VOP_SETACL()\n"); /* panic("ufs_mkdir: VOP_GETACL() but no VOP_SETACL()"); */ break; default: acl_free(acl); acl_free(dacl); dacl = acl = NULL; goto bad; } acl_free(acl); acl_free(dacl); dacl = acl = NULL; } #endif /* !UFS_ACL */ /* * Initialize directory with "." and ".." from static template. */ if (dvp->v_mount->mnt_maxsymlinklen > 0) dtp = &mastertemplate; else dtp = (struct dirtemplate *)&omastertemplate; dirtemplate = *dtp; dirtemplate.dot_ino = ip->i_number; dirtemplate.dotdot_ino = dp->i_number; if ((error = UFS_BALLOC(tvp, (off_t)0, DIRBLKSIZ, cnp->cn_cred, BA_CLRBUF, &bp)) != 0) goto bad; ip->i_size = DIRBLKSIZ; DIP_SET(ip, i_size, DIRBLKSIZ); ip->i_flag |= IN_CHANGE | IN_UPDATE; vnode_pager_setsize(tvp, (u_long)ip->i_size); bcopy((caddr_t)&dirtemplate, (caddr_t)bp->b_data, sizeof dirtemplate); if (DOINGSOFTDEP(tvp)) { /* * Ensure that the entire newly allocated block is a * valid directory so that future growth within the * block does not have to ensure that the block is * written before the inode. */ blkoff = DIRBLKSIZ; while (blkoff < bp->b_bcount) { ((struct direct *) (bp->b_data + blkoff))->d_reclen = DIRBLKSIZ; blkoff += DIRBLKSIZ; } } if ((error = UFS_UPDATE(tvp, !(DOINGSOFTDEP(tvp) | DOINGASYNC(tvp)))) != 0) { (void)bwrite(bp); goto bad; } /* * Directory set up, now install its entry in the parent directory. * * If we are not doing soft dependencies, then we must write out the * buffer containing the new directory body before entering the new * name in the parent. If we are doing soft dependencies, then the * buffer containing the new directory body will be passed to and * released in the soft dependency code after the code has attached * an appropriate ordering dependency to the buffer which ensures that * the buffer is written before the new name is written in the parent. */ if (DOINGASYNC(dvp)) bdwrite(bp); else if (!DOINGSOFTDEP(dvp) && ((error = bwrite(bp)))) goto bad; ufs_makedirentry(ip, cnp, &newdir); error = ufs_direnter(dvp, tvp, &newdir, cnp, bp); bad: if (error == 0) { *ap->a_vpp = tvp; } else { #ifdef UFS_ACL if (acl != NULL) acl_free(acl); if (dacl != NULL) acl_free(dacl); #endif dp->i_effnlink--; dp->i_nlink--; DIP_SET(dp, i_nlink, dp->i_nlink); dp->i_flag |= IN_CHANGE; if (DOINGSOFTDEP(dvp)) softdep_change_linkcnt(dp); /* * No need to do an explicit VOP_TRUNCATE here, vrele will * do this for us because we set the link count to 0. */ ip->i_effnlink = 0; ip->i_nlink = 0; DIP_SET(ip, i_nlink, 0); ip->i_flag |= IN_CHANGE; if (DOINGSOFTDEP(tvp)) softdep_change_linkcnt(ip); vput(tvp); } out: return (error); } /* * Rmdir system call. */ static int ufs_rmdir(ap) struct vop_rmdir_args /* { struct vnode *a_dvp; struct vnode *a_vp; struct componentname *a_cnp; } */ *ap; { struct vnode *vp = ap->a_vp; struct vnode *dvp = ap->a_dvp; struct componentname *cnp = ap->a_cnp; struct inode *ip, *dp; int error, ioflag; ip = VTOI(vp); dp = VTOI(dvp); /* * Do not remove a directory that is in the process of being renamed. * Verify the directory is empty (and valid). Rmdir ".." will not be * valid since ".." will contain a reference to the current directory * and thus be non-empty. Do not allow the removal of mounted on * directories (this can happen when an NFS exported filesystem * tries to remove a locally mounted on directory). */ error = 0; if ((ip->i_flag & IN_RENAME) || ip->i_effnlink < 2) { error = EINVAL; goto out; } if (!ufs_dirempty(ip, dp->i_number, cnp->cn_cred)) { error = ENOTEMPTY; goto out; } if ((dp->i_flags & APPEND) || (ip->i_flags & (NOUNLINK | IMMUTABLE | APPEND))) { error = EPERM; goto out; } if (vp->v_mountedhere != 0) { error = EINVAL; goto out; } #ifdef UFS_GJOURNAL ufs_gjournal_orphan(vp); #endif /* * Delete reference to directory before purging * inode. If we crash in between, the directory * will be reattached to lost+found, */ dp->i_effnlink--; ip->i_effnlink--; if (DOINGSOFTDEP(vp)) { softdep_change_linkcnt(dp); softdep_change_linkcnt(ip); } error = ufs_dirremove(dvp, ip, cnp->cn_flags, 1); if (error) { dp->i_effnlink++; ip->i_effnlink++; if (DOINGSOFTDEP(vp)) { softdep_change_linkcnt(dp); softdep_change_linkcnt(ip); } goto out; } cache_purge(dvp); /* * Truncate inode. The only stuff left in the directory is "." and * "..". The "." reference is inconsequential since we are quashing * it. The soft dependency code will arrange to do these operations * after the parent directory entry has been deleted on disk, so * when running with that code we avoid doing them now. */ if (!DOINGSOFTDEP(vp)) { dp->i_nlink--; DIP_SET(dp, i_nlink, dp->i_nlink); dp->i_flag |= IN_CHANGE; ip->i_nlink--; DIP_SET(ip, i_nlink, ip->i_nlink); ip->i_flag |= IN_CHANGE; ioflag = IO_NORMAL; if (!DOINGASYNC(vp)) ioflag |= IO_SYNC; error = UFS_TRUNCATE(vp, (off_t)0, ioflag, cnp->cn_cred, cnp->cn_thread); } cache_purge(vp); #ifdef UFS_DIRHASH /* Kill any active hash; i_effnlink == 0, so it will not come back. */ if (ip->i_dirhash != NULL) ufsdirhash_free(ip); #endif out: return (error); } /* * symlink -- make a symbolic link */ static int ufs_symlink(ap) struct vop_symlink_args /* { struct vnode *a_dvp; struct vnode **a_vpp; struct componentname *a_cnp; struct vattr *a_vap; char *a_target; } */ *ap; { struct vnode *vp, **vpp = ap->a_vpp; struct inode *ip; int len, error; error = ufs_makeinode(IFLNK | ap->a_vap->va_mode, ap->a_dvp, vpp, ap->a_cnp); if (error) return (error); vp = *vpp; len = strlen(ap->a_target); if (len < vp->v_mount->mnt_maxsymlinklen) { ip = VTOI(vp); bcopy(ap->a_target, SHORTLINK(ip), len); ip->i_size = len; DIP_SET(ip, i_size, len); ip->i_flag |= IN_CHANGE | IN_UPDATE; } else error = vn_rdwr(UIO_WRITE, vp, ap->a_target, len, (off_t)0, UIO_SYSSPACE, IO_NODELOCKED | IO_NOMACCHECK, ap->a_cnp->cn_cred, NOCRED, (int *)0, (struct thread *)0); if (error) vput(vp); return (error); } /* * Vnode op for reading directories. * * The routine below assumes that the on-disk format of a directory * is the same as that defined by . If the on-disk * format changes, then it will be necessary to do a conversion * from the on-disk format that read returns to the format defined * by . */ int ufs_readdir(ap) struct vop_readdir_args /* { struct vnode *a_vp; struct uio *a_uio; struct ucred *a_cred; int *a_eofflag; int *a_ncookies; u_long **a_cookies; } */ *ap; { struct uio *uio = ap->a_uio; int error; size_t count, lost; off_t off; if (ap->a_ncookies != NULL) /* * Ensure that the block is aligned. The caller can use * the cookies to determine where in the block to start. */ uio->uio_offset &= ~(DIRBLKSIZ - 1); off = uio->uio_offset; count = uio->uio_resid; /* Make sure we don't return partial entries. */ if (count <= ((uio->uio_offset + count) & (DIRBLKSIZ -1))) return (EINVAL); count -= (uio->uio_offset + count) & (DIRBLKSIZ -1); lost = uio->uio_resid - count; uio->uio_resid = count; uio->uio_iov->iov_len = count; # if (BYTE_ORDER == LITTLE_ENDIAN) if (ap->a_vp->v_mount->mnt_maxsymlinklen > 0) { error = VOP_READ(ap->a_vp, uio, 0, ap->a_cred); } else { struct dirent *dp, *edp; struct uio auio; struct iovec aiov; caddr_t dirbuf; int readcnt; u_char tmp; auio = *uio; auio.uio_iov = &aiov; auio.uio_iovcnt = 1; auio.uio_segflg = UIO_SYSSPACE; aiov.iov_len = count; dirbuf = malloc(count, M_TEMP, M_WAITOK); aiov.iov_base = dirbuf; error = VOP_READ(ap->a_vp, &auio, 0, ap->a_cred); if (error == 0) { readcnt = count - auio.uio_resid; edp = (struct dirent *)&dirbuf[readcnt]; for (dp = (struct dirent *)dirbuf; dp < edp; ) { tmp = dp->d_namlen; dp->d_namlen = dp->d_type; dp->d_type = tmp; if (dp->d_reclen > 0) { dp = (struct dirent *) ((char *)dp + dp->d_reclen); } else { error = EIO; break; } } if (dp >= edp) error = uiomove(dirbuf, readcnt, uio); } free(dirbuf, M_TEMP); } # else error = VOP_READ(ap->a_vp, uio, 0, ap->a_cred); # endif if (!error && ap->a_ncookies != NULL) { struct dirent* dpStart; struct dirent* dpEnd; struct dirent* dp; int ncookies; u_long *cookies; u_long *cookiep; if (uio->uio_segflg != UIO_SYSSPACE || uio->uio_iovcnt != 1) panic("ufs_readdir: unexpected uio from NFS server"); dpStart = (struct dirent *) ((char *)uio->uio_iov->iov_base - (uio->uio_offset - off)); dpEnd = (struct dirent *) uio->uio_iov->iov_base; for (dp = dpStart, ncookies = 0; dp < dpEnd; dp = (struct dirent *)((caddr_t) dp + dp->d_reclen)) ncookies++; cookies = malloc(ncookies * sizeof(u_long), M_TEMP, M_WAITOK); for (dp = dpStart, cookiep = cookies; dp < dpEnd; dp = (struct dirent *)((caddr_t) dp + dp->d_reclen)) { off += dp->d_reclen; *cookiep++ = (u_long) off; } *ap->a_ncookies = ncookies; *ap->a_cookies = cookies; } uio->uio_resid += lost; if (ap->a_eofflag) *ap->a_eofflag = VTOI(ap->a_vp)->i_size <= uio->uio_offset; return (error); } /* * Return target name of a symbolic link */ static int ufs_readlink(ap) struct vop_readlink_args /* { struct vnode *a_vp; struct uio *a_uio; struct ucred *a_cred; } */ *ap; { struct vnode *vp = ap->a_vp; struct inode *ip = VTOI(vp); doff_t isize; isize = ip->i_size; if ((isize < vp->v_mount->mnt_maxsymlinklen) || DIP(ip, i_blocks) == 0) { /* XXX - for old fastlink support */ return (uiomove(SHORTLINK(ip), isize, ap->a_uio)); } return (VOP_READ(vp, ap->a_uio, 0, ap->a_cred)); } /* * Calculate the logical to physical mapping if not done already, * then call the device strategy routine. * * In order to be able to swap to a file, the ufs_bmaparray() operation may not * deadlock on memory. See ufs_bmap() for details. */ static int ufs_strategy(ap) struct vop_strategy_args /* { struct vnode *a_vp; struct buf *a_bp; } */ *ap; { struct buf *bp = ap->a_bp; struct vnode *vp = ap->a_vp; struct bufobj *bo; struct inode *ip; ufs2_daddr_t blkno; int error; ip = VTOI(vp); if (bp->b_blkno == bp->b_lblkno) { error = ufs_bmaparray(vp, bp->b_lblkno, &blkno, bp, NULL, NULL); bp->b_blkno = blkno; if (error) { bp->b_error = error; bp->b_ioflags |= BIO_ERROR; bufdone(bp); return (0); } if ((long)bp->b_blkno == -1) vfs_bio_clrbuf(bp); } if ((long)bp->b_blkno == -1) { bufdone(bp); return (0); } bp->b_iooffset = dbtob(bp->b_blkno); bo = ip->i_umbufobj; BO_STRATEGY(bo, bp); return (0); } /* * Print out the contents of an inode. */ static int ufs_print(ap) struct vop_print_args /* { struct vnode *a_vp; } */ *ap; { struct vnode *vp = ap->a_vp; struct inode *ip = VTOI(vp); printf("\tino %lu, on dev %s", (u_long)ip->i_number, devtoname(ip->i_dev)); if (vp->v_type == VFIFO) fifo_printinfo(vp); printf("\n"); return (0); } /* * Close wrapper for fifos. * * Update the times on the inode then do device close. */ static int ufsfifo_close(ap) struct vop_close_args /* { struct vnode *a_vp; int a_fflag; struct ucred *a_cred; struct thread *a_td; } */ *ap; { struct vnode *vp = ap->a_vp; int usecount; VI_LOCK(vp); usecount = vp->v_usecount; if (usecount > 1) ufs_itimes_locked(vp); VI_UNLOCK(vp); return (fifo_specops.vop_close(ap)); } /* * Kqfilter wrapper for fifos. * * Fall through to ufs kqfilter routines if needed */ static int ufsfifo_kqfilter(ap) struct vop_kqfilter_args *ap; { int error; error = fifo_specops.vop_kqfilter(ap); if (error) error = vfs_kqfilter(ap); return (error); } /* * Return POSIX pathconf information applicable to ufs filesystems. */ static int ufs_pathconf(ap) struct vop_pathconf_args /* { struct vnode *a_vp; int a_name; int *a_retval; } */ *ap; { int error; error = 0; switch (ap->a_name) { case _PC_LINK_MAX: *ap->a_retval = LINK_MAX; break; case _PC_NAME_MAX: *ap->a_retval = NAME_MAX; break; case _PC_PATH_MAX: *ap->a_retval = PATH_MAX; break; case _PC_PIPE_BUF: *ap->a_retval = PIPE_BUF; break; case _PC_CHOWN_RESTRICTED: *ap->a_retval = 1; break; case _PC_NO_TRUNC: *ap->a_retval = 1; break; case _PC_ACL_EXTENDED: #ifdef UFS_ACL if (ap->a_vp->v_mount->mnt_flag & MNT_ACLS) *ap->a_retval = 1; else *ap->a_retval = 0; #else *ap->a_retval = 0; #endif break; case _PC_ACL_PATH_MAX: #ifdef UFS_ACL if (ap->a_vp->v_mount->mnt_flag & MNT_ACLS) *ap->a_retval = ACL_MAX_ENTRIES; else *ap->a_retval = 3; #else *ap->a_retval = 3; #endif break; case _PC_MAC_PRESENT: #ifdef MAC if (ap->a_vp->v_mount->mnt_flag & MNT_MULTILABEL) *ap->a_retval = 1; else *ap->a_retval = 0; #else *ap->a_retval = 0; #endif break; case _PC_ASYNC_IO: /* _PC_ASYNC_IO should have been handled by upper layers. */ KASSERT(0, ("_PC_ASYNC_IO should not get here")); error = EINVAL; break; case _PC_PRIO_IO: *ap->a_retval = 0; break; case _PC_SYNC_IO: *ap->a_retval = 0; break; case _PC_ALLOC_SIZE_MIN: *ap->a_retval = ap->a_vp->v_mount->mnt_stat.f_bsize; break; case _PC_FILESIZEBITS: *ap->a_retval = 64; break; case _PC_REC_INCR_XFER_SIZE: *ap->a_retval = ap->a_vp->v_mount->mnt_stat.f_iosize; break; case _PC_REC_MAX_XFER_SIZE: *ap->a_retval = -1; /* means ``unlimited'' */ break; case _PC_REC_MIN_XFER_SIZE: *ap->a_retval = ap->a_vp->v_mount->mnt_stat.f_iosize; break; case _PC_REC_XFER_ALIGN: *ap->a_retval = PAGE_SIZE; break; case _PC_SYMLINK_MAX: *ap->a_retval = MAXPATHLEN; break; default: error = EINVAL; break; } return (error); } /* * Initialize the vnode associated with a new inode, handle aliased * vnodes. */ int ufs_vinit(mntp, fifoops, vpp) struct mount *mntp; struct vop_vector *fifoops; struct vnode **vpp; { struct inode *ip; struct vnode *vp; vp = *vpp; ip = VTOI(vp); vp->v_type = IFTOVT(ip->i_mode); if (vp->v_type == VFIFO) vp->v_op = fifoops; ASSERT_VOP_LOCKED(vp, "ufs_vinit"); if (ip->i_number == ROOTINO) vp->v_vflag |= VV_ROOT; *vpp = vp; return (0); } /* * Allocate a new inode. * Vnode dvp must be locked. */ static int ufs_makeinode(mode, dvp, vpp, cnp) int mode; struct vnode *dvp; struct vnode **vpp; struct componentname *cnp; { struct inode *ip, *pdir; struct direct newdir; struct vnode *tvp; #ifdef UFS_ACL struct acl *acl; #endif int error; pdir = VTOI(dvp); #ifdef INVARIANTS if ((cnp->cn_flags & HASBUF) == 0) panic("ufs_makeinode: no name"); #endif *vpp = NULL; if ((mode & IFMT) == 0) mode |= IFREG; error = UFS_VALLOC(dvp, mode, cnp->cn_cred, &tvp); if (error) return (error); ip = VTOI(tvp); ip->i_gid = pdir->i_gid; DIP_SET(ip, i_gid, pdir->i_gid); #ifdef SUIDDIR { #ifdef QUOTA struct ucred ucred, *ucp; ucp = cnp->cn_cred; #endif /* * If we are not the owner of the directory, * and we are hacking owners here, (only do this where told to) * and we are not giving it TO root, (would subvert quotas) * then go ahead and give it to the other user. * Note that this drops off the execute bits for security. */ if ((dvp->v_mount->mnt_flag & MNT_SUIDDIR) && (pdir->i_mode & ISUID) && (pdir->i_uid != cnp->cn_cred->cr_uid) && pdir->i_uid) { ip->i_uid = pdir->i_uid; DIP_SET(ip, i_uid, ip->i_uid); mode &= ~07111; #ifdef QUOTA /* * Make sure the correct user gets charged * for the space. * Quickly knock up a dummy credential for the victim. * XXX This seems to never be accessed out of our * context so a stack variable is ok. */ refcount_init(&ucred.cr_ref, 1); ucred.cr_uid = ip->i_uid; ucred.cr_ngroups = 1; ucred.cr_groups[0] = pdir->i_gid; ucp = &ucred; #endif } else { ip->i_uid = cnp->cn_cred->cr_uid; DIP_SET(ip, i_uid, ip->i_uid); } #ifdef QUOTA if ((error = getinoquota(ip)) || (error = chkiq(ip, 1, ucp, 0))) { UFS_VFREE(tvp, ip->i_number, mode); vput(tvp); return (error); } #endif } #else /* !SUIDDIR */ ip->i_uid = cnp->cn_cred->cr_uid; DIP_SET(ip, i_uid, ip->i_uid); #ifdef QUOTA if ((error = getinoquota(ip)) || (error = chkiq(ip, 1, cnp->cn_cred, 0))) { UFS_VFREE(tvp, ip->i_number, mode); vput(tvp); return (error); } #endif #endif /* !SUIDDIR */ ip->i_flag |= IN_ACCESS | IN_CHANGE | IN_UPDATE; #ifdef UFS_ACL acl = NULL; if ((dvp->v_mount->mnt_flag & MNT_ACLS) != 0) { acl = acl_alloc(M_WAITOK); /* * Retrieve default ACL for parent, if any. */ error = VOP_GETACL(dvp, ACL_TYPE_DEFAULT, acl, cnp->cn_cred, cnp->cn_thread); switch (error) { case 0: /* * Retrieved a default ACL, so merge mode and ACL if * necessary. */ if (acl->acl_cnt != 0) { /* * Two possible ways for default ACL to not * be present. First, the EA can be * undefined, or second, the default ACL can * be blank. If it's blank, fall through to * the it's not defined case. */ mode = acl_posix1e_newfilemode(mode, acl); ip->i_mode = mode; DIP_SET(ip, i_mode, mode); ufs_sync_acl_from_inode(ip, acl); break; } /* FALLTHROUGH */ case EOPNOTSUPP: /* * Just use the mode as-is. */ ip->i_mode = mode; DIP_SET(ip, i_mode, mode); acl_free(acl); acl = NULL; break; default: UFS_VFREE(tvp, ip->i_number, mode); vput(tvp); acl_free(acl); acl = NULL; return (error); } } else { #endif ip->i_mode = mode; DIP_SET(ip, i_mode, mode); #ifdef UFS_ACL } #endif tvp->v_type = IFTOVT(mode); /* Rest init'd in getnewvnode(). */ ip->i_effnlink = 1; ip->i_nlink = 1; DIP_SET(ip, i_nlink, 1); if (DOINGSOFTDEP(tvp)) softdep_change_linkcnt(ip); if ((ip->i_mode & ISGID) && !groupmember(ip->i_gid, cnp->cn_cred) && priv_check_cred(cnp->cn_cred, PRIV_VFS_SETGID, 0)) { ip->i_mode &= ~ISGID; DIP_SET(ip, i_mode, ip->i_mode); } if (cnp->cn_flags & ISWHITEOUT) { ip->i_flags |= UF_OPAQUE; DIP_SET(ip, i_flags, ip->i_flags); } /* * Make sure inode goes to disk before directory entry. */ error = UFS_UPDATE(tvp, !(DOINGSOFTDEP(tvp) | DOINGASYNC(tvp))); if (error) goto bad; #ifdef MAC if (dvp->v_mount->mnt_flag & MNT_MULTILABEL) { error = mac_vnode_create_extattr(cnp->cn_cred, dvp->v_mount, dvp, tvp, cnp); if (error) goto bad; } #endif #ifdef UFS_ACL if (acl != NULL) { /* * XXX: If we abort now, will Soft Updates notify the extattr * code that the EAs for the file need to be released? */ error = VOP_SETACL(tvp, ACL_TYPE_ACCESS, acl, cnp->cn_cred, cnp->cn_thread); switch (error) { case 0: break; case EOPNOTSUPP: /* * XXX: This should not happen, as EOPNOTSUPP above was * supposed to free acl. */ printf("ufs_makeinode: VOP_GETACL() but no " "VOP_SETACL()\n"); /* panic("ufs_makeinode: VOP_GETACL() but no " "VOP_SETACL()"); */ break; default: acl_free(acl); goto bad; } acl_free(acl); } #endif /* !UFS_ACL */ ufs_makedirentry(ip, cnp, &newdir); error = ufs_direnter(dvp, tvp, &newdir, cnp, NULL); if (error) goto bad; *vpp = tvp; return (0); bad: /* * Write error occurred trying to update the inode * or the directory so must deallocate the inode. */ ip->i_effnlink = 0; ip->i_nlink = 0; DIP_SET(ip, i_nlink, 0); ip->i_flag |= IN_CHANGE; if (DOINGSOFTDEP(tvp)) softdep_change_linkcnt(ip); vput(tvp); return (error); } /* Global vfs data structures for ufs. */ struct vop_vector ufs_vnodeops = { .vop_default = &default_vnodeops, .vop_fsync = VOP_PANIC, .vop_read = VOP_PANIC, .vop_reallocblks = VOP_PANIC, .vop_write = VOP_PANIC, .vop_access = ufs_access, .vop_bmap = ufs_bmap, .vop_cachedlookup = ufs_lookup, .vop_close = ufs_close, .vop_create = ufs_create, .vop_getattr = ufs_getattr, .vop_inactive = ufs_inactive, .vop_link = ufs_link, .vop_lookup = vfs_cache_lookup, .vop_markatime = ufs_markatime, .vop_mkdir = ufs_mkdir, .vop_mknod = ufs_mknod, .vop_open = ufs_open, .vop_pathconf = ufs_pathconf, .vop_poll = vop_stdpoll, .vop_print = ufs_print, .vop_readdir = ufs_readdir, .vop_readlink = ufs_readlink, .vop_reclaim = ufs_reclaim, .vop_remove = ufs_remove, .vop_rename = ufs_rename, .vop_rmdir = ufs_rmdir, .vop_setattr = ufs_setattr, #ifdef MAC .vop_setlabel = vop_stdsetlabel_ea, #endif .vop_strategy = ufs_strategy, .vop_symlink = ufs_symlink, .vop_whiteout = ufs_whiteout, #ifdef UFS_EXTATTR .vop_getextattr = ufs_getextattr, .vop_deleteextattr = ufs_deleteextattr, .vop_setextattr = ufs_setextattr, #endif #ifdef UFS_ACL .vop_getacl = ufs_getacl, .vop_setacl = ufs_setacl, .vop_aclcheck = ufs_aclcheck, #endif }; struct vop_vector ufs_fifoops = { .vop_default = &fifo_specops, .vop_fsync = VOP_PANIC, .vop_access = ufs_access, .vop_close = ufsfifo_close, .vop_getattr = ufs_getattr, .vop_inactive = ufs_inactive, .vop_kqfilter = ufsfifo_kqfilter, .vop_markatime = ufs_markatime, .vop_print = ufs_print, .vop_read = VOP_PANIC, .vop_reclaim = ufs_reclaim, .vop_setattr = ufs_setattr, #ifdef MAC .vop_setlabel = vop_stdsetlabel_ea, #endif .vop_write = VOP_PANIC, #ifdef UFS_EXTATTR .vop_getextattr = ufs_getextattr, .vop_deleteextattr = ufs_deleteextattr, .vop_setextattr = ufs_setextattr, #endif #ifdef UFS_ACL .vop_getacl = ufs_getacl, .vop_setacl = ufs_setacl, .vop_aclcheck = ufs_aclcheck, #endif };