Index: stable/12/share/man/man4/netmap.4 =================================================================== --- stable/12/share/man/man4/netmap.4 (revision 354470) +++ stable/12/share/man/man4/netmap.4 (revision 354471) @@ -1,1172 +1,1172 @@ .\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" This document is derived in part from the enet man page (enet.4) .\" distributed with 4.3BSD Unix. .\" .\" $FreeBSD$ .\" -.Dd November 20, 2018 +.Dd October 26, 2019 .Dt NETMAP 4 .Os .Sh NAME .Nm netmap .Nd a framework for fast packet I/O .Sh SYNOPSIS .Cd device netmap .Sh DESCRIPTION .Nm is a framework for extremely fast and efficient packet I/O for userspace and kernel clients, and for Virtual Machines. It runs on .Fx Linux and some versions of Windows, and supports a variety of .Nm netmap ports , including .Bl -tag -width XXXX .It Nm physical NIC ports to access individual queues of network interfaces; .It Nm host ports to inject packets into the host stack; .It Nm VALE ports implementing a very fast and modular in-kernel software switch/dataplane; .It Nm netmap pipes a shared memory packet transport channel; .It Nm netmap monitors a mechanism similar to .Xr bpf 4 to capture traffic .El .Pp All these .Nm netmap ports are accessed interchangeably with the same API, and are at least one order of magnitude faster than standard OS mechanisms (sockets, bpf, tun/tap interfaces, native switches, pipes). With suitably fast hardware (NICs, PCIe buses, CPUs), packet I/O using .Nm on supported NICs reaches 14.88 million packets per second (Mpps) with much less than one core on 10 Gbit/s NICs; 35-40 Mpps on 40 Gbit/s NICs (limited by the hardware); about 20 Mpps per core for VALE ports; and over 100 Mpps for .Nm netmap pipes . NICs without native .Nm support can still use the API in emulated mode, which uses unmodified device drivers and is 3-5 times faster than .Xr bpf 4 or raw sockets. .Pp Userspace clients can dynamically switch NICs into .Nm mode and send and receive raw packets through memory mapped buffers. Similarly, .Nm VALE switch instances and ports, .Nm netmap pipes and .Nm netmap monitors can be created dynamically, providing high speed packet I/O between processes, virtual machines, NICs and the host stack. .Pp .Nm supports both non-blocking I/O through .Xr ioctl 2 , synchronization and blocking I/O through a file descriptor and standard OS mechanisms such as .Xr select 2 , .Xr poll 2 , .Xr kqueue 2 and .Xr epoll 7 . All types of .Nm netmap ports and the .Nm VALE switch are implemented by a single kernel module, which also emulates the .Nm API over standard drivers. For best performance, .Nm requires native support in device drivers. A list of such devices is at the end of this document. .Pp In the rest of this (long) manual page we document various aspects of the .Nm and .Nm VALE architecture, features and usage. .Sh ARCHITECTURE .Nm supports raw packet I/O through a .Em port , which can be connected to a physical interface .Em ( NIC ) , to the host stack, or to a .Nm VALE switch. Ports use preallocated circular queues of buffers .Em ( rings ) residing in an mmapped region. There is one ring for each transmit/receive queue of a NIC or virtual port. An additional ring pair connects to the host stack. .Pp After binding a file descriptor to a port, a .Nm client can send or receive packets in batches through the rings, and possibly implement zero-copy forwarding between ports. .Pp All NICs operating in .Nm mode use the same memory region, accessible to all processes who own .Pa /dev/netmap file descriptors bound to NICs. Independent .Nm VALE and .Nm netmap pipe ports by default use separate memory regions, but can be independently configured to share memory. .Sh ENTERING AND EXITING NETMAP MODE The following section describes the system calls to create and control .Nm netmap ports (including .Nm VALE and .Nm netmap pipe ports). Simpler, higher level functions are described in the .Sx LIBRARIES section. .Pp Ports and rings are created and controlled through a file descriptor, created by opening a special device .Dl fd = open("/dev/netmap"); and then bound to a specific port with an .Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg); .Pp .Nm has multiple modes of operation controlled by the .Vt struct nmreq argument. .Va arg.nr_name specifies the netmap port name, as follows: .Bl -tag -width XXXX .It Dv OS network interface name (e.g., 'em0', 'eth1', ... ) the data path of the NIC is disconnected from the host stack, and the file descriptor is bound to the NIC (one or all queues), or to the host stack; .It Dv valeSSS:PPP the file descriptor is bound to port PPP of VALE switch SSS. Switch instances and ports are dynamically created if necessary. .Pp Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string cannot exceed IFNAMSIZ characters, and PPP cannot be the name of any existing OS network interface. .El .Pp On return, .Va arg indicates the size of the shared memory region, and the number, size and location of all the .Nm data structures, which can be accessed by mmapping the memory .Dl char *mem = mmap(0, arg.nr_memsize, fd); .Pp Non-blocking I/O is done with special .Xr ioctl 2 .Xr select 2 and .Xr poll 2 on the file descriptor permit blocking I/O. .Pp While a NIC is in .Nm mode, the OS will still believe the interface is up and running. OS-generated packets for that NIC end up into a .Nm ring, and another ring is used to send packets into the OS network stack. A .Xr close 2 on the file descriptor removes the binding, and returns the NIC to normal mode (reconnecting the data path to the host stack), or destroys the virtual port. .Sh DATA STRUCTURES The data structures in the mmapped memory region are detailed in .In sys/net/netmap.h , which is the ultimate reference for the .Nm API. The main structures and fields are indicated below: .Bl -tag -width XXX .It Dv struct netmap_if (one per interface ) .Bd -literal struct netmap_if { ... const uint32_t ni_flags; /* properties */ ... const uint32_t ni_tx_rings; /* NIC tx rings */ const uint32_t ni_rx_rings; /* NIC rx rings */ uint32_t ni_bufs_head; /* head of extra bufs list */ ... }; .Ed .Pp Indicates the number of available rings .Pa ( struct netmap_rings ) and their position in the mmapped region. The number of tx and rx rings .Pa ( ni_tx_rings , ni_rx_rings ) normally depends on the hardware. NICs also have an extra tx/rx ring pair connected to the host stack. .Em NIOCREGIF can also request additional unbound buffers in the same memory space, to be used as temporary storage for packets. The number of extra buffers is specified in the .Va arg.nr_arg3 field. On success, the kernel writes back to .Va arg.nr_arg3 the number of extra buffers actually allocated (they may be less than the amount requested if the memory space ran out of buffers). .Pa ni_bufs_head contains the index of the first of these extra buffers, which are connected in a list (the first uint32_t of each buffer being the index of the next buffer in the list). A .Dv 0 indicates the end of the list. The application is free to modify this list and use the buffers (i.e., binding them to the slots of a netmap ring). When closing the netmap file descriptor, the kernel frees the buffers contained in the list pointed by .Pa ni_bufs_head , irrespectively of the buffers originally provided by the kernel on .Em NIOCREGIF . .It Dv struct netmap_ring (one per ring ) .Bd -literal struct netmap_ring { ... const uint32_t num_slots; /* slots in each ring */ const uint32_t nr_buf_size; /* size of each buffer */ ... uint32_t head; /* (u) first buf owned by user */ uint32_t cur; /* (u) wakeup position */ const uint32_t tail; /* (k) first buf owned by kernel */ ... uint32_t flags; struct timeval ts; /* (k) time of last rxsync() */ ... struct netmap_slot slot[0]; /* array of slots */ } .Ed .Pp Implements transmit and receive rings, with read/write pointers, metadata and an array of .Em slots describing the buffers. .It Dv struct netmap_slot (one per buffer ) .Bd -literal struct netmap_slot { uint32_t buf_idx; /* buffer index */ uint16_t len; /* packet length */ uint16_t flags; /* buf changed, etc. */ uint64_t ptr; /* address for indirect buffers */ }; .Ed .Pp Describes a packet buffer, which normally is identified by an index and resides in the mmapped region. .It Dv packet buffers Fixed size (normally 2 KB) packet buffers allocated by the kernel. .El .Pp The offset of the .Pa struct netmap_if in the mmapped region is indicated by the .Pa nr_offset field in the structure returned by .Dv NIOCREGIF . From there, all other objects are reachable through relative references (offsets or indexes). Macros and functions in .In net/netmap_user.h help converting them into actual pointers: .Pp .Dl struct netmap_if *nifp = NETMAP_IF(mem, arg.nr_offset); .Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index); .Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index); .Pp .Dl char *buf = NETMAP_BUF(ring, buffer_index); .Sh RINGS, BUFFERS AND DATA I/O .Va Rings are circular queues of packets with three indexes/pointers .Va ( head , cur , tail ) ; one slot is always kept empty. The ring size .Va ( num_slots ) should not be assumed to be a power of two. .Pp .Va head is the first slot available to userspace; .Pp .Va cur is the wakeup point: select/poll will unblock when .Va tail passes .Va cur ; .Pp .Va tail is the first slot reserved to the kernel. .Pp Slot indexes .Em must only move forward; for convenience, the function .Dl nm_ring_next(ring, index) returns the next index modulo the ring size. .Pp .Va head and .Va cur are only modified by the user program; .Va tail is only modified by the kernel. The kernel only reads/writes the .Vt struct netmap_ring slots and buffers during the execution of a netmap-related system call. The only exception are slots (and buffers) in the range .Va tail\ . . . head-1 , that are explicitly assigned to the kernel. .Ss TRANSMIT RINGS On transmit rings, after a .Nm system call, slots in the range .Va head\ . . . tail-1 are available for transmission. User code should fill the slots sequentially and advance .Va head and .Va cur past slots ready to transmit. .Va cur may be moved further ahead if the user code needs more slots before further transmissions (see .Sx SCATTER GATHER I/O ) . .Pp At the next NIOCTXSYNC/select()/poll(), slots up to .Va head-1 are pushed to the port, and .Va tail may advance if further slots have become available. Below is an example of the evolution of a TX ring: .Bd -literal after the syscall, slots between cur and tail are (a)vailable head=cur tail | | v v TX [.....aaaaaaaaaaa.............] user creates new packets to (T)ransmit head=cur tail | | v v TX [.....TTTTTaaaaaa.............] NIOCTXSYNC/poll()/select() sends packets and reports new slots head=cur tail | | v v TX [..........aaaaaaaaaaa........] .Ed .Pp .Fn select and .Fn poll will block if there is no space in the ring, i.e., .Dl ring->cur == ring->tail and return when new slots have become available. .Pp High speed applications may want to amortize the cost of system calls by preparing as many packets as possible before issuing them. .Pp A transmit ring with pending transmissions has .Dl ring->head != ring->tail + 1 (modulo the ring size). The function .Va int nm_tx_pending(ring) implements this test. .Ss RECEIVE RINGS On receive rings, after a .Nm system call, the slots in the range .Va head\& . . . tail-1 contain received packets. User code should process them and advance .Va head and .Va cur past slots it wants to return to the kernel. .Va cur may be moved further ahead if the user code wants to wait for more packets without returning all the previous slots to the kernel. .Pp At the next NIOCRXSYNC/select()/poll(), slots up to .Va head-1 are returned to the kernel for further receives, and .Va tail may advance to report new incoming packets. .Pp Below is an example of the evolution of an RX ring: .Bd -literal after the syscall, there are some (h)eld and some (R)eceived slots head cur tail | | | v v v RX [..hhhhhhRRRRRRRR..........] user advances head and cur, releasing some slots and holding others head cur tail | | | v v v RX [..*****hhhRRRRRR...........] NICRXSYNC/poll()/select() recovers slots and reports new packets head cur tail | | | v v v RX [.......hhhRRRRRRRRRRRR....] .Ed .Sh SLOTS AND PACKET BUFFERS Normally, packets should be stored in the netmap-allocated buffers assigned to slots when ports are bound to a file descriptor. One packet is fully contained in a single buffer. .Pp The following flags affect slot and buffer processing: .Bl -tag -width XXX .It NS_BUF_CHANGED .Em must be used when the .Va buf_idx in the slot is changed. This can be used to implement zero-copy forwarding, see .Sx ZERO-COPY FORWARDING . .It NS_REPORT reports when this buffer has been transmitted. Normally, .Nm notifies transmit completions in batches, hence signals can be delayed indefinitely. This flag helps detect when packets have been sent and a file descriptor can be closed. .It NS_FORWARD When a ring is in 'transparent' mode, packets marked with this flag by the user application are forwarded to the other endpoint at the next system call, thus restoring (in a selective way) the connection between a NIC and the host stack. .It NS_NO_LEARN tells the forwarding code that the source MAC address for this packet must not be used in the learning bridge code. .It NS_INDIRECT indicates that the packet's payload is in a user-supplied buffer whose user virtual address is in the 'ptr' field of the slot. The size can reach 65535 bytes. .Pp This is only supported on the transmit ring of .Nm VALE ports, and it helps reducing data copies in the interconnection of virtual machines. .It NS_MOREFRAG indicates that the packet continues with subsequent buffers; the last buffer in a packet must have the flag clear. .El .Sh SCATTER GATHER I/O Packets can span multiple slots if the .Va NS_MOREFRAG flag is set in all but the last slot. The maximum length of a chain is 64 buffers. This is normally used with .Nm VALE ports when connecting virtual machines, as they generate large TSO segments that are not split unless they reach a physical device. .Pp NOTE: The length field always refers to the individual fragment; there is no place with the total length of a packet. .Pp On receive rings the macro .Va NS_RFRAGS(slot) indicates the remaining number of slots for this packet, including the current one. Slots with a value greater than 1 also have NS_MOREFRAG set. .Sh IOCTLS .Nm uses two ioctls (NIOCTXSYNC, NIOCRXSYNC) for non-blocking I/O. They take no argument. Two more ioctls (NIOCGINFO, NIOCREGIF) are used to query and configure ports, with the following argument: .Bd -literal struct nmreq { char nr_name[IFNAMSIZ]; /* (i) port name */ uint32_t nr_version; /* (i) API version */ uint32_t nr_offset; /* (o) nifp offset in mmap region */ uint32_t nr_memsize; /* (o) size of the mmap region */ uint32_t nr_tx_slots; /* (i/o) slots in tx rings */ uint32_t nr_rx_slots; /* (i/o) slots in rx rings */ uint16_t nr_tx_rings; /* (i/o) number of tx rings */ uint16_t nr_rx_rings; /* (i/o) number of rx rings */ uint16_t nr_ringid; /* (i/o) ring(s) we care about */ uint16_t nr_cmd; /* (i) special command */ uint16_t nr_arg1; /* (i/o) extra arguments */ uint16_t nr_arg2; /* (i/o) extra arguments */ uint32_t nr_arg3; /* (i/o) extra arguments */ uint32_t nr_flags /* (i/o) open mode */ ... }; .Ed .Pp A file descriptor obtained through .Pa /dev/netmap also supports the ioctl supported by network devices, see .Xr netintro 4 . .Bl -tag -width XXXX .It Dv NIOCGINFO returns EINVAL if the named port does not support netmap. Otherwise, it returns 0 and (advisory) information about the port. Note that all the information below can change before the interface is actually put in netmap mode. .Bl -tag -width XX .It Pa nr_memsize indicates the size of the .Nm memory region. NICs in .Nm mode all share the same memory region, whereas .Nm VALE ports have independent regions for each port. .It Pa nr_tx_slots , nr_rx_slots indicate the size of transmit and receive rings. .It Pa nr_tx_rings , nr_rx_rings indicate the number of transmit and receive rings. Both ring number and sizes may be configured at runtime using interface-specific functions (e.g., .Xr ethtool 8 ). .El .It Dv NIOCREGIF binds the port named in .Va nr_name to the file descriptor. For a physical device this also switches it into .Nm mode, disconnecting it from the host stack. Multiple file descriptors can be bound to the same port, with proper synchronization left to the user. .Pp The recommended way to bind a file descriptor to a port is to use function .Va nm_open(..) (see .Sx LIBRARIES ) which parses names to access specific port types and enable features. In the following we document the main features. .Pp .Dv NIOCREGIF can also bind a file descriptor to one endpoint of a .Em netmap pipe , consisting of two netmap ports with a crossover connection. A netmap pipe share the same memory space of the parent port, and is meant to enable configuration where a master process acts as a dispatcher towards slave processes. .Pp To enable this function, the .Pa nr_arg1 field of the structure can be used as a hint to the kernel to indicate how many pipes we expect to use, and reserve extra space in the memory region. .Pp On return, it gives the same info as NIOCGINFO, with .Pa nr_ringid and .Pa nr_flags indicating the identity of the rings controlled through the file descriptor. .Pp .Va nr_flags .Va nr_ringid selects which rings are controlled through this file descriptor. Possible values of .Pa nr_flags are indicated below, together with the naming schemes that application libraries (such as the .Nm nm_open indicated below) can use to indicate the specific set of rings. In the example below, "netmap:foo" is any valid netmap port name. .Bl -tag -width XXXXX .It NR_REG_ALL_NIC "netmap:foo" (default) all hardware ring pairs .It NR_REG_SW "netmap:foo^" the ``host rings'', connecting to the host stack. .It NR_REG_NIC_SW "netmap:foo+" all hardware rings and the host rings .It NR_REG_ONE_NIC "netmap:foo-i" only the i-th hardware ring pair, where the number is in .Pa nr_ringid ; .It NR_REG_PIPE_MASTER "netmap:foo{i" the master side of the netmap pipe whose identifier (i) is in .Pa nr_ringid ; .It NR_REG_PIPE_SLAVE "netmap:foo}i" the slave side of the netmap pipe whose identifier (i) is in .Pa nr_ringid . .Pp The identifier of a pipe must be thought as part of the pipe name, and does not need to be sequential. On return the pipe will only have a single ring pair with index 0, irrespective of the value of .Va i . .El .Pp By default, a .Xr poll 2 or .Xr select 2 call pushes out any pending packets on the transmit ring, even if no write events are specified. The feature can be disabled by or-ing .Va NETMAP_NO_TX_POLL to the value written to .Va nr_ringid . When this feature is used, packets are transmitted only on .Va ioctl(NIOCTXSYNC) or .Va select() / .Va poll() are called with a write event (POLLOUT/wfdset) or a full ring. .Pp When registering a virtual interface that is dynamically created to a .Xr vale 4 switch, we can specify the desired number of rings (1 by default, and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields. .It Dv NIOCTXSYNC tells the hardware of new packets to transmit, and updates the number of slots available for transmission. .It Dv NIOCRXSYNC tells the hardware of consumed packets, and asks for newly available packets. .El .Sh SELECT, POLL, EPOLL, KQUEUE .Xr select 2 and .Xr poll 2 on a .Nm file descriptor process rings as indicated in .Sx TRANSMIT RINGS and .Sx RECEIVE RINGS , respectively when write (POLLOUT) and read (POLLIN) events are requested. Both block if no slots are available in the ring .Va ( ring->cur == ring->tail ) . Depending on the platform, .Xr epoll 7 and .Xr kqueue 2 are supported too. .Pp Packets in transmit rings are normally pushed out (and buffers reclaimed) even without requesting write events. Passing the .Dv NETMAP_NO_TX_POLL flag to .Em NIOCREGIF disables this feature. By default, receive rings are processed only if read events are requested. Passing the .Dv NETMAP_DO_RX_POLL flag to .Em NIOCREGIF updates receive rings even without read events. Note that on .Xr epoll 7 and .Xr kqueue 2 , .Dv NETMAP_NO_TX_POLL and .Dv NETMAP_DO_RX_POLL only have an effect when some event is posted for the file descriptor. .Sh LIBRARIES The .Nm API is supposed to be used directly, both because of its simplicity and for efficient integration with applications. .Pp For convenience, the .In net/netmap_user.h header provides a few macros and functions to ease creating a file descriptor and doing I/O with a .Nm port. These are loosely modeled after the .Xr pcap 3 API, to ease porting of libpcap-based applications to .Nm . To use these extra functions, programs should .Dl #define NETMAP_WITH_LIBS before .Dl #include .Pp The following functions are available: .Bl -tag -width XXXXX .It Va struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg ) similar to .Xr pcap_open_live 3 , binds a file descriptor to a port. .Bl -tag -width XX .It Va ifname is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a .Nm VALE port. .It Va req provides the initial values for the argument to the NIOCREGIF ioctl. The nm_flags and nm_ringid values are overwritten by parsing ifname and flags, and other fields can be overridden through the other two arguments. .It Va arg points to a struct nm_desc containing arguments (e.g., from a previously open file descriptor) that should override the defaults. The fields are used as described below .It Va flags can be set to a combination of the following flags: .Va NETMAP_NO_TX_POLL , .Va NETMAP_DO_RX_POLL (copied into nr_ringid); .Va NM_OPEN_NO_MMAP (if arg points to the same memory region, avoids the mmap and uses the values from it); .Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg); .Va NM_OPEN_ARG1 , .Va NM_OPEN_ARG2 , .Va NM_OPEN_ARG3 (uses the fields from arg); .Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg). .El .It Va int nm_close(struct nm_desc *d ) closes the file descriptor, unmaps memory, frees resources. .It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size ) similar to .Va pcap_inject() , pushes a packet to a ring, returns the size of the packet is successful, or 0 on error; .It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg ) similar to .Va pcap_dispatch() , applies a callback to incoming packets .It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr ) similar to .Va pcap_next() , fetches the next packet .El .Sh SUPPORTED DEVICES .Nm natively supports the following devices: .Pp On .Fx : .Xr cxgbe 4 , .Xr em 4 , .Xr iflib 4 (providing igb, em and lem), .Xr ixgbe 4 , .Xr ixl 4 , .Xr re 4 , .Xr vtnet 4 . .Pp On Linux e1000, e1000e, i40e, igb, ixgbe, ixgbevf, r8169, virtio_net, vmxnet3. .Pp NICs without native support can still be used in .Nm mode through emulation. Performance is inferior to native netmap mode but still significantly higher than various raw socket types (bpf, PF_PACKET, etc.). Note that for slow devices (such as 1 Gbit/s and slower NICs, or several 10 Gbit/s NICs whose hardware is unable to sustain line rate), emulated and native mode will likely have similar or same throughput. .Pp When emulation is in use, packet sniffer programs such as tcpdump could see received packets before they are diverted by netmap. This behaviour is not intentional, being just an artifact of the implementation of emulation. Note that in case the netmap application subsequently moves packets received from the emulated adapter onto the host RX ring, the sniffer will intercept those packets again, since the packets are injected to the host stack as they were received by the network interface. .Pp Emulation is also available for devices with native netmap support, which can be used for testing or performance comparison. The sysctl variable .Va dev.netmap.admode globally controls how netmap mode is implemented. .Sh SYSCTL VARIABLES AND MODULE PARAMETERS Some aspect of the operation of .Nm are controlled through sysctl variables on .Fx .Em ( dev.netmap.* ) and module parameters on Linux .Em ( /sys/module/netmap/parameters/* ) : .Bl -tag -width indent .It Va dev.netmap.admode: 0 Controls the use of native or emulated adapter mode. .Pp 0 uses the best available option; .Pp 1 forces native mode and fails if not available; .Pp 2 forces emulated hence never fails. .It Va dev.netmap.generic_rings: 1 Number of rings used for emulated netmap mode .It Va dev.netmap.generic_ringsize: 1024 Ring size used for emulated netmap mode .It Va dev.netmap.generic_mit: 100000 Controls interrupt moderation for emulated mode .It Va dev.netmap.mmap_unreg: 0 .It Va dev.netmap.fwd: 0 Forces NS_FORWARD mode .It Va dev.netmap.flags: 0 .It Va dev.netmap.txsync_retry: 2 .It Va dev.netmap.no_pendintr: 1 Forces recovery of transmit buffers on system calls .It Va dev.netmap.mitigate: 1 Propagates interrupt mitigation to user processes .It Va dev.netmap.no_timestamp: 0 Disables the update of the timestamp in the netmap ring .It Va dev.netmap.verbose: 0 Verbose kernel messages .It Va dev.netmap.buf_num: 163840 .It Va dev.netmap.buf_size: 2048 .It Va dev.netmap.ring_num: 200 .It Va dev.netmap.ring_size: 36864 .It Va dev.netmap.if_num: 100 .It Va dev.netmap.if_size: 1024 Sizes and number of objects (netmap_if, netmap_ring, buffers) for the global memory region. The only parameter worth modifying is .Va dev.netmap.buf_num as it impacts the total amount of memory used by netmap. .It Va dev.netmap.buf_curr_num: 0 .It Va dev.netmap.buf_curr_size: 0 .It Va dev.netmap.ring_curr_num: 0 .It Va dev.netmap.ring_curr_size: 0 .It Va dev.netmap.if_curr_num: 0 .It Va dev.netmap.if_curr_size: 0 Actual values in use. .It Va dev.netmap.bridge_batch: 1024 Batch size used when moving packets across a .Nm VALE switch. Values above 64 generally guarantee good performance. .It Va dev.netmap.ptnet_vnet_hdr: 1 Allow ptnet devices to use virtio-net headers .El .Sh SYSTEM CALLS .Nm uses .Xr select 2 , .Xr poll 2 , .Xr epoll 7 and .Xr kqueue 2 to wake up processes when significant events occur, and .Xr mmap 2 to map memory. .Xr ioctl 2 is used to configure ports and .Nm VALE switches . .Pp Applications may need to create threads and bind them to specific cores to improve performance, using standard OS primitives, see .Xr pthread 3 . In particular, .Xr pthread_setaffinity_np 3 may be of use. .Sh EXAMPLES .Ss TEST PROGRAMS .Nm comes with a few programs that can be used for testing or simple applications. See the .Pa examples/ directory in .Nm distributions, or .Pa tools/tools/netmap/ directory in .Fx distributions. .Pp .Xr pkt-gen 8 is a general purpose traffic source/sink. .Pp As an example .Dl pkt-gen -i ix0 -f tx -l 60 can generate an infinite stream of minimum size packets, and .Dl pkt-gen -i ix0 -f rx is a traffic sink. Both print traffic statistics, to help monitor how the system performs. .Pp .Xr pkt-gen 8 has many options can be uses to set packet sizes, addresses, rates, and use multiple send/receive threads and cores. .Pp .Xr bridge 4 is another test program which interconnects two .Nm ports. It can be used for transparent forwarding between interfaces, as in .Dl bridge -i netmap:ix0 -i netmap:ix1 or even connect the NIC to the host stack using netmap .Dl bridge -i netmap:ix0 .Ss USING THE NATIVE API The following code implements a traffic generator .Pp .Bd -literal -compact #include \&... void sender(void) { struct netmap_if *nifp; struct netmap_ring *ring; struct nmreq nmr; struct pollfd fds; fd = open("/dev/netmap", O_RDWR); bzero(&nmr, sizeof(nmr)); strcpy(nmr.nr_name, "ix0"); nmr.nm_version = NETMAP_API; ioctl(fd, NIOCREGIF, &nmr); p = mmap(0, nmr.nr_memsize, fd); nifp = NETMAP_IF(p, nmr.nr_offset); ring = NETMAP_TXRING(nifp, 0); fds.fd = fd; fds.events = POLLOUT; for (;;) { poll(&fds, 1, -1); while (!nm_ring_empty(ring)) { i = ring->cur; buf = NETMAP_BUF(ring, ring->slot[i].buf_index); ... prepare packet in buf ... ring->slot[i].len = ... packet length ... ring->head = ring->cur = nm_ring_next(ring, i); } } } .Ed .Ss HELPER FUNCTIONS A simple receiver can be implemented using the helper functions .Bd -literal -compact #define NETMAP_WITH_LIBS #include \&... void receiver(void) { struct nm_desc *d; struct pollfd fds; u_char *buf; struct nm_pkthdr h; ... d = nm_open("netmap:ix0", NULL, 0, 0); fds.fd = NETMAP_FD(d); fds.events = POLLIN; for (;;) { poll(&fds, 1, -1); while ( (buf = nm_nextpkt(d, &h)) ) consume_pkt(buf, h->len); } nm_close(d); } .Ed .Ss ZERO-COPY FORWARDING Since physical interfaces share the same memory region, it is possible to do packet forwarding between ports swapping buffers. The buffer from the transmit ring is used to replenish the receive ring: .Bd -literal -compact uint32_t tmp; struct netmap_slot *src, *dst; ... src = &src_ring->slot[rxr->cur]; dst = &dst_ring->slot[txr->cur]; tmp = dst->buf_idx; dst->buf_idx = src->buf_idx; dst->len = src->len; dst->flags = NS_BUF_CHANGED; src->buf_idx = tmp; src->flags = NS_BUF_CHANGED; rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur); txr->head = txr->cur = nm_ring_next(txr, txr->cur); ... .Ed .Ss ACCESSING THE HOST STACK The host stack is for all practical purposes just a regular ring pair, which you can access with the netmap API (e.g., with .Dl nm_open("netmap:eth0^", ... ) ; All packets that the host would send to an interface in .Nm mode end up into the RX ring, whereas all packets queued to the TX ring are send up to the host stack. .Ss VALE SWITCH A simple way to test the performance of a .Nm VALE switch is to attach a sender and a receiver to it, e.g., running the following in two different terminals: .Dl pkt-gen -i vale1:a -f rx # receiver .Dl pkt-gen -i vale1:b -f tx # sender The same example can be used to test netmap pipes, by simply changing port names, e.g., .Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side .Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side .Pp The following command attaches an interface and the host stack to a switch: -.Dl vale-ctl -h vale2:em0 +.Dl valectl -h vale2:em0 Other .Nm clients attached to the same switch can now communicate with the network card or the host. .Sh SEE ALSO .Xr vale 4 , -.Xr vale-ctl 4 , +.Xr valectl 8 , .Xr bridge 8 , .Xr lb 8 , .Xr nmreplay 8 , .Xr pkt-gen 8 .Pp .Pa http://info.iet.unipi.it/~luigi/netmap/ .Pp Luigi Rizzo, Revisiting network I/O APIs: the netmap framework, Communications of the ACM, 55 (3), pp.45-51, March 2012 .Pp Luigi Rizzo, netmap: a novel framework for fast packet I/O, Usenix ATC'12, June 2012, Boston .Pp Luigi Rizzo, Giuseppe Lettieri, VALE, a switched ethernet for virtual machines, ACM CoNEXT'12, December 2012, Nice .Pp Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione, Speeding up packet I/O in virtual machines, ACM/IEEE ANCS'13, October 2013, San Jose .Sh AUTHORS .An -nosplit The .Nm framework has been originally designed and implemented at the Universita` di Pisa in 2011 by .An Luigi Rizzo , and further extended with help from .An Matteo Landi , .An Gaetano Catalli , .An Giuseppe Lettieri , and .An Vincenzo Maffione . .Pp .Nm and .Nm VALE have been funded by the European Commission within FP7 Projects CHANGE (257422) and OPENLAB (287581). .Sh CAVEATS No matter how fast the CPU and OS are, achieving line rate on 10G and faster interfaces requires hardware with sufficient performance. Several NICs are unable to sustain line rate with small packet sizes. Insufficient PCIe or memory bandwidth can also cause reduced performance. .Pp Another frequent reason for low performance is the use of flow control on the link: a slow receiver can limit the transmit speed. Be sure to disable flow control when running high speed experiments. .Ss SPECIAL NIC FEATURES .Nm is orthogonal to some NIC features such as multiqueue, schedulers, packet filters. .Pp Multiple transmit and receive rings are supported natively and can be configured with ordinary OS tools, such as .Xr ethtool 8 or device-specific sysctl variables. The same goes for Receive Packet Steering (RPS) and filtering of incoming traffic. .Pp .Nm .Em does not use features such as .Em checksum offloading , TCP segmentation offloading , .Em encryption , VLAN encapsulation/decapsulation , etc. When using netmap to exchange packets with the host stack, make sure to disable these features. Index: stable/12/sys/dev/netmap/netmap_bdg.c =================================================================== --- stable/12/sys/dev/netmap/netmap_bdg.c (revision 354470) +++ stable/12/sys/dev/netmap/netmap_bdg.c (revision 354471) @@ -1,1649 +1,1649 @@ /* * Copyright (C) 2013-2016 Universita` di Pisa * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * This module implements the VALE switch for netmap --- VALE SWITCH --- NMG_LOCK() serializes all modifications to switches and ports. A switch cannot be deleted until all ports are gone. For each switch, an SX lock (RWlock on linux) protects deletion of ports. When configuring or deleting a new port, the lock is acquired in exclusive mode (after holding NMG_LOCK). When forwarding, the lock is acquired in shared mode (without NMG_LOCK). The lock is held throughout the entire forwarding cycle, during which the thread may incur in a page fault. Hence it is important that sleepable shared locks are used. On the rx ring, the per-port lock is grabbed initially to reserve a number of slot in the ring, then the lock is released, packets are copied from source to destination, and then the lock is acquired again and the receive ring is updated. (A similar thing is done on the tx ring for NIC and host stack ports attached to the switch) */ /* * OS-specific code that is used only within this file. * Other OS-specific code that must be accessed by drivers * is present in netmap_kern.h */ #if defined(__FreeBSD__) #include /* prerequisite */ __FBSDID("$FreeBSD$"); #include #include #include /* defines used in kernel.h */ #include /* types used in module initialization */ #include /* cdevsw struct, UID, GID */ #include #include /* struct socket */ #include #include #include #include /* sockaddrs */ #include #include #include #include #include /* BIOCIMMEDIATE */ #include /* bus_dmamap_* */ #include #include #include #elif defined(linux) #include "bsd_glue.h" #elif defined(__APPLE__) #warning OSX support is only partial #include "osx_glue.h" #elif defined(_WIN32) #include "win_glue.h" #else #error Unsupported platform #endif /* unsupported */ /* * common headers */ #include #include #include #include const char* netmap_bdg_name(struct netmap_vp_adapter *vp) { struct nm_bridge *b = vp->na_bdg; if (b == NULL) return NULL; return b->bdg_basename; } #ifndef CONFIG_NET_NS /* * XXX in principle nm_bridges could be created dynamically * Right now we have a static array and deletions are protected * by an exclusive lock. */ struct nm_bridge *nm_bridges; #endif /* !CONFIG_NET_NS */ static int nm_is_id_char(const char c) { return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9') || (c == '_'); } /* Validate the name of a bdg port and return the * position of the ":" character. */ static int nm_bdg_name_validate(const char *name, size_t prefixlen) { int colon_pos = -1; int i; if (!name || strlen(name) < prefixlen) { return -1; } for (i = 0; i < NM_BDG_IFNAMSIZ && name[i]; i++) { if (name[i] == ':') { colon_pos = i; break; } else if (!nm_is_id_char(name[i])) { return -1; } } if (strlen(name) - colon_pos > IFNAMSIZ) { /* interface name too long */ return -1; } return colon_pos; } /* * locate a bridge among the existing ones. * MUST BE CALLED WITH NMG_LOCK() * * a ':' in the name terminates the bridge name. Otherwise, just NM_NAME. * We assume that this is called with a name of at least NM_NAME chars. */ struct nm_bridge * nm_find_bridge(const char *name, int create, struct netmap_bdg_ops *ops) { int i, namelen; struct nm_bridge *b = NULL, *bridges; u_int num_bridges; NMG_LOCK_ASSERT(); netmap_bns_getbridges(&bridges, &num_bridges); namelen = nm_bdg_name_validate(name, (ops != NULL ? strlen(ops->name) : 0)); if (namelen < 0) { nm_prerr("invalid bridge name %s", name ? name : NULL); return NULL; } /* lookup the name, remember empty slot if there is one */ for (i = 0; i < num_bridges; i++) { struct nm_bridge *x = bridges + i; if ((x->bdg_flags & NM_BDG_ACTIVE) + x->bdg_active_ports == 0) { if (create && b == NULL) b = x; /* record empty slot */ } else if (x->bdg_namelen != namelen) { continue; } else if (strncmp(name, x->bdg_basename, namelen) == 0) { nm_prdis("found '%.*s' at %d", namelen, name, i); b = x; break; } } if (i == num_bridges && b) { /* name not found, can create entry */ /* initialize the bridge */ nm_prdis("create new bridge %s with ports %d", b->bdg_basename, b->bdg_active_ports); b->ht = nm_os_malloc(sizeof(struct nm_hash_ent) * NM_BDG_HASH); if (b->ht == NULL) { nm_prerr("failed to allocate hash table"); return NULL; } strncpy(b->bdg_basename, name, namelen); b->bdg_namelen = namelen; b->bdg_active_ports = 0; for (i = 0; i < NM_BDG_MAXPORTS; i++) b->bdg_port_index[i] = i; /* set the default function */ b->bdg_ops = b->bdg_saved_ops = *ops; b->private_data = b->ht; b->bdg_flags = 0; NM_BNS_GET(b); } return b; } int netmap_bdg_free(struct nm_bridge *b) { if ((b->bdg_flags & NM_BDG_ACTIVE) + b->bdg_active_ports != 0) { return EBUSY; } nm_prdis("marking bridge %s as free", b->bdg_basename); nm_os_free(b->ht); memset(&b->bdg_ops, 0, sizeof(b->bdg_ops)); memset(&b->bdg_saved_ops, 0, sizeof(b->bdg_saved_ops)); b->bdg_flags = 0; NM_BNS_PUT(b); return 0; } /* Called by external kernel modules (e.g., Openvswitch). * to modify the private data previously given to regops(). * 'name' may be just bridge's name (including ':' if it * is not just NM_BDG_NAME). * Called without NMG_LOCK. */ int netmap_bdg_update_private_data(const char *name, bdg_update_private_data_fn_t callback, void *callback_data, void *auth_token) { void *private_data = NULL; struct nm_bridge *b; int error = 0; NMG_LOCK(); b = nm_find_bridge(name, 0 /* don't create */, NULL); if (!b) { error = EINVAL; goto unlock_update_priv; } if (!nm_bdg_valid_auth_token(b, auth_token)) { error = EACCES; goto unlock_update_priv; } BDG_WLOCK(b); private_data = callback(b->private_data, callback_data, &error); b->private_data = private_data; BDG_WUNLOCK(b); unlock_update_priv: NMG_UNLOCK(); return error; } /* remove from bridge b the ports in slots hw and sw * (sw can be -1 if not needed) */ void netmap_bdg_detach_common(struct nm_bridge *b, int hw, int sw) { int s_hw = hw, s_sw = sw; int i, lim =b->bdg_active_ports; uint32_t *tmp = b->tmp_bdg_port_index; /* New algorithm: make a copy of bdg_port_index; lookup NA(ifp)->bdg_port and SWNA(ifp)->bdg_port in the array of bdg_port_index, replacing them with entries from the bottom of the array; decrement bdg_active_ports; acquire BDG_WLOCK() and copy back the array. */ if (netmap_debug & NM_DEBUG_BDG) nm_prinf("detach %d and %d (lim %d)", hw, sw, lim); /* make a copy of the list of active ports, update it, * and then copy back within BDG_WLOCK(). */ memcpy(b->tmp_bdg_port_index, b->bdg_port_index, sizeof(b->tmp_bdg_port_index)); for (i = 0; (hw >= 0 || sw >= 0) && i < lim; ) { if (hw >= 0 && tmp[i] == hw) { nm_prdis("detach hw %d at %d", hw, i); lim--; /* point to last active port */ tmp[i] = tmp[lim]; /* swap with i */ tmp[lim] = hw; /* now this is inactive */ hw = -1; } else if (sw >= 0 && tmp[i] == sw) { nm_prdis("detach sw %d at %d", sw, i); lim--; tmp[i] = tmp[lim]; tmp[lim] = sw; sw = -1; } else { i++; } } if (hw >= 0 || sw >= 0) { nm_prerr("delete failed hw %d sw %d, should panic...", hw, sw); } BDG_WLOCK(b); if (b->bdg_ops.dtor) b->bdg_ops.dtor(b->bdg_ports[s_hw]); b->bdg_ports[s_hw] = NULL; if (s_sw >= 0) { b->bdg_ports[s_sw] = NULL; } memcpy(b->bdg_port_index, b->tmp_bdg_port_index, sizeof(b->tmp_bdg_port_index)); b->bdg_active_ports = lim; BDG_WUNLOCK(b); nm_prdis("now %d active ports", lim); netmap_bdg_free(b); } /* nm_bdg_ctl callback for VALE ports */ int netmap_vp_bdg_ctl(struct nmreq_header *hdr, struct netmap_adapter *na) { struct netmap_vp_adapter *vpna = (struct netmap_vp_adapter *)na; struct nm_bridge *b = vpna->na_bdg; if (hdr->nr_reqtype == NETMAP_REQ_VALE_ATTACH) { return 0; /* nothing to do */ } if (b) { netmap_set_all_rings(na, 0 /* disable */); netmap_bdg_detach_common(b, vpna->bdg_port, -1); vpna->na_bdg = NULL; netmap_set_all_rings(na, 1 /* enable */); } /* I have took reference just for attach */ netmap_adapter_put(na); return 0; } int netmap_default_bdg_attach(const char *name, struct netmap_adapter *na, struct nm_bridge *b) { return NM_NEED_BWRAP; } /* Try to get a reference to a netmap adapter attached to a VALE switch. * If the adapter is found (or is created), this function returns 0, a * non NULL pointer is returned into *na, and the caller holds a * reference to the adapter. * If an adapter is not found, then no reference is grabbed and the * function returns an error code, or 0 if there is just a VALE prefix * mismatch. Therefore the caller holds a reference when * (*na != NULL && return == 0). */ int netmap_get_bdg_na(struct nmreq_header *hdr, struct netmap_adapter **na, struct netmap_mem_d *nmd, int create, struct netmap_bdg_ops *ops) { char *nr_name = hdr->nr_name; const char *ifname; struct ifnet *ifp = NULL; int error = 0; struct netmap_vp_adapter *vpna, *hostna = NULL; struct nm_bridge *b; uint32_t i, j; uint32_t cand = NM_BDG_NOPORT, cand2 = NM_BDG_NOPORT; int needed; *na = NULL; /* default return value */ /* first try to see if this is a bridge port. */ NMG_LOCK_ASSERT(); if (strncmp(nr_name, ops->name, strlen(ops->name) - 1)) { return 0; /* no error, but no VALE prefix */ } b = nm_find_bridge(nr_name, create, ops); if (b == NULL) { nm_prdis("no bridges available for '%s'", nr_name); return (create ? ENOMEM : ENXIO); } if (strlen(nr_name) < b->bdg_namelen) /* impossible */ panic("x"); /* Now we are sure that name starts with the bridge's name, * lookup the port in the bridge. We need to scan the entire * list. It is not important to hold a WLOCK on the bridge * during the search because NMG_LOCK already guarantees * that there are no other possible writers. */ /* lookup in the local list of ports */ for (j = 0; j < b->bdg_active_ports; j++) { i = b->bdg_port_index[j]; vpna = b->bdg_ports[i]; nm_prdis("checking %s", vpna->up.name); if (!strcmp(vpna->up.name, nr_name)) { netmap_adapter_get(&vpna->up); nm_prdis("found existing if %s refs %d", nr_name) *na = &vpna->up; return 0; } } /* not found, should we create it? */ if (!create) return ENXIO; /* yes we should, see if we have space to attach entries */ needed = 2; /* in some cases we only need 1 */ if (b->bdg_active_ports + needed >= NM_BDG_MAXPORTS) { nm_prerr("bridge full %d, cannot create new port", b->bdg_active_ports); return ENOMEM; } /* record the next two ports available, but do not allocate yet */ cand = b->bdg_port_index[b->bdg_active_ports]; cand2 = b->bdg_port_index[b->bdg_active_ports + 1]; nm_prdis("+++ bridge %s port %s used %d avail %d %d", b->bdg_basename, ifname, b->bdg_active_ports, cand, cand2); /* * try see if there is a matching NIC with this name * (after the bridge's name) */ ifname = nr_name + b->bdg_namelen + 1; ifp = ifunit_ref(ifname); if (!ifp) { /* Create an ephemeral virtual port. * This block contains all the ephemeral-specific logic. */ if (hdr->nr_reqtype != NETMAP_REQ_REGISTER) { error = EINVAL; goto out; } /* bdg_netmap_attach creates a struct netmap_adapter */ error = b->bdg_ops.vp_create(hdr, NULL, nmd, &vpna); if (error) { if (netmap_debug & NM_DEBUG_BDG) nm_prerr("error %d", error); goto out; } /* shortcut - we can skip get_hw_na(), * ownership check and nm_bdg_attach() */ } else { struct netmap_adapter *hw; /* the vale:nic syntax is only valid for some commands */ switch (hdr->nr_reqtype) { case NETMAP_REQ_VALE_ATTACH: case NETMAP_REQ_VALE_DETACH: case NETMAP_REQ_VALE_POLLING_ENABLE: case NETMAP_REQ_VALE_POLLING_DISABLE: break; /* ok */ default: error = EINVAL; goto out; } error = netmap_get_hw_na(ifp, nmd, &hw); if (error || hw == NULL) goto out; /* host adapter might not be created */ error = hw->nm_bdg_attach(nr_name, hw, b); if (error == NM_NEED_BWRAP) { error = b->bdg_ops.bwrap_attach(nr_name, hw); } if (error) goto out; vpna = hw->na_vp; hostna = hw->na_hostvp; if (hdr->nr_reqtype == NETMAP_REQ_VALE_ATTACH) { /* Check if we need to skip the host rings. */ struct nmreq_vale_attach *areq = (struct nmreq_vale_attach *)(uintptr_t)hdr->nr_body; if (areq->reg.nr_mode != NR_REG_NIC_SW) { hostna = NULL; } } } BDG_WLOCK(b); vpna->bdg_port = cand; nm_prdis("NIC %p to bridge port %d", vpna, cand); /* bind the port to the bridge (virtual ports are not active) */ b->bdg_ports[cand] = vpna; vpna->na_bdg = b; b->bdg_active_ports++; if (hostna != NULL) { /* also bind the host stack to the bridge */ b->bdg_ports[cand2] = hostna; hostna->bdg_port = cand2; hostna->na_bdg = b; b->bdg_active_ports++; nm_prdis("host %p to bridge port %d", hostna, cand2); } nm_prdis("if %s refs %d", ifname, vpna->up.na_refcount); BDG_WUNLOCK(b); *na = &vpna->up; netmap_adapter_get(*na); out: if (ifp) if_rele(ifp); return error; } int nm_is_bwrap(struct netmap_adapter *na) { return na->nm_register == netmap_bwrap_reg; } struct nm_bdg_polling_state; struct nm_bdg_kthread { struct nm_kctx *nmk; u_int qfirst; u_int qlast; struct nm_bdg_polling_state *bps; }; struct nm_bdg_polling_state { bool configured; bool stopped; struct netmap_bwrap_adapter *bna; uint32_t mode; u_int qfirst; u_int qlast; u_int cpu_from; u_int ncpus; struct nm_bdg_kthread *kthreads; }; static void netmap_bwrap_polling(void *data) { struct nm_bdg_kthread *nbk = data; struct netmap_bwrap_adapter *bna; u_int qfirst, qlast, i; struct netmap_kring **kring0, *kring; if (!nbk) return; qfirst = nbk->qfirst; qlast = nbk->qlast; bna = nbk->bps->bna; kring0 = NMR(bna->hwna, NR_RX); for (i = qfirst; i < qlast; i++) { kring = kring0[i]; kring->nm_notify(kring, 0); } } static int nm_bdg_create_kthreads(struct nm_bdg_polling_state *bps) { struct nm_kctx_cfg kcfg; int i, j; bps->kthreads = nm_os_malloc(sizeof(struct nm_bdg_kthread) * bps->ncpus); if (bps->kthreads == NULL) return ENOMEM; bzero(&kcfg, sizeof(kcfg)); kcfg.worker_fn = netmap_bwrap_polling; for (i = 0; i < bps->ncpus; i++) { struct nm_bdg_kthread *t = bps->kthreads + i; int all = (bps->ncpus == 1 && bps->mode == NETMAP_POLLING_MODE_SINGLE_CPU); int affinity = bps->cpu_from + i; t->bps = bps; t->qfirst = all ? bps->qfirst /* must be 0 */: affinity; t->qlast = all ? bps->qlast : t->qfirst + 1; if (netmap_verbose) nm_prinf("kthread %d a:%u qf:%u ql:%u", i, affinity, t->qfirst, t->qlast); kcfg.type = i; kcfg.worker_private = t; t->nmk = nm_os_kctx_create(&kcfg, NULL); if (t->nmk == NULL) { goto cleanup; } nm_os_kctx_worker_setaff(t->nmk, affinity); } return 0; cleanup: for (j = 0; j < i; j++) { struct nm_bdg_kthread *t = bps->kthreads + i; nm_os_kctx_destroy(t->nmk); } nm_os_free(bps->kthreads); return EFAULT; } /* A variant of ptnetmap_start_kthreads() */ static int nm_bdg_polling_start_kthreads(struct nm_bdg_polling_state *bps) { int error, i, j; if (!bps) { nm_prerr("polling is not configured"); return EFAULT; } bps->stopped = false; for (i = 0; i < bps->ncpus; i++) { struct nm_bdg_kthread *t = bps->kthreads + i; error = nm_os_kctx_worker_start(t->nmk); if (error) { nm_prerr("error in nm_kthread_start(): %d", error); goto cleanup; } } return 0; cleanup: for (j = 0; j < i; j++) { struct nm_bdg_kthread *t = bps->kthreads + i; nm_os_kctx_worker_stop(t->nmk); } bps->stopped = true; return error; } static void nm_bdg_polling_stop_delete_kthreads(struct nm_bdg_polling_state *bps) { int i; if (!bps) return; for (i = 0; i < bps->ncpus; i++) { struct nm_bdg_kthread *t = bps->kthreads + i; nm_os_kctx_worker_stop(t->nmk); nm_os_kctx_destroy(t->nmk); } bps->stopped = true; } static int get_polling_cfg(struct nmreq_vale_polling *req, struct netmap_adapter *na, struct nm_bdg_polling_state *bps) { unsigned int avail_cpus, core_from; unsigned int qfirst, qlast; uint32_t i = req->nr_first_cpu_id; uint32_t req_cpus = req->nr_num_polling_cpus; avail_cpus = nm_os_ncpus(); if (req_cpus == 0) { nm_prerr("req_cpus must be > 0"); return EINVAL; } else if (req_cpus >= avail_cpus) { nm_prerr("Cannot use all the CPUs in the system"); return EINVAL; } if (req->nr_mode == NETMAP_POLLING_MODE_MULTI_CPU) { /* Use a separate core for each ring. If nr_num_polling_cpus>1 * more consecutive rings are polled. * For example, if nr_first_cpu_id=2 and nr_num_polling_cpus=2, * ring 2 and 3 are polled by core 2 and 3, respectively. */ if (i + req_cpus > nma_get_nrings(na, NR_RX)) { nm_prerr("Rings %u-%u not in range (have %d rings)", i, i + req_cpus, nma_get_nrings(na, NR_RX)); return EINVAL; } qfirst = i; qlast = qfirst + req_cpus; core_from = qfirst; } else if (req->nr_mode == NETMAP_POLLING_MODE_SINGLE_CPU) { /* Poll all the rings using a core specified by nr_first_cpu_id. * the number of cores must be 1. */ if (req_cpus != 1) { nm_prerr("ncpus must be 1 for NETMAP_POLLING_MODE_SINGLE_CPU " "(was %d)", req_cpus); return EINVAL; } qfirst = 0; qlast = nma_get_nrings(na, NR_RX); core_from = i; } else { nm_prerr("Invalid polling mode"); return EINVAL; } bps->mode = req->nr_mode; bps->qfirst = qfirst; bps->qlast = qlast; bps->cpu_from = core_from; bps->ncpus = req_cpus; nm_prinf("%s qfirst %u qlast %u cpu_from %u ncpus %u", req->nr_mode == NETMAP_POLLING_MODE_MULTI_CPU ? "MULTI" : "SINGLE", qfirst, qlast, core_from, req_cpus); return 0; } static int nm_bdg_ctl_polling_start(struct nmreq_vale_polling *req, struct netmap_adapter *na) { struct nm_bdg_polling_state *bps; struct netmap_bwrap_adapter *bna; int error; bna = (struct netmap_bwrap_adapter *)na; if (bna->na_polling_state) { nm_prerr("ERROR adapter already in polling mode"); return EFAULT; } bps = nm_os_malloc(sizeof(*bps)); if (!bps) return ENOMEM; bps->configured = false; bps->stopped = true; if (get_polling_cfg(req, na, bps)) { nm_os_free(bps); return EINVAL; } if (nm_bdg_create_kthreads(bps)) { nm_os_free(bps); return EFAULT; } bps->configured = true; bna->na_polling_state = bps; bps->bna = bna; /* disable interrupts if possible */ nma_intr_enable(bna->hwna, 0); /* start kthread now */ error = nm_bdg_polling_start_kthreads(bps); if (error) { nm_prerr("ERROR nm_bdg_polling_start_kthread()"); nm_os_free(bps->kthreads); nm_os_free(bps); bna->na_polling_state = NULL; nma_intr_enable(bna->hwna, 1); } return error; } static int nm_bdg_ctl_polling_stop(struct netmap_adapter *na) { struct netmap_bwrap_adapter *bna = (struct netmap_bwrap_adapter *)na; struct nm_bdg_polling_state *bps; if (!bna->na_polling_state) { nm_prerr("ERROR adapter is not in polling mode"); return EFAULT; } bps = bna->na_polling_state; nm_bdg_polling_stop_delete_kthreads(bna->na_polling_state); bps->configured = false; nm_os_free(bps); bna->na_polling_state = NULL; /* reenable interrupts */ nma_intr_enable(bna->hwna, 1); return 0; } int nm_bdg_polling(struct nmreq_header *hdr) { struct nmreq_vale_polling *req = (struct nmreq_vale_polling *)(uintptr_t)hdr->nr_body; struct netmap_adapter *na = NULL; int error = 0; NMG_LOCK(); error = netmap_get_vale_na(hdr, &na, NULL, /*create=*/0); if (na && !error) { if (!nm_is_bwrap(na)) { error = EOPNOTSUPP; } else if (hdr->nr_reqtype == NETMAP_BDG_POLLING_ON) { error = nm_bdg_ctl_polling_start(req, na); if (!error) netmap_adapter_get(na); } else { error = nm_bdg_ctl_polling_stop(na); if (!error) netmap_adapter_put(na); } netmap_adapter_put(na); } else if (!na && !error) { /* Not VALE port. */ error = EINVAL; } NMG_UNLOCK(); return error; } /* Called by external kernel modules (e.g., Openvswitch). * to set configure/lookup/dtor functions of a VALE instance. * Register callbacks to the given bridge. 'name' may be just * bridge's name (including ':' if it is not just NM_BDG_NAME). * * Called without NMG_LOCK. */ int netmap_bdg_regops(const char *name, struct netmap_bdg_ops *bdg_ops, void *private_data, void *auth_token) { struct nm_bridge *b; int error = 0; NMG_LOCK(); b = nm_find_bridge(name, 0 /* don't create */, NULL); if (!b) { error = ENXIO; goto unlock_regops; } if (!nm_bdg_valid_auth_token(b, auth_token)) { error = EACCES; goto unlock_regops; } BDG_WLOCK(b); if (!bdg_ops) { /* resetting the bridge */ bzero(b->ht, sizeof(struct nm_hash_ent) * NM_BDG_HASH); b->bdg_ops = b->bdg_saved_ops; b->private_data = b->ht; } else { /* modifying the bridge */ b->private_data = private_data; #define nm_bdg_override(m) if (bdg_ops->m) b->bdg_ops.m = bdg_ops->m nm_bdg_override(lookup); nm_bdg_override(config); nm_bdg_override(dtor); nm_bdg_override(vp_create); nm_bdg_override(bwrap_attach); #undef nm_bdg_override } BDG_WUNLOCK(b); unlock_regops: NMG_UNLOCK(); return error; } int netmap_bdg_config(struct nm_ifreq *nr) { struct nm_bridge *b; int error = EINVAL; NMG_LOCK(); b = nm_find_bridge(nr->nifr_name, 0, NULL); if (!b) { NMG_UNLOCK(); return error; } NMG_UNLOCK(); /* Don't call config() with NMG_LOCK() held */ BDG_RLOCK(b); if (b->bdg_ops.config != NULL) error = b->bdg_ops.config(nr); BDG_RUNLOCK(b); return error; } /* nm_register callback for VALE ports */ int netmap_vp_reg(struct netmap_adapter *na, int onoff) { struct netmap_vp_adapter *vpna = (struct netmap_vp_adapter*)na; /* persistent ports may be put in netmap mode * before being attached to a bridge */ if (vpna->na_bdg) BDG_WLOCK(vpna->na_bdg); if (onoff) { netmap_krings_mode_commit(na, onoff); if (na->active_fds == 0) na->na_flags |= NAF_NETMAP_ON; /* XXX on FreeBSD, persistent VALE ports should also * toggle IFCAP_NETMAP in na->ifp (2014-03-16) */ } else { if (na->active_fds == 0) na->na_flags &= ~NAF_NETMAP_ON; netmap_krings_mode_commit(na, onoff); } if (vpna->na_bdg) BDG_WUNLOCK(vpna->na_bdg); return 0; } /* rxsync code used by VALE ports nm_rxsync callback and also * internally by the brwap */ static int netmap_vp_rxsync_locked(struct netmap_kring *kring, int flags) { struct netmap_adapter *na = kring->na; struct netmap_ring *ring = kring->ring; u_int nm_i, lim = kring->nkr_num_slots - 1; u_int head = kring->rhead; int n; if (head > lim) { nm_prerr("ouch dangerous reset!!!"); n = netmap_ring_reinit(kring); goto done; } /* First part, import newly received packets. */ /* actually nothing to do here, they are already in the kring */ /* Second part, skip past packets that userspace has released. */ nm_i = kring->nr_hwcur; if (nm_i != head) { /* consistency check, but nothing really important here */ for (n = 0; likely(nm_i != head); n++) { struct netmap_slot *slot = &ring->slot[nm_i]; void *addr = NMB(na, slot); if (addr == NETMAP_BUF_BASE(kring->na)) { /* bad buf */ nm_prerr("bad buffer index %d, ignore ?", slot->buf_idx); } slot->flags &= ~NS_BUF_CHANGED; nm_i = nm_next(nm_i, lim); } kring->nr_hwcur = head; } n = 0; done: return n; } /* * nm_rxsync callback for VALE ports * user process reading from a VALE switch. * Already protected against concurrent calls from userspace, * but we must acquire the queue's lock to protect against * writers on the same queue. */ int netmap_vp_rxsync(struct netmap_kring *kring, int flags) { int n; mtx_lock(&kring->q_lock); n = netmap_vp_rxsync_locked(kring, flags); mtx_unlock(&kring->q_lock); return n; } int netmap_bwrap_attach(const char *nr_name, struct netmap_adapter *hwna, struct netmap_bdg_ops *ops) { return ops->bwrap_attach(nr_name, hwna); } /* Bridge wrapper code (bwrap). * This is used to connect a non-VALE-port netmap_adapter (hwna) to a * VALE switch. * The main task is to swap the meaning of tx and rx rings to match the * expectations of the VALE switch code (see nm_bdg_flush). * * The bwrap works by interposing a netmap_bwrap_adapter between the * rest of the system and the hwna. The netmap_bwrap_adapter looks like * a netmap_vp_adapter to the rest the system, but, internally, it * translates all callbacks to what the hwna expects. * * Note that we have to intercept callbacks coming from two sides: * * - callbacks coming from the netmap module are intercepted by * passing around the netmap_bwrap_adapter instead of the hwna * * - callbacks coming from outside of the netmap module only know * about the hwna. This, however, only happens in interrupt * handlers, where only the hwna->nm_notify callback is called. * What the bwrap does is to overwrite the hwna->nm_notify callback * with its own netmap_bwrap_intr_notify. * XXX This assumes that the hwna->nm_notify callback was the * standard netmap_notify(), as it is the case for nic adapters. * Any additional action performed by hwna->nm_notify will not be * performed by netmap_bwrap_intr_notify. * * Additionally, the bwrap can optionally attach the host rings pair * of the wrapped adapter to a different port of the switch. */ static void netmap_bwrap_dtor(struct netmap_adapter *na) { struct netmap_bwrap_adapter *bna = (struct netmap_bwrap_adapter*)na; struct netmap_adapter *hwna = bna->hwna; struct nm_bridge *b = bna->up.na_bdg, *bh = bna->host.na_bdg; if (bna->host.up.nm_mem) netmap_mem_put(bna->host.up.nm_mem); if (b) { netmap_bdg_detach_common(b, bna->up.bdg_port, (bh ? bna->host.bdg_port : -1)); } nm_prdis("na %p", na); na->ifp = NULL; bna->host.up.ifp = NULL; hwna->na_vp = bna->saved_na_vp; hwna->na_hostvp = NULL; hwna->na_private = NULL; hwna->na_flags &= ~NAF_BUSY; netmap_adapter_put(hwna); } /* * Intr callback for NICs connected to a bridge. * Simply ignore tx interrupts (maybe we could try to recover space ?) * and pass received packets from nic to the bridge. * * XXX TODO check locking: this is called from the interrupt * handler so we should make sure that the interface is not * disconnected while passing down an interrupt. * * Note, no user process can access this NIC or the host stack. * The only part of the ring that is significant are the slots, * and head/cur/tail are set from the kring as needed * (part as a receive ring, part as a transmit ring). * * callback that overwrites the hwna notify callback. * Packets come from the outside or from the host stack and are put on an * hwna rx ring. * The bridge wrapper then sends the packets through the bridge. */ static int netmap_bwrap_intr_notify(struct netmap_kring *kring, int flags) { struct netmap_adapter *na = kring->na; struct netmap_bwrap_adapter *bna = na->na_private; struct netmap_kring *bkring; struct netmap_vp_adapter *vpna = &bna->up; u_int ring_nr = kring->ring_id; int ret = NM_IRQ_COMPLETED; int error; if (netmap_debug & NM_DEBUG_RXINTR) nm_prinf("%s %s 0x%x", na->name, kring->name, flags); bkring = vpna->up.tx_rings[ring_nr]; /* make sure the ring is not disabled */ if (nm_kr_tryget(kring, 0 /* can't sleep */, NULL)) { return EIO; } if (netmap_debug & NM_DEBUG_RXINTR) nm_prinf("%s head %d cur %d tail %d", na->name, kring->rhead, kring->rcur, kring->rtail); /* simulate a user wakeup on the rx ring * fetch packets that have arrived. */ error = kring->nm_sync(kring, 0); if (error) goto put_out; if (kring->nr_hwcur == kring->nr_hwtail) { if (netmap_verbose) nm_prlim(1, "interrupt with no packets on %s", kring->name); goto put_out; } /* new packets are kring->rcur to kring->nr_hwtail, and the bkring * had hwcur == bkring->rhead. So advance bkring->rhead to kring->nr_hwtail * to push all packets out. */ bkring->rhead = bkring->rcur = kring->nr_hwtail; bkring->nm_sync(bkring, flags); /* mark all buffers as released on this ring */ kring->rhead = kring->rcur = kring->rtail = kring->nr_hwtail; /* another call to actually release the buffers */ error = kring->nm_sync(kring, 0); /* The second rxsync may have further advanced hwtail. If this happens, * return NM_IRQ_RESCHED, otherwise just return NM_IRQ_COMPLETED. */ if (kring->rcur != kring->nr_hwtail) { ret = NM_IRQ_RESCHED; } put_out: nm_kr_put(kring); return error ? error : ret; } /* nm_register callback for bwrap */ int netmap_bwrap_reg(struct netmap_adapter *na, int onoff) { struct netmap_bwrap_adapter *bna = (struct netmap_bwrap_adapter *)na; struct netmap_adapter *hwna = bna->hwna; struct netmap_vp_adapter *hostna = &bna->host; int error, i; enum txrx t; nm_prdis("%s %s", na->name, onoff ? "on" : "off"); if (onoff) { /* netmap_do_regif has been called on the bwrap na. * We need to pass the information about the * memory allocator down to the hwna before * putting it in netmap mode */ hwna->na_lut = na->na_lut; if (hostna->na_bdg) { /* if the host rings have been attached to switch, * we need to copy the memory allocator information * in the hostna also */ hostna->up.na_lut = na->na_lut; } } /* pass down the pending ring state information */ for_rx_tx(t) { for (i = 0; i < netmap_all_rings(na, t); i++) { NMR(hwna, nm_txrx_swap(t))[i]->nr_pending_mode = NMR(na, t)[i]->nr_pending_mode; } } /* forward the request to the hwna */ error = hwna->nm_register(hwna, onoff); if (error) return error; /* copy up the current ring state information */ for_rx_tx(t) { for (i = 0; i < netmap_all_rings(na, t); i++) { struct netmap_kring *kring = NMR(hwna, nm_txrx_swap(t))[i]; NMR(na, t)[i]->nr_mode = kring->nr_mode; } } /* impersonate a netmap_vp_adapter */ netmap_vp_reg(na, onoff); if (hostna->na_bdg) netmap_vp_reg(&hostna->up, onoff); if (onoff) { u_int i; /* intercept the hwna nm_nofify callback on the hw rings */ for (i = 0; i < hwna->num_rx_rings; i++) { hwna->rx_rings[i]->save_notify = hwna->rx_rings[i]->nm_notify; hwna->rx_rings[i]->nm_notify = netmap_bwrap_intr_notify; } i = hwna->num_rx_rings; /* for safety */ /* save the host ring notify unconditionally */ for (; i < netmap_real_rings(hwna, NR_RX); i++) { hwna->rx_rings[i]->save_notify = hwna->rx_rings[i]->nm_notify; if (hostna->na_bdg) { /* also intercept the host ring notify */ hwna->rx_rings[i]->nm_notify = netmap_bwrap_intr_notify; na->tx_rings[i]->nm_sync = na->nm_txsync; } } if (na->active_fds == 0) na->na_flags |= NAF_NETMAP_ON; } else { u_int i; if (na->active_fds == 0) na->na_flags &= ~NAF_NETMAP_ON; /* reset all notify callbacks (including host ring) */ for (i = 0; i < netmap_all_rings(hwna, NR_RX); i++) { hwna->rx_rings[i]->nm_notify = hwna->rx_rings[i]->save_notify; hwna->rx_rings[i]->save_notify = NULL; } hwna->na_lut.lut = NULL; hwna->na_lut.plut = NULL; hwna->na_lut.objtotal = 0; hwna->na_lut.objsize = 0; /* pass ownership of the netmap rings to the hwna */ for_rx_tx(t) { for (i = 0; i < netmap_all_rings(na, t); i++) { NMR(na, t)[i]->ring = NULL; } } /* reset the number of host rings to default */ for_rx_tx(t) { nma_set_host_nrings(hwna, t, 1); } } return 0; } /* nm_config callback for bwrap */ static int netmap_bwrap_config(struct netmap_adapter *na, struct nm_config_info *info) { struct netmap_bwrap_adapter *bna = (struct netmap_bwrap_adapter *)na; struct netmap_adapter *hwna = bna->hwna; int error; /* Forward the request to the hwna. It may happen that nobody * registered hwna yet, so netmap_mem_get_lut() may have not * been called yet. */ error = netmap_mem_get_lut(hwna->nm_mem, &hwna->na_lut); if (error) return error; netmap_update_config(hwna); /* swap the results and propagate */ info->num_tx_rings = hwna->num_rx_rings; info->num_tx_descs = hwna->num_rx_desc; info->num_rx_rings = hwna->num_tx_rings; info->num_rx_descs = hwna->num_tx_desc; info->rx_buf_maxsize = hwna->rx_buf_maxsize; return 0; } /* nm_krings_create callback for bwrap */ int netmap_bwrap_krings_create_common(struct netmap_adapter *na) { struct netmap_bwrap_adapter *bna = (struct netmap_bwrap_adapter *)na; struct netmap_adapter *hwna = bna->hwna; struct netmap_adapter *hostna = &bna->host.up; int i, error = 0; enum txrx t; /* also create the hwna krings */ error = hwna->nm_krings_create(hwna); if (error) { return error; } /* increment the usage counter for all the hwna krings */ for_rx_tx(t) { for (i = 0; i < netmap_all_rings(hwna, t); i++) { NMR(hwna, t)[i]->users++; } } /* now create the actual rings */ error = netmap_mem_rings_create(hwna); if (error) { goto err_dec_users; } /* cross-link the netmap rings * The original number of rings comes from hwna, * rx rings on one side equals tx rings on the other. */ for_rx_tx(t) { enum txrx r = nm_txrx_swap(t); /* swap NR_TX <-> NR_RX */ for (i = 0; i < netmap_all_rings(hwna, r); i++) { NMR(na, t)[i]->nkr_num_slots = NMR(hwna, r)[i]->nkr_num_slots; NMR(na, t)[i]->ring = NMR(hwna, r)[i]->ring; } } if (na->na_flags & NAF_HOST_RINGS) { /* the hostna rings are the host rings of the bwrap. * The corresponding krings must point back to the * hostna */ hostna->tx_rings = &na->tx_rings[na->num_tx_rings]; hostna->rx_rings = &na->rx_rings[na->num_rx_rings]; for_rx_tx(t) { for (i = 0; i < nma_get_nrings(hostna, t); i++) { NMR(hostna, t)[i]->na = hostna; } } } return 0; err_dec_users: for_rx_tx(t) { for (i = 0; i < netmap_all_rings(hwna, t); i++) { NMR(hwna, t)[i]->users--; } } hwna->nm_krings_delete(hwna); return error; } void netmap_bwrap_krings_delete_common(struct netmap_adapter *na) { struct netmap_bwrap_adapter *bna = (struct netmap_bwrap_adapter *)na; struct netmap_adapter *hwna = bna->hwna; enum txrx t; int i; nm_prdis("%s", na->name); /* decrement the usage counter for all the hwna krings */ for_rx_tx(t) { for (i = 0; i < netmap_all_rings(hwna, t); i++) { NMR(hwna, t)[i]->users--; } } /* delete any netmap rings that are no longer needed */ netmap_mem_rings_delete(hwna); hwna->nm_krings_delete(hwna); } /* notify method for the bridge-->hwna direction */ int netmap_bwrap_notify(struct netmap_kring *kring, int flags) { struct netmap_adapter *na = kring->na; struct netmap_bwrap_adapter *bna = na->na_private; struct netmap_adapter *hwna = bna->hwna; u_int ring_n = kring->ring_id; u_int lim = kring->nkr_num_slots - 1; struct netmap_kring *hw_kring; int error; nm_prdis("%s: na %s hwna %s", (kring ? kring->name : "NULL!"), (na ? na->name : "NULL!"), (hwna ? hwna->name : "NULL!")); hw_kring = hwna->tx_rings[ring_n]; if (nm_kr_tryget(hw_kring, 0, NULL)) { return ENXIO; } /* first step: simulate a user wakeup on the rx ring */ netmap_vp_rxsync(kring, flags); nm_prdis("%s[%d] PRE rx(c%3d t%3d l%3d) ring(h%3d c%3d t%3d) tx(c%3d ht%3d t%3d)", na->name, ring_n, kring->nr_hwcur, kring->nr_hwtail, kring->nkr_hwlease, kring->rhead, kring->rcur, kring->rtail, hw_kring->nr_hwcur, hw_kring->nr_hwtail, hw_kring->rtail); /* second step: the new packets are sent on the tx ring * (which is actually the same ring) */ hw_kring->rhead = hw_kring->rcur = kring->nr_hwtail; error = hw_kring->nm_sync(hw_kring, flags); if (error) goto put_out; /* third step: now we are back the rx ring */ /* claim ownership on all hw owned bufs */ kring->rhead = kring->rcur = nm_next(hw_kring->nr_hwtail, lim); /* skip past reserved slot */ /* fourth step: the user goes to sleep again, causing another rxsync */ netmap_vp_rxsync(kring, flags); nm_prdis("%s[%d] PST rx(c%3d t%3d l%3d) ring(h%3d c%3d t%3d) tx(c%3d ht%3d t%3d)", na->name, ring_n, kring->nr_hwcur, kring->nr_hwtail, kring->nkr_hwlease, kring->rhead, kring->rcur, kring->rtail, hw_kring->nr_hwcur, hw_kring->nr_hwtail, hw_kring->rtail); put_out: nm_kr_put(hw_kring); return error ? error : NM_IRQ_COMPLETED; } /* nm_bdg_ctl callback for the bwrap. - * Called on bridge-attach and detach, as an effect of vale-ctl -[ahd]. + * Called on bridge-attach and detach, as an effect of valectl -[ahd]. * On attach, it needs to provide a fake netmap_priv_d structure and * perform a netmap_do_regif() on the bwrap. This will put both the * bwrap and the hwna in netmap mode, with the netmap rings shared * and cross linked. Moroever, it will start intercepting interrupts * directed to hwna. */ static int netmap_bwrap_bdg_ctl(struct nmreq_header *hdr, struct netmap_adapter *na) { struct netmap_priv_d *npriv; struct netmap_bwrap_adapter *bna = (struct netmap_bwrap_adapter*)na; int error = 0; if (hdr->nr_reqtype == NETMAP_REQ_VALE_ATTACH) { struct nmreq_vale_attach *req = (struct nmreq_vale_attach *)(uintptr_t)hdr->nr_body; if (req->reg.nr_ringid != 0 || (req->reg.nr_mode != NR_REG_ALL_NIC && req->reg.nr_mode != NR_REG_NIC_SW)) { /* We only support attaching all the NIC rings * and/or the host stack. */ return EINVAL; } if (NETMAP_OWNED_BY_ANY(na)) { return EBUSY; } if (bna->na_kpriv) { /* nothing to do */ return 0; } npriv = netmap_priv_new(); if (npriv == NULL) return ENOMEM; npriv->np_ifp = na->ifp; /* let the priv destructor release the ref */ error = netmap_do_regif(npriv, na, req->reg.nr_mode, req->reg.nr_ringid, req->reg.nr_flags); if (error) { netmap_priv_delete(npriv); return error; } bna->na_kpriv = npriv; na->na_flags |= NAF_BUSY; } else { if (na->active_fds == 0) /* not registered */ return EINVAL; netmap_priv_delete(bna->na_kpriv); bna->na_kpriv = NULL; na->na_flags &= ~NAF_BUSY; } return error; } /* attach a bridge wrapper to the 'real' device */ int netmap_bwrap_attach_common(struct netmap_adapter *na, struct netmap_adapter *hwna) { struct netmap_bwrap_adapter *bna; struct netmap_adapter *hostna = NULL; int error = 0; enum txrx t; /* make sure the NIC is not already in use */ if (NETMAP_OWNED_BY_ANY(hwna)) { nm_prerr("NIC %s busy, cannot attach to bridge", hwna->name); return EBUSY; } bna = (struct netmap_bwrap_adapter *)na; /* make bwrap ifp point to the real ifp */ na->ifp = hwna->ifp; if_ref(na->ifp); na->na_private = bna; /* fill the ring data for the bwrap adapter with rx/tx meanings * swapped. The real cross-linking will be done during register, * when all the krings will have been created. */ for_rx_tx(t) { enum txrx r = nm_txrx_swap(t); /* swap NR_TX <-> NR_RX */ nma_set_nrings(na, t, nma_get_nrings(hwna, r)); nma_set_ndesc(na, t, nma_get_ndesc(hwna, r)); } na->nm_dtor = netmap_bwrap_dtor; na->nm_config = netmap_bwrap_config; na->nm_bdg_ctl = netmap_bwrap_bdg_ctl; na->pdev = hwna->pdev; na->nm_mem = netmap_mem_get(hwna->nm_mem); na->virt_hdr_len = hwna->virt_hdr_len; na->rx_buf_maxsize = hwna->rx_buf_maxsize; bna->hwna = hwna; netmap_adapter_get(hwna); hwna->na_private = bna; /* weak reference */ bna->saved_na_vp = hwna->na_vp; hwna->na_vp = &bna->up; bna->up.up.na_vp = &(bna->up); if (hwna->na_flags & NAF_HOST_RINGS) { if (hwna->na_flags & NAF_SW_ONLY) na->na_flags |= NAF_SW_ONLY; na->na_flags |= NAF_HOST_RINGS; hostna = &bna->host.up; /* limit the number of host rings to that of hw */ nm_bound_var(&hostna->num_tx_rings, 1, 1, nma_get_nrings(hwna, NR_TX), NULL); nm_bound_var(&hostna->num_rx_rings, 1, 1, nma_get_nrings(hwna, NR_RX), NULL); snprintf(hostna->name, sizeof(hostna->name), "%s^", na->name); hostna->ifp = hwna->ifp; for_rx_tx(t) { enum txrx r = nm_txrx_swap(t); u_int nr = nma_get_nrings(hostna, t); nma_set_nrings(hostna, t, nr); nma_set_host_nrings(na, t, nr); if (nma_get_host_nrings(hwna, t) < nr) { nma_set_host_nrings(hwna, t, nr); } nma_set_ndesc(hostna, t, nma_get_ndesc(hwna, r)); } // hostna->nm_txsync = netmap_bwrap_host_txsync; // hostna->nm_rxsync = netmap_bwrap_host_rxsync; hostna->nm_mem = netmap_mem_get(na->nm_mem); hostna->na_private = bna; hostna->na_vp = &bna->up; na->na_hostvp = hwna->na_hostvp = hostna->na_hostvp = &bna->host; hostna->na_flags = NAF_BUSY; /* prevent NIOCREGIF */ hostna->rx_buf_maxsize = hwna->rx_buf_maxsize; } if (hwna->na_flags & NAF_MOREFRAG) na->na_flags |= NAF_MOREFRAG; nm_prdis("%s<->%s txr %d txd %d rxr %d rxd %d", na->name, ifp->if_xname, na->num_tx_rings, na->num_tx_desc, na->num_rx_rings, na->num_rx_desc); error = netmap_attach_common(na); if (error) { goto err_put; } hwna->na_flags |= NAF_BUSY; return 0; err_put: hwna->na_vp = hwna->na_hostvp = NULL; netmap_adapter_put(hwna); return error; } struct nm_bridge * netmap_init_bridges2(u_int n) { int i; struct nm_bridge *b; b = nm_os_malloc(sizeof(struct nm_bridge) * n); if (b == NULL) return NULL; for (i = 0; i < n; i++) BDG_RWINIT(&b[i]); return b; } void netmap_uninit_bridges2(struct nm_bridge *b, u_int n) { int i; if (b == NULL) return; for (i = 0; i < n; i++) BDG_RWDESTROY(&b[i]); nm_os_free(b); } int netmap_init_bridges(void) { #ifdef CONFIG_NET_NS return netmap_bns_register(); #else nm_bridges = netmap_init_bridges2(NM_BRIDGES); if (nm_bridges == NULL) return ENOMEM; return 0; #endif } void netmap_uninit_bridges(void) { #ifdef CONFIG_NET_NS netmap_bns_unregister(); #else netmap_uninit_bridges2(nm_bridges, NM_BRIDGES); #endif } Index: stable/12/sys/net/netmap_legacy.h =================================================================== --- stable/12/sys/net/netmap_legacy.h (revision 354470) +++ stable/12/sys/net/netmap_legacy.h (revision 354471) @@ -1,257 +1,257 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (C) 2011-2014 Matteo Landi, Luigi Rizzo. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``S IS''AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #ifndef _NET_NETMAP_LEGACY_H_ #define _NET_NETMAP_LEGACY_H_ /* * $FreeBSD$ * * ioctl names and related fields * * NIOCTXSYNC, NIOCRXSYNC synchronize tx or rx queues, * whose identity is set in NIOCREGIF through nr_ringid. * These are non blocking and take no argument. * * NIOCGINFO takes a struct ifreq, the interface name is the input, * the outputs are number of queues and number of descriptor * for each queue (useful to set number of threads etc.). * The info returned is only advisory and may change before * the interface is bound to a file descriptor. * * NIOCREGIF takes an interface name within a struct nmre, * and activates netmap mode on the interface (if possible). * * The argument to NIOCGINFO/NIOCREGIF overlays struct ifreq so we * can pass it down to other NIC-related ioctls. * * The actual argument (struct nmreq) has a number of options to request * different functions. * The following are used in NIOCREGIF when nr_cmd == 0: * * nr_name (in) * The name of the port (em0, valeXXX:YYY, etc.) * limited to IFNAMSIZ for backward compatibility. * * nr_version (in/out) * Must match NETMAP_API as used in the kernel, error otherwise. * Always returns the desired value on output. * * nr_tx_slots, nr_tx_slots, nr_tx_rings, nr_rx_rings (in/out) * On input, non-zero values may be used to reconfigure the port * according to the requested values, but this is not guaranteed. * On output the actual values in use are reported. * * nr_ringid (in) * Indicates how rings should be bound to the file descriptors. * If nr_flags != 0, then the low bits (in NETMAP_RING_MASK) * are used to indicate the ring number, and nr_flags specifies * the actual rings to bind. NETMAP_NO_TX_POLL is unaffected. * * NOTE: THE FOLLOWING (nr_flags == 0) IS DEPRECATED: * If nr_flags == 0, NETMAP_HW_RING and NETMAP_SW_RING control * the binding as follows: * 0 (default) binds all physical rings * NETMAP_HW_RING | ring number binds a single ring pair * NETMAP_SW_RING binds only the host tx/rx rings * * NETMAP_NO_TX_POLL can be OR-ed to make select()/poll() push * packets on tx rings only if POLLOUT is set. * The default is to push any pending packet. * * NETMAP_DO_RX_POLL can be OR-ed to make select()/poll() release * packets on rx rings also when POLLIN is NOT set. * The default is to touch the rx ring only with POLLIN. * Note that this is the opposite of TX because it * reflects the common usage. * * NOTE: NETMAP_PRIV_MEM IS DEPRECATED, use nr_arg2 instead. * NETMAP_PRIV_MEM is set on return for ports that do not use * the global memory allocator. * This information is not significant and applications * should look at the region id in nr_arg2 * * nr_flags is the recommended mode to indicate which rings should * be bound to a file descriptor. Values are NR_REG_* * * nr_arg1 (in) Reserved. * * nr_arg2 (in/out) The identity of the memory region used. * On input, 0 means the system decides autonomously, * other values may try to select a specific region. * On return the actual value is reported. * Region '1' is the global allocator, normally shared * by all interfaces. Other values are private regions. * If two ports the same region zero-copy is possible. * * nr_arg3 (in/out) number of extra buffers to be allocated. * * * * nr_cmd (in) if non-zero indicates a special command: * NETMAP_BDG_ATTACH and nr_name = vale*:ifname * attaches the NIC to the switch; nr_ringid specifies - * which rings to use. Used by vale-ctl -a ... + * which rings to use. Used by valectl -a ... * nr_arg1 = NETMAP_BDG_HOST also attaches the host port - * as in vale-ctl -h ... + * as in valectl -h ... * * NETMAP_BDG_DETACH and nr_name = vale*:ifname * disconnects a previously attached NIC. - * Used by vale-ctl -d ... + * Used by valectl -d ... * * NETMAP_BDG_LIST * list the configuration of VALE switches. * * NETMAP_BDG_VNET_HDR * Set the virtio-net header length used by the client * of a VALE switch port. * * NETMAP_BDG_NEWIF * create a persistent VALE port with name nr_name. - * Used by vale-ctl -n ... + * Used by valectl -n ... * * NETMAP_BDG_DELIF - * delete a persistent VALE port. Used by vale-ctl -d ... + * delete a persistent VALE port. Used by valectl -d ... * * nr_arg1, nr_arg2, nr_arg3 (in/out) command specific * * * */ /* * struct nmreq overlays a struct ifreq (just the name) */ struct nmreq { char nr_name[IFNAMSIZ]; uint32_t nr_version; /* API version */ uint32_t nr_offset; /* nifp offset in the shared region */ uint32_t nr_memsize; /* size of the shared region */ uint32_t nr_tx_slots; /* slots in tx rings */ uint32_t nr_rx_slots; /* slots in rx rings */ uint16_t nr_tx_rings; /* number of tx rings */ uint16_t nr_rx_rings; /* number of rx rings */ uint16_t nr_ringid; /* ring(s) we care about */ #define NETMAP_HW_RING 0x4000 /* single NIC ring pair */ #define NETMAP_SW_RING 0x2000 /* only host ring pair */ #define NETMAP_RING_MASK 0x0fff /* the ring number */ #define NETMAP_NO_TX_POLL 0x1000 /* no automatic txsync on poll */ #define NETMAP_DO_RX_POLL 0x8000 /* DO automatic rxsync on poll */ uint16_t nr_cmd; #define NETMAP_BDG_ATTACH 1 /* attach the NIC */ #define NETMAP_BDG_DETACH 2 /* detach the NIC */ #define NETMAP_BDG_REGOPS 3 /* register bridge callbacks */ #define NETMAP_BDG_LIST 4 /* get bridge's info */ #define NETMAP_BDG_VNET_HDR 5 /* set the port virtio-net-hdr length */ #define NETMAP_BDG_NEWIF 6 /* create a virtual port */ #define NETMAP_BDG_DELIF 7 /* destroy a virtual port */ #define NETMAP_PT_HOST_CREATE 8 /* create ptnetmap kthreads */ #define NETMAP_PT_HOST_DELETE 9 /* delete ptnetmap kthreads */ #define NETMAP_BDG_POLLING_ON 10 /* delete polling kthread */ #define NETMAP_BDG_POLLING_OFF 11 /* delete polling kthread */ #define NETMAP_VNET_HDR_GET 12 /* get the port virtio-net-hdr length */ uint16_t nr_arg1; /* extra arguments */ #define NETMAP_BDG_HOST 1 /* nr_arg1 value for NETMAP_BDG_ATTACH */ uint16_t nr_arg2; /* id of the memory allocator */ uint32_t nr_arg3; /* req. extra buffers in NIOCREGIF */ uint32_t nr_flags; /* specify NR_REG_* mode and other flags */ #define NR_REG_MASK 0xf /* to extract NR_REG_* mode from nr_flags */ /* various modes, extends nr_ringid */ uint32_t spare2[1]; }; #ifdef _WIN32 /* * Windows does not have _IOWR(). _IO(), _IOW() and _IOR() are defined * in ws2def.h but not sure if they are in the form we need. * We therefore redefine them in a convenient way to use for DeviceIoControl * signatures. */ #undef _IO // ws2def.h #define _WIN_NM_IOCTL_TYPE 40000 #define _IO(_c, _n) CTL_CODE(_WIN_NM_IOCTL_TYPE, ((_n) + 0x800) , \ METHOD_BUFFERED, FILE_ANY_ACCESS ) #define _IO_direct(_c, _n) CTL_CODE(_WIN_NM_IOCTL_TYPE, ((_n) + 0x800) , \ METHOD_OUT_DIRECT, FILE_ANY_ACCESS ) #define _IOWR(_c, _n, _s) _IO(_c, _n) /* We havesome internal sysctl in addition to the externally visible ones */ #define NETMAP_MMAP _IO_direct('i', 160) // note METHOD_OUT_DIRECT #define NETMAP_POLL _IO('i', 162) /* and also two setsockopt for sysctl emulation */ #define NETMAP_SETSOCKOPT _IO('i', 140) #define NETMAP_GETSOCKOPT _IO('i', 141) /* These linknames are for the Netmap Core Driver */ #define NETMAP_NT_DEVICE_NAME L"\\Device\\NETMAP" #define NETMAP_DOS_DEVICE_NAME L"\\DosDevices\\netmap" /* Definition of a structure used to pass a virtual address within an IOCTL */ typedef struct _MEMORY_ENTRY { PVOID pUsermodeVirtualAddress; } MEMORY_ENTRY, *PMEMORY_ENTRY; typedef struct _POLL_REQUEST_DATA { int events; int timeout; int revents; } POLL_REQUEST_DATA; #endif /* _WIN32 */ /* * Opaque structure that is passed to an external kernel * module via ioctl(fd, NIOCCONFIG, req) for a user-owned * bridge port (at this point ephemeral VALE interface). */ #define NM_IFRDATA_LEN 256 struct nm_ifreq { char nifr_name[IFNAMSIZ]; char data[NM_IFRDATA_LEN]; }; /* * FreeBSD uses the size value embedded in the _IOWR to determine * how much to copy in/out. So we need it to match the actual * data structure we pass. We put some spares in the structure * to ease compatibility with other versions */ #define NIOCGINFO _IOWR('i', 145, struct nmreq) /* return IF info */ #define NIOCREGIF _IOWR('i', 146, struct nmreq) /* interface register */ #define NIOCCONFIG _IOWR('i',150, struct nm_ifreq) /* for ext. modules */ #endif /* _NET_NETMAP_LEGACY_H_ */ Index: stable/12/tools/tools/netmap/vale-ctl.c =================================================================== --- stable/12/tools/tools/netmap/vale-ctl.c (revision 354470) +++ stable/12/tools/tools/netmap/vale-ctl.c (nonexistent) @@ -1,282 +0,0 @@ -/* - * Copyright (C) 2013-2014 Michio Honda. All rights reserved. - * - * Redistribution and use in source and binary forms, with or without - * modification, are permitted provided that the following conditions - * are met: - * 1. Redistributions of source code must retain the above copyright - * notice, this list of conditions and the following disclaimer. - * 2. Redistributions in binary form must reproduce the above copyright - * notice, this list of conditions and the following disclaimer in the - * documentation and/or other materials provided with the distribution. - * - * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND - * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE - * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE - * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE - * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL - * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS - * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) - * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT - * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY - * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF - * SUCH DAMAGE. - */ - -/* $FreeBSD$ */ - -#define NETMAP_WITH_LIBS -#include -#include - -#include -#include -#include /* PRI* macros */ -#include /* strcmp */ -#include /* open */ -#include /* close */ -#include /* ioctl */ -#include -#include /* apple needs sockaddr */ -#include /* ifreq */ -#include /* basename */ -#include /* atoi, free */ - -/* XXX cut and paste from pkt-gen.c because I'm not sure whether this - * program may include nm_util.h - */ -void parse_nmr_config(const char* conf, struct nmreq *nmr) -{ - char *w, *tok; - int i, v; - - nmr->nr_tx_rings = nmr->nr_rx_rings = 0; - nmr->nr_tx_slots = nmr->nr_rx_slots = 0; - if (conf == NULL || ! *conf) - return; - w = strdup(conf); - for (i = 0, tok = strtok(w, ","); tok; i++, tok = strtok(NULL, ",")) { - v = atoi(tok); - switch (i) { - case 0: - nmr->nr_tx_slots = nmr->nr_rx_slots = v; - break; - case 1: - nmr->nr_rx_slots = v; - break; - case 2: - nmr->nr_tx_rings = nmr->nr_rx_rings = v; - break; - case 3: - nmr->nr_rx_rings = v; - break; - default: - D("ignored config: %s", tok); - break; - } - } - D("txr %d txd %d rxr %d rxd %d", - nmr->nr_tx_rings, nmr->nr_tx_slots, - nmr->nr_rx_rings, nmr->nr_rx_slots); - free(w); -} - -static int -bdg_ctl(const char *name, int nr_cmd, int nr_arg, char *nmr_config, int nr_arg2) -{ - struct nmreq nmr; - int error = 0; - int fd = open("/dev/netmap", O_RDWR); - - if (fd == -1) { - D("Unable to open /dev/netmap"); - return -1; - } - - bzero(&nmr, sizeof(nmr)); - nmr.nr_version = NETMAP_API; - if (name != NULL) /* might be NULL */ - strncpy(nmr.nr_name, name, sizeof(nmr.nr_name)-1); - nmr.nr_cmd = nr_cmd; - parse_nmr_config(nmr_config, &nmr); - nmr.nr_arg2 = nr_arg2; - - switch (nr_cmd) { - case NETMAP_BDG_DELIF: - case NETMAP_BDG_NEWIF: - error = ioctl(fd, NIOCREGIF, &nmr); - if (error == -1) { - ND("Unable to %s %s", nr_cmd == NETMAP_BDG_DELIF ? "delete":"create", name); - perror(name); - } else { - ND("Success to %s %s", nr_cmd == NETMAP_BDG_DELIF ? "delete":"create", name); - } - break; - case NETMAP_BDG_ATTACH: - case NETMAP_BDG_DETACH: - nmr.nr_flags = NR_REG_ALL_NIC; - if (nr_arg && nr_arg != NETMAP_BDG_HOST) { - nmr.nr_flags = NR_REG_NIC_SW; - nr_arg = 0; - } - nmr.nr_arg1 = nr_arg; - error = ioctl(fd, NIOCREGIF, &nmr); - if (error == -1) { - ND("Unable to %s %s to the bridge", nr_cmd == - NETMAP_BDG_DETACH?"detach":"attach", name); - perror(name); - } else - ND("Success to %s %s to the bridge", nr_cmd == - NETMAP_BDG_DETACH?"detach":"attach", name); - break; - - case NETMAP_BDG_LIST: - if (strlen(nmr.nr_name)) { /* name to bridge/port info */ - error = ioctl(fd, NIOCGINFO, &nmr); - if (error) { - ND("Unable to obtain info for %s", name); - perror(name); - } else - D("%s at bridge:%d port:%d", name, nmr.nr_arg1, - nmr.nr_arg2); - break; - } - - /* scan all the bridges and ports */ - nmr.nr_arg1 = nmr.nr_arg2 = 0; - for (; !ioctl(fd, NIOCGINFO, &nmr); nmr.nr_arg2++) { - D("bridge:%d port:%d %s", nmr.nr_arg1, nmr.nr_arg2, - nmr.nr_name); - nmr.nr_name[0] = '\0'; - } - - break; - - case NETMAP_BDG_POLLING_ON: - case NETMAP_BDG_POLLING_OFF: - /* We reuse nmreq fields as follows: - * nr_tx_slots: 0 and non-zero indicate REG_ALL_NIC - * REG_ONE_NIC, respectively. - * nr_rx_slots: CPU core index. This also indicates the - * first queue in the case of REG_ONE_NIC - * nr_tx_rings: (REG_ONE_NIC only) indicates the - * number of CPU cores or the last queue - */ - nmr.nr_flags |= nmr.nr_tx_slots ? - NR_REG_ONE_NIC : NR_REG_ALL_NIC; - nmr.nr_ringid = nmr.nr_rx_slots; - /* number of cores/rings */ - if (nmr.nr_flags == NR_REG_ALL_NIC) - nmr.nr_arg1 = 1; - else - nmr.nr_arg1 = nmr.nr_tx_rings; - - error = ioctl(fd, NIOCREGIF, &nmr); - if (!error) - D("polling on %s %s", nmr.nr_name, - nr_cmd == NETMAP_BDG_POLLING_ON ? - "started" : "stopped"); - else - D("polling on %s %s (err %d)", nmr.nr_name, - nr_cmd == NETMAP_BDG_POLLING_ON ? - "couldn't start" : "couldn't stop", error); - break; - - default: /* GINFO */ - nmr.nr_cmd = nmr.nr_arg1 = nmr.nr_arg2 = 0; - error = ioctl(fd, NIOCGINFO, &nmr); - if (error) { - ND("Unable to get if info for %s", name); - perror(name); - } else - D("%s: %d queues.", name, nmr.nr_rx_rings); - break; - } - close(fd); - return error; -} - -static void -usage(int errcode) -{ - fprintf(stderr, - "Usage:\n" - "vale-ctl arguments\n" - "\t-g interface interface name to get info\n" - "\t-d interface interface name to be detached\n" - "\t-a interface interface name to be attached\n" - "\t-h interface interface name to be attached with the host stack\n" - "\t-n interface interface name to be created\n" - "\t-r interface interface name to be deleted\n" - "\t-l list all or specified bridge's interfaces (default)\n" - "\t-C string ring/slot setting of an interface creating by -n\n" - "\t-p interface start polling. Additional -C x,y,z configures\n" - "\t\t x: 0 (REG_ALL_NIC) or 1 (REG_ONE_NIC),\n" - "\t\t y: CPU core id for ALL_NIC and core/ring for ONE_NIC\n" - "\t\t z: (ONE_NIC only) num of total cores/rings\n" - "\t-P interface stop polling\n" - "\t-m memid to use when creating a new interface\n"); - exit(errcode); -} - -int -main(int argc, char *argv[]) -{ - int ch, nr_cmd = 0, nr_arg = 0; - char *name = NULL, *nmr_config = NULL; - int nr_arg2 = 0; - - while ((ch = getopt(argc, argv, "d:a:h:g:l:n:r:C:p:P:m:")) != -1) { - if (ch != 'C' && ch != 'm') - name = optarg; /* default */ - switch (ch) { - default: - fprintf(stderr, "bad option %c %s", ch, optarg); - usage(-1); - break; - case 'd': - nr_cmd = NETMAP_BDG_DETACH; - break; - case 'a': - nr_cmd = NETMAP_BDG_ATTACH; - break; - case 'h': - nr_cmd = NETMAP_BDG_ATTACH; - nr_arg = NETMAP_BDG_HOST; - break; - case 'n': - nr_cmd = NETMAP_BDG_NEWIF; - break; - case 'r': - nr_cmd = NETMAP_BDG_DELIF; - break; - case 'g': - nr_cmd = 0; - break; - case 'l': - nr_cmd = NETMAP_BDG_LIST; - break; - case 'C': - nmr_config = strdup(optarg); - break; - case 'p': - nr_cmd = NETMAP_BDG_POLLING_ON; - break; - case 'P': - nr_cmd = NETMAP_BDG_POLLING_OFF; - break; - case 'm': - nr_arg2 = atoi(optarg); - break; - } - } - if (optind != argc) { - // fprintf(stderr, "optind %d argc %d\n", optind, argc); - usage(-1); - } - if (argc == 1) { - nr_cmd = NETMAP_BDG_LIST; - name = NULL; - } - return bdg_ctl(name, nr_cmd, nr_arg, nmr_config, nr_arg2) ? 1 : 0; -} Property changes on: stable/12/tools/tools/netmap/vale-ctl.c ___________________________________________________________________ Deleted: svn:keywords ## -1 +0,0 ## -FreeBSD=%H \ No newline at end of property Index: stable/12/tools/tools/netmap/vale-ctl.4 =================================================================== --- stable/12/tools/tools/netmap/vale-ctl.4 (revision 354470) +++ stable/12/tools/tools/netmap/vale-ctl.4 (nonexistent) @@ -1,163 +0,0 @@ -.\" Copyright (c) 2016 Michio Honda. -.\" All rights reserved. -.\" -.\" Redistribution and use in source and binary forms, with or without -.\" modification, are permitted provided that the following conditions -.\" are met: -.\" 1. Redistributions of source code must retain the above copyright -.\" notice, this list of conditions and the following disclaimer. -.\" 2. Redistributions in binary form must reproduce the above copyright -.\" notice, this list of conditions and the following disclaimer in the -.\" documentation and/or other materials provided with the distribution. -.\" -.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND -.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE -.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE -.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL -.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS -.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) -.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT -.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY -.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF -.\" SUCH DAMAGE. -.\" -.\" $FreeBSD$ -.\" -.Dd October 24, 2018 -.Dt VALE-CTL 4 -.Os -.Sh NAME -.Nm vale-ctl -.Nd manage VALE switches provided by netmap -.Sh SYNOPSIS -.Bk -words -.Bl -tag -width "vale-ctl" -.It Nm -.Op Fl g Ar valeSSS:PPP -.Op Fl a Ar valeSSS:interface -.Op Fl h Ar valeSSS:interface -.Op Fl d Ar valeSSS:interface -.Op Fl n Ar interface -.Op Fl r Ar interface -.Op Fl l Ar valeSSS:PPP -.Op Fl l -.Op Fl p Ar valeSSS:PPP -.Op Fl P Ar valeSSS:PPP -.Op Fl C Ar spec -.Op Fl m Ar memid -.El -.Ek -.Sh DESCRIPTION -.Nm -manages and inspects -.Xr vale 4 -switches, for instance attaching and detaching interfaces, creating -and deleting persistent VALE ports, or listing the existing switches -and their ports. -In the following, -.Ar valeSSS -is the name of a VALE switch, while -.Ar valeSSS:PPP -is the name of a VALE port of -.Ar valeSSS . -.Pp -When issued without options it lists all the existing switch ports together -with their internal bridge number and port number. -.Bl -tag -width Ds -.It Fl g Ar valeSSS:PPP -Print the number of receive rings of -.Ar valeSSS:PPP . -.It Fl a Ar valeSSS:interface -Attach -.Ar interface -(which must be an existing network interface) to -.Ar valeSSS -and detach it from the host stack. -.It Fl h Ar valeSSS:interface -Attach -.Ar interface -(which must be an existing network interface) to -.Ar valeSSS -while keeping it attached to the host stack. -More precisely, packets coming from -the host stack and directed to the interface will go through the switch, where -they can still reach the interface if the switch rules allow it. -Conversely, packets coming from the interface will go through the switch and, -if appropriate, will reach the host stack. -.It Fl d Ar valeSSS:interface -Detach -.Ar interface -from -.Ar valeSSS . -.It Fl n Ar interface -Create a new persistent VALE port with name -.Ar interface . -The name must be different from any other network interface -already present in the system. -.It Fl d Ar interface -Destroy the persistent VALE port with name -.Ar inteface . -.It Fl l Ar valeSSS:PPP -Show the internal bridge number and port number of the given switch port. -.It Fl p Ar valeSSS:PPP -Enable polling mode for -.Ar valeSSS:PPP . -In polling mode, a dedicated kernel thread is spawned to handle packets -received from -.Ar valeSSS:PPP -and push them into the switch. -The kernel thread busy waits on the switch port rather than relying on -interrupts or notifications. -Polling mode can only be used on physical NICs attached to a VALE switch. -.It Fl P Ar valeSSS:PPP -Disable polling mode for -.Ar valeSSS:PPP . -.It Fl C Ar x | Ar x,y | Ar x,y,z | Ar x,y,z,w -When used in conjunction with -.Fl n -it supplies the number of tx and rx rings and slots. -The full format with four numbers gives, in order, number of tx slots, number -of rx slots, number of tx rings and number of rx rings. -The form with three numbers uses -.Ar z -for both the number of tx and the number of rx rings. -The forms with less than two numbers use the default values for the number -of rings. -The form with two numbers supplies the numbers of tx and rx slots. -The form with only one number uses -.Ar x -for both the number of tx and the number of rx slots. -.Pp -When used in conjunction with -.Fl p -only the first three forms are used. -The first number may be either 0 or 1. -If 0, then all interface rings will be polled by a single thread, running -on the core id given by the second number (the third number, if present, -must be 1). -If the first number is 1, then the ring identified by the second number will -be polled by the core with the same id. -If a third number is given, then this is repeated for as many consecutive -rings and cores. -.It Fl m Ar memid -Used in conjunction with -.Fl n -supplies the netmap memory region identifier to use together with the newly -created persistent VALE port. -These ports use a private memory region by default. -Using this option you can let them share memory with other ports. -Pass 1 as -.Ar memid -to use the global memory region already shared by all -harware netmap ports. -.El -.Sh SEE ALSO -.Xr netmap 4 , -.Xr vale 4 -.Sh AUTHORS -.An -nosplit -.Nm -has been written by -.An Michio Honda -at NetApp. Property changes on: stable/12/tools/tools/netmap/vale-ctl.4 ___________________________________________________________________ Deleted: svn:eol-style ## -1 +0,0 ## -native \ No newline at end of property Deleted: svn:keywords ## -1 +0,0 ## -FreeBSD=%H \ No newline at end of property Deleted: svn:mime-type ## -1 +0,0 ## -text/plain \ No newline at end of property Index: stable/12/tools/tools/netmap/Makefile =================================================================== --- stable/12/tools/tools/netmap/Makefile (revision 354470) +++ stable/12/tools/tools/netmap/Makefile (revision 354471) @@ -1,39 +1,36 @@ # # $FreeBSD$ # # For multiple programs using a single source file each, # we can just define 'progs' and create custom targets. -PROGS = pkt-gen nmreplay bridge vale-ctl lb +PROGS = pkt-gen nmreplay bridge lb CLEANFILES = $(PROGS) *.o MAN= CFLAGS += -Werror -Wall CFLAGS += -Wextra LDFLAGS += -lpthread .ifdef WITHOUT_PCAP CFLAGS += -DNO_PCAP .else LDFLAGS += -lpcap .endif LDFLAGS += -lm # used by nmreplay .include .include all: $(PROGS) pkt-gen: pkt-gen.o $(CC) $(CFLAGS) -o pkt-gen pkt-gen.o $(LDFLAGS) bridge: bridge.o $(CC) $(CFLAGS) -o bridge bridge.o nmreplay: nmreplay.o $(CC) $(CFLAGS) -o nmreplay nmreplay.o $(LDFLAGS) - -vale-ctl: vale-ctl.o - $(CC) $(CFLAGS) -o vale-ctl vale-ctl.o lb: lb.o pkt_hash.o $(CC) $(CFLAGS) -o lb lb.o pkt_hash.o $(LDFLAGS) Index: stable/12/tools/tools/netmap/README =================================================================== --- stable/12/tools/tools/netmap/README (revision 354470) +++ stable/12/tools/tools/netmap/README (revision 354471) @@ -1,13 +1,11 @@ $FreeBSD$ This directory contains applications that use the netmap API pkt-gen a multi-function packet generator and traffic sink bridge a two-port jumper wire, also using the netmap API - vale-ctl the program to control and inspect VALE switches - lb an L3/L4 load balancer nmreplay a tool to playback a pcap file to a netmap port Index: stable/12/tools/tools/netmap/lb.8 =================================================================== --- stable/12/tools/tools/netmap/lb.8 (revision 354470) +++ stable/12/tools/tools/netmap/lb.8 (revision 354471) @@ -1,130 +1,130 @@ .\" Copyright (c) 2017 Corelight, Inc. and Universita` di Pisa .\" All rights reserved. .\" .\" Redistribution and use in source and binary forms, with or without .\" modification, are permitted provided that the following conditions .\" are met: .\" 1. Redistributions of source code must retain the above copyright .\" notice, this list of conditions and the following disclaimer. .\" 2. Redistributions in binary form must reproduce the above copyright .\" notice, this list of conditions and the following disclaimer in the .\" documentation and/or other materials provided with the distribution. .\" .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE .\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF .\" SUCH DAMAGE. .\" .\" $FreeBSD$ .\" -.Dd October 28, 2018 +.Dd October 26, 2019 .Dt LB 8 .Os .Sh NAME .Nm lb .Nd netmap-based load balancer .Sh SYNOPSIS .Bk -words .Bl -tag -width "lb" .It Nm .Op Fl i Ar port .Op Fl p Ar pipe-group .Op Fl B Ar extra-buffers .Op Fl b Ar batch-size .Op Fl w Ar wait-link .El .Ek .Sh DESCRIPTION .Nm reads packets from an input netmap port and sends them to a number of netmap pipes, trying to balance the packets received by each pipe. Packets belonging to the same connection will always be sent to the same pipe. .Pp Command line options are listed below. .Bl -tag -width Ds .It Fl i Ar port Name of a netmap port. It must be supplied exactly once to identify the input port. Any netmap port type (e.g., physical interface, VALE switch, pipe, monitor port) can be used. .It Fl p Ar name Ns Cm \&: Ns Ar number | number Add a new pipe group of the given number of pipes. The pipe group will receive all the packets read from the input port, balanced among the available pipes. The receiving ends of the pipes will be called .Dq Ar name Ns Em }0 to .Dq Ar name No Ns Em } Ns Aq Ar number No - 1 . The name is optional and defaults to the name of the input port (stripped down of any netmap operator). If the name is omitted, also the colon can be omitted. .Pp This option can be supplied multiple times to define a sequence of pipe groups, each group receiving all the packets in turn. .Pp If no .Fl p option is given, a single group of two pipes with default name is assumed. .Pp It is allowed to use the same name for several groups. The pipe numbering in each group will start from were the previous identically-named group had left. .It Fl B Ar extra-buffers Try to reserve the given number of extra buffers. Extra buffers are shared among all pipes in all groups and work as an extension of the pipe rings. If a pipe ring is full for whatever reason, .Nm tries to use extra buffers before dropping any packets directed to that pipe. .Pp If all extra buffers are busy, some are stolen from the pipe with the longest backlog. This gives preference to newer packets over old ones, and prevents a stalled pipe to deplete the pool of extra buffers. .It Fl b Ar batch-size Maximum number of packets processed between two read operations from the input port. Higher values of batch-size improve performance by amortizing read operations, but increase the risk of filling up the port internal queues. .It Fl w Ar wait-link indicates the number of seconds to wait before transmitting. It defaults to 2, and may be useful when talking to physical ports to let link negotiation complete before starting transmission. .El .Sh LIMITATIONS The group chaining assumes that the applications on the receiving end of the pipes are read-only: they must not modify the buffers or the pipe ring slots in any way. .Pp The group naming is currently implemented by creating a persistent VALE port with the given name. If .Nm does not exit cleanly the ports will not be removed. Please use -.Xr vale-ctl 4 +.Xr valectl 8 to remove any stale persistent VALE port. .Sh SEE ALSO .Xr netmap 4 , .Xr bridge 8 , .Xr pkt-gen 8 .Pp .Pa http://info.iet.unipi.it/~luigi/netmap/ .Sh AUTHORS .An -nosplit .Nm has been written by .An Seth Hall at Corelight, USA. The facilities related to extra buffers and pipe groups have been added by .An Giuseppe Lettieri at University of Pisa, Italy, under contract by Corelight, USA. Index: stable/12/usr.sbin/Makefile =================================================================== --- stable/12/usr.sbin/Makefile (revision 354470) +++ stable/12/usr.sbin/Makefile (revision 354471) @@ -1,227 +1,228 @@ # From: @(#)Makefile 5.20 (Berkeley) 6/12/93 # $FreeBSD$ .include SUBDIR= adduser \ arp \ binmiscctl \ camdd \ cdcontrol \ chkgrp \ chown \ chroot \ ckdist \ clear_locks \ crashinfo \ cron \ ctladm \ ctld \ daemon \ dconschat \ devctl \ devinfo \ diskinfo \ dumpcis \ etcupdate \ extattr \ extattrctl \ fifolog \ fstyp \ fwcontrol \ getfmac \ getpmac \ gstat \ i2c \ ifmcstat \ iostat \ iovctl \ kldxref \ mailwrapper \ makefs \ memcontrol \ mergemaster \ mfiutil \ mixer \ mlxcontrol \ mountd \ mount_smbfs \ mpsutil \ mptutil \ mtest \ newsyslog \ nfscbd \ nfsd \ nfsdumpstate \ nfsrevoke \ nfsuserd \ nmtree \ nologin \ pciconf \ periodic \ pnfsdscopymr \ pnfsdsfile \ pnfsdskill \ powerd \ prometheus_sysctl_exporter \ pstat \ pw \ pwd_mkdb \ pwm \ quot \ rarpd \ rmt \ rpcbind \ rpc.lockd \ rpc.statd \ rpc.umntall \ rtprio \ rwhod \ service \ services_mkdb \ sesutil \ setfib \ setfmac \ setpmac \ smbmsg \ snapinfo \ spi \ spray \ syslogd \ sysrc \ tcpdrop \ tcpdump \ traceroute \ trim \ trpt \ tzsetup \ ugidfw \ + valectl \ vigr \ vipw \ wake \ watch \ watchdogd \ zic \ zonectl # NB: keep these sorted by MK_* knobs SUBDIR.${MK_ACCT}+= accton SUBDIR.${MK_ACCT}+= sa SUBDIR.${MK_AMD}+= amd SUBDIR.${MK_AUDIT}+= audit SUBDIR.${MK_AUDIT}+= auditd .if ${MK_OPENSSL} != "no" SUBDIR.${MK_AUDIT}+= auditdistd .endif SUBDIR.${MK_AUDIT}+= auditreduce SUBDIR.${MK_AUDIT}+= praudit SUBDIR.${MK_AUTHPF}+= authpf SUBDIR.${MK_AUTOFS}+= autofs SUBDIR.${MK_BLACKLIST}+= blacklistctl SUBDIR.${MK_BLACKLIST}+= blacklistd SUBDIR.${MK_BLUETOOTH}+= bluetooth SUBDIR.${MK_BOOTPARAMD}+= bootparamd SUBDIR.${MK_BSDINSTALL}+= bsdinstall SUBDIR.${MK_BSNMP}+= bsnmpd SUBDIR.${MK_CTM}+= ctm SUBDIR.${MK_CXGBETOOL}+= cxgbetool SUBDIR.${MK_DIALOG}+= bsdconfig SUBDIR.${MK_EFI}+= efivar efidp efibootmgr .if ${MK_OPENSSL} != "no" SUBDIR.${MK_EFI}+= uefisign .endif SUBDIR.${MK_FLOPPY}+= fdcontrol SUBDIR.${MK_FLOPPY}+= fdformat SUBDIR.${MK_FLOPPY}+= fdread SUBDIR.${MK_FLOPPY}+= fdwrite SUBDIR.${MK_FMTREE}+= fmtree SUBDIR.${MK_FREEBSD_UPDATE}+= freebsd-update SUBDIR.${MK_GSSAPI}+= gssd SUBDIR.${MK_GPIO}+= gpioctl SUBDIR.${MK_INET6}+= ip6addrctl SUBDIR.${MK_INET6}+= mld6query SUBDIR.${MK_INET6}+= ndp SUBDIR.${MK_INET6}+= rip6query SUBDIR.${MK_INET6}+= route6d SUBDIR.${MK_INET6}+= rrenumd SUBDIR.${MK_INET6}+= rtadvctl SUBDIR.${MK_INET6}+= rtadvd SUBDIR.${MK_INET6}+= rtsold SUBDIR.${MK_INET6}+= traceroute6 SUBDIR.${MK_INETD}+= inetd SUBDIR.${MK_IPFW}+= ipfwpcap SUBDIR.${MK_ISCSI}+= iscsid SUBDIR.${MK_JAIL}+= jail SUBDIR.${MK_JAIL}+= jexec SUBDIR.${MK_JAIL}+= jls # XXX MK_SYSCONS SUBDIR.${MK_LEGACY_CONSOLE}+= kbdcontrol SUBDIR.${MK_LEGACY_CONSOLE}+= kbdmap SUBDIR.${MK_LEGACY_CONSOLE}+= moused SUBDIR.${MK_LEGACY_CONSOLE}+= vidcontrol .if ${MK_LIBTHR} != "no" || ${MK_LIBPTHREAD} != "no" SUBDIR.${MK_PPP}+= pppctl SUBDIR.${MK_NS_CACHING}+= nscd .endif SUBDIR.${MK_LPR}+= lpr SUBDIR.${MK_MAN_UTILS}+= manctl SUBDIR.${MK_MLX5TOOL}+= mlx5tool SUBDIR.${MK_NAND}+= nandsim SUBDIR.${MK_NAND}+= nandtool SUBDIR.${MK_NETGRAPH}+= flowctl SUBDIR.${MK_NETGRAPH}+= ngctl SUBDIR.${MK_NETGRAPH}+= nghook SUBDIR.${MK_NIS}+= rpc.yppasswdd SUBDIR.${MK_NIS}+= rpc.ypupdated SUBDIR.${MK_NIS}+= rpc.ypxfrd SUBDIR.${MK_NIS}+= ypbind SUBDIR.${MK_NIS}+= ypldap SUBDIR.${MK_NIS}+= yp_mkdb SUBDIR.${MK_NIS}+= yppoll SUBDIR.${MK_NIS}+= yppush SUBDIR.${MK_NIS}+= ypserv SUBDIR.${MK_NIS}+= ypset SUBDIR.${MK_NTP}+= ntp SUBDIR.${MK_OPENSSL}+= keyserv SUBDIR.${MK_PC_SYSINSTALL}+= pc-sysinstall SUBDIR.${MK_PF}+= ftp-proxy SUBDIR.${MK_PKGBOOTSTRAP}+= pkg .if ${COMPILER_FEATURES:Mc++11} SUBDIR.${MK_PMC}+= pmc .endif SUBDIR.${MK_PMC}+= pmcannotate pmccontrol pmcstat pmcstudy SUBDIR.${MK_PORTSNAP}+= portsnap SUBDIR.${MK_PPP}+= ppp SUBDIR.${MK_QUOTAS}+= edquota SUBDIR.${MK_QUOTAS}+= quotaon SUBDIR.${MK_QUOTAS}+= repquota SUBDIR.${MK_SENDMAIL}+= editmap SUBDIR.${MK_SENDMAIL}+= mailstats SUBDIR.${MK_SENDMAIL}+= makemap SUBDIR.${MK_SENDMAIL}+= praliases SUBDIR.${MK_SENDMAIL}+= sendmail SUBDIR.${MK_TCP_WRAPPERS}+= tcpdchk SUBDIR.${MK_TCP_WRAPPERS}+= tcpdmatch SUBDIR.${MK_TIMED}+= timed SUBDIR.${MK_TOOLCHAIN}+= config SUBDIR.${MK_TOOLCHAIN}+= crunch SUBDIR.${MK_UNBOUND}+= unbound SUBDIR.${MK_USB}+= uathload SUBDIR.${MK_USB}+= uhsoctl SUBDIR.${MK_USB}+= usbconfig SUBDIR.${MK_USB}+= usbdump SUBDIR.${MK_UTMPX}+= ac SUBDIR.${MK_UTMPX}+= lastlogin SUBDIR.${MK_UTMPX}+= utx SUBDIR.${MK_WIRELESS}+= ancontrol SUBDIR.${MK_WIRELESS}+= wlandebug SUBDIR.${MK_WIRELESS}+= wpa SUBDIR.${MK_TESTS}+= tests .include SUBDIR_PARALLEL= .include Index: stable/12/usr.sbin/valectl/Makefile =================================================================== --- stable/12/usr.sbin/valectl/Makefile (nonexistent) +++ stable/12/usr.sbin/valectl/Makefile (revision 354471) @@ -0,0 +1,8 @@ +# $FreeBSD$ + +PROG= valectl +MAN= valectl.8 + +WARNS?= 3 + +.include Property changes on: stable/12/usr.sbin/valectl/Makefile ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: stable/12/usr.sbin/valectl/valectl.8 =================================================================== --- stable/12/usr.sbin/valectl/valectl.8 (nonexistent) +++ stable/12/usr.sbin/valectl/valectl.8 (revision 354471) @@ -0,0 +1,163 @@ +.\" Copyright (c) 2016 Michio Honda. +.\" All rights reserved. +.\" +.\" Redistribution and use in source and binary forms, with or without +.\" modification, are permitted provided that the following conditions +.\" are met: +.\" 1. Redistributions of source code must retain the above copyright +.\" notice, this list of conditions and the following disclaimer. +.\" 2. Redistributions in binary form must reproduce the above copyright +.\" notice, this list of conditions and the following disclaimer in the +.\" documentation and/or other materials provided with the distribution. +.\" +.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND +.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE +.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +.\" SUCH DAMAGE. +.\" +.\" $FreeBSD$ +.\" +.Dd October 26, 2019 +.Dt VALECTL 8 +.Os +.Sh NAME +.Nm valectl +.Nd manage VALE switches provided by netmap +.Sh SYNOPSIS +.Bk -words +.Bl -tag -width "valectl" +.It Nm +.Op Fl g Ar valeSSS:PPP +.Op Fl a Ar valeSSS:interface +.Op Fl h Ar valeSSS:interface +.Op Fl d Ar valeSSS:interface +.Op Fl n Ar interface +.Op Fl r Ar interface +.Op Fl l Ar valeSSS:PPP +.Op Fl l +.Op Fl p Ar valeSSS:PPP +.Op Fl P Ar valeSSS:PPP +.Op Fl C Ar spec +.Op Fl m Ar memid +.El +.Ek +.Sh DESCRIPTION +.Nm +manages and inspects +.Xr vale 4 +switches, for instance attaching and detaching interfaces, creating +and deleting persistent VALE ports, or listing the existing switches +and their ports. +In the following, +.Ar valeSSS +is the name of a VALE switch, while +.Ar valeSSS:PPP +is the name of a VALE port of +.Ar valeSSS . +.Pp +When issued without options it lists all the existing switch ports together +with their internal bridge number and port number. +.Bl -tag -width Ds +.It Fl g Ar valeSSS:PPP +Print the number of receive rings of +.Ar valeSSS:PPP . +.It Fl a Ar valeSSS:interface +Attach +.Ar interface +(which must be an existing network interface) to +.Ar valeSSS +and detach it from the host stack. +.It Fl h Ar valeSSS:interface +Attach +.Ar interface +(which must be an existing network interface) to +.Ar valeSSS +while keeping it attached to the host stack. +More precisely, packets coming from +the host stack and directed to the interface will go through the switch, where +they can still reach the interface if the switch rules allow it. +Conversely, packets coming from the interface will go through the switch and, +if appropriate, will reach the host stack. +.It Fl d Ar valeSSS:interface +Detach +.Ar interface +from +.Ar valeSSS . +.It Fl n Ar interface +Create a new persistent VALE port with name +.Ar interface . +The name must be different from any other network interface +already present in the system. +.It Fl d Ar interface +Destroy the persistent VALE port with name +.Ar inteface . +.It Fl l Ar valeSSS:PPP +Show the internal bridge number and port number of the given switch port. +.It Fl p Ar valeSSS:PPP +Enable polling mode for +.Ar valeSSS:PPP . +In polling mode, a dedicated kernel thread is spawned to handle packets +received from +.Ar valeSSS:PPP +and push them into the switch. +The kernel thread busy waits on the switch port rather than relying on +interrupts or notifications. +Polling mode can only be used on physical NICs attached to a VALE switch. +.It Fl P Ar valeSSS:PPP +Disable polling mode for +.Ar valeSSS:PPP . +.It Fl C Ar x | Ar x,y | Ar x,y,z | Ar x,y,z,w +When used in conjunction with +.Fl n +it supplies the number of tx and rx rings and slots. +The full format with four numbers gives, in order, number of tx slots, number +of rx slots, number of tx rings and number of rx rings. +The form with three numbers uses +.Ar z +for both the number of tx and the number of rx rings. +The forms with less than two numbers use the default values for the number +of rings. +The form with two numbers supplies the numbers of tx and rx slots. +The form with only one number uses +.Ar x +for both the number of tx and the number of rx slots. +.Pp +When used in conjunction with +.Fl p +only the first three forms are used. +The first number may be either 0 or 1. +If 0, then all interface rings will be polled by a single thread, running +on the core id given by the second number (the third number, if present, +must be 1). +If the first number is 1, then the ring identified by the second number will +be polled by the core with the same id. +If a third number is given, then this is repeated for as many consecutive +rings and cores. +.It Fl m Ar memid +Used in conjunction with +.Fl n +supplies the netmap memory region identifier to use together with the newly +created persistent VALE port. +These ports use a private memory region by default. +Using this option you can let them share memory with other ports. +Pass 1 as +.Ar memid +to use the global memory region already shared by all +harware netmap ports. +.El +.Sh SEE ALSO +.Xr netmap 4 , +.Xr vale 4 +.Sh AUTHORS +.An -nosplit +.Nm +was written by +.An Michio Honda +at NetApp. Property changes on: stable/12/usr.sbin/valectl/valectl.8 ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: stable/12/usr.sbin/valectl/valectl.c =================================================================== --- stable/12/usr.sbin/valectl/valectl.c (nonexistent) +++ stable/12/usr.sbin/valectl/valectl.c (revision 354471) @@ -0,0 +1,280 @@ +/* + * Copyright (C) 2013-2014 Michio Honda. All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + * notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + * notice, this list of conditions and the following disclaimer in the + * documentation and/or other materials provided with the distribution. + * + * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE + * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE + * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL + * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS + * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) + * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT + * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY + * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF + * SUCH DAMAGE. + */ + +/* $FreeBSD$ */ + +#define NETMAP_WITH_LIBS +#include +#include + +#include +#include +#include /* PRI* macros */ +#include /* strcmp */ +#include /* open */ +#include /* close */ +#include /* ioctl */ +#include +#include /* apple needs sockaddr */ +#include /* ifreq */ +#include /* basename */ +#include /* atoi, free */ + +static void +parse_nmr_config(const char* conf, struct nmreq *nmr) +{ + char *w, *tok; + int i, v; + + nmr->nr_tx_rings = nmr->nr_rx_rings = 0; + nmr->nr_tx_slots = nmr->nr_rx_slots = 0; + if (conf == NULL || ! *conf) + return; + w = strdup(conf); + for (i = 0, tok = strtok(w, ","); tok; i++, tok = strtok(NULL, ",")) { + v = atoi(tok); + switch (i) { + case 0: + nmr->nr_tx_slots = nmr->nr_rx_slots = v; + break; + case 1: + nmr->nr_rx_slots = v; + break; + case 2: + nmr->nr_tx_rings = nmr->nr_rx_rings = v; + break; + case 3: + nmr->nr_rx_rings = v; + break; + default: + D("ignored config: %s", tok); + break; + } + } + D("txr %d txd %d rxr %d rxd %d", + nmr->nr_tx_rings, nmr->nr_tx_slots, + nmr->nr_rx_rings, nmr->nr_rx_slots); + free(w); +} + +static int +bdg_ctl(const char *name, int nr_cmd, int nr_arg, char *nmr_config, int nr_arg2) +{ + struct nmreq nmr; + int error = 0; + int fd = open("/dev/netmap", O_RDWR); + + if (fd == -1) { + D("Unable to open /dev/netmap"); + return -1; + } + + bzero(&nmr, sizeof(nmr)); + nmr.nr_version = NETMAP_API; + if (name != NULL) /* might be NULL */ + strncpy(nmr.nr_name, name, sizeof(nmr.nr_name)-1); + nmr.nr_cmd = nr_cmd; + parse_nmr_config(nmr_config, &nmr); + nmr.nr_arg2 = nr_arg2; + + switch (nr_cmd) { + case NETMAP_BDG_DELIF: + case NETMAP_BDG_NEWIF: + error = ioctl(fd, NIOCREGIF, &nmr); + if (error == -1) { + ND("Unable to %s %s", nr_cmd == NETMAP_BDG_DELIF ? "delete":"create", name); + perror(name); + } else { + ND("Success to %s %s", nr_cmd == NETMAP_BDG_DELIF ? "delete":"create", name); + } + break; + case NETMAP_BDG_ATTACH: + case NETMAP_BDG_DETACH: + nmr.nr_flags = NR_REG_ALL_NIC; + if (nr_arg && nr_arg != NETMAP_BDG_HOST) { + nmr.nr_flags = NR_REG_NIC_SW; + nr_arg = 0; + } + nmr.nr_arg1 = nr_arg; + error = ioctl(fd, NIOCREGIF, &nmr); + if (error == -1) { + ND("Unable to %s %s to the bridge", nr_cmd == + NETMAP_BDG_DETACH?"detach":"attach", name); + perror(name); + } else + ND("Success to %s %s to the bridge", nr_cmd == + NETMAP_BDG_DETACH?"detach":"attach", name); + break; + + case NETMAP_BDG_LIST: + if (strlen(nmr.nr_name)) { /* name to bridge/port info */ + error = ioctl(fd, NIOCGINFO, &nmr); + if (error) { + ND("Unable to obtain info for %s", name); + perror(name); + } else + D("%s at bridge:%d port:%d", name, nmr.nr_arg1, + nmr.nr_arg2); + break; + } + + /* scan all the bridges and ports */ + nmr.nr_arg1 = nmr.nr_arg2 = 0; + for (; !ioctl(fd, NIOCGINFO, &nmr); nmr.nr_arg2++) { + D("bridge:%d port:%d %s", nmr.nr_arg1, nmr.nr_arg2, + nmr.nr_name); + nmr.nr_name[0] = '\0'; + } + + break; + + case NETMAP_BDG_POLLING_ON: + case NETMAP_BDG_POLLING_OFF: + /* We reuse nmreq fields as follows: + * nr_tx_slots: 0 and non-zero indicate REG_ALL_NIC + * REG_ONE_NIC, respectively. + * nr_rx_slots: CPU core index. This also indicates the + * first queue in the case of REG_ONE_NIC + * nr_tx_rings: (REG_ONE_NIC only) indicates the + * number of CPU cores or the last queue + */ + nmr.nr_flags |= nmr.nr_tx_slots ? + NR_REG_ONE_NIC : NR_REG_ALL_NIC; + nmr.nr_ringid = nmr.nr_rx_slots; + /* number of cores/rings */ + if (nmr.nr_flags == NR_REG_ALL_NIC) + nmr.nr_arg1 = 1; + else + nmr.nr_arg1 = nmr.nr_tx_rings; + + error = ioctl(fd, NIOCREGIF, &nmr); + if (!error) + D("polling on %s %s", nmr.nr_name, + nr_cmd == NETMAP_BDG_POLLING_ON ? + "started" : "stopped"); + else + D("polling on %s %s (err %d)", nmr.nr_name, + nr_cmd == NETMAP_BDG_POLLING_ON ? + "couldn't start" : "couldn't stop", error); + break; + + default: /* GINFO */ + nmr.nr_cmd = nmr.nr_arg1 = nmr.nr_arg2 = 0; + error = ioctl(fd, NIOCGINFO, &nmr); + if (error) { + ND("Unable to get if info for %s", name); + perror(name); + } else + D("%s: %d queues.", name, nmr.nr_rx_rings); + break; + } + close(fd); + return error; +} + +static void +usage(int errcode) +{ + fprintf(stderr, + "Usage:\n" + "valectl arguments\n" + "\t-g interface interface name to get info\n" + "\t-d interface interface name to be detached\n" + "\t-a interface interface name to be attached\n" + "\t-h interface interface name to be attached with the host stack\n" + "\t-n interface interface name to be created\n" + "\t-r interface interface name to be deleted\n" + "\t-l list all or specified bridge's interfaces (default)\n" + "\t-C string ring/slot setting of an interface creating by -n\n" + "\t-p interface start polling. Additional -C x,y,z configures\n" + "\t\t x: 0 (REG_ALL_NIC) or 1 (REG_ONE_NIC),\n" + "\t\t y: CPU core id for ALL_NIC and core/ring for ONE_NIC\n" + "\t\t z: (ONE_NIC only) num of total cores/rings\n" + "\t-P interface stop polling\n" + "\t-m memid to use when creating a new interface\n"); + exit(errcode); +} + +int +main(int argc, char *argv[]) +{ + int ch, nr_cmd = 0, nr_arg = 0; + char *name = NULL, *nmr_config = NULL; + int nr_arg2 = 0; + + while ((ch = getopt(argc, argv, "d:a:h:g:l:n:r:C:p:P:m:")) != -1) { + if (ch != 'C' && ch != 'm') + name = optarg; /* default */ + switch (ch) { + default: + fprintf(stderr, "bad option %c %s", ch, optarg); + usage(-1); + break; + case 'd': + nr_cmd = NETMAP_BDG_DETACH; + break; + case 'a': + nr_cmd = NETMAP_BDG_ATTACH; + break; + case 'h': + nr_cmd = NETMAP_BDG_ATTACH; + nr_arg = NETMAP_BDG_HOST; + break; + case 'n': + nr_cmd = NETMAP_BDG_NEWIF; + break; + case 'r': + nr_cmd = NETMAP_BDG_DELIF; + break; + case 'g': + nr_cmd = 0; + break; + case 'l': + nr_cmd = NETMAP_BDG_LIST; + break; + case 'C': + nmr_config = strdup(optarg); + break; + case 'p': + nr_cmd = NETMAP_BDG_POLLING_ON; + break; + case 'P': + nr_cmd = NETMAP_BDG_POLLING_OFF; + break; + case 'm': + nr_arg2 = atoi(optarg); + break; + } + } + if (optind != argc) { + // fprintf(stderr, "optind %d argc %d\n", optind, argc); + usage(-1); + } + if (argc == 1) { + nr_cmd = NETMAP_BDG_LIST; + name = NULL; + } + return bdg_ctl(name, nr_cmd, nr_arg, nmr_config, nr_arg2) ? 1 : 0; +} Property changes on: stable/12/usr.sbin/valectl/valectl.c ___________________________________________________________________ Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Index: stable/12 =================================================================== --- stable/12 (revision 354470) +++ stable/12 (revision 354471) Property changes on: stable/12 ___________________________________________________________________ Modified: svn:mergeinfo ## -0,0 +0,1 ## Merged /head:r354229