Index: head/sys/dev/netmap/netmap.c =================================================================== --- head/sys/dev/netmap/netmap.c (revision 353774) +++ head/sys/dev/netmap/netmap.c (revision 353775) @@ -1,4340 +1,4341 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (C) 2011-2014 Matteo Landi * Copyright (C) 2011-2016 Luigi Rizzo * Copyright (C) 2011-2016 Giuseppe Lettieri * Copyright (C) 2011-2016 Vincenzo Maffione * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * $FreeBSD$ * * This module supports memory mapped access to network devices, * see netmap(4). * * The module uses a large, memory pool allocated by the kernel * and accessible as mmapped memory by multiple userspace threads/processes. * The memory pool contains packet buffers and "netmap rings", * i.e. user-accessible copies of the interface's queues. * * Access to the network card works like this: * 1. a process/thread issues one or more open() on /dev/netmap, to create * select()able file descriptor on which events are reported. * 2. on each descriptor, the process issues an ioctl() to identify * the interface that should report events to the file descriptor. * 3. on each descriptor, the process issues an mmap() request to * map the shared memory region within the process' address space. * The list of interesting queues is indicated by a location in * the shared memory region. * 4. using the functions in the netmap(4) userspace API, a process * can look up the occupation state of a queue, access memory buffers, * and retrieve received packets or enqueue packets to transmit. * 5. using some ioctl()s the process can synchronize the userspace view * of the queue with the actual status in the kernel. This includes both * receiving the notification of new packets, and transmitting new * packets on the output interface. * 6. select() or poll() can be used to wait for events on individual * transmit or receive queues (or all queues for a given interface). * SYNCHRONIZATION (USER) The netmap rings and data structures may be shared among multiple user threads or even independent processes. Any synchronization among those threads/processes is delegated to the threads themselves. Only one thread at a time can be in a system call on the same netmap ring. The OS does not enforce this and only guarantees against system crashes in case of invalid usage. LOCKING (INTERNAL) Within the kernel, access to the netmap rings is protected as follows: - a spinlock on each ring, to handle producer/consumer races on RX rings attached to the host stack (against multiple host threads writing from the host stack to the same ring), and on 'destination' rings attached to a VALE switch (i.e. RX rings in VALE ports, and TX rings in NIC/host ports) protecting multiple active senders for the same destination) - an atomic variable to guarantee that there is at most one instance of *_*xsync() on the ring at any time. For rings connected to user file descriptors, an atomic_test_and_set() protects this, and the lock on the ring is not actually used. For NIC RX rings connected to a VALE switch, an atomic_test_and_set() is also used to prevent multiple executions (the driver might indeed already guarantee this). For NIC TX rings connected to a VALE switch, the lock arbitrates access to the queue (both when allocating buffers and when pushing them out). - *xsync() should be protected against initializations of the card. On FreeBSD most devices have the reset routine protected by a RING lock (ixgbe, igb, em) or core lock (re). lem is missing the RING protection on rx_reset(), this should be added. On linux there is an external lock on the tx path, which probably also arbitrates access to the reset routine. XXX to be revised - a per-interface core_lock protecting access from the host stack while interfaces may be detached from netmap mode. XXX there should be no need for this lock if we detach the interfaces only while they are down. --- VALE SWITCH --- NMG_LOCK() serializes all modifications to switches and ports. A switch cannot be deleted until all ports are gone. For each switch, an SX lock (RWlock on linux) protects deletion of ports. When configuring or deleting a new port, the lock is acquired in exclusive mode (after holding NMG_LOCK). When forwarding, the lock is acquired in shared mode (without NMG_LOCK). The lock is held throughout the entire forwarding cycle, during which the thread may incur in a page fault. Hence it is important that sleepable shared locks are used. On the rx ring, the per-port lock is grabbed initially to reserve a number of slot in the ring, then the lock is released, packets are copied from source to destination, and then the lock is acquired again and the receive ring is updated. (A similar thing is done on the tx ring for NIC and host stack ports attached to the switch) */ /* --- internals ---- * * Roadmap to the code that implements the above. * * > 1. a process/thread issues one or more open() on /dev/netmap, to create * > select()able file descriptor on which events are reported. * * Internally, we allocate a netmap_priv_d structure, that will be * initialized on ioctl(NIOCREGIF). There is one netmap_priv_d * structure for each open(). * * os-specific: * FreeBSD: see netmap_open() (netmap_freebsd.c) * linux: see linux_netmap_open() (netmap_linux.c) * * > 2. on each descriptor, the process issues an ioctl() to identify * > the interface that should report events to the file descriptor. * * Implemented by netmap_ioctl(), NIOCREGIF case, with nmr->nr_cmd==0. * Most important things happen in netmap_get_na() and * netmap_do_regif(), called from there. Additional details can be * found in the comments above those functions. * * In all cases, this action creates/takes-a-reference-to a * netmap_*_adapter describing the port, and allocates a netmap_if * and all necessary netmap rings, filling them with netmap buffers. * * In this phase, the sync callbacks for each ring are set (these are used * in steps 5 and 6 below). The callbacks depend on the type of adapter. * The adapter creation/initialization code puts them in the * netmap_adapter (fields na->nm_txsync and na->nm_rxsync). Then, they * are copied from there to the netmap_kring's during netmap_do_regif(), by * the nm_krings_create() callback. All the nm_krings_create callbacks * actually call netmap_krings_create() to perform this and the other * common stuff. netmap_krings_create() also takes care of the host rings, * if needed, by setting their sync callbacks appropriately. * * Additional actions depend on the kind of netmap_adapter that has been * registered: * * - netmap_hw_adapter: [netmap.c] * This is a system netdev/ifp with native netmap support. * The ifp is detached from the host stack by redirecting: * - transmissions (from the network stack) to netmap_transmit() * - receive notifications to the nm_notify() callback for * this adapter. The callback is normally netmap_notify(), unless * the ifp is attached to a bridge using bwrap, in which case it * is netmap_bwrap_intr_notify(). * * - netmap_generic_adapter: [netmap_generic.c] * A system netdev/ifp without native netmap support. * * (the decision about native/non native support is taken in * netmap_get_hw_na(), called by netmap_get_na()) * * - netmap_vp_adapter [netmap_vale.c] * Returned by netmap_get_bdg_na(). * This is a persistent or ephemeral VALE port. Ephemeral ports * are created on the fly if they don't already exist, and are * always attached to a bridge. * Persistent VALE ports must must be created separately, and i * then attached like normal NICs. The NIOCREGIF we are examining * will find them only if they had previosly been created and * attached (see VALE_CTL below). * * - netmap_pipe_adapter [netmap_pipe.c] * Returned by netmap_get_pipe_na(). * Both pipe ends are created, if they didn't already exist. * * - netmap_monitor_adapter [netmap_monitor.c] * Returned by netmap_get_monitor_na(). * If successful, the nm_sync callbacks of the monitored adapter * will be intercepted by the returned monitor. * * - netmap_bwrap_adapter [netmap_vale.c] * Cannot be obtained in this way, see VALE_CTL below * * * os-specific: * linux: we first go through linux_netmap_ioctl() to * adapt the FreeBSD interface to the linux one. * * * > 3. on each descriptor, the process issues an mmap() request to * > map the shared memory region within the process' address space. * > The list of interesting queues is indicated by a location in * > the shared memory region. * * os-specific: * FreeBSD: netmap_mmap_single (netmap_freebsd.c). * linux: linux_netmap_mmap (netmap_linux.c). * * > 4. using the functions in the netmap(4) userspace API, a process * > can look up the occupation state of a queue, access memory buffers, * > and retrieve received packets or enqueue packets to transmit. * * these actions do not involve the kernel. * * > 5. using some ioctl()s the process can synchronize the userspace view * > of the queue with the actual status in the kernel. This includes both * > receiving the notification of new packets, and transmitting new * > packets on the output interface. * * These are implemented in netmap_ioctl(), NIOCTXSYNC and NIOCRXSYNC * cases. They invoke the nm_sync callbacks on the netmap_kring * structures, as initialized in step 2 and maybe later modified * by a monitor. Monitors, however, will always call the original * callback before doing anything else. * * * > 6. select() or poll() can be used to wait for events on individual * > transmit or receive queues (or all queues for a given interface). * * Implemented in netmap_poll(). This will call the same nm_sync() * callbacks as in step 5 above. * * os-specific: * linux: we first go through linux_netmap_poll() to adapt * the FreeBSD interface to the linux one. * * * ---- VALE_CTL ----- * * VALE switches are controlled by issuing a NIOCREGIF with a non-null * nr_cmd in the nmreq structure. These subcommands are handled by * netmap_bdg_ctl() in netmap_vale.c. Persistent VALE ports are created * and destroyed by issuing the NETMAP_BDG_NEWIF and NETMAP_BDG_DELIF * subcommands, respectively. * * Any network interface known to the system (including a persistent VALE * port) can be attached to a VALE switch by issuing the * NETMAP_REQ_VALE_ATTACH command. After the attachment, persistent VALE ports * look exactly like ephemeral VALE ports (as created in step 2 above). The * attachment of other interfaces, instead, requires the creation of a * netmap_bwrap_adapter. Moreover, the attached interface must be put in * netmap mode. This may require the creation of a netmap_generic_adapter if * we have no native support for the interface, or if generic adapters have * been forced by sysctl. * * Both persistent VALE ports and bwraps are handled by netmap_get_bdg_na(), * called by nm_bdg_ctl_attach(), and discriminated by the nm_bdg_attach() * callback. In the case of the bwrap, the callback creates the * netmap_bwrap_adapter. The initialization of the bwrap is then * completed by calling netmap_do_regif() on it, in the nm_bdg_ctl() * callback (netmap_bwrap_bdg_ctl in netmap_vale.c). * A generic adapter for the wrapped ifp will be created if needed, when * netmap_get_bdg_na() calls netmap_get_hw_na(). * * * ---- DATAPATHS ----- * * -= SYSTEM DEVICE WITH NATIVE SUPPORT =- * * na == NA(ifp) == netmap_hw_adapter created in DEVICE_netmap_attach() * * - tx from netmap userspace: * concurrently: * 1) ioctl(NIOCTXSYNC)/netmap_poll() in process context * kring->nm_sync() == DEVICE_netmap_txsync() * 2) device interrupt handler * na->nm_notify() == netmap_notify() * - rx from netmap userspace: * concurrently: * 1) ioctl(NIOCRXSYNC)/netmap_poll() in process context * kring->nm_sync() == DEVICE_netmap_rxsync() * 2) device interrupt handler * na->nm_notify() == netmap_notify() * - rx from host stack * concurrently: * 1) host stack * netmap_transmit() * na->nm_notify == netmap_notify() * 2) ioctl(NIOCRXSYNC)/netmap_poll() in process context * kring->nm_sync() == netmap_rxsync_from_host * netmap_rxsync_from_host(na, NULL, NULL) * - tx to host stack * ioctl(NIOCTXSYNC)/netmap_poll() in process context * kring->nm_sync() == netmap_txsync_to_host * netmap_txsync_to_host(na) * nm_os_send_up() * FreeBSD: na->if_input() == ether_input() * linux: netif_rx() with NM_MAGIC_PRIORITY_RX * * * -= SYSTEM DEVICE WITH GENERIC SUPPORT =- * * na == NA(ifp) == generic_netmap_adapter created in generic_netmap_attach() * * - tx from netmap userspace: * concurrently: * 1) ioctl(NIOCTXSYNC)/netmap_poll() in process context * kring->nm_sync() == generic_netmap_txsync() * nm_os_generic_xmit_frame() * linux: dev_queue_xmit() with NM_MAGIC_PRIORITY_TX * ifp->ndo_start_xmit == generic_ndo_start_xmit() * gna->save_start_xmit == orig. dev. start_xmit * FreeBSD: na->if_transmit() == orig. dev if_transmit * 2) generic_mbuf_destructor() * na->nm_notify() == netmap_notify() * - rx from netmap userspace: * 1) ioctl(NIOCRXSYNC)/netmap_poll() in process context * kring->nm_sync() == generic_netmap_rxsync() * mbq_safe_dequeue() * 2) device driver * generic_rx_handler() * mbq_safe_enqueue() * na->nm_notify() == netmap_notify() * - rx from host stack * FreeBSD: same as native * Linux: same as native except: * 1) host stack * dev_queue_xmit() without NM_MAGIC_PRIORITY_TX * ifp->ndo_start_xmit == generic_ndo_start_xmit() * netmap_transmit() * na->nm_notify() == netmap_notify() * - tx to host stack (same as native): * * * -= VALE =- * * INCOMING: * * - VALE ports: * ioctl(NIOCTXSYNC)/netmap_poll() in process context * kring->nm_sync() == netmap_vp_txsync() * * - system device with native support: * from cable: * interrupt * na->nm_notify() == netmap_bwrap_intr_notify(ring_nr != host ring) * kring->nm_sync() == DEVICE_netmap_rxsync() * netmap_vp_txsync() * kring->nm_sync() == DEVICE_netmap_rxsync() * from host stack: * netmap_transmit() * na->nm_notify() == netmap_bwrap_intr_notify(ring_nr == host ring) * kring->nm_sync() == netmap_rxsync_from_host() * netmap_vp_txsync() * * - system device with generic support: * from device driver: * generic_rx_handler() * na->nm_notify() == netmap_bwrap_intr_notify(ring_nr != host ring) * kring->nm_sync() == generic_netmap_rxsync() * netmap_vp_txsync() * kring->nm_sync() == generic_netmap_rxsync() * from host stack: * netmap_transmit() * na->nm_notify() == netmap_bwrap_intr_notify(ring_nr == host ring) * kring->nm_sync() == netmap_rxsync_from_host() * netmap_vp_txsync() * * (all cases) --> nm_bdg_flush() * dest_na->nm_notify() == (see below) * * OUTGOING: * * - VALE ports: * concurrently: * 1) ioctl(NIOCRXSYNC)/netmap_poll() in process context * kring->nm_sync() == netmap_vp_rxsync() * 2) from nm_bdg_flush() * na->nm_notify() == netmap_notify() * * - system device with native support: * to cable: * na->nm_notify() == netmap_bwrap_notify() * netmap_vp_rxsync() * kring->nm_sync() == DEVICE_netmap_txsync() * netmap_vp_rxsync() * to host stack: * netmap_vp_rxsync() * kring->nm_sync() == netmap_txsync_to_host * netmap_vp_rxsync_locked() * * - system device with generic adapter: * to device driver: * na->nm_notify() == netmap_bwrap_notify() * netmap_vp_rxsync() * kring->nm_sync() == generic_netmap_txsync() * netmap_vp_rxsync() * to host stack: * netmap_vp_rxsync() * kring->nm_sync() == netmap_txsync_to_host * netmap_vp_rxsync() * */ /* * OS-specific code that is used only within this file. * Other OS-specific code that must be accessed by drivers * is present in netmap_kern.h */ #if defined(__FreeBSD__) #include /* prerequisite */ #include #include #include /* defines used in kernel.h */ #include /* types used in module initialization */ #include /* cdevsw struct, UID, GID */ #include /* FIONBIO */ #include #include /* struct socket */ #include #include #include #include /* sockaddrs */ #include #include #include #include #include #include #include /* BIOCIMMEDIATE */ #include /* bus_dmamap_* */ #include #include #include /* ETHER_BPF_MTAP */ #elif defined(linux) #include "bsd_glue.h" #elif defined(__APPLE__) #warning OSX support is only partial #include "osx_glue.h" #elif defined (_WIN32) #include "win_glue.h" #else #error Unsupported platform #endif /* unsupported */ /* * common headers */ #include #include #include /* user-controlled variables */ int netmap_verbose; #ifdef CONFIG_NETMAP_DEBUG int netmap_debug; #endif /* CONFIG_NETMAP_DEBUG */ static int netmap_no_timestamp; /* don't timestamp on rxsync */ int netmap_no_pendintr = 1; int netmap_txsync_retry = 2; static int netmap_fwd = 0; /* force transparent forwarding */ /* * netmap_admode selects the netmap mode to use. * Invalid values are reset to NETMAP_ADMODE_BEST */ enum { NETMAP_ADMODE_BEST = 0, /* use native, fallback to generic */ NETMAP_ADMODE_NATIVE, /* either native or none */ NETMAP_ADMODE_GENERIC, /* force generic */ NETMAP_ADMODE_LAST }; static int netmap_admode = NETMAP_ADMODE_BEST; /* netmap_generic_mit controls mitigation of RX notifications for * the generic netmap adapter. The value is a time interval in * nanoseconds. */ int netmap_generic_mit = 100*1000; /* We use by default netmap-aware qdiscs with generic netmap adapters, * even if there can be a little performance hit with hardware NICs. * However, using the qdisc is the safer approach, for two reasons: * 1) it prevents non-fifo qdiscs to break the TX notification * scheme, which is based on mbuf destructors when txqdisc is * not used. * 2) it makes it possible to transmit over software devices that * change skb->dev, like bridge, veth, ... * * Anyway users looking for the best performance should * use native adapters. */ #ifdef linux int netmap_generic_txqdisc = 1; #endif /* Default number of slots and queues for generic adapters. */ int netmap_generic_ringsize = 1024; int netmap_generic_rings = 1; /* Non-zero to enable checksum offloading in NIC drivers */ int netmap_generic_hwcsum = 0; /* Non-zero if ptnet devices are allowed to use virtio-net headers. */ int ptnet_vnet_hdr = 1; /* * SYSCTL calls are grouped between SYSBEGIN and SYSEND to be emulated * in some other operating systems */ SYSBEGIN(main_init); SYSCTL_DECL(_dev_netmap); SYSCTL_NODE(_dev, OID_AUTO, netmap, CTLFLAG_RW, 0, "Netmap args"); SYSCTL_INT(_dev_netmap, OID_AUTO, verbose, CTLFLAG_RW, &netmap_verbose, 0, "Verbose mode"); #ifdef CONFIG_NETMAP_DEBUG SYSCTL_INT(_dev_netmap, OID_AUTO, debug, CTLFLAG_RW, &netmap_debug, 0, "Debug messages"); #endif /* CONFIG_NETMAP_DEBUG */ SYSCTL_INT(_dev_netmap, OID_AUTO, no_timestamp, CTLFLAG_RW, &netmap_no_timestamp, 0, "no_timestamp"); SYSCTL_INT(_dev_netmap, OID_AUTO, no_pendintr, CTLFLAG_RW, &netmap_no_pendintr, 0, "Always look for new received packets."); SYSCTL_INT(_dev_netmap, OID_AUTO, txsync_retry, CTLFLAG_RW, &netmap_txsync_retry, 0, "Number of txsync loops in bridge's flush."); SYSCTL_INT(_dev_netmap, OID_AUTO, fwd, CTLFLAG_RW, &netmap_fwd, 0, "Force NR_FORWARD mode"); SYSCTL_INT(_dev_netmap, OID_AUTO, admode, CTLFLAG_RW, &netmap_admode, 0, "Adapter mode. 0 selects the best option available," "1 forces native adapter, 2 forces emulated adapter"); SYSCTL_INT(_dev_netmap, OID_AUTO, generic_hwcsum, CTLFLAG_RW, &netmap_generic_hwcsum, 0, "Hardware checksums. 0 to disable checksum generation by the NIC (default)," "1 to enable checksum generation by the NIC"); SYSCTL_INT(_dev_netmap, OID_AUTO, generic_mit, CTLFLAG_RW, &netmap_generic_mit, 0, "RX notification interval in nanoseconds"); SYSCTL_INT(_dev_netmap, OID_AUTO, generic_ringsize, CTLFLAG_RW, &netmap_generic_ringsize, 0, "Number of per-ring slots for emulated netmap mode"); SYSCTL_INT(_dev_netmap, OID_AUTO, generic_rings, CTLFLAG_RW, &netmap_generic_rings, 0, "Number of TX/RX queues for emulated netmap adapters"); #ifdef linux SYSCTL_INT(_dev_netmap, OID_AUTO, generic_txqdisc, CTLFLAG_RW, &netmap_generic_txqdisc, 0, "Use qdisc for generic adapters"); #endif SYSCTL_INT(_dev_netmap, OID_AUTO, ptnet_vnet_hdr, CTLFLAG_RW, &ptnet_vnet_hdr, 0, "Allow ptnet devices to use virtio-net headers"); SYSEND; NMG_LOCK_T netmap_global_lock; /* * mark the ring as stopped, and run through the locks * to make sure other users get to see it. * stopped must be either NR_KR_STOPPED (for unbounded stop) * of NR_KR_LOCKED (brief stop for mutual exclusion purposes) */ static void netmap_disable_ring(struct netmap_kring *kr, int stopped) { nm_kr_stop(kr, stopped); // XXX check if nm_kr_stop is sufficient mtx_lock(&kr->q_lock); mtx_unlock(&kr->q_lock); nm_kr_put(kr); } /* stop or enable a single ring */ void netmap_set_ring(struct netmap_adapter *na, u_int ring_id, enum txrx t, int stopped) { if (stopped) netmap_disable_ring(NMR(na, t)[ring_id], stopped); else NMR(na, t)[ring_id]->nkr_stopped = 0; } /* stop or enable all the rings of na */ void netmap_set_all_rings(struct netmap_adapter *na, int stopped) { int i; enum txrx t; if (!nm_netmap_on(na)) return; for_rx_tx(t) { for (i = 0; i < netmap_real_rings(na, t); i++) { netmap_set_ring(na, i, t, stopped); } } } /* * Convenience function used in drivers. Waits for current txsync()s/rxsync()s * to finish and prevents any new one from starting. Call this before turning * netmap mode off, or before removing the hardware rings (e.g., on module * onload). */ void netmap_disable_all_rings(struct ifnet *ifp) { if (NM_NA_VALID(ifp)) { netmap_set_all_rings(NA(ifp), NM_KR_STOPPED); } } /* * Convenience function used in drivers. Re-enables rxsync and txsync on the * adapter's rings In linux drivers, this should be placed near each * napi_enable(). */ void netmap_enable_all_rings(struct ifnet *ifp) { if (NM_NA_VALID(ifp)) { netmap_set_all_rings(NA(ifp), 0 /* enabled */); } } void netmap_make_zombie(struct ifnet *ifp) { if (NM_NA_VALID(ifp)) { struct netmap_adapter *na = NA(ifp); netmap_set_all_rings(na, NM_KR_LOCKED); na->na_flags |= NAF_ZOMBIE; netmap_set_all_rings(na, 0); } } void netmap_undo_zombie(struct ifnet *ifp) { if (NM_NA_VALID(ifp)) { struct netmap_adapter *na = NA(ifp); if (na->na_flags & NAF_ZOMBIE) { netmap_set_all_rings(na, NM_KR_LOCKED); na->na_flags &= ~NAF_ZOMBIE; netmap_set_all_rings(na, 0); } } } /* * generic bound_checking function */ u_int nm_bound_var(u_int *v, u_int dflt, u_int lo, u_int hi, const char *msg) { u_int oldv = *v; const char *op = NULL; if (dflt < lo) dflt = lo; if (dflt > hi) dflt = hi; if (oldv < lo) { *v = dflt; op = "Bump"; } else if (oldv > hi) { *v = hi; op = "Clamp"; } if (op && msg) nm_prinf("%s %s to %d (was %d)", op, msg, *v, oldv); return *v; } /* * packet-dump function, user-supplied or static buffer. * The destination buffer must be at least 30+4*len */ const char * nm_dump_buf(char *p, int len, int lim, char *dst) { static char _dst[8192]; int i, j, i0; static char hex[] ="0123456789abcdef"; char *o; /* output position */ #define P_HI(x) hex[((x) & 0xf0)>>4] #define P_LO(x) hex[((x) & 0xf)] #define P_C(x) ((x) >= 0x20 && (x) <= 0x7e ? (x) : '.') if (!dst) dst = _dst; if (lim <= 0 || lim > len) lim = len; o = dst; sprintf(o, "buf 0x%p len %d lim %d\n", p, len, lim); o += strlen(o); /* hexdump routine */ for (i = 0; i < lim; ) { sprintf(o, "%5d: ", i); o += strlen(o); memset(o, ' ', 48); i0 = i; for (j=0; j < 16 && i < lim; i++, j++) { o[j*3] = P_HI(p[i]); o[j*3+1] = P_LO(p[i]); } i = i0; for (j=0; j < 16 && i < lim; i++, j++) o[j + 48] = P_C(p[i]); o[j+48] = '\n'; o += j+49; } *o = '\0'; #undef P_HI #undef P_LO #undef P_C return dst; } /* * Fetch configuration from the device, to cope with dynamic * reconfigurations after loading the module. */ /* call with NMG_LOCK held */ int netmap_update_config(struct netmap_adapter *na) { struct nm_config_info info; bzero(&info, sizeof(info)); if (na->nm_config == NULL || na->nm_config(na, &info)) { /* take whatever we had at init time */ info.num_tx_rings = na->num_tx_rings; info.num_tx_descs = na->num_tx_desc; info.num_rx_rings = na->num_rx_rings; info.num_rx_descs = na->num_rx_desc; info.rx_buf_maxsize = na->rx_buf_maxsize; } if (na->num_tx_rings == info.num_tx_rings && na->num_tx_desc == info.num_tx_descs && na->num_rx_rings == info.num_rx_rings && na->num_rx_desc == info.num_rx_descs && na->rx_buf_maxsize == info.rx_buf_maxsize) return 0; /* nothing changed */ if (na->active_fds == 0) { na->num_tx_rings = info.num_tx_rings; na->num_tx_desc = info.num_tx_descs; na->num_rx_rings = info.num_rx_rings; na->num_rx_desc = info.num_rx_descs; na->rx_buf_maxsize = info.rx_buf_maxsize; if (netmap_verbose) nm_prinf("configuration changed for %s: txring %d x %d, " "rxring %d x %d, rxbufsz %d", na->name, na->num_tx_rings, na->num_tx_desc, na->num_rx_rings, na->num_rx_desc, na->rx_buf_maxsize); return 0; } nm_prerr("WARNING: configuration changed for %s while active: " "txring %d x %d, rxring %d x %d, rxbufsz %d", na->name, info.num_tx_rings, info.num_tx_descs, info.num_rx_rings, info.num_rx_descs, info.rx_buf_maxsize); return 1; } /* nm_sync callbacks for the host rings */ static int netmap_txsync_to_host(struct netmap_kring *kring, int flags); static int netmap_rxsync_from_host(struct netmap_kring *kring, int flags); /* create the krings array and initialize the fields common to all adapters. * The array layout is this: * * +----------+ * na->tx_rings ----->| | \ * | | } na->num_tx_ring * | | / * +----------+ * | | host tx kring * na->rx_rings ----> +----------+ * | | \ * | | } na->num_rx_rings * | | / * +----------+ * | | host rx kring * +----------+ * na->tailroom ----->| | \ * | | } tailroom bytes * | | / * +----------+ * * Note: for compatibility, host krings are created even when not needed. * The tailroom space is currently used by vale ports for allocating leases. */ /* call with NMG_LOCK held */ int netmap_krings_create(struct netmap_adapter *na, u_int tailroom) { u_int i, len, ndesc; struct netmap_kring *kring; u_int n[NR_TXRX]; enum txrx t; int err = 0; if (na->tx_rings != NULL) { if (netmap_debug & NM_DEBUG_ON) nm_prerr("warning: krings were already created"); return 0; } /* account for the (possibly fake) host rings */ n[NR_TX] = netmap_all_rings(na, NR_TX); n[NR_RX] = netmap_all_rings(na, NR_RX); len = (n[NR_TX] + n[NR_RX]) * (sizeof(struct netmap_kring) + sizeof(struct netmap_kring *)) + tailroom; na->tx_rings = nm_os_malloc((size_t)len); if (na->tx_rings == NULL) { nm_prerr("Cannot allocate krings"); return ENOMEM; } na->rx_rings = na->tx_rings + n[NR_TX]; na->tailroom = na->rx_rings + n[NR_RX]; /* link the krings in the krings array */ kring = (struct netmap_kring *)((char *)na->tailroom + tailroom); for (i = 0; i < n[NR_TX] + n[NR_RX]; i++) { na->tx_rings[i] = kring; kring++; } /* * All fields in krings are 0 except the one initialized below. * but better be explicit on important kring fields. */ for_rx_tx(t) { ndesc = nma_get_ndesc(na, t); for (i = 0; i < n[t]; i++) { kring = NMR(na, t)[i]; bzero(kring, sizeof(*kring)); kring->notify_na = na; kring->ring_id = i; kring->tx = t; kring->nkr_num_slots = ndesc; kring->nr_mode = NKR_NETMAP_OFF; kring->nr_pending_mode = NKR_NETMAP_OFF; if (i < nma_get_nrings(na, t)) { kring->nm_sync = (t == NR_TX ? na->nm_txsync : na->nm_rxsync); } else { if (!(na->na_flags & NAF_HOST_RINGS)) kring->nr_kflags |= NKR_FAKERING; kring->nm_sync = (t == NR_TX ? netmap_txsync_to_host: netmap_rxsync_from_host); } kring->nm_notify = na->nm_notify; kring->rhead = kring->rcur = kring->nr_hwcur = 0; /* * IMPORTANT: Always keep one slot empty. */ kring->rtail = kring->nr_hwtail = (t == NR_TX ? ndesc - 1 : 0); snprintf(kring->name, sizeof(kring->name) - 1, "%s %s%d", na->name, nm_txrx2str(t), i); nm_prdis("ktx %s h %d c %d t %d", kring->name, kring->rhead, kring->rcur, kring->rtail); err = nm_os_selinfo_init(&kring->si, kring->name); if (err) { netmap_krings_delete(na); return err; } mtx_init(&kring->q_lock, (t == NR_TX ? "nm_txq_lock" : "nm_rxq_lock"), NULL, MTX_DEF); kring->na = na; /* setting this field marks the mutex as initialized */ } err = nm_os_selinfo_init(&na->si[t], na->name); if (err) { netmap_krings_delete(na); return err; } } return 0; } /* undo the actions performed by netmap_krings_create */ /* call with NMG_LOCK held */ void netmap_krings_delete(struct netmap_adapter *na) { struct netmap_kring **kring = na->tx_rings; enum txrx t; if (na->tx_rings == NULL) { if (netmap_debug & NM_DEBUG_ON) nm_prerr("warning: krings were already deleted"); return; } for_rx_tx(t) nm_os_selinfo_uninit(&na->si[t]); /* we rely on the krings layout described above */ for ( ; kring != na->tailroom; kring++) { if ((*kring)->na != NULL) mtx_destroy(&(*kring)->q_lock); nm_os_selinfo_uninit(&(*kring)->si); } nm_os_free(na->tx_rings); na->tx_rings = na->rx_rings = na->tailroom = NULL; } /* * Destructor for NIC ports. They also have an mbuf queue * on the rings connected to the host so we need to purge * them first. */ /* call with NMG_LOCK held */ void netmap_hw_krings_delete(struct netmap_adapter *na) { u_int lim = netmap_real_rings(na, NR_RX), i; for (i = nma_get_nrings(na, NR_RX); i < lim; i++) { struct mbq *q = &NMR(na, NR_RX)[i]->rx_queue; nm_prdis("destroy sw mbq with len %d", mbq_len(q)); mbq_purge(q); mbq_safe_fini(q); } netmap_krings_delete(na); } static void netmap_mem_drop(struct netmap_adapter *na) { int last = netmap_mem_deref(na->nm_mem, na); /* if the native allocator had been overrided on regif, * restore it now and drop the temporary one */ if (last && na->nm_mem_prev) { netmap_mem_put(na->nm_mem); na->nm_mem = na->nm_mem_prev; na->nm_mem_prev = NULL; } } /* * Undo everything that was done in netmap_do_regif(). In particular, * call nm_register(ifp,0) to stop netmap mode on the interface and * revert to normal operation. */ /* call with NMG_LOCK held */ static void netmap_unset_ringid(struct netmap_priv_d *); static void netmap_krings_put(struct netmap_priv_d *); void netmap_do_unregif(struct netmap_priv_d *priv) { struct netmap_adapter *na = priv->np_na; NMG_LOCK_ASSERT(); na->active_fds--; /* unset nr_pending_mode and possibly release exclusive mode */ netmap_krings_put(priv); #ifdef WITH_MONITOR /* XXX check whether we have to do something with monitor * when rings change nr_mode. */ if (na->active_fds <= 0) { /* walk through all the rings and tell any monitor * that the port is going to exit netmap mode */ netmap_monitor_stop(na); } #endif if (na->active_fds <= 0 || nm_kring_pending(priv)) { na->nm_register(na, 0); } /* delete rings and buffers that are no longer needed */ netmap_mem_rings_delete(na); if (na->active_fds <= 0) { /* last instance */ /* * (TO CHECK) We enter here * when the last reference to this file descriptor goes * away. This means we cannot have any pending poll() * or interrupt routine operating on the structure. * XXX The file may be closed in a thread while * another thread is using it. * Linux keeps the file opened until the last reference * by any outstanding ioctl/poll or mmap is gone. * FreeBSD does not track mmap()s (but we do) and * wakes up any sleeping poll(). Need to check what * happens if the close() occurs while a concurrent * syscall is running. */ if (netmap_debug & NM_DEBUG_ON) nm_prinf("deleting last instance for %s", na->name); if (nm_netmap_on(na)) { nm_prerr("BUG: netmap on while going to delete the krings"); } na->nm_krings_delete(na); /* restore the default number of host tx and rx rings */ if (na->na_flags & NAF_HOST_RINGS) { na->num_host_tx_rings = 1; na->num_host_rx_rings = 1; } else { na->num_host_tx_rings = 0; na->num_host_rx_rings = 0; } } /* possibily decrement counter of tx_si/rx_si users */ netmap_unset_ringid(priv); /* delete the nifp */ netmap_mem_if_delete(na, priv->np_nifp); /* drop the allocator */ netmap_mem_drop(na); /* mark the priv as unregistered */ priv->np_na = NULL; priv->np_nifp = NULL; } struct netmap_priv_d* netmap_priv_new(void) { struct netmap_priv_d *priv; priv = nm_os_malloc(sizeof(struct netmap_priv_d)); if (priv == NULL) return NULL; priv->np_refs = 1; nm_os_get_module(); return priv; } /* * Destructor of the netmap_priv_d, called when the fd is closed * Action: undo all the things done by NIOCREGIF, * On FreeBSD we need to track whether there are active mmap()s, * and we use np_active_mmaps for that. On linux, the field is always 0. * Return: 1 if we can free priv, 0 otherwise. * */ /* call with NMG_LOCK held */ void netmap_priv_delete(struct netmap_priv_d *priv) { struct netmap_adapter *na = priv->np_na; /* number of active references to this fd */ if (--priv->np_refs > 0) { return; } nm_os_put_module(); if (na) { netmap_do_unregif(priv); } netmap_unget_na(na, priv->np_ifp); bzero(priv, sizeof(*priv)); /* for safety */ nm_os_free(priv); } /* call with NMG_LOCK *not* held */ void netmap_dtor(void *data) { struct netmap_priv_d *priv = data; NMG_LOCK(); netmap_priv_delete(priv); NMG_UNLOCK(); } /* * Handlers for synchronization of the rings from/to the host stack. * These are associated to a network interface and are just another * ring pair managed by userspace. * * Netmap also supports transparent forwarding (NS_FORWARD and NR_FORWARD * flags): * * - Before releasing buffers on hw RX rings, the application can mark * them with the NS_FORWARD flag. During the next RXSYNC or poll(), they * will be forwarded to the host stack, similarly to what happened if * the application moved them to the host TX ring. * * - Before releasing buffers on the host RX ring, the application can * mark them with the NS_FORWARD flag. During the next RXSYNC or poll(), * they will be forwarded to the hw TX rings, saving the application * from doing the same task in user-space. * * Transparent fowarding can be enabled per-ring, by setting the NR_FORWARD * flag, or globally with the netmap_fwd sysctl. * * The transfer NIC --> host is relatively easy, just encapsulate * into mbufs and we are done. The host --> NIC side is slightly * harder because there might not be room in the tx ring so it * might take a while before releasing the buffer. */ /* * Pass a whole queue of mbufs to the host stack as coming from 'dst' * We do not need to lock because the queue is private. * After this call the queue is empty. */ static void netmap_send_up(struct ifnet *dst, struct mbq *q) { struct mbuf *m; struct mbuf *head = NULL, *prev = NULL; /* Send packets up, outside the lock; head/prev machinery * is only useful for Windows. */ while ((m = mbq_dequeue(q)) != NULL) { if (netmap_debug & NM_DEBUG_HOST) nm_prinf("sending up pkt %p size %d", m, MBUF_LEN(m)); prev = nm_os_send_up(dst, m, prev); if (head == NULL) head = prev; } if (head) nm_os_send_up(dst, NULL, head); mbq_fini(q); } /* * Scan the buffers from hwcur to ring->head, and put a copy of those * marked NS_FORWARD (or all of them if forced) into a queue of mbufs. * Drop remaining packets in the unlikely event * of an mbuf shortage. */ static void netmap_grab_packets(struct netmap_kring *kring, struct mbq *q, int force) { u_int const lim = kring->nkr_num_slots - 1; u_int const head = kring->rhead; u_int n; struct netmap_adapter *na = kring->na; for (n = kring->nr_hwcur; n != head; n = nm_next(n, lim)) { struct mbuf *m; struct netmap_slot *slot = &kring->ring->slot[n]; if ((slot->flags & NS_FORWARD) == 0 && !force) continue; if (slot->len < 14 || slot->len > NETMAP_BUF_SIZE(na)) { nm_prlim(5, "bad pkt at %d len %d", n, slot->len); continue; } slot->flags &= ~NS_FORWARD; // XXX needed ? /* XXX TODO: adapt to the case of a multisegment packet */ m = m_devget(NMB(na, slot), slot->len, 0, na->ifp, NULL); if (m == NULL) break; mbq_enqueue(q, m); } } static inline int _nm_may_forward(struct netmap_kring *kring) { return ((netmap_fwd || kring->ring->flags & NR_FORWARD) && kring->na->na_flags & NAF_HOST_RINGS && kring->tx == NR_RX); } static inline int nm_may_forward_up(struct netmap_kring *kring) { return _nm_may_forward(kring) && kring->ring_id != kring->na->num_rx_rings; } static inline int nm_may_forward_down(struct netmap_kring *kring, int sync_flags) { return _nm_may_forward(kring) && (sync_flags & NAF_CAN_FORWARD_DOWN) && kring->ring_id == kring->na->num_rx_rings; } /* * Send to the NIC rings packets marked NS_FORWARD between * kring->nr_hwcur and kring->rhead. * Called under kring->rx_queue.lock on the sw rx ring. * * It can only be called if the user opened all the TX hw rings, * see NAF_CAN_FORWARD_DOWN flag. * We can touch the TX netmap rings (slots, head and cur) since * we are in poll/ioctl system call context, and the application * is not supposed to touch the ring (using a different thread) * during the execution of the system call. */ static u_int netmap_sw_to_nic(struct netmap_adapter *na) { struct netmap_kring *kring = na->rx_rings[na->num_rx_rings]; struct netmap_slot *rxslot = kring->ring->slot; u_int i, rxcur = kring->nr_hwcur; u_int const head = kring->rhead; u_int const src_lim = kring->nkr_num_slots - 1; u_int sent = 0; /* scan rings to find space, then fill as much as possible */ for (i = 0; i < na->num_tx_rings; i++) { struct netmap_kring *kdst = na->tx_rings[i]; struct netmap_ring *rdst = kdst->ring; u_int const dst_lim = kdst->nkr_num_slots - 1; /* XXX do we trust ring or kring->rcur,rtail ? */ for (; rxcur != head && !nm_ring_empty(rdst); rxcur = nm_next(rxcur, src_lim) ) { struct netmap_slot *src, *dst, tmp; u_int dst_head = rdst->head; src = &rxslot[rxcur]; if ((src->flags & NS_FORWARD) == 0 && !netmap_fwd) continue; sent++; dst = &rdst->slot[dst_head]; tmp = *src; src->buf_idx = dst->buf_idx; src->flags = NS_BUF_CHANGED; dst->buf_idx = tmp.buf_idx; dst->len = tmp.len; dst->flags = NS_BUF_CHANGED; rdst->head = rdst->cur = nm_next(dst_head, dst_lim); } /* if (sent) XXX txsync ? it would be just an optimization */ } return sent; } /* * netmap_txsync_to_host() passes packets up. We are called from a * system call in user process context, and the only contention * can be among multiple user threads erroneously calling * this routine concurrently. */ static int netmap_txsync_to_host(struct netmap_kring *kring, int flags) { struct netmap_adapter *na = kring->na; u_int const lim = kring->nkr_num_slots - 1; u_int const head = kring->rhead; struct mbq q; /* Take packets from hwcur to head and pass them up. * Force hwcur = head since netmap_grab_packets() stops at head */ mbq_init(&q); netmap_grab_packets(kring, &q, 1 /* force */); nm_prdis("have %d pkts in queue", mbq_len(&q)); kring->nr_hwcur = head; kring->nr_hwtail = head + lim; if (kring->nr_hwtail > lim) kring->nr_hwtail -= lim + 1; netmap_send_up(na->ifp, &q); return 0; } /* * rxsync backend for packets coming from the host stack. * They have been put in kring->rx_queue by netmap_transmit(). * We protect access to the kring using kring->rx_queue.lock * * also moves to the nic hw rings any packet the user has marked * for transparent-mode forwarding, then sets the NR_FORWARD * flag in the kring to let the caller push them out */ static int netmap_rxsync_from_host(struct netmap_kring *kring, int flags) { struct netmap_adapter *na = kring->na; struct netmap_ring *ring = kring->ring; u_int nm_i, n; u_int const lim = kring->nkr_num_slots - 1; u_int const head = kring->rhead; int ret = 0; struct mbq *q = &kring->rx_queue, fq; mbq_init(&fq); /* fq holds packets to be freed */ mbq_lock(q); /* First part: import newly received packets */ n = mbq_len(q); if (n) { /* grab packets from the queue */ struct mbuf *m; uint32_t stop_i; nm_i = kring->nr_hwtail; stop_i = nm_prev(kring->nr_hwcur, lim); while ( nm_i != stop_i && (m = mbq_dequeue(q)) != NULL ) { int len = MBUF_LEN(m); struct netmap_slot *slot = &ring->slot[nm_i]; m_copydata(m, 0, len, NMB(na, slot)); nm_prdis("nm %d len %d", nm_i, len); if (netmap_debug & NM_DEBUG_HOST) nm_prinf("%s", nm_dump_buf(NMB(na, slot),len, 128, NULL)); slot->len = len; slot->flags = 0; nm_i = nm_next(nm_i, lim); mbq_enqueue(&fq, m); } kring->nr_hwtail = nm_i; } /* * Second part: skip past packets that userspace has released. */ nm_i = kring->nr_hwcur; if (nm_i != head) { /* something was released */ if (nm_may_forward_down(kring, flags)) { ret = netmap_sw_to_nic(na); if (ret > 0) { kring->nr_kflags |= NR_FORWARD; ret = 0; } } kring->nr_hwcur = head; } mbq_unlock(q); mbq_purge(&fq); mbq_fini(&fq); return ret; } /* Get a netmap adapter for the port. * * If it is possible to satisfy the request, return 0 * with *na containing the netmap adapter found. * Otherwise return an error code, with *na containing NULL. * * When the port is attached to a bridge, we always return * EBUSY. * Otherwise, if the port is already bound to a file descriptor, * then we unconditionally return the existing adapter into *na. * In all the other cases, we return (into *na) either native, * generic or NULL, according to the following table: * * native_support * active_fds dev.netmap.admode YES NO * ------------------------------------------------------- * >0 * NA(ifp) NA(ifp) * * 0 NETMAP_ADMODE_BEST NATIVE GENERIC * 0 NETMAP_ADMODE_NATIVE NATIVE NULL * 0 NETMAP_ADMODE_GENERIC GENERIC GENERIC * */ static void netmap_hw_dtor(struct netmap_adapter *); /* needed by NM_IS_NATIVE() */ int netmap_get_hw_na(struct ifnet *ifp, struct netmap_mem_d *nmd, struct netmap_adapter **na) { /* generic support */ int i = netmap_admode; /* Take a snapshot. */ struct netmap_adapter *prev_na; int error = 0; *na = NULL; /* default */ /* reset in case of invalid value */ if (i < NETMAP_ADMODE_BEST || i >= NETMAP_ADMODE_LAST) i = netmap_admode = NETMAP_ADMODE_BEST; if (NM_NA_VALID(ifp)) { prev_na = NA(ifp); /* If an adapter already exists, return it if * there are active file descriptors or if * netmap is not forced to use generic * adapters. */ if (NETMAP_OWNED_BY_ANY(prev_na) || i != NETMAP_ADMODE_GENERIC || prev_na->na_flags & NAF_FORCE_NATIVE #ifdef WITH_PIPES /* ugly, but we cannot allow an adapter switch * if some pipe is referring to this one */ || prev_na->na_next_pipe > 0 #endif ) { *na = prev_na; goto assign_mem; } } /* If there isn't native support and netmap is not allowed * to use generic adapters, we cannot satisfy the request. */ if (!NM_IS_NATIVE(ifp) && i == NETMAP_ADMODE_NATIVE) return EOPNOTSUPP; /* Otherwise, create a generic adapter and return it, * saving the previously used netmap adapter, if any. * * Note that here 'prev_na', if not NULL, MUST be a * native adapter, and CANNOT be a generic one. This is * true because generic adapters are created on demand, and * destroyed when not used anymore. Therefore, if the adapter * currently attached to an interface 'ifp' is generic, it * must be that * (NA(ifp)->active_fds > 0 || NETMAP_OWNED_BY_KERN(NA(ifp))). * Consequently, if NA(ifp) is generic, we will enter one of * the branches above. This ensures that we never override * a generic adapter with another generic adapter. */ error = generic_netmap_attach(ifp); if (error) return error; *na = NA(ifp); assign_mem: if (nmd != NULL && !((*na)->na_flags & NAF_MEM_OWNER) && (*na)->active_fds == 0 && ((*na)->nm_mem != nmd)) { (*na)->nm_mem_prev = (*na)->nm_mem; (*na)->nm_mem = netmap_mem_get(nmd); } return 0; } /* * MUST BE CALLED UNDER NMG_LOCK() * * Get a refcounted reference to a netmap adapter attached * to the interface specified by req. * This is always called in the execution of an ioctl(). * * Return ENXIO if the interface specified by the request does * not exist, ENOTSUP if netmap is not supported by the interface, * EBUSY if the interface is already attached to a bridge, * EINVAL if parameters are invalid, ENOMEM if needed resources * could not be allocated. * If successful, hold a reference to the netmap adapter. * * If the interface specified by req is a system one, also keep * a reference to it and return a valid *ifp. */ int netmap_get_na(struct nmreq_header *hdr, struct netmap_adapter **na, struct ifnet **ifp, struct netmap_mem_d *nmd, int create) { struct nmreq_register *req = (struct nmreq_register *)(uintptr_t)hdr->nr_body; int error = 0; struct netmap_adapter *ret = NULL; int nmd_ref = 0; *na = NULL; /* default return value */ *ifp = NULL; if (hdr->nr_reqtype != NETMAP_REQ_REGISTER) { return EINVAL; } if (req->nr_mode == NR_REG_PIPE_MASTER || req->nr_mode == NR_REG_PIPE_SLAVE) { /* Do not accept deprecated pipe modes. */ nm_prerr("Deprecated pipe nr_mode, use xx{yy or xx}yy syntax"); return EINVAL; } NMG_LOCK_ASSERT(); /* if the request contain a memid, try to find the * corresponding memory region */ if (nmd == NULL && req->nr_mem_id) { nmd = netmap_mem_find(req->nr_mem_id); if (nmd == NULL) return EINVAL; /* keep the rereference */ nmd_ref = 1; } /* We cascade through all possible types of netmap adapter. * All netmap_get_*_na() functions return an error and an na, * with the following combinations: * * error na * 0 NULL type doesn't match * !0 NULL type matches, but na creation/lookup failed * 0 !NULL type matches and na created/found * !0 !NULL impossible */ error = netmap_get_null_na(hdr, na, nmd, create); if (error || *na != NULL) goto out; /* try to see if this is a monitor port */ error = netmap_get_monitor_na(hdr, na, nmd, create); if (error || *na != NULL) goto out; /* try to see if this is a pipe port */ error = netmap_get_pipe_na(hdr, na, nmd, create); if (error || *na != NULL) goto out; /* try to see if this is a bridge port */ error = netmap_get_vale_na(hdr, na, nmd, create); if (error) goto out; if (*na != NULL) /* valid match in netmap_get_bdg_na() */ goto out; /* * This must be a hardware na, lookup the name in the system. * Note that by hardware we actually mean "it shows up in ifconfig". * This may still be a tap, a veth/epair, or even a * persistent VALE port. */ *ifp = ifunit_ref(hdr->nr_name); if (*ifp == NULL) { error = ENXIO; goto out; } error = netmap_get_hw_na(*ifp, nmd, &ret); if (error) goto out; *na = ret; netmap_adapter_get(ret); /* * if the adapter supports the host rings and it is not alread open, * try to set the number of host rings as requested by the user */ if (((*na)->na_flags & NAF_HOST_RINGS) && (*na)->active_fds == 0) { if (req->nr_host_tx_rings) (*na)->num_host_tx_rings = req->nr_host_tx_rings; if (req->nr_host_rx_rings) (*na)->num_host_rx_rings = req->nr_host_rx_rings; } nm_prdis("%s: host tx %d rx %u", (*na)->name, (*na)->num_host_tx_rings, (*na)->num_host_rx_rings); out: if (error) { if (ret) netmap_adapter_put(ret); if (*ifp) { if_rele(*ifp); *ifp = NULL; } } if (nmd_ref) netmap_mem_put(nmd); return error; } /* undo netmap_get_na() */ void netmap_unget_na(struct netmap_adapter *na, struct ifnet *ifp) { if (ifp) if_rele(ifp); if (na) netmap_adapter_put(na); } #define NM_FAIL_ON(t) do { \ if (unlikely(t)) { \ nm_prlim(5, "%s: fail '" #t "' " \ "h %d c %d t %d " \ "rh %d rc %d rt %d " \ "hc %d ht %d", \ kring->name, \ head, cur, ring->tail, \ kring->rhead, kring->rcur, kring->rtail, \ kring->nr_hwcur, kring->nr_hwtail); \ return kring->nkr_num_slots; \ } \ } while (0) /* * validate parameters on entry for *_txsync() * Returns ring->cur if ok, or something >= kring->nkr_num_slots * in case of error. * * rhead, rcur and rtail=hwtail are stored from previous round. * hwcur is the next packet to send to the ring. * * We want * hwcur <= *rhead <= head <= cur <= tail = *rtail <= hwtail * * hwcur, rhead, rtail and hwtail are reliable */ u_int nm_txsync_prologue(struct netmap_kring *kring, struct netmap_ring *ring) { u_int head = ring->head; /* read only once */ u_int cur = ring->cur; /* read only once */ u_int n = kring->nkr_num_slots; nm_prdis(5, "%s kcur %d ktail %d head %d cur %d tail %d", kring->name, kring->nr_hwcur, kring->nr_hwtail, ring->head, ring->cur, ring->tail); #if 1 /* kernel sanity checks; but we can trust the kring. */ NM_FAIL_ON(kring->nr_hwcur >= n || kring->rhead >= n || kring->rtail >= n || kring->nr_hwtail >= n); #endif /* kernel sanity checks */ /* * user sanity checks. We only use head, * A, B, ... are possible positions for head: * * 0 A rhead B rtail C n-1 * 0 D rtail E rhead F n-1 * * B, F, D are valid. A, C, E are wrong */ if (kring->rtail >= kring->rhead) { /* want rhead <= head <= rtail */ NM_FAIL_ON(head < kring->rhead || head > kring->rtail); /* and also head <= cur <= rtail */ NM_FAIL_ON(cur < head || cur > kring->rtail); } else { /* here rtail < rhead */ /* we need head outside rtail .. rhead */ NM_FAIL_ON(head > kring->rtail && head < kring->rhead); /* two cases now: head <= rtail or head >= rhead */ if (head <= kring->rtail) { /* want head <= cur <= rtail */ NM_FAIL_ON(cur < head || cur > kring->rtail); } else { /* head >= rhead */ /* cur must be outside rtail..head */ NM_FAIL_ON(cur > kring->rtail && cur < head); } } if (ring->tail != kring->rtail) { nm_prlim(5, "%s tail overwritten was %d need %d", kring->name, ring->tail, kring->rtail); ring->tail = kring->rtail; } kring->rhead = head; kring->rcur = cur; return head; } /* * validate parameters on entry for *_rxsync() * Returns ring->head if ok, kring->nkr_num_slots on error. * * For a valid configuration, * hwcur <= head <= cur <= tail <= hwtail * * We only consider head and cur. * hwcur and hwtail are reliable. * */ u_int nm_rxsync_prologue(struct netmap_kring *kring, struct netmap_ring *ring) { uint32_t const n = kring->nkr_num_slots; uint32_t head, cur; nm_prdis(5,"%s kc %d kt %d h %d c %d t %d", kring->name, kring->nr_hwcur, kring->nr_hwtail, ring->head, ring->cur, ring->tail); /* * Before storing the new values, we should check they do not * move backwards. However: * - head is not an issue because the previous value is hwcur; * - cur could in principle go back, however it does not matter * because we are processing a brand new rxsync() */ cur = kring->rcur = ring->cur; /* read only once */ head = kring->rhead = ring->head; /* read only once */ #if 1 /* kernel sanity checks */ NM_FAIL_ON(kring->nr_hwcur >= n || kring->nr_hwtail >= n); #endif /* kernel sanity checks */ /* user sanity checks */ if (kring->nr_hwtail >= kring->nr_hwcur) { /* want hwcur <= rhead <= hwtail */ NM_FAIL_ON(head < kring->nr_hwcur || head > kring->nr_hwtail); /* and also rhead <= rcur <= hwtail */ NM_FAIL_ON(cur < head || cur > kring->nr_hwtail); } else { /* we need rhead outside hwtail..hwcur */ NM_FAIL_ON(head < kring->nr_hwcur && head > kring->nr_hwtail); /* two cases now: head <= hwtail or head >= hwcur */ if (head <= kring->nr_hwtail) { /* want head <= cur <= hwtail */ NM_FAIL_ON(cur < head || cur > kring->nr_hwtail); } else { /* cur must be outside hwtail..head */ NM_FAIL_ON(cur < head && cur > kring->nr_hwtail); } } if (ring->tail != kring->rtail) { nm_prlim(5, "%s tail overwritten was %d need %d", kring->name, ring->tail, kring->rtail); ring->tail = kring->rtail; } return head; } /* * Error routine called when txsync/rxsync detects an error. * Can't do much more than resetting head = cur = hwcur, tail = hwtail * Return 1 on reinit. * * This routine is only called by the upper half of the kernel. * It only reads hwcur (which is changed only by the upper half, too) * and hwtail (which may be changed by the lower half, but only on * a tx ring and only to increase it, so any error will be recovered * on the next call). For the above, we don't strictly need to call * it under lock. */ int netmap_ring_reinit(struct netmap_kring *kring) { struct netmap_ring *ring = kring->ring; u_int i, lim = kring->nkr_num_slots - 1; int errors = 0; // XXX KASSERT nm_kr_tryget nm_prlim(10, "called for %s", kring->name); // XXX probably wrong to trust userspace kring->rhead = ring->head; kring->rcur = ring->cur; kring->rtail = ring->tail; if (ring->cur > lim) errors++; if (ring->head > lim) errors++; if (ring->tail > lim) errors++; for (i = 0; i <= lim; i++) { u_int idx = ring->slot[i].buf_idx; u_int len = ring->slot[i].len; if (idx < 2 || idx >= kring->na->na_lut.objtotal) { nm_prlim(5, "bad index at slot %d idx %d len %d ", i, idx, len); ring->slot[i].buf_idx = 0; ring->slot[i].len = 0; } else if (len > NETMAP_BUF_SIZE(kring->na)) { ring->slot[i].len = 0; nm_prlim(5, "bad len at slot %d idx %d len %d", i, idx, len); } } if (errors) { nm_prlim(10, "total %d errors", errors); nm_prlim(10, "%s reinit, cur %d -> %d tail %d -> %d", kring->name, ring->cur, kring->nr_hwcur, ring->tail, kring->nr_hwtail); ring->head = kring->rhead = kring->nr_hwcur; ring->cur = kring->rcur = kring->nr_hwcur; ring->tail = kring->rtail = kring->nr_hwtail; } return (errors ? 1 : 0); } /* interpret the ringid and flags fields of an nmreq, by translating them * into a pair of intervals of ring indices: * * [priv->np_txqfirst, priv->np_txqlast) and * [priv->np_rxqfirst, priv->np_rxqlast) * */ int netmap_interp_ringid(struct netmap_priv_d *priv, uint32_t nr_mode, uint16_t nr_ringid, uint64_t nr_flags) { struct netmap_adapter *na = priv->np_na; int excluded_direction[] = { NR_TX_RINGS_ONLY, NR_RX_RINGS_ONLY }; enum txrx t; u_int j; for_rx_tx(t) { if (nr_flags & excluded_direction[t]) { priv->np_qfirst[t] = priv->np_qlast[t] = 0; continue; } switch (nr_mode) { case NR_REG_ALL_NIC: case NR_REG_NULL: priv->np_qfirst[t] = 0; priv->np_qlast[t] = nma_get_nrings(na, t); nm_prdis("ALL/PIPE: %s %d %d", nm_txrx2str(t), priv->np_qfirst[t], priv->np_qlast[t]); break; case NR_REG_SW: case NR_REG_NIC_SW: if (!(na->na_flags & NAF_HOST_RINGS)) { nm_prerr("host rings not supported"); return EINVAL; } priv->np_qfirst[t] = (nr_mode == NR_REG_SW ? nma_get_nrings(na, t) : 0); priv->np_qlast[t] = netmap_all_rings(na, t); nm_prdis("%s: %s %d %d", nr_mode == NR_REG_SW ? "SW" : "NIC+SW", nm_txrx2str(t), priv->np_qfirst[t], priv->np_qlast[t]); break; case NR_REG_ONE_NIC: if (nr_ringid >= na->num_tx_rings && nr_ringid >= na->num_rx_rings) { nm_prerr("invalid ring id %d", nr_ringid); return EINVAL; } /* if not enough rings, use the first one */ j = nr_ringid; if (j >= nma_get_nrings(na, t)) j = 0; priv->np_qfirst[t] = j; priv->np_qlast[t] = j + 1; nm_prdis("ONE_NIC: %s %d %d", nm_txrx2str(t), priv->np_qfirst[t], priv->np_qlast[t]); break; case NR_REG_ONE_SW: if (!(na->na_flags & NAF_HOST_RINGS)) { nm_prerr("host rings not supported"); return EINVAL; } if (nr_ringid >= na->num_host_tx_rings && nr_ringid >= na->num_host_rx_rings) { nm_prerr("invalid ring id %d", nr_ringid); return EINVAL; } /* if not enough rings, use the first one */ j = nr_ringid; if (j >= nma_get_host_nrings(na, t)) j = 0; priv->np_qfirst[t] = nma_get_nrings(na, t) + j; priv->np_qlast[t] = nma_get_nrings(na, t) + j + 1; nm_prdis("ONE_SW: %s %d %d", nm_txrx2str(t), priv->np_qfirst[t], priv->np_qlast[t]); break; default: nm_prerr("invalid regif type %d", nr_mode); return EINVAL; } } priv->np_flags = nr_flags; /* Allow transparent forwarding mode in the host --> nic * direction only if all the TX hw rings have been opened. */ if (priv->np_qfirst[NR_TX] == 0 && priv->np_qlast[NR_TX] >= na->num_tx_rings) { priv->np_sync_flags |= NAF_CAN_FORWARD_DOWN; } if (netmap_verbose) { nm_prinf("%s: tx [%d,%d) rx [%d,%d) id %d", na->name, priv->np_qfirst[NR_TX], priv->np_qlast[NR_TX], priv->np_qfirst[NR_RX], priv->np_qlast[NR_RX], nr_ringid); } return 0; } /* * Set the ring ID. For devices with a single queue, a request * for all rings is the same as a single ring. */ static int netmap_set_ringid(struct netmap_priv_d *priv, uint32_t nr_mode, uint16_t nr_ringid, uint64_t nr_flags) { struct netmap_adapter *na = priv->np_na; int error; enum txrx t; error = netmap_interp_ringid(priv, nr_mode, nr_ringid, nr_flags); if (error) { return error; } priv->np_txpoll = (nr_flags & NR_NO_TX_POLL) ? 0 : 1; /* optimization: count the users registered for more than * one ring, which are the ones sleeping on the global queue. * The default netmap_notify() callback will then * avoid signaling the global queue if nobody is using it */ for_rx_tx(t) { if (nm_si_user(priv, t)) na->si_users[t]++; } return 0; } static void netmap_unset_ringid(struct netmap_priv_d *priv) { struct netmap_adapter *na = priv->np_na; enum txrx t; for_rx_tx(t) { if (nm_si_user(priv, t)) na->si_users[t]--; priv->np_qfirst[t] = priv->np_qlast[t] = 0; } priv->np_flags = 0; priv->np_txpoll = 0; priv->np_kloop_state = 0; } /* Set the nr_pending_mode for the requested rings. * If requested, also try to get exclusive access to the rings, provided * the rings we want to bind are not exclusively owned by a previous bind. */ static int netmap_krings_get(struct netmap_priv_d *priv) { struct netmap_adapter *na = priv->np_na; u_int i; struct netmap_kring *kring; int excl = (priv->np_flags & NR_EXCLUSIVE); enum txrx t; if (netmap_debug & NM_DEBUG_ON) nm_prinf("%s: grabbing tx [%d, %d) rx [%d, %d)", na->name, priv->np_qfirst[NR_TX], priv->np_qlast[NR_TX], priv->np_qfirst[NR_RX], priv->np_qlast[NR_RX]); /* first round: check that all the requested rings * are neither alread exclusively owned, nor we * want exclusive ownership when they are already in use */ for_rx_tx(t) { for (i = priv->np_qfirst[t]; i < priv->np_qlast[t]; i++) { kring = NMR(na, t)[i]; if ((kring->nr_kflags & NKR_EXCLUSIVE) || (kring->users && excl)) { nm_prdis("ring %s busy", kring->name); return EBUSY; } } } /* second round: increment usage count (possibly marking them * as exclusive) and set the nr_pending_mode */ for_rx_tx(t) { for (i = priv->np_qfirst[t]; i < priv->np_qlast[t]; i++) { kring = NMR(na, t)[i]; kring->users++; if (excl) kring->nr_kflags |= NKR_EXCLUSIVE; kring->nr_pending_mode = NKR_NETMAP_ON; } } return 0; } /* Undo netmap_krings_get(). This is done by clearing the exclusive mode * if was asked on regif, and unset the nr_pending_mode if we are the * last users of the involved rings. */ static void netmap_krings_put(struct netmap_priv_d *priv) { struct netmap_adapter *na = priv->np_na; u_int i; struct netmap_kring *kring; int excl = (priv->np_flags & NR_EXCLUSIVE); enum txrx t; nm_prdis("%s: releasing tx [%d, %d) rx [%d, %d)", na->name, priv->np_qfirst[NR_TX], priv->np_qlast[NR_TX], priv->np_qfirst[NR_RX], priv->np_qlast[MR_RX]); for_rx_tx(t) { for (i = priv->np_qfirst[t]; i < priv->np_qlast[t]; i++) { kring = NMR(na, t)[i]; if (excl) kring->nr_kflags &= ~NKR_EXCLUSIVE; kring->users--; if (kring->users == 0) kring->nr_pending_mode = NKR_NETMAP_OFF; } } } static int nm_priv_rx_enabled(struct netmap_priv_d *priv) { return (priv->np_qfirst[NR_RX] != priv->np_qlast[NR_RX]); } /* Validate the CSB entries for both directions (atok and ktoa). * To be called under NMG_LOCK(). */ static int netmap_csb_validate(struct netmap_priv_d *priv, struct nmreq_opt_csb *csbo) { struct nm_csb_atok *csb_atok_base = (struct nm_csb_atok *)(uintptr_t)csbo->csb_atok; struct nm_csb_ktoa *csb_ktoa_base = (struct nm_csb_ktoa *)(uintptr_t)csbo->csb_ktoa; enum txrx t; int num_rings[NR_TXRX], tot_rings; size_t entry_size[2]; void *csb_start[2]; int i; if (priv->np_kloop_state & NM_SYNC_KLOOP_RUNNING) { nm_prerr("Cannot update CSB while kloop is running"); return EBUSY; } tot_rings = 0; for_rx_tx(t) { num_rings[t] = priv->np_qlast[t] - priv->np_qfirst[t]; tot_rings += num_rings[t]; } if (tot_rings <= 0) return 0; if (!(priv->np_flags & NR_EXCLUSIVE)) { nm_prerr("CSB mode requires NR_EXCLUSIVE"); return EINVAL; } entry_size[0] = sizeof(*csb_atok_base); entry_size[1] = sizeof(*csb_ktoa_base); csb_start[0] = (void *)csb_atok_base; csb_start[1] = (void *)csb_ktoa_base; for (i = 0; i < 2; i++) { /* On Linux we could use access_ok() to simplify * the validation. However, the advantage of * this approach is that it works also on * FreeBSD. */ size_t csb_size = tot_rings * entry_size[i]; void *tmp; int err; if ((uintptr_t)csb_start[i] & (entry_size[i]-1)) { nm_prerr("Unaligned CSB address"); return EINVAL; } tmp = nm_os_malloc(csb_size); if (!tmp) return ENOMEM; if (i == 0) { /* Application --> kernel direction. */ err = copyin(csb_start[i], tmp, csb_size); } else { /* Kernel --> application direction. */ memset(tmp, 0, csb_size); err = copyout(tmp, csb_start[i], csb_size); } nm_os_free(tmp); if (err) { nm_prerr("Invalid CSB address"); return err; } } priv->np_csb_atok_base = csb_atok_base; priv->np_csb_ktoa_base = csb_ktoa_base; /* Initialize the CSB. */ for_rx_tx(t) { for (i = 0; i < num_rings[t]; i++) { struct netmap_kring *kring = NMR(priv->np_na, t)[i + priv->np_qfirst[t]]; struct nm_csb_atok *csb_atok = csb_atok_base + i; struct nm_csb_ktoa *csb_ktoa = csb_ktoa_base + i; if (t == NR_RX) { csb_atok += num_rings[NR_TX]; csb_ktoa += num_rings[NR_TX]; } CSB_WRITE(csb_atok, head, kring->rhead); CSB_WRITE(csb_atok, cur, kring->rcur); CSB_WRITE(csb_atok, appl_need_kick, 1); CSB_WRITE(csb_atok, sync_flags, 1); CSB_WRITE(csb_ktoa, hwcur, kring->nr_hwcur); CSB_WRITE(csb_ktoa, hwtail, kring->nr_hwtail); CSB_WRITE(csb_ktoa, kern_need_kick, 1); nm_prinf("csb_init for kring %s: head %u, cur %u, " "hwcur %u, hwtail %u", kring->name, kring->rhead, kring->rcur, kring->nr_hwcur, kring->nr_hwtail); } } return 0; } /* Ensure that the netmap adapter can support the given MTU. * @return EINVAL if the na cannot be set to mtu, 0 otherwise. */ int netmap_buf_size_validate(const struct netmap_adapter *na, unsigned mtu) { unsigned nbs = NETMAP_BUF_SIZE(na); if (mtu <= na->rx_buf_maxsize) { /* The MTU fits a single NIC slot. We only * Need to check that netmap buffers are * large enough to hold an MTU. NS_MOREFRAG * cannot be used in this case. */ if (nbs < mtu) { nm_prerr("error: netmap buf size (%u) " "< device MTU (%u)", nbs, mtu); return EINVAL; } } else { /* More NIC slots may be needed to receive * or transmit a single packet. Check that * the adapter supports NS_MOREFRAG and that * netmap buffers are large enough to hold * the maximum per-slot size. */ if (!(na->na_flags & NAF_MOREFRAG)) { nm_prerr("error: large MTU (%d) needed " "but %s does not support " "NS_MOREFRAG", mtu, na->ifp->if_xname); return EINVAL; } else if (nbs < na->rx_buf_maxsize) { nm_prerr("error: using NS_MOREFRAG on " "%s requires netmap buf size " ">= %u", na->ifp->if_xname, na->rx_buf_maxsize); return EINVAL; } else { nm_prinf("info: netmap application on " "%s needs to support " "NS_MOREFRAG " "(MTU=%u,netmap_buf_size=%u)", na->ifp->if_xname, mtu, nbs); } } return 0; } /* * possibly move the interface to netmap-mode. * If success it returns a pointer to netmap_if, otherwise NULL. * This must be called with NMG_LOCK held. * * The following na callbacks are called in the process: * * na->nm_config() [by netmap_update_config] * (get current number and size of rings) * * We have a generic one for linux (netmap_linux_config). * The bwrap has to override this, since it has to forward * the request to the wrapped adapter (netmap_bwrap_config). * * * na->nm_krings_create() * (create and init the krings array) * * One of the following: * * * netmap_hw_krings_create, (hw ports) * creates the standard layout for the krings * and adds the mbq (used for the host rings). * * * netmap_vp_krings_create (VALE ports) * add leases and scratchpads * * * netmap_pipe_krings_create (pipes) * create the krings and rings of both ends and * cross-link them * * * netmap_monitor_krings_create (monitors) * avoid allocating the mbq * * * netmap_bwrap_krings_create (bwraps) * create both the brap krings array, * the krings array of the wrapped adapter, and * (if needed) the fake array for the host adapter * * na->nm_register(, 1) * (put the adapter in netmap mode) * * This may be one of the following: * * * netmap_hw_reg (hw ports) * checks that the ifp is still there, then calls * the hardware specific callback; * * * netmap_vp_reg (VALE ports) * If the port is connected to a bridge, * set the NAF_NETMAP_ON flag under the * bridge write lock. * * * netmap_pipe_reg (pipes) * inform the other pipe end that it is no * longer responsible for the lifetime of this * pipe end * * * netmap_monitor_reg (monitors) * intercept the sync callbacks of the monitored * rings * * * netmap_bwrap_reg (bwraps) * cross-link the bwrap and hwna rings, * forward the request to the hwna, override * the hwna notify callback (to get the frames * coming from outside go through the bridge). * * */ int netmap_do_regif(struct netmap_priv_d *priv, struct netmap_adapter *na, uint32_t nr_mode, uint16_t nr_ringid, uint64_t nr_flags) { struct netmap_if *nifp = NULL; int error; NMG_LOCK_ASSERT(); priv->np_na = na; /* store the reference */ error = netmap_mem_finalize(na->nm_mem, na); if (error) goto err; if (na->active_fds == 0) { /* cache the allocator info in the na */ error = netmap_mem_get_lut(na->nm_mem, &na->na_lut); if (error) goto err_drop_mem; nm_prdis("lut %p bufs %u size %u", na->na_lut.lut, na->na_lut.objtotal, na->na_lut.objsize); /* ring configuration may have changed, fetch from the card */ netmap_update_config(na); } /* compute the range of tx and rx rings to monitor */ error = netmap_set_ringid(priv, nr_mode, nr_ringid, nr_flags); if (error) goto err_put_lut; if (na->active_fds == 0) { /* * If this is the first registration of the adapter, * perform sanity checks and create the in-kernel view * of the netmap rings (the netmap krings). */ if (na->ifp && nm_priv_rx_enabled(priv)) { /* This netmap adapter is attached to an ifnet. */ unsigned mtu = nm_os_ifnet_mtu(na->ifp); nm_prdis("%s: mtu %d rx_buf_maxsize %d netmap_buf_size %d", na->name, mtu, na->rx_buf_maxsize, NETMAP_BUF_SIZE(na)); if (na->rx_buf_maxsize == 0) { nm_prerr("%s: error: rx_buf_maxsize == 0", na->name); error = EIO; goto err_drop_mem; } error = netmap_buf_size_validate(na, mtu); if (error) goto err_drop_mem; } /* * Depending on the adapter, this may also create * the netmap rings themselves */ error = na->nm_krings_create(na); if (error) goto err_put_lut; } /* now the krings must exist and we can check whether some * previous bind has exclusive ownership on them, and set * nr_pending_mode */ error = netmap_krings_get(priv); if (error) goto err_del_krings; /* create all needed missing netmap rings */ error = netmap_mem_rings_create(na); if (error) goto err_rel_excl; /* in all cases, create a new netmap if */ nifp = netmap_mem_if_new(na, priv); if (nifp == NULL) { error = ENOMEM; goto err_rel_excl; } if (nm_kring_pending(priv)) { /* Some kring is switching mode, tell the adapter to * react on this. */ error = na->nm_register(na, 1); if (error) goto err_del_if; } /* Commit the reference. */ na->active_fds++; /* * advertise that the interface is ready by setting np_nifp. * The barrier is needed because readers (poll, *SYNC and mmap) * check for priv->np_nifp != NULL without locking */ mb(); /* make sure previous writes are visible to all CPUs */ priv->np_nifp = nifp; return 0; err_del_if: netmap_mem_if_delete(na, nifp); err_rel_excl: netmap_krings_put(priv); netmap_mem_rings_delete(na); err_del_krings: if (na->active_fds == 0) na->nm_krings_delete(na); err_put_lut: if (na->active_fds == 0) memset(&na->na_lut, 0, sizeof(na->na_lut)); err_drop_mem: netmap_mem_drop(na); err: priv->np_na = NULL; return error; } /* * update kring and ring at the end of rxsync/txsync. */ static inline void nm_sync_finalize(struct netmap_kring *kring) { /* * Update ring tail to what the kernel knows * After txsync: head/rhead/hwcur might be behind cur/rcur * if no carrier. */ kring->ring->tail = kring->rtail = kring->nr_hwtail; nm_prdis(5, "%s now hwcur %d hwtail %d head %d cur %d tail %d", kring->name, kring->nr_hwcur, kring->nr_hwtail, kring->rhead, kring->rcur, kring->rtail); } /* set ring timestamp */ static inline void ring_timestamp_set(struct netmap_ring *ring) { if (netmap_no_timestamp == 0 || ring->flags & NR_TIMESTAMP) { microtime(&ring->ts); } } static int nmreq_copyin(struct nmreq_header *, int); static int nmreq_copyout(struct nmreq_header *, int); static int nmreq_checkoptions(struct nmreq_header *); /* * ioctl(2) support for the "netmap" device. * * Following a list of accepted commands: * - NIOCCTRL device control API * - NIOCTXSYNC sync TX rings * - NIOCRXSYNC sync RX rings * - SIOCGIFADDR just for convenience * - NIOCGINFO deprecated (legacy API) * - NIOCREGIF deprecated (legacy API) * * Return 0 on success, errno otherwise. */ int netmap_ioctl(struct netmap_priv_d *priv, u_long cmd, caddr_t data, struct thread *td, int nr_body_is_user) { struct mbq q; /* packets from RX hw queues to host stack */ struct netmap_adapter *na = NULL; struct netmap_mem_d *nmd = NULL; struct ifnet *ifp = NULL; int error = 0; u_int i, qfirst, qlast; struct netmap_kring **krings; int sync_flags; enum txrx t; switch (cmd) { case NIOCCTRL: { struct nmreq_header *hdr = (struct nmreq_header *)data; if (hdr->nr_version < NETMAP_MIN_API || hdr->nr_version > NETMAP_MAX_API) { nm_prerr("API mismatch: got %d need %d", hdr->nr_version, NETMAP_API); return EINVAL; } /* Make a kernel-space copy of the user-space nr_body. * For convenince, the nr_body pointer and the pointers * in the options list will be replaced with their * kernel-space counterparts. The original pointers are * saved internally and later restored by nmreq_copyout */ error = nmreq_copyin(hdr, nr_body_is_user); if (error) { return error; } /* Sanitize hdr->nr_name. */ hdr->nr_name[sizeof(hdr->nr_name) - 1] = '\0'; switch (hdr->nr_reqtype) { case NETMAP_REQ_REGISTER: { struct nmreq_register *req = (struct nmreq_register *)(uintptr_t)hdr->nr_body; struct netmap_if *nifp; /* Protect access to priv from concurrent requests. */ NMG_LOCK(); do { struct nmreq_option *opt; u_int memflags; if (priv->np_nifp != NULL) { /* thread already registered */ error = EBUSY; break; } #ifdef WITH_EXTMEM opt = nmreq_getoption(hdr, NETMAP_REQ_OPT_EXTMEM); if (opt != NULL) { struct nmreq_opt_extmem *e = (struct nmreq_opt_extmem *)opt; nmd = netmap_mem_ext_create(e->nro_usrptr, &e->nro_info, &error); opt->nro_status = error; if (nmd == NULL) break; } #endif /* WITH_EXTMEM */ if (nmd == NULL && req->nr_mem_id) { /* find the allocator and get a reference */ nmd = netmap_mem_find(req->nr_mem_id); if (nmd == NULL) { if (netmap_verbose) { nm_prerr("%s: failed to find mem_id %u", hdr->nr_name, req->nr_mem_id); } error = EINVAL; break; } } /* find the interface and a reference */ error = netmap_get_na(hdr, &na, &ifp, nmd, 1 /* create */); /* keep reference */ if (error) break; if (NETMAP_OWNED_BY_KERN(na)) { error = EBUSY; break; } if (na->virt_hdr_len && !(req->nr_flags & NR_ACCEPT_VNET_HDR)) { nm_prerr("virt_hdr_len=%d, but application does " "not accept it", na->virt_hdr_len); error = EIO; break; } error = netmap_do_regif(priv, na, req->nr_mode, req->nr_ringid, req->nr_flags); if (error) { /* reg. failed, release priv and ref */ break; } opt = nmreq_getoption(hdr, NETMAP_REQ_OPT_CSB); if (opt != NULL) { struct nmreq_opt_csb *csbo = (struct nmreq_opt_csb *)opt; error = netmap_csb_validate(priv, csbo); opt->nro_status = error; if (error) { netmap_do_unregif(priv); break; } } nifp = priv->np_nifp; /* return the offset of the netmap_if object */ req->nr_rx_rings = na->num_rx_rings; req->nr_tx_rings = na->num_tx_rings; req->nr_rx_slots = na->num_rx_desc; req->nr_tx_slots = na->num_tx_desc; req->nr_host_tx_rings = na->num_host_tx_rings; req->nr_host_rx_rings = na->num_host_rx_rings; error = netmap_mem_get_info(na->nm_mem, &req->nr_memsize, &memflags, &req->nr_mem_id); if (error) { netmap_do_unregif(priv); break; } if (memflags & NETMAP_MEM_PRIVATE) { *(uint32_t *)(uintptr_t)&nifp->ni_flags |= NI_PRIV_MEM; } for_rx_tx(t) { priv->np_si[t] = nm_si_user(priv, t) ? &na->si[t] : &NMR(na, t)[priv->np_qfirst[t]]->si; } if (req->nr_extra_bufs) { if (netmap_verbose) nm_prinf("requested %d extra buffers", req->nr_extra_bufs); req->nr_extra_bufs = netmap_extra_alloc(na, &nifp->ni_bufs_head, req->nr_extra_bufs); if (netmap_verbose) nm_prinf("got %d extra buffers", req->nr_extra_bufs); } req->nr_offset = netmap_mem_if_offset(na->nm_mem, nifp); error = nmreq_checkoptions(hdr); if (error) { netmap_do_unregif(priv); break; } /* store ifp reference so that priv destructor may release it */ priv->np_ifp = ifp; } while (0); if (error) { netmap_unget_na(na, ifp); } /* release the reference from netmap_mem_find() or * netmap_mem_ext_create() */ if (nmd) netmap_mem_put(nmd); NMG_UNLOCK(); break; } case NETMAP_REQ_PORT_INFO_GET: { struct nmreq_port_info_get *req = (struct nmreq_port_info_get *)(uintptr_t)hdr->nr_body; NMG_LOCK(); do { u_int memflags; if (hdr->nr_name[0] != '\0') { /* Build a nmreq_register out of the nmreq_port_info_get, * so that we can call netmap_get_na(). */ struct nmreq_register regreq; bzero(®req, sizeof(regreq)); regreq.nr_mode = NR_REG_ALL_NIC; regreq.nr_tx_slots = req->nr_tx_slots; regreq.nr_rx_slots = req->nr_rx_slots; regreq.nr_tx_rings = req->nr_tx_rings; regreq.nr_rx_rings = req->nr_rx_rings; regreq.nr_host_tx_rings = req->nr_host_tx_rings; regreq.nr_host_rx_rings = req->nr_host_rx_rings; regreq.nr_mem_id = req->nr_mem_id; /* get a refcount */ hdr->nr_reqtype = NETMAP_REQ_REGISTER; hdr->nr_body = (uintptr_t)®req; error = netmap_get_na(hdr, &na, &ifp, NULL, 1 /* create */); hdr->nr_reqtype = NETMAP_REQ_PORT_INFO_GET; /* reset type */ hdr->nr_body = (uintptr_t)req; /* reset nr_body */ if (error) { na = NULL; ifp = NULL; break; } nmd = na->nm_mem; /* get memory allocator */ } else { nmd = netmap_mem_find(req->nr_mem_id ? req->nr_mem_id : 1); if (nmd == NULL) { if (netmap_verbose) nm_prerr("%s: failed to find mem_id %u", hdr->nr_name, req->nr_mem_id ? req->nr_mem_id : 1); error = EINVAL; break; } } error = netmap_mem_get_info(nmd, &req->nr_memsize, &memflags, &req->nr_mem_id); if (error) break; if (na == NULL) /* only memory info */ break; netmap_update_config(na); req->nr_rx_rings = na->num_rx_rings; req->nr_tx_rings = na->num_tx_rings; req->nr_rx_slots = na->num_rx_desc; req->nr_tx_slots = na->num_tx_desc; req->nr_host_tx_rings = na->num_host_tx_rings; req->nr_host_rx_rings = na->num_host_rx_rings; } while (0); netmap_unget_na(na, ifp); NMG_UNLOCK(); break; } #ifdef WITH_VALE case NETMAP_REQ_VALE_ATTACH: { error = netmap_vale_attach(hdr, NULL /* userspace request */); break; } case NETMAP_REQ_VALE_DETACH: { error = netmap_vale_detach(hdr, NULL /* userspace request */); break; } case NETMAP_REQ_VALE_LIST: { error = netmap_vale_list(hdr); break; } case NETMAP_REQ_PORT_HDR_SET: { struct nmreq_port_hdr *req = (struct nmreq_port_hdr *)(uintptr_t)hdr->nr_body; /* Build a nmreq_register out of the nmreq_port_hdr, * so that we can call netmap_get_bdg_na(). */ struct nmreq_register regreq; bzero(®req, sizeof(regreq)); regreq.nr_mode = NR_REG_ALL_NIC; /* For now we only support virtio-net headers, and only for * VALE ports, but this may change in future. Valid lengths * for the virtio-net header are 0 (no header), 10 and 12. */ if (req->nr_hdr_len != 0 && req->nr_hdr_len != sizeof(struct nm_vnet_hdr) && req->nr_hdr_len != 12) { if (netmap_verbose) nm_prerr("invalid hdr_len %u", req->nr_hdr_len); error = EINVAL; break; } NMG_LOCK(); hdr->nr_reqtype = NETMAP_REQ_REGISTER; hdr->nr_body = (uintptr_t)®req; error = netmap_get_vale_na(hdr, &na, NULL, 0); hdr->nr_reqtype = NETMAP_REQ_PORT_HDR_SET; hdr->nr_body = (uintptr_t)req; if (na && !error) { struct netmap_vp_adapter *vpna = (struct netmap_vp_adapter *)na; na->virt_hdr_len = req->nr_hdr_len; if (na->virt_hdr_len) { vpna->mfs = NETMAP_BUF_SIZE(na); } if (netmap_verbose) nm_prinf("Using vnet_hdr_len %d for %p", na->virt_hdr_len, na); netmap_adapter_put(na); } else if (!na) { error = ENXIO; } NMG_UNLOCK(); break; } case NETMAP_REQ_PORT_HDR_GET: { /* Get vnet-header length for this netmap port */ struct nmreq_port_hdr *req = (struct nmreq_port_hdr *)(uintptr_t)hdr->nr_body; /* Build a nmreq_register out of the nmreq_port_hdr, * so that we can call netmap_get_bdg_na(). */ struct nmreq_register regreq; struct ifnet *ifp; bzero(®req, sizeof(regreq)); regreq.nr_mode = NR_REG_ALL_NIC; NMG_LOCK(); hdr->nr_reqtype = NETMAP_REQ_REGISTER; hdr->nr_body = (uintptr_t)®req; error = netmap_get_na(hdr, &na, &ifp, NULL, 0); hdr->nr_reqtype = NETMAP_REQ_PORT_HDR_GET; hdr->nr_body = (uintptr_t)req; if (na && !error) { req->nr_hdr_len = na->virt_hdr_len; } netmap_unget_na(na, ifp); NMG_UNLOCK(); break; } case NETMAP_REQ_VALE_NEWIF: { error = nm_vi_create(hdr); break; } case NETMAP_REQ_VALE_DELIF: { error = nm_vi_destroy(hdr->nr_name); break; } case NETMAP_REQ_VALE_POLLING_ENABLE: case NETMAP_REQ_VALE_POLLING_DISABLE: { error = nm_bdg_polling(hdr); break; } #endif /* WITH_VALE */ case NETMAP_REQ_POOLS_INFO_GET: { /* Get information from the memory allocator used for * hdr->nr_name. */ struct nmreq_pools_info *req = (struct nmreq_pools_info *)(uintptr_t)hdr->nr_body; NMG_LOCK(); do { /* Build a nmreq_register out of the nmreq_pools_info, * so that we can call netmap_get_na(). */ struct nmreq_register regreq; bzero(®req, sizeof(regreq)); regreq.nr_mem_id = req->nr_mem_id; regreq.nr_mode = NR_REG_ALL_NIC; hdr->nr_reqtype = NETMAP_REQ_REGISTER; hdr->nr_body = (uintptr_t)®req; error = netmap_get_na(hdr, &na, &ifp, NULL, 1 /* create */); hdr->nr_reqtype = NETMAP_REQ_POOLS_INFO_GET; /* reset type */ hdr->nr_body = (uintptr_t)req; /* reset nr_body */ if (error) { na = NULL; ifp = NULL; break; } nmd = na->nm_mem; /* grab the memory allocator */ if (nmd == NULL) { error = EINVAL; break; } /* Finalize the memory allocator, get the pools * information and release the allocator. */ error = netmap_mem_finalize(nmd, na); if (error) { break; } error = netmap_mem_pools_info_get(req, nmd); netmap_mem_drop(na); } while (0); netmap_unget_na(na, ifp); NMG_UNLOCK(); break; } case NETMAP_REQ_CSB_ENABLE: { struct nmreq_option *opt; opt = nmreq_getoption(hdr, NETMAP_REQ_OPT_CSB); if (opt == NULL) { error = EINVAL; } else { struct nmreq_opt_csb *csbo = (struct nmreq_opt_csb *)opt; NMG_LOCK(); error = netmap_csb_validate(priv, csbo); NMG_UNLOCK(); opt->nro_status = error; } break; } case NETMAP_REQ_SYNC_KLOOP_START: { error = netmap_sync_kloop(priv, hdr); break; } case NETMAP_REQ_SYNC_KLOOP_STOP: { error = netmap_sync_kloop_stop(priv); break; } default: { error = EINVAL; break; } } /* Write back request body to userspace and reset the * user-space pointer. */ error = nmreq_copyout(hdr, error); break; } case NIOCTXSYNC: case NIOCRXSYNC: { if (unlikely(priv->np_nifp == NULL)) { error = ENXIO; break; } mb(); /* make sure following reads are not from cache */ if (unlikely(priv->np_csb_atok_base)) { nm_prerr("Invalid sync in CSB mode"); error = EBUSY; break; } na = priv->np_na; /* we have a reference */ mbq_init(&q); t = (cmd == NIOCTXSYNC ? NR_TX : NR_RX); krings = NMR(na, t); qfirst = priv->np_qfirst[t]; qlast = priv->np_qlast[t]; sync_flags = priv->np_sync_flags; for (i = qfirst; i < qlast; i++) { struct netmap_kring *kring = krings[i]; struct netmap_ring *ring = kring->ring; if (unlikely(nm_kr_tryget(kring, 1, &error))) { error = (error ? EIO : 0); continue; } if (cmd == NIOCTXSYNC) { if (netmap_debug & NM_DEBUG_TXSYNC) nm_prinf("pre txsync ring %d cur %d hwcur %d", i, ring->cur, kring->nr_hwcur); if (nm_txsync_prologue(kring, ring) >= kring->nkr_num_slots) { netmap_ring_reinit(kring); } else if (kring->nm_sync(kring, sync_flags | NAF_FORCE_RECLAIM) == 0) { nm_sync_finalize(kring); } if (netmap_debug & NM_DEBUG_TXSYNC) nm_prinf("post txsync ring %d cur %d hwcur %d", i, ring->cur, kring->nr_hwcur); } else { if (nm_rxsync_prologue(kring, ring) >= kring->nkr_num_slots) { netmap_ring_reinit(kring); } if (nm_may_forward_up(kring)) { /* transparent forwarding, see netmap_poll() */ netmap_grab_packets(kring, &q, netmap_fwd); } if (kring->nm_sync(kring, sync_flags | NAF_FORCE_READ) == 0) { nm_sync_finalize(kring); } ring_timestamp_set(ring); } nm_kr_put(kring); } if (mbq_peek(&q)) { netmap_send_up(na->ifp, &q); } break; } default: { return netmap_ioctl_legacy(priv, cmd, data, td); break; } } return (error); } size_t nmreq_size_by_type(uint16_t nr_reqtype) { switch (nr_reqtype) { case NETMAP_REQ_REGISTER: return sizeof(struct nmreq_register); case NETMAP_REQ_PORT_INFO_GET: return sizeof(struct nmreq_port_info_get); case NETMAP_REQ_VALE_ATTACH: return sizeof(struct nmreq_vale_attach); case NETMAP_REQ_VALE_DETACH: return sizeof(struct nmreq_vale_detach); case NETMAP_REQ_VALE_LIST: return sizeof(struct nmreq_vale_list); case NETMAP_REQ_PORT_HDR_SET: case NETMAP_REQ_PORT_HDR_GET: return sizeof(struct nmreq_port_hdr); case NETMAP_REQ_VALE_NEWIF: return sizeof(struct nmreq_vale_newif); case NETMAP_REQ_VALE_DELIF: case NETMAP_REQ_SYNC_KLOOP_STOP: case NETMAP_REQ_CSB_ENABLE: return 0; case NETMAP_REQ_VALE_POLLING_ENABLE: case NETMAP_REQ_VALE_POLLING_DISABLE: return sizeof(struct nmreq_vale_polling); case NETMAP_REQ_POOLS_INFO_GET: return sizeof(struct nmreq_pools_info); case NETMAP_REQ_SYNC_KLOOP_START: return sizeof(struct nmreq_sync_kloop_start); } return 0; } static size_t nmreq_opt_size_by_type(uint32_t nro_reqtype, uint64_t nro_size) { size_t rv = sizeof(struct nmreq_option); #ifdef NETMAP_REQ_OPT_DEBUG if (nro_reqtype & NETMAP_REQ_OPT_DEBUG) return (nro_reqtype & ~NETMAP_REQ_OPT_DEBUG); #endif /* NETMAP_REQ_OPT_DEBUG */ switch (nro_reqtype) { #ifdef WITH_EXTMEM case NETMAP_REQ_OPT_EXTMEM: rv = sizeof(struct nmreq_opt_extmem); break; #endif /* WITH_EXTMEM */ case NETMAP_REQ_OPT_SYNC_KLOOP_EVENTFDS: if (nro_size >= rv) rv = nro_size; break; case NETMAP_REQ_OPT_CSB: rv = sizeof(struct nmreq_opt_csb); break; case NETMAP_REQ_OPT_SYNC_KLOOP_MODE: rv = sizeof(struct nmreq_opt_sync_kloop_mode); break; } /* subtract the common header */ return rv - sizeof(struct nmreq_option); } /* * nmreq_copyin: create an in-kernel version of the request. * * We build the following data structure: * * hdr -> +-------+ buf * | | +---------------+ * +-------+ |usr body ptr | * |options|-. +---------------+ * +-------+ | |usr options ptr| * |body |--------->+---------------+ * +-------+ | | | * | | copy of body | * | | | * | +---------------+ * | | NULL | * | +---------------+ * | .---| |\ * | | +---------------+ | * | .------| | | * | | | +---------------+ \ option table * | | | | ... | / indexed by option * | | | +---------------+ | type * | | | | | | * | | | +---------------+/ * | | | |usr next ptr 1 | * `-|----->+---------------+ * | | | copy of opt 1 | * | | | | * | | .-| nro_next | * | | | +---------------+ * | | | |usr next ptr 2 | * | `-`>+---------------+ * | | copy of opt 2 | * | | | * | .-| nro_next | * | | +---------------+ * | | | | * ~ ~ ~ ... ~ * | .-| | * `----->+---------------+ * | |usr next ptr n | * `>+---------------+ * | copy of opt n | * | | * | nro_next(NULL)| * +---------------+ * * The options and body fields of the hdr structure are overwritten * with in-kernel valid pointers inside the buf. The original user * pointers are saved in the buf and restored on copyout. * The list of options is copied and the pointers adjusted. The * original pointers are saved before the option they belonged. * * The option table has an entry for every availabe option. Entries * for options that have not been passed contain NULL. * */ int nmreq_copyin(struct nmreq_header *hdr, int nr_body_is_user) { size_t rqsz, optsz, bufsz; int error = 0; char *ker = NULL, *p; struct nmreq_option **next, *src, **opt_tab; struct nmreq_option buf; uint64_t *ptrs; if (hdr->nr_reserved) { if (netmap_verbose) nm_prerr("nr_reserved must be zero"); return EINVAL; } if (!nr_body_is_user) return 0; hdr->nr_reserved = nr_body_is_user; /* compute the total size of the buffer */ rqsz = nmreq_size_by_type(hdr->nr_reqtype); if (rqsz > NETMAP_REQ_MAXSIZE) { error = EMSGSIZE; goto out_err; } if ((rqsz && hdr->nr_body == (uintptr_t)NULL) || (!rqsz && hdr->nr_body != (uintptr_t)NULL)) { /* Request body expected, but not found; or * request body found but unexpected. */ if (netmap_verbose) nm_prerr("nr_body expected but not found, or vice versa"); error = EINVAL; goto out_err; } bufsz = 2 * sizeof(void *) + rqsz + NETMAP_REQ_OPT_MAX * sizeof(opt_tab); /* compute the size of the buf below the option table. * It must contain a copy of every received option structure. * For every option we also need to store a copy of the user * list pointer. */ optsz = 0; for (src = (struct nmreq_option *)(uintptr_t)hdr->nr_options; src; src = (struct nmreq_option *)(uintptr_t)buf.nro_next) { error = copyin(src, &buf, sizeof(*src)); if (error) goto out_err; optsz += sizeof(*src); optsz += nmreq_opt_size_by_type(buf.nro_reqtype, buf.nro_size); if (rqsz + optsz > NETMAP_REQ_MAXSIZE) { error = EMSGSIZE; goto out_err; } bufsz += sizeof(void *); } bufsz += optsz; ker = nm_os_malloc(bufsz); if (ker == NULL) { error = ENOMEM; goto out_err; } p = ker; /* write pointer into the buffer */ /* make a copy of the user pointers */ ptrs = (uint64_t*)p; *ptrs++ = hdr->nr_body; *ptrs++ = hdr->nr_options; p = (char *)ptrs; /* copy the body */ error = copyin((void *)(uintptr_t)hdr->nr_body, p, rqsz); if (error) goto out_restore; /* overwrite the user pointer with the in-kernel one */ hdr->nr_body = (uintptr_t)p; p += rqsz; /* start of the options table */ opt_tab = (struct nmreq_option **)p; p += sizeof(opt_tab) * NETMAP_REQ_OPT_MAX; /* copy the options */ next = (struct nmreq_option **)&hdr->nr_options; src = *next; while (src) { struct nmreq_option *opt; /* copy the option header */ ptrs = (uint64_t *)p; opt = (struct nmreq_option *)(ptrs + 1); error = copyin(src, opt, sizeof(*src)); if (error) goto out_restore; /* make a copy of the user next pointer */ *ptrs = opt->nro_next; /* overwrite the user pointer with the in-kernel one */ *next = opt; /* initialize the option as not supported. * Recognized options will update this field. */ opt->nro_status = EOPNOTSUPP; /* check for invalid types */ if (opt->nro_reqtype < 1) { if (netmap_verbose) nm_prinf("invalid option type: %u", opt->nro_reqtype); opt->nro_status = EINVAL; error = EINVAL; goto next; } if (opt->nro_reqtype >= NETMAP_REQ_OPT_MAX) { /* opt->nro_status is already EOPNOTSUPP */ error = EOPNOTSUPP; goto next; } /* if the type is valid, index the option in the table * unless it is a duplicate. */ if (opt_tab[opt->nro_reqtype] != NULL) { if (netmap_verbose) nm_prinf("duplicate option: %u", opt->nro_reqtype); opt->nro_status = EINVAL; opt_tab[opt->nro_reqtype]->nro_status = EINVAL; error = EINVAL; goto next; } opt_tab[opt->nro_reqtype] = opt; p = (char *)(opt + 1); /* copy the option body */ optsz = nmreq_opt_size_by_type(opt->nro_reqtype, opt->nro_size); if (optsz) { /* the option body follows the option header */ error = copyin(src + 1, p, optsz); if (error) goto out_restore; p += optsz; } next: /* move to next option */ next = (struct nmreq_option **)&opt->nro_next; src = *next; } if (error) nmreq_copyout(hdr, error); return error; out_restore: ptrs = (uint64_t *)ker; hdr->nr_body = *ptrs++; hdr->nr_options = *ptrs++; hdr->nr_reserved = 0; nm_os_free(ker); out_err: return error; } static int nmreq_copyout(struct nmreq_header *hdr, int rerror) { struct nmreq_option *src, *dst; void *ker = (void *)(uintptr_t)hdr->nr_body, *bufstart; uint64_t *ptrs; size_t bodysz; int error; if (!hdr->nr_reserved) return rerror; /* restore the user pointers in the header */ ptrs = (uint64_t *)ker - 2; bufstart = ptrs; hdr->nr_body = *ptrs++; src = (struct nmreq_option *)(uintptr_t)hdr->nr_options; hdr->nr_options = *ptrs; if (!rerror) { /* copy the body */ bodysz = nmreq_size_by_type(hdr->nr_reqtype); error = copyout(ker, (void *)(uintptr_t)hdr->nr_body, bodysz); if (error) { rerror = error; goto out; } } /* copy the options */ dst = (struct nmreq_option *)(uintptr_t)hdr->nr_options; while (src) { size_t optsz; uint64_t next; /* restore the user pointer */ next = src->nro_next; ptrs = (uint64_t *)src - 1; src->nro_next = *ptrs; /* always copy the option header */ error = copyout(src, dst, sizeof(*src)); if (error) { rerror = error; goto out; } /* copy the option body only if there was no error */ if (!rerror && !src->nro_status) { optsz = nmreq_opt_size_by_type(src->nro_reqtype, src->nro_size); if (optsz) { error = copyout(src + 1, dst + 1, optsz); if (error) { rerror = error; goto out; } } } src = (struct nmreq_option *)(uintptr_t)next; dst = (struct nmreq_option *)(uintptr_t)*ptrs; } out: hdr->nr_reserved = 0; nm_os_free(bufstart); return rerror; } struct nmreq_option * nmreq_getoption(struct nmreq_header *hdr, uint16_t reqtype) { struct nmreq_option **opt_tab; if (!hdr->nr_options) return NULL; - opt_tab = (struct nmreq_option **)(hdr->nr_options) - (NETMAP_REQ_OPT_MAX + 1); + opt_tab = (struct nmreq_option **)((uintptr_t)hdr->nr_options) - + (NETMAP_REQ_OPT_MAX + 1); return opt_tab[reqtype]; } static int nmreq_checkoptions(struct nmreq_header *hdr) { struct nmreq_option *opt; /* return error if there is still any option * marked as not supported */ for (opt = (struct nmreq_option *)(uintptr_t)hdr->nr_options; opt; opt = (struct nmreq_option *)(uintptr_t)opt->nro_next) if (opt->nro_status == EOPNOTSUPP) return EOPNOTSUPP; return 0; } /* * select(2) and poll(2) handlers for the "netmap" device. * * Can be called for one or more queues. * Return true the event mask corresponding to ready events. * If there are no ready events (and 'sr' is not NULL), do a * selrecord on either individual selinfo or on the global one. * Device-dependent parts (locking and sync of tx/rx rings) * are done through callbacks. * * On linux, arguments are really pwait, the poll table, and 'td' is struct file * * The first one is remapped to pwait as selrecord() uses the name as an * hidden argument. */ int netmap_poll(struct netmap_priv_d *priv, int events, NM_SELRECORD_T *sr) { struct netmap_adapter *na; struct netmap_kring *kring; struct netmap_ring *ring; u_int i, want[NR_TXRX], revents = 0; NM_SELINFO_T *si[NR_TXRX]; #define want_tx want[NR_TX] #define want_rx want[NR_RX] struct mbq q; /* packets from RX hw queues to host stack */ /* * In order to avoid nested locks, we need to "double check" * txsync and rxsync if we decide to do a selrecord(). * retry_tx (and retry_rx, later) prevent looping forever. */ int retry_tx = 1, retry_rx = 1; /* Transparent mode: send_down is 1 if we have found some * packets to forward (host RX ring --> NIC) during the rx * scan and we have not sent them down to the NIC yet. * Transparent mode requires to bind all rings to a single * file descriptor. */ int send_down = 0; int sync_flags = priv->np_sync_flags; mbq_init(&q); if (unlikely(priv->np_nifp == NULL)) { return POLLERR; } mb(); /* make sure following reads are not from cache */ na = priv->np_na; if (unlikely(!nm_netmap_on(na))) return POLLERR; if (unlikely(priv->np_csb_atok_base)) { nm_prerr("Invalid poll in CSB mode"); return POLLERR; } if (netmap_debug & NM_DEBUG_ON) nm_prinf("device %s events 0x%x", na->name, events); want_tx = events & (POLLOUT | POLLWRNORM); want_rx = events & (POLLIN | POLLRDNORM); /* * If the card has more than one queue AND the file descriptor is * bound to all of them, we sleep on the "global" selinfo, otherwise * we sleep on individual selinfo (FreeBSD only allows two selinfo's * per file descriptor). * The interrupt routine in the driver wake one or the other * (or both) depending on which clients are active. * * rxsync() is only called if we run out of buffers on a POLLIN. * txsync() is called if we run out of buffers on POLLOUT, or * there are pending packets to send. The latter can be disabled * passing NETMAP_NO_TX_POLL in the NIOCREG call. */ si[NR_RX] = priv->np_si[NR_RX]; si[NR_TX] = priv->np_si[NR_TX]; #ifdef __FreeBSD__ /* * We start with a lock free round which is cheap if we have * slots available. If this fails, then lock and call the sync * routines. We can't do this on Linux, as the contract says * that we must call nm_os_selrecord() unconditionally. */ if (want_tx) { const enum txrx t = NR_TX; for (i = priv->np_qfirst[t]; i < priv->np_qlast[t]; i++) { kring = NMR(na, t)[i]; if (kring->ring->cur != kring->ring->tail) { /* Some unseen TX space is available, so what * we don't need to run txsync. */ revents |= want[t]; want[t] = 0; break; } } } if (want_rx) { const enum txrx t = NR_RX; int rxsync_needed = 0; for (i = priv->np_qfirst[t]; i < priv->np_qlast[t]; i++) { kring = NMR(na, t)[i]; if (kring->ring->cur == kring->ring->tail || kring->rhead != kring->ring->head) { /* There are no unseen packets on this ring, * or there are some buffers to be returned * to the netmap port. We therefore go ahead * and run rxsync. */ rxsync_needed = 1; break; } } if (!rxsync_needed) { revents |= want_rx; want_rx = 0; } } #endif #ifdef linux /* The selrecord must be unconditional on linux. */ nm_os_selrecord(sr, si[NR_RX]); nm_os_selrecord(sr, si[NR_TX]); #endif /* linux */ /* * If we want to push packets out (priv->np_txpoll) or * want_tx is still set, we must issue txsync calls * (on all rings, to avoid that the tx rings stall). * Fortunately, normal tx mode has np_txpoll set. */ if (priv->np_txpoll || want_tx) { /* * The first round checks if anyone is ready, if not * do a selrecord and another round to handle races. * want_tx goes to 0 if any space is found, and is * used to skip rings with no pending transmissions. */ flush_tx: for (i = priv->np_qfirst[NR_TX]; i < priv->np_qlast[NR_TX]; i++) { int found = 0; kring = na->tx_rings[i]; ring = kring->ring; /* * Don't try to txsync this TX ring if we already found some * space in some of the TX rings (want_tx == 0) and there are no * TX slots in this ring that need to be flushed to the NIC * (head == hwcur). */ if (!send_down && !want_tx && ring->head == kring->nr_hwcur) continue; if (nm_kr_tryget(kring, 1, &revents)) continue; if (nm_txsync_prologue(kring, ring) >= kring->nkr_num_slots) { netmap_ring_reinit(kring); revents |= POLLERR; } else { if (kring->nm_sync(kring, sync_flags)) revents |= POLLERR; else nm_sync_finalize(kring); } /* * If we found new slots, notify potential * listeners on the same ring. * Since we just did a txsync, look at the copies * of cur,tail in the kring. */ found = kring->rcur != kring->rtail; nm_kr_put(kring); if (found) { /* notify other listeners */ revents |= want_tx; want_tx = 0; #ifndef linux kring->nm_notify(kring, 0); #endif /* linux */ } } /* if there were any packet to forward we must have handled them by now */ send_down = 0; if (want_tx && retry_tx && sr) { #ifndef linux nm_os_selrecord(sr, si[NR_TX]); #endif /* !linux */ retry_tx = 0; goto flush_tx; } } /* * If want_rx is still set scan receive rings. * Do it on all rings because otherwise we starve. */ if (want_rx) { /* two rounds here for race avoidance */ do_retry_rx: for (i = priv->np_qfirst[NR_RX]; i < priv->np_qlast[NR_RX]; i++) { int found = 0; kring = na->rx_rings[i]; ring = kring->ring; if (unlikely(nm_kr_tryget(kring, 1, &revents))) continue; if (nm_rxsync_prologue(kring, ring) >= kring->nkr_num_slots) { netmap_ring_reinit(kring); revents |= POLLERR; } /* now we can use kring->rcur, rtail */ /* * transparent mode support: collect packets from * hw rxring(s) that have been released by the user */ if (nm_may_forward_up(kring)) { netmap_grab_packets(kring, &q, netmap_fwd); } /* Clear the NR_FORWARD flag anyway, it may be set by * the nm_sync() below only on for the host RX ring (see * netmap_rxsync_from_host()). */ kring->nr_kflags &= ~NR_FORWARD; if (kring->nm_sync(kring, sync_flags)) revents |= POLLERR; else nm_sync_finalize(kring); send_down |= (kring->nr_kflags & NR_FORWARD); ring_timestamp_set(ring); found = kring->rcur != kring->rtail; nm_kr_put(kring); if (found) { revents |= want_rx; retry_rx = 0; #ifndef linux kring->nm_notify(kring, 0); #endif /* linux */ } } #ifndef linux if (retry_rx && sr) { nm_os_selrecord(sr, si[NR_RX]); } #endif /* !linux */ if (send_down || retry_rx) { retry_rx = 0; if (send_down) goto flush_tx; /* and retry_rx */ else goto do_retry_rx; } } /* * Transparent mode: released bufs (i.e. between kring->nr_hwcur and * ring->head) marked with NS_FORWARD on hw rx rings are passed up * to the host stack. */ if (mbq_peek(&q)) { netmap_send_up(na->ifp, &q); } return (revents); #undef want_tx #undef want_rx } int nma_intr_enable(struct netmap_adapter *na, int onoff) { bool changed = false; enum txrx t; int i; for_rx_tx(t) { for (i = 0; i < nma_get_nrings(na, t); i++) { struct netmap_kring *kring = NMR(na, t)[i]; int on = !(kring->nr_kflags & NKR_NOINTR); if (!!onoff != !!on) { changed = true; } if (onoff) { kring->nr_kflags &= ~NKR_NOINTR; } else { kring->nr_kflags |= NKR_NOINTR; } } } if (!changed) { return 0; /* nothing to do */ } if (!na->nm_intr) { nm_prerr("Cannot %s interrupts for %s", onoff ? "enable" : "disable", na->name); return -1; } na->nm_intr(na, onoff); return 0; } /*-------------------- driver support routines -------------------*/ /* default notify callback */ static int netmap_notify(struct netmap_kring *kring, int flags) { struct netmap_adapter *na = kring->notify_na; enum txrx t = kring->tx; nm_os_selwakeup(&kring->si); /* optimization: avoid a wake up on the global * queue if nobody has registered for more * than one ring */ if (na->si_users[t] > 0) nm_os_selwakeup(&na->si[t]); return NM_IRQ_COMPLETED; } /* called by all routines that create netmap_adapters. * provide some defaults and get a reference to the * memory allocator */ int netmap_attach_common(struct netmap_adapter *na) { if (!na->rx_buf_maxsize) { /* Set a conservative default (larger is safer). */ na->rx_buf_maxsize = PAGE_SIZE; } #ifdef __FreeBSD__ if (na->na_flags & NAF_HOST_RINGS && na->ifp) { na->if_input = na->ifp->if_input; /* for netmap_send_up */ } na->pdev = na; /* make sure netmap_mem_map() is called */ #endif /* __FreeBSD__ */ if (na->na_flags & NAF_HOST_RINGS) { if (na->num_host_rx_rings == 0) na->num_host_rx_rings = 1; if (na->num_host_tx_rings == 0) na->num_host_tx_rings = 1; } if (na->nm_krings_create == NULL) { /* we assume that we have been called by a driver, * since other port types all provide their own * nm_krings_create */ na->nm_krings_create = netmap_hw_krings_create; na->nm_krings_delete = netmap_hw_krings_delete; } if (na->nm_notify == NULL) na->nm_notify = netmap_notify; na->active_fds = 0; if (na->nm_mem == NULL) { /* use the global allocator */ na->nm_mem = netmap_mem_get(&nm_mem); } #ifdef WITH_VALE if (na->nm_bdg_attach == NULL) /* no special nm_bdg_attach callback. On VALE * attach, we need to interpose a bwrap */ na->nm_bdg_attach = netmap_default_bdg_attach; #endif return 0; } /* Wrapper for the register callback provided netmap-enabled * hardware drivers. * nm_iszombie(na) means that the driver module has been * unloaded, so we cannot call into it. * nm_os_ifnet_lock() must guarantee mutual exclusion with * module unloading. */ static int netmap_hw_reg(struct netmap_adapter *na, int onoff) { struct netmap_hw_adapter *hwna = (struct netmap_hw_adapter*)na; int error = 0; nm_os_ifnet_lock(); if (nm_iszombie(na)) { if (onoff) { error = ENXIO; } else if (na != NULL) { na->na_flags &= ~NAF_NETMAP_ON; } goto out; } error = hwna->nm_hw_register(na, onoff); out: nm_os_ifnet_unlock(); return error; } static void netmap_hw_dtor(struct netmap_adapter *na) { if (na->ifp == NULL) return; NM_DETACH_NA(na->ifp); } /* * Allocate a netmap_adapter object, and initialize it from the * 'arg' passed by the driver on attach. * We allocate a block of memory of 'size' bytes, which has room * for struct netmap_adapter plus additional room private to * the caller. * Return 0 on success, ENOMEM otherwise. */ int netmap_attach_ext(struct netmap_adapter *arg, size_t size, int override_reg) { struct netmap_hw_adapter *hwna = NULL; struct ifnet *ifp = NULL; if (size < sizeof(struct netmap_hw_adapter)) { if (netmap_debug & NM_DEBUG_ON) nm_prerr("Invalid netmap adapter size %d", (int)size); return EINVAL; } if (arg == NULL || arg->ifp == NULL) { if (netmap_debug & NM_DEBUG_ON) nm_prerr("either arg or arg->ifp is NULL"); return EINVAL; } if (arg->num_tx_rings == 0 || arg->num_rx_rings == 0) { if (netmap_debug & NM_DEBUG_ON) nm_prerr("%s: invalid rings tx %d rx %d", arg->name, arg->num_tx_rings, arg->num_rx_rings); return EINVAL; } ifp = arg->ifp; if (NM_NA_CLASH(ifp)) { /* If NA(ifp) is not null but there is no valid netmap * adapter it means that someone else is using the same * pointer (e.g. ax25_ptr on linux). This happens for * instance when also PF_RING is in use. */ nm_prerr("Error: netmap adapter hook is busy"); return EBUSY; } hwna = nm_os_malloc(size); if (hwna == NULL) goto fail; hwna->up = *arg; hwna->up.na_flags |= NAF_HOST_RINGS | NAF_NATIVE; strlcpy(hwna->up.name, ifp->if_xname, sizeof(hwna->up.name)); if (override_reg) { hwna->nm_hw_register = hwna->up.nm_register; hwna->up.nm_register = netmap_hw_reg; } if (netmap_attach_common(&hwna->up)) { nm_os_free(hwna); goto fail; } netmap_adapter_get(&hwna->up); NM_ATTACH_NA(ifp, &hwna->up); nm_os_onattach(ifp); if (arg->nm_dtor == NULL) { hwna->up.nm_dtor = netmap_hw_dtor; } if_printf(ifp, "netmap queues/slots: TX %d/%d, RX %d/%d\n", hwna->up.num_tx_rings, hwna->up.num_tx_desc, hwna->up.num_rx_rings, hwna->up.num_rx_desc); return 0; fail: nm_prerr("fail, arg %p ifp %p na %p", arg, ifp, hwna); return (hwna ? EINVAL : ENOMEM); } int netmap_attach(struct netmap_adapter *arg) { return netmap_attach_ext(arg, sizeof(struct netmap_hw_adapter), 1 /* override nm_reg */); } void NM_DBG(netmap_adapter_get)(struct netmap_adapter *na) { if (!na) { return; } refcount_acquire(&na->na_refcount); } /* returns 1 iff the netmap_adapter is destroyed */ int NM_DBG(netmap_adapter_put)(struct netmap_adapter *na) { if (!na) return 1; if (!refcount_release(&na->na_refcount)) return 0; if (na->nm_dtor) na->nm_dtor(na); if (na->tx_rings) { /* XXX should not happen */ if (netmap_debug & NM_DEBUG_ON) nm_prerr("freeing leftover tx_rings"); na->nm_krings_delete(na); } netmap_pipe_dealloc(na); if (na->nm_mem) netmap_mem_put(na->nm_mem); bzero(na, sizeof(*na)); nm_os_free(na); return 1; } /* nm_krings_create callback for all hardware native adapters */ int netmap_hw_krings_create(struct netmap_adapter *na) { int ret = netmap_krings_create(na, 0); if (ret == 0) { /* initialize the mbq for the sw rx ring */ u_int lim = netmap_real_rings(na, NR_RX), i; for (i = na->num_rx_rings; i < lim; i++) { mbq_safe_init(&NMR(na, NR_RX)[i]->rx_queue); } nm_prdis("initialized sw rx queue %d", na->num_rx_rings); } return ret; } /* * Called on module unload by the netmap-enabled drivers */ void netmap_detach(struct ifnet *ifp) { struct netmap_adapter *na = NA(ifp); if (!na) return; NMG_LOCK(); netmap_set_all_rings(na, NM_KR_LOCKED); /* * if the netmap adapter is not native, somebody * changed it, so we can not release it here. * The NAF_ZOMBIE flag will notify the new owner that * the driver is gone. */ if (!(na->na_flags & NAF_NATIVE) || !netmap_adapter_put(na)) { na->na_flags |= NAF_ZOMBIE; } /* give active users a chance to notice that NAF_ZOMBIE has been * turned on, so that they can stop and return an error to userspace. * Note that this becomes a NOP if there are no active users and, * therefore, the put() above has deleted the na, since now NA(ifp) is * NULL. */ netmap_enable_all_rings(ifp); NMG_UNLOCK(); } /* * Intercept packets from the network stack and pass them * to netmap as incoming packets on the 'software' ring. * * We only store packets in a bounded mbq and then copy them * in the relevant rxsync routine. * * We rely on the OS to make sure that the ifp and na do not go * away (typically the caller checks for IFF_DRV_RUNNING or the like). * In nm_register() or whenever there is a reinitialization, * we make sure to make the mode change visible here. */ int netmap_transmit(struct ifnet *ifp, struct mbuf *m) { struct netmap_adapter *na = NA(ifp); struct netmap_kring *kring, *tx_kring; u_int len = MBUF_LEN(m); u_int error = ENOBUFS; unsigned int txr; struct mbq *q; int busy; u_int i; i = MBUF_TXQ(m); if (i >= na->num_host_rx_rings) { i = i % na->num_host_rx_rings; } kring = NMR(na, NR_RX)[nma_get_nrings(na, NR_RX) + i]; // XXX [Linux] we do not need this lock // if we follow the down/configure/up protocol -gl // mtx_lock(&na->core_lock); if (!nm_netmap_on(na)) { nm_prerr("%s not in netmap mode anymore", na->name); error = ENXIO; goto done; } txr = MBUF_TXQ(m); if (txr >= na->num_tx_rings) { txr %= na->num_tx_rings; } tx_kring = NMR(na, NR_TX)[txr]; if (tx_kring->nr_mode == NKR_NETMAP_OFF) { return MBUF_TRANSMIT(na, ifp, m); } q = &kring->rx_queue; // XXX reconsider long packets if we handle fragments if (len > NETMAP_BUF_SIZE(na)) { /* too long for us */ nm_prerr("%s from_host, drop packet size %d > %d", na->name, len, NETMAP_BUF_SIZE(na)); goto done; } if (!netmap_generic_hwcsum) { if (nm_os_mbuf_has_csum_offld(m)) { nm_prlim(1, "%s drop mbuf that needs checksum offload", na->name); goto done; } } if (nm_os_mbuf_has_seg_offld(m)) { nm_prlim(1, "%s drop mbuf that needs generic segmentation offload", na->name); goto done; } #ifdef __FreeBSD__ ETHER_BPF_MTAP(ifp, m); #endif /* __FreeBSD__ */ /* protect against netmap_rxsync_from_host(), netmap_sw_to_nic() * and maybe other instances of netmap_transmit (the latter * not possible on Linux). * We enqueue the mbuf only if we are sure there is going to be * enough room in the host RX ring, otherwise we drop it. */ mbq_lock(q); busy = kring->nr_hwtail - kring->nr_hwcur; if (busy < 0) busy += kring->nkr_num_slots; if (busy + mbq_len(q) >= kring->nkr_num_slots - 1) { nm_prlim(2, "%s full hwcur %d hwtail %d qlen %d", na->name, kring->nr_hwcur, kring->nr_hwtail, mbq_len(q)); } else { mbq_enqueue(q, m); nm_prdis(2, "%s %d bufs in queue", na->name, mbq_len(q)); /* notify outside the lock */ m = NULL; error = 0; } mbq_unlock(q); done: if (m) m_freem(m); /* unconditionally wake up listeners */ kring->nm_notify(kring, 0); /* this is normally netmap_notify(), but for nics * connected to a bridge it is netmap_bwrap_intr_notify(), * that possibly forwards the frames through the switch */ return (error); } /* * netmap_reset() is called by the driver routines when reinitializing * a ring. The driver is in charge of locking to protect the kring. * If native netmap mode is not set just return NULL. * If native netmap mode is set, in particular, we have to set nr_mode to * NKR_NETMAP_ON. */ struct netmap_slot * netmap_reset(struct netmap_adapter *na, enum txrx tx, u_int n, u_int new_cur) { struct netmap_kring *kring; int new_hwofs, lim; if (!nm_native_on(na)) { nm_prdis("interface not in native netmap mode"); return NULL; /* nothing to reinitialize */ } /* XXX note- in the new scheme, we are not guaranteed to be * under lock (e.g. when called on a device reset). * In this case, we should set a flag and do not trust too * much the values. In practice: TODO * - set a RESET flag somewhere in the kring * - do the processing in a conservative way * - let the *sync() fixup at the end. */ if (tx == NR_TX) { if (n >= na->num_tx_rings) return NULL; kring = na->tx_rings[n]; if (kring->nr_pending_mode == NKR_NETMAP_OFF) { kring->nr_mode = NKR_NETMAP_OFF; return NULL; } // XXX check whether we should use hwcur or rcur new_hwofs = kring->nr_hwcur - new_cur; } else { if (n >= na->num_rx_rings) return NULL; kring = na->rx_rings[n]; if (kring->nr_pending_mode == NKR_NETMAP_OFF) { kring->nr_mode = NKR_NETMAP_OFF; return NULL; } new_hwofs = kring->nr_hwtail - new_cur; } lim = kring->nkr_num_slots - 1; if (new_hwofs > lim) new_hwofs -= lim + 1; /* Always set the new offset value and realign the ring. */ if (netmap_debug & NM_DEBUG_ON) nm_prinf("%s %s%d hwofs %d -> %d, hwtail %d -> %d", na->name, tx == NR_TX ? "TX" : "RX", n, kring->nkr_hwofs, new_hwofs, kring->nr_hwtail, tx == NR_TX ? lim : kring->nr_hwtail); kring->nkr_hwofs = new_hwofs; if (tx == NR_TX) { kring->nr_hwtail = kring->nr_hwcur + lim; if (kring->nr_hwtail > lim) kring->nr_hwtail -= lim + 1; } /* * Wakeup on the individual and global selwait * We do the wakeup here, but the ring is not yet reconfigured. * However, we are under lock so there are no races. */ kring->nr_mode = NKR_NETMAP_ON; kring->nm_notify(kring, 0); return kring->ring->slot; } /* * Dispatch rx/tx interrupts to the netmap rings. * * "work_done" is non-null on the RX path, NULL for the TX path. * We rely on the OS to make sure that there is only one active * instance per queue, and that there is appropriate locking. * * The 'notify' routine depends on what the ring is attached to. * - for a netmap file descriptor, do a selwakeup on the individual * waitqueue, plus one on the global one if needed * (see netmap_notify) * - for a nic connected to a switch, call the proper forwarding routine * (see netmap_bwrap_intr_notify) */ int netmap_common_irq(struct netmap_adapter *na, u_int q, u_int *work_done) { struct netmap_kring *kring; enum txrx t = (work_done ? NR_RX : NR_TX); q &= NETMAP_RING_MASK; if (netmap_debug & (NM_DEBUG_RXINTR|NM_DEBUG_TXINTR)) { nm_prlim(5, "received %s queue %d", work_done ? "RX" : "TX" , q); } if (q >= nma_get_nrings(na, t)) return NM_IRQ_PASS; // not a physical queue kring = NMR(na, t)[q]; if (kring->nr_mode == NKR_NETMAP_OFF) { return NM_IRQ_PASS; } if (t == NR_RX) { kring->nr_kflags |= NKR_PENDINTR; // XXX atomic ? *work_done = 1; /* do not fire napi again */ } return kring->nm_notify(kring, 0); } /* * Default functions to handle rx/tx interrupts from a physical device. * "work_done" is non-null on the RX path, NULL for the TX path. * * If the card is not in netmap mode, simply return NM_IRQ_PASS, * so that the caller proceeds with regular processing. * Otherwise call netmap_common_irq(). * * If the card is connected to a netmap file descriptor, * do a selwakeup on the individual queue, plus one on the global one * if needed (multiqueue card _and_ there are multiqueue listeners), * and return NR_IRQ_COMPLETED. * * Finally, if called on rx from an interface connected to a switch, * calls the proper forwarding routine. */ int netmap_rx_irq(struct ifnet *ifp, u_int q, u_int *work_done) { struct netmap_adapter *na = NA(ifp); /* * XXX emulated netmap mode sets NAF_SKIP_INTR so * we still use the regular driver even though the previous * check fails. It is unclear whether we should use * nm_native_on() here. */ if (!nm_netmap_on(na)) return NM_IRQ_PASS; if (na->na_flags & NAF_SKIP_INTR) { nm_prdis("use regular interrupt"); return NM_IRQ_PASS; } return netmap_common_irq(na, q, work_done); } /* set/clear native flags and if_transmit/netdev_ops */ void nm_set_native_flags(struct netmap_adapter *na) { struct ifnet *ifp = na->ifp; /* We do the setup for intercepting packets only if we are the * first user of this adapapter. */ if (na->active_fds > 0) { return; } na->na_flags |= NAF_NETMAP_ON; nm_os_onenter(ifp); nm_update_hostrings_mode(na); } void nm_clear_native_flags(struct netmap_adapter *na) { struct ifnet *ifp = na->ifp; /* We undo the setup for intercepting packets only if we are the * last user of this adapter. */ if (na->active_fds > 0) { return; } nm_update_hostrings_mode(na); nm_os_onexit(ifp); na->na_flags &= ~NAF_NETMAP_ON; } void netmap_krings_mode_commit(struct netmap_adapter *na, int onoff) { enum txrx t; for_rx_tx(t) { int i; for (i = 0; i < netmap_real_rings(na, t); i++) { struct netmap_kring *kring = NMR(na, t)[i]; if (onoff && nm_kring_pending_on(kring)) kring->nr_mode = NKR_NETMAP_ON; else if (!onoff && nm_kring_pending_off(kring)) kring->nr_mode = NKR_NETMAP_OFF; } } } /* * Module loader and unloader * * netmap_init() creates the /dev/netmap device and initializes * all global variables. Returns 0 on success, errno on failure * (but there is no chance) * * netmap_fini() destroys everything. */ static struct cdev *netmap_dev; /* /dev/netmap character device. */ extern struct cdevsw netmap_cdevsw; void netmap_fini(void) { if (netmap_dev) destroy_dev(netmap_dev); /* we assume that there are no longer netmap users */ nm_os_ifnet_fini(); netmap_uninit_bridges(); netmap_mem_fini(); NMG_LOCK_DESTROY(); nm_prinf("netmap: unloaded module."); } int netmap_init(void) { int error; NMG_LOCK_INIT(); error = netmap_mem_init(); if (error != 0) goto fail; /* * MAKEDEV_ETERNAL_KLD avoids an expensive check on syscalls * when the module is compiled in. * XXX could use make_dev_credv() to get error number */ netmap_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, &netmap_cdevsw, 0, NULL, UID_ROOT, GID_WHEEL, 0600, "netmap"); if (!netmap_dev) goto fail; error = netmap_init_bridges(); if (error) goto fail; #ifdef __FreeBSD__ nm_os_vi_init_index(); #endif error = nm_os_ifnet_init(); if (error) goto fail; nm_prinf("netmap: loaded module"); return (0); fail: netmap_fini(); return (EINVAL); /* may be incorrect */ } Index: head/sys/dev/netmap/netmap_legacy.c =================================================================== --- head/sys/dev/netmap/netmap_legacy.c (revision 353774) +++ head/sys/dev/netmap/netmap_legacy.c (revision 353775) @@ -1,439 +1,439 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (C) 2018 Vincenzo Maffione * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* $FreeBSD$ */ #if defined(__FreeBSD__) #include /* prerequisite */ #include #include /* defines used in kernel.h */ #include /* FIONBIO */ #include #include /* struct socket */ #include /* sockaddrs */ #include #include #include #include /* BIOCIMMEDIATE */ #include /* bus_dmamap_* */ #include #elif defined(linux) #include "bsd_glue.h" #elif defined(__APPLE__) #warning OSX support is only partial #include "osx_glue.h" #elif defined (_WIN32) #include "win_glue.h" #endif /* * common headers */ #include #include #include static int nmreq_register_from_legacy(struct nmreq *nmr, struct nmreq_header *hdr, struct nmreq_register *req) { req->nr_offset = nmr->nr_offset; req->nr_memsize = nmr->nr_memsize; req->nr_tx_slots = nmr->nr_tx_slots; req->nr_rx_slots = nmr->nr_rx_slots; req->nr_tx_rings = nmr->nr_tx_rings; req->nr_rx_rings = nmr->nr_rx_rings; req->nr_host_tx_rings = 0; req->nr_host_rx_rings = 0; req->nr_mem_id = nmr->nr_arg2; req->nr_ringid = nmr->nr_ringid & NETMAP_RING_MASK; if ((nmr->nr_flags & NR_REG_MASK) == NR_REG_DEFAULT) { /* Convert the older nmr->nr_ringid (original * netmap control API) to nmr->nr_flags. */ u_int regmode = NR_REG_DEFAULT; if (req->nr_ringid & NETMAP_SW_RING) { regmode = NR_REG_SW; } else if (req->nr_ringid & NETMAP_HW_RING) { regmode = NR_REG_ONE_NIC; } else { regmode = NR_REG_ALL_NIC; } req->nr_mode = regmode; } else { req->nr_mode = nmr->nr_flags & NR_REG_MASK; } /* Fix nr_name, nr_mode and nr_ringid to handle pipe requests. */ if (req->nr_mode == NR_REG_PIPE_MASTER || req->nr_mode == NR_REG_PIPE_SLAVE) { char suffix[10]; snprintf(suffix, sizeof(suffix), "%c%d", (req->nr_mode == NR_REG_PIPE_MASTER ? '{' : '}'), req->nr_ringid); if (strlen(hdr->nr_name) + strlen(suffix) >= sizeof(hdr->nr_name)) { /* No space for the pipe suffix. */ return ENOBUFS; } - strncat(hdr->nr_name, suffix, strlen(suffix)); + strlcat(hdr->nr_name, suffix, sizeof(hdr->nr_name)); req->nr_mode = NR_REG_ALL_NIC; req->nr_ringid = 0; } req->nr_flags = nmr->nr_flags & (~NR_REG_MASK); if (nmr->nr_ringid & NETMAP_NO_TX_POLL) { req->nr_flags |= NR_NO_TX_POLL; } if (nmr->nr_ringid & NETMAP_DO_RX_POLL) { req->nr_flags |= NR_DO_RX_POLL; } /* nmr->nr_arg1 (nr_pipes) ignored */ req->nr_extra_bufs = nmr->nr_arg3; return 0; } /* Convert the legacy 'nmr' struct into one of the nmreq_xyz structs * (new API). The new struct is dynamically allocated. */ static struct nmreq_header * nmreq_from_legacy(struct nmreq *nmr, u_long ioctl_cmd) { struct nmreq_header *hdr = nm_os_malloc(sizeof(*hdr)); if (hdr == NULL) { goto oom; } /* Sanitize nmr->nr_name by adding the string terminator. */ if (ioctl_cmd == NIOCGINFO || ioctl_cmd == NIOCREGIF) { nmr->nr_name[sizeof(nmr->nr_name) - 1] = '\0'; } /* First prepare the request header. */ hdr->nr_version = NETMAP_API; /* new API */ strlcpy(hdr->nr_name, nmr->nr_name, sizeof(nmr->nr_name)); hdr->nr_options = (uintptr_t)NULL; hdr->nr_body = (uintptr_t)NULL; switch (ioctl_cmd) { case NIOCREGIF: { switch (nmr->nr_cmd) { case 0: { /* Regular NIOCREGIF operation. */ struct nmreq_register *req = nm_os_malloc(sizeof(*req)); if (!req) { goto oom; } hdr->nr_body = (uintptr_t)req; hdr->nr_reqtype = NETMAP_REQ_REGISTER; if (nmreq_register_from_legacy(nmr, hdr, req)) { goto oom; } break; } case NETMAP_BDG_ATTACH: { struct nmreq_vale_attach *req = nm_os_malloc(sizeof(*req)); if (!req) { goto oom; } hdr->nr_body = (uintptr_t)req; hdr->nr_reqtype = NETMAP_REQ_VALE_ATTACH; if (nmreq_register_from_legacy(nmr, hdr, &req->reg)) { goto oom; } /* Fix nr_mode, starting from nr_arg1. */ if (nmr->nr_arg1 & NETMAP_BDG_HOST) { req->reg.nr_mode = NR_REG_NIC_SW; } else { req->reg.nr_mode = NR_REG_ALL_NIC; } break; } case NETMAP_BDG_DETACH: { hdr->nr_reqtype = NETMAP_REQ_VALE_DETACH; hdr->nr_body = (uintptr_t)nm_os_malloc(sizeof(struct nmreq_vale_detach)); break; } case NETMAP_BDG_VNET_HDR: case NETMAP_VNET_HDR_GET: { struct nmreq_port_hdr *req = nm_os_malloc(sizeof(*req)); if (!req) { goto oom; } hdr->nr_body = (uintptr_t)req; hdr->nr_reqtype = (nmr->nr_cmd == NETMAP_BDG_VNET_HDR) ? NETMAP_REQ_PORT_HDR_SET : NETMAP_REQ_PORT_HDR_GET; req->nr_hdr_len = nmr->nr_arg1; break; } case NETMAP_BDG_NEWIF : { struct nmreq_vale_newif *req = nm_os_malloc(sizeof(*req)); if (!req) { goto oom; } hdr->nr_body = (uintptr_t)req; hdr->nr_reqtype = NETMAP_REQ_VALE_NEWIF; req->nr_tx_slots = nmr->nr_tx_slots; req->nr_rx_slots = nmr->nr_rx_slots; req->nr_tx_rings = nmr->nr_tx_rings; req->nr_rx_rings = nmr->nr_rx_rings; req->nr_mem_id = nmr->nr_arg2; break; } case NETMAP_BDG_DELIF: { hdr->nr_reqtype = NETMAP_REQ_VALE_DELIF; break; } case NETMAP_BDG_POLLING_ON: case NETMAP_BDG_POLLING_OFF: { struct nmreq_vale_polling *req = nm_os_malloc(sizeof(*req)); if (!req) { goto oom; } hdr->nr_body = (uintptr_t)req; hdr->nr_reqtype = (nmr->nr_cmd == NETMAP_BDG_POLLING_ON) ? NETMAP_REQ_VALE_POLLING_ENABLE : NETMAP_REQ_VALE_POLLING_DISABLE; switch (nmr->nr_flags & NR_REG_MASK) { default: req->nr_mode = 0; /* invalid */ break; case NR_REG_ONE_NIC: req->nr_mode = NETMAP_POLLING_MODE_MULTI_CPU; break; case NR_REG_ALL_NIC: req->nr_mode = NETMAP_POLLING_MODE_SINGLE_CPU; break; } req->nr_first_cpu_id = nmr->nr_ringid & NETMAP_RING_MASK; req->nr_num_polling_cpus = nmr->nr_arg1; break; } case NETMAP_PT_HOST_CREATE: case NETMAP_PT_HOST_DELETE: { nm_prerr("Netmap passthrough not supported yet"); return NULL; break; } } break; } case NIOCGINFO: { if (nmr->nr_cmd == NETMAP_BDG_LIST) { struct nmreq_vale_list *req = nm_os_malloc(sizeof(*req)); if (!req) { goto oom; } hdr->nr_body = (uintptr_t)req; hdr->nr_reqtype = NETMAP_REQ_VALE_LIST; req->nr_bridge_idx = nmr->nr_arg1; req->nr_port_idx = nmr->nr_arg2; } else { /* Regular NIOCGINFO. */ struct nmreq_port_info_get *req = nm_os_malloc(sizeof(*req)); if (!req) { goto oom; } hdr->nr_body = (uintptr_t)req; hdr->nr_reqtype = NETMAP_REQ_PORT_INFO_GET; req->nr_memsize = nmr->nr_memsize; req->nr_tx_slots = nmr->nr_tx_slots; req->nr_rx_slots = nmr->nr_rx_slots; req->nr_tx_rings = nmr->nr_tx_rings; req->nr_rx_rings = nmr->nr_rx_rings; req->nr_host_tx_rings = 0; req->nr_host_rx_rings = 0; req->nr_mem_id = nmr->nr_arg2; } break; } } return hdr; oom: if (hdr) { if (hdr->nr_body) { nm_os_free((void *)(uintptr_t)hdr->nr_body); } nm_os_free(hdr); } nm_prerr("Failed to allocate memory for nmreq_xyz struct"); return NULL; } static void nmreq_register_to_legacy(const struct nmreq_register *req, struct nmreq *nmr) { nmr->nr_offset = req->nr_offset; nmr->nr_memsize = req->nr_memsize; nmr->nr_tx_slots = req->nr_tx_slots; nmr->nr_rx_slots = req->nr_rx_slots; nmr->nr_tx_rings = req->nr_tx_rings; nmr->nr_rx_rings = req->nr_rx_rings; nmr->nr_arg2 = req->nr_mem_id; nmr->nr_arg3 = req->nr_extra_bufs; } /* Convert a nmreq_xyz struct (new API) to the legacy 'nmr' struct. * It also frees the nmreq_xyz struct, as it was allocated by * nmreq_from_legacy(). */ static int nmreq_to_legacy(struct nmreq_header *hdr, struct nmreq *nmr) { int ret = 0; /* We only write-back the fields that the user expects to be * written back. */ switch (hdr->nr_reqtype) { case NETMAP_REQ_REGISTER: { struct nmreq_register *req = (struct nmreq_register *)(uintptr_t)hdr->nr_body; nmreq_register_to_legacy(req, nmr); break; } case NETMAP_REQ_PORT_INFO_GET: { struct nmreq_port_info_get *req = (struct nmreq_port_info_get *)(uintptr_t)hdr->nr_body; nmr->nr_memsize = req->nr_memsize; nmr->nr_tx_slots = req->nr_tx_slots; nmr->nr_rx_slots = req->nr_rx_slots; nmr->nr_tx_rings = req->nr_tx_rings; nmr->nr_rx_rings = req->nr_rx_rings; nmr->nr_arg2 = req->nr_mem_id; break; } case NETMAP_REQ_VALE_ATTACH: { struct nmreq_vale_attach *req = (struct nmreq_vale_attach *)(uintptr_t)hdr->nr_body; nmreq_register_to_legacy(&req->reg, nmr); break; } case NETMAP_REQ_VALE_DETACH: { break; } case NETMAP_REQ_VALE_LIST: { struct nmreq_vale_list *req = (struct nmreq_vale_list *)(uintptr_t)hdr->nr_body; strlcpy(nmr->nr_name, hdr->nr_name, sizeof(nmr->nr_name)); nmr->nr_arg1 = req->nr_bridge_idx; nmr->nr_arg2 = req->nr_port_idx; break; } case NETMAP_REQ_PORT_HDR_SET: case NETMAP_REQ_PORT_HDR_GET: { struct nmreq_port_hdr *req = (struct nmreq_port_hdr *)(uintptr_t)hdr->nr_body; nmr->nr_arg1 = req->nr_hdr_len; break; } case NETMAP_REQ_VALE_NEWIF: { struct nmreq_vale_newif *req = (struct nmreq_vale_newif *)(uintptr_t)hdr->nr_body; nmr->nr_tx_slots = req->nr_tx_slots; nmr->nr_rx_slots = req->nr_rx_slots; nmr->nr_tx_rings = req->nr_tx_rings; nmr->nr_rx_rings = req->nr_rx_rings; nmr->nr_arg2 = req->nr_mem_id; break; } case NETMAP_REQ_VALE_DELIF: case NETMAP_REQ_VALE_POLLING_ENABLE: case NETMAP_REQ_VALE_POLLING_DISABLE: { break; } } return ret; } int netmap_ioctl_legacy(struct netmap_priv_d *priv, u_long cmd, caddr_t data, struct thread *td) { int error = 0; switch (cmd) { case NIOCGINFO: case NIOCREGIF: { /* Request for the legacy control API. Convert it to a * NIOCCTRL request. */ struct nmreq *nmr = (struct nmreq *) data; struct nmreq_header *hdr; if (nmr->nr_version < 14) { nm_prerr("Minimum supported API is 14 (requested %u)", nmr->nr_version); return EINVAL; } hdr = nmreq_from_legacy(nmr, cmd); if (hdr == NULL) { /* out of memory */ return ENOMEM; } error = netmap_ioctl(priv, NIOCCTRL, (caddr_t)hdr, td, /*nr_body_is_user=*/0); if (error == 0) { nmreq_to_legacy(hdr, nmr); } if (hdr->nr_body) { nm_os_free((void *)(uintptr_t)hdr->nr_body); } nm_os_free(hdr); break; } #ifdef WITH_VALE case NIOCCONFIG: { struct nm_ifreq *nr = (struct nm_ifreq *)data; error = netmap_bdg_config(nr); break; } #endif #ifdef __FreeBSD__ case FIONBIO: case FIOASYNC: /* FIONBIO/FIOASYNC are no-ops. */ break; case BIOCIMMEDIATE: case BIOCGHDRCMPLT: case BIOCSHDRCMPLT: case BIOCSSEESENT: /* Ignore these commands. */ break; default: /* allow device-specific ioctls */ { struct nmreq *nmr = (struct nmreq *)data; struct ifnet *ifp = ifunit_ref(nmr->nr_name); if (ifp == NULL) { error = ENXIO; } else { struct socket so; bzero(&so, sizeof(so)); so.so_vnet = ifp->if_vnet; // so->so_proto not null. error = ifioctl(&so, cmd, data, td); if_rele(ifp); } break; } #else /* linux */ default: error = EOPNOTSUPP; #endif /* linux */ } return error; } Index: head/sys/dev/netmap/netmap_mem2.c =================================================================== --- head/sys/dev/netmap/netmap_mem2.c (revision 353774) +++ head/sys/dev/netmap/netmap_mem2.c (revision 353775) @@ -1,2857 +1,2857 @@ /*- * SPDX-License-Identifier: BSD-2-Clause-FreeBSD * * Copyright (C) 2012-2014 Matteo Landi * Copyright (C) 2012-2016 Luigi Rizzo * Copyright (C) 2012-2016 Giuseppe Lettieri * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #ifdef linux #include "bsd_glue.h" #endif /* linux */ #ifdef __APPLE__ #include "osx_glue.h" #endif /* __APPLE__ */ #ifdef __FreeBSD__ #include /* prerequisite */ __FBSDID("$FreeBSD$"); #include #include #include /* MALLOC_DEFINE */ #include #include /* vtophys */ #include /* vtophys */ #include /* sockaddrs */ #include #include #include #include #include #include /* bus_dmamap_* */ /* M_NETMAP only used in here */ MALLOC_DECLARE(M_NETMAP); MALLOC_DEFINE(M_NETMAP, "netmap", "Network memory map"); #endif /* __FreeBSD__ */ #ifdef _WIN32 #include #endif #include #include #include #include "netmap_mem2.h" #ifdef _WIN32_USE_SMALL_GENERIC_DEVICES_MEMORY #define NETMAP_BUF_MAX_NUM 8*4096 /* if too big takes too much time to allocate */ #else #define NETMAP_BUF_MAX_NUM 20*4096*2 /* large machine */ #endif #define NETMAP_POOL_MAX_NAMSZ 32 enum { NETMAP_IF_POOL = 0, NETMAP_RING_POOL, NETMAP_BUF_POOL, NETMAP_POOLS_NR }; struct netmap_obj_params { u_int size; u_int num; u_int last_size; u_int last_num; }; struct netmap_obj_pool { char name[NETMAP_POOL_MAX_NAMSZ]; /* name of the allocator */ /* ---------------------------------------------------*/ /* these are only meaningful if the pool is finalized */ /* (see 'finalized' field in netmap_mem_d) */ size_t memtotal; /* actual total memory space */ struct lut_entry *lut; /* virt,phys addresses, objtotal entries */ uint32_t *bitmap; /* one bit per buffer, 1 means free */ uint32_t *invalid_bitmap;/* one bit per buffer, 1 means invalid */ uint32_t bitmap_slots; /* number of uint32 entries in bitmap */ u_int objtotal; /* actual total number of objects. */ u_int numclusters; /* actual number of clusters */ u_int objfree; /* number of free objects. */ int alloc_done; /* we have allocated the memory */ /* ---------------------------------------------------*/ /* limits */ u_int objminsize; /* minimum object size */ u_int objmaxsize; /* maximum object size */ u_int nummin; /* minimum number of objects */ u_int nummax; /* maximum number of objects */ /* these are changed only by config */ u_int _objtotal; /* total number of objects */ u_int _objsize; /* object size */ u_int _clustsize; /* cluster size */ u_int _clustentries; /* objects per cluster */ u_int _numclusters; /* number of clusters */ /* requested values */ u_int r_objtotal; u_int r_objsize; }; #define NMA_LOCK_T NM_MTX_T #define NMA_LOCK_INIT(n) NM_MTX_INIT((n)->nm_mtx) #define NMA_LOCK_DESTROY(n) NM_MTX_DESTROY((n)->nm_mtx) #define NMA_LOCK(n) NM_MTX_LOCK((n)->nm_mtx) #define NMA_SPINLOCK(n) NM_MTX_SPINLOCK((n)->nm_mtx) #define NMA_UNLOCK(n) NM_MTX_UNLOCK((n)->nm_mtx) struct netmap_mem_ops { int (*nmd_get_lut)(struct netmap_mem_d *, struct netmap_lut*); int (*nmd_get_info)(struct netmap_mem_d *, uint64_t *size, u_int *memflags, uint16_t *id); vm_paddr_t (*nmd_ofstophys)(struct netmap_mem_d *, vm_ooffset_t); int (*nmd_config)(struct netmap_mem_d *); int (*nmd_finalize)(struct netmap_mem_d *); void (*nmd_deref)(struct netmap_mem_d *); ssize_t (*nmd_if_offset)(struct netmap_mem_d *, const void *vaddr); void (*nmd_delete)(struct netmap_mem_d *); struct netmap_if * (*nmd_if_new)(struct netmap_adapter *, struct netmap_priv_d *); void (*nmd_if_delete)(struct netmap_adapter *, struct netmap_if *); int (*nmd_rings_create)(struct netmap_adapter *); void (*nmd_rings_delete)(struct netmap_adapter *); }; struct netmap_mem_d { NMA_LOCK_T nm_mtx; /* protect the allocator */ size_t nm_totalsize; /* shorthand */ u_int flags; #define NETMAP_MEM_FINALIZED 0x1 /* preallocation done */ #define NETMAP_MEM_HIDDEN 0x8 /* beeing prepared */ int lasterr; /* last error for curr config */ int active; /* active users */ int refcount; /* the three allocators */ struct netmap_obj_pool pools[NETMAP_POOLS_NR]; nm_memid_t nm_id; /* allocator identifier */ int nm_grp; /* iommu groupd id */ /* list of all existing allocators, sorted by nm_id */ struct netmap_mem_d *prev, *next; struct netmap_mem_ops *ops; struct netmap_obj_params params[NETMAP_POOLS_NR]; #define NM_MEM_NAMESZ 16 char name[NM_MEM_NAMESZ]; }; int netmap_mem_get_lut(struct netmap_mem_d *nmd, struct netmap_lut *lut) { int rv; NMA_LOCK(nmd); rv = nmd->ops->nmd_get_lut(nmd, lut); NMA_UNLOCK(nmd); return rv; } int netmap_mem_get_info(struct netmap_mem_d *nmd, uint64_t *size, u_int *memflags, nm_memid_t *memid) { int rv; NMA_LOCK(nmd); rv = nmd->ops->nmd_get_info(nmd, size, memflags, memid); NMA_UNLOCK(nmd); return rv; } vm_paddr_t netmap_mem_ofstophys(struct netmap_mem_d *nmd, vm_ooffset_t off) { vm_paddr_t pa; #if defined(__FreeBSD__) /* This function is called by netmap_dev_pager_fault(), which holds a * non-sleepable lock since FreeBSD 12. Since we cannot sleep, we * spin on the trylock. */ NMA_SPINLOCK(nmd); #else NMA_LOCK(nmd); #endif pa = nmd->ops->nmd_ofstophys(nmd, off); NMA_UNLOCK(nmd); return pa; } static int netmap_mem_config(struct netmap_mem_d *nmd) { if (nmd->active) { /* already in use. Not fatal, but we * cannot change the configuration */ return 0; } return nmd->ops->nmd_config(nmd); } ssize_t netmap_mem_if_offset(struct netmap_mem_d *nmd, const void *off) { ssize_t rv; NMA_LOCK(nmd); rv = nmd->ops->nmd_if_offset(nmd, off); NMA_UNLOCK(nmd); return rv; } static void netmap_mem_delete(struct netmap_mem_d *nmd) { nmd->ops->nmd_delete(nmd); } struct netmap_if * netmap_mem_if_new(struct netmap_adapter *na, struct netmap_priv_d *priv) { struct netmap_if *nifp; struct netmap_mem_d *nmd = na->nm_mem; NMA_LOCK(nmd); nifp = nmd->ops->nmd_if_new(na, priv); NMA_UNLOCK(nmd); return nifp; } void netmap_mem_if_delete(struct netmap_adapter *na, struct netmap_if *nif) { struct netmap_mem_d *nmd = na->nm_mem; NMA_LOCK(nmd); nmd->ops->nmd_if_delete(na, nif); NMA_UNLOCK(nmd); } int netmap_mem_rings_create(struct netmap_adapter *na) { int rv; struct netmap_mem_d *nmd = na->nm_mem; NMA_LOCK(nmd); rv = nmd->ops->nmd_rings_create(na); NMA_UNLOCK(nmd); return rv; } void netmap_mem_rings_delete(struct netmap_adapter *na) { struct netmap_mem_d *nmd = na->nm_mem; NMA_LOCK(nmd); nmd->ops->nmd_rings_delete(na); NMA_UNLOCK(nmd); } static int netmap_mem_map(struct netmap_obj_pool *, struct netmap_adapter *); static int netmap_mem_unmap(struct netmap_obj_pool *, struct netmap_adapter *); static int nm_mem_assign_group(struct netmap_mem_d *, struct device *); static void nm_mem_release_id(struct netmap_mem_d *); nm_memid_t netmap_mem_get_id(struct netmap_mem_d *nmd) { return nmd->nm_id; } #ifdef NM_DEBUG_MEM_PUTGET #define NM_DBG_REFC(nmd, func, line) \ nm_prinf("%d mem[%d] -> %d", line, (nmd)->nm_id, (nmd)->refcount); #else #define NM_DBG_REFC(nmd, func, line) #endif /* circular list of all existing allocators */ static struct netmap_mem_d *netmap_last_mem_d = &nm_mem; NM_MTX_T nm_mem_list_lock; struct netmap_mem_d * __netmap_mem_get(struct netmap_mem_d *nmd, const char *func, int line) { NM_MTX_LOCK(nm_mem_list_lock); nmd->refcount++; NM_DBG_REFC(nmd, func, line); NM_MTX_UNLOCK(nm_mem_list_lock); return nmd; } void __netmap_mem_put(struct netmap_mem_d *nmd, const char *func, int line) { int last; NM_MTX_LOCK(nm_mem_list_lock); last = (--nmd->refcount == 0); if (last) nm_mem_release_id(nmd); NM_DBG_REFC(nmd, func, line); NM_MTX_UNLOCK(nm_mem_list_lock); if (last) netmap_mem_delete(nmd); } int netmap_mem_finalize(struct netmap_mem_d *nmd, struct netmap_adapter *na) { int lasterr = 0; if (nm_mem_assign_group(nmd, na->pdev) < 0) { return ENOMEM; } NMA_LOCK(nmd); if (netmap_mem_config(nmd)) goto out; nmd->active++; nmd->lasterr = nmd->ops->nmd_finalize(nmd); if (!nmd->lasterr && na->pdev) { nmd->lasterr = netmap_mem_map(&nmd->pools[NETMAP_BUF_POOL], na); } out: lasterr = nmd->lasterr; NMA_UNLOCK(nmd); if (lasterr) netmap_mem_deref(nmd, na); return lasterr; } static int nm_isset(uint32_t *bitmap, u_int i) { return bitmap[ (i>>5) ] & ( 1U << (i & 31U) ); } static int netmap_init_obj_allocator_bitmap(struct netmap_obj_pool *p) { u_int n, j; if (p->bitmap == NULL) { /* Allocate the bitmap */ n = (p->objtotal + 31) / 32; p->bitmap = nm_os_malloc(sizeof(p->bitmap[0]) * n); if (p->bitmap == NULL) { nm_prerr("Unable to create bitmap (%d entries) for allocator '%s'", (int)n, p->name); return ENOMEM; } p->bitmap_slots = n; } else { memset(p->bitmap, 0, p->bitmap_slots * sizeof(p->bitmap[0])); } p->objfree = 0; /* * Set all the bits in the bitmap that have * corresponding buffers to 1 to indicate they are * free. */ for (j = 0; j < p->objtotal; j++) { if (p->invalid_bitmap && nm_isset(p->invalid_bitmap, j)) { if (netmap_debug & NM_DEBUG_MEM) nm_prinf("skipping %s %d", p->name, j); continue; } p->bitmap[ (j>>5) ] |= ( 1U << (j & 31U) ); p->objfree++; } if (netmap_verbose) nm_prinf("%s free %u", p->name, p->objfree); if (p->objfree == 0) { if (netmap_verbose) nm_prerr("%s: no objects available", p->name); return ENOMEM; } return 0; } static int netmap_mem_init_bitmaps(struct netmap_mem_d *nmd) { int i, error = 0; for (i = 0; i < NETMAP_POOLS_NR; i++) { struct netmap_obj_pool *p = &nmd->pools[i]; error = netmap_init_obj_allocator_bitmap(p); if (error) return error; } /* * buffers 0 and 1 are reserved */ if (nmd->pools[NETMAP_BUF_POOL].objfree < 2) { nm_prerr("%s: not enough buffers", nmd->pools[NETMAP_BUF_POOL].name); return ENOMEM; } nmd->pools[NETMAP_BUF_POOL].objfree -= 2; if (nmd->pools[NETMAP_BUF_POOL].bitmap) { /* XXX This check is a workaround that prevents a * NULL pointer crash which currently happens only * with ptnetmap guests. * Removed shared-info --> is the bug still there? */ nmd->pools[NETMAP_BUF_POOL].bitmap[0] = ~3U; } return 0; } int netmap_mem_deref(struct netmap_mem_d *nmd, struct netmap_adapter *na) { int last_user = 0; NMA_LOCK(nmd); if (na->active_fds <= 0) netmap_mem_unmap(&nmd->pools[NETMAP_BUF_POOL], na); if (nmd->active == 1) { last_user = 1; /* * Reset the allocator when it falls out of use so that any * pool resources leaked by unclean application exits are * reclaimed. */ netmap_mem_init_bitmaps(nmd); } nmd->ops->nmd_deref(nmd); nmd->active--; if (last_user) { nmd->nm_grp = -1; nmd->lasterr = 0; } NMA_UNLOCK(nmd); return last_user; } /* accessor functions */ static int netmap_mem2_get_lut(struct netmap_mem_d *nmd, struct netmap_lut *lut) { lut->lut = nmd->pools[NETMAP_BUF_POOL].lut; #ifdef __FreeBSD__ lut->plut = lut->lut; #endif lut->objtotal = nmd->pools[NETMAP_BUF_POOL].objtotal; lut->objsize = nmd->pools[NETMAP_BUF_POOL]._objsize; return 0; } static struct netmap_obj_params netmap_min_priv_params[NETMAP_POOLS_NR] = { [NETMAP_IF_POOL] = { .size = 1024, .num = 2, }, [NETMAP_RING_POOL] = { .size = 5*PAGE_SIZE, .num = 4, }, [NETMAP_BUF_POOL] = { .size = 2048, .num = 4098, }, }; /* * nm_mem is the memory allocator used for all physical interfaces * running in netmap mode. * Virtual (VALE) ports will have each its own allocator. */ extern struct netmap_mem_ops netmap_mem_global_ops; /* forward */ struct netmap_mem_d nm_mem = { /* Our memory allocator. */ .pools = { [NETMAP_IF_POOL] = { .name = "netmap_if", .objminsize = sizeof(struct netmap_if), .objmaxsize = 4096, .nummin = 10, /* don't be stingy */ .nummax = 10000, /* XXX very large */ }, [NETMAP_RING_POOL] = { .name = "netmap_ring", .objminsize = sizeof(struct netmap_ring), .objmaxsize = 32*PAGE_SIZE, .nummin = 2, .nummax = 1024, }, [NETMAP_BUF_POOL] = { .name = "netmap_buf", .objminsize = 64, .objmaxsize = 65536, .nummin = 4, .nummax = 1000000, /* one million! */ }, }, .params = { [NETMAP_IF_POOL] = { .size = 1024, .num = 100, }, [NETMAP_RING_POOL] = { .size = 9*PAGE_SIZE, .num = 200, }, [NETMAP_BUF_POOL] = { .size = 2048, .num = NETMAP_BUF_MAX_NUM, }, }, .nm_id = 1, .nm_grp = -1, .prev = &nm_mem, .next = &nm_mem, .ops = &netmap_mem_global_ops, .name = "1" }; /* blueprint for the private memory allocators */ /* XXX clang is not happy about using name as a print format */ static const struct netmap_mem_d nm_blueprint = { .pools = { [NETMAP_IF_POOL] = { .name = "%s_if", .objminsize = sizeof(struct netmap_if), .objmaxsize = 4096, .nummin = 1, .nummax = 100, }, [NETMAP_RING_POOL] = { .name = "%s_ring", .objminsize = sizeof(struct netmap_ring), .objmaxsize = 32*PAGE_SIZE, .nummin = 2, .nummax = 1024, }, [NETMAP_BUF_POOL] = { .name = "%s_buf", .objminsize = 64, .objmaxsize = 65536, .nummin = 4, .nummax = 1000000, /* one million! */ }, }, .nm_grp = -1, .flags = NETMAP_MEM_PRIVATE, .ops = &netmap_mem_global_ops, }; /* memory allocator related sysctls */ #define STRINGIFY(x) #x #define DECLARE_SYSCTLS(id, name) \ SYSBEGIN(mem2_ ## name); \ SYSCTL_INT(_dev_netmap, OID_AUTO, name##_size, \ CTLFLAG_RW, &nm_mem.params[id].size, 0, "Requested size of netmap " STRINGIFY(name) "s"); \ SYSCTL_INT(_dev_netmap, OID_AUTO, name##_curr_size, \ CTLFLAG_RD, &nm_mem.pools[id]._objsize, 0, "Current size of netmap " STRINGIFY(name) "s"); \ SYSCTL_INT(_dev_netmap, OID_AUTO, name##_num, \ CTLFLAG_RW, &nm_mem.params[id].num, 0, "Requested number of netmap " STRINGIFY(name) "s"); \ SYSCTL_INT(_dev_netmap, OID_AUTO, name##_curr_num, \ CTLFLAG_RD, &nm_mem.pools[id].objtotal, 0, "Current number of netmap " STRINGIFY(name) "s"); \ SYSCTL_INT(_dev_netmap, OID_AUTO, priv_##name##_size, \ CTLFLAG_RW, &netmap_min_priv_params[id].size, 0, \ "Default size of private netmap " STRINGIFY(name) "s"); \ SYSCTL_INT(_dev_netmap, OID_AUTO, priv_##name##_num, \ CTLFLAG_RW, &netmap_min_priv_params[id].num, 0, \ "Default number of private netmap " STRINGIFY(name) "s"); \ SYSEND SYSCTL_DECL(_dev_netmap); DECLARE_SYSCTLS(NETMAP_IF_POOL, if); DECLARE_SYSCTLS(NETMAP_RING_POOL, ring); DECLARE_SYSCTLS(NETMAP_BUF_POOL, buf); /* call with nm_mem_list_lock held */ static int nm_mem_assign_id_locked(struct netmap_mem_d *nmd) { nm_memid_t id; struct netmap_mem_d *scan = netmap_last_mem_d; int error = ENOMEM; do { /* we rely on unsigned wrap around */ id = scan->nm_id + 1; if (id == 0) /* reserve 0 as error value */ id = 1; scan = scan->next; if (id != scan->nm_id) { nmd->nm_id = id; nmd->prev = scan->prev; nmd->next = scan; scan->prev->next = nmd; scan->prev = nmd; netmap_last_mem_d = nmd; nmd->refcount = 1; NM_DBG_REFC(nmd, __FUNCTION__, __LINE__); error = 0; break; } } while (scan != netmap_last_mem_d); return error; } /* call with nm_mem_list_lock *not* held */ static int nm_mem_assign_id(struct netmap_mem_d *nmd) { int ret; NM_MTX_LOCK(nm_mem_list_lock); ret = nm_mem_assign_id_locked(nmd); NM_MTX_UNLOCK(nm_mem_list_lock); return ret; } /* call with nm_mem_list_lock held */ static void nm_mem_release_id(struct netmap_mem_d *nmd) { nmd->prev->next = nmd->next; nmd->next->prev = nmd->prev; if (netmap_last_mem_d == nmd) netmap_last_mem_d = nmd->prev; nmd->prev = nmd->next = NULL; } struct netmap_mem_d * netmap_mem_find(nm_memid_t id) { struct netmap_mem_d *nmd; NM_MTX_LOCK(nm_mem_list_lock); nmd = netmap_last_mem_d; do { if (!(nmd->flags & NETMAP_MEM_HIDDEN) && nmd->nm_id == id) { nmd->refcount++; NM_DBG_REFC(nmd, __FUNCTION__, __LINE__); NM_MTX_UNLOCK(nm_mem_list_lock); return nmd; } nmd = nmd->next; } while (nmd != netmap_last_mem_d); NM_MTX_UNLOCK(nm_mem_list_lock); return NULL; } static int nm_mem_assign_group(struct netmap_mem_d *nmd, struct device *dev) { int err = 0, id; id = nm_iommu_group_id(dev); if (netmap_debug & NM_DEBUG_MEM) nm_prinf("iommu_group %d", id); NMA_LOCK(nmd); if (nmd->nm_grp < 0) nmd->nm_grp = id; if (nmd->nm_grp != id) { if (netmap_verbose) nm_prerr("iommu group mismatch: %u vs %u", nmd->nm_grp, id); nmd->lasterr = err = ENOMEM; } NMA_UNLOCK(nmd); return err; } static struct lut_entry * nm_alloc_lut(u_int nobj) { size_t n = sizeof(struct lut_entry) * nobj; struct lut_entry *lut; #ifdef linux lut = vmalloc(n); #else lut = nm_os_malloc(n); #endif return lut; } static void nm_free_lut(struct lut_entry *lut, u_int objtotal) { bzero(lut, sizeof(struct lut_entry) * objtotal); #ifdef linux vfree(lut); #else nm_os_free(lut); #endif } #if defined(linux) || defined(_WIN32) static struct plut_entry * nm_alloc_plut(u_int nobj) { size_t n = sizeof(struct plut_entry) * nobj; struct plut_entry *lut; lut = vmalloc(n); return lut; } static void nm_free_plut(struct plut_entry * lut) { vfree(lut); } #endif /* linux or _WIN32 */ /* * First, find the allocator that contains the requested offset, * then locate the cluster through a lookup table. */ static vm_paddr_t netmap_mem2_ofstophys(struct netmap_mem_d* nmd, vm_ooffset_t offset) { int i; vm_ooffset_t o = offset; vm_paddr_t pa; struct netmap_obj_pool *p; p = nmd->pools; for (i = 0; i < NETMAP_POOLS_NR; offset -= p[i].memtotal, i++) { if (offset >= p[i].memtotal) continue; // now lookup the cluster's address #ifndef _WIN32 pa = vtophys(p[i].lut[offset / p[i]._objsize].vaddr) + offset % p[i]._objsize; #else pa = vtophys(p[i].lut[offset / p[i]._objsize].vaddr); pa.QuadPart += offset % p[i]._objsize; #endif return pa; } /* this is only in case of errors */ nm_prerr("invalid ofs 0x%x out of 0x%zx 0x%zx 0x%zx", (u_int)o, p[NETMAP_IF_POOL].memtotal, p[NETMAP_IF_POOL].memtotal + p[NETMAP_RING_POOL].memtotal, p[NETMAP_IF_POOL].memtotal + p[NETMAP_RING_POOL].memtotal + p[NETMAP_BUF_POOL].memtotal); #ifndef _WIN32 return 0; /* bad address */ #else vm_paddr_t res; res.QuadPart = 0; return res; #endif } #ifdef _WIN32 /* * win32_build_virtual_memory_for_userspace * * This function get all the object making part of the pools and maps * a contiguous virtual memory space for the userspace * It works this way * 1 - allocate a Memory Descriptor List wide as the sum * of the memory needed for the pools * 2 - cycle all the objects in every pool and for every object do * * 2a - cycle all the objects in every pool, get the list * of the physical address descriptors * 2b - calculate the offset in the array of pages desciptor in the * main MDL * 2c - copy the descriptors of the object in the main MDL * * 3 - return the resulting MDL that needs to be mapped in userland * * In this way we will have an MDL that describes all the memory for the * objects in a single object */ PMDL win32_build_user_vm_map(struct netmap_mem_d* nmd) { u_int memflags, ofs = 0; PMDL mainMdl, tempMdl; uint64_t memsize; int i, j; if (netmap_mem_get_info(nmd, &memsize, &memflags, NULL)) { nm_prerr("memory not finalised yet"); return NULL; } mainMdl = IoAllocateMdl(NULL, memsize, FALSE, FALSE, NULL); if (mainMdl == NULL) { nm_prerr("failed to allocate mdl"); return NULL; } NMA_LOCK(nmd); for (i = 0; i < NETMAP_POOLS_NR; i++) { struct netmap_obj_pool *p = &nmd->pools[i]; int clsz = p->_clustsize; int clobjs = p->_clustentries; /* objects per cluster */ int mdl_len = sizeof(PFN_NUMBER) * BYTES_TO_PAGES(clsz); PPFN_NUMBER pSrc, pDst; /* each pool has a different cluster size so we need to reallocate */ tempMdl = IoAllocateMdl(p->lut[0].vaddr, clsz, FALSE, FALSE, NULL); if (tempMdl == NULL) { NMA_UNLOCK(nmd); nm_prerr("fail to allocate tempMdl"); IoFreeMdl(mainMdl); return NULL; } pSrc = MmGetMdlPfnArray(tempMdl); /* create one entry per cluster, the lut[] has one entry per object */ for (j = 0; j < p->numclusters; j++, ofs += clsz) { pDst = &MmGetMdlPfnArray(mainMdl)[BYTES_TO_PAGES(ofs)]; MmInitializeMdl(tempMdl, p->lut[j*clobjs].vaddr, clsz); MmBuildMdlForNonPagedPool(tempMdl); /* compute physical page addresses */ RtlCopyMemory(pDst, pSrc, mdl_len); /* copy the page descriptors */ mainMdl->MdlFlags = tempMdl->MdlFlags; /* XXX what is in here ? */ } IoFreeMdl(tempMdl); } NMA_UNLOCK(nmd); return mainMdl; } #endif /* _WIN32 */ /* * helper function for OS-specific mmap routines (currently only windows). * Given an nmd and a pool index, returns the cluster size and number of clusters. * Returns 0 if memory is finalised and the pool is valid, otherwise 1. * It should be called under NMA_LOCK(nmd) otherwise the underlying info can change. */ int netmap_mem2_get_pool_info(struct netmap_mem_d* nmd, u_int pool, u_int *clustsize, u_int *numclusters) { if (!nmd || !clustsize || !numclusters || pool >= NETMAP_POOLS_NR) return 1; /* invalid arguments */ // NMA_LOCK_ASSERT(nmd); if (!(nmd->flags & NETMAP_MEM_FINALIZED)) { *clustsize = *numclusters = 0; return 1; /* not ready yet */ } *clustsize = nmd->pools[pool]._clustsize; *numclusters = nmd->pools[pool].numclusters; return 0; /* success */ } static int netmap_mem2_get_info(struct netmap_mem_d* nmd, uint64_t* size, u_int *memflags, nm_memid_t *id) { int error = 0; error = netmap_mem_config(nmd); if (error) goto out; if (size) { if (nmd->flags & NETMAP_MEM_FINALIZED) { *size = nmd->nm_totalsize; } else { int i; *size = 0; for (i = 0; i < NETMAP_POOLS_NR; i++) { struct netmap_obj_pool *p = nmd->pools + i; *size += ((size_t)p->_numclusters * (size_t)p->_clustsize); } } } if (memflags) *memflags = nmd->flags; if (id) *id = nmd->nm_id; out: return error; } /* * we store objects by kernel address, need to find the offset * within the pool to export the value to userspace. * Algorithm: scan until we find the cluster, then add the * actual offset in the cluster */ static ssize_t netmap_obj_offset(struct netmap_obj_pool *p, const void *vaddr) { int i, k = p->_clustentries, n = p->objtotal; ssize_t ofs = 0; for (i = 0; i < n; i += k, ofs += p->_clustsize) { const char *base = p->lut[i].vaddr; ssize_t relofs = (const char *) vaddr - base; if (relofs < 0 || relofs >= p->_clustsize) continue; ofs = ofs + relofs; nm_prdis("%s: return offset %d (cluster %d) for pointer %p", p->name, ofs, i, vaddr); return ofs; } nm_prerr("address %p is not contained inside any cluster (%s)", vaddr, p->name); return 0; /* An error occurred */ } /* Helper functions which convert virtual addresses to offsets */ #define netmap_if_offset(n, v) \ netmap_obj_offset(&(n)->pools[NETMAP_IF_POOL], (v)) #define netmap_ring_offset(n, v) \ ((n)->pools[NETMAP_IF_POOL].memtotal + \ netmap_obj_offset(&(n)->pools[NETMAP_RING_POOL], (v))) static ssize_t netmap_mem2_if_offset(struct netmap_mem_d *nmd, const void *addr) { return netmap_if_offset(nmd, addr); } /* * report the index, and use start position as a hint, * otherwise buffer allocation becomes terribly expensive. */ static void * netmap_obj_malloc(struct netmap_obj_pool *p, u_int len, uint32_t *start, uint32_t *index) { uint32_t i = 0; /* index in the bitmap */ uint32_t mask, j = 0; /* slot counter */ void *vaddr = NULL; if (len > p->_objsize) { nm_prerr("%s request size %d too large", p->name, len); return NULL; } if (p->objfree == 0) { nm_prerr("no more %s objects", p->name); return NULL; } if (start) i = *start; /* termination is guaranteed by p->free, but better check bounds on i */ while (vaddr == NULL && i < p->bitmap_slots) { uint32_t cur = p->bitmap[i]; if (cur == 0) { /* bitmask is fully used */ i++; continue; } /* locate a slot */ for (j = 0, mask = 1; (cur & mask) == 0; j++, mask <<= 1) ; p->bitmap[i] &= ~mask; /* mark object as in use */ p->objfree--; vaddr = p->lut[i * 32 + j].vaddr; if (index) *index = i * 32 + j; } nm_prdis("%s allocator: allocated object @ [%d][%d]: vaddr %p",p->name, i, j, vaddr); if (start) *start = i; return vaddr; } /* * free by index, not by address. * XXX should we also cleanup the content ? */ static int netmap_obj_free(struct netmap_obj_pool *p, uint32_t j) { uint32_t *ptr, mask; if (j >= p->objtotal) { nm_prerr("invalid index %u, max %u", j, p->objtotal); return 1; } ptr = &p->bitmap[j / 32]; mask = (1 << (j % 32)); if (*ptr & mask) { nm_prerr("ouch, double free on buffer %d", j); return 1; } else { *ptr |= mask; p->objfree++; return 0; } } /* * free by address. This is slow but is only used for a few * objects (rings, nifp) */ static void netmap_obj_free_va(struct netmap_obj_pool *p, void *vaddr) { u_int i, j, n = p->numclusters; for (i = 0, j = 0; i < n; i++, j += p->_clustentries) { void *base = p->lut[i * p->_clustentries].vaddr; ssize_t relofs = (ssize_t) vaddr - (ssize_t) base; /* Given address, is out of the scope of the current cluster.*/ if (base == NULL || vaddr < base || relofs >= p->_clustsize) continue; j = j + relofs / p->_objsize; /* KASSERT(j != 0, ("Cannot free object 0")); */ netmap_obj_free(p, j); return; } nm_prerr("address %p is not contained inside any cluster (%s)", vaddr, p->name); } unsigned netmap_mem_bufsize(struct netmap_mem_d *nmd) { return nmd->pools[NETMAP_BUF_POOL]._objsize; } #define netmap_if_malloc(n, len) netmap_obj_malloc(&(n)->pools[NETMAP_IF_POOL], len, NULL, NULL) #define netmap_if_free(n, v) netmap_obj_free_va(&(n)->pools[NETMAP_IF_POOL], (v)) #define netmap_ring_malloc(n, len) netmap_obj_malloc(&(n)->pools[NETMAP_RING_POOL], len, NULL, NULL) #define netmap_ring_free(n, v) netmap_obj_free_va(&(n)->pools[NETMAP_RING_POOL], (v)) #define netmap_buf_malloc(n, _pos, _index) \ netmap_obj_malloc(&(n)->pools[NETMAP_BUF_POOL], netmap_mem_bufsize(n), _pos, _index) #if 0 /* currently unused */ /* Return the index associated to the given packet buffer */ #define netmap_buf_index(n, v) \ (netmap_obj_offset(&(n)->pools[NETMAP_BUF_POOL], (v)) / NETMAP_BDG_BUF_SIZE(n)) #endif /* * allocate extra buffers in a linked list. * returns the actual number. */ uint32_t netmap_extra_alloc(struct netmap_adapter *na, uint32_t *head, uint32_t n) { struct netmap_mem_d *nmd = na->nm_mem; uint32_t i, pos = 0; /* opaque, scan position in the bitmap */ NMA_LOCK(nmd); *head = 0; /* default, 'null' index ie empty list */ for (i = 0 ; i < n; i++) { uint32_t cur = *head; /* save current head */ uint32_t *p = netmap_buf_malloc(nmd, &pos, head); if (p == NULL) { nm_prerr("no more buffers after %d of %d", i, n); *head = cur; /* restore */ break; } nm_prdis(5, "allocate buffer %d -> %d", *head, cur); *p = cur; /* link to previous head */ } NMA_UNLOCK(nmd); return i; } static void netmap_extra_free(struct netmap_adapter *na, uint32_t head) { struct lut_entry *lut = na->na_lut.lut; struct netmap_mem_d *nmd = na->nm_mem; struct netmap_obj_pool *p = &nmd->pools[NETMAP_BUF_POOL]; uint32_t i, cur, *buf; nm_prdis("freeing the extra list"); for (i = 0; head >=2 && head < p->objtotal; i++) { cur = head; buf = lut[head].vaddr; head = *buf; *buf = 0; if (netmap_obj_free(p, cur)) break; } if (head != 0) nm_prerr("breaking with head %d", head); if (netmap_debug & NM_DEBUG_MEM) nm_prinf("freed %d buffers", i); } /* Return nonzero on error */ static int netmap_new_bufs(struct netmap_mem_d *nmd, struct netmap_slot *slot, u_int n) { struct netmap_obj_pool *p = &nmd->pools[NETMAP_BUF_POOL]; u_int i = 0; /* slot counter */ uint32_t pos = 0; /* slot in p->bitmap */ uint32_t index = 0; /* buffer index */ for (i = 0; i < n; i++) { void *vaddr = netmap_buf_malloc(nmd, &pos, &index); if (vaddr == NULL) { nm_prerr("no more buffers after %d of %d", i, n); goto cleanup; } slot[i].buf_idx = index; slot[i].len = p->_objsize; slot[i].flags = 0; slot[i].ptr = 0; } nm_prdis("%s: allocated %d buffers, %d available, first at %d", p->name, n, p->objfree, pos); return (0); cleanup: while (i > 0) { i--; netmap_obj_free(p, slot[i].buf_idx); } bzero(slot, n * sizeof(slot[0])); return (ENOMEM); } static void netmap_mem_set_ring(struct netmap_mem_d *nmd, struct netmap_slot *slot, u_int n, uint32_t index) { struct netmap_obj_pool *p = &nmd->pools[NETMAP_BUF_POOL]; u_int i; for (i = 0; i < n; i++) { slot[i].buf_idx = index; slot[i].len = p->_objsize; slot[i].flags = 0; } } static void netmap_free_buf(struct netmap_mem_d *nmd, uint32_t i) { struct netmap_obj_pool *p = &nmd->pools[NETMAP_BUF_POOL]; if (i < 2 || i >= p->objtotal) { nm_prerr("Cannot free buf#%d: should be in [2, %d[", i, p->objtotal); return; } netmap_obj_free(p, i); } static void netmap_free_bufs(struct netmap_mem_d *nmd, struct netmap_slot *slot, u_int n) { u_int i; for (i = 0; i < n; i++) { if (slot[i].buf_idx > 1) netmap_free_buf(nmd, slot[i].buf_idx); } nm_prdis("%s: released some buffers, available: %u", p->name, p->objfree); } static void netmap_reset_obj_allocator(struct netmap_obj_pool *p) { if (p == NULL) return; if (p->bitmap) nm_os_free(p->bitmap); p->bitmap = NULL; if (p->invalid_bitmap) nm_os_free(p->invalid_bitmap); p->invalid_bitmap = NULL; if (!p->alloc_done) { /* allocation was done by somebody else. * Let them clean up after themselves. */ return; } if (p->lut) { u_int i; /* * Free each cluster allocated in * netmap_finalize_obj_allocator(). The cluster start * addresses are stored at multiples of p->_clusterentries * in the lut. */ for (i = 0; i < p->objtotal; i += p->_clustentries) { contigfree(p->lut[i].vaddr, p->_clustsize, M_NETMAP); } nm_free_lut(p->lut, p->objtotal); } p->lut = NULL; p->objtotal = 0; p->memtotal = 0; p->numclusters = 0; p->objfree = 0; p->alloc_done = 0; } /* * Free all resources related to an allocator. */ static void netmap_destroy_obj_allocator(struct netmap_obj_pool *p) { if (p == NULL) return; netmap_reset_obj_allocator(p); } /* * We receive a request for objtotal objects, of size objsize each. * Internally we may round up both numbers, as we allocate objects * in small clusters multiple of the page size. * We need to keep track of objtotal and clustentries, * as they are needed when freeing memory. * * XXX note -- userspace needs the buffers to be contiguous, * so we cannot afford gaps at the end of a cluster. */ /* call with NMA_LOCK held */ static int netmap_config_obj_allocator(struct netmap_obj_pool *p, u_int objtotal, u_int objsize) { int i; u_int clustsize; /* the cluster size, multiple of page size */ u_int clustentries; /* how many objects per entry */ /* we store the current request, so we can * detect configuration changes later */ p->r_objtotal = objtotal; p->r_objsize = objsize; #define MAX_CLUSTSIZE (1<<22) // 4 MB #define LINE_ROUND NM_CACHE_ALIGN // 64 if (objsize >= MAX_CLUSTSIZE) { /* we could do it but there is no point */ nm_prerr("unsupported allocation for %d bytes", objsize); return EINVAL; } /* make sure objsize is a multiple of LINE_ROUND */ i = (objsize & (LINE_ROUND - 1)); if (i) { nm_prinf("aligning object by %d bytes", LINE_ROUND - i); objsize += LINE_ROUND - i; } if (objsize < p->objminsize || objsize > p->objmaxsize) { nm_prerr("requested objsize %d out of range [%d, %d]", objsize, p->objminsize, p->objmaxsize); return EINVAL; } if (objtotal < p->nummin || objtotal > p->nummax) { nm_prerr("requested objtotal %d out of range [%d, %d]", objtotal, p->nummin, p->nummax); return EINVAL; } /* * Compute number of objects using a brute-force approach: * given a max cluster size, * we try to fill it with objects keeping track of the * wasted space to the next page boundary. */ for (clustentries = 0, i = 1;; i++) { u_int delta, used = i * objsize; if (used > MAX_CLUSTSIZE) break; delta = used % PAGE_SIZE; if (delta == 0) { // exact solution clustentries = i; break; } } /* exact solution not found */ if (clustentries == 0) { nm_prerr("unsupported allocation for %d bytes", objsize); return EINVAL; } /* compute clustsize */ clustsize = clustentries * objsize; if (netmap_debug & NM_DEBUG_MEM) nm_prinf("objsize %d clustsize %d objects %d", objsize, clustsize, clustentries); /* * The number of clusters is n = ceil(objtotal/clustentries) * objtotal' = n * clustentries */ p->_clustentries = clustentries; p->_clustsize = clustsize; p->_numclusters = (objtotal + clustentries - 1) / clustentries; /* actual values (may be larger than requested) */ p->_objsize = objsize; p->_objtotal = p->_numclusters * clustentries; return 0; } /* call with NMA_LOCK held */ static int netmap_finalize_obj_allocator(struct netmap_obj_pool *p) { int i; /* must be signed */ size_t n; if (p->lut) { /* if the lut is already there we assume that also all the * clusters have already been allocated, possibily by somebody * else (e.g., extmem). In the latter case, the alloc_done flag * will remain at zero, so that we will not attempt to * deallocate the clusters by ourselves in * netmap_reset_obj_allocator. */ return 0; } /* optimistically assume we have enough memory */ p->numclusters = p->_numclusters; p->objtotal = p->_objtotal; p->alloc_done = 1; p->lut = nm_alloc_lut(p->objtotal); if (p->lut == NULL) { nm_prerr("Unable to create lookup table for '%s'", p->name); goto clean; } /* * Allocate clusters, init pointers */ n = p->_clustsize; for (i = 0; i < (int)p->objtotal;) { int lim = i + p->_clustentries; char *clust; /* * XXX Note, we only need contigmalloc() for buffers attached * to native interfaces. In all other cases (nifp, netmap rings * and even buffers for VALE ports or emulated interfaces) we * can live with standard malloc, because the hardware will not * access the pages directly. */ clust = contigmalloc(n, M_NETMAP, M_NOWAIT | M_ZERO, (size_t)0, -1UL, PAGE_SIZE, 0); if (clust == NULL) { /* * If we get here, there is a severe memory shortage, * so halve the allocated memory to reclaim some. */ nm_prerr("Unable to create cluster at %d for '%s' allocator", i, p->name); if (i < 2) /* nothing to halve */ goto out; lim = i / 2; for (i--; i >= lim; i--) { if (i % p->_clustentries == 0 && p->lut[i].vaddr) contigfree(p->lut[i].vaddr, n, M_NETMAP); p->lut[i].vaddr = NULL; } out: p->objtotal = i; /* we may have stopped in the middle of a cluster */ p->numclusters = (i + p->_clustentries - 1) / p->_clustentries; break; } /* * Set lut state for all buffers in the current cluster. * * [i, lim) is the set of buffer indexes that cover the * current cluster. * * 'clust' is really the address of the current buffer in * the current cluster as we index through it with a stride * of p->_objsize. */ for (; i < lim; i++, clust += p->_objsize) { p->lut[i].vaddr = clust; #if !defined(linux) && !defined(_WIN32) p->lut[i].paddr = vtophys(clust); #endif } } p->memtotal = (size_t)p->numclusters * (size_t)p->_clustsize; if (netmap_verbose) nm_prinf("Pre-allocated %d clusters (%d/%zuKB) for '%s'", p->numclusters, p->_clustsize >> 10, p->memtotal >> 10, p->name); return 0; clean: netmap_reset_obj_allocator(p); return ENOMEM; } /* call with lock held */ static int netmap_mem_params_changed(struct netmap_obj_params* p) { int i, rv = 0; for (i = 0; i < NETMAP_POOLS_NR; i++) { if (p[i].last_size != p[i].size || p[i].last_num != p[i].num) { p[i].last_size = p[i].size; p[i].last_num = p[i].num; rv = 1; } } return rv; } static void netmap_mem_reset_all(struct netmap_mem_d *nmd) { int i; if (netmap_debug & NM_DEBUG_MEM) nm_prinf("resetting %p", nmd); for (i = 0; i < NETMAP_POOLS_NR; i++) { netmap_reset_obj_allocator(&nmd->pools[i]); } nmd->flags &= ~NETMAP_MEM_FINALIZED; } static int netmap_mem_unmap(struct netmap_obj_pool *p, struct netmap_adapter *na) { int i, lim = p->objtotal; struct netmap_lut *lut = &na->na_lut; if (na == NULL || na->pdev == NULL) return 0; #if defined(__FreeBSD__) /* On FreeBSD mapping and unmapping is performed by the txsync * and rxsync routine, packet by packet. */ (void)i; (void)lim; (void)lut; #elif defined(_WIN32) (void)i; (void)lim; (void)lut; nm_prerr("unsupported on Windows"); #else /* linux */ nm_prdis("unmapping and freeing plut for %s", na->name); if (lut->plut == NULL) return 0; for (i = 0; i < lim; i += p->_clustentries) { if (lut->plut[i].paddr) netmap_unload_map(na, (bus_dma_tag_t) na->pdev, &lut->plut[i].paddr, p->_clustsize); } nm_free_plut(lut->plut); lut->plut = NULL; #endif /* linux */ return 0; } static int netmap_mem_map(struct netmap_obj_pool *p, struct netmap_adapter *na) { int error = 0; int i, lim = p->objtotal; struct netmap_lut *lut = &na->na_lut; if (na->pdev == NULL) return 0; #if defined(__FreeBSD__) /* On FreeBSD mapping and unmapping is performed by the txsync * and rxsync routine, packet by packet. */ (void)i; (void)lim; (void)lut; #elif defined(_WIN32) (void)i; (void)lim; (void)lut; nm_prerr("unsupported on Windows"); #else /* linux */ if (lut->plut != NULL) { nm_prdis("plut already allocated for %s", na->name); return 0; } nm_prdis("allocating physical lut for %s", na->name); lut->plut = nm_alloc_plut(lim); if (lut->plut == NULL) { nm_prerr("Failed to allocate physical lut for %s", na->name); return ENOMEM; } for (i = 0; i < lim; i += p->_clustentries) { lut->plut[i].paddr = 0; } for (i = 0; i < lim; i += p->_clustentries) { int j; if (p->lut[i].vaddr == NULL) continue; error = netmap_load_map(na, (bus_dma_tag_t) na->pdev, &lut->plut[i].paddr, p->lut[i].vaddr, p->_clustsize); if (error) { nm_prerr("Failed to map cluster #%d from the %s pool", i, p->name); break; } for (j = 1; j < p->_clustentries; j++) { lut->plut[i + j].paddr = lut->plut[i + j - 1].paddr + p->_objsize; } } if (error) netmap_mem_unmap(p, na); #endif /* linux */ return error; } static int netmap_mem_finalize_all(struct netmap_mem_d *nmd) { int i; if (nmd->flags & NETMAP_MEM_FINALIZED) return 0; nmd->lasterr = 0; nmd->nm_totalsize = 0; for (i = 0; i < NETMAP_POOLS_NR; i++) { nmd->lasterr = netmap_finalize_obj_allocator(&nmd->pools[i]); if (nmd->lasterr) goto error; nmd->nm_totalsize += nmd->pools[i].memtotal; } nmd->lasterr = netmap_mem_init_bitmaps(nmd); if (nmd->lasterr) goto error; nmd->flags |= NETMAP_MEM_FINALIZED; if (netmap_verbose) nm_prinf("interfaces %zd KB, rings %zd KB, buffers %zd MB", nmd->pools[NETMAP_IF_POOL].memtotal >> 10, nmd->pools[NETMAP_RING_POOL].memtotal >> 10, nmd->pools[NETMAP_BUF_POOL].memtotal >> 20); if (netmap_verbose) nm_prinf("Free buffers: %d", nmd->pools[NETMAP_BUF_POOL].objfree); return 0; error: netmap_mem_reset_all(nmd); return nmd->lasterr; } /* * allocator for private memory */ static void * _netmap_mem_private_new(size_t size, struct netmap_obj_params *p, struct netmap_mem_ops *ops, int *perr) { struct netmap_mem_d *d = NULL; int i, err = 0; d = nm_os_malloc(size); if (d == NULL) { err = ENOMEM; goto error; } *d = nm_blueprint; d->ops = ops; err = nm_mem_assign_id(d); if (err) goto error_free; snprintf(d->name, NM_MEM_NAMESZ, "%d", d->nm_id); for (i = 0; i < NETMAP_POOLS_NR; i++) { snprintf(d->pools[i].name, NETMAP_POOL_MAX_NAMSZ, nm_blueprint.pools[i].name, d->name); d->params[i].num = p[i].num; d->params[i].size = p[i].size; } NMA_LOCK_INIT(d); err = netmap_mem_config(d); if (err) goto error_rel_id; d->flags &= ~NETMAP_MEM_FINALIZED; return d; error_rel_id: NMA_LOCK_DESTROY(d); nm_mem_release_id(d); error_free: nm_os_free(d); error: if (perr) *perr = err; return NULL; } struct netmap_mem_d * netmap_mem_private_new(u_int txr, u_int txd, u_int rxr, u_int rxd, u_int extra_bufs, u_int npipes, int *perr) { struct netmap_mem_d *d = NULL; struct netmap_obj_params p[NETMAP_POOLS_NR]; int i; u_int v, maxd; /* account for the fake host rings */ txr++; rxr++; /* copy the min values */ for (i = 0; i < NETMAP_POOLS_NR; i++) { p[i] = netmap_min_priv_params[i]; } /* possibly increase them to fit user request */ v = sizeof(struct netmap_if) + sizeof(ssize_t) * (txr + rxr); if (p[NETMAP_IF_POOL].size < v) p[NETMAP_IF_POOL].size = v; v = 2 + 4 * npipes; if (p[NETMAP_IF_POOL].num < v) p[NETMAP_IF_POOL].num = v; maxd = (txd > rxd) ? txd : rxd; v = sizeof(struct netmap_ring) + sizeof(struct netmap_slot) * maxd; if (p[NETMAP_RING_POOL].size < v) p[NETMAP_RING_POOL].size = v; /* each pipe endpoint needs two tx rings (1 normal + 1 host, fake) * and two rx rings (again, 1 normal and 1 fake host) */ v = txr + rxr + 8 * npipes; if (p[NETMAP_RING_POOL].num < v) p[NETMAP_RING_POOL].num = v; /* for each pipe we only need the buffers for the 4 "real" rings. * On the other end, the pipe ring dimension may be different from * the parent port ring dimension. As a compromise, we allocate twice the * space actually needed if the pipe rings were the same size as the parent rings */ v = (4 * npipes + rxr) * rxd + (4 * npipes + txr) * txd + 2 + extra_bufs; /* the +2 is for the tx and rx fake buffers (indices 0 and 1) */ if (p[NETMAP_BUF_POOL].num < v) p[NETMAP_BUF_POOL].num = v; if (netmap_verbose) nm_prinf("req if %d*%d ring %d*%d buf %d*%d", p[NETMAP_IF_POOL].num, p[NETMAP_IF_POOL].size, p[NETMAP_RING_POOL].num, p[NETMAP_RING_POOL].size, p[NETMAP_BUF_POOL].num, p[NETMAP_BUF_POOL].size); d = _netmap_mem_private_new(sizeof(*d), p, &netmap_mem_global_ops, perr); return d; } /* call with lock held */ static int netmap_mem2_config(struct netmap_mem_d *nmd) { int i; if (!netmap_mem_params_changed(nmd->params)) goto out; nm_prdis("reconfiguring"); if (nmd->flags & NETMAP_MEM_FINALIZED) { /* reset previous allocation */ for (i = 0; i < NETMAP_POOLS_NR; i++) { netmap_reset_obj_allocator(&nmd->pools[i]); } nmd->flags &= ~NETMAP_MEM_FINALIZED; } for (i = 0; i < NETMAP_POOLS_NR; i++) { nmd->lasterr = netmap_config_obj_allocator(&nmd->pools[i], nmd->params[i].num, nmd->params[i].size); if (nmd->lasterr) goto out; } out: return nmd->lasterr; } static int netmap_mem2_finalize(struct netmap_mem_d *nmd) { if (nmd->flags & NETMAP_MEM_FINALIZED) goto out; if (netmap_mem_finalize_all(nmd)) goto out; nmd->lasterr = 0; out: return nmd->lasterr; } static void netmap_mem2_delete(struct netmap_mem_d *nmd) { int i; for (i = 0; i < NETMAP_POOLS_NR; i++) { netmap_destroy_obj_allocator(&nmd->pools[i]); } NMA_LOCK_DESTROY(nmd); if (nmd != &nm_mem) nm_os_free(nmd); } #ifdef WITH_EXTMEM /* doubly linekd list of all existing external allocators */ static struct netmap_mem_ext *netmap_mem_ext_list = NULL; NM_MTX_T nm_mem_ext_list_lock; #endif /* WITH_EXTMEM */ int netmap_mem_init(void) { NM_MTX_INIT(nm_mem_list_lock); NMA_LOCK_INIT(&nm_mem); netmap_mem_get(&nm_mem); #ifdef WITH_EXTMEM NM_MTX_INIT(nm_mem_ext_list_lock); #endif /* WITH_EXTMEM */ return (0); } void netmap_mem_fini(void) { netmap_mem_put(&nm_mem); } static void netmap_free_rings(struct netmap_adapter *na) { enum txrx t; for_rx_tx(t) { u_int i; for (i = 0; i < netmap_all_rings(na, t); i++) { struct netmap_kring *kring = NMR(na, t)[i]; struct netmap_ring *ring = kring->ring; if (ring == NULL || kring->users > 0 || (kring->nr_kflags & NKR_NEEDRING)) { if (netmap_debug & NM_DEBUG_MEM) nm_prinf("NOT deleting ring %s (ring %p, users %d neekring %d)", kring->name, ring, kring->users, kring->nr_kflags & NKR_NEEDRING); continue; } if (netmap_debug & NM_DEBUG_MEM) nm_prinf("deleting ring %s", kring->name); if (!(kring->nr_kflags & NKR_FAKERING)) { nm_prdis("freeing bufs for %s", kring->name); netmap_free_bufs(na->nm_mem, ring->slot, kring->nkr_num_slots); } else { nm_prdis("NOT freeing bufs for %s", kring->name); } netmap_ring_free(na->nm_mem, ring); kring->ring = NULL; } } } /* call with NMA_LOCK held * * * Allocate netmap rings and buffers for this card * The rings are contiguous, but have variable size. * The kring array must follow the layout described * in netmap_krings_create(). */ static int netmap_mem2_rings_create(struct netmap_adapter *na) { enum txrx t; for_rx_tx(t) { u_int i; for (i = 0; i < netmap_all_rings(na, t); i++) { struct netmap_kring *kring = NMR(na, t)[i]; struct netmap_ring *ring = kring->ring; u_int len, ndesc; if (ring || (!kring->users && !(kring->nr_kflags & NKR_NEEDRING))) { /* uneeded, or already created by somebody else */ if (netmap_debug & NM_DEBUG_MEM) nm_prinf("NOT creating ring %s (ring %p, users %d neekring %d)", kring->name, ring, kring->users, kring->nr_kflags & NKR_NEEDRING); continue; } if (netmap_debug & NM_DEBUG_MEM) nm_prinf("creating %s", kring->name); ndesc = kring->nkr_num_slots; len = sizeof(struct netmap_ring) + ndesc * sizeof(struct netmap_slot); ring = netmap_ring_malloc(na->nm_mem, len); if (ring == NULL) { nm_prerr("Cannot allocate %s_ring", nm_txrx2str(t)); goto cleanup; } nm_prdis("txring at %p", ring); kring->ring = ring; *(uint32_t *)(uintptr_t)&ring->num_slots = ndesc; *(int64_t *)(uintptr_t)&ring->buf_ofs = (na->nm_mem->pools[NETMAP_IF_POOL].memtotal + na->nm_mem->pools[NETMAP_RING_POOL].memtotal) - netmap_ring_offset(na->nm_mem, ring); /* copy values from kring */ ring->head = kring->rhead; ring->cur = kring->rcur; ring->tail = kring->rtail; *(uint32_t *)(uintptr_t)&ring->nr_buf_size = netmap_mem_bufsize(na->nm_mem); nm_prdis("%s h %d c %d t %d", kring->name, ring->head, ring->cur, ring->tail); nm_prdis("initializing slots for %s_ring", nm_txrx2str(t)); if (!(kring->nr_kflags & NKR_FAKERING)) { /* this is a real ring */ if (netmap_debug & NM_DEBUG_MEM) nm_prinf("allocating buffers for %s", kring->name); if (netmap_new_bufs(na->nm_mem, ring->slot, ndesc)) { nm_prerr("Cannot allocate buffers for %s_ring", nm_txrx2str(t)); goto cleanup; } } else { /* this is a fake ring, set all indices to 0 */ if (netmap_debug & NM_DEBUG_MEM) nm_prinf("NOT allocating buffers for %s", kring->name); netmap_mem_set_ring(na->nm_mem, ring->slot, ndesc, 0); } /* ring info */ *(uint16_t *)(uintptr_t)&ring->ringid = kring->ring_id; *(uint16_t *)(uintptr_t)&ring->dir = kring->tx; } } return 0; cleanup: /* we cannot actually cleanup here, since we don't own kring->users * and kring->nr_klags & NKR_NEEDRING. The caller must decrement * the first or zero-out the second, then call netmap_free_rings() * to do the cleanup */ return ENOMEM; } static void netmap_mem2_rings_delete(struct netmap_adapter *na) { /* last instance, release bufs and rings */ netmap_free_rings(na); } /* call with NMA_LOCK held */ /* * Allocate the per-fd structure netmap_if. * * We assume that the configuration stored in na * (number of tx/rx rings and descs) does not change while * the interface is in netmap mode. */ static struct netmap_if * netmap_mem2_if_new(struct netmap_adapter *na, struct netmap_priv_d *priv) { struct netmap_if *nifp; ssize_t base; /* handy for relative offsets between rings and nifp */ u_int i, len, n[NR_TXRX], ntot; enum txrx t; ntot = 0; for_rx_tx(t) { /* account for the (eventually fake) host rings */ n[t] = netmap_all_rings(na, t); ntot += n[t]; } /* * the descriptor is followed inline by an array of offsets * to the tx and rx rings in the shared memory region. */ len = sizeof(struct netmap_if) + (ntot * sizeof(ssize_t)); nifp = netmap_if_malloc(na->nm_mem, len); if (nifp == NULL) { NMA_UNLOCK(na->nm_mem); return NULL; } /* initialize base fields -- override const */ *(u_int *)(uintptr_t)&nifp->ni_tx_rings = na->num_tx_rings; *(u_int *)(uintptr_t)&nifp->ni_rx_rings = na->num_rx_rings; *(u_int *)(uintptr_t)&nifp->ni_host_tx_rings = (na->num_host_tx_rings ? na->num_host_tx_rings : 1); *(u_int *)(uintptr_t)&nifp->ni_host_rx_rings = (na->num_host_rx_rings ? na->num_host_rx_rings : 1); strlcpy(nifp->ni_name, na->name, sizeof(nifp->ni_name)); /* * fill the slots for the rx and tx rings. They contain the offset * between the ring and nifp, so the information is usable in * userspace to reach the ring from the nifp. */ base = netmap_if_offset(na->nm_mem, nifp); for (i = 0; i < n[NR_TX]; i++) { /* XXX instead of ofs == 0 maybe use the offset of an error * ring, like we do for buffers? */ ssize_t ofs = 0; if (na->tx_rings[i]->ring != NULL && i >= priv->np_qfirst[NR_TX] && i < priv->np_qlast[NR_TX]) { ofs = netmap_ring_offset(na->nm_mem, na->tx_rings[i]->ring) - base; } *(ssize_t *)(uintptr_t)&nifp->ring_ofs[i] = ofs; } for (i = 0; i < n[NR_RX]; i++) { /* XXX instead of ofs == 0 maybe use the offset of an error * ring, like we do for buffers? */ ssize_t ofs = 0; if (na->rx_rings[i]->ring != NULL && i >= priv->np_qfirst[NR_RX] && i < priv->np_qlast[NR_RX]) { ofs = netmap_ring_offset(na->nm_mem, na->rx_rings[i]->ring) - base; } *(ssize_t *)(uintptr_t)&nifp->ring_ofs[i+n[NR_TX]] = ofs; } return (nifp); } static void netmap_mem2_if_delete(struct netmap_adapter *na, struct netmap_if *nifp) { if (nifp == NULL) /* nothing to do */ return; if (nifp->ni_bufs_head) netmap_extra_free(na, nifp->ni_bufs_head); netmap_if_free(na->nm_mem, nifp); } static void netmap_mem2_deref(struct netmap_mem_d *nmd) { if (netmap_debug & NM_DEBUG_MEM) nm_prinf("active = %d", nmd->active); } struct netmap_mem_ops netmap_mem_global_ops = { .nmd_get_lut = netmap_mem2_get_lut, .nmd_get_info = netmap_mem2_get_info, .nmd_ofstophys = netmap_mem2_ofstophys, .nmd_config = netmap_mem2_config, .nmd_finalize = netmap_mem2_finalize, .nmd_deref = netmap_mem2_deref, .nmd_delete = netmap_mem2_delete, .nmd_if_offset = netmap_mem2_if_offset, .nmd_if_new = netmap_mem2_if_new, .nmd_if_delete = netmap_mem2_if_delete, .nmd_rings_create = netmap_mem2_rings_create, .nmd_rings_delete = netmap_mem2_rings_delete }; int netmap_mem_pools_info_get(struct nmreq_pools_info *req, struct netmap_mem_d *nmd) { int ret; ret = netmap_mem_get_info(nmd, &req->nr_memsize, NULL, &req->nr_mem_id); if (ret) { return ret; } NMA_LOCK(nmd); req->nr_if_pool_offset = 0; req->nr_if_pool_objtotal = nmd->pools[NETMAP_IF_POOL].objtotal; req->nr_if_pool_objsize = nmd->pools[NETMAP_IF_POOL]._objsize; req->nr_ring_pool_offset = nmd->pools[NETMAP_IF_POOL].memtotal; req->nr_ring_pool_objtotal = nmd->pools[NETMAP_RING_POOL].objtotal; req->nr_ring_pool_objsize = nmd->pools[NETMAP_RING_POOL]._objsize; req->nr_buf_pool_offset = nmd->pools[NETMAP_IF_POOL].memtotal + nmd->pools[NETMAP_RING_POOL].memtotal; req->nr_buf_pool_objtotal = nmd->pools[NETMAP_BUF_POOL].objtotal; req->nr_buf_pool_objsize = nmd->pools[NETMAP_BUF_POOL]._objsize; NMA_UNLOCK(nmd); return 0; } #ifdef WITH_EXTMEM struct netmap_mem_ext { struct netmap_mem_d up; struct nm_os_extmem *os; struct netmap_mem_ext *next, *prev; }; /* call with nm_mem_list_lock held */ static void netmap_mem_ext_register(struct netmap_mem_ext *e) { NM_MTX_LOCK(nm_mem_ext_list_lock); if (netmap_mem_ext_list) netmap_mem_ext_list->prev = e; e->next = netmap_mem_ext_list; netmap_mem_ext_list = e; e->prev = NULL; NM_MTX_UNLOCK(nm_mem_ext_list_lock); } /* call with nm_mem_list_lock held */ static void netmap_mem_ext_unregister(struct netmap_mem_ext *e) { if (e->prev) e->prev->next = e->next; else netmap_mem_ext_list = e->next; if (e->next) e->next->prev = e->prev; e->prev = e->next = NULL; } static struct netmap_mem_ext * netmap_mem_ext_search(struct nm_os_extmem *os) { struct netmap_mem_ext *e; NM_MTX_LOCK(nm_mem_ext_list_lock); for (e = netmap_mem_ext_list; e; e = e->next) { if (nm_os_extmem_isequal(e->os, os)) { netmap_mem_get(&e->up); break; } } NM_MTX_UNLOCK(nm_mem_ext_list_lock); return e; } static void netmap_mem_ext_delete(struct netmap_mem_d *d) { int i; struct netmap_mem_ext *e = (struct netmap_mem_ext *)d; netmap_mem_ext_unregister(e); for (i = 0; i < NETMAP_POOLS_NR; i++) { struct netmap_obj_pool *p = &d->pools[i]; if (p->lut) { nm_free_lut(p->lut, p->objtotal); p->lut = NULL; } } if (e->os) nm_os_extmem_delete(e->os); netmap_mem2_delete(d); } static int netmap_mem_ext_config(struct netmap_mem_d *nmd) { return 0; } struct netmap_mem_ops netmap_mem_ext_ops = { .nmd_get_lut = netmap_mem2_get_lut, .nmd_get_info = netmap_mem2_get_info, .nmd_ofstophys = netmap_mem2_ofstophys, .nmd_config = netmap_mem_ext_config, .nmd_finalize = netmap_mem2_finalize, .nmd_deref = netmap_mem2_deref, .nmd_delete = netmap_mem_ext_delete, .nmd_if_offset = netmap_mem2_if_offset, .nmd_if_new = netmap_mem2_if_new, .nmd_if_delete = netmap_mem2_if_delete, .nmd_rings_create = netmap_mem2_rings_create, .nmd_rings_delete = netmap_mem2_rings_delete }; struct netmap_mem_d * netmap_mem_ext_create(uint64_t usrptr, struct nmreq_pools_info *pi, int *perror) { int error = 0; int i, j; struct netmap_mem_ext *nme; char *clust; size_t off; struct nm_os_extmem *os = NULL; int nr_pages; // XXX sanity checks if (pi->nr_if_pool_objtotal == 0) pi->nr_if_pool_objtotal = netmap_min_priv_params[NETMAP_IF_POOL].num; if (pi->nr_if_pool_objsize == 0) pi->nr_if_pool_objsize = netmap_min_priv_params[NETMAP_IF_POOL].size; if (pi->nr_ring_pool_objtotal == 0) pi->nr_ring_pool_objtotal = netmap_min_priv_params[NETMAP_RING_POOL].num; if (pi->nr_ring_pool_objsize == 0) pi->nr_ring_pool_objsize = netmap_min_priv_params[NETMAP_RING_POOL].size; if (pi->nr_buf_pool_objtotal == 0) pi->nr_buf_pool_objtotal = netmap_min_priv_params[NETMAP_BUF_POOL].num; if (pi->nr_buf_pool_objsize == 0) pi->nr_buf_pool_objsize = netmap_min_priv_params[NETMAP_BUF_POOL].size; if (netmap_verbose & NM_DEBUG_MEM) nm_prinf("if %d %d ring %d %d buf %d %d", pi->nr_if_pool_objtotal, pi->nr_if_pool_objsize, pi->nr_ring_pool_objtotal, pi->nr_ring_pool_objsize, pi->nr_buf_pool_objtotal, pi->nr_buf_pool_objsize); os = nm_os_extmem_create(usrptr, pi, &error); if (os == NULL) { nm_prerr("os extmem creation failed"); goto out; } nme = netmap_mem_ext_search(os); if (nme) { nm_os_extmem_delete(os); return &nme->up; } if (netmap_verbose & NM_DEBUG_MEM) nm_prinf("not found, creating new"); nme = _netmap_mem_private_new(sizeof(*nme), (struct netmap_obj_params[]){ { pi->nr_if_pool_objsize, pi->nr_if_pool_objtotal }, { pi->nr_ring_pool_objsize, pi->nr_ring_pool_objtotal }, { pi->nr_buf_pool_objsize, pi->nr_buf_pool_objtotal }}, &netmap_mem_ext_ops, &error); if (nme == NULL) goto out_unmap; nr_pages = nm_os_extmem_nr_pages(os); /* from now on pages will be released by nme destructor; * we let res = 0 to prevent release in out_unmap below */ nme->os = os; os = NULL; /* pass ownership */ clust = nm_os_extmem_nextpage(nme->os); off = 0; for (i = 0; i < NETMAP_POOLS_NR; i++) { struct netmap_obj_pool *p = &nme->up.pools[i]; struct netmap_obj_params *o = &nme->up.params[i]; p->_objsize = o->size; p->_clustsize = o->size; p->_clustentries = 1; p->lut = nm_alloc_lut(o->num); if (p->lut == NULL) { error = ENOMEM; goto out_delete; } p->bitmap_slots = (o->num + sizeof(uint32_t) - 1) / sizeof(uint32_t); p->invalid_bitmap = nm_os_malloc(sizeof(uint32_t) * p->bitmap_slots); if (p->invalid_bitmap == NULL) { error = ENOMEM; goto out_delete; } if (nr_pages == 0) { p->objtotal = 0; p->memtotal = 0; p->objfree = 0; continue; } for (j = 0; j < o->num && nr_pages > 0; j++) { size_t noff; p->lut[j].vaddr = clust + off; #if !defined(linux) && !defined(_WIN32) p->lut[j].paddr = vtophys(p->lut[j].vaddr); #endif nm_prdis("%s %d at %p", p->name, j, p->lut[j].vaddr); noff = off + p->_objsize; if (noff < PAGE_SIZE) { off = noff; continue; } nm_prdis("too big, recomputing offset..."); while (noff >= PAGE_SIZE) { char *old_clust = clust; noff -= PAGE_SIZE; clust = nm_os_extmem_nextpage(nme->os); nr_pages--; nm_prdis("noff %zu page %p nr_pages %d", noff, page_to_virt(*pages), nr_pages); if (noff > 0 && !nm_isset(p->invalid_bitmap, j) && (nr_pages == 0 || old_clust + PAGE_SIZE != clust)) { /* out of space or non contiguous, * drop this object * */ p->invalid_bitmap[ (j>>5) ] |= 1U << (j & 31U); nm_prdis("non contiguous at off %zu, drop", noff); } if (nr_pages == 0) break; } off = noff; } p->objtotal = j; p->numclusters = p->objtotal; p->memtotal = j * (size_t)p->_objsize; nm_prdis("%d memtotal %zu", j, p->memtotal); } netmap_mem_ext_register(nme); return &nme->up; out_delete: netmap_mem_put(&nme->up); out_unmap: if (os) nm_os_extmem_delete(os); out: if (perror) *perror = error; return NULL; } #endif /* WITH_EXTMEM */ #ifdef WITH_PTNETMAP struct mem_pt_if { struct mem_pt_if *next; struct ifnet *ifp; unsigned int nifp_offset; }; /* Netmap allocator for ptnetmap guests. */ struct netmap_mem_ptg { struct netmap_mem_d up; vm_paddr_t nm_paddr; /* physical address in the guest */ void *nm_addr; /* virtual address in the guest */ struct netmap_lut buf_lut; /* lookup table for BUF pool in the guest */ nm_memid_t host_mem_id; /* allocator identifier in the host */ struct ptnetmap_memdev *ptn_dev;/* ptnetmap memdev */ struct mem_pt_if *pt_ifs; /* list of interfaces in passthrough */ }; /* Link a passthrough interface to a passthrough netmap allocator. */ static int netmap_mem_pt_guest_ifp_add(struct netmap_mem_d *nmd, struct ifnet *ifp, unsigned int nifp_offset) { struct netmap_mem_ptg *ptnmd = (struct netmap_mem_ptg *)nmd; struct mem_pt_if *ptif = nm_os_malloc(sizeof(*ptif)); if (!ptif) { return ENOMEM; } NMA_LOCK(nmd); ptif->ifp = ifp; ptif->nifp_offset = nifp_offset; if (ptnmd->pt_ifs) { ptif->next = ptnmd->pt_ifs; } ptnmd->pt_ifs = ptif; NMA_UNLOCK(nmd); nm_prinf("ifp=%s,nifp_offset=%u", ptif->ifp->if_xname, ptif->nifp_offset); return 0; } /* Called with NMA_LOCK(nmd) held. */ static struct mem_pt_if * netmap_mem_pt_guest_ifp_lookup(struct netmap_mem_d *nmd, struct ifnet *ifp) { struct netmap_mem_ptg *ptnmd = (struct netmap_mem_ptg *)nmd; struct mem_pt_if *curr; for (curr = ptnmd->pt_ifs; curr; curr = curr->next) { if (curr->ifp == ifp) { return curr; } } return NULL; } /* Unlink a passthrough interface from a passthrough netmap allocator. */ int netmap_mem_pt_guest_ifp_del(struct netmap_mem_d *nmd, struct ifnet *ifp) { struct netmap_mem_ptg *ptnmd = (struct netmap_mem_ptg *)nmd; struct mem_pt_if *prev = NULL; struct mem_pt_if *curr; int ret = -1; NMA_LOCK(nmd); for (curr = ptnmd->pt_ifs; curr; curr = curr->next) { if (curr->ifp == ifp) { if (prev) { prev->next = curr->next; } else { ptnmd->pt_ifs = curr->next; } - nm_prinf("removed (ifp=%p,nifp_offset=%u)", - curr->ifp, curr->nifp_offset); + nm_prinf("removed (ifp=%s,nifp_offset=%u)", + curr->ifp->if_xname, curr->nifp_offset); nm_os_free(curr); ret = 0; break; } prev = curr; } NMA_UNLOCK(nmd); return ret; } static int netmap_mem_pt_guest_get_lut(struct netmap_mem_d *nmd, struct netmap_lut *lut) { struct netmap_mem_ptg *ptnmd = (struct netmap_mem_ptg *)nmd; if (!(nmd->flags & NETMAP_MEM_FINALIZED)) { return EINVAL; } *lut = ptnmd->buf_lut; return 0; } static int netmap_mem_pt_guest_get_info(struct netmap_mem_d *nmd, uint64_t *size, u_int *memflags, uint16_t *id) { int error = 0; error = nmd->ops->nmd_config(nmd); if (error) goto out; if (size) *size = nmd->nm_totalsize; if (memflags) *memflags = nmd->flags; if (id) *id = nmd->nm_id; out: return error; } static vm_paddr_t netmap_mem_pt_guest_ofstophys(struct netmap_mem_d *nmd, vm_ooffset_t off) { struct netmap_mem_ptg *ptnmd = (struct netmap_mem_ptg *)nmd; vm_paddr_t paddr; /* if the offset is valid, just return csb->base_addr + off */ paddr = (vm_paddr_t)(ptnmd->nm_paddr + off); nm_prdis("off %lx padr %lx", off, (unsigned long)paddr); return paddr; } static int netmap_mem_pt_guest_config(struct netmap_mem_d *nmd) { /* nothing to do, we are configured on creation * and configuration never changes thereafter */ return 0; } static int netmap_mem_pt_guest_finalize(struct netmap_mem_d *nmd) { struct netmap_mem_ptg *ptnmd = (struct netmap_mem_ptg *)nmd; uint64_t mem_size; uint32_t bufsize; uint32_t nbuffers; uint32_t poolofs; vm_paddr_t paddr; char *vaddr; int i; int error = 0; if (nmd->flags & NETMAP_MEM_FINALIZED) goto out; if (ptnmd->ptn_dev == NULL) { nm_prerr("ptnetmap memdev not attached"); error = ENOMEM; goto out; } /* Map memory through ptnetmap-memdev BAR. */ error = nm_os_pt_memdev_iomap(ptnmd->ptn_dev, &ptnmd->nm_paddr, &ptnmd->nm_addr, &mem_size); if (error) goto out; /* Initialize the lut using the information contained in the * ptnetmap memory device. */ bufsize = nm_os_pt_memdev_ioread(ptnmd->ptn_dev, PTNET_MDEV_IO_BUF_POOL_OBJSZ); nbuffers = nm_os_pt_memdev_ioread(ptnmd->ptn_dev, PTNET_MDEV_IO_BUF_POOL_OBJNUM); /* allocate the lut */ if (ptnmd->buf_lut.lut == NULL) { nm_prinf("allocating lut"); ptnmd->buf_lut.lut = nm_alloc_lut(nbuffers); if (ptnmd->buf_lut.lut == NULL) { nm_prerr("lut allocation failed"); return ENOMEM; } } /* we have physically contiguous memory mapped through PCI BAR */ poolofs = nm_os_pt_memdev_ioread(ptnmd->ptn_dev, PTNET_MDEV_IO_BUF_POOL_OFS); vaddr = (char *)(ptnmd->nm_addr) + poolofs; paddr = ptnmd->nm_paddr + poolofs; for (i = 0; i < nbuffers; i++) { ptnmd->buf_lut.lut[i].vaddr = vaddr; vaddr += bufsize; paddr += bufsize; } ptnmd->buf_lut.objtotal = nbuffers; ptnmd->buf_lut.objsize = bufsize; nmd->nm_totalsize = mem_size; /* Initialize these fields as are needed by * netmap_mem_bufsize(). * XXX please improve this, why do we need this * replication? maybe we nmd->pools[] should no be * there for the guest allocator? */ nmd->pools[NETMAP_BUF_POOL]._objsize = bufsize; nmd->pools[NETMAP_BUF_POOL]._objtotal = nbuffers; nmd->flags |= NETMAP_MEM_FINALIZED; out: return error; } static void netmap_mem_pt_guest_deref(struct netmap_mem_d *nmd) { struct netmap_mem_ptg *ptnmd = (struct netmap_mem_ptg *)nmd; if (nmd->active == 1 && (nmd->flags & NETMAP_MEM_FINALIZED)) { nmd->flags &= ~NETMAP_MEM_FINALIZED; /* unmap ptnetmap-memdev memory */ if (ptnmd->ptn_dev) { nm_os_pt_memdev_iounmap(ptnmd->ptn_dev); } ptnmd->nm_addr = NULL; ptnmd->nm_paddr = 0; } } static ssize_t netmap_mem_pt_guest_if_offset(struct netmap_mem_d *nmd, const void *vaddr) { struct netmap_mem_ptg *ptnmd = (struct netmap_mem_ptg *)nmd; return (const char *)(vaddr) - (char *)(ptnmd->nm_addr); } static void netmap_mem_pt_guest_delete(struct netmap_mem_d *nmd) { if (nmd == NULL) return; if (netmap_verbose) nm_prinf("deleting %p", nmd); if (nmd->active > 0) nm_prerr("bug: deleting mem allocator with active=%d!", nmd->active); if (netmap_verbose) nm_prinf("done deleting %p", nmd); NMA_LOCK_DESTROY(nmd); nm_os_free(nmd); } static struct netmap_if * netmap_mem_pt_guest_if_new(struct netmap_adapter *na, struct netmap_priv_d *priv) { struct netmap_mem_ptg *ptnmd = (struct netmap_mem_ptg *)na->nm_mem; struct mem_pt_if *ptif; struct netmap_if *nifp = NULL; ptif = netmap_mem_pt_guest_ifp_lookup(na->nm_mem, na->ifp); if (ptif == NULL) { nm_prerr("interface %s is not in passthrough", na->name); goto out; } nifp = (struct netmap_if *)((char *)(ptnmd->nm_addr) + ptif->nifp_offset); out: return nifp; } static void netmap_mem_pt_guest_if_delete(struct netmap_adapter *na, struct netmap_if *nifp) { struct mem_pt_if *ptif; ptif = netmap_mem_pt_guest_ifp_lookup(na->nm_mem, na->ifp); if (ptif == NULL) { nm_prerr("interface %s is not in passthrough", na->name); } } static int netmap_mem_pt_guest_rings_create(struct netmap_adapter *na) { struct netmap_mem_ptg *ptnmd = (struct netmap_mem_ptg *)na->nm_mem; struct mem_pt_if *ptif; struct netmap_if *nifp; int i, error = -1; ptif = netmap_mem_pt_guest_ifp_lookup(na->nm_mem, na->ifp); if (ptif == NULL) { nm_prerr("interface %s is not in passthrough", na->name); goto out; } /* point each kring to the corresponding backend ring */ nifp = (struct netmap_if *)((char *)ptnmd->nm_addr + ptif->nifp_offset); for (i = 0; i < netmap_all_rings(na, NR_TX); i++) { struct netmap_kring *kring = na->tx_rings[i]; if (kring->ring) continue; kring->ring = (struct netmap_ring *) ((char *)nifp + nifp->ring_ofs[i]); } for (i = 0; i < netmap_all_rings(na, NR_RX); i++) { struct netmap_kring *kring = na->rx_rings[i]; if (kring->ring) continue; kring->ring = (struct netmap_ring *) ((char *)nifp + nifp->ring_ofs[netmap_all_rings(na, NR_TX) + i]); } error = 0; out: return error; } static void netmap_mem_pt_guest_rings_delete(struct netmap_adapter *na) { #if 0 enum txrx t; for_rx_tx(t) { u_int i; for (i = 0; i < nma_get_nrings(na, t) + 1; i++) { struct netmap_kring *kring = &NMR(na, t)[i]; kring->ring = NULL; } } #endif } static struct netmap_mem_ops netmap_mem_pt_guest_ops = { .nmd_get_lut = netmap_mem_pt_guest_get_lut, .nmd_get_info = netmap_mem_pt_guest_get_info, .nmd_ofstophys = netmap_mem_pt_guest_ofstophys, .nmd_config = netmap_mem_pt_guest_config, .nmd_finalize = netmap_mem_pt_guest_finalize, .nmd_deref = netmap_mem_pt_guest_deref, .nmd_if_offset = netmap_mem_pt_guest_if_offset, .nmd_delete = netmap_mem_pt_guest_delete, .nmd_if_new = netmap_mem_pt_guest_if_new, .nmd_if_delete = netmap_mem_pt_guest_if_delete, .nmd_rings_create = netmap_mem_pt_guest_rings_create, .nmd_rings_delete = netmap_mem_pt_guest_rings_delete }; /* Called with nm_mem_list_lock held. */ static struct netmap_mem_d * netmap_mem_pt_guest_find_memid(nm_memid_t mem_id) { struct netmap_mem_d *mem = NULL; struct netmap_mem_d *scan = netmap_last_mem_d; do { /* find ptnetmap allocator through host ID */ if (scan->ops->nmd_deref == netmap_mem_pt_guest_deref && ((struct netmap_mem_ptg *)(scan))->host_mem_id == mem_id) { mem = scan; mem->refcount++; NM_DBG_REFC(mem, __FUNCTION__, __LINE__); break; } scan = scan->next; } while (scan != netmap_last_mem_d); return mem; } /* Called with nm_mem_list_lock held. */ static struct netmap_mem_d * netmap_mem_pt_guest_create(nm_memid_t mem_id) { struct netmap_mem_ptg *ptnmd; int err = 0; ptnmd = nm_os_malloc(sizeof(struct netmap_mem_ptg)); if (ptnmd == NULL) { err = ENOMEM; goto error; } ptnmd->up.ops = &netmap_mem_pt_guest_ops; ptnmd->host_mem_id = mem_id; ptnmd->pt_ifs = NULL; /* Assign new id in the guest (We have the lock) */ err = nm_mem_assign_id_locked(&ptnmd->up); if (err) goto error; ptnmd->up.flags &= ~NETMAP_MEM_FINALIZED; ptnmd->up.flags |= NETMAP_MEM_IO; NMA_LOCK_INIT(&ptnmd->up); snprintf(ptnmd->up.name, NM_MEM_NAMESZ, "%d", ptnmd->up.nm_id); return &ptnmd->up; error: netmap_mem_pt_guest_delete(&ptnmd->up); return NULL; } /* * find host id in guest allocators and create guest allocator * if it is not there */ static struct netmap_mem_d * netmap_mem_pt_guest_get(nm_memid_t mem_id) { struct netmap_mem_d *nmd; NM_MTX_LOCK(nm_mem_list_lock); nmd = netmap_mem_pt_guest_find_memid(mem_id); if (nmd == NULL) { nmd = netmap_mem_pt_guest_create(mem_id); } NM_MTX_UNLOCK(nm_mem_list_lock); return nmd; } /* * The guest allocator can be created by ptnetmap_memdev (during the device * attach) or by ptnetmap device (ptnet), during the netmap_attach. * * The order is not important (we have different order in LINUX and FreeBSD). * The first one, creates the device, and the second one simply attaches it. */ /* Called when ptnetmap_memdev is attaching, to attach a new allocator in * the guest */ struct netmap_mem_d * netmap_mem_pt_guest_attach(struct ptnetmap_memdev *ptn_dev, nm_memid_t mem_id) { struct netmap_mem_d *nmd; struct netmap_mem_ptg *ptnmd; nmd = netmap_mem_pt_guest_get(mem_id); /* assign this device to the guest allocator */ if (nmd) { ptnmd = (struct netmap_mem_ptg *)nmd; ptnmd->ptn_dev = ptn_dev; } return nmd; } /* Called when ptnet device is attaching */ struct netmap_mem_d * netmap_mem_pt_guest_new(struct ifnet *ifp, unsigned int nifp_offset, unsigned int memid) { struct netmap_mem_d *nmd; if (ifp == NULL) { return NULL; } nmd = netmap_mem_pt_guest_get((nm_memid_t)memid); if (nmd) { netmap_mem_pt_guest_ifp_add(nmd, ifp, nifp_offset); } return nmd; } #endif /* WITH_PTNETMAP */ Index: head/tools/tools/netmap/lb.c =================================================================== --- head/tools/tools/netmap/lb.c (revision 353774) +++ head/tools/tools/netmap/lb.c (revision 353775) @@ -1,1027 +1,1027 @@ /* * Copyright (C) 2017 Corelight, Inc. and Universita` di Pisa. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* $FreeBSD$ */ #include #include #include #include #include #include #define NETMAP_WITH_LIBS #include #include #include /* htonl */ #include #include "pkt_hash.h" #include "ctrs.h" /* * use our version of header structs, rather than bringing in a ton * of platform specific ones */ #ifndef ETH_ALEN #define ETH_ALEN 6 #endif struct compact_eth_hdr { unsigned char h_dest[ETH_ALEN]; unsigned char h_source[ETH_ALEN]; u_int16_t h_proto; }; struct compact_ip_hdr { u_int8_t ihl:4, version:4; u_int8_t tos; u_int16_t tot_len; u_int16_t id; u_int16_t frag_off; u_int8_t ttl; u_int8_t protocol; u_int16_t check; u_int32_t saddr; u_int32_t daddr; }; struct compact_ipv6_hdr { u_int8_t priority:4, version:4; u_int8_t flow_lbl[3]; u_int16_t payload_len; u_int8_t nexthdr; u_int8_t hop_limit; struct in6_addr saddr; struct in6_addr daddr; }; #define MAX_IFNAMELEN 64 #define MAX_PORTNAMELEN (MAX_IFNAMELEN + 40) #define DEF_OUT_PIPES 2 #define DEF_EXTRA_BUFS 0 #define DEF_BATCH 2048 #define DEF_WAIT_LINK 2 #define DEF_STATS_INT 600 #define BUF_REVOKE 100 #define STAT_MSG_MAXSIZE 1024 struct { char ifname[MAX_IFNAMELEN]; char base_name[MAX_IFNAMELEN]; int netmap_fd; uint16_t output_rings; uint16_t num_groups; uint32_t extra_bufs; uint16_t batch; int stdout_interval; int syslog_interval; int wait_link; bool busy_wait; } glob_arg; /* * the overflow queue is a circular queue of buffers */ struct overflow_queue { char name[MAX_IFNAMELEN + 16]; struct netmap_slot *slots; uint32_t head; uint32_t tail; uint32_t n; uint32_t size; }; struct overflow_queue *freeq; static inline int oq_full(struct overflow_queue *q) { return q->n >= q->size; } static inline int oq_empty(struct overflow_queue *q) { return q->n <= 0; } static inline void oq_enq(struct overflow_queue *q, const struct netmap_slot *s) { if (unlikely(oq_full(q))) { D("%s: queue full!", q->name); abort(); } q->slots[q->tail] = *s; q->n++; q->tail++; if (q->tail >= q->size) q->tail = 0; } static inline struct netmap_slot oq_deq(struct overflow_queue *q) { struct netmap_slot s = q->slots[q->head]; if (unlikely(oq_empty(q))) { D("%s: queue empty!", q->name); abort(); } q->n--; q->head++; if (q->head >= q->size) q->head = 0; return s; } static volatile int do_abort = 0; uint64_t dropped = 0; uint64_t forwarded = 0; uint64_t received_bytes = 0; uint64_t received_pkts = 0; uint64_t non_ip = 0; uint32_t freeq_n = 0; struct port_des { char interface[MAX_PORTNAMELEN]; struct my_ctrs ctr; unsigned int last_sync; uint32_t last_tail; struct overflow_queue *oq; struct nm_desc *nmd; struct netmap_ring *ring; struct group_des *group; }; struct port_des *ports; /* each group of pipes receives all the packets */ struct group_des { char pipename[MAX_IFNAMELEN]; struct port_des *ports; int first_id; int nports; int last; int custom_port; }; struct group_des *groups; /* statistcs */ struct counters { struct timeval ts; struct my_ctrs *ctrs; uint64_t received_pkts; uint64_t received_bytes; uint64_t non_ip; uint32_t freeq_n; int status __attribute__((aligned(64))); #define COUNTERS_EMPTY 0 #define COUNTERS_FULL 1 }; struct counters counters_buf; static void * print_stats(void *arg) { int npipes = glob_arg.output_rings; int sys_int = 0; (void)arg; struct my_ctrs cur, prev; struct my_ctrs *pipe_prev; pipe_prev = calloc(npipes, sizeof(struct my_ctrs)); if (pipe_prev == NULL) { D("out of memory"); exit(1); } char stat_msg[STAT_MSG_MAXSIZE] = ""; memset(&prev, 0, sizeof(prev)); while (!do_abort) { int j, dosyslog = 0, dostdout = 0, newdata; uint64_t pps = 0, dps = 0, bps = 0, dbps = 0, usec = 0; struct my_ctrs x; counters_buf.status = COUNTERS_EMPTY; newdata = 0; memset(&cur, 0, sizeof(cur)); sleep(1); if (counters_buf.status == COUNTERS_FULL) { __sync_synchronize(); newdata = 1; cur.t = counters_buf.ts; if (prev.t.tv_sec || prev.t.tv_usec) { usec = (cur.t.tv_sec - prev.t.tv_sec) * 1000000 + cur.t.tv_usec - prev.t.tv_usec; } } ++sys_int; if (glob_arg.stdout_interval && sys_int % glob_arg.stdout_interval == 0) dostdout = 1; if (glob_arg.syslog_interval && sys_int % glob_arg.syslog_interval == 0) dosyslog = 1; for (j = 0; j < npipes; ++j) { struct my_ctrs *c = &counters_buf.ctrs[j]; cur.pkts += c->pkts; cur.drop += c->drop; cur.drop_bytes += c->drop_bytes; cur.bytes += c->bytes; if (usec) { x.pkts = c->pkts - pipe_prev[j].pkts; x.drop = c->drop - pipe_prev[j].drop; x.bytes = c->bytes - pipe_prev[j].bytes; x.drop_bytes = c->drop_bytes - pipe_prev[j].drop_bytes; pps = (x.pkts*1000000 + usec/2) / usec; dps = (x.drop*1000000 + usec/2) / usec; bps = ((x.bytes*1000000 + usec/2) / usec) * 8; dbps = ((x.drop_bytes*1000000 + usec/2) / usec) * 8; } pipe_prev[j] = *c; if ( (dosyslog || dostdout) && newdata ) snprintf(stat_msg, STAT_MSG_MAXSIZE, "{" "\"ts\":%.6f," "\"interface\":\"%s\"," "\"output_ring\":%" PRIu16 "," "\"packets_forwarded\":%" PRIu64 "," "\"packets_dropped\":%" PRIu64 "," "\"data_forward_rate_Mbps\":%.4f," "\"data_drop_rate_Mbps\":%.4f," "\"packet_forward_rate_kpps\":%.4f," "\"packet_drop_rate_kpps\":%.4f," "\"overflow_queue_size\":%" PRIu32 "}", cur.t.tv_sec + (cur.t.tv_usec / 1000000.0), ports[j].interface, j, c->pkts, c->drop, (double)bps / 1024 / 1024, (double)dbps / 1024 / 1024, (double)pps / 1000, (double)dps / 1000, c->oq_n); if (dosyslog && stat_msg[0]) syslog(LOG_INFO, "%s", stat_msg); if (dostdout && stat_msg[0]) printf("%s\n", stat_msg); } if (usec) { x.pkts = cur.pkts - prev.pkts; x.drop = cur.drop - prev.drop; x.bytes = cur.bytes - prev.bytes; x.drop_bytes = cur.drop_bytes - prev.drop_bytes; pps = (x.pkts*1000000 + usec/2) / usec; dps = (x.drop*1000000 + usec/2) / usec; bps = ((x.bytes*1000000 + usec/2) / usec) * 8; dbps = ((x.drop_bytes*1000000 + usec/2) / usec) * 8; } if ( (dosyslog || dostdout) && newdata ) snprintf(stat_msg, STAT_MSG_MAXSIZE, "{" "\"ts\":%.6f," "\"interface\":\"%s\"," "\"output_ring\":null," "\"packets_received\":%" PRIu64 "," "\"packets_forwarded\":%" PRIu64 "," "\"packets_dropped\":%" PRIu64 "," "\"non_ip_packets\":%" PRIu64 "," "\"data_forward_rate_Mbps\":%.4f," "\"data_drop_rate_Mbps\":%.4f," "\"packet_forward_rate_kpps\":%.4f," "\"packet_drop_rate_kpps\":%.4f," "\"free_buffer_slots\":%" PRIu32 "}", cur.t.tv_sec + (cur.t.tv_usec / 1000000.0), glob_arg.ifname, received_pkts, cur.pkts, cur.drop, counters_buf.non_ip, (double)bps / 1024 / 1024, (double)dbps / 1024 / 1024, (double)pps / 1000, (double)dps / 1000, counters_buf.freeq_n); if (dosyslog && stat_msg[0]) syslog(LOG_INFO, "%s", stat_msg); if (dostdout && stat_msg[0]) printf("%s\n", stat_msg); prev = cur; } free(pipe_prev); return NULL; } static void free_buffers(void) { int i, tot = 0; struct port_des *rxport = &ports[glob_arg.output_rings]; /* build a netmap free list with the buffers in all the overflow queues */ for (i = 0; i < glob_arg.output_rings + 1; i++) { struct port_des *cp = &ports[i]; struct overflow_queue *q = cp->oq; if (!q) continue; while (q->n) { struct netmap_slot s = oq_deq(q); uint32_t *b = (uint32_t *)NETMAP_BUF(cp->ring, s.buf_idx); *b = rxport->nmd->nifp->ni_bufs_head; rxport->nmd->nifp->ni_bufs_head = s.buf_idx; tot++; } } D("added %d buffers to netmap free list", tot); for (i = 0; i < glob_arg.output_rings + 1; ++i) { nm_close(ports[i].nmd); } } static void sigint_h(int sig) { (void)sig; /* UNUSED */ do_abort = 1; signal(SIGINT, SIG_DFL); } void usage() { printf("usage: lb [options]\n"); printf("where options are:\n"); printf(" -h view help text\n"); printf(" -i iface interface name (required)\n"); printf(" -p [prefix:]npipes add a new group of output pipes\n"); printf(" -B nbufs number of extra buffers (default: %d)\n", DEF_EXTRA_BUFS); printf(" -b batch batch size (default: %d)\n", DEF_BATCH); printf(" -w seconds wait for link up (default: %d)\n", DEF_WAIT_LINK); printf(" -W enable busy waiting. this will run your CPU at 100%%\n"); printf(" -s seconds seconds between syslog stats messages (default: 0)\n"); printf(" -o seconds seconds between stdout stats messages (default: 0)\n"); exit(0); } static int parse_pipes(char *spec) { char *end = index(spec, ':'); static int max_groups = 0; struct group_des *g; ND("spec %s num_groups %d", spec, glob_arg.num_groups); if (max_groups < glob_arg.num_groups + 1) { size_t size = sizeof(*g) * (glob_arg.num_groups + 1); groups = realloc(groups, size); if (groups == NULL) { D("out of memory"); return 1; } } g = &groups[glob_arg.num_groups]; memset(g, 0, sizeof(*g)); if (end != NULL) { if (end - spec > MAX_IFNAMELEN - 8) { D("name '%s' too long", spec); return 1; } if (end == spec) { D("missing prefix before ':' in '%s'", spec); return 1; } strncpy(g->pipename, spec, end - spec); g->custom_port = 1; end++; } else { /* no prefix, this group will use the * name of the input port. * This will be set in init_groups(), * since here the input port may still * be uninitialized */ end = spec; } if (*end == '\0') { g->nports = DEF_OUT_PIPES; } else { g->nports = atoi(end); if (g->nports < 1) { D("invalid number of pipes '%s' (must be at least 1)", end); return 1; } } glob_arg.output_rings += g->nports; glob_arg.num_groups++; return 0; } /* complete the initialization of the groups data structure */ void init_groups(void) { int i, j, t = 0; struct group_des *g = NULL; for (i = 0; i < glob_arg.num_groups; i++) { g = &groups[i]; g->ports = &ports[t]; for (j = 0; j < g->nports; j++) g->ports[j].group = g; t += g->nports; if (!g->custom_port) strcpy(g->pipename, glob_arg.base_name); for (j = 0; j < i; j++) { struct group_des *h = &groups[j]; if (!strcmp(h->pipename, g->pipename)) g->first_id += h->nports; } } g->last = 1; } /* push the packet described by slot rs to the group g. * This may cause other buffers to be pushed down the * chain headed by g. * Return a free buffer. */ uint32_t forward_packet(struct group_des *g, struct netmap_slot *rs) { uint32_t hash = rs->ptr; uint32_t output_port = hash % g->nports; struct port_des *port = &g->ports[output_port]; struct netmap_ring *ring = port->ring; struct overflow_queue *q = port->oq; /* Move the packet to the output pipe, unless there is * either no space left on the ring, or there is some * packet still in the overflow queue (since those must * take precedence over the new one) */ if (ring->head != ring->tail && (q == NULL || oq_empty(q))) { struct netmap_slot *ts = &ring->slot[ring->head]; struct netmap_slot old_slot = *ts; ts->buf_idx = rs->buf_idx; ts->len = rs->len; ts->flags |= NS_BUF_CHANGED; ts->ptr = rs->ptr; ring->head = nm_ring_next(ring, ring->head); port->ctr.bytes += rs->len; port->ctr.pkts++; forwarded++; return old_slot.buf_idx; } /* use the overflow queue, if available */ if (q == NULL || oq_full(q)) { /* no space left on the ring and no overflow queue * available: we are forced to drop the packet */ dropped++; port->ctr.drop++; port->ctr.drop_bytes += rs->len; return rs->buf_idx; } oq_enq(q, rs); /* * we cannot continue down the chain and we need to * return a free buffer now. We take it from the free queue. */ if (oq_empty(freeq)) { /* the free queue is empty. Revoke some buffers * from the longest overflow queue */ uint32_t j; struct port_des *lp = &ports[0]; uint32_t max = lp->oq->n; /* let lp point to the port with the longest queue */ for (j = 1; j < glob_arg.output_rings; j++) { struct port_des *cp = &ports[j]; if (cp->oq->n > max) { lp = cp; max = cp->oq->n; } } /* move the oldest BUF_REVOKE buffers from the * lp queue to the free queue */ // XXX optimize this cycle for (j = 0; lp->oq->n && j < BUF_REVOKE; j++) { struct netmap_slot tmp = oq_deq(lp->oq); dropped++; lp->ctr.drop++; lp->ctr.drop_bytes += tmp.len; oq_enq(freeq, &tmp); } ND(1, "revoked %d buffers from %s", j, lq->name); } return oq_deq(freeq).buf_idx; } int main(int argc, char **argv) { int ch; uint32_t i; int rv; unsigned int iter = 0; int poll_timeout = 10; /* default */ glob_arg.ifname[0] = '\0'; glob_arg.output_rings = 0; glob_arg.batch = DEF_BATCH; glob_arg.wait_link = DEF_WAIT_LINK; glob_arg.busy_wait = false; glob_arg.syslog_interval = 0; glob_arg.stdout_interval = 0; while ( (ch = getopt(argc, argv, "hi:p:b:B:s:o:w:W")) != -1) { switch (ch) { case 'i': D("interface is %s", optarg); if (strlen(optarg) > MAX_IFNAMELEN - 8) { D("ifname too long %s", optarg); return 1; } if (strncmp(optarg, "netmap:", 7) && strncmp(optarg, "vale", 4)) { sprintf(glob_arg.ifname, "netmap:%s", optarg); } else { strcpy(glob_arg.ifname, optarg); } break; case 'p': if (parse_pipes(optarg)) { usage(); return 1; } break; case 'B': glob_arg.extra_bufs = atoi(optarg); D("requested %d extra buffers", glob_arg.extra_bufs); break; case 'b': glob_arg.batch = atoi(optarg); D("batch is %d", glob_arg.batch); break; case 'w': glob_arg.wait_link = atoi(optarg); D("link wait for up time is %d", glob_arg.wait_link); break; case 'W': glob_arg.busy_wait = true; break; case 'o': glob_arg.stdout_interval = atoi(optarg); break; case 's': glob_arg.syslog_interval = atoi(optarg); break; case 'h': usage(); return 0; break; default: D("bad option %c %s", ch, optarg); usage(); return 1; } } if (glob_arg.ifname[0] == '\0') { D("missing interface name"); usage(); return 1; } /* extract the base name */ char *nscan = strncmp(glob_arg.ifname, "netmap:", 7) ? glob_arg.ifname : glob_arg.ifname + 7; - strncpy(glob_arg.base_name, nscan, MAX_IFNAMELEN-1); + strncpy(glob_arg.base_name, nscan, MAX_IFNAMELEN - 1); for (nscan = glob_arg.base_name; *nscan && !index("-*^{}/@", *nscan); nscan++) ; *nscan = '\0'; if (glob_arg.num_groups == 0) parse_pipes(""); if (glob_arg.syslog_interval) { setlogmask(LOG_UPTO(LOG_INFO)); openlog("lb", LOG_CONS | LOG_PID | LOG_NDELAY, LOG_LOCAL1); } uint32_t npipes = glob_arg.output_rings; pthread_t stat_thread; ports = calloc(npipes + 1, sizeof(struct port_des)); if (!ports) { D("failed to allocate the stats array"); return 1; } struct port_des *rxport = &ports[npipes]; init_groups(); memset(&counters_buf, 0, sizeof(counters_buf)); counters_buf.ctrs = calloc(npipes, sizeof(struct my_ctrs)); if (!counters_buf.ctrs) { D("failed to allocate the counters snapshot buffer"); return 1; } /* we need base_req to specify pipes and extra bufs */ struct nmreq base_req; memset(&base_req, 0, sizeof(base_req)); base_req.nr_arg1 = npipes; base_req.nr_arg3 = glob_arg.extra_bufs; rxport->nmd = nm_open(glob_arg.ifname, &base_req, 0, NULL); if (rxport->nmd == NULL) { D("cannot open %s", glob_arg.ifname); return (1); } else { D("successfully opened %s (tx rings: %u)", glob_arg.ifname, rxport->nmd->req.nr_tx_slots); } uint32_t extra_bufs = rxport->nmd->req.nr_arg3; struct overflow_queue *oq = NULL; /* reference ring to access the buffers */ rxport->ring = NETMAP_RXRING(rxport->nmd->nifp, 0); if (!glob_arg.extra_bufs) goto run; D("obtained %d extra buffers", extra_bufs); if (!extra_bufs) goto run; /* one overflow queue for each output pipe, plus one for the * free extra buffers */ oq = calloc(npipes + 1, sizeof(struct overflow_queue)); if (!oq) { D("failed to allocated overflow queues descriptors"); goto run; } freeq = &oq[npipes]; rxport->oq = freeq; freeq->slots = calloc(extra_bufs, sizeof(struct netmap_slot)); if (!freeq->slots) { D("failed to allocate the free list"); } freeq->size = extra_bufs; snprintf(freeq->name, MAX_IFNAMELEN, "free queue"); /* * the list of buffers uses the first uint32_t in each buffer * as the index of the next buffer. */ uint32_t scan; for (scan = rxport->nmd->nifp->ni_bufs_head; scan; scan = *(uint32_t *)NETMAP_BUF(rxport->ring, scan)) { struct netmap_slot s; s.len = s.flags = 0; s.ptr = 0; s.buf_idx = scan; ND("freeq <- %d", s.buf_idx); oq_enq(freeq, &s); } if (freeq->n != extra_bufs) { D("something went wrong: netmap reported %d extra_bufs, but the free list contained %d", extra_bufs, freeq->n); return 1; } rxport->nmd->nifp->ni_bufs_head = 0; run: atexit(free_buffers); int j, t = 0; for (j = 0; j < glob_arg.num_groups; j++) { struct group_des *g = &groups[j]; int k; for (k = 0; k < g->nports; ++k) { struct port_des *p = &g->ports[k]; snprintf(p->interface, MAX_PORTNAMELEN, "%s%s{%d/xT@%d", (strncmp(g->pipename, "vale", 4) ? "netmap:" : ""), g->pipename, g->first_id + k, rxport->nmd->req.nr_arg2); D("opening pipe named %s", p->interface); p->nmd = nm_open(p->interface, NULL, 0, rxport->nmd); if (p->nmd == NULL) { D("cannot open %s", p->interface); return (1); } else if (p->nmd->req.nr_arg2 != rxport->nmd->req.nr_arg2) { D("failed to open pipe #%d in zero-copy mode, " "please close any application that uses either pipe %s}%d, " "or %s{%d, and retry", k + 1, g->pipename, g->first_id + k, g->pipename, g->first_id + k); return (1); } else { D("successfully opened pipe #%d %s (tx slots: %d)", k + 1, p->interface, p->nmd->req.nr_tx_slots); p->ring = NETMAP_TXRING(p->nmd->nifp, 0); p->last_tail = nm_ring_next(p->ring, p->ring->tail); } D("zerocopy %s", (rxport->nmd->mem == p->nmd->mem) ? "enabled" : "disabled"); if (extra_bufs) { struct overflow_queue *q = &oq[t + k]; q->slots = calloc(extra_bufs, sizeof(struct netmap_slot)); if (!q->slots) { D("failed to allocate overflow queue for pipe %d", k); /* make all overflow queue management fail */ extra_bufs = 0; } q->size = extra_bufs; snprintf(q->name, sizeof(q->name), "oq %s{%4d", g->pipename, k); p->oq = q; } } t += g->nports; } if (glob_arg.extra_bufs && !extra_bufs) { if (oq) { for (i = 0; i < npipes + 1; i++) { free(oq[i].slots); oq[i].slots = NULL; } free(oq); oq = NULL; } D("*** overflow queues disabled ***"); } sleep(glob_arg.wait_link); /* start stats thread after wait_link */ if (pthread_create(&stat_thread, NULL, print_stats, NULL) == -1) { D("unable to create the stats thread: %s", strerror(errno)); return 1; } struct pollfd pollfd[npipes + 1]; memset(&pollfd, 0, sizeof(pollfd)); signal(SIGINT, sigint_h); /* make sure we wake up as often as needed, even when there are no * packets coming in */ if (glob_arg.syslog_interval > 0 && glob_arg.syslog_interval < poll_timeout) poll_timeout = glob_arg.syslog_interval; if (glob_arg.stdout_interval > 0 && glob_arg.stdout_interval < poll_timeout) poll_timeout = glob_arg.stdout_interval; while (!do_abort) { u_int polli = 0; iter++; for (i = 0; i < npipes; ++i) { struct netmap_ring *ring = ports[i].ring; int pending = nm_tx_pending(ring); /* if there are packets pending, we want to be notified when * tail moves, so we let cur=tail */ ring->cur = pending ? ring->tail : ring->head; if (!glob_arg.busy_wait && !pending) { /* no need to poll, there are no packets pending */ continue; } pollfd[polli].fd = ports[i].nmd->fd; pollfd[polli].events = POLLOUT; pollfd[polli].revents = 0; ++polli; } pollfd[polli].fd = rxport->nmd->fd; pollfd[polli].events = POLLIN; pollfd[polli].revents = 0; ++polli; //RD(5, "polling %d file descriptors", polli+1); rv = poll(pollfd, polli, poll_timeout); if (rv <= 0) { if (rv < 0 && errno != EAGAIN && errno != EINTR) RD(1, "poll error %s", strerror(errno)); goto send_stats; } /* if there are several groups, try pushing released packets from * upstream groups to the downstream ones. * * It is important to do this before returned slots are reused * for new transmissions. For the same reason, this must be * done starting from the last group going backwards. */ for (i = glob_arg.num_groups - 1U; i > 0; i--) { struct group_des *g = &groups[i - 1]; int j; for (j = 0; j < g->nports; j++) { struct port_des *p = &g->ports[j]; struct netmap_ring *ring = p->ring; uint32_t last = p->last_tail, stop = nm_ring_next(ring, ring->tail); /* slight abuse of the API here: we touch the slot * pointed to by tail */ for ( ; last != stop; last = nm_ring_next(ring, last)) { struct netmap_slot *rs = &ring->slot[last]; // XXX less aggressive? rs->buf_idx = forward_packet(g + 1, rs); rs->flags |= NS_BUF_CHANGED; rs->ptr = 0; } p->last_tail = last; } } if (oq) { /* try to push packets from the overflow queues * to the corresponding pipes */ for (i = 0; i < npipes; i++) { struct port_des *p = &ports[i]; struct overflow_queue *q = p->oq; uint32_t j, lim; struct netmap_ring *ring; struct netmap_slot *slot; if (oq_empty(q)) continue; ring = p->ring; lim = nm_ring_space(ring); if (!lim) continue; if (q->n < lim) lim = q->n; for (j = 0; j < lim; j++) { struct netmap_slot s = oq_deq(q), tmp; tmp.ptr = 0; slot = &ring->slot[ring->head]; tmp.buf_idx = slot->buf_idx; oq_enq(freeq, &tmp); *slot = s; slot->flags |= NS_BUF_CHANGED; ring->head = nm_ring_next(ring, ring->head); } } } /* push any new packets from the input port to the first group */ int batch = 0; for (i = rxport->nmd->first_rx_ring; i <= rxport->nmd->last_rx_ring; i++) { struct netmap_ring *rxring = NETMAP_RXRING(rxport->nmd->nifp, i); //D("prepare to scan rings"); - int next_cur = rxring->cur; - struct netmap_slot *next_slot = &rxring->slot[next_cur]; + int next_head = rxring->head; + struct netmap_slot *next_slot = &rxring->slot[next_head]; const char *next_buf = NETMAP_BUF(rxring, next_slot->buf_idx); while (!nm_ring_empty(rxring)) { struct netmap_slot *rs = next_slot; struct group_des *g = &groups[0]; ++received_pkts; received_bytes += rs->len; // CHOOSE THE CORRECT OUTPUT PIPE rs->ptr = pkt_hdr_hash((const unsigned char *)next_buf, 4, 'B'); if (rs->ptr == 0) { non_ip++; // XXX ?? } // prefetch the buffer for the next round - next_cur = nm_ring_next(rxring, next_cur); - next_slot = &rxring->slot[next_cur]; + next_head = nm_ring_next(rxring, next_head); + next_slot = &rxring->slot[next_head]; next_buf = NETMAP_BUF(rxring, next_slot->buf_idx); __builtin_prefetch(next_buf); // 'B' is just a hashing seed rs->buf_idx = forward_packet(g, rs); rs->flags |= NS_BUF_CHANGED; - rxring->head = rxring->cur = next_cur; + rxring->head = rxring->cur = next_head; batch++; if (unlikely(batch >= glob_arg.batch)) { ioctl(rxport->nmd->fd, NIOCRXSYNC, NULL); batch = 0; } ND(1, "Forwarded Packets: %"PRIu64" Dropped packets: %"PRIu64" Percent: %.2f", forwarded, dropped, ((float)dropped / (float)forwarded * 100)); } } send_stats: if (counters_buf.status == COUNTERS_FULL) continue; /* take a new snapshot of the counters */ gettimeofday(&counters_buf.ts, NULL); for (i = 0; i < npipes; i++) { struct my_ctrs *c = &counters_buf.ctrs[i]; *c = ports[i].ctr; /* * If there are overflow queues, copy the number of them for each * port to the ctrs.oq_n variable for each port. */ if (ports[i].oq != NULL) c->oq_n = ports[i].oq->n; } counters_buf.received_pkts = received_pkts; counters_buf.received_bytes = received_bytes; counters_buf.non_ip = non_ip; if (freeq != NULL) counters_buf.freeq_n = freeq->n; __sync_synchronize(); counters_buf.status = COUNTERS_FULL; } /* * If freeq exists, copy the number to the freeq_n member of the * message struct, otherwise set it to 0. */ if (freeq != NULL) { freeq_n = freeq->n; } else { freeq_n = 0; } pthread_join(stat_thread, NULL); printf("%"PRIu64" packets forwarded. %"PRIu64" packets dropped. Total %"PRIu64"\n", forwarded, dropped, forwarded + dropped); return 0; } Index: head/tools/tools/netmap/pkt-gen.c =================================================================== --- head/tools/tools/netmap/pkt-gen.c (revision 353774) +++ head/tools/tools/netmap/pkt-gen.c (revision 353775) @@ -1,3249 +1,3250 @@ /* * Copyright (C) 2011-2014 Matteo Landi, Luigi Rizzo. All rights reserved. * Copyright (C) 2013-2015 Universita` di Pisa. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ /* * $FreeBSD$ * $Id: pkt-gen.c 12346 2013-06-12 17:36:25Z luigi $ * * Example program to show how to build a multithreaded packet * source/sink using the netmap device. * * In this example we create a programmable number of threads * to take care of all the queues of the interface used to * send or receive traffic. * */ #define _GNU_SOURCE /* for CPU_SET() */ #include #define NETMAP_WITH_LIBS #include #include // isprint() #include // sysconf() #include #include /* ntohs */ #ifndef _WIN32 #include /* sysctl */ #endif #include /* getifaddrs */ #include #include #include #include #include #ifdef linux #define IPV6_VERSION 0x60 #define IPV6_DEFHLIM 64 #endif #include #include #include #ifndef NO_PCAP #include #endif #include "ctrs.h" static void usage(int); #ifdef _WIN32 #define cpuset_t DWORD_PTR //uint64_t static inline void CPU_ZERO(cpuset_t *p) { *p = 0; } static inline void CPU_SET(uint32_t i, cpuset_t *p) { *p |= 1<< (i & 0x3f); } #define pthread_setaffinity_np(a, b, c) !SetThreadAffinityMask(a, *c) //((void)a, 0) #define TAP_CLONEDEV "/dev/tap" #define AF_LINK 18 //defined in winsocks.h #define CLOCK_REALTIME_PRECISE CLOCK_REALTIME #include /* * Convert an ASCII representation of an ethernet address to * binary form. */ struct ether_addr * ether_aton(const char *a) { int i; static struct ether_addr o; unsigned int o0, o1, o2, o3, o4, o5; i = sscanf(a, "%x:%x:%x:%x:%x:%x", &o0, &o1, &o2, &o3, &o4, &o5); if (i != 6) return (NULL); o.octet[0]=o0; o.octet[1]=o1; o.octet[2]=o2; o.octet[3]=o3; o.octet[4]=o4; o.octet[5]=o5; return ((struct ether_addr *)&o); } /* * Convert a binary representation of an ethernet address to * an ASCII string. */ char * ether_ntoa(const struct ether_addr *n) { int i; static char a[18]; i = sprintf(a, "%02x:%02x:%02x:%02x:%02x:%02x", n->octet[0], n->octet[1], n->octet[2], n->octet[3], n->octet[4], n->octet[5]); return (i < 17 ? NULL : (char *)&a); } #endif /* _WIN32 */ #ifdef linux #define cpuset_t cpu_set_t #define ifr_flagshigh ifr_flags /* only the low 16 bits here */ #define IFF_PPROMISC IFF_PROMISC /* IFF_PPROMISC does not exist */ #include #include #define CLOCK_REALTIME_PRECISE CLOCK_REALTIME #include /* ether_aton */ #include /* sockaddr_ll */ #endif /* linux */ #ifdef __FreeBSD__ #include /* le64toh */ #include #include /* pthread w/ affinity */ #include /* cpu_set */ #include /* LLADDR */ #endif /* __FreeBSD__ */ #ifdef __APPLE__ #define cpuset_t uint64_t // XXX static inline void CPU_ZERO(cpuset_t *p) { *p = 0; } static inline void CPU_SET(uint32_t i, cpuset_t *p) { *p |= 1<< (i & 0x3f); } #define pthread_setaffinity_np(a, b, c) ((void)a, 0) #define ifr_flagshigh ifr_flags // XXX #define IFF_PPROMISC IFF_PROMISC #include /* LLADDR */ #define clock_gettime(a,b) \ do {struct timespec t0 = {0,0}; *(b) = t0; } while (0) #endif /* __APPLE__ */ const char *default_payload="netmap pkt-gen DIRECT payload\n" "http://info.iet.unipi.it/~luigi/netmap/ "; const char *indirect_payload="netmap pkt-gen indirect payload\n" "http://info.iet.unipi.it/~luigi/netmap/ "; int verbose = 0; int normalize = 1; #define VIRT_HDR_1 10 /* length of a base vnet-hdr */ #define VIRT_HDR_2 12 /* length of the extenede vnet-hdr */ #define VIRT_HDR_MAX VIRT_HDR_2 struct virt_header { uint8_t fields[VIRT_HDR_MAX]; }; #define MAX_BODYSIZE 65536 struct pkt { struct virt_header vh; struct ether_header eh; union { struct { struct ip ip; struct udphdr udp; uint8_t body[MAX_BODYSIZE]; /* hardwired */ } ipv4; struct { struct ip6_hdr ip; struct udphdr udp; uint8_t body[MAX_BODYSIZE]; /* hardwired */ } ipv6; }; } __attribute__((__packed__)); #define PKT(p, f, af) \ ((af) == AF_INET ? (p)->ipv4.f: (p)->ipv6.f) struct ip_range { char *name; union { struct { uint32_t start, end; /* same as struct in_addr */ } ipv4; struct { struct in6_addr start, end; uint8_t sgroup, egroup; } ipv6; }; uint16_t port0, port1; }; struct mac_range { char *name; struct ether_addr start, end; }; /* ifname can be netmap:foo-xxxx */ #define MAX_IFNAMELEN 64 /* our buffer for ifname */ #define MAX_PKTSIZE MAX_BODYSIZE /* XXX: + IP_HDR + ETH_HDR */ /* compact timestamp to fit into 60 byte packet. (enough to obtain RTT) */ struct tstamp { uint32_t sec; uint32_t nsec; }; /* * global arguments for all threads */ struct glob_arg { int af; /* address family AF_INET/AF_INET6 */ struct ip_range src_ip; struct ip_range dst_ip; struct mac_range dst_mac; struct mac_range src_mac; int pkt_size; int pkt_min_size; int burst; int forever; uint64_t npackets; /* total packets to send */ int frags; /* fragments per packet */ u_int frag_size; /* size of each fragment */ int nthreads; int cpus; /* cpus used for running */ int system_cpus; /* cpus on the system */ int options; /* testing */ #define OPT_PREFETCH 1 #define OPT_ACCESS 2 #define OPT_COPY 4 #define OPT_MEMCPY 8 #define OPT_TS 16 /* add a timestamp */ #define OPT_INDIRECT 32 /* use indirect buffers, tx only */ #define OPT_DUMP 64 /* dump rx/tx traffic */ #define OPT_RUBBISH 256 /* send wathever the buffers contain */ #define OPT_RANDOM_SRC 512 #define OPT_RANDOM_DST 1024 #define OPT_PPS_STATS 2048 int dev_type; #ifndef NO_PCAP pcap_t *p; #endif int tx_rate; struct timespec tx_period; int affinity; int main_fd; struct nm_desc *nmd; int report_interval; /* milliseconds between prints */ void *(*td_body)(void *); int td_type; void *mmap_addr; char ifname[MAX_IFNAMELEN]; char *nmr_config; int dummy_send; int virt_header; /* send also the virt_header */ char *packet_file; /* -P option */ #define STATS_WIN 15 int win_idx; int64_t win[STATS_WIN]; int wait_link; int framing; /* #bits of framing (for bw output) */ }; enum dev_type { DEV_NONE, DEV_NETMAP, DEV_PCAP, DEV_TAP }; enum { TD_TYPE_SENDER = 1, TD_TYPE_RECEIVER, TD_TYPE_OTHER, }; /* * Arguments for a new thread. The same structure is used by * the source and the sink */ struct targ { struct glob_arg *g; int used; int completed; int cancel; int fd; struct nm_desc *nmd; /* these ought to be volatile, but they are * only sampled and errors should not accumulate */ struct my_ctrs ctr; struct timespec tic, toc; int me; pthread_t thread; int affinity; struct pkt pkt; void *frame; uint16_t seed[3]; u_int frags; u_int frag_size; }; static __inline uint16_t cksum_add(uint16_t sum, uint16_t a) { uint16_t res; res = sum + a; return (res + (res < a)); } static void extract_ipv4_addr(char *name, uint32_t *addr, uint16_t *port) { struct in_addr a; char *pp; pp = strchr(name, ':'); if (pp != NULL) { /* do we have ports ? */ *pp++ = '\0'; *port = (uint16_t)strtol(pp, NULL, 0); } inet_pton(AF_INET, name, &a); *addr = ntohl(a.s_addr); } static void extract_ipv6_addr(char *name, struct in6_addr *addr, uint16_t *port, uint8_t *group) { char *pp; /* * We accept IPv6 address in the following form: * group@[2001:DB8::1001]:port (w/ brackets and port) * group@[2001:DB8::1] (w/ brackets and w/o port) * group@2001:DB8::1234 (w/o brackets and w/o port) */ pp = strchr(name, '@'); if (pp != NULL) { *pp++ = '\0'; *group = (uint8_t)strtol(name, NULL, 0); if (*group > 7) *group = 7; name = pp; } if (name[0] == '[') name++; pp = strchr(name, ']'); if (pp != NULL) *pp++ = '\0'; if (pp != NULL && *pp != ':') pp = NULL; if (pp != NULL) { /* do we have ports ? */ *pp++ = '\0'; *port = (uint16_t)strtol(pp, NULL, 0); } inet_pton(AF_INET6, name, addr); } /* * extract the extremes from a range of ipv4 addresses. * addr_lo[-addr_hi][:port_lo[-port_hi]] */ static int extract_ip_range(struct ip_range *r, int af) { char *name, *ap, start[INET6_ADDRSTRLEN]; char end[INET6_ADDRSTRLEN]; struct in_addr a; uint32_t tmp; if (verbose) D("extract IP range from %s", r->name); name = strdup(r->name); if (name == NULL) { D("strdup failed"); usage(-1); } /* the first - splits start/end of range */ ap = strchr(name, '-'); if (ap != NULL) *ap++ = '\0'; r->port0 = 1234; /* default port */ if (af == AF_INET6) { r->ipv6.sgroup = 7; /* default group */ extract_ipv6_addr(name, &r->ipv6.start, &r->port0, &r->ipv6.sgroup); } else extract_ipv4_addr(name, &r->ipv4.start, &r->port0); r->port1 = r->port0; if (af == AF_INET6) { if (ap != NULL) { r->ipv6.egroup = r->ipv6.sgroup; extract_ipv6_addr(ap, &r->ipv6.end, &r->port1, &r->ipv6.egroup); } else { r->ipv6.end = r->ipv6.start; r->ipv6.egroup = r->ipv6.sgroup; } } else { if (ap != NULL) { extract_ipv4_addr(ap, &r->ipv4.end, &r->port1); if (r->ipv4.start > r->ipv4.end) { tmp = r->ipv4.end; r->ipv4.end = r->ipv4.start; r->ipv4.start = tmp; } } else r->ipv4.end = r->ipv4.start; } if (r->port0 > r->port1) { tmp = r->port0; r->port0 = r->port1; r->port1 = tmp; } if (af == AF_INET) { a.s_addr = htonl(r->ipv4.start); inet_ntop(af, &a, start, sizeof(start)); a.s_addr = htonl(r->ipv4.end); inet_ntop(af, &a, end, sizeof(end)); } else { inet_ntop(af, &r->ipv6.start, start, sizeof(start)); inet_ntop(af, &r->ipv6.end, end, sizeof(end)); } if (af == AF_INET) D("range is %s:%d to %s:%d", start, r->port0, end, r->port1); else D("range is %d@[%s]:%d to %d@[%s]:%d", r->ipv6.sgroup, start, r->port0, r->ipv6.egroup, end, r->port1); free(name); if (r->port0 != r->port1 || (af == AF_INET && r->ipv4.start != r->ipv4.end) || (af == AF_INET6 && !IN6_ARE_ADDR_EQUAL(&r->ipv6.start, &r->ipv6.end))) return (OPT_COPY); return (0); } static int extract_mac_range(struct mac_range *r) { struct ether_addr *e; if (verbose) D("extract MAC range from %s", r->name); e = ether_aton(r->name); if (e == NULL) { D("invalid MAC address '%s'", r->name); return 1; } bcopy(e, &r->start, 6); bcopy(e, &r->end, 6); #if 0 bcopy(targ->src_mac, eh->ether_shost, 6); p = index(targ->g->src_mac, '-'); if (p) targ->src_mac_range = atoi(p+1); bcopy(ether_aton(targ->g->dst_mac), targ->dst_mac, 6); bcopy(targ->dst_mac, eh->ether_dhost, 6); p = index(targ->g->dst_mac, '-'); if (p) targ->dst_mac_range = atoi(p+1); #endif if (verbose) D("%s starts at %s", r->name, ether_ntoa(&r->start)); return 0; } static int get_if_mtu(const struct glob_arg *g) { char ifname[IFNAMSIZ]; struct ifreq ifreq; int s, ret; if (!strncmp(g->ifname, "netmap:", 7) && !strchr(g->ifname, '{') && !strchr(g->ifname, '}')) { /* Parse the interface name and ask the kernel for the * MTU value. */ strncpy(ifname, g->ifname+7, IFNAMSIZ-1); ifname[strcspn(ifname, "-*^{}/@")] = '\0'; s = socket(AF_INET, SOCK_DGRAM, 0); if (s < 0) { D("socket() failed: %s", strerror(errno)); return s; } memset(&ifreq, 0, sizeof(ifreq)); strncpy(ifreq.ifr_name, ifname, IFNAMSIZ); ret = ioctl(s, SIOCGIFMTU, &ifreq); if (ret) { D("ioctl(SIOCGIFMTU) failed: %s", strerror(errno)); } return ifreq.ifr_mtu; } /* This is a pipe or a VALE port, where the MTU is very large, * so we use some practical limit. */ return 65536; } static struct targ *targs; static int global_nthreads; /* control-C handler */ static void sigint_h(int sig) { int i; (void)sig; /* UNUSED */ D("received control-C on thread %p", (void *)pthread_self()); for (i = 0; i < global_nthreads; i++) { targs[i].cancel = 1; } } /* sysctl wrapper to return the number of active CPUs */ static int system_ncpus(void) { int ncpus; #if defined (__FreeBSD__) int mib[2] = { CTL_HW, HW_NCPU }; size_t len = sizeof(mib); sysctl(mib, 2, &ncpus, &len, NULL, 0); #elif defined(linux) ncpus = sysconf(_SC_NPROCESSORS_ONLN); #elif defined(_WIN32) { SYSTEM_INFO sysinfo; GetSystemInfo(&sysinfo); ncpus = sysinfo.dwNumberOfProcessors; } #else /* others */ ncpus = 1; #endif /* others */ return (ncpus); } #ifdef __linux__ #define sockaddr_dl sockaddr_ll #define sdl_family sll_family #define AF_LINK AF_PACKET #define LLADDR(s) s->sll_addr; #include #define TAP_CLONEDEV "/dev/net/tun" #endif /* __linux__ */ #ifdef __FreeBSD__ #include #define TAP_CLONEDEV "/dev/tap" #endif /* __FreeBSD */ #ifdef __APPLE__ // #warning TAP not supported on apple ? #include #define TAP_CLONEDEV "/dev/tap" #endif /* __APPLE__ */ /* * parse the vale configuration in conf and put it in nmr. * Return the flag set if necessary. * The configuration may consist of 1 to 4 numbers separated * by commas: #tx-slots,#rx-slots,#tx-rings,#rx-rings. * Missing numbers or zeroes stand for default values. * As an additional convenience, if exactly one number * is specified, then this is assigned to both #tx-slots and #rx-slots. * If there is no 4th number, then the 3rd is assigned to both #tx-rings * and #rx-rings. */ int parse_nmr_config(const char* conf, struct nmreq *nmr) { char *w, *tok; int i, v; if (conf == NULL || ! *conf) return 0; nmr->nr_tx_rings = nmr->nr_rx_rings = 0; nmr->nr_tx_slots = nmr->nr_rx_slots = 0; w = strdup(conf); for (i = 0, tok = strtok(w, ","); tok; i++, tok = strtok(NULL, ",")) { v = atoi(tok); switch (i) { case 0: nmr->nr_tx_slots = nmr->nr_rx_slots = v; break; case 1: nmr->nr_rx_slots = v; break; case 2: nmr->nr_tx_rings = nmr->nr_rx_rings = v; break; case 3: nmr->nr_rx_rings = v; break; default: D("ignored config: %s", tok); break; } } D("txr %d txd %d rxr %d rxd %d", nmr->nr_tx_rings, nmr->nr_tx_slots, nmr->nr_rx_rings, nmr->nr_rx_slots); free(w); return (nmr->nr_tx_rings || nmr->nr_tx_slots || nmr->nr_rx_rings || nmr->nr_rx_slots) ? NM_OPEN_RING_CFG : 0; } /* * locate the src mac address for our interface, put it * into the user-supplied buffer. return 0 if ok, -1 on error. */ static int source_hwaddr(const char *ifname, char *buf) { struct ifaddrs *ifaphead, *ifap; if (getifaddrs(&ifaphead) != 0) { D("getifaddrs %s failed", ifname); return (-1); } for (ifap = ifaphead; ifap; ifap = ifap->ifa_next) { struct sockaddr_dl *sdl = (struct sockaddr_dl *)ifap->ifa_addr; uint8_t *mac; if (!sdl || sdl->sdl_family != AF_LINK) continue; if (strncmp(ifap->ifa_name, ifname, IFNAMSIZ) != 0) continue; mac = (uint8_t *)LLADDR(sdl); sprintf(buf, "%02x:%02x:%02x:%02x:%02x:%02x", mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]); if (verbose) D("source hwaddr %s", buf); break; } freeifaddrs(ifaphead); return ifap ? 0 : 1; } /* set the thread affinity. */ static int setaffinity(pthread_t me, int i) { cpuset_t cpumask; if (i == -1) return 0; /* Set thread affinity affinity.*/ CPU_ZERO(&cpumask); CPU_SET(i, &cpumask); if (pthread_setaffinity_np(me, sizeof(cpuset_t), &cpumask) != 0) { D("Unable to set affinity: %s", strerror(errno)); return 1; } return 0; } /* Compute the checksum of the given ip header. */ static uint32_t checksum(const void *data, uint16_t len, uint32_t sum) { const uint8_t *addr = data; uint32_t i; /* Checksum all the pairs of bytes first... */ for (i = 0; i < (len & ~1U); i += 2) { sum += (u_int16_t)ntohs(*((u_int16_t *)(addr + i))); if (sum > 0xFFFF) sum -= 0xFFFF; } /* * If there's a single byte left over, checksum it, too. * Network byte order is big-endian, so the remaining byte is * the high byte. */ if (i < len) { sum += addr[i] << 8; if (sum > 0xFFFF) sum -= 0xFFFF; } return sum; } static uint16_t wrapsum(uint32_t sum) { sum = ~sum & 0xFFFF; return (htons(sum)); } /* Check the payload of the packet for errors (use it for debug). * Look for consecutive ascii representations of the size of the packet. */ static void dump_payload(const char *_p, int len, struct netmap_ring *ring, int cur) { char buf[128]; int i, j, i0; const unsigned char *p = (const unsigned char *)_p; /* get the length in ASCII of the length of the packet. */ printf("ring %p cur %5d [buf %6d flags 0x%04x len %5d]\n", ring, cur, ring->slot[cur].buf_idx, ring->slot[cur].flags, len); /* hexdump routine */ for (i = 0; i < len; ) { memset(buf, ' ', sizeof(buf)); sprintf(buf, "%5d: ", i); i0 = i; for (j=0; j < 16 && i < len; i++, j++) sprintf(buf+7+j*3, "%02x ", (uint8_t)(p[i])); i = i0; for (j=0; j < 16 && i < len; i++, j++) sprintf(buf+7+j + 48, "%c", isprint(p[i]) ? p[i] : '.'); printf("%s\n", buf); } } /* * Fill a packet with some payload. * We create a UDP packet so the payload starts at * 14+20+8 = 42 bytes. */ #ifdef __linux__ #define uh_sport source #define uh_dport dest #define uh_ulen len #define uh_sum check #endif /* linux */ static void update_ip(struct pkt *pkt, struct targ *t) { struct glob_arg *g = t->g; struct ip ip; struct udphdr udp; uint32_t oaddr, naddr; uint16_t oport, nport; uint16_t ip_sum, udp_sum; memcpy(&ip, &pkt->ipv4.ip, sizeof(ip)); memcpy(&udp, &pkt->ipv4.udp, sizeof(udp)); do { ip_sum = udp_sum = 0; naddr = oaddr = ntohl(ip.ip_src.s_addr); nport = oport = ntohs(udp.uh_sport); if (g->options & OPT_RANDOM_SRC) { ip.ip_src.s_addr = nrand48(t->seed); udp.uh_sport = nrand48(t->seed); naddr = ntohl(ip.ip_src.s_addr); nport = ntohs(udp.uh_sport); break; } if (oport < g->src_ip.port1) { nport = oport + 1; udp.uh_sport = htons(nport); break; } nport = g->src_ip.port0; udp.uh_sport = htons(nport); if (oaddr < g->src_ip.ipv4.end) { naddr = oaddr + 1; ip.ip_src.s_addr = htonl(naddr); break; } naddr = g->src_ip.ipv4.start; ip.ip_src.s_addr = htonl(naddr); } while (0); /* update checksums if needed */ if (oaddr != naddr) { ip_sum = cksum_add(ip_sum, ~oaddr >> 16); ip_sum = cksum_add(ip_sum, ~oaddr & 0xffff); ip_sum = cksum_add(ip_sum, naddr >> 16); ip_sum = cksum_add(ip_sum, naddr & 0xffff); } if (oport != nport) { udp_sum = cksum_add(udp_sum, ~oport); udp_sum = cksum_add(udp_sum, nport); } do { naddr = oaddr = ntohl(ip.ip_dst.s_addr); nport = oport = ntohs(udp.uh_dport); if (g->options & OPT_RANDOM_DST) { ip.ip_dst.s_addr = nrand48(t->seed); udp.uh_dport = nrand48(t->seed); naddr = ntohl(ip.ip_dst.s_addr); nport = ntohs(udp.uh_dport); break; } if (oport < g->dst_ip.port1) { nport = oport + 1; udp.uh_dport = htons(nport); break; } nport = g->dst_ip.port0; udp.uh_dport = htons(nport); if (oaddr < g->dst_ip.ipv4.end) { naddr = oaddr + 1; ip.ip_dst.s_addr = htonl(naddr); break; } naddr = g->dst_ip.ipv4.start; ip.ip_dst.s_addr = htonl(naddr); } while (0); /* update checksums */ if (oaddr != naddr) { ip_sum = cksum_add(ip_sum, ~oaddr >> 16); ip_sum = cksum_add(ip_sum, ~oaddr & 0xffff); ip_sum = cksum_add(ip_sum, naddr >> 16); ip_sum = cksum_add(ip_sum, naddr & 0xffff); } if (oport != nport) { udp_sum = cksum_add(udp_sum, ~oport); udp_sum = cksum_add(udp_sum, nport); } if (udp_sum != 0) udp.uh_sum = ~cksum_add(~udp.uh_sum, htons(udp_sum)); if (ip_sum != 0) { ip.ip_sum = ~cksum_add(~ip.ip_sum, htons(ip_sum)); udp.uh_sum = ~cksum_add(~udp.uh_sum, htons(ip_sum)); } memcpy(&pkt->ipv4.ip, &ip, sizeof(ip)); memcpy(&pkt->ipv4.udp, &udp, sizeof(udp)); } #ifndef s6_addr16 #define s6_addr16 __u6_addr.__u6_addr16 #endif static void update_ip6(struct pkt *pkt, struct targ *t) { struct glob_arg *g = t->g; struct ip6_hdr ip6; struct udphdr udp; uint16_t udp_sum; uint16_t oaddr, naddr; uint16_t oport, nport; uint8_t group; memcpy(&ip6, &pkt->ipv6.ip, sizeof(ip6)); memcpy(&udp, &pkt->ipv6.udp, sizeof(udp)); do { udp_sum = 0; group = g->src_ip.ipv6.sgroup; naddr = oaddr = ntohs(ip6.ip6_src.s6_addr16[group]); nport = oport = ntohs(udp.uh_sport); if (g->options & OPT_RANDOM_SRC) { ip6.ip6_src.s6_addr16[group] = nrand48(t->seed); udp.uh_sport = nrand48(t->seed); naddr = ntohs(ip6.ip6_src.s6_addr16[group]); nport = ntohs(udp.uh_sport); break; } if (oport < g->src_ip.port1) { nport = oport + 1; udp.uh_sport = htons(nport); break; } nport = g->src_ip.port0; udp.uh_sport = htons(nport); if (oaddr < ntohs(g->src_ip.ipv6.end.s6_addr16[group])) { naddr = oaddr + 1; ip6.ip6_src.s6_addr16[group] = htons(naddr); break; } naddr = ntohs(g->src_ip.ipv6.start.s6_addr16[group]); ip6.ip6_src.s6_addr16[group] = htons(naddr); } while (0); /* update checksums if needed */ if (oaddr != naddr) udp_sum = cksum_add(~oaddr, naddr); if (oport != nport) udp_sum = cksum_add(udp_sum, cksum_add(~oport, nport)); do { group = g->dst_ip.ipv6.egroup; naddr = oaddr = ntohs(ip6.ip6_dst.s6_addr16[group]); nport = oport = ntohs(udp.uh_dport); if (g->options & OPT_RANDOM_DST) { ip6.ip6_dst.s6_addr16[group] = nrand48(t->seed); udp.uh_dport = nrand48(t->seed); naddr = ntohs(ip6.ip6_dst.s6_addr16[group]); nport = ntohs(udp.uh_dport); break; } if (oport < g->dst_ip.port1) { nport = oport + 1; udp.uh_dport = htons(nport); break; } nport = g->dst_ip.port0; udp.uh_dport = htons(nport); if (oaddr < ntohs(g->dst_ip.ipv6.end.s6_addr16[group])) { naddr = oaddr + 1; ip6.ip6_dst.s6_addr16[group] = htons(naddr); break; } naddr = ntohs(g->dst_ip.ipv6.start.s6_addr16[group]); ip6.ip6_dst.s6_addr16[group] = htons(naddr); } while (0); /* update checksums */ if (oaddr != naddr) udp_sum = cksum_add(udp_sum, cksum_add(~oaddr, naddr)); if (oport != nport) udp_sum = cksum_add(udp_sum, cksum_add(~oport, nport)); if (udp_sum != 0) udp.uh_sum = ~cksum_add(~udp.uh_sum, udp_sum); memcpy(&pkt->ipv6.ip, &ip6, sizeof(ip6)); memcpy(&pkt->ipv6.udp, &udp, sizeof(udp)); } static void update_addresses(struct pkt *pkt, struct targ *t) { if (t->g->af == AF_INET) update_ip(pkt, t); else update_ip6(pkt, t); } /* * initialize one packet and prepare for the next one. * The copy could be done better instead of repeating it each time. */ static void initialize_packet(struct targ *targ) { struct pkt *pkt = &targ->pkt; struct ether_header *eh; struct ip6_hdr ip6; struct ip ip; struct udphdr udp; void *udp_ptr; uint16_t paylen; uint32_t csum = 0; const char *payload = targ->g->options & OPT_INDIRECT ? indirect_payload : default_payload; int i, l0 = strlen(payload); #ifndef NO_PCAP char errbuf[PCAP_ERRBUF_SIZE]; pcap_t *file; struct pcap_pkthdr *header; const unsigned char *packet; /* Read a packet from a PCAP file if asked. */ if (targ->g->packet_file != NULL) { if ((file = pcap_open_offline(targ->g->packet_file, errbuf)) == NULL) D("failed to open pcap file %s", targ->g->packet_file); if (pcap_next_ex(file, &header, &packet) < 0) D("failed to read packet from %s", targ->g->packet_file); if ((targ->frame = malloc(header->caplen)) == NULL) D("out of memory"); bcopy(packet, (unsigned char *)targ->frame, header->caplen); targ->g->pkt_size = header->caplen; pcap_close(file); return; } #endif paylen = targ->g->pkt_size - sizeof(*eh) - (targ->g->af == AF_INET ? sizeof(ip): sizeof(ip6)); /* create a nice NUL-terminated string */ for (i = 0; i < paylen; i += l0) { if (l0 > paylen - i) l0 = paylen - i; // last round bcopy(payload, PKT(pkt, body, targ->g->af) + i, l0); } PKT(pkt, body, targ->g->af)[i - 1] = '\0'; /* prepare the headers */ eh = &pkt->eh; bcopy(&targ->g->src_mac.start, eh->ether_shost, 6); bcopy(&targ->g->dst_mac.start, eh->ether_dhost, 6); if (targ->g->af == AF_INET) { eh->ether_type = htons(ETHERTYPE_IP); memcpy(&ip, &pkt->ipv4.ip, sizeof(ip)); udp_ptr = &pkt->ipv4.udp; ip.ip_v = IPVERSION; ip.ip_hl = sizeof(ip) >> 2; ip.ip_id = 0; ip.ip_tos = IPTOS_LOWDELAY; ip.ip_len = htons(targ->g->pkt_size - sizeof(*eh)); ip.ip_id = 0; ip.ip_off = htons(IP_DF); /* Don't fragment */ ip.ip_ttl = IPDEFTTL; ip.ip_p = IPPROTO_UDP; ip.ip_dst.s_addr = htonl(targ->g->dst_ip.ipv4.start); ip.ip_src.s_addr = htonl(targ->g->src_ip.ipv4.start); ip.ip_sum = wrapsum(checksum(&ip, sizeof(ip), 0)); memcpy(&pkt->ipv4.ip, &ip, sizeof(ip)); } else { eh->ether_type = htons(ETHERTYPE_IPV6); memcpy(&ip6, &pkt->ipv4.ip, sizeof(ip6)); udp_ptr = &pkt->ipv6.udp; ip6.ip6_flow = 0; ip6.ip6_plen = htons(paylen); ip6.ip6_vfc = IPV6_VERSION; ip6.ip6_nxt = IPPROTO_UDP; ip6.ip6_hlim = IPV6_DEFHLIM; ip6.ip6_src = targ->g->src_ip.ipv6.start; ip6.ip6_dst = targ->g->dst_ip.ipv6.start; } memcpy(&udp, udp_ptr, sizeof(udp)); udp.uh_sport = htons(targ->g->src_ip.port0); udp.uh_dport = htons(targ->g->dst_ip.port0); udp.uh_ulen = htons(paylen); if (targ->g->af == AF_INET) { /* Magic: taken from sbin/dhclient/packet.c */ udp.uh_sum = wrapsum( checksum(&udp, sizeof(udp), /* udp header */ checksum(pkt->ipv4.body, /* udp payload */ paylen - sizeof(udp), checksum(&pkt->ipv4.ip.ip_src, /* pseudo header */ 2 * sizeof(pkt->ipv4.ip.ip_src), IPPROTO_UDP + (u_int32_t)ntohs(udp.uh_ulen))))); memcpy(&pkt->ipv4.ip, &ip, sizeof(ip)); } else { /* Save part of pseudo header checksum into csum */ csum = IPPROTO_UDP << 24; csum = checksum(&csum, sizeof(csum), paylen); udp.uh_sum = wrapsum( checksum(udp_ptr, sizeof(udp), /* udp header */ checksum(pkt->ipv6.body, /* udp payload */ paylen - sizeof(udp), checksum(&pkt->ipv6.ip.ip6_src, /* pseudo header */ 2 * sizeof(pkt->ipv6.ip.ip6_src), csum)))); memcpy(&pkt->ipv6.ip, &ip6, sizeof(ip6)); } memcpy(udp_ptr, &udp, sizeof(udp)); bzero(&pkt->vh, sizeof(pkt->vh)); // dump_payload((void *)pkt, targ->g->pkt_size, NULL, 0); } static void get_vnet_hdr_len(struct glob_arg *g) { struct nmreq req; int err; memset(&req, 0, sizeof(req)); bcopy(g->nmd->req.nr_name, req.nr_name, sizeof(req.nr_name)); req.nr_version = NETMAP_API; req.nr_cmd = NETMAP_VNET_HDR_GET; err = ioctl(g->main_fd, NIOCREGIF, &req); if (err) { D("Unable to get virtio-net header length"); return; } g->virt_header = req.nr_arg1; if (g->virt_header) { D("Port requires virtio-net header, length = %d", g->virt_header); } } static void set_vnet_hdr_len(struct glob_arg *g) { int err, l = g->virt_header; struct nmreq req; if (l == 0) return; memset(&req, 0, sizeof(req)); bcopy(g->nmd->req.nr_name, req.nr_name, sizeof(req.nr_name)); req.nr_version = NETMAP_API; req.nr_cmd = NETMAP_BDG_VNET_HDR; req.nr_arg1 = l; err = ioctl(g->main_fd, NIOCREGIF, &req); if (err) { D("Unable to set virtio-net header length %d", l); } } /* * create and enqueue a batch of packets on a ring. * On the last one set NS_REPORT to tell the driver to generate * an interrupt when done. */ static int send_packets(struct netmap_ring *ring, struct pkt *pkt, void *frame, int size, struct targ *t, u_int count, int options) { u_int n, sent, head = ring->head; u_int frags = t->frags; u_int frag_size = t->frag_size; struct netmap_slot *slot = &ring->slot[head]; n = nm_ring_space(ring); #if 0 if (options & (OPT_COPY | OPT_PREFETCH) ) { for (sent = 0; sent < count; sent++) { struct netmap_slot *slot = &ring->slot[head]; char *p = NETMAP_BUF(ring, slot->buf_idx); __builtin_prefetch(p); head = nm_ring_next(ring, head); } head = ring->head; } #endif for (sent = 0; sent < count && n >= frags; sent++, n--) { char *p; int buf_changed; u_int tosend = size; slot = &ring->slot[head]; p = NETMAP_BUF(ring, slot->buf_idx); buf_changed = slot->flags & NS_BUF_CHANGED; slot->flags = 0; if (options & OPT_RUBBISH) { /* do nothing */ } else if (options & OPT_INDIRECT) { slot->flags |= NS_INDIRECT; slot->ptr = (uint64_t)((uintptr_t)frame); } else if (frags > 1) { u_int i; const char *f = frame; char *fp = p; for (i = 0; i < frags - 1; i++) { memcpy(fp, f, frag_size); slot->len = frag_size; slot->flags = NS_MOREFRAG; if (options & OPT_DUMP) dump_payload(fp, frag_size, ring, head); tosend -= frag_size; f += frag_size; head = nm_ring_next(ring, head); slot = &ring->slot[head]; fp = NETMAP_BUF(ring, slot->buf_idx); } n -= (frags - 1); p = fp; slot->flags = 0; memcpy(p, f, tosend); update_addresses(pkt, t); } else if ((options & (OPT_COPY | OPT_MEMCPY)) || buf_changed) { if (options & OPT_COPY) nm_pkt_copy(frame, p, size); else memcpy(p, frame, size); update_addresses(pkt, t); } else if (options & OPT_PREFETCH) { __builtin_prefetch(p); } slot->len = tosend; if (options & OPT_DUMP) dump_payload(p, tosend, ring, head); head = nm_ring_next(ring, head); } if (sent) { slot->flags |= NS_REPORT; ring->head = ring->cur = head; } if (sent < count) { /* tell netmap that we need more slots */ ring->cur = ring->tail; } return (sent); } /* * Index of the highest bit set */ uint32_t msb64(uint64_t x) { uint64_t m = 1ULL << 63; int i; for (i = 63; i >= 0; i--, m >>=1) if (m & x) return i; return 0; } /* * wait until ts, either busy or sleeping if more than 1ms. * Return wakeup time. */ static struct timespec wait_time(struct timespec ts) { for (;;) { struct timespec w, cur; clock_gettime(CLOCK_REALTIME_PRECISE, &cur); w = timespec_sub(ts, cur); if (w.tv_sec < 0) return cur; else if (w.tv_sec > 0 || w.tv_nsec > 1000000) poll(NULL, 0, 1); } } /* * Send a packet, and wait for a response. * The payload (after UDP header, ofs 42) has a 4-byte sequence * followed by a struct timeval (or bintime?) */ static void * ping_body(void *data) { struct targ *targ = (struct targ *) data; struct pollfd pfd = { .fd = targ->fd, .events = POLLIN }; struct netmap_if *nifp = targ->nmd->nifp; int i, m, rx = 0; void *frame; int size; struct timespec ts, now, last_print; struct timespec nexttime = {0, 0}; /* silence compiler */ uint64_t sent = 0, n = targ->g->npackets; uint64_t count = 0, t_cur, t_min = ~0, av = 0; uint64_t g_min = ~0, g_av = 0; uint64_t buckets[64]; /* bins for delays, ns */ int rate_limit = targ->g->tx_rate, tosend = 0; frame = (char*)&targ->pkt + sizeof(targ->pkt.vh) - targ->g->virt_header; size = targ->g->pkt_size + targ->g->virt_header; if (targ->g->nthreads > 1) { D("can only ping with 1 thread"); return NULL; } bzero(&buckets, sizeof(buckets)); clock_gettime(CLOCK_REALTIME_PRECISE, &last_print); now = last_print; if (rate_limit) { targ->tic = timespec_add(now, (struct timespec){2,0}); targ->tic.tv_nsec = 0; wait_time(targ->tic); nexttime = targ->tic; } while (!targ->cancel && (n == 0 || sent < n)) { struct netmap_ring *ring = NETMAP_TXRING(nifp, targ->nmd->first_tx_ring); struct netmap_slot *slot; char *p; int rv; uint64_t limit, event = 0; if (rate_limit && tosend <= 0) { tosend = targ->g->burst; nexttime = timespec_add(nexttime, targ->g->tx_period); wait_time(nexttime); } limit = rate_limit ? tosend : targ->g->burst; if (n > 0 && n - sent < limit) limit = n - sent; for (m = 0; (unsigned)m < limit; m++) { slot = &ring->slot[ring->head]; slot->len = size; p = NETMAP_BUF(ring, slot->buf_idx); if (nm_ring_empty(ring)) { D("-- ouch, cannot send"); break; } else { struct tstamp *tp; nm_pkt_copy(frame, p, size); clock_gettime(CLOCK_REALTIME_PRECISE, &ts); bcopy(&sent, p+42, sizeof(sent)); tp = (struct tstamp *)(p+46); tp->sec = (uint32_t)ts.tv_sec; tp->nsec = (uint32_t)ts.tv_nsec; sent++; ring->head = ring->cur = nm_ring_next(ring, ring->head); } } if (m > 0) event++; targ->ctr.pkts = sent; targ->ctr.bytes = sent*size; targ->ctr.events = event; if (rate_limit) tosend -= m; #ifdef BUSYWAIT rv = ioctl(pfd.fd, NIOCTXSYNC, NULL); if (rv < 0) { D("TXSYNC error on queue %d: %s", targ->me, strerror(errno)); } again: ioctl(pfd.fd, NIOCRXSYNC, NULL); #else /* should use a parameter to decide how often to send */ if ( (rv = poll(&pfd, 1, 3000)) <= 0) { D("poll error on queue %d: %s", targ->me, (rv ? strerror(errno) : "timeout")); continue; } #endif /* BUSYWAIT */ /* see what we got back */ rx = 0; for (i = targ->nmd->first_rx_ring; i <= targ->nmd->last_rx_ring; i++) { ring = NETMAP_RXRING(nifp, i); while (!nm_ring_empty(ring)) { uint32_t seq; struct tstamp *tp; int pos; slot = &ring->slot[ring->head]; p = NETMAP_BUF(ring, slot->buf_idx); clock_gettime(CLOCK_REALTIME_PRECISE, &now); bcopy(p+42, &seq, sizeof(seq)); tp = (struct tstamp *)(p+46); ts.tv_sec = (time_t)tp->sec; ts.tv_nsec = (long)tp->nsec; ts.tv_sec = now.tv_sec - ts.tv_sec; ts.tv_nsec = now.tv_nsec - ts.tv_nsec; if (ts.tv_nsec < 0) { ts.tv_nsec += 1000000000; ts.tv_sec--; } if (0) D("seq %d/%llu delta %d.%09d", seq, (unsigned long long)sent, (int)ts.tv_sec, (int)ts.tv_nsec); t_cur = ts.tv_sec * 1000000000UL + ts.tv_nsec; if (t_cur < t_min) t_min = t_cur; count ++; av += t_cur; pos = msb64(t_cur); buckets[pos]++; /* now store it in a bucket */ ring->head = ring->cur = nm_ring_next(ring, ring->head); rx++; } } //D("tx %d rx %d", sent, rx); //usleep(100000); ts.tv_sec = now.tv_sec - last_print.tv_sec; ts.tv_nsec = now.tv_nsec - last_print.tv_nsec; if (ts.tv_nsec < 0) { ts.tv_nsec += 1000000000; ts.tv_sec--; } if (ts.tv_sec >= 1) { D("count %d RTT: min %d av %d ns", (int)count, (int)t_min, (int)(av/count)); int k, j, kmin, off; char buf[512]; for (kmin = 0; kmin < 64; kmin ++) if (buckets[kmin]) break; for (k = 63; k >= kmin; k--) if (buckets[k]) break; buf[0] = '\0'; off = 0; for (j = kmin; j <= k; j++) { off += sprintf(buf + off, " %5d", (int)buckets[j]); } D("k: %d .. %d\n\t%s", 1<cancel) goto again; #endif /* BUSYWAIT */ } if (sent > 0) { D("RTT over %llu packets: min %d av %d ns", (long long unsigned)sent, (int)g_min, (int)((double)g_av/sent)); } targ->completed = 1; /* reset the ``used`` flag. */ targ->used = 0; return NULL; } /* * reply to ping requests */ static void * pong_body(void *data) { struct targ *targ = (struct targ *) data; struct pollfd pfd = { .fd = targ->fd, .events = POLLIN }; struct netmap_if *nifp = targ->nmd->nifp; struct netmap_ring *txring, *rxring; int i, rx = 0; uint64_t sent = 0, n = targ->g->npackets; if (targ->g->nthreads > 1) { D("can only reply ping with 1 thread"); return NULL; } if (n > 0) D("understood ponger %llu but don't know how to do it", (unsigned long long)n); while (!targ->cancel && (n == 0 || sent < n)) { uint32_t txhead, txavail; //#define BUSYWAIT #ifdef BUSYWAIT ioctl(pfd.fd, NIOCRXSYNC, NULL); #else int rv; if ( (rv = poll(&pfd, 1, 1000)) <= 0) { D("poll error on queue %d: %s", targ->me, rv ? strerror(errno) : "timeout"); continue; } #endif txring = NETMAP_TXRING(nifp, targ->nmd->first_tx_ring); txhead = txring->head; txavail = nm_ring_space(txring); /* see what we got back */ for (i = targ->nmd->first_rx_ring; i <= targ->nmd->last_rx_ring; i++) { rxring = NETMAP_RXRING(nifp, i); while (!nm_ring_empty(rxring)) { uint16_t *spkt, *dpkt; uint32_t head = rxring->head; struct netmap_slot *slot = &rxring->slot[head]; char *src, *dst; src = NETMAP_BUF(rxring, slot->buf_idx); //D("got pkt %p of size %d", src, slot->len); rxring->head = rxring->cur = nm_ring_next(rxring, head); rx++; if (txavail == 0) continue; dst = NETMAP_BUF(txring, txring->slot[txhead].buf_idx); /* copy... */ dpkt = (uint16_t *)dst; spkt = (uint16_t *)src; nm_pkt_copy(src, dst, slot->len); /* swap source and destination MAC */ dpkt[0] = spkt[3]; dpkt[1] = spkt[4]; dpkt[2] = spkt[5]; dpkt[3] = spkt[0]; dpkt[4] = spkt[1]; dpkt[5] = spkt[2]; txring->slot[txhead].len = slot->len; txhead = nm_ring_next(txring, txhead); txavail--; sent++; } } txring->head = txring->cur = txhead; targ->ctr.pkts = sent; #ifdef BUSYWAIT ioctl(pfd.fd, NIOCTXSYNC, NULL); #endif //D("tx %d rx %d", sent, rx); } targ->completed = 1; /* reset the ``used`` flag. */ targ->used = 0; return NULL; } static void * sender_body(void *data) { struct targ *targ = (struct targ *) data; struct pollfd pfd = { .fd = targ->fd, .events = POLLOUT }; struct netmap_if *nifp; struct netmap_ring *txring = NULL; int i; uint64_t n = targ->g->npackets / targ->g->nthreads; uint64_t sent = 0; uint64_t event = 0; int options = targ->g->options | OPT_COPY; struct timespec nexttime = { 0, 0}; // XXX silence compiler int rate_limit = targ->g->tx_rate; struct pkt *pkt = &targ->pkt; void *frame; int size; if (targ->frame == NULL) { frame = (char *)pkt + sizeof(pkt->vh) - targ->g->virt_header; size = targ->g->pkt_size + targ->g->virt_header; } else { frame = targ->frame; size = targ->g->pkt_size; } D("start, fd %d main_fd %d", targ->fd, targ->g->main_fd); if (setaffinity(targ->thread, targ->affinity)) goto quit; /* main loop.*/ clock_gettime(CLOCK_REALTIME_PRECISE, &targ->tic); if (rate_limit) { targ->tic = timespec_add(targ->tic, (struct timespec){2,0}); targ->tic.tv_nsec = 0; wait_time(targ->tic); nexttime = targ->tic; } if (targ->g->dev_type == DEV_TAP) { D("writing to file desc %d", targ->g->main_fd); for (i = 0; !targ->cancel && (n == 0 || sent < n); i++) { if (write(targ->g->main_fd, frame, size) != -1) sent++; update_addresses(pkt, targ); if (i > 10000) { targ->ctr.pkts = sent; targ->ctr.bytes = sent*size; targ->ctr.events = sent; i = 0; } } #ifndef NO_PCAP } else if (targ->g->dev_type == DEV_PCAP) { pcap_t *p = targ->g->p; for (i = 0; !targ->cancel && (n == 0 || sent < n); i++) { if (pcap_inject(p, frame, size) != -1) sent++; update_addresses(pkt, targ); if (i > 10000) { targ->ctr.pkts = sent; targ->ctr.bytes = sent*size; targ->ctr.events = sent; i = 0; } } #endif /* NO_PCAP */ } else { int tosend = 0; u_int bufsz, frag_size = targ->g->frag_size; nifp = targ->nmd->nifp; txring = NETMAP_TXRING(nifp, targ->nmd->first_tx_ring); bufsz = txring->nr_buf_size; if (bufsz < frag_size) frag_size = bufsz; targ->frag_size = targ->g->pkt_size / targ->frags; if (targ->frag_size > frag_size) { targ->frags = targ->g->pkt_size / frag_size; targ->frag_size = frag_size; if (targ->g->pkt_size % frag_size != 0) targ->frags++; } D("frags %u frag_size %u", targ->frags, targ->frag_size); while (!targ->cancel && (n == 0 || sent < n)) { int rv; if (rate_limit && tosend <= 0) { tosend = targ->g->burst; nexttime = timespec_add(nexttime, targ->g->tx_period); wait_time(nexttime); } /* * wait for available room in the send queue(s) */ #ifdef BUSYWAIT (void)rv; if (ioctl(pfd.fd, NIOCTXSYNC, NULL) < 0) { D("ioctl error on queue %d: %s", targ->me, strerror(errno)); goto quit; } #else /* !BUSYWAIT */ if ( (rv = poll(&pfd, 1, 2000)) <= 0) { if (targ->cancel) break; D("poll error on queue %d: %s", targ->me, rv ? strerror(errno) : "timeout"); // goto quit; } if (pfd.revents & POLLERR) { D("poll error on %d ring %d-%d", pfd.fd, targ->nmd->first_tx_ring, targ->nmd->last_tx_ring); goto quit; } #endif /* !BUSYWAIT */ /* * scan our queues and send on those with room */ if (options & OPT_COPY && sent > 100000 && !(targ->g->options & OPT_COPY) ) { D("drop copy"); options &= ~OPT_COPY; } for (i = targ->nmd->first_tx_ring; i <= targ->nmd->last_tx_ring; i++) { int m; uint64_t limit = rate_limit ? tosend : targ->g->burst; if (n > 0 && n == sent) break; if (n > 0 && n - sent < limit) limit = n - sent; txring = NETMAP_TXRING(nifp, i); if (nm_ring_empty(txring)) continue; if (targ->g->pkt_min_size > 0) { size = nrand48(targ->seed) % (targ->g->pkt_size - targ->g->pkt_min_size) + targ->g->pkt_min_size; } m = send_packets(txring, pkt, frame, size, targ, limit, options); ND("limit %lu tail %d m %d", limit, txring->tail, m); sent += m; if (m > 0) //XXX-ste: can m be 0? event++; targ->ctr.pkts = sent; targ->ctr.bytes += m*size; targ->ctr.events = event; if (rate_limit) { tosend -= m; if (tosend <= 0) break; } } } /* flush any remaining packets */ if (txring != NULL) { D("flush tail %d head %d on thread %p", txring->tail, txring->head, (void *)pthread_self()); ioctl(pfd.fd, NIOCTXSYNC, NULL); } /* final part: wait all the TX queues to be empty. */ for (i = targ->nmd->first_tx_ring; i <= targ->nmd->last_tx_ring; i++) { txring = NETMAP_TXRING(nifp, i); while (!targ->cancel && nm_tx_pending(txring)) { RD(5, "pending tx tail %d head %d on ring %d", txring->tail, txring->head, i); ioctl(pfd.fd, NIOCTXSYNC, NULL); usleep(1); /* wait 1 tick */ } } } /* end DEV_NETMAP */ clock_gettime(CLOCK_REALTIME_PRECISE, &targ->toc); targ->completed = 1; targ->ctr.pkts = sent; targ->ctr.bytes = sent*size; targ->ctr.events = event; quit: /* reset the ``used`` flag. */ targ->used = 0; return (NULL); } #ifndef NO_PCAP static void receive_pcap(u_char *user, const struct pcap_pkthdr * h, const u_char * bytes) { struct my_ctrs *ctr = (struct my_ctrs *)user; (void)bytes; /* UNUSED */ ctr->bytes += h->len; ctr->pkts++; } #endif /* !NO_PCAP */ static int receive_packets(struct netmap_ring *ring, u_int limit, int dump, uint64_t *bytes) { u_int head, rx, n; uint64_t b = 0; u_int complete = 0; if (bytes == NULL) bytes = &b; head = ring->head; n = nm_ring_space(ring); if (n < limit) limit = n; for (rx = 0; rx < limit; rx++) { struct netmap_slot *slot = &ring->slot[head]; char *p = NETMAP_BUF(ring, slot->buf_idx); *bytes += slot->len; if (dump) dump_payload(p, slot->len, ring, head); if (!(slot->flags & NS_MOREFRAG)) complete++; head = nm_ring_next(ring, head); } ring->head = ring->cur = head; return (complete); } static void * receiver_body(void *data) { struct targ *targ = (struct targ *) data; struct pollfd pfd = { .fd = targ->fd, .events = POLLIN }; struct netmap_if *nifp; struct netmap_ring *rxring; int i; struct my_ctrs cur; memset(&cur, 0, sizeof(cur)); if (setaffinity(targ->thread, targ->affinity)) goto quit; D("reading from %s fd %d main_fd %d", targ->g->ifname, targ->fd, targ->g->main_fd); /* unbounded wait for the first packet. */ for (;!targ->cancel;) { i = poll(&pfd, 1, 1000); if (i > 0 && !(pfd.revents & POLLERR)) break; if (i < 0) { D("poll() error: %s", strerror(errno)); goto quit; } if (pfd.revents & POLLERR) { D("fd error"); goto quit; } RD(1, "waiting for initial packets, poll returns %d %d", i, pfd.revents); } /* main loop, exit after 1s silence */ clock_gettime(CLOCK_REALTIME_PRECISE, &targ->tic); if (targ->g->dev_type == DEV_TAP) { while (!targ->cancel) { char buf[MAX_BODYSIZE]; /* XXX should we poll ? */ i = read(targ->g->main_fd, buf, sizeof(buf)); if (i > 0) { targ->ctr.pkts++; targ->ctr.bytes += i; targ->ctr.events++; } } #ifndef NO_PCAP } else if (targ->g->dev_type == DEV_PCAP) { while (!targ->cancel) { /* XXX should we poll ? */ pcap_dispatch(targ->g->p, targ->g->burst, receive_pcap, (u_char *)&targ->ctr); targ->ctr.events++; } #endif /* !NO_PCAP */ } else { int dump = targ->g->options & OPT_DUMP; nifp = targ->nmd->nifp; while (!targ->cancel) { /* Once we started to receive packets, wait at most 1 seconds before quitting. */ #ifdef BUSYWAIT if (ioctl(pfd.fd, NIOCRXSYNC, NULL) < 0) { D("ioctl error on queue %d: %s", targ->me, strerror(errno)); goto quit; } #else /* !BUSYWAIT */ if (poll(&pfd, 1, 1 * 1000) <= 0 && !targ->g->forever) { clock_gettime(CLOCK_REALTIME_PRECISE, &targ->toc); targ->toc.tv_sec -= 1; /* Subtract timeout time. */ goto out; } if (pfd.revents & POLLERR) { D("poll err"); goto quit; } #endif /* !BUSYWAIT */ uint64_t cur_space = 0; for (i = targ->nmd->first_rx_ring; i <= targ->nmd->last_rx_ring; i++) { int m; rxring = NETMAP_RXRING(nifp, i); /* compute free space in the ring */ m = rxring->head + rxring->num_slots - rxring->tail; if (m >= (int) rxring->num_slots) m -= rxring->num_slots; cur_space += m; if (nm_ring_empty(rxring)) continue; m = receive_packets(rxring, targ->g->burst, dump, &cur.bytes); cur.pkts += m; if (m > 0) cur.events++; } cur.min_space = targ->ctr.min_space; if (cur_space < cur.min_space) cur.min_space = cur_space; targ->ctr = cur; } } clock_gettime(CLOCK_REALTIME_PRECISE, &targ->toc); #if !defined(BUSYWAIT) out: #endif targ->completed = 1; targ->ctr = cur; quit: /* reset the ``used`` flag. */ targ->used = 0; return (NULL); } static void * txseq_body(void *data) { struct targ *targ = (struct targ *) data; struct pollfd pfd = { .fd = targ->fd, .events = POLLOUT }; struct netmap_ring *ring; int64_t sent = 0; uint64_t event = 0; int options = targ->g->options | OPT_COPY; struct timespec nexttime = {0, 0}; int rate_limit = targ->g->tx_rate; struct pkt *pkt = &targ->pkt; int frags = targ->g->frags; uint32_t sequence = 0; int budget = 0; void *frame; int size; if (targ->g->nthreads > 1) { D("can only txseq ping with 1 thread"); return NULL; } if (targ->g->npackets > 0) { D("Ignoring -n argument"); } frame = (char *)pkt + sizeof(pkt->vh) - targ->g->virt_header; size = targ->g->pkt_size + targ->g->virt_header; D("start, fd %d main_fd %d", targ->fd, targ->g->main_fd); if (setaffinity(targ->thread, targ->affinity)) goto quit; clock_gettime(CLOCK_REALTIME_PRECISE, &targ->tic); if (rate_limit) { targ->tic = timespec_add(targ->tic, (struct timespec){2,0}); targ->tic.tv_nsec = 0; wait_time(targ->tic); nexttime = targ->tic; } /* Only use the first queue. */ ring = NETMAP_TXRING(targ->nmd->nifp, targ->nmd->first_tx_ring); while (!targ->cancel) { int64_t limit; unsigned int space; unsigned int head; int fcnt; uint16_t sum = 0; int rv; if (!rate_limit) { budget = targ->g->burst; } else if (budget <= 0) { budget = targ->g->burst; nexttime = timespec_add(nexttime, targ->g->tx_period); wait_time(nexttime); } /* wait for available room in the send queue */ #ifdef BUSYWAIT (void)rv; if (ioctl(pfd.fd, NIOCTXSYNC, NULL) < 0) { D("ioctl error on queue %d: %s", targ->me, strerror(errno)); goto quit; } #else /* !BUSYWAIT */ if ( (rv = poll(&pfd, 1, 2000)) <= 0) { if (targ->cancel) break; D("poll error on queue %d: %s", targ->me, rv ? strerror(errno) : "timeout"); // goto quit; } if (pfd.revents & POLLERR) { D("poll error on %d ring %d-%d", pfd.fd, targ->nmd->first_tx_ring, targ->nmd->last_tx_ring); goto quit; } #endif /* !BUSYWAIT */ /* If no room poll() again. */ space = nm_ring_space(ring); if (!space) { continue; } limit = budget; if (space < limit) { limit = space; } /* Cut off ``limit`` to make sure is multiple of ``frags``. */ if (frags > 1) { limit = (limit / frags) * frags; } limit = sent + limit; /* Convert to absolute. */ for (fcnt = frags, head = ring->head; sent < limit; sent++, sequence++) { struct netmap_slot *slot = &ring->slot[head]; char *p = NETMAP_BUF(ring, slot->buf_idx); uint16_t *w = (uint16_t *)PKT(pkt, body, targ->g->af), t; memcpy(&sum, targ->g->af == AF_INET ? &pkt->ipv4.udp.uh_sum : &pkt->ipv6.udp.uh_sum, sizeof(sum)); slot->flags = 0; t = *w; PKT(pkt, body, targ->g->af)[0] = sequence >> 24; PKT(pkt, body, targ->g->af)[1] = (sequence >> 16) & 0xff; sum = ~cksum_add(~sum, cksum_add(~t, *w)); t = *++w; PKT(pkt, body, targ->g->af)[2] = (sequence >> 8) & 0xff; PKT(pkt, body, targ->g->af)[3] = sequence & 0xff; sum = ~cksum_add(~sum, cksum_add(~t, *w)); memcpy(targ->g->af == AF_INET ? &pkt->ipv4.udp.uh_sum : &pkt->ipv6.udp.uh_sum, &sum, sizeof(sum)); nm_pkt_copy(frame, p, size); if (fcnt == frags) { update_addresses(pkt, targ); } if (options & OPT_DUMP) { dump_payload(p, size, ring, head); } slot->len = size; if (--fcnt > 0) { slot->flags |= NS_MOREFRAG; } else { fcnt = frags; } if (sent == limit - 1) { /* Make sure we don't push an incomplete * packet. */ assert(!(slot->flags & NS_MOREFRAG)); slot->flags |= NS_REPORT; } head = nm_ring_next(ring, head); if (rate_limit) { budget--; } } ring->cur = ring->head = head; event ++; targ->ctr.pkts = sent; targ->ctr.bytes = sent * size; targ->ctr.events = event; } /* flush any remaining packets */ D("flush tail %d head %d on thread %p", ring->tail, ring->head, (void *)pthread_self()); ioctl(pfd.fd, NIOCTXSYNC, NULL); /* final part: wait the TX queues to become empty. */ while (!targ->cancel && nm_tx_pending(ring)) { RD(5, "pending tx tail %d head %d on ring %d", ring->tail, ring->head, targ->nmd->first_tx_ring); ioctl(pfd.fd, NIOCTXSYNC, NULL); usleep(1); /* wait 1 tick */ } clock_gettime(CLOCK_REALTIME_PRECISE, &targ->toc); targ->completed = 1; targ->ctr.pkts = sent; targ->ctr.bytes = sent * size; targ->ctr.events = event; quit: /* reset the ``used`` flag. */ targ->used = 0; return (NULL); } static char * multi_slot_to_string(struct netmap_ring *ring, unsigned int head, unsigned int nfrags, char *strbuf, size_t strbuflen) { unsigned int f; char *ret = strbuf; for (f = 0; f < nfrags; f++) { struct netmap_slot *slot = &ring->slot[head]; int m = snprintf(strbuf, strbuflen, "|%u,%x|", slot->len, slot->flags); if (m >= (int)strbuflen) { break; } strbuf += m; strbuflen -= m; head = nm_ring_next(ring, head); } return ret; } static void * rxseq_body(void *data) { struct targ *targ = (struct targ *) data; struct pollfd pfd = { .fd = targ->fd, .events = POLLIN }; int dump = targ->g->options & OPT_DUMP; struct netmap_ring *ring; unsigned int frags_exp = 1; struct my_ctrs cur; unsigned int frags = 0; int first_packet = 1; int first_slot = 1; int i, j, af, nrings; uint32_t seq, *seq_exp = NULL; memset(&cur, 0, sizeof(cur)); if (setaffinity(targ->thread, targ->affinity)) goto quit; nrings = targ->nmd->last_rx_ring - targ->nmd->first_rx_ring + 1; seq_exp = calloc(nrings, sizeof(uint32_t)); if (seq_exp == NULL) { D("failed to allocate seq array"); goto quit; } D("reading from %s fd %d main_fd %d", targ->g->ifname, targ->fd, targ->g->main_fd); /* unbounded wait for the first packet. */ for (;!targ->cancel;) { i = poll(&pfd, 1, 1000); if (i > 0 && !(pfd.revents & POLLERR)) break; RD(1, "waiting for initial packets, poll returns %d %d", i, pfd.revents); } clock_gettime(CLOCK_REALTIME_PRECISE, &targ->tic); while (!targ->cancel) { unsigned int head; int limit; #ifdef BUSYWAIT if (ioctl(pfd.fd, NIOCRXSYNC, NULL) < 0) { D("ioctl error on queue %d: %s", targ->me, strerror(errno)); goto quit; } #else /* !BUSYWAIT */ if (poll(&pfd, 1, 1 * 1000) <= 0 && !targ->g->forever) { clock_gettime(CLOCK_REALTIME_PRECISE, &targ->toc); targ->toc.tv_sec -= 1; /* Subtract timeout time. */ goto out; } if (pfd.revents & POLLERR) { D("poll err"); goto quit; } #endif /* !BUSYWAIT */ for (j = targ->nmd->first_rx_ring; j <= targ->nmd->last_rx_ring; j++) { ring = NETMAP_RXRING(targ->nmd->nifp, j); if (nm_ring_empty(ring)) continue; limit = nm_ring_space(ring); if (limit > targ->g->burst) limit = targ->g->burst; #if 0 /* Enable this if * 1) we remove the early-return optimization from * the netmap poll implementation, or * 2) pipes get NS_MOREFRAG support. * With the current netmap implementation, an experiment like * pkt-gen -i vale:1{1 -f txseq -F 9 * pkt-gen -i vale:1}1 -f rxseq * would get stuck as soon as we find nm_ring_space(ring) < 9, * since here limit is rounded to 0 and * pipe rxsync is not called anymore by the poll() of this loop. */ if (frags_exp > 1) { int o = limit; /* Cut off to the closest smaller multiple. */ limit = (limit / frags_exp) * frags_exp; RD(2, "LIMIT %d --> %d", o, limit); } #endif for (head = ring->head, i = 0; i < limit; i++) { struct netmap_slot *slot = &ring->slot[head]; char *p = NETMAP_BUF(ring, slot->buf_idx); int len = slot->len; struct pkt *pkt; if (dump) { dump_payload(p, slot->len, ring, head); } frags++; if (!(slot->flags & NS_MOREFRAG)) { if (first_packet) { first_packet = 0; } else if (frags != frags_exp) { char prbuf[512]; RD(1, "Received packets with %u frags, " "expected %u, '%s'", frags, frags_exp, multi_slot_to_string(ring, head-frags+1, frags, prbuf, sizeof(prbuf))); } first_packet = 0; frags_exp = frags; frags = 0; } p -= sizeof(pkt->vh) - targ->g->virt_header; len += sizeof(pkt->vh) - targ->g->virt_header; pkt = (struct pkt *)p; if (ntohs(pkt->eh.ether_type) == ETHERTYPE_IP) af = AF_INET; else af = AF_INET6; if ((char *)pkt + len < ((char *)PKT(pkt, body, af)) + sizeof(seq)) { RD(1, "%s: packet too small (len=%u)", __func__, slot->len); } else { seq = (PKT(pkt, body, af)[0] << 24) | (PKT(pkt, body, af)[1] << 16) | (PKT(pkt, body, af)[2] << 8) | PKT(pkt, body, af)[3]; if (first_slot) { /* Grab the first one, whatever it is. */ seq_exp[j] = seq; first_slot = 0; } else if (seq != seq_exp[j]) { uint32_t delta = seq - seq_exp[j]; if (delta < (0xFFFFFFFF >> 1)) { RD(2, "Sequence GAP: exp %u found %u", seq_exp[j], seq); } else { RD(2, "Sequence OUT OF ORDER: " "exp %u found %u", seq_exp[j], seq); } seq_exp[j] = seq; } seq_exp[j]++; } cur.bytes += slot->len; head = nm_ring_next(ring, head); cur.pkts++; } ring->cur = ring->head = head; cur.events++; targ->ctr = cur; } } clock_gettime(CLOCK_REALTIME_PRECISE, &targ->toc); #ifndef BUSYWAIT out: #endif /* !BUSYWAIT */ targ->completed = 1; targ->ctr = cur; quit: if (seq_exp != NULL) free(seq_exp); /* reset the ``used`` flag. */ targ->used = 0; return (NULL); } static void tx_output(struct glob_arg *g, struct my_ctrs *cur, double delta, const char *msg) { double bw, raw_bw, pps, abs; char b1[40], b2[80], b3[80]; int size; if (cur->pkts == 0) { printf("%s nothing.\n", msg); return; } size = (int)(cur->bytes / cur->pkts); printf("%s %llu packets %llu bytes %llu events %d bytes each in %.2f seconds.\n", msg, (unsigned long long)cur->pkts, (unsigned long long)cur->bytes, (unsigned long long)cur->events, size, delta); if (delta == 0) delta = 1e-6; if (size < 60) /* correct for min packet size */ size = 60; pps = cur->pkts / delta; bw = (8.0 * cur->bytes) / delta; raw_bw = (8.0 * cur->bytes + cur->pkts * g->framing) / delta; abs = cur->pkts / (double)(cur->events); printf("Speed: %spps Bandwidth: %sbps (raw %sbps). Average batch: %.2f pkts\n", norm(b1, pps, normalize), norm(b2, bw, normalize), norm(b3, raw_bw, normalize), abs); } static void usage(int errcode) { /* This usage is generated from the pkt-gen man page: * $ man pkt-gen > x * and pasted here adding the string terminators and endlines with simple * regular expressions. */ const char *cmd = "pkt-gen"; fprintf(stderr, "Usage:\n" "%s arguments\n" " -h Show program usage and exit.\n" "\n" " -i interface\n" " Name of the network interface that pkt-gen operates on. It can be a system network interface\n" " (e.g., em0), the name of a vale(4) port (e.g., valeSSS:PPP), the name of a netmap pipe or\n" " monitor, or any valid netmap port name accepted by the nm_open library function, as docu-\n" " mented in netmap(4) (NIOCREGIF section).\n" "\n" " -f function\n" " The function to be executed by pkt-gen. Specify tx for transmission, rx for reception, ping\n" " for client-side ping-pong operation, and pong for server-side ping-pong operation.\n" "\n" " -n count\n" " Number of iterations of the pkt-gen function, with 0 meaning infinite). In case of tx or rx,\n" " count is the number of packets to receive or transmit. In case of ping or pong, count is the\n" " number of ping-pong transactions.\n" "\n" " -l pkt_size\n" " Packet size in bytes excluding CRC. If passed a second time, use random sizes larger or\n" " equal than the second one and lower than the first one.\n" "\n" " -b burst_size\n" " Transmit or receive up to burst_size packets at a time.\n" "\n" " -4 Use IPv4 addresses.\n" "\n" " -6 Use IPv6 addresses.\n" "\n" " -d dst_ip[:port[-dst_ip:port]]\n" " Destination IPv4/IPv6 address and port, single or range.\n" "\n" " -s src_ip[:port[-src_ip:port]]\n" " Source IPv4/IPv6 address and port, single or range.\n" "\n" " -D dst_mac\n" " Destination MAC address in colon notation (e.g., aa:bb:cc:dd:ee:00).\n" "\n" " -S src_mac\n" " Source MAC address in colon notation.\n" "\n" " -a cpu_id\n" " Pin the first thread of pkt-gen to a particular CPU using pthread_setaffinity_np(3). If more\n" " threads are used, they are pinned to the subsequent CPUs, one per thread.\n" "\n" " -c cpus\n" " Maximum number of CPUs to use (0 means to use all the available ones).\n" "\n" " -p threads\n" " Number of threads to use. By default, only a single thread is used to handle all the netmap\n" " rings. If threads is larger than one, each thread handles a single TX ring (in tx mode), a\n" " single RX ring (in rx mode), or a TX/RX ring couple. The number of threads must be less or\n" " equal than the number of TX (or RX) ring available in the device specified by interface.\n" "\n" " -T report_ms\n" " Number of milliseconds between reports.\n" "\n" " -w wait_for_link_time\n" " Number of seconds to wait before starting the pkt-gen function, useuful to make sure that the\n" " network link is up. A network device driver may take some time to enter netmap mode, or to\n" " create a new transmit/receive ring pair when netmap(4) requests one.\n" "\n" " -R rate\n" " Packet transmission rate. Not setting the packet transmission rate tells pkt-gen to transmit\n" " packets as quickly as possible. On servers from 2010 on-wards netmap(4) is able to com-\n" " pletely use all of the bandwidth of a 10 or 40Gbps link, so this option should be used unless\n" " your intention is to saturate the link.\n" "\n" " -X Dump payload of each packet transmitted or received.\n" "\n" " -H len Add empty virtio-net-header with size 'len'. Valid sizes are 0, 10 and 12. This option is\n" " only used with Virtual Machine technologies that use virtio as a network interface.\n" "\n" " -P file\n" " Load the packet to be transmitted from a pcap file rather than constructing it within\n" " pkt-gen.\n" "\n" " -z Use random IPv4/IPv6 src address/port.\n" "\n" " -Z Use random IPv4/IPv6 dst address/port.\n" "\n" " -N Do not normalize units (i.e., use bps, pps instead of Mbps, Kpps, etc.).\n" "\n" " -F num_frags\n" " Send multi-slot packets, each one with num_frags fragments. A multi-slot packet is repre-\n" " sented by two or more consecutive netmap slots with the NS_MOREFRAG flag set (except for the\n" " last slot). This is useful to transmit or receive packets larger than the netmap buffer\n" " size.\n" "\n" " -M frag_size\n" " In multi-slot mode, frag_size specifies the size of each fragment, if smaller than the packet\n" " length divided by num_frags.\n" "\n" " -I Use indirect buffers. It is only valid for transmitting on VALE ports, and it is implemented\n" " by setting the NS_INDIRECT flag in the netmap slots.\n" "\n" " -W Exit immediately if all the RX rings are empty the first time they are examined.\n" "\n" " -v Increase the verbosity level.\n" "\n" " -r In tx mode, do not initialize packets, but send whatever the content of the uninitialized\n" " netmap buffers is (rubbish mode).\n" "\n" " -A Compute mean and standard deviation (over a sliding window) for the transmit or receive rate.\n" "\n" " -B Take Ethernet framing and CRC into account when computing the average bps. This adds 4 bytes\n" " of CRC and 20 bytes of framing to each packet.\n" "\n" " -C tx_slots[,rx_slots[,tx_rings[,rx_rings]]]\n" " Configuration in terms of number of rings and slots to be used when opening the netmap port.\n" " Such configuration has effect on software ports created on the fly, such as VALE ports and\n" " netmap pipes. The configuration may consist of 1 to 4 numbers separated by commas: tx_slots,\n" " rx_slots, tx_rings, rx_rings. Missing numbers or zeroes stand for default values. As an\n" " additional convenience, if exactly one number is specified, then this is assigned to both\n" " tx_slots and rx_slots. If there is no fourth number, then the third one is assigned to both\n" " tx_rings and rx_rings.\n" "\n" " -o options data generation options (parsed using atoi)\n" " OPT_PREFETCH 1\n" " OPT_ACCESS 2\n" " OPT_COPY 4\n" " OPT_MEMCPY 8\n" " OPT_TS 16 (add a timestamp)\n" " OPT_INDIRECT 32 (use indirect buffers)\n" " OPT_DUMP 64 (dump rx/tx traffic)\n" " OPT_RUBBISH 256\n" " (send wathever the buffers contain)\n" " OPT_RANDOM_SRC 512\n" " OPT_RANDOM_DST 1024\n" " OPT_PPS_STATS 2048\n" "", cmd); exit(errcode); } static void start_threads(struct glob_arg *g) { int i; targs = calloc(g->nthreads, sizeof(*targs)); struct targ *t; /* * Now create the desired number of threads, each one * using a single descriptor. */ for (i = 0; i < g->nthreads; i++) { uint64_t seed = time(0) | (time(0) << 32); t = &targs[i]; bzero(t, sizeof(*t)); t->fd = -1; /* default, with pcap */ t->g = g; memcpy(t->seed, &seed, sizeof(t->seed)); if (g->dev_type == DEV_NETMAP) { struct nm_desc nmd = *g->nmd; /* copy, we overwrite ringid */ uint64_t nmd_flags = 0; nmd.self = &nmd; if (i > 0) { /* the first thread uses the fd opened by the main * thread, the other threads re-open /dev/netmap */ if (g->nthreads > 1) { nmd.req.nr_flags = g->nmd->req.nr_flags & ~NR_REG_MASK; nmd.req.nr_flags |= NR_REG_ONE_NIC; nmd.req.nr_ringid = i; } /* Only touch one of the rings (rx is already ok) */ if (g->td_type == TD_TYPE_RECEIVER) nmd_flags |= NETMAP_NO_TX_POLL; /* register interface. Override ifname and ringid etc. */ t->nmd = nm_open(t->g->ifname, NULL, nmd_flags | NM_OPEN_IFNAME | NM_OPEN_NO_MMAP, &nmd); if (t->nmd == NULL) { D("Unable to open %s: %s", t->g->ifname, strerror(errno)); continue; } } else { t->nmd = g->nmd; } t->fd = t->nmd->fd; t->frags = g->frags; } else { targs[i].fd = g->main_fd; } t->used = 1; t->me = i; if (g->affinity >= 0) { t->affinity = (g->affinity + i) % g->cpus; } else { t->affinity = -1; } /* default, init packets */ initialize_packet(t); } /* Wait for PHY reset. */ D("Wait %d secs for phy reset", g->wait_link); sleep(g->wait_link); D("Ready..."); for (i = 0; i < g->nthreads; i++) { t = &targs[i]; if (pthread_create(&t->thread, NULL, g->td_body, t) == -1) { D("Unable to create thread %d: %s", i, strerror(errno)); t->used = 0; } } } static void main_thread(struct glob_arg *g) { int i; struct my_ctrs prev, cur; double delta_t; struct timeval tic, toc; prev.pkts = prev.bytes = prev.events = 0; gettimeofday(&prev.t, NULL); for (;;) { char b1[40], b2[40], b3[40], b4[100]; uint64_t pps, usec; struct my_ctrs x; double abs; int done = 0; usec = wait_for_next_report(&prev.t, &cur.t, g->report_interval); cur.pkts = cur.bytes = cur.events = 0; cur.min_space = 0; if (usec < 10000) /* too short to be meaningful */ continue; /* accumulate counts for all threads */ for (i = 0; i < g->nthreads; i++) { cur.pkts += targs[i].ctr.pkts; cur.bytes += targs[i].ctr.bytes; cur.events += targs[i].ctr.events; cur.min_space += targs[i].ctr.min_space; targs[i].ctr.min_space = 99999; if (targs[i].used == 0) done++; } x.pkts = cur.pkts - prev.pkts; x.bytes = cur.bytes - prev.bytes; x.events = cur.events - prev.events; pps = (x.pkts*1000000 + usec/2) / usec; abs = (x.events > 0) ? (x.pkts / (double) x.events) : 0; if (!(g->options & OPT_PPS_STATS)) { strcpy(b4, ""); } else { /* Compute some pps stats using a sliding window. */ double ppsavg = 0.0, ppsdev = 0.0; int nsamples = 0; g->win[g->win_idx] = pps; g->win_idx = (g->win_idx + 1) % STATS_WIN; for (i = 0; i < STATS_WIN; i++) { ppsavg += g->win[i]; if (g->win[i]) { nsamples ++; } } ppsavg /= nsamples; for (i = 0; i < STATS_WIN; i++) { if (g->win[i] == 0) { continue; } ppsdev += (g->win[i] - ppsavg) * (g->win[i] - ppsavg); } ppsdev /= nsamples; ppsdev = sqrt(ppsdev); snprintf(b4, sizeof(b4), "[avg/std %s/%s pps]", norm(b1, ppsavg, normalize), norm(b2, ppsdev, normalize)); } D("%spps %s(%spkts %sbps in %llu usec) %.2f avg_batch %d min_space", norm(b1, pps, normalize), b4, norm(b2, (double)x.pkts, normalize), - norm(b3, (double)x.bytes*8+(double)x.pkts*g->framing, normalize), + norm(b3, 1000000*((double)x.bytes*8+(double)x.pkts*g->framing)/usec, normalize), (unsigned long long)usec, abs, (int)cur.min_space); prev = cur; if (done == g->nthreads) break; } timerclear(&tic); timerclear(&toc); cur.pkts = cur.bytes = cur.events = 0; /* final round */ for (i = 0; i < g->nthreads; i++) { struct timespec t_tic, t_toc; /* * Join active threads, unregister interfaces and close * file descriptors. */ if (targs[i].used) pthread_join(targs[i].thread, NULL); /* blocking */ if (g->dev_type == DEV_NETMAP) { nm_close(targs[i].nmd); targs[i].nmd = NULL; } else { close(targs[i].fd); } if (targs[i].completed == 0) D("ouch, thread %d exited with error", i); /* * Collect threads output and extract information about * how long it took to send all the packets. */ cur.pkts += targs[i].ctr.pkts; cur.bytes += targs[i].ctr.bytes; cur.events += targs[i].ctr.events; /* collect the largest start (tic) and end (toc) times, * XXX maybe we should do the earliest tic, or do a weighted * average ? */ t_tic = timeval2spec(&tic); t_toc = timeval2spec(&toc); if (!timerisset(&tic) || timespec_ge(&targs[i].tic, &t_tic)) tic = timespec2val(&targs[i].tic); if (!timerisset(&toc) || timespec_ge(&targs[i].toc, &t_toc)) toc = timespec2val(&targs[i].toc); } /* print output. */ timersub(&toc, &tic, &toc); delta_t = toc.tv_sec + 1e-6* toc.tv_usec; if (g->td_type == TD_TYPE_SENDER) tx_output(g, &cur, delta_t, "Sent"); else if (g->td_type == TD_TYPE_RECEIVER) tx_output(g, &cur, delta_t, "Received"); } struct td_desc { int ty; char *key; void *f; int default_burst; }; static struct td_desc func[] = { { TD_TYPE_RECEIVER, "rx", receiver_body, 512}, /* default */ { TD_TYPE_SENDER, "tx", sender_body, 512 }, { TD_TYPE_OTHER, "ping", ping_body, 1 }, { TD_TYPE_OTHER, "pong", pong_body, 1 }, { TD_TYPE_SENDER, "txseq", txseq_body, 512 }, { TD_TYPE_RECEIVER, "rxseq", rxseq_body, 512 }, { 0, NULL, NULL, 0 } }; static int tap_alloc(char *dev) { struct ifreq ifr; int fd, err; char *clonedev = TAP_CLONEDEV; (void)err; (void)dev; /* Arguments taken by the function: * * char *dev: the name of an interface (or '\0'). MUST have enough * space to hold the interface name if '\0' is passed * int flags: interface flags (eg, IFF_TUN etc.) */ #ifdef __FreeBSD__ if (dev[3]) { /* tapSomething */ static char buf[128]; snprintf(buf, sizeof(buf), "/dev/%s", dev); clonedev = buf; } #endif /* open the device */ if( (fd = open(clonedev, O_RDWR)) < 0 ) { return fd; } D("%s open successful", clonedev); /* preparation of the struct ifr, of type "struct ifreq" */ memset(&ifr, 0, sizeof(ifr)); #ifdef linux ifr.ifr_flags = IFF_TAP | IFF_NO_PI; if (*dev) { /* if a device name was specified, put it in the structure; otherwise, * the kernel will try to allocate the "next" device of the * specified type */ size_t len = strlen(dev); if (len > IFNAMSIZ) { D("%s too long", dev); return -1; } memcpy(ifr.ifr_name, dev, len); } /* try to create the device */ if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ) { D("failed to to a TUNSETIFF: %s", strerror(errno)); close(fd); return err; } /* if the operation was successful, write back the name of the * interface to the variable "dev", so the caller can know * it. Note that the caller MUST reserve space in *dev (see calling * code below) */ strcpy(dev, ifr.ifr_name); D("new name is %s", dev); #endif /* linux */ /* this is the special file descriptor that the caller will use to talk * with the virtual interface */ return fd; } int main(int arc, char **argv) { int i; struct sigaction sa; sigset_t ss; struct glob_arg g; int ch; int devqueues = 1; /* how many device queues */ int wait_link_arg = 0; int pkt_size_done = 0; struct td_desc *fn = func; bzero(&g, sizeof(g)); g.main_fd = -1; g.td_body = fn->f; g.td_type = fn->ty; g.report_interval = 1000; /* report interval */ g.affinity = -1; /* ip addresses can also be a range x.x.x.x-x.x.x.y */ g.af = AF_INET; /* default */ g.src_ip.name = "10.0.0.1"; g.dst_ip.name = "10.1.0.1"; g.dst_mac.name = "ff:ff:ff:ff:ff:ff"; g.src_mac.name = NULL; g.pkt_size = 60; g.pkt_min_size = 0; g.nthreads = 1; g.cpus = 1; /* default */ g.forever = 1; g.tx_rate = 0; g.frags = 1; g.frag_size = (u_int)-1; /* use the netmap buffer size by default */ g.nmr_config = ""; g.virt_header = 0; g.wait_link = 2; /* wait 2 seconds for physical ports */ while ((ch = getopt(arc, argv, "46a:f:F:Nn:i:Il:d:s:D:S:b:c:o:p:" "T:w:WvR:XC:H:rP:zZAhBM:")) != -1) { switch(ch) { default: D("bad option %c %s", ch, optarg); usage(-1); break; case 'h': usage(0); break; case '4': g.af = AF_INET; break; case '6': g.af = AF_INET6; break; case 'N': normalize = 0; break; case 'n': g.npackets = strtoull(optarg, NULL, 10); break; case 'F': i = atoi(optarg); if (i < 1 || i > 63) { D("invalid frags %d [1..63], ignore", i); break; } g.frags = i; break; case 'M': g.frag_size = atoi(optarg); break; case 'f': for (fn = func; fn->key; fn++) { if (!strcmp(fn->key, optarg)) break; } if (fn->key) { g.td_body = fn->f; g.td_type = fn->ty; } else { D("unrecognised function %s", optarg); } break; case 'o': /* data generation options */ g.options |= atoi(optarg); break; case 'a': /* force affinity */ g.affinity = atoi(optarg); break; case 'i': /* interface */ /* a prefix of tap: netmap: or pcap: forces the mode. * otherwise we guess */ D("interface is %s", optarg); if (strlen(optarg) > MAX_IFNAMELEN - 8) { D("ifname too long %s", optarg); break; } strcpy(g.ifname, optarg); if (!strcmp(optarg, "null")) { g.dev_type = DEV_NETMAP; g.dummy_send = 1; } else if (!strncmp(optarg, "tap:", 4)) { g.dev_type = DEV_TAP; strcpy(g.ifname, optarg + 4); } else if (!strncmp(optarg, "pcap:", 5)) { g.dev_type = DEV_PCAP; strcpy(g.ifname, optarg + 5); } else if (!strncmp(optarg, "netmap:", 7) || !strncmp(optarg, "vale", 4)) { g.dev_type = DEV_NETMAP; } else if (!strncmp(optarg, "tap", 3)) { g.dev_type = DEV_TAP; } else { /* prepend netmap: */ g.dev_type = DEV_NETMAP; sprintf(g.ifname, "netmap:%s", optarg); } break; case 'I': g.options |= OPT_INDIRECT; /* use indirect buffers */ break; case 'l': /* pkt_size */ if (pkt_size_done) { g.pkt_min_size = atoi(optarg); } else { g.pkt_size = atoi(optarg); pkt_size_done = 1; } break; case 'd': g.dst_ip.name = optarg; break; case 's': g.src_ip.name = optarg; break; case 'T': /* report interval */ g.report_interval = atoi(optarg); break; case 'w': g.wait_link = atoi(optarg); wait_link_arg = 1; break; case 'W': g.forever = 0; /* exit RX with no traffic */ break; case 'b': /* burst */ g.burst = atoi(optarg); break; case 'c': g.cpus = atoi(optarg); break; case 'p': g.nthreads = atoi(optarg); break; case 'D': /* destination mac */ g.dst_mac.name = optarg; break; case 'S': /* source mac */ g.src_mac.name = optarg; break; case 'v': verbose++; break; case 'R': g.tx_rate = atoi(optarg); break; case 'X': g.options |= OPT_DUMP; break; case 'C': + D("WARNING: the 'C' option is deprecated, use the '+conf:' libnetmap option instead"); g.nmr_config = strdup(optarg); break; case 'H': g.virt_header = atoi(optarg); break; case 'P': g.packet_file = strdup(optarg); break; case 'r': g.options |= OPT_RUBBISH; break; case 'z': g.options |= OPT_RANDOM_SRC; break; case 'Z': g.options |= OPT_RANDOM_DST; break; case 'A': g.options |= OPT_PPS_STATS; break; case 'B': /* raw packets have4 bytes crc + 20 bytes framing */ // XXX maybe add an option to pass the IFG g.framing = 24 * 8; break; } } if (strlen(g.ifname) <=0 ) { D("missing ifname"); usage(-1); } if (g.burst == 0) { g.burst = fn->default_burst; D("using default burst size: %d", g.burst); } g.system_cpus = i = system_ncpus(); if (g.cpus < 0 || g.cpus > i) { D("%d cpus is too high, have only %d cpus", g.cpus, i); usage(-1); } D("running on %d cpus (have %d)", g.cpus, i); if (g.cpus == 0) g.cpus = i; if (!wait_link_arg && !strncmp(g.ifname, "vale", 4)) { g.wait_link = 0; } if (g.pkt_size < 16 || g.pkt_size > MAX_PKTSIZE) { D("bad pktsize %d [16..%d]\n", g.pkt_size, MAX_PKTSIZE); usage(-1); } if (g.pkt_min_size > 0 && (g.pkt_min_size < 16 || g.pkt_min_size > g.pkt_size)) { D("bad pktminsize %d [16..%d]\n", g.pkt_min_size, g.pkt_size); usage(-1); } if (g.src_mac.name == NULL) { static char mybuf[20] = "00:00:00:00:00:00"; /* retrieve source mac address. */ if (source_hwaddr(g.ifname, mybuf) == -1) { D("Unable to retrieve source mac"); // continue, fail later } g.src_mac.name = mybuf; } /* extract address ranges */ if (extract_mac_range(&g.src_mac) || extract_mac_range(&g.dst_mac)) usage(-1); g.options |= extract_ip_range(&g.src_ip, g.af); g.options |= extract_ip_range(&g.dst_ip, g.af); if (g.virt_header != 0 && g.virt_header != VIRT_HDR_1 && g.virt_header != VIRT_HDR_2) { D("bad virtio-net-header length"); usage(-1); } if (g.dev_type == DEV_TAP) { D("want to use tap %s", g.ifname); g.main_fd = tap_alloc(g.ifname); if (g.main_fd < 0) { D("cannot open tap %s", g.ifname); usage(-1); } #ifndef NO_PCAP } else if (g.dev_type == DEV_PCAP) { char pcap_errbuf[PCAP_ERRBUF_SIZE]; pcap_errbuf[0] = '\0'; // init the buffer g.p = pcap_open_live(g.ifname, 256 /* XXX */, 1, 100, pcap_errbuf); if (g.p == NULL) { D("cannot open pcap on %s", g.ifname); usage(-1); } g.main_fd = pcap_fileno(g.p); D("using pcap on %s fileno %d", g.ifname, g.main_fd); #endif /* !NO_PCAP */ } else if (g.dummy_send) { /* but DEV_NETMAP */ D("using a dummy send routine"); } else { struct nm_desc base_nmd; char errmsg[MAXERRMSG]; u_int flags; bzero(&base_nmd, sizeof(base_nmd)); parse_nmr_config(g.nmr_config, &base_nmd.req); base_nmd.req.nr_flags |= NR_ACCEPT_VNET_HDR; if (nm_parse(g.ifname, &base_nmd, errmsg) < 0) { D("Invalid name '%s': %s", g.ifname, errmsg); goto out; } /* * Open the netmap device using nm_open(). * * protocol stack and may cause a reset of the card, * which in turn may take some time for the PHY to * reconfigure. We do the open here to have time to reset. */ flags = NM_OPEN_IFNAME | NM_OPEN_ARG1 | NM_OPEN_ARG2 | NM_OPEN_ARG3 | NM_OPEN_RING_CFG; if (g.nthreads > 1) { base_nmd.req.nr_flags &= ~NR_REG_MASK; base_nmd.req.nr_flags |= NR_REG_ONE_NIC; base_nmd.req.nr_ringid = 0; } g.nmd = nm_open(g.ifname, NULL, flags, &base_nmd); if (g.nmd == NULL) { D("Unable to open %s: %s", g.ifname, strerror(errno)); goto out; } g.main_fd = g.nmd->fd; D("mapped %luKB at %p", (unsigned long)(g.nmd->req.nr_memsize>>10), g.nmd->mem); if (g.virt_header) { /* Set the virtio-net header length, since the user asked * for it explicitely. */ set_vnet_hdr_len(&g); } else { /* Check whether the netmap port we opened requires us to send * and receive frames with virtio-net header. */ get_vnet_hdr_len(&g); } /* get num of queues in tx or rx */ if (g.td_type == TD_TYPE_SENDER) devqueues = g.nmd->req.nr_tx_rings; else devqueues = g.nmd->req.nr_rx_rings; /* validate provided nthreads. */ if (g.nthreads < 1 || g.nthreads > devqueues) { D("bad nthreads %d, have %d queues", g.nthreads, devqueues); // continue, fail later } if (g.td_type == TD_TYPE_SENDER) { int mtu = get_if_mtu(&g); if (mtu > 0 && g.pkt_size > mtu) { D("pkt_size (%d) must be <= mtu (%d)", g.pkt_size, mtu); return -1; } } if (verbose) { struct netmap_if *nifp = g.nmd->nifp; struct nmreq *req = &g.nmd->req; D("nifp at offset %d, %d tx %d rx region %d", req->nr_offset, req->nr_tx_rings, req->nr_rx_rings, req->nr_arg2); for (i = 0; i <= req->nr_tx_rings; i++) { struct netmap_ring *ring = NETMAP_TXRING(nifp, i); D(" TX%d at 0x%p slots %d", i, (void *)((char *)ring - (char *)nifp), ring->num_slots); } for (i = 0; i <= req->nr_rx_rings; i++) { struct netmap_ring *ring = NETMAP_RXRING(nifp, i); D(" RX%d at 0x%p slots %d", i, (void *)((char *)ring - (char *)nifp), ring->num_slots); } } /* Print some debug information. */ fprintf(stdout, "%s %s: %d queues, %d threads and %d cpus.\n", (g.td_type == TD_TYPE_SENDER) ? "Sending on" : ((g.td_type == TD_TYPE_RECEIVER) ? "Receiving from" : "Working on"), g.ifname, devqueues, g.nthreads, g.cpus); if (g.td_type == TD_TYPE_SENDER) { fprintf(stdout, "%s -> %s (%s -> %s)\n", g.src_ip.name, g.dst_ip.name, g.src_mac.name, g.dst_mac.name); } out: /* Exit if something went wrong. */ if (g.main_fd < 0) { D("aborting"); usage(-1); } } if (g.options) { D("--- SPECIAL OPTIONS:%s%s%s%s%s%s\n", g.options & OPT_PREFETCH ? " prefetch" : "", g.options & OPT_ACCESS ? " access" : "", g.options & OPT_MEMCPY ? " memcpy" : "", g.options & OPT_INDIRECT ? " indirect" : "", g.options & OPT_COPY ? " copy" : "", g.options & OPT_RUBBISH ? " rubbish " : ""); } g.tx_period.tv_sec = g.tx_period.tv_nsec = 0; if (g.tx_rate > 0) { /* try to have at least something every second, * reducing the burst size to some 0.01s worth of data * (but no less than one full set of fragments) */ uint64_t x; int lim = (g.tx_rate)/300; if (g.burst > lim) g.burst = lim; if (g.burst == 0) g.burst = 1; x = ((uint64_t)1000000000 * (uint64_t)g.burst) / (uint64_t) g.tx_rate; g.tx_period.tv_nsec = x; g.tx_period.tv_sec = g.tx_period.tv_nsec / 1000000000; g.tx_period.tv_nsec = g.tx_period.tv_nsec % 1000000000; } if (g.td_type == TD_TYPE_SENDER) D("Sending %d packets every %ld.%09ld s", g.burst, g.tx_period.tv_sec, g.tx_period.tv_nsec); /* Install ^C handler. */ global_nthreads = g.nthreads; sigemptyset(&ss); sigaddset(&ss, SIGINT); /* block SIGINT now, so that all created threads will inherit the mask */ if (pthread_sigmask(SIG_BLOCK, &ss, NULL) < 0) { D("failed to block SIGINT: %s", strerror(errno)); } start_threads(&g); /* Install the handler and re-enable SIGINT for the main thread */ memset(&sa, 0, sizeof(sa)); sa.sa_handler = sigint_h; if (sigaction(SIGINT, &sa, NULL) < 0) { D("failed to install ^C handler: %s", strerror(errno)); } if (pthread_sigmask(SIG_UNBLOCK, &ss, NULL) < 0) { D("failed to re-enable SIGINT: %s", strerror(errno)); } main_thread(&g); free(targs); return 0; } /* end of file */