Index: head/share/man/man4/netmap.4
===================================================================
--- head/share/man/man4/netmap.4	(revision 340278)
+++ head/share/man/man4/netmap.4	(revision 340279)
@@ -1,1148 +1,1151 @@
 .\" Copyright (c) 2011-2014 Matteo Landi, Luigi Rizzo, Universita` di Pisa
 .\" All rights reserved.
 .\"
 .\" Redistribution and use in source and binary forms, with or without
 .\" modification, are permitted provided that the following conditions
 .\" are met:
 .\" 1. Redistributions of source code must retain the above copyright
 .\"    notice, this list of conditions and the following disclaimer.
 .\" 2. Redistributions in binary form must reproduce the above copyright
 .\"    notice, this list of conditions and the following disclaimer in the
 .\"    documentation and/or other materials provided with the distribution.
 .\"
 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
 .\"
 .\" This document is derived in part from the enet man page (enet.4)
 .\" distributed with 4.3BSD Unix.
 .\"
 .\" $FreeBSD$
 .\"
-.Dd October 23, 2018
+.Dd October 28, 2018
 .Dt NETMAP 4
 .Os
 .Sh NAME
 .Nm netmap
 .Nd a framework for fast packet I/O
 .Nm VALE
 .Nd a fast VirtuAl Local Ethernet using the netmap API
 .Pp
 .Nm netmap pipes
 .Nd a shared memory packet transport channel
 .Sh SYNOPSIS
 .Cd device netmap
 .Sh DESCRIPTION
 .Nm
 is a framework for extremely fast and efficient packet I/O
 for userspace and kernel clients, and for Virtual Machines.
 It runs on
 .Fx
 Linux and some versions of Windows, and supports a variety of
 .Nm netmap ports ,
 including
 .Bl -tag -width XXXX
 .It Nm physical NIC ports
 to access individual queues of network interfaces;
 .It Nm host ports
 to inject packets into the host stack;
 .It Nm VALE ports
 implementing a very fast and modular in-kernel software switch/dataplane;
 .It Nm netmap pipes
 a shared memory packet transport channel;
 .It Nm netmap monitors
 a mechanism similar to
 .Xr bpf 4
 to capture traffic
 .El
 .Pp
 All these
 .Nm netmap ports
 are accessed interchangeably with the same API,
 and are at least one order of magnitude faster than
 standard OS mechanisms
 (sockets, bpf, tun/tap interfaces, native switches, pipes).
 With suitably fast hardware (NICs, PCIe buses, CPUs),
 packet I/O using
 .Nm
 on supported NICs
 reaches 14.88 million packets per second (Mpps)
 with much less than one core on 10 Gbit/s NICs;
 35-40 Mpps on 40 Gbit/s NICs (limited by the hardware);
 about 20 Mpps per core for VALE ports;
 and over 100 Mpps for
 .Nm netmap pipes.
 NICs without native
 .Nm
 support can still use the API in emulated mode,
 which uses unmodified device drivers and is 3-5 times faster than
 .Xr bpf 4
 or raw sockets.
 .Pp
 Userspace clients can dynamically switch NICs into
 .Nm
 mode and send and receive raw packets through
 memory mapped buffers.
 Similarly,
 .Nm VALE
 switch instances and ports,
 .Nm netmap pipes
 and
 .Nm netmap monitors
 can be created dynamically,
 providing high speed packet I/O between processes,
 virtual machines, NICs and the host stack.
 .Pp
 .Nm
 supports both non-blocking I/O through
 .Xr ioctl 2 ,
 synchronization and blocking I/O through a file descriptor
 and standard OS mechanisms such as
 .Xr select 2 ,
 .Xr poll 2 ,
 .Xr epoll 2 ,
 and
 .Xr kqueue 2 .
 All types of
 .Nm netmap ports
 and the
 .Nm VALE switch
 are implemented by a single kernel module, which also emulates the
 .Nm
 API over standard drivers.
 For best performance,
 .Nm
 requires native support in device drivers.
 A list of such devices is at the end of this document.
 .Pp
 In the rest of this (long) manual page we document
 various aspects of the
 .Nm
 and
 .Nm VALE
 architecture, features and usage.
 .Sh ARCHITECTURE
 .Nm
 supports raw packet I/O through a
 .Em port ,
 which can be connected to a physical interface
 .Em ( NIC ) ,
 to the host stack,
 or to a
 .Nm VALE
 switch.
 Ports use preallocated circular queues of buffers
 .Em ( rings )
 residing in an mmapped region.
 There is one ring for each transmit/receive queue of a
 NIC or virtual port.
 An additional ring pair connects to the host stack.
 .Pp
 After binding a file descriptor to a port, a
 .Nm
 client can send or receive packets in batches through
 the rings, and possibly implement zero-copy forwarding
 between ports.
 .Pp
 All NICs operating in
 .Nm
 mode use the same memory region,
 accessible to all processes who own
 .Pa /dev/netmap
 file descriptors bound to NICs.
 Independent
 .Nm VALE
 and
 .Nm netmap pipe
 ports
 by default use separate memory regions,
 but can be independently configured to share memory.
 .Sh ENTERING AND EXITING NETMAP MODE
 The following section describes the system calls to create
 and control
 .Nm netmap
 ports (including
 .Nm VALE
 and
 .Nm netmap pipe
 ports).
 Simpler, higher level functions are described in the
 .Sx LIBRARIES
 section.
 .Pp
 Ports and rings are created and controlled through a file descriptor,
 created by opening a special device
 .Dl fd = open("/dev/netmap");
 and then bound to a specific port with an
 .Dl ioctl(fd, NIOCREGIF, (struct nmreq *)arg);
 .Pp
 .Nm
 has multiple modes of operation controlled by the
 .Vt struct nmreq
 argument.
 .Va arg.nr_name
 specifies the netmap port name, as follows:
 .Bl -tag -width XXXX
 .It Dv OS network interface name (e.g., 'em0', 'eth1', ... )
 the data path of the NIC is disconnected from the host stack,
 and the file descriptor is bound to the NIC (one or all queues),
 or to the host stack;
 .It Dv valeSSS:PPP
 the file descriptor is bound to port PPP of VALE switch SSS.
 Switch instances and ports are dynamically created if necessary.
 .Pp
 Both SSS and PPP have the form [0-9a-zA-Z_]+ , the string
 cannot exceed IFNAMSIZ characters, and PPP cannot
 be the name of any existing OS network interface.
 .El
 .Pp
 On return,
 .Va arg
 indicates the size of the shared memory region,
 and the number, size and location of all the
 .Nm
 data structures, which can be accessed by mmapping the memory
 .Dl char *mem = mmap(0, arg.nr_memsize, fd);
 .Pp
 Non-blocking I/O is done with special
 .Xr ioctl 2
 .Xr select 2
 and
 .Xr poll 2
 on the file descriptor permit blocking I/O.
 .Xr epoll 2
 and
 .Xr kqueue 2
 are not supported on
 .Nm
 file descriptors.
 .Pp
 While a NIC is in
 .Nm
 mode, the OS will still believe the interface is up and running.
 OS-generated packets for that NIC end up into a
 .Nm
 ring, and another ring is used to send packets into the OS network stack.
 A
 .Xr close 2
 on the file descriptor removes the binding,
 and returns the NIC to normal mode (reconnecting the data path
 to the host stack), or destroys the virtual port.
 .Sh DATA STRUCTURES
 The data structures in the mmapped memory region are detailed in
 .In sys/net/netmap.h ,
 which is the ultimate reference for the
 .Nm
 API.
 The main structures and fields are indicated below:
 .Bl -tag -width XXX
 .It Dv struct netmap_if (one per interface)
 .Bd -literal
 struct netmap_if {
     ...
     const uint32_t   ni_flags;      /* properties              */
     ...
     const uint32_t   ni_tx_rings;   /* NIC tx rings            */
     const uint32_t   ni_rx_rings;   /* NIC rx rings            */
     uint32_t         ni_bufs_head;  /* head of extra bufs list */
     ...
 };
 .Ed
 .Pp
 Indicates the number of available rings
 .Pa ( struct netmap_rings )
 and their position in the mmapped region.
 The number of tx and rx rings
 .Pa ( ni_tx_rings , ni_rx_rings )
 normally depends on the hardware.
 NICs also have an extra tx/rx ring pair connected to the host stack.
 .Em NIOCREGIF
 can also request additional unbound buffers in the same memory space,
 to be used as temporary storage for packets.
 .Pa ni_bufs_head
 contains the index of the first of these free rings,
 which are connected in a list (the first uint32_t of each
 buffer being the index of the next buffer in the list).
 A
 .Dv 0
 indicates the end of the list.
 .It Dv struct netmap_ring (one per ring)
 .Bd -literal
 struct netmap_ring {
     ...
     const uint32_t num_slots;   /* slots in each ring            */
     const uint32_t nr_buf_size; /* size of each buffer           */
     ...
     uint32_t       head;        /* (u) first buf owned by user   */
     uint32_t       cur;         /* (u) wakeup position           */
     const uint32_t tail;        /* (k) first buf owned by kernel */
     ...
     uint32_t       flags;
     struct timeval ts;          /* (k) time of last rxsync()     */
     ...
     struct netmap_slot slot[0]; /* array of slots                */
 }
 .Ed
 .Pp
 Implements transmit and receive rings, with read/write
 pointers, metadata and an array of
 .Em slots
 describing the buffers.
 .It Dv struct netmap_slot (one per buffer)
 .Bd -literal
 struct netmap_slot {
     uint32_t buf_idx;           /* buffer index                 */
     uint16_t len;               /* packet length                */
     uint16_t flags;             /* buf changed, etc.            */
     uint64_t ptr;               /* address for indirect buffers */
 };
 .Ed
 .Pp
 Describes a packet buffer, which normally is identified by
 an index and resides in the mmapped region.
 .It Dv packet buffers
 Fixed size (normally 2 KB) packet buffers allocated by the kernel.
 .El
 .Pp
 The offset of the
 .Pa struct netmap_if
 in the mmapped region is indicated by the
 .Pa nr_offset
 field in the structure returned by
 .Dv NIOCREGIF .
 From there, all other objects are reachable through
 relative references (offsets or indexes).
 Macros and functions in
 .In net/netmap_user.h
 help converting them into actual pointers:
 .Pp
 .Dl struct netmap_if  *nifp = NETMAP_IF(mem, arg.nr_offset);
 .Dl struct netmap_ring *txr = NETMAP_TXRING(nifp, ring_index);
 .Dl struct netmap_ring *rxr = NETMAP_RXRING(nifp, ring_index);
 .Pp
 .Dl char *buf = NETMAP_BUF(ring, buffer_index);
 .Sh RINGS, BUFFERS AND DATA I/O
 .Va Rings
 are circular queues of packets with three indexes/pointers
 .Va ( head , cur , tail ) ;
 one slot is always kept empty.
 The ring size
 .Va ( num_slots )
 should not be assumed to be a power of two.
 .Pp
 .Va head
 is the first slot available to userspace;
 .Pp
 .Va cur
 is the wakeup point:
 select/poll will unblock when
 .Va tail
 passes
 .Va cur ;
 .Pp
 .Va tail
 is the first slot reserved to the kernel.
 .Pp
 Slot indexes
 .Em must
 only move forward;
 for convenience, the function
 .Dl nm_ring_next(ring, index)
 returns the next index modulo the ring size.
 .Pp
 .Va head
 and
 .Va cur
 are only modified by the user program;
 .Va tail
 is only modified by the kernel.
 The kernel only reads/writes the
 .Vt struct netmap_ring
 slots and buffers
 during the execution of a netmap-related system call.
 The only exception are slots (and buffers) in the range
 .Va tail\  . . . head-1 ,
 that are explicitly assigned to the kernel.
 .Pp
 .Ss TRANSMIT RINGS
 On transmit rings, after a
 .Nm
 system call, slots in the range
 .Va head\  . . . tail-1
 are available for transmission.
 User code should fill the slots sequentially
 and advance
 .Va head
 and
 .Va cur
 past slots ready to transmit.
 .Va cur
 may be moved further ahead if the user code needs
 more slots before further transmissions (see
 .Sx SCATTER GATHER I/O ) .
 .Pp
 At the next NIOCTXSYNC/select()/poll(),
 slots up to
 .Va head-1
 are pushed to the port, and
 .Va tail
 may advance if further slots have become available.
 Below is an example of the evolution of a TX ring:
 .Bd -literal
     after the syscall, slots between cur and tail are (a)vailable
               head=cur   tail
                |          |
                v          v
      TX  [.....aaaaaaaaaaa.............]
 
     user creates new packets to (T)ransmit
                 head=cur tail
                     |     |
                     v     v
      TX  [.....TTTTTaaaaaa.............]
 
     NIOCTXSYNC/poll()/select() sends packets and reports new slots
                 head=cur      tail
                     |          |
                     v          v
      TX  [..........aaaaaaaaaaa........]
 .Ed
 .Pp
 .Fn select
 and
 .Fn poll
 will block if there is no space in the ring, i.e.,
 .Dl ring->cur == ring->tail
 and return when new slots have become available.
 .Pp
 High speed applications may want to amortize the cost of system calls
 by preparing as many packets as possible before issuing them.
 .Pp
 A transmit ring with pending transmissions has
 .Dl ring->head != ring->tail + 1 (modulo the ring size).
 The function
 .Va int nm_tx_pending(ring)
 implements this test.
 .Ss RECEIVE RINGS
 On receive rings, after a
 .Nm
 system call, the slots in the range
 .Va head\& . . . tail-1
 contain received packets.
 User code should process them and advance
 .Va head
 and
 .Va cur
 past slots it wants to return to the kernel.
 .Va cur
 may be moved further ahead if the user code wants to
 wait for more packets
 without returning all the previous slots to the kernel.
 .Pp
 At the next NIOCRXSYNC/select()/poll(),
 slots up to
 .Va head-1
 are returned to the kernel for further receives, and
 .Va tail
 may advance to report new incoming packets.
 .Pp
 Below is an example of the evolution of an RX ring:
 .Bd -literal
     after the syscall, there are some (h)eld and some (R)eceived slots
            head  cur     tail
             |     |       |
             v     v       v
      RX  [..hhhhhhRRRRRRRR..........]
 
     user advances head and cur, releasing some slots and holding others
                head cur  tail
                  |  |     |
                  v  v     v
      RX  [..*****hhhRRRRRR...........]
 
     NICRXSYNC/poll()/select() recovers slots and reports new packets
                head cur        tail
                  |  |           |
                  v  v           v
      RX  [.......hhhRRRRRRRRRRRR....]
 .Ed
 .Sh SLOTS AND PACKET BUFFERS
 Normally, packets should be stored in the netmap-allocated buffers
 assigned to slots when ports are bound to a file descriptor.
 One packet is fully contained in a single buffer.
 .Pp
 The following flags affect slot and buffer processing:
 .Bl -tag -width XXX
 .It NS_BUF_CHANGED
 .Em must
 be used when the
 .Va buf_idx
 in the slot is changed.
 This can be used to implement
 zero-copy forwarding, see
 .Sx ZERO-COPY FORWARDING .
 .It NS_REPORT
 reports when this buffer has been transmitted.
 Normally,
 .Nm
 notifies transmit completions in batches, hence signals
 can be delayed indefinitely.
 This flag helps detect
 when packets have been sent and a file descriptor can be closed.
 .It NS_FORWARD
 When a ring is in 'transparent' mode (see
 .Sx TRANSPARENT MODE ) ,
 packets marked with this flag are forwarded to the other endpoint
 at the next system call, thus restoring (in a selective way)
 the connection between a NIC and the host stack.
 .It NS_NO_LEARN
 tells the forwarding code that the source MAC address for this
 packet must not be used in the learning bridge code.
 .It NS_INDIRECT
 indicates that the packet's payload is in a user-supplied buffer
 whose user virtual address is in the 'ptr' field of the slot.
 The size can reach 65535 bytes.
 .Pp
 This is only supported on the transmit ring of
 .Nm VALE
 ports, and it helps reducing data copies in the interconnection
 of virtual machines.
 .It NS_MOREFRAG
 indicates that the packet continues with subsequent buffers;
 the last buffer in a packet must have the flag clear.
 .El
 .Sh SCATTER GATHER I/O
 Packets can span multiple slots if the
 .Va NS_MOREFRAG
 flag is set in all but the last slot.
 The maximum length of a chain is 64 buffers.
 This is normally used with
 .Nm VALE
 ports when connecting virtual machines, as they generate large
 TSO segments that are not split unless they reach a physical device.
 .Pp
 NOTE: The length field always refers to the individual
 fragment; there is no place with the total length of a packet.
 .Pp
 On receive rings the macro
 .Va NS_RFRAGS(slot)
 indicates the remaining number of slots for this packet,
 including the current one.
 Slots with a value greater than 1 also have NS_MOREFRAG set.
 .Sh IOCTLS
 .Nm
 uses two ioctls (NIOCTXSYNC, NIOCRXSYNC)
 for non-blocking I/O.
 They take no argument.
 Two more ioctls (NIOCGINFO, NIOCREGIF) are used
 to query and configure ports, with the following argument:
 .Bd -literal
 struct nmreq {
     char      nr_name[IFNAMSIZ]; /* (i) port name                  */
     uint32_t  nr_version;        /* (i) API version                */
     uint32_t  nr_offset;         /* (o) nifp offset in mmap region */
     uint32_t  nr_memsize;        /* (o) size of the mmap region    */
     uint32_t  nr_tx_slots;       /* (i/o) slots in tx rings        */
     uint32_t  nr_rx_slots;       /* (i/o) slots in rx rings        */
     uint16_t  nr_tx_rings;       /* (i/o) number of tx rings       */
     uint16_t  nr_rx_rings;       /* (i/o) number of rx rings       */
     uint16_t  nr_ringid;         /* (i/o) ring(s) we care about    */
     uint16_t  nr_cmd;            /* (i) special command            */
     uint16_t  nr_arg1;           /* (i/o) extra arguments          */
     uint16_t  nr_arg2;           /* (i/o) extra arguments          */
     uint32_t  nr_arg3;           /* (i/o) extra arguments          */
     uint32_t  nr_flags           /* (i/o) open mode                */
     ...
 };
 .Ed
 .Pp
 A file descriptor obtained through
 .Pa /dev/netmap
 also supports the ioctl supported by network devices, see
 .Xr netintro 4 .
 .Bl -tag -width XXXX
 .It Dv NIOCGINFO
 returns EINVAL if the named port does not support netmap.
 Otherwise, it returns 0 and (advisory) information
 about the port.
 Note that all the information below can change before the
 interface is actually put in netmap mode.
 .Bl -tag -width XX
 .It Pa nr_memsize
 indicates the size of the
 .Nm
 memory region.
 NICs in
 .Nm
 mode all share the same memory region,
 whereas
 .Nm VALE
 ports have independent regions for each port.
 .It Pa nr_tx_slots , nr_rx_slots
 indicate the size of transmit and receive rings.
 .It Pa nr_tx_rings , nr_rx_rings
 indicate the number of transmit
 and receive rings.
 Both ring number and sizes may be configured at runtime
 using interface-specific functions (e.g.,
 .Xr ethtool 8
 ).
 .El
 .It Dv NIOCREGIF
 binds the port named in
 .Va nr_name
 to the file descriptor.
 For a physical device this also switches it into
 .Nm
 mode, disconnecting
 it from the host stack.
 Multiple file descriptors can be bound to the same port,
 with proper synchronization left to the user.
 .Pp
 The recommended way to bind a file descriptor to a port is
 to use function
 .Va nm_open(..)
 (see
 .Sx LIBRARIES )
 which parses names to access specific port types and
 enable features.
 In the following we document the main features.
 .Pp
 .Dv NIOCREGIF can also bind a file descriptor to one endpoint of a
 .Em netmap pipe ,
 consisting of two netmap ports with a crossover connection.
 A netmap pipe share the same memory space of the parent port,
 and is meant to enable configuration where a master process acts
 as a dispatcher towards slave processes.
 .Pp
 To enable this function, the
 .Pa nr_arg1
 field of the structure can be used as a hint to the kernel to
 indicate how many pipes we expect to use, and reserve extra space
 in the memory region.
 .Pp
 On return, it gives the same info as NIOCGINFO,
 with
 .Pa nr_ringid
 and
 .Pa nr_flags
 indicating the identity of the rings controlled through the file
 descriptor.
 .Pp
 .Va nr_flags
 .Va nr_ringid
 selects which rings are controlled through this file descriptor.
 Possible values of
 .Pa nr_flags
 are indicated below, together with the naming schemes
 that application libraries (such as the
 .Nm nm_open
 indicated below) can use to indicate the specific set of rings.
 In the example below, "netmap:foo" is any valid netmap port name.
 .Bl -tag -width XXXXX
 .It NR_REG_ALL_NIC                         "netmap:foo"
 (default) all hardware ring pairs
 .It NR_REG_SW            "netmap:foo^"
 the ``host rings'', connecting to the host stack.
 .It NR_REG_NIC_SW        "netmap:foo+"
 all hardware rings and the host rings
 .It NR_REG_ONE_NIC       "netmap:foo-i"
 only the i-th hardware ring pair, where the number is in
 .Pa nr_ringid ;
 .It NR_REG_PIPE_MASTER  "netmap:foo{i"
 the master side of the netmap pipe whose identifier (i) is in
 .Pa nr_ringid ;
 .It NR_REG_PIPE_SLAVE   "netmap:foo}i"
 the slave side of the netmap pipe whose identifier (i) is in
 .Pa nr_ringid .
 .Pp
 The identifier of a pipe must be thought as part of the pipe name,
 and does not need to be sequential.
 On return the pipe
 will only have a single ring pair with index 0,
 irrespective of the value of
 .Va i.
 .El
 .Pp
 By default, a
 .Xr poll 2
 or
 .Xr select 2
 call pushes out any pending packets on the transmit ring, even if
 no write events are specified.
 The feature can be disabled by or-ing
 .Va NETMAP_NO_TX_POLL
 to the value written to
 .Va nr_ringid.
 When this feature is used,
 packets are transmitted only on
 .Va ioctl(NIOCTXSYNC)
 or select()/poll() are called with a write event (POLLOUT/wfdset) or a full ring.
 .Pp
 When registering a virtual interface that is dynamically created to a
 .Xr vale 4
 switch, we can specify the desired number of rings (1 by default,
 and currently up to 16) on it using nr_tx_rings and nr_rx_rings fields.
 .It Dv NIOCTXSYNC
 tells the hardware of new packets to transmit, and updates the
 number of slots available for transmission.
 .It Dv NIOCRXSYNC
 tells the hardware of consumed packets, and asks for newly available
 packets.
 .El
 .Sh SELECT, POLL, EPOLL, KQUEUE.
 .Xr select 2
 and
 .Xr poll 2
 on a
 .Nm
 file descriptor process rings as indicated in
 .Sx TRANSMIT RINGS
 and
 .Sx RECEIVE RINGS ,
 respectively when write (POLLOUT) and read (POLLIN) events are requested.
 Both block if no slots are available in the ring
 .Va ( ring->cur == ring->tail ) .
 Depending on the platform,
 .Xr epoll 2
 and
 .Xr kqueue 2
 are supported too.
 .Pp
 Packets in transmit rings are normally pushed out
 (and buffers reclaimed) even without
 requesting write events.
 Passing the
 .Dv NETMAP_NO_TX_POLL
 flag to
 .Em NIOCREGIF
 disables this feature.
 By default, receive rings are processed only if read
 events are requested.
 Passing the
 .Dv NETMAP_DO_RX_POLL
 flag to
 .Em NIOCREGIF updates receive rings even without read events.
 Note that on epoll and kqueue,
 .Dv NETMAP_NO_TX_POLL
 and
 .Dv NETMAP_DO_RX_POLL
 only have an effect when some event is posted for the file descriptor.
 .Sh LIBRARIES
 The
 .Nm
 API is supposed to be used directly, both because of its simplicity and
 for efficient integration with applications.
 .Pp
 For convenience, the
 .In net/netmap_user.h
 header provides a few macros and functions to ease creating
 a file descriptor and doing I/O with a
 .Nm
 port.
 These are loosely modeled after the
 .Xr pcap 3
 API, to ease porting of libpcap-based applications to
 .Nm .
 To use these extra functions, programs should
 .Dl #define NETMAP_WITH_LIBS
 before
 .Dl #include <net/netmap_user.h>
 .Pp
 The following functions are available:
 .Bl -tag -width XXXXX
 .It Va  struct nm_desc * nm_open(const char *ifname, const struct nmreq *req, uint64_t flags, const struct nm_desc *arg)
 similar to
 .Xr pcap_open 3pcap ,
 binds a file descriptor to a port.
 .Bl -tag -width XX
 .It Va ifname
 is a port name, in the form "netmap:PPP" for a NIC and "valeSSS:PPP" for a
 .Nm VALE
 port.
 .It Va req
 provides the initial values for the argument to the NIOCREGIF ioctl.
 The nm_flags and nm_ringid values are overwritten by parsing
 ifname and flags, and other fields can be overridden through
 the other two arguments.
 .It Va arg
 points to a struct nm_desc containing arguments (e.g., from a previously
 open file descriptor) that should override the defaults.
 The fields are used as described below
 .It Va flags
 can be set to a combination of the following flags:
 .Va NETMAP_NO_TX_POLL ,
 .Va NETMAP_DO_RX_POLL
 (copied into nr_ringid);
 .Va NM_OPEN_NO_MMAP (if arg points to the same memory region,
 avoids the mmap and uses the values from it);
 .Va NM_OPEN_IFNAME (ignores ifname and uses the values in arg);
 .Va NM_OPEN_ARG1 ,
 .Va NM_OPEN_ARG2 ,
 .Va NM_OPEN_ARG3 (uses the fields from arg);
 .Va NM_OPEN_RING_CFG (uses the ring number and sizes from arg).
 .El
 .It Va int nm_close(struct nm_desc *d)
 closes the file descriptor, unmaps memory, frees resources.
 .It Va int nm_inject(struct nm_desc *d, const void *buf, size_t size)
 similar to pcap_inject(), pushes a packet to a ring, returns the size
 of the packet is successful, or 0 on error;
 .It Va int nm_dispatch(struct nm_desc *d, int cnt, nm_cb_t cb, u_char *arg)
 similar to pcap_dispatch(), applies a callback to incoming packets
 .It Va u_char * nm_nextpkt(struct nm_desc *d, struct nm_pkthdr *hdr)
 similar to pcap_next(), fetches the next packet
 .El
 .Sh SUPPORTED DEVICES
 .Nm
 natively supports the following devices:
 .Pp
 On FreeBSD:
 .Xr cxgbe 4 ,
 .Xr em 4 ,
 .Xr igb 4 ,
 .Xr ixgbe 4 ,
 .Xr ixl 4 ,
 .Xr lem 4 ,
 .Xr re 4 .
 .Pp
 On Linux
 .Xr e1000 4 ,
 .Xr e1000e 4 ,
 .Xr i40e 4 ,
 .Xr igb 4 ,
 .Xr ixgbe 4 ,
 .Xr r8169 4 .
 .Pp
 NICs without native support can still be used in
 .Nm
 mode through emulation.
 Performance is inferior to native netmap
 mode but still significantly higher than various raw socket types
 (bpf, PF_PACKET, etc.).
 Note that for slow devices (such as 1 Gbit/s and slower NICs,
 or several 10 Gbit/s NICs whose hardware is unable to sustain line rate),
 emulated and native mode will likely have similar or same throughput.
 .Pp
 When emulation is in use, packet sniffer programs such as tcpdump
 could see received packets before they are diverted by netmap.
 This behaviour is not intentional, being just an artifact of the implementation
 of emulation.
 Note that in case the netmap application subsequently moves packets received
 from the emulated adapter onto the host RX ring, the sniffer will intercept
 those packets again, since the packets are injected to the host stack as they
 were received by the network interface.
 .Pp
 Emulation is also available for devices with native netmap support,
 which can be used for testing or performance comparison.
 The sysctl variable
 .Va dev.netmap.admode
 globally controls how netmap mode is implemented.
 .Sh SYSCTL VARIABLES AND MODULE PARAMETERS
 Some aspect of the operation of
 .Nm
 are controlled through sysctl variables on FreeBSD
 .Em ( dev.netmap.* )
 and module parameters on Linux
 .Em ( /sys/module/netmap_lin/parameters/* ) :
 .Bl -tag -width indent
 .It Va dev.netmap.admode: 0
 Controls the use of native or emulated adapter mode.
 .Pp
 0 uses the best available option;
 .Pp
 1 forces native mode and fails if not available;
 .Pp
 2 forces emulated hence never fails.
 .It Va dev.netmap.generic_ringsize: 1024
 Ring size used for emulated netmap mode
 .It Va dev.netmap.generic_mit: 100000
 Controls interrupt moderation for emulated mode
 .It Va dev.netmap.mmap_unreg: 0
 .It Va dev.netmap.fwd: 0
 Forces NS_FORWARD mode
 .It Va dev.netmap.flags: 0
 .It Va dev.netmap.txsync_retry: 2
 .It Va dev.netmap.no_pendintr: 1
 Forces recovery of transmit buffers on system calls
 .It Va dev.netmap.mitigate: 1
 Propagates interrupt mitigation to user processes
 .It Va dev.netmap.no_timestamp: 0
 Disables the update of the timestamp in the netmap ring
 .It Va dev.netmap.verbose: 0
 Verbose kernel messages
 .It Va dev.netmap.buf_num: 163840
 .It Va dev.netmap.buf_size: 2048
 .It Va dev.netmap.ring_num: 200
 .It Va dev.netmap.ring_size: 36864
 .It Va dev.netmap.if_num: 100
 .It Va dev.netmap.if_size: 1024
 Sizes and number of objects (netmap_if, netmap_ring, buffers)
 for the global memory region.
 The only parameter worth modifying is
 .Va dev.netmap.buf_num
 as it impacts the total amount of memory used by netmap.
 .It Va dev.netmap.buf_curr_num: 0
 .It Va dev.netmap.buf_curr_size: 0
 .It Va dev.netmap.ring_curr_num: 0
 .It Va dev.netmap.ring_curr_size: 0
 .It Va dev.netmap.if_curr_num: 0
 .It Va dev.netmap.if_curr_size: 0
 Actual values in use.
 .It Va dev.netmap.bridge_batch: 1024
 Batch size used when moving packets across a
 .Nm VALE
 switch.
 Values above 64 generally guarantee good
 performance.
 .El
 .Sh SYSTEM CALLS
 .Nm
 uses
 .Xr select 2 ,
 .Xr poll 2 ,
 .Xr epoll 2
 and
 .Xr kqueue 2
 to wake up processes when significant events occur, and
 .Xr mmap 2
 to map memory.
 .Xr ioctl 2
 is used to configure ports and
 .Nm VALE switches .
 .Pp
 Applications may need to create threads and bind them to
 specific cores to improve performance, using standard
 OS primitives, see
 .Xr pthread 3 .
 In particular,
 .Xr pthread_setaffinity_np 3
 may be of use.
 .Sh EXAMPLES
 .Ss TEST PROGRAMS
 .Nm
 comes with a few programs that can be used for testing or
 simple applications.
 See the
 .Pa examples/
 directory in
 .Nm
 distributions, or
 .Pa tools/tools/netmap/
 directory in
 .Fx
 distributions.
 .Pp
 .Xr pkt-gen 8
 is a general purpose traffic source/sink.
 .Pp
 As an example
 .Dl pkt-gen -i ix0 -f tx -l 60
 can generate an infinite stream of minimum size packets, and
 .Dl pkt-gen -i ix0 -f rx
 is a traffic sink.
 Both print traffic statistics, to help monitor
 how the system performs.
 .Pp
 .Xr pkt-gen 8
 has many options can be uses to set packet sizes, addresses,
 rates, and use multiple send/receive threads and cores.
 .Pp
 .Xr bridge 4
 is another test program which interconnects two
 .Nm
 ports.
 It can be used for transparent forwarding between
 interfaces, as in
 .Dl bridge -i ix0 -i ix1
 or even connect the NIC to the host stack using netmap
 .Dl bridge -i ix0 -i ix0
 .Ss USING THE NATIVE API
 The following code implements a traffic generator
 .Pp
 .Bd -literal -compact
 #include <net/netmap_user.h>
 \&...
 void sender(void)
 {
     struct netmap_if *nifp;
     struct netmap_ring *ring;
     struct nmreq nmr;
     struct pollfd fds;
 
     fd = open("/dev/netmap", O_RDWR);
     bzero(&nmr, sizeof(nmr));
     strcpy(nmr.nr_name, "ix0");
     nmr.nm_version = NETMAP_API;
     ioctl(fd, NIOCREGIF, &nmr);
     p = mmap(0, nmr.nr_memsize, fd);
     nifp = NETMAP_IF(p, nmr.nr_offset);
     ring = NETMAP_TXRING(nifp, 0);
     fds.fd = fd;
     fds.events = POLLOUT;
     for (;;) {
 	poll(&fds, 1, -1);
 	while (!nm_ring_empty(ring)) {
 	    i = ring->cur;
 	    buf = NETMAP_BUF(ring, ring->slot[i].buf_index);
 	    ... prepare packet in buf ...
 	    ring->slot[i].len = ... packet length ...
 	    ring->head = ring->cur = nm_ring_next(ring, i);
 	}
     }
 }
 .Ed
 .Ss HELPER FUNCTIONS
 A simple receiver can be implemented using the helper functions
 .Bd -literal -compact
 #define NETMAP_WITH_LIBS
 #include <net/netmap_user.h>
 \&...
 void receiver(void)
 {
     struct nm_desc *d;
     struct pollfd fds;
     u_char *buf;
     struct nm_pkthdr h;
     ...
     d = nm_open("netmap:ix0", NULL, 0, 0);
     fds.fd = NETMAP_FD(d);
     fds.events = POLLIN;
     for (;;) {
 	poll(&fds, 1, -1);
         while ( (buf = nm_nextpkt(d, &h)) )
 	    consume_pkt(buf, h->len);
     }
     nm_close(d);
 }
 .Ed
 .Ss ZERO-COPY FORWARDING
 Since physical interfaces share the same memory region,
 it is possible to do packet forwarding between ports
 swapping buffers.
 The buffer from the transmit ring is used
 to replenish the receive ring:
 .Bd -literal -compact
     uint32_t tmp;
     struct netmap_slot *src, *dst;
     ...
     src = &src_ring->slot[rxr->cur];
     dst = &dst_ring->slot[txr->cur];
     tmp = dst->buf_idx;
     dst->buf_idx = src->buf_idx;
     dst->len = src->len;
     dst->flags = NS_BUF_CHANGED;
     src->buf_idx = tmp;
     src->flags = NS_BUF_CHANGED;
     rxr->head = rxr->cur = nm_ring_next(rxr, rxr->cur);
     txr->head = txr->cur = nm_ring_next(txr, txr->cur);
     ...
 .Ed
 .Ss ACCESSING THE HOST STACK
 The host stack is for all practical purposes just a regular ring pair,
 which you can access with the netmap API (e.g., with
 .Dl nm_open("netmap:eth0^", ... ) ;
 All packets that the host would send to an interface in
 .Nm
 mode end up into the RX ring, whereas all packets queued to the
 TX ring are send up to the host stack.
 .Ss VALE SWITCH
 A simple way to test the performance of a
 .Nm VALE
 switch is to attach a sender and a receiver to it,
 e.g., running the following in two different terminals:
 .Dl pkt-gen -i vale1:a -f rx # receiver
 .Dl pkt-gen -i vale1:b -f tx # sender
 The same example can be used to test netmap pipes, by simply
 changing port names, e.g.,
 .Dl pkt-gen -i vale2:x{3 -f rx # receiver on the master side
 .Dl pkt-gen -i vale2:x}3 -f tx # sender on the slave side
 .Pp
 The following command attaches an interface and the host stack
 to a switch:
 .Dl vale-ctl -h vale2:em0
 Other
 .Nm
 clients attached to the same switch can now communicate
 with the network card or the host.
 .Sh SEE ALSO
-.Xr pkt-gen 8 ,
-.Xr bridge 8
+.Xr vale 4 ,
+.Xr vale-ctl 4 ,
+.Xr bridge 8 ,
+.Xr lb 8 ,
+.Xr pkt-gen 8
 .Pp
 .Pa http://info.iet.unipi.it/~luigi/netmap/
 .Pp
 Luigi Rizzo, Revisiting network I/O APIs: the netmap framework,
 Communications of the ACM, 55 (3), pp.45-51, March 2012
 .Pp
 Luigi Rizzo, netmap: a novel framework for fast packet I/O,
 Usenix ATC'12, June 2012, Boston
 .Pp
 Luigi Rizzo, Giuseppe Lettieri,
 VALE, a switched ethernet for virtual machines,
 ACM CoNEXT'12, December 2012, Nice
 .Pp
 Luigi Rizzo, Giuseppe Lettieri, Vincenzo Maffione,
 Speeding up packet I/O in virtual machines,
 ACM/IEEE ANCS'13, October 2013, San Jose
 .Sh AUTHORS
 .An -nosplit
 The
 .Nm
 framework has been originally designed and implemented at the
 Universita` di Pisa in 2011 by
 .An Luigi Rizzo ,
 and further extended with help from
 .An Matteo Landi ,
 .An Gaetano Catalli ,
 .An Giuseppe Lettieri ,
 and
 .An Vincenzo Maffione .
 .Pp
 .Nm
 and
 .Nm VALE
 have been funded by the European Commission within FP7 Projects
 CHANGE (257422) and OPENLAB (287581).
 .Sh CAVEATS
 No matter how fast the CPU and OS are,
 achieving line rate on 10G and faster interfaces
 requires hardware with sufficient performance.
 Several NICs are unable to sustain line rate with
 small packet sizes.
 Insufficient PCIe or memory bandwidth
 can also cause reduced performance.
 .Pp
 Another frequent reason for low performance is the use
 of flow control on the link: a slow receiver can limit
 the transmit speed.
 Be sure to disable flow control when running high
 speed experiments.
 .Ss SPECIAL NIC FEATURES
 .Nm
 is orthogonal to some NIC features such as
 multiqueue, schedulers, packet filters.
 .Pp
 Multiple transmit and receive rings are supported natively
 and can be configured with ordinary OS tools,
 such as
 .Xr ethtool 8
 or
 device-specific sysctl variables.
 The same goes for Receive Packet Steering (RPS)
 and filtering of incoming traffic.
 .Pp
 .Nm
 .Em does not use
 features such as
 .Em checksum offloading , TCP segmentation offloading ,
 .Em encryption , VLAN encapsulation/decapsulation ,
 etc.
 When using netmap to exchange packets with the host stack,
 make sure to disable these features.
Index: head/tools/tools/netmap/Makefile
===================================================================
--- head/tools/tools/netmap/Makefile	(revision 340278)
+++ head/tools/tools/netmap/Makefile	(revision 340279)
@@ -1,36 +1,39 @@
 #
 # $FreeBSD$
 #
 # For multiple programs using a single source file each,
 # we can just define 'progs' and create custom targets.
-PROGS	=	pkt-gen nmreplay bridge vale-ctl
+PROGS	=	pkt-gen nmreplay bridge vale-ctl lb
 
 CLEANFILES = $(PROGS) *.o
 MAN=
 CFLAGS += -Werror -Wall
 CFLAGS += -Wextra
 
 LDFLAGS += -lpthread
 .ifdef WITHOUT_PCAP
 CFLAGS += -DNO_PCAP
 .else
 LDFLAGS += -lpcap
 .endif
 LDFLAGS += -lm # used by nmreplay
 
 .include <bsd.prog.mk>
 .include <bsd.lib.mk>
 
 all: $(PROGS)
 
 pkt-gen: pkt-gen.o
 	$(CC) $(CFLAGS) -o pkt-gen pkt-gen.o $(LDFLAGS)
 
 bridge: bridge.o
 	$(CC) $(CFLAGS) -o bridge bridge.o
 
 nmreplay: nmreplay.o
 	$(CC) $(CFLAGS) -o nmreplay nmreplay.o $(LDFLAGS)
 
 vale-ctl: vale-ctl.o
 	$(CC) $(CFLAGS) -o vale-ctl vale-ctl.o
+
+lb: lb.o pkt_hash.o
+	$(CC) $(CFLAGS) -o lb lb.o pkt_hash.o $(LDFLAGS)
Index: head/tools/tools/netmap/README
===================================================================
--- head/tools/tools/netmap/README	(revision 340278)
+++ head/tools/tools/netmap/README	(revision 340279)
@@ -1,9 +1,13 @@
 $FreeBSD$
 
-This directory contains examples that use netmap
+This directory contains applications that use the netmap API
 
-	pkt-gen		a packet sink/source using the netmap API
+	pkt-gen		a multi-function packet generator and traffic sink
 
-	bridge		a two-port jumper wire, also using the native API
+	bridge		a two-port jumper wire, also using the netmap API
 
-	vale-ctl	the program to control VALE bridges
+	vale-ctl	the program to control and inspect VALE switches
+
+	lb		an L3/L4 load balancer
+
+	nmreplay	a tool to playback a pcap file to a netmap port
Index: head/tools/tools/netmap/bridge.8
===================================================================
--- head/tools/tools/netmap/bridge.8	(revision 340278)
+++ head/tools/tools/netmap/bridge.8	(revision 340279)
@@ -1,82 +1,83 @@
 .\" Copyright (c) 2016 Luigi Rizzo, Universita` di Pisa
 .\"
 .\" Redistribution and use in source and binary forms, with or without
 .\" modification, are permitted provided that the following conditions
 .\" are met:
 .\" 1. Redistributions of source code must retain the above copyright
 .\"    notice, this list of conditions and the following disclaimer.
 .\" 2. Redistributions in binary form must reproduce the above copyright
 .\"    notice, this list of conditions and the following disclaimer in the
 .\"    documentation and/or other materials provided with the distribution.
 .\"
 .\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
 .\" SUCH DAMAGE.
 .\"
 .\" $FreeBSD$
 .\"
-.Dd October 23, 2018
+.Dd October 28, 2018
 .Dt BRIDGE 8
 .Os
 .Sh NAME
 .Nm bridge
 .Nd netmap client to bridge two netmap ports
 .Sh SYNOPSIS
 .Bk -words
 .Bl -tag -width "bridge"
 .It Nm
 .Op Fl i Ar port
 .Op Fl b Ar batch size
 .Op Fl w Ar wait-link
 .Op Fl v
 .Op Fl c
 .El
 .Ek
 .Sh DESCRIPTION
 .Nm
 is a simple netmap application that bridges packets between two netmap ports.
 If the two netmap ports use the same netmap memory region
 .Nm
 forwards packets without copying the packets payload (zero-copy mode), unless
 explicitly prevented by the
 .Fl c
 flag.
 .Bl -tag -width Ds
 .It Fl i Ar port
 Name of the netmap port.
 It can be supplied up to two times to identify the ports that must be bridged.
 Any netmap port type (physical interface, VALE switch, pipe, monitor port...)
 can be used.
 If the option is supplied only once, then it must be for a physical interface and, in that case,
 .Nm
 will bridge the port and the host stack.
 .It Fl b Ar batch-size
 Maximum number of packets to send in one operation.
 .It Fl w Ar wait-link
 indicates the number of seconds to wait before transmitting.
 It defaults to 2, and may be useful when talking to physical
 ports to let link negotiation complete before starting transmission.
 .It Fl v
 Enable verbose mode
 .It Fl c
 Disable zero-copy mode.
 .El
 .Sh SEE ALSO
 .Xr netmap 4 ,
-.Xr pkt-gen 8
+.Xr pkt-gen 8 ,
+.Xr lb 8
 .Sh AUTHORS
 .An -nosplit
 .Nm
 has been written by
 .An Luigi Rizzo
 and
 .An Matteo Landi
 at the Universita` di Pisa, Italy.
Index: head/tools/tools/netmap/ctrs.h
===================================================================
--- head/tools/tools/netmap/ctrs.h	(revision 340278)
+++ head/tools/tools/netmap/ctrs.h	(revision 340279)
@@ -1,108 +1,116 @@
 #ifndef CTRS_H_
 #define CTRS_H_
 
 /* $FreeBSD$ */
 
 #include <sys/time.h>
 
 /* counters to accumulate statistics */
 struct my_ctrs {
-	uint64_t pkts, bytes, events, drop;
+	uint64_t pkts, bytes, events;
+	uint64_t drop, drop_bytes;
 	uint64_t min_space;
 	struct timeval t;
+	uint32_t oq_n; /* number of elements in overflow queue (used in lb) */
 };
 
 /* very crude code to print a number in normalized form.
  * Caller has to make sure that the buffer is large enough.
  */
 static const char *
-norm2(char *buf, double val, char *fmt)
+norm2(char *buf, double val, char *fmt, int normalize)
 {
 	char *units[] = { "", "K", "M", "G", "T" };
 	u_int i;
-
-	for (i = 0; val >=1000 && i < sizeof(units)/sizeof(char *) - 1; i++)
-		val /= 1000;
+	if (normalize)
+		for (i = 0; val >=1000 && i < sizeof(units)/sizeof(char *) - 1; i++)
+			val /= 1000;
+	else
+		i=0;
 	sprintf(buf, fmt, val, units[i]);
 	return buf;
 }
 
 static __inline const char *
-norm(char *buf, double val)
+norm(char *buf, double val, int normalize)
 {
-	return norm2(buf, val, "%.3f %s");
+	if (normalize)
+		return norm2(buf, val, "%.3f %s", normalize);
+	else
+		return norm2(buf, val, "%.0f %s", normalize);
 }
 
 static __inline int
 timespec_ge(const struct timespec *a, const struct timespec *b)
 {
 
 	if (a->tv_sec > b->tv_sec)
 		return (1);
 	if (a->tv_sec < b->tv_sec)
 		return (0);
 	if (a->tv_nsec >= b->tv_nsec)
 		return (1);
 	return (0);
 }
 
 static __inline struct timespec
 timeval2spec(const struct timeval *a)
 {
 	struct timespec ts = {
 		.tv_sec = a->tv_sec,
 		.tv_nsec = a->tv_usec * 1000
 	};
 	return ts;
 }
 
 static __inline struct timeval
 timespec2val(const struct timespec *a)
 {
 	struct timeval tv = {
 		.tv_sec = a->tv_sec,
 		.tv_usec = a->tv_nsec / 1000
 	};
 	return tv;
 }
 
 
 static __inline struct timespec
 timespec_add(struct timespec a, struct timespec b)
 {
 	struct timespec ret = { a.tv_sec + b.tv_sec, a.tv_nsec + b.tv_nsec };
 	if (ret.tv_nsec >= 1000000000) {
 		ret.tv_sec++;
 		ret.tv_nsec -= 1000000000;
 	}
 	return ret;
 }
 
 static __inline struct timespec
 timespec_sub(struct timespec a, struct timespec b)
 {
 	struct timespec ret = { a.tv_sec - b.tv_sec, a.tv_nsec - b.tv_nsec };
 	if (ret.tv_nsec < 0) {
 		ret.tv_sec--;
 		ret.tv_nsec += 1000000000;
 	}
 	return ret;
 }
 
-static uint64_t
+static __inline uint64_t
 wait_for_next_report(struct timeval *prev, struct timeval *cur,
 		int report_interval)
 {
 	struct timeval delta;
 
 	delta.tv_sec = report_interval/1000;
 	delta.tv_usec = (report_interval%1000)*1000;
 	if (select(0, NULL, NULL, NULL, &delta) < 0 && errno != EINTR) {
 		perror("select");
 		abort();
 	}
 	gettimeofday(cur, NULL);
 	timersub(cur, prev, &delta);
 	return delta.tv_sec* 1000000 + delta.tv_usec;
 }
 #endif /* CTRS_H_ */
+
Index: head/tools/tools/netmap/lb.8
===================================================================
--- head/tools/tools/netmap/lb.8	(nonexistent)
+++ head/tools/tools/netmap/lb.8	(revision 340279)
@@ -0,0 +1,130 @@
+.\" Copyright (c) 2017 Corelight, Inc. and Universita` di Pisa
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\"    notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\"    notice, this list of conditions and the following disclaimer in the
+.\"    documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.Dd October 28, 2018
+.Dt LB 8
+.Os
+.Sh NAME
+.Nm lb
+.Nd netmap-based load balancer
+.Sh SYNOPSIS
+.Bk -words
+.Bl -tag -width "lb"
+.It Nm
+.Op Fl i Ar port
+.Op Fl p Ar pipe-group
+.Op Fl B Ar extra-buffers
+.Op Fl b Ar batch-size
+.Op Fl w Ar wait-link
+.El
+.Ek
+.Sh DESCRIPTION
+.Nm
+reads packets from an input netmap port and sends them to a number of netmap pipes,
+trying to balance the packets received by each pipe.
+Packets belonging to the
+same connection will always be sent to the same pipe.
+.Pp
+Command line options are listed below.
+.Bl -tag -width Ds
+.It Fl i Ar port
+Name of a netmap port.
+It must be supplied exactly once to identify
+the input port.
+Any netmap port type (e.g., physical interface, VALE switch, pipe,
+monitor port) can be used.
+.It Fl p Ar name Ns Cm \&: Ns Ar number | number
+Add a new pipe group of the given number of pipes.
+The pipe group will receive all the packets read from the input port, balanced
+among the available pipes.
+The receiving ends of the pipes
+will be called
+.Dq Ar name Ns Em }0
+to
+.Dq Ar name No Ns Em } Ns Aq Ar number No - 1 .
+The name is optional and defaults to
+the name of the input port (stripped down of any netmap operator).
+If the name is omitted, also the colon can be omitted.
+.Pp
+This option can be supplied multiple times to define a sequence of pipe groups,
+each group receiving all the packets in turn.
+.Pp
+If no
+.Fl p
+option is given, a single group of two pipes with default name is assumed.
+.Pp
+It is allowed to use the same name for several groups.
+The pipe numbering in each
+group will start from were the previous identically-named group had left.
+.It Fl B Ar extra-buffers
+Try to reserve the given number of extra buffers.
+Extra buffers are shared among
+all pipes in all groups and work as an extension of the pipe rings.
+If a pipe ring is full for whatever reason,
+.Nm
+tries to use extra buffers before dropping any packets directed to that pipe.
+.Pp
+If all extra buffers are busy, some are stolen from the pipe with the longest
+backlog.
+This gives preference to newer packets over old ones, and prevents a
+stalled pipe to deplete the pool of extra buffers.
+.It Fl b Ar batch-size
+Maximum number of packets processed between two read operations from the input port.
+Higher values of batch-size improve performance by amortizing read operations,
+but increase the risk of filling up the port internal queues.
+.It Fl w Ar wait-link
+indicates the number of seconds to wait before transmitting.
+It defaults to 2, and may be useful when talking to physical
+ports to let link negotiation complete before starting transmission.
+.El
+.Sh LIMITATIONS
+The group chaining assumes that the applications on the receiving end of the
+pipes are read-only: they must not modify the buffers or the pipe ring slots
+in any way.
+.Pp
+The group naming is currently implemented by creating a persistent VALE port
+with the given name.
+If
+.Nm
+does not exit cleanly the ports will not be removed.
+Please use
+.Xr vale-ctl 4
+to remove any stale persistent VALE port.
+.Sh SEE ALSO
+.Xr netmap 4 ,
+.Xr bridge 8 ,
+.Xr pkt-gen 8
+.Pp
+.Pa http://info.iet.unipi.it/~luigi/netmap/
+.Sh AUTHORS
+.An -nosplit
+.Nm
+has been written by
+.An Seth Hall
+at Corelight, USA.
+The facilities related to extra buffers and pipe groups have been added by
+.An Giuseppe Lettieri
+at University of Pisa, Italy, under contract by Corelight, USA.

Property changes on: head/tools/tools/netmap/lb.8
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: head/tools/tools/netmap/lb.c
===================================================================
--- head/tools/tools/netmap/lb.c	(nonexistent)
+++ head/tools/tools/netmap/lb.c	(revision 340279)
@@ -0,0 +1,1027 @@
+/*
+ * Copyright (C) 2017 Corelight, Inc. and Universita` di Pisa. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *   1. Redistributions of source code must retain the above copyright
+ *      notice, this list of conditions and the following disclaimer.
+ *   2. Redistributions in binary form must reproduce the above copyright
+ *      notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+/* $FreeBSD$ */
+#include <stdio.h>
+#include <string.h>
+#include <ctype.h>
+#include <stdbool.h>
+#include <inttypes.h>
+#include <syslog.h>
+
+#define NETMAP_WITH_LIBS
+#include <net/netmap_user.h>
+#include <sys/poll.h>
+
+#include <netinet/in.h>		/* htonl */
+
+#include <pthread.h>
+
+#include "pkt_hash.h"
+#include "ctrs.h"
+
+
+/*
+ * use our version of header structs, rather than bringing in a ton
+ * of platform specific ones
+ */
+#ifndef ETH_ALEN
+#define ETH_ALEN 6
+#endif
+
+struct compact_eth_hdr {
+	unsigned char h_dest[ETH_ALEN];
+	unsigned char h_source[ETH_ALEN];
+	u_int16_t h_proto;
+};
+
+struct compact_ip_hdr {
+	u_int8_t ihl:4, version:4;
+	u_int8_t tos;
+	u_int16_t tot_len;
+	u_int16_t id;
+	u_int16_t frag_off;
+	u_int8_t ttl;
+	u_int8_t protocol;
+	u_int16_t check;
+	u_int32_t saddr;
+	u_int32_t daddr;
+};
+
+struct compact_ipv6_hdr {
+	u_int8_t priority:4, version:4;
+	u_int8_t flow_lbl[3];
+	u_int16_t payload_len;
+	u_int8_t nexthdr;
+	u_int8_t hop_limit;
+	struct in6_addr saddr;
+	struct in6_addr daddr;
+};
+
+#define MAX_IFNAMELEN 	64
+#define MAX_PORTNAMELEN	(MAX_IFNAMELEN + 40)
+#define DEF_OUT_PIPES 	2
+#define DEF_EXTRA_BUFS 	0
+#define DEF_BATCH	2048
+#define DEF_WAIT_LINK	2
+#define DEF_STATS_INT	600
+#define BUF_REVOKE	100
+#define STAT_MSG_MAXSIZE 1024
+
+struct {
+	char ifname[MAX_IFNAMELEN];
+	char base_name[MAX_IFNAMELEN];
+	int netmap_fd;
+	uint16_t output_rings;
+	uint16_t num_groups;
+	uint32_t extra_bufs;
+	uint16_t batch;
+	int stdout_interval;
+	int syslog_interval;
+	int wait_link;
+	bool busy_wait;
+} glob_arg;
+
+/*
+ * the overflow queue is a circular queue of buffers
+ */
+struct overflow_queue {
+	char name[MAX_IFNAMELEN + 16];
+	struct netmap_slot *slots;
+	uint32_t head;
+	uint32_t tail;
+	uint32_t n;
+	uint32_t size;
+};
+
+struct overflow_queue *freeq;
+
+static inline int
+oq_full(struct overflow_queue *q)
+{
+	return q->n >= q->size;
+}
+
+static inline int
+oq_empty(struct overflow_queue *q)
+{
+	return q->n <= 0;
+}
+
+static inline void
+oq_enq(struct overflow_queue *q, const struct netmap_slot *s)
+{
+	if (unlikely(oq_full(q))) {
+		D("%s: queue full!", q->name);
+		abort();
+	}
+	q->slots[q->tail] = *s;
+	q->n++;
+	q->tail++;
+	if (q->tail >= q->size)
+		q->tail = 0;
+}
+
+static inline struct netmap_slot
+oq_deq(struct overflow_queue *q)
+{
+	struct netmap_slot s = q->slots[q->head];
+	if (unlikely(oq_empty(q))) {
+		D("%s: queue empty!", q->name);
+		abort();
+	}
+	q->n--;
+	q->head++;
+	if (q->head >= q->size)
+		q->head = 0;
+	return s;
+}
+
+static volatile int do_abort = 0;
+
+uint64_t dropped = 0;
+uint64_t forwarded = 0;
+uint64_t received_bytes = 0;
+uint64_t received_pkts = 0;
+uint64_t non_ip = 0;
+uint32_t freeq_n = 0;
+
+struct port_des {
+	char interface[MAX_PORTNAMELEN];
+	struct my_ctrs ctr;
+	unsigned int last_sync;
+	uint32_t last_tail;
+	struct overflow_queue *oq;
+	struct nm_desc *nmd;
+	struct netmap_ring *ring;
+	struct group_des *group;
+};
+
+struct port_des *ports;
+
+/* each group of pipes receives all the packets */
+struct group_des {
+	char pipename[MAX_IFNAMELEN];
+	struct port_des *ports;
+	int first_id;
+	int nports;
+	int last;
+	int custom_port;
+};
+
+struct group_des *groups;
+
+/* statistcs */
+struct counters {
+	struct timeval ts;
+	struct my_ctrs *ctrs;
+	uint64_t received_pkts;
+	uint64_t received_bytes;
+	uint64_t non_ip;
+	uint32_t freeq_n;
+	int status __attribute__((aligned(64)));
+#define COUNTERS_EMPTY	0
+#define COUNTERS_FULL	1
+};
+
+struct counters counters_buf;
+
+static void *
+print_stats(void *arg)
+{
+	int npipes = glob_arg.output_rings;
+	int sys_int = 0;
+	(void)arg;
+	struct my_ctrs cur, prev;
+	struct my_ctrs *pipe_prev;
+
+	pipe_prev = calloc(npipes, sizeof(struct my_ctrs));
+	if (pipe_prev == NULL) {
+		D("out of memory");
+		exit(1);
+	}
+
+	char stat_msg[STAT_MSG_MAXSIZE] = "";
+
+	memset(&prev, 0, sizeof(prev));
+	while (!do_abort) {
+		int j, dosyslog = 0, dostdout = 0, newdata;
+		uint64_t pps = 0, dps = 0, bps = 0, dbps = 0, usec = 0;
+		struct my_ctrs x;
+
+		counters_buf.status = COUNTERS_EMPTY;
+		newdata = 0;
+		memset(&cur, 0, sizeof(cur));
+		sleep(1);
+		if (counters_buf.status == COUNTERS_FULL) {
+			__sync_synchronize();
+			newdata = 1;
+			cur.t = counters_buf.ts;
+			if (prev.t.tv_sec || prev.t.tv_usec) {
+				usec = (cur.t.tv_sec - prev.t.tv_sec) * 1000000 +
+					cur.t.tv_usec - prev.t.tv_usec;
+			}
+		}
+
+		++sys_int;
+		if (glob_arg.stdout_interval && sys_int % glob_arg.stdout_interval == 0)
+				dostdout = 1;
+		if (glob_arg.syslog_interval && sys_int % glob_arg.syslog_interval == 0)
+				dosyslog = 1;
+
+		for (j = 0; j < npipes; ++j) {
+			struct my_ctrs *c = &counters_buf.ctrs[j];
+			cur.pkts += c->pkts;
+			cur.drop += c->drop;
+			cur.drop_bytes += c->drop_bytes;
+			cur.bytes += c->bytes;
+
+			if (usec) {
+				x.pkts = c->pkts - pipe_prev[j].pkts;
+				x.drop = c->drop - pipe_prev[j].drop;
+				x.bytes = c->bytes - pipe_prev[j].bytes;
+				x.drop_bytes = c->drop_bytes - pipe_prev[j].drop_bytes;
+				pps = (x.pkts*1000000 + usec/2) / usec;
+				dps = (x.drop*1000000 + usec/2) / usec;
+				bps = ((x.bytes*1000000 + usec/2) / usec) * 8;
+				dbps = ((x.drop_bytes*1000000 + usec/2) / usec) * 8;
+			}
+			pipe_prev[j] = *c;
+
+			if ( (dosyslog || dostdout) && newdata )
+				snprintf(stat_msg, STAT_MSG_MAXSIZE,
+				       "{"
+				       "\"ts\":%.6f,"
+				       "\"interface\":\"%s\","
+				       "\"output_ring\":%" PRIu16 ","
+				       "\"packets_forwarded\":%" PRIu64 ","
+				       "\"packets_dropped\":%" PRIu64 ","
+				       "\"data_forward_rate_Mbps\":%.4f,"
+				       "\"data_drop_rate_Mbps\":%.4f,"
+				       "\"packet_forward_rate_kpps\":%.4f,"
+				       "\"packet_drop_rate_kpps\":%.4f,"
+				       "\"overflow_queue_size\":%" PRIu32
+				       "}", cur.t.tv_sec + (cur.t.tv_usec / 1000000.0),
+				            ports[j].interface,
+				            j,
+				            c->pkts,
+				            c->drop,
+				            (double)bps / 1024 / 1024,
+				            (double)dbps / 1024 / 1024,
+				            (double)pps / 1000,
+				            (double)dps / 1000,
+				            c->oq_n);
+
+			if (dosyslog && stat_msg[0])
+				syslog(LOG_INFO, "%s", stat_msg);
+			if (dostdout && stat_msg[0])
+				printf("%s\n", stat_msg);
+		}
+		if (usec) {
+			x.pkts = cur.pkts - prev.pkts;
+			x.drop = cur.drop - prev.drop;
+			x.bytes = cur.bytes - prev.bytes;
+			x.drop_bytes = cur.drop_bytes - prev.drop_bytes;
+			pps = (x.pkts*1000000 + usec/2) / usec;
+			dps = (x.drop*1000000 + usec/2) / usec;
+			bps = ((x.bytes*1000000 + usec/2) / usec) * 8;
+			dbps = ((x.drop_bytes*1000000 + usec/2) / usec) * 8;
+		}
+
+		if ( (dosyslog || dostdout) && newdata )
+			snprintf(stat_msg, STAT_MSG_MAXSIZE,
+			         "{"
+			         "\"ts\":%.6f,"
+			         "\"interface\":\"%s\","
+			         "\"output_ring\":null,"
+			         "\"packets_received\":%" PRIu64 ","
+			         "\"packets_forwarded\":%" PRIu64 ","
+			         "\"packets_dropped\":%" PRIu64 ","
+			         "\"non_ip_packets\":%" PRIu64 ","
+			         "\"data_forward_rate_Mbps\":%.4f,"
+			         "\"data_drop_rate_Mbps\":%.4f,"
+			         "\"packet_forward_rate_kpps\":%.4f,"
+			         "\"packet_drop_rate_kpps\":%.4f,"
+			         "\"free_buffer_slots\":%" PRIu32
+			         "}", cur.t.tv_sec + (cur.t.tv_usec / 1000000.0),
+			              glob_arg.ifname,
+			              received_pkts,
+			              cur.pkts,
+			              cur.drop,
+			              counters_buf.non_ip,
+			              (double)bps / 1024 / 1024,
+			              (double)dbps / 1024 / 1024,
+			              (double)pps / 1000,
+			              (double)dps / 1000,
+			              counters_buf.freeq_n);
+
+		if (dosyslog && stat_msg[0])
+			syslog(LOG_INFO, "%s", stat_msg);
+		if (dostdout && stat_msg[0])
+			printf("%s\n", stat_msg);
+
+		prev = cur;
+	}
+
+	free(pipe_prev);
+
+	return NULL;
+}
+
+static void
+free_buffers(void)
+{
+	int i, tot = 0;
+	struct port_des *rxport = &ports[glob_arg.output_rings];
+
+	/* build a netmap free list with the buffers in all the overflow queues */
+	for (i = 0; i < glob_arg.output_rings + 1; i++) {
+		struct port_des *cp = &ports[i];
+		struct overflow_queue *q = cp->oq;
+
+		if (!q)
+			continue;
+
+		while (q->n) {
+			struct netmap_slot s = oq_deq(q);
+			uint32_t *b = (uint32_t *)NETMAP_BUF(cp->ring, s.buf_idx);
+
+			*b = rxport->nmd->nifp->ni_bufs_head;
+			rxport->nmd->nifp->ni_bufs_head = s.buf_idx;
+			tot++;
+		}
+	}
+	D("added %d buffers to netmap free list", tot);
+
+	for (i = 0; i < glob_arg.output_rings + 1; ++i) {
+		nm_close(ports[i].nmd);
+	}
+}
+
+
+static void sigint_h(int sig)
+{
+	(void)sig;		/* UNUSED */
+	do_abort = 1;
+	signal(SIGINT, SIG_DFL);
+}
+
+void usage()
+{
+	printf("usage: lb [options]\n");
+	printf("where options are:\n");
+	printf("  -h              	view help text\n");
+	printf("  -i iface        	interface name (required)\n");
+	printf("  -p [prefix:]npipes	add a new group of output pipes\n");
+	printf("  -B nbufs        	number of extra buffers (default: %d)\n", DEF_EXTRA_BUFS);
+	printf("  -b batch        	batch size (default: %d)\n", DEF_BATCH);
+	printf("  -w seconds        	wait for link up (default: %d)\n", DEF_WAIT_LINK);
+	printf("  -W                    enable busy waiting. this will run your CPU at 100%%\n");
+	printf("  -s seconds      	seconds between syslog stats messages (default: 0)\n");
+	printf("  -o seconds      	seconds between stdout stats messages (default: 0)\n");
+	exit(0);
+}
+
+static int
+parse_pipes(char *spec)
+{
+	char *end = index(spec, ':');
+	static int max_groups = 0;
+	struct group_des *g;
+
+	ND("spec %s num_groups %d", spec, glob_arg.num_groups);
+	if (max_groups < glob_arg.num_groups + 1) {
+		size_t size = sizeof(*g) * (glob_arg.num_groups + 1);
+		groups = realloc(groups, size);
+		if (groups == NULL) {
+			D("out of memory");
+			return 1;
+		}
+	}
+	g = &groups[glob_arg.num_groups];
+	memset(g, 0, sizeof(*g));
+
+	if (end != NULL) {
+		if (end - spec > MAX_IFNAMELEN - 8) {
+			D("name '%s' too long", spec);
+			return 1;
+		}
+		if (end == spec) {
+			D("missing prefix before ':' in '%s'", spec);
+			return 1;
+		}
+		strncpy(g->pipename, spec, end - spec);
+		g->custom_port = 1;
+		end++;
+	} else {
+		/* no prefix, this group will use the
+		 * name of the input port.
+		 * This will be set in init_groups(),
+		 * since here the input port may still
+		 * be uninitialized
+		 */
+		end = spec;
+	}
+	if (*end == '\0') {
+		g->nports = DEF_OUT_PIPES;
+	} else {
+		g->nports = atoi(end);
+		if (g->nports < 1) {
+			D("invalid number of pipes '%s' (must be at least 1)", end);
+			return 1;
+		}
+	}
+	glob_arg.output_rings += g->nports;
+	glob_arg.num_groups++;
+	return 0;
+}
+
+/* complete the initialization of the groups data structure */
+void init_groups(void)
+{
+	int i, j, t = 0;
+	struct group_des *g = NULL;
+	for (i = 0; i < glob_arg.num_groups; i++) {
+		g = &groups[i];
+		g->ports = &ports[t];
+		for (j = 0; j < g->nports; j++)
+			g->ports[j].group = g;
+		t += g->nports;
+		if (!g->custom_port)
+			strcpy(g->pipename, glob_arg.base_name);
+		for (j = 0; j < i; j++) {
+			struct group_des *h = &groups[j];
+			if (!strcmp(h->pipename, g->pipename))
+				g->first_id += h->nports;
+		}
+	}
+	g->last = 1;
+}
+
+/* push the packet described by slot rs to the group g.
+ * This may cause other buffers to be pushed down the
+ * chain headed by g.
+ * Return a free buffer.
+ */
+uint32_t forward_packet(struct group_des *g, struct netmap_slot *rs)
+{
+	uint32_t hash = rs->ptr;
+	uint32_t output_port = hash % g->nports;
+	struct port_des *port = &g->ports[output_port];
+	struct netmap_ring *ring = port->ring;
+	struct overflow_queue *q = port->oq;
+
+	/* Move the packet to the output pipe, unless there is
+	 * either no space left on the ring, or there is some
+	 * packet still in the overflow queue (since those must
+	 * take precedence over the new one)
+	*/
+	if (ring->head != ring->tail && (q == NULL || oq_empty(q))) {
+		struct netmap_slot *ts = &ring->slot[ring->head];
+		struct netmap_slot old_slot = *ts;
+
+		ts->buf_idx = rs->buf_idx;
+		ts->len = rs->len;
+		ts->flags |= NS_BUF_CHANGED;
+		ts->ptr = rs->ptr;
+		ring->head = nm_ring_next(ring, ring->head);
+		port->ctr.bytes += rs->len;
+		port->ctr.pkts++;
+		forwarded++;
+		return old_slot.buf_idx;
+	}
+
+	/* use the overflow queue, if available */
+	if (q == NULL || oq_full(q)) {
+		/* no space left on the ring and no overflow queue
+		 * available: we are forced to drop the packet
+		 */
+		dropped++;
+		port->ctr.drop++;
+		port->ctr.drop_bytes += rs->len;
+		return rs->buf_idx;
+	}
+
+	oq_enq(q, rs);
+
+	/*
+	 * we cannot continue down the chain and we need to
+	 * return a free buffer now. We take it from the free queue.
+	 */
+	if (oq_empty(freeq)) {
+		/* the free queue is empty. Revoke some buffers
+		 * from the longest overflow queue
+		 */
+		uint32_t j;
+		struct port_des *lp = &ports[0];
+		uint32_t max = lp->oq->n;
+
+		/* let lp point to the port with the longest queue */
+		for (j = 1; j < glob_arg.output_rings; j++) {
+			struct port_des *cp = &ports[j];
+			if (cp->oq->n > max) {
+				lp = cp;
+				max = cp->oq->n;
+			}
+		}
+
+		/* move the oldest BUF_REVOKE buffers from the
+		 * lp queue to the free queue
+		 */
+		// XXX optimize this cycle
+		for (j = 0; lp->oq->n && j < BUF_REVOKE; j++) {
+			struct netmap_slot tmp = oq_deq(lp->oq);
+
+			dropped++;
+			lp->ctr.drop++;
+			lp->ctr.drop_bytes += tmp.len;
+
+			oq_enq(freeq, &tmp);
+		}
+
+		ND(1, "revoked %d buffers from %s", j, lq->name);
+	}
+
+	return oq_deq(freeq).buf_idx;
+}
+
+int main(int argc, char **argv)
+{
+	int ch;
+	uint32_t i;
+	int rv;
+	unsigned int iter = 0;
+	int poll_timeout = 10; /* default */
+
+	glob_arg.ifname[0] = '\0';
+	glob_arg.output_rings = 0;
+	glob_arg.batch = DEF_BATCH;
+	glob_arg.wait_link = DEF_WAIT_LINK;
+	glob_arg.busy_wait = false;
+	glob_arg.syslog_interval = 0;
+	glob_arg.stdout_interval = 0;
+
+	while ( (ch = getopt(argc, argv, "hi:p:b:B:s:o:w:W")) != -1) {
+		switch (ch) {
+		case 'i':
+			D("interface is %s", optarg);
+			if (strlen(optarg) > MAX_IFNAMELEN - 8) {
+				D("ifname too long %s", optarg);
+				return 1;
+			}
+			if (strncmp(optarg, "netmap:", 7) && strncmp(optarg, "vale", 4)) {
+				sprintf(glob_arg.ifname, "netmap:%s", optarg);
+			} else {
+				strcpy(glob_arg.ifname, optarg);
+			}
+			break;
+
+		case 'p':
+			if (parse_pipes(optarg)) {
+				usage();
+				return 1;
+			}
+			break;
+
+		case 'B':
+			glob_arg.extra_bufs = atoi(optarg);
+			D("requested %d extra buffers", glob_arg.extra_bufs);
+			break;
+
+		case 'b':
+			glob_arg.batch = atoi(optarg);
+			D("batch is %d", glob_arg.batch);
+			break;
+
+		case 'w':
+			glob_arg.wait_link = atoi(optarg);
+			D("link wait for up time is %d", glob_arg.wait_link);
+			break;
+
+		case 'W':
+			glob_arg.busy_wait = true;
+			break;
+
+		case 'o':
+			glob_arg.stdout_interval = atoi(optarg);
+			break;
+
+		case 's':
+			glob_arg.syslog_interval = atoi(optarg);
+			break;
+
+		case 'h':
+			usage();
+			return 0;
+			break;
+
+		default:
+			D("bad option %c %s", ch, optarg);
+			usage();
+			return 1;
+		}
+	}
+
+	if (glob_arg.ifname[0] == '\0') {
+		D("missing interface name");
+		usage();
+		return 1;
+	}
+
+	/* extract the base name */
+	char *nscan = strncmp(glob_arg.ifname, "netmap:", 7) ?
+			glob_arg.ifname : glob_arg.ifname + 7;
+	strncpy(glob_arg.base_name, nscan, MAX_IFNAMELEN-1);
+	for (nscan = glob_arg.base_name; *nscan && !index("-*^{}/@", *nscan); nscan++)
+		;
+	*nscan = '\0';
+
+	if (glob_arg.num_groups == 0)
+		parse_pipes("");
+
+	if (glob_arg.syslog_interval) {
+		setlogmask(LOG_UPTO(LOG_INFO));
+		openlog("lb", LOG_CONS | LOG_PID | LOG_NDELAY, LOG_LOCAL1);
+	}
+
+	uint32_t npipes = glob_arg.output_rings;
+
+
+	pthread_t stat_thread;
+
+	ports = calloc(npipes + 1, sizeof(struct port_des));
+	if (!ports) {
+		D("failed to allocate the stats array");
+		return 1;
+	}
+	struct port_des *rxport = &ports[npipes];
+	init_groups();
+
+	memset(&counters_buf, 0, sizeof(counters_buf));
+	counters_buf.ctrs = calloc(npipes, sizeof(struct my_ctrs));
+	if (!counters_buf.ctrs) {
+		D("failed to allocate the counters snapshot buffer");
+		return 1;
+	}
+
+	/* we need base_req to specify pipes and extra bufs */
+	struct nmreq base_req;
+	memset(&base_req, 0, sizeof(base_req));
+
+	base_req.nr_arg1 = npipes;
+	base_req.nr_arg3 = glob_arg.extra_bufs;
+
+	rxport->nmd = nm_open(glob_arg.ifname, &base_req, 0, NULL);
+
+	if (rxport->nmd == NULL) {
+		D("cannot open %s", glob_arg.ifname);
+		return (1);
+	} else {
+		D("successfully opened %s (tx rings: %u)", glob_arg.ifname,
+		  rxport->nmd->req.nr_tx_slots);
+	}
+
+	uint32_t extra_bufs = rxport->nmd->req.nr_arg3;
+	struct overflow_queue *oq = NULL;
+	/* reference ring to access the buffers */
+	rxport->ring = NETMAP_RXRING(rxport->nmd->nifp, 0);
+
+	if (!glob_arg.extra_bufs)
+		goto run;
+
+	D("obtained %d extra buffers", extra_bufs);
+	if (!extra_bufs)
+		goto run;
+
+	/* one overflow queue for each output pipe, plus one for the
+	 * free extra buffers
+	 */
+	oq = calloc(npipes + 1, sizeof(struct overflow_queue));
+	if (!oq) {
+		D("failed to allocated overflow queues descriptors");
+		goto run;
+	}
+
+	freeq = &oq[npipes];
+	rxport->oq = freeq;
+
+	freeq->slots = calloc(extra_bufs, sizeof(struct netmap_slot));
+	if (!freeq->slots) {
+		D("failed to allocate the free list");
+	}
+	freeq->size = extra_bufs;
+	snprintf(freeq->name, MAX_IFNAMELEN, "free queue");
+
+	/*
+	 * the list of buffers uses the first uint32_t in each buffer
+	 * as the index of the next buffer.
+	 */
+	uint32_t scan;
+	for (scan = rxport->nmd->nifp->ni_bufs_head;
+	     scan;
+	     scan = *(uint32_t *)NETMAP_BUF(rxport->ring, scan))
+	{
+		struct netmap_slot s;
+		s.len = s.flags = 0;
+		s.ptr = 0;
+		s.buf_idx = scan;
+		ND("freeq <- %d", s.buf_idx);
+		oq_enq(freeq, &s);
+	}
+
+
+	if (freeq->n != extra_bufs) {
+		D("something went wrong: netmap reported %d extra_bufs, but the free list contained %d",
+				extra_bufs, freeq->n);
+		return 1;
+	}
+	rxport->nmd->nifp->ni_bufs_head = 0;
+
+run:
+	atexit(free_buffers);
+
+	int j, t = 0;
+	for (j = 0; j < glob_arg.num_groups; j++) {
+		struct group_des *g = &groups[j];
+		int k;
+		for (k = 0; k < g->nports; ++k) {
+			struct port_des *p = &g->ports[k];
+			snprintf(p->interface, MAX_PORTNAMELEN, "%s%s{%d/xT@%d",
+					(strncmp(g->pipename, "vale", 4) ? "netmap:" : ""),
+					g->pipename, g->first_id + k,
+					rxport->nmd->req.nr_arg2);
+			D("opening pipe named %s", p->interface);
+
+			p->nmd = nm_open(p->interface, NULL, 0, rxport->nmd);
+
+			if (p->nmd == NULL) {
+				D("cannot open %s", p->interface);
+				return (1);
+			} else if (p->nmd->req.nr_arg2 != rxport->nmd->req.nr_arg2) {
+				D("failed to open pipe #%d in zero-copy mode, "
+					"please close any application that uses either pipe %s}%d, "
+				        "or %s{%d, and retry",
+					k + 1, g->pipename, g->first_id + k, g->pipename, g->first_id + k);
+				return (1);
+			} else {
+				D("successfully opened pipe #%d %s (tx slots: %d)",
+				  k + 1, p->interface, p->nmd->req.nr_tx_slots);
+				p->ring = NETMAP_TXRING(p->nmd->nifp, 0);
+				p->last_tail = nm_ring_next(p->ring, p->ring->tail);
+			}
+			D("zerocopy %s",
+			  (rxport->nmd->mem == p->nmd->mem) ? "enabled" : "disabled");
+
+			if (extra_bufs) {
+				struct overflow_queue *q = &oq[t + k];
+				q->slots = calloc(extra_bufs, sizeof(struct netmap_slot));
+				if (!q->slots) {
+					D("failed to allocate overflow queue for pipe %d", k);
+					/* make all overflow queue management fail */
+					extra_bufs = 0;
+				}
+				q->size = extra_bufs;
+				snprintf(q->name, sizeof(q->name), "oq %s{%4d", g->pipename, k);
+				p->oq = q;
+			}
+		}
+		t += g->nports;
+	}
+
+	if (glob_arg.extra_bufs && !extra_bufs) {
+		if (oq) {
+			for (i = 0; i < npipes + 1; i++) {
+				free(oq[i].slots);
+				oq[i].slots = NULL;
+			}
+			free(oq);
+			oq = NULL;
+		}
+		D("*** overflow queues disabled ***");
+	}
+
+	sleep(glob_arg.wait_link);
+
+	/* start stats thread after wait_link */
+	if (pthread_create(&stat_thread, NULL, print_stats, NULL) == -1) {
+		D("unable to create the stats thread: %s", strerror(errno));
+		return 1;
+	}
+
+	struct pollfd pollfd[npipes + 1];
+	memset(&pollfd, 0, sizeof(pollfd));
+	signal(SIGINT, sigint_h);
+
+	/* make sure we wake up as often as needed, even when there are no
+	 * packets coming in
+	 */
+	if (glob_arg.syslog_interval > 0 && glob_arg.syslog_interval < poll_timeout)
+		poll_timeout = glob_arg.syslog_interval;
+	if (glob_arg.stdout_interval > 0 && glob_arg.stdout_interval < poll_timeout)
+		poll_timeout = glob_arg.stdout_interval;
+
+	while (!do_abort) {
+		u_int polli = 0;
+		iter++;
+
+		for (i = 0; i < npipes; ++i) {
+			struct netmap_ring *ring = ports[i].ring;
+			int pending = nm_tx_pending(ring);
+
+			/* if there are packets pending, we want to be notified when
+			 * tail moves, so we let cur=tail
+			 */
+			ring->cur = pending ? ring->tail : ring->head;
+
+			if (!glob_arg.busy_wait && !pending) {
+				/* no need to poll, there are no packets pending */
+				continue;
+			}
+			pollfd[polli].fd = ports[i].nmd->fd;
+			pollfd[polli].events = POLLOUT;
+			pollfd[polli].revents = 0;
+			++polli;
+		}
+
+		pollfd[polli].fd = rxport->nmd->fd;
+		pollfd[polli].events = POLLIN;
+		pollfd[polli].revents = 0;
+		++polli;
+
+		//RD(5, "polling %d file descriptors", polli+1);
+		rv = poll(pollfd, polli, poll_timeout);
+		if (rv <= 0) {
+			if (rv < 0 && errno != EAGAIN && errno != EINTR)
+				RD(1, "poll error %s", strerror(errno));
+			goto send_stats;
+		}
+
+		/* if there are several groups, try pushing released packets from
+		 * upstream groups to the downstream ones.
+		 *
+		 * It is important to do this before returned slots are reused
+		 * for new transmissions. For the same reason, this must be
+		 * done starting from the last group going backwards.
+		 */
+		for (i = glob_arg.num_groups - 1U; i > 0; i--) {
+			struct group_des *g = &groups[i - 1];
+			int j;
+
+			for (j = 0; j < g->nports; j++) {
+				struct port_des *p = &g->ports[j];
+				struct netmap_ring *ring = p->ring;
+				uint32_t last = p->last_tail,
+					 stop = nm_ring_next(ring, ring->tail);
+
+				/* slight abuse of the API here: we touch the slot
+				 * pointed to by tail
+				 */
+				for ( ; last != stop; last = nm_ring_next(ring, last)) {
+					struct netmap_slot *rs = &ring->slot[last];
+					// XXX less aggressive?
+					rs->buf_idx = forward_packet(g + 1, rs);
+					rs->flags |= NS_BUF_CHANGED;
+					rs->ptr = 0;
+				}
+				p->last_tail = last;
+			}
+		}
+
+
+
+		if (oq) {
+			/* try to push packets from the overflow queues
+			 * to the corresponding pipes
+			 */
+			for (i = 0; i < npipes; i++) {
+				struct port_des *p = &ports[i];
+				struct overflow_queue *q = p->oq;
+				uint32_t j, lim;
+				struct netmap_ring *ring;
+				struct netmap_slot *slot;
+
+				if (oq_empty(q))
+					continue;
+				ring = p->ring;
+				lim = nm_ring_space(ring);
+				if (!lim)
+					continue;
+				if (q->n < lim)
+					lim = q->n;
+				for (j = 0; j < lim; j++) {
+					struct netmap_slot s = oq_deq(q), tmp;
+					tmp.ptr = 0;
+					slot = &ring->slot[ring->head];
+					tmp.buf_idx = slot->buf_idx;
+					oq_enq(freeq, &tmp);
+					*slot = s;
+					slot->flags |= NS_BUF_CHANGED;
+					ring->head = nm_ring_next(ring, ring->head);
+				}
+			}
+		}
+
+		/* push any new packets from the input port to the first group */
+		int batch = 0;
+		for (i = rxport->nmd->first_rx_ring; i <= rxport->nmd->last_rx_ring; i++) {
+			struct netmap_ring *rxring = NETMAP_RXRING(rxport->nmd->nifp, i);
+
+			//D("prepare to scan rings");
+			int next_cur = rxring->cur;
+			struct netmap_slot *next_slot = &rxring->slot[next_cur];
+			const char *next_buf = NETMAP_BUF(rxring, next_slot->buf_idx);
+			while (!nm_ring_empty(rxring)) {
+				struct netmap_slot *rs = next_slot;
+				struct group_des *g = &groups[0];
+				++received_pkts;
+				received_bytes += rs->len;
+
+				// CHOOSE THE CORRECT OUTPUT PIPE
+				rs->ptr = pkt_hdr_hash((const unsigned char *)next_buf, 4, 'B');
+				if (rs->ptr == 0) {
+					non_ip++; // XXX ??
+				}
+				// prefetch the buffer for the next round
+				next_cur = nm_ring_next(rxring, next_cur);
+				next_slot = &rxring->slot[next_cur];
+				next_buf = NETMAP_BUF(rxring, next_slot->buf_idx);
+				__builtin_prefetch(next_buf);
+				// 'B' is just a hashing seed
+				rs->buf_idx = forward_packet(g, rs);
+				rs->flags |= NS_BUF_CHANGED;
+				rxring->head = rxring->cur = next_cur;
+
+				batch++;
+				if (unlikely(batch >= glob_arg.batch)) {
+					ioctl(rxport->nmd->fd, NIOCRXSYNC, NULL);
+					batch = 0;
+				}
+				ND(1,
+				   "Forwarded Packets: %"PRIu64" Dropped packets: %"PRIu64"   Percent: %.2f",
+				   forwarded, dropped,
+				   ((float)dropped / (float)forwarded * 100));
+			}
+
+		}
+
+	send_stats:
+		if (counters_buf.status == COUNTERS_FULL)
+			continue;
+		/* take a new snapshot of the counters */
+		gettimeofday(&counters_buf.ts, NULL);
+		for (i = 0; i < npipes; i++) {
+			struct my_ctrs *c = &counters_buf.ctrs[i];
+			*c = ports[i].ctr;
+			/*
+			 * If there are overflow queues, copy the number of them for each
+			 * port to the ctrs.oq_n variable for each port.
+			 */
+			if (ports[i].oq != NULL)
+				c->oq_n = ports[i].oq->n;
+		}
+		counters_buf.received_pkts = received_pkts;
+		counters_buf.received_bytes = received_bytes;
+		counters_buf.non_ip = non_ip;
+		if (freeq != NULL)
+			counters_buf.freeq_n = freeq->n;
+		__sync_synchronize();
+		counters_buf.status = COUNTERS_FULL;
+	}
+
+	/*
+	 * If freeq exists, copy the number to the freeq_n member of the
+	 * message struct, otherwise set it to 0.
+	 */
+	if (freeq != NULL) {
+		freeq_n = freeq->n;
+	} else {
+		freeq_n = 0;
+	}
+
+	pthread_join(stat_thread, NULL);
+
+	printf("%"PRIu64" packets forwarded.  %"PRIu64" packets dropped. Total %"PRIu64"\n", forwarded,
+	       dropped, forwarded + dropped);
+	return 0;
+}

Property changes on: head/tools/tools/netmap/lb.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: head/tools/tools/netmap/pkt_hash.c
===================================================================
--- head/tools/tools/netmap/pkt_hash.c	(nonexistent)
+++ head/tools/tools/netmap/pkt_hash.c	(revision 340279)
@@ -0,0 +1,396 @@
+/*
+ ** Copyright (c) 2015, Asim Jamshed, Robin Sommer, Seth Hall
+ ** and the International Computer Science Institute. All rights reserved.
+ **
+ ** Redistribution and use in source and binary forms, with or without
+ ** modification, are permitted provided that the following conditions are met:
+ **
+ ** (1) Redistributions of source code must retain the above copyright
+ **     notice, this list of conditions and the following disclaimer.
+ **
+ ** (2) Redistributions in binary form must reproduce the above copyright
+ **     notice, this list of conditions and the following disclaimer in the
+ **     documentation and/or other materials provided with the distribution.
+ **
+ **
+ ** THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ ** AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ ** IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ ** ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ ** LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ ** CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ ** SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ ** INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ ** CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ ** ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ ** POSSIBILITY OF SUCH DAMAGE.
+ **/
+/* $FreeBSD$ */
+/* for func prototypes */
+#include "pkt_hash.h"
+
+/* Make Linux headers choose BSD versions of some of the data structures */
+#define __FAVOR_BSD
+
+/* for types */
+#include <sys/types.h>
+/* for [n/h]to[h/n][ls] */
+#include <netinet/in.h>
+/* iphdr */
+#include <netinet/ip.h>
+/* ipv6hdr */
+#include <netinet/ip6.h>
+/* tcphdr */
+#include <netinet/tcp.h>
+/* udphdr */
+#include <netinet/udp.h>
+/* eth hdr */
+#include <net/ethernet.h>
+/* for memset */
+#include <string.h>
+
+#include <stdio.h>
+#include <assert.h>
+
+//#include <libnet.h>
+/*---------------------------------------------------------------------*/
+/**
+ *  * The cache table is used to pick a nice seed for the hash value. It is
+ *   * built only once when sym_hash_fn is called for the very first time
+ *    */
+static void
+build_sym_key_cache(uint32_t *cache, int cache_len)
+{
+	static const uint8_t key[] = { 0x50, 0x6d };
+
+        uint32_t result = (((uint32_t)key[0]) << 24) |
+                (((uint32_t)key[1]) << 16) |
+                (((uint32_t)key[0]) << 8)  |
+                ((uint32_t)key[1]);
+
+        uint32_t idx = 32;
+        int i;
+
+        for (i = 0; i < cache_len; i++, idx++) {
+                uint8_t shift = (idx % 8);
+                uint32_t bit;
+
+                cache[i] = result;
+                bit = ((key[(idx/8) & 1] << shift) & 0x80) ? 1 : 0;
+                result = ((result << 1) | bit);
+        }
+}
+
+static void
+build_byte_cache(uint32_t byte_cache[256][4])
+{
+#define KEY_CACHE_LEN			96
+	int i, j, k;
+	uint32_t key_cache[KEY_CACHE_LEN];
+
+	build_sym_key_cache(key_cache, KEY_CACHE_LEN);
+
+	for (i = 0; i < 4; i++) {
+		for (j = 0; j < 256; j++) {
+			uint8_t b = j;
+			byte_cache[j][i] = 0;
+			for (k = 0; k < 8; k++) {
+				if (b & 0x80)
+					byte_cache[j][i] ^= key_cache[8 * i + k];
+				b <<= 1U;
+			}
+		}
+	}
+}
+
+
+/*---------------------------------------------------------------------*/
+/**
+ ** Computes symmetric hash based on the 4-tuple header data
+ **/
+static uint32_t
+sym_hash_fn(uint32_t sip, uint32_t dip, uint16_t sp, uint32_t dp)
+{
+	uint32_t rc = 0;
+	static int first_time = 1;
+	static uint32_t byte_cache[256][4];
+	uint8_t *sip_b = (uint8_t *)&sip,
+		*dip_b = (uint8_t *)&dip,
+		*sp_b  = (uint8_t *)&sp,
+		*dp_b  = (uint8_t *)&dp;
+
+	if (first_time) {
+		build_byte_cache(byte_cache);
+		first_time = 0;
+	}
+
+	rc = byte_cache[sip_b[3]][0] ^
+	     byte_cache[sip_b[2]][1] ^
+	     byte_cache[sip_b[1]][2] ^
+	     byte_cache[sip_b[0]][3] ^
+	     byte_cache[dip_b[3]][0] ^
+	     byte_cache[dip_b[2]][1] ^
+	     byte_cache[dip_b[1]][2] ^
+	     byte_cache[dip_b[0]][3] ^
+	     byte_cache[sp_b[1]][0] ^
+	     byte_cache[sp_b[0]][1] ^
+	     byte_cache[dp_b[1]][2] ^
+	     byte_cache[dp_b[0]][3];
+
+	return rc;
+}
+static uint32_t decode_gre_hash(const uint8_t *, uint8_t, uint8_t);
+/*---------------------------------------------------------------------*/
+/**
+ ** Parser + hash function for the IPv4 packet
+ **/
+static uint32_t
+decode_ip_n_hash(struct ip *iph, uint8_t hash_split, uint8_t seed)
+{
+	uint32_t rc = 0;
+
+	if (hash_split == 2) {
+		rc = sym_hash_fn(ntohl(iph->ip_src.s_addr),
+			ntohl(iph->ip_dst.s_addr),
+			ntohs(0xFFFD) + seed,
+			ntohs(0xFFFE) + seed);
+	} else {
+		struct tcphdr *tcph = NULL;
+		struct udphdr *udph = NULL;
+
+		switch (iph->ip_p) {
+		case IPPROTO_TCP:
+			tcph = (struct tcphdr *)((uint8_t *)iph + (iph->ip_hl<<2));
+			rc = sym_hash_fn(ntohl(iph->ip_src.s_addr),
+					 ntohl(iph->ip_dst.s_addr),
+					 ntohs(tcph->th_sport) + seed,
+					 ntohs(tcph->th_dport) + seed);
+			break;
+		case IPPROTO_UDP:
+			udph = (struct udphdr *)((uint8_t *)iph + (iph->ip_hl<<2));
+			rc = sym_hash_fn(ntohl(iph->ip_src.s_addr),
+					 ntohl(iph->ip_dst.s_addr),
+					 ntohs(udph->uh_sport) + seed,
+					 ntohs(udph->uh_dport) + seed);
+			break;
+		case IPPROTO_IPIP:
+			/* tunneling */
+			rc = decode_ip_n_hash((struct ip *)((uint8_t *)iph + (iph->ip_hl<<2)),
+					      hash_split, seed);
+			break;
+		case IPPROTO_GRE:
+			rc = decode_gre_hash((uint8_t *)iph + (iph->ip_hl<<2),
+					hash_split, seed);
+			break;
+		case IPPROTO_ICMP:
+		case IPPROTO_ESP:
+		case IPPROTO_PIM:
+		case IPPROTO_IGMP:
+		default:
+			/*
+			 ** the hash strength (although weaker but) should still hold
+			 ** even with 2 fields
+			 **/
+			rc = sym_hash_fn(ntohl(iph->ip_src.s_addr),
+					 ntohl(iph->ip_dst.s_addr),
+					 ntohs(0xFFFD) + seed,
+					 ntohs(0xFFFE) + seed);
+			break;
+		}
+	}
+	return rc;
+}
+/*---------------------------------------------------------------------*/
+/**
+ ** Parser + hash function for the IPv6 packet
+ **/
+static uint32_t
+decode_ipv6_n_hash(struct ip6_hdr *ipv6h, uint8_t hash_split, uint8_t seed)
+{
+	uint32_t saddr, daddr;
+	uint32_t rc = 0;
+
+	/* Get only the first 4 octets */
+	saddr = ipv6h->ip6_src.s6_addr[0] |
+		(ipv6h->ip6_src.s6_addr[1] << 8) |
+		(ipv6h->ip6_src.s6_addr[2] << 16) |
+		(ipv6h->ip6_src.s6_addr[3] << 24);
+	daddr = ipv6h->ip6_dst.s6_addr[0] |
+		(ipv6h->ip6_dst.s6_addr[1] << 8) |
+		(ipv6h->ip6_dst.s6_addr[2] << 16) |
+		(ipv6h->ip6_dst.s6_addr[3] << 24);
+
+	if (hash_split == 2) {
+		rc = sym_hash_fn(ntohl(saddr),
+				 ntohl(daddr),
+				 ntohs(0xFFFD) + seed,
+				 ntohs(0xFFFE) + seed);
+	} else {
+		struct tcphdr *tcph = NULL;
+		struct udphdr *udph = NULL;
+
+		switch(ntohs(ipv6h->ip6_ctlun.ip6_un1.ip6_un1_nxt)) {
+		case IPPROTO_TCP:
+			tcph = (struct tcphdr *)(ipv6h + 1);
+			rc = sym_hash_fn(ntohl(saddr),
+					 ntohl(daddr),
+					 ntohs(tcph->th_sport) + seed,
+					 ntohs(tcph->th_dport) + seed);
+			break;
+		case IPPROTO_UDP:
+			udph = (struct udphdr *)(ipv6h + 1);
+			rc = sym_hash_fn(ntohl(saddr),
+					 ntohl(daddr),
+					 ntohs(udph->uh_sport) + seed,
+					 ntohs(udph->uh_dport) + seed);
+			break;
+		case IPPROTO_IPIP:
+			/* tunneling */
+			rc = decode_ip_n_hash((struct ip *)(ipv6h + 1),
+					      hash_split, seed);
+			break;
+		case IPPROTO_IPV6:
+			/* tunneling */
+			rc = decode_ipv6_n_hash((struct ip6_hdr *)(ipv6h + 1),
+						hash_split, seed);
+			break;
+		case IPPROTO_GRE:
+			rc = decode_gre_hash((uint8_t *)(ipv6h + 1), hash_split, seed);
+			break;
+		case IPPROTO_ICMP:
+		case IPPROTO_ESP:
+		case IPPROTO_PIM:
+		case IPPROTO_IGMP:
+		default:
+			/*
+			 ** the hash strength (although weaker but) should still hold
+			 ** even with 2 fields
+			 **/
+			rc = sym_hash_fn(ntohl(saddr),
+					 ntohl(daddr),
+					 ntohs(0xFFFD) + seed,
+					 ntohs(0xFFFE) + seed);
+		}
+	}
+	return rc;
+}
+/*---------------------------------------------------------------------*/
+/**
+ *  *  A temp solution while hash for other protocols are filled...
+ *   * (See decode_vlan_n_hash & pkt_hdr_hash functions).
+ *    */
+static uint32_t
+decode_others_n_hash(struct ether_header *ethh, uint8_t seed)
+{
+	uint32_t saddr, daddr, rc;
+
+	saddr = ethh->ether_shost[5] |
+		(ethh->ether_shost[4] << 8) |
+		(ethh->ether_shost[3] << 16) |
+		(ethh->ether_shost[2] << 24);
+	daddr = ethh->ether_dhost[5] |
+		(ethh->ether_dhost[4] << 8) |
+		(ethh->ether_dhost[3] << 16) |
+		(ethh->ether_dhost[2] << 24);
+
+	rc = sym_hash_fn(ntohl(saddr),
+			 ntohl(daddr),
+			 ntohs(0xFFFD) + seed,
+			 ntohs(0xFFFE) + seed);
+
+	return rc;
+}
+/*---------------------------------------------------------------------*/
+/**
+ ** Parser + hash function for VLAN packet
+ **/
+static inline uint32_t
+decode_vlan_n_hash(struct ether_header *ethh, uint8_t hash_split, uint8_t seed)
+{
+	uint32_t rc = 0;
+	struct vlanhdr *vhdr = (struct vlanhdr *)(ethh + 1);
+
+	switch (ntohs(vhdr->proto)) {
+	case ETHERTYPE_IP:
+		rc = decode_ip_n_hash((struct ip *)(vhdr + 1),
+				      hash_split, seed);
+		break;
+	case ETHERTYPE_IPV6:
+		rc = decode_ipv6_n_hash((struct ip6_hdr *)(vhdr + 1),
+					hash_split, seed);
+		break;
+	case ETHERTYPE_ARP:
+	default:
+		/* others */
+		rc = decode_others_n_hash(ethh, seed);
+		break;
+	}
+	return rc;
+}
+
+/*---------------------------------------------------------------------*/
+/**
+ ** General parser + hash function...
+ **/
+uint32_t
+pkt_hdr_hash(const unsigned char *buffer, uint8_t hash_split, uint8_t seed)
+{
+	uint32_t rc = 0;
+	struct ether_header *ethh = (struct ether_header *)buffer;
+
+	switch (ntohs(ethh->ether_type)) {
+	case ETHERTYPE_IP:
+		rc = decode_ip_n_hash((struct ip *)(ethh + 1),
+				      hash_split, seed);
+		break;
+	case ETHERTYPE_IPV6:
+		rc = decode_ipv6_n_hash((struct ip6_hdr *)(ethh + 1),
+					hash_split, seed);
+		break;
+	case ETHERTYPE_VLAN:
+		rc = decode_vlan_n_hash(ethh, hash_split, seed);
+		break;
+	case ETHERTYPE_ARP:
+	default:
+		/* others */
+		rc = decode_others_n_hash(ethh, seed);
+		break;
+	}
+
+	return rc;
+}
+
+/*---------------------------------------------------------------------*/
+/**
+ ** Parser + hash function for the GRE packet
+ **/
+static uint32_t
+decode_gre_hash(const uint8_t *grehdr, uint8_t hash_split, uint8_t seed)
+{
+	uint32_t rc = 0;
+	int len = 4 + 2 * (!!(*grehdr & 1) + /* Checksum */
+			   !!(*grehdr & 2) + /* Routing */
+			   !!(*grehdr & 4) + /* Key */
+			   !!(*grehdr & 8)); /* Sequence Number */
+	uint16_t proto = ntohs(*(uint16_t *)(void *)(grehdr + 2));
+
+	switch (proto) {
+	case ETHERTYPE_IP:
+		rc = decode_ip_n_hash((struct ip *)(grehdr + len),
+				      hash_split, seed);
+		break;
+	case ETHERTYPE_IPV6:
+		rc = decode_ipv6_n_hash((struct ip6_hdr *)(grehdr + len),
+					hash_split, seed);
+		break;
+	case 0x6558: /* Transparent Ethernet Bridging */
+		rc = pkt_hdr_hash(grehdr + len, hash_split, seed);
+		break;
+	default:
+		/* others */
+		break;
+	}
+	return rc;
+}
+/*---------------------------------------------------------------------*/
+

Property changes on: head/tools/tools/netmap/pkt_hash.c
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property
Index: head/tools/tools/netmap/pkt_hash.h
===================================================================
--- head/tools/tools/netmap/pkt_hash.h	(nonexistent)
+++ head/tools/tools/netmap/pkt_hash.h	(revision 340279)
@@ -0,0 +1,79 @@
+/*
+ ** Copyright (c) 2015, Asim Jamshed, Robin Sommer, Seth Hall
+ ** and the International Computer Science Institute. All rights reserved.
+ **
+ ** Redistribution and use in source and binary forms, with or without
+ ** modification, are permitted provided that the following conditions are met:
+ **
+ ** (1) Redistributions of source code must retain the above copyright
+ **     notice, this list of conditions and the following disclaimer.
+ **
+ ** (2) Redistributions in binary form must reproduce the above copyright
+ **     notice, this list of conditions and the following disclaimer in the
+ **     documentation and/or other materials provided with the distribution.
+ **
+ **
+ ** THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ ** AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ ** IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ ** ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ ** LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ ** CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ ** SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ ** INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ ** CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ ** ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ ** POSSIBILITY OF SUCH DAMAGE.
+ **/
+/* $FreeBSD$ */
+#ifndef LB_PKT_HASH_H
+#define LB_PKT_HASH_H
+/*---------------------------------------------------------------------*/
+/**
+ ** Packet header hashing function utility - This file contains functions
+ ** that parse the packet headers and computes hash functions based on
+ ** the header fields. Please see pkt_hash.c for more details...
+ **/
+/*---------------------------------------------------------------------*/
+/* for type def'n */
+#include <stdint.h>
+/*---------------------------------------------------------------------*/
+#ifdef __GNUC__
+#define likely(x)       __builtin_expect(!!(x), 1)
+#define unlikely(x)     __builtin_expect(!!(x), 0)
+#else
+#define likely(x)       (x)
+#define unlikely(x)     (x)
+#endif
+
+#define HTONS(n) (((((unsigned short)(n) & 0xFF)) << 8) | \
+		  (((unsigned short)(n) & 0xFF00) >> 8))
+#define NTOHS(n) (((((unsigned short)(n) & 0xFF)) << 8) | \
+		  (((unsigned short)(n) & 0xFF00) >> 8))
+
+#define HTONL(n) (((((unsigned long)(n) & 0xFF)) << 24) | \
+        ((((unsigned long)(n) & 0xFF00)) << 8) | \
+        ((((unsigned long)(n) & 0xFF0000)) >> 8) | \
+		  ((((unsigned long)(n) & 0xFF000000)) >> 24))
+
+#define NTOHL(n) (((((unsigned long)(n) & 0xFF)) << 24) | \
+        ((((unsigned long)(n) & 0xFF00)) << 8) | \
+        ((((unsigned long)(n) & 0xFF0000)) >> 8) | \
+		  ((((unsigned long)(n) & 0xFF000000)) >> 24))
+/*---------------------------------------------------------------------*/
+typedef struct vlanhdr {
+	uint16_t pri_cfi_vlan;
+	uint16_t proto;
+} vlanhdr;
+/*---------------------------------------------------------------------*/
+/**
+ ** Analyzes the packet header of computes a corresponding
+ ** hash function.
+ **/
+uint32_t
+pkt_hdr_hash(const unsigned char *buffer,
+	     uint8_t hash_split,
+	     uint8_t seed);
+/*---------------------------------------------------------------------*/
+#endif /* LB_PKT_HASH_H */
+

Property changes on: head/tools/tools/netmap/pkt_hash.h
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+FreeBSD=%H
\ No newline at end of property
Added: svn:mime-type
## -0,0 +1 ##
+text/plain
\ No newline at end of property