Page MenuHomeFreeBSD

[bhyve][virtio-net] Allow guest VM's to set JUMBO MTU in case of using the VALE switch.
AbandonedPublic

Authored by aleksandr.fedorov_itglobal.com on May 16 2019, 11:23 AM.

Details

Reviewers
jhb
rgrimes
krion
vmaffione
Group Reviewers
bhyve
Summary

Сurrent implementation of the virtio-net backend doesn't allow to transfer of packets larger than the netmap(4) buffer size (2048) or maximum guest descriptor size for tap(4) case. The reason is that there is no support for merge-able buffers (VIRTIO_NET_F_MRG_RXBUF in virtio specifications). See PR: 215737 This significantly limits the TCP throughput.

This patch adds support for mergable buffers using netmap's ability to chain it's own buffers (see NS _MOREFRAG, netmap(4)). The same approach is used by QEMU (virtio-net + netmap backend).

We are seeing a significant increase in throughput both for transferring data between VM's on the same host, and between VM's on different hosts. See tests below.

Test Plan

1. 'iperf3 -c X.X.X.X -t 60 -R' between 2 Ubuntu 16.04 VM's through VALE switch.

MTU 1500. VM - virtio-net - VALE - virtio-net - VM:

[ ID] Interval Transfer Bandwidth Retr
[ 5] 0.00-60.04 sec 28.6 GBytes 4.09 Gbits/sec 3 sender
[ 5] 0.00-60.04 sec 28.6 GBytes 4.09 Gbits/sec receiver

MTU 3000. VM - virtio-net - VALE - virtio-net - VM (MTU 3000):

[ ID] Interval Transfer Bandwidth Retr
[ 5] 0.00-60.04 sec 51.8 GBytes 7.42 Gbits/sec 651 sender
[ 5] 0.00-60.04 sec 51.8 GBytes 7.42 Gbits/sec receiver

MTU 9000. VM - virtio-net - VALE - virtio-net - VM:

[ ID] Interval Transfer Bandwidth Retr
[ 5] 0.00-60.04 sec 99.7 GBytes 14.3 Gbits/sec 100 sender
[ 5] 0.00-60.04 sec 99.7 GBytes 14.3 Gbits/sec receiver

MTU 16000. VM - virtio-net - VALE - virtio-net - VM:

[ ID] Interval Transfer Bandwidth Retr
[ 5] 0.00-60.04 sec 122 GBytes 17.5 Gbits/sec 0 sender
[ 5] 0.00-60.04 sec 122 GBytes 17.5 Gbits/sec receiver

MTU 32000. VM - virtio-net - VALE - virtio-net - VM:

[ ID] Interval Transfer Bandwidth Retr
[ 5] 0.00-60.04 sec 152 GBytes 21.7 Gbits/sec 64 sender
[ 5] 0.00-60.04 sec 152 GBytes 21.7 Gbits/sec receiver

MTU 64000. VM - virtio-net - VALE - virtio-net - VM:

[ ID] Interval Transfer Bandwidth Retr
[ 5] 0.00-60.04 sec 220 GBytes 31.4 Gbits/sec 60 sender
[ 5] 0.00-60.04 sec 220 GBytes 31.4 Gbits/sec receiver

2. 'iperf3 -c X.X.X.X -t 60 -R' between 2 FreeBSD 12 RELEASE VM's through VALE switch.

MTU 1500. VM - virtio-net - VALE - virtio-net - VM: 1.30 Gbits/sec
MTU 3000. VM - virtio-net - VALE - virtio-net - VM: 2.14 Gbits/sec
MTU 9000. VM - virtio-net - VALE - virtio-net - VM: 4.80 Gbits/sec
MTU 16000. VM - virtio-net - VALE - virtio-net - VM: 7.25 Gbits/sec
MTU 32000. VM - virtio-net - VALE - virtio-net - VM: 12.8 Gbits/sec
MTU 64000. VM - virtio-net - VALE - virtio-net - VM: 13.3 Gbits/sec

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

novel added a subscriber: novel.May 16 2019, 1:06 PM
vmaffione added inline comments.May 16 2019, 8:14 PM
usr.sbin/bhyve/pci_virtio_net.c
405

nmd->cur_tx_ring

Why did you drop the iterating over all the possible tx rings?

usr.sbin/bhyve/pci_virtio_net.c
405

I drop it to simplify the code under the assumption that current setup code

void
 pci_vtnet_netmap_setup(struct pci_vtnet_softc *sc, char *ifname)
 {
...
      sc->vsc_nmd = nm_open(ifname, NULL, 0, 0);
...
}

always sets the number of TX/RX rings to one. And there is no way to change it using bhyve(8).

In the future, I want to replace the legacy API and add the ability to specify not only the VALE switch, but also other types of netmap ports, and return the iteration through the rings.

Or I can return this code now. What do you think?

mizhka added a subscriber: mizhka.May 17 2019, 9:53 AM
vmaffione added inline comments.May 17 2019, 7:28 PM
usr.sbin/bhyve/pci_virtio_net.c
405

That's fine. But still you need to replace cur_rx_ring with cur_tx_ring, for consistency.

Fix incorrect usage of 'cur_rx_ring' on TX path.

aleksandr.fedorov_itglobal.com marked an inline comment as done.May 20 2019, 7:25 AM
usr.sbin/bhyve/pci_virtio_net.c
405

Vincenzo, do you have any other suggestion about netmap API usage?

Netmap usage by itself looks mostly ok, except where noted. We could improve the management of r->cur in certain cases, but it's not particularly important for now (for reference, please look at my QEMU implementation, for instance the transmit routine is here https://github.com/qemu/qemu/blob/master/net/netmap.c#L160-L225).
I still need to review your changes to the virtqueue processing. But why did you drop vq_getchain() and write a custom method? If another method is needed, it should be added to virtio.c IMHO.

usr.sbin/bhyve/pci_virtio_net.c
466–476

r->head is better than r->cur, although there is no difference right now (see comment below).

494

In theory this NIOCRXSYNC is not needed, because poll() or kqueue_wait() calls NIOCRXSYNC internally. This works perfectly with poll(), at least. As far as I know bhyve uses kqueue to wait on the netmap file descriptor. What happens if you remove this ioctl()?

547

You should use r->head to get the next to use. r->cur is for synchronization, and points at the next unseen, which may be ahead of r->head, although not your code.
This may cause problems in the future.

562

r->head rather than r->cur

Vincenzo, thanks for the review.
The main motivation to write the custom method is optimization. Function vq_getchain() returns virtio descriptors which chained using VRING_DESCR_F_NEXT flag. But viritio-net guest drivers do not represent the available descriptors in the form of a chain, so vq_getchain() returns only one descriptor per call. Also this function has side effect - it's increment vq->vq_last_avail index, which should be decremented if there is no enough descriptors to store the RX packet.

Change r->cur to r->head in RX path

aleksandr.fedorov_itglobal.com marked 3 inline comments as done.May 23 2019, 2:35 PM
usr.sbin/bhyve/pci_virtio_net.c
494

I agree, it seems there are no need to call ioctl(..., NIOCRXSYNC, ...). And I observed a some throughput increase (~300-500 MBit/s 1500 MTU), when RX packet occupied one netmap buffer. This is explained by the fact that additional syscall was gone.

But I need some time to investigate how this affects performance with a large MTU.

Yes, but I think the story is more complex than that. Guests can publish single-descriptor chains or multi-descriptor chains, depending on feature negotiation and driver implementation. With mergeable buffers enabled, or TSO features disabled, guests will normally submit single-descriptor chains (because it makes sense), but this is not mandatory. With TSO enabled and mergeable buffers disabled, guests will normally pass in multi-descriptor chains, each one describing 64K or 32K buffers.
It makes sense to add more vq methods to handle the mergeable rx bufs case, but think they should be moved to usr.sbin/bhyve/virtio.c, so that they can be reused.

usr.sbin/bhyve/pci_virtio_net.c
522

why this check? if there are not enough descriptors to cover 'len' bytes the function will return 0 anyway.

523

You are mixing declarations and code here. Is it allowed?

637

I think you should call vq_inc_used_idx_and_last_avail() after each packet, otherwise you are introducing artificial latency. As a side effect, this allows you to remove the "start" argument ("used" local variable) and the used local variable.

640

why this return?

Vincenzo, you are right. I'm trying to rewrite the code to handle various negotated features. But I need some more time to test it.

  • Reuse vq_getchain() to handle various negotiated features (TSO, MRG_RXBUF, INDIRECT descriptors).
  • Add two helper functions vq_get_mrgrx_bufs() and vq_relchain_mrgrx() and move it to virtio.[ch]

Vincenzo, what do you think about this approach?

aleksandr.fedorov_itglobal.com marked 2 inline comments as done.Jun 7 2019, 8:21 AM

Rename vq_get_mrgrx_bufs() to vq_getbufs_mrgrx() and vq_relchain_mrgrx() to vq_relbufs_mrgrx() to to corresponding overall style.

This looks better, but the problem is that I was refactoring this file to separate virtio specific code from the backends (netmap, tap), and add
support for offloads (including TSO); this patch creates many conflicts with my pending work, and IMHO it should be rebased after my work to prevent a total mess.

usr.sbin/bhyve/pci_virtio_net.c
547

Why are you using this temporary array instead of writing directly to the used ring?
Used entries are only exposed to the guest when vu_idx is incremented, so you would not have race conditions.
This would also greatly simplify your vq_relbufs_mrgrx below.

usr.sbin/bhyve/virtio.c
412

The name of 'iov_len' variable is really confusing, as struct iovec.iov_len refers to the length of a single slot, while your variable refers to the sum of the lengths of possibly many slots...
It's better to use something like 'chain_len'