Page MenuHomeFreeBSD

bhyve: add support for virtio-net mergeable rx buffers
Needs ReviewPublic

Authored by vmaffione on Jul 20 2019, 9:29 AM.

Details

Reviewers
markj
jhb
bryanv
pmooney_pfmooney.com
Group Reviewers
bhyve
Summary

Mergeable rx buffers is a virtio-net feature that allows the hypervisor to use multiple RX descriptor chains to receive a single receive packet. Without this feature, a TSO-enabled guest is compelled to publish only 64K (or 32K) long chains, and each of these large buffers is consumed to receive a single packet, even a very short one. This is a waste of memory, as a RX queu has room for 256 chains, which means up to 16MB of buffer memory for each (single-queue) vtnet device.
With the feature on, the guest can publish 2K long chains, and the hypervisor can merge them as needed.

This change also enables the feature in the netmap backend, which supports virtio-net offloads.
The plan is to add support to the tap backend too.
Note that differently from QEMU/KVM, here we implement one-copy receive, while QEMU uses two copies.

This patch depends on https://reviews.freebsd.org/D20987

Test Plan

Two VMs connected on the same VALE switch. Debug kernel (GENERIC).
I ran netperf TCP_MAERTS and TCP_STREAM.
With mergeable RX buffers on: ~7.5 Gbps
Without mergeable RX buffers: ~6.5 Gbps

More testing appreciated (maybe GENERIC-NODEBUG).

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 25438

Event Timeline

vmaffione created this revision.Jul 20 2019, 9:29 AM
vmaffione retitled this revision from bhyve: add support virtio-net mergeable rx buffers to bhyve: add support for virtio-net mergeable rx buffers.Jul 20 2019, 9:29 AM
aleksandr.fedorov_itglobal.com added inline comments.
usr.sbin/bhyve/pci_virtio_net.c
228

I catched this assert(n==0) with my tests - two ubuntu 16.04 VM + vale switch. It seems, there are nothing to prevent vq_getchain() return 0.

vmaffione marked an inline comment as done.Jul 22 2019, 7:30 PM
vmaffione added inline comments.
usr.sbin/bhyve/pci_virtio_net.c
228

Thank you, this helps. I forgot to check that chains after the first one are indeed available.
Now the issue should be fixed. Could you please check what happens with your testbed now?

I tested the updated patch with iperf3 in various combinations:

  1. vm (ubuntu 16.04) - vale - vm (ubuntu 16.04)
  2. vm (freebsd 13) - vale - vm (ubuntu 16.04)
  3. vm (ubuntu 16.04) - vale - host(if_epair)
  4. vm (freebsd 13) - vale - host(if_epair)

And I didn't find any problems.

vmaffione marked an inline comment as done.Jul 23 2019, 4:27 PM

Thanks. Did you notice any change in terms of performance?

Sorry, but I didn't compare the performance. I conducted my tests on a machine loaded with other tasks. The throughput between two Ubuntu 16.04 vm's floats from 16 to 18 Gbit / s, sometimes increasing up to 28 Gbit / s. FreeBSD - FreeBSD ~ 7-8 Gbit / s. But as I said, the host machine was loaded with other tasks. Also this machine has two processors, and I clearly observed NUMA effects. I will try to compare the performance on a separate test server tomorrow.

I tried to compare performance on a dedicated server.
Host: Single-processor Xeon E5-2630 v4 @ 2.20GHz, 128 GB RAM, FreeBSD latest CURRENT.

iperf3 tests.

Before patching:

  • VM (Ubuntu 16.04) - vale - VM (Ubuntu 16.04) ~ 21,9 Gbit/s
  • VM (FreeeBSD CURRENT) - vale -VM (FreeBSD CURRENT) ~6.0 Gbit/s
  • VM (FreeBSD 12R) - vale - VM (FreeBSD 12R) ~11,2 Gbit/s

With mergable buffers:

  • VM (Ubuntu 16.04) - vale - VM (Ubuntu 16.04) ~ 27,3 Gbit/s
  • VM (FreeeBSD CURRENT) - vale -VM (FreeBSD CURRENT) ~6.3 Gbit/s
  • VM (FreeBSD 12R) - vale - VM (FreeBSD 12R) ~12,2 Gbit/s

So, for Ubuntu VM there is a clear increase in throughput.
It seems, that FreeBSD VM reaches a limit earlier than the difference can be noticeable.

Thanks a lot for your effort!
To be honest I'm not sure why the throughput increases so much, since TSO (64KB unchecksummed packets) is being used in both cases.
The main difference is that with mergeable rx buffers there is less pressure on the guest memory allocators, since the driver can allocate 2K clusters, rather than bigger packets.
Also, with mergeable rx buffer bhyve may do a little more work, because it needs to call vq_gechain() 33 times in order to receive each packet.
In any case, your results look very good to me, and they also agree with mine (taken with a smaller and less powerful machine).

Any opinions on this change?