Page MenuHomeFreeBSD

bhyve: add support for virtio-net mergeable rx buffers
ClosedPublic

Authored by vmaffione on Jul 20 2019, 9:29 AM.
Tags
None
Referenced Files
F106653514: D21007.diff
Fri, Jan 3, 10:41 AM
Unknown Object (File)
Thu, Jan 2, 11:07 AM
Unknown Object (File)
Sat, Dec 21, 7:34 PM
Unknown Object (File)
Sat, Dec 14, 7:56 PM
Unknown Object (File)
Sat, Dec 7, 10:25 AM
Unknown Object (File)
Fri, Dec 6, 4:56 AM
Unknown Object (File)
Nov 24 2024, 4:34 PM
Unknown Object (File)
Oct 30 2024, 1:24 AM

Details

Summary

Mergeable rx buffers is a virtio-net feature that allows the hypervisor to use multiple RX descriptor chains to receive a single receive packet. Without this feature, a TSO-enabled guest is compelled to publish only 64K (or 32K) long chains, and each of these large buffers is consumed to receive a single packet, even a very short one. This is a waste of memory, as a RX queu has room for 256 chains, which means up to 16MB of buffer memory for each (single-queue) vtnet device.
With the feature on, the guest can publish 2K long chains, and the hypervisor can merge them as needed.

This change also enables the feature in the netmap backend, which supports virtio-net offloads.
The plan is to add support to the tap backend too.
Note that differently from QEMU/KVM, here we implement one-copy receive, while QEMU uses two copies.

This patch depends on https://reviews.freebsd.org/D20987

Test Plan

Two VMs connected on the same VALE switch. Debug kernel (GENERIC).
I ran netperf TCP_MAERTS and TCP_STREAM.
With mergeable RX buffers on: ~7.5 Gbps
Without mergeable RX buffers: ~6.5 Gbps

More testing appreciated (maybe GENERIC-NODEBUG).

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 25460

Event Timeline

vmaffione retitled this revision from bhyve: add support virtio-net mergeable rx buffers to bhyve: add support for virtio-net mergeable rx buffers.Jul 20 2019, 9:29 AM
afedorov added inline comments.
usr.sbin/bhyve/pci_virtio_net.c
257

I catched this assert(n==0) with my tests - two ubuntu 16.04 VM + vale switch. It seems, there are nothing to prevent vq_getchain() return 0.

Fix issue identified by @aleksandr.fedorov_itglobal.com

vmaffione added inline comments.
usr.sbin/bhyve/pci_virtio_net.c
257

Thank you, this helps. I forgot to check that chains after the first one are indeed available.
Now the issue should be fixed. Could you please check what happens with your testbed now?

I tested the updated patch with iperf3 in various combinations:

  1. vm (ubuntu 16.04) - vale - vm (ubuntu 16.04)
  2. vm (freebsd 13) - vale - vm (ubuntu 16.04)
  3. vm (ubuntu 16.04) - vale - host(if_epair)
  4. vm (freebsd 13) - vale - host(if_epair)

And I didn't find any problems.

Thanks. Did you notice any change in terms of performance?

Sorry, but I didn't compare the performance. I conducted my tests on a machine loaded with other tasks. The throughput between two Ubuntu 16.04 vm's floats from 16 to 18 Gbit / s, sometimes increasing up to 28 Gbit / s. FreeBSD - FreeBSD ~ 7-8 Gbit / s. But as I said, the host machine was loaded with other tasks. Also this machine has two processors, and I clearly observed NUMA effects. I will try to compare the performance on a separate test server tomorrow.

I tried to compare performance on a dedicated server.
Host: Single-processor Xeon E5-2630 v4 @ 2.20GHz, 128 GB RAM, FreeBSD latest CURRENT.

iperf3 tests.

Before patching:

  • VM (Ubuntu 16.04) - vale - VM (Ubuntu 16.04) ~ 21,9 Gbit/s
  • VM (FreeeBSD CURRENT) - vale -VM (FreeBSD CURRENT) ~6.0 Gbit/s
  • VM (FreeBSD 12R) - vale - VM (FreeBSD 12R) ~11,2 Gbit/s

With mergable buffers:

  • VM (Ubuntu 16.04) - vale - VM (Ubuntu 16.04) ~ 27,3 Gbit/s
  • VM (FreeeBSD CURRENT) - vale -VM (FreeBSD CURRENT) ~6.3 Gbit/s
  • VM (FreeBSD 12R) - vale - VM (FreeBSD 12R) ~12,2 Gbit/s

So, for Ubuntu VM there is a clear increase in throughput.
It seems, that FreeBSD VM reaches a limit earlier than the difference can be noticeable.

Thanks a lot for your effort!
To be honest I'm not sure why the throughput increases so much, since TSO (64KB unchecksummed packets) is being used in both cases.
The main difference is that with mergeable rx buffers there is less pressure on the guest memory allocators, since the driver can allocate 2K clusters, rather than bigger packets.
Also, with mergeable rx buffer bhyve may do a little more work, because it needs to call vq_gechain() 33 times in order to receive each packet.
In any case, your results look very good to me, and they also agree with mine (taken with a smaller and less powerful machine).

This looks ok to me generally. Do you need to refresh this after other recent commits?

Thanks for looking at this.
No, the patch is ready as is (I just retested everything again).

This revision is now accepted and ready to land.Nov 8 2019, 5:24 PM