Page MenuHomeFreeBSD

bhyve: move virtio-net header processing to pci_virtio_net
ClosedPublic

Authored by vmaffione on Jan 23 2020, 10:04 PM.

Details

Summary

This patch cleans up the API between the net frontends (e1000, virtio-net) and the net backends (tap and netmap).
We move the virtio-net header stripping/prepending to the virtio-net code, where this functionality belongs.
In this way, the netbe_send() and netbe_recv() signatures can have const struct iov * rather than struct iov *.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Minor man page fix.

usr.sbin/bhyve/bhyve.8
588 ↗(On Diff #67230)

A line break is required after a sentence stop.

usr.sbin/bhyve/pci_virtio_net.c
275 ↗(On Diff #67230)

I tried to test this patch. And I found a bug.
With a tap backend, the netbe_recv() function calls iov_trim() which changes the value of iov[0].iov_base (It increases the address by the size of the struct virtio_net_rxhdr).
https://svnweb.freebsd.org/base/head/usr.sbin/bhyve/net_backends.c?view=markup#l819
https://svnweb.freebsd.org/base/head/usr.sbin/bhyve/net_backends.c?view=markup#l766

So, hdr points to an incorrect location.

If I move the 'hdr' declaration before calling the netbe_recv () function, then everything works correctly:
ping -f -s 8000 and iperf3 with MTU == 9000.

usr.sbin/bhyve/pci_virtio_net.c
254 ↗(On Diff #67230)

We need to initialize 'struct virtio_net_rxhdr *hdr = iov[0].iov_base;' before calling netbe_recv() which can change value of iov[0].iov_base.

usr.sbin/bhyve/pci_virtio_net.c
254 ↗(On Diff #67230)

You're right, thank you!

275 ↗(On Diff #67230)

Thank you for the suggestion. I will try to take a different approach to avoid visible changes to the iov

vmaffione marked 3 inline comments as done.

I refactored the code so that iov trimming is done by the virtio-net code, where this operation belongs.
As a nice result, now we can pass struct iovec arrays as const to the backend code.

Tested several combinations (vm-2-host, vm-2-vm, tap, vale, mergeable rx buffers on and off).
Everything seems to work correctly.

The changes looks good to me.

Also, I tested Win10, CentOS7, Ubuntu 16 and18 in our test lab - everything works correctly.

In D23342#512819, @aleksandr.fedorov_itglobal.com wrote:

The changes looks good to me.

Also, I tested Win10, CentOS7, Ubuntu 16 and18 in our test lab - everything works correctly.

Thanks for the tests!

@vmaffione Is there a reason we need an options for this or at the very least a reason it shouldn't default to on with an option to disable it. I would consider 9000 mtu not working to be the POLA not other way around?

I'd suggest going even further by not making it an option and fixing it as always on. There doesn't seem to be any reason to have it disabled.

Is "mtu 9000" really not working? I tried on head (without this patch), by setting the mtu to 9000 for both if_vtnet guest interfaces and host interfaces, and I'm able to ping -s 8900 x.y.z.w. So what exactly is not working?

Regarding the option... Actually, at the moment being there is a performance reason to keep it disabled for tap and enabled for netmap.
In the current implementation of mergeable rx buffers, for each packet to be received bhyve will collect ~32 guest buffers (calling vq_getchain ~32 times), so that it can handle a 64KB input packet (either TSO or large MTU). In the default case of vtnet+tap without jumbo frames, this is a huge waste of time (since 1 guest buffer is always enough, and the other 31 are returned to the vq), and it slows down the guest considerably.
The problem is that we keep calling vq_getchain() again and again on the same guest buffers, and vq_getchain() is kind of expensive.. to solve this issue, we can avoid returning unused guest buffers to the vq, and reuse those for the next call. That's something I would like to implement. The alternative solution (the same used by QEMU) is to read the packet in a local bounce buffer (64KB), so that we figure out its actual length and then we call vq_getchain() the corrent number of times (e.g. just once for 1500 bytes packets). The cost for this solution is an additional packet copy, which is not good imho.

If you don't really like the idea of adding an option that may become superfluos in the future, we could postpone the tap+mergeable-rx-buffer support (until we get a performant solution) and just commit the refactoring work (most of this diff, actually).

The alternative solution (the same used by QEMU) is to read the packet in a local bounce buffer (64KB), so that we figure out its actual length and then we call vq_getchain() the corrent number of times (e.g. just once for 1500 bytes packets). The cost for this solution is an additional packet copy, which is not good imho.

Tap has enough overhead already that I suspect using a bounce-buffer would be noise (and I've done that in a proprietary implementation).

Tap needs to be fixed anyways e.g. by providing the length of the first available buffer in kqueue data, and providing the length of the following buffer if any in msgdata.

@vmaffione Don't forget to add -D flag to your ping testing otherwise ping is fragmented and reassembled.

So yes jumbo is broken with tap and gives us a theoretical 2-4x hit in performance in some case the questions is whether merging is a lower hit then that. Bhyve networking speed has been one of the biggest things holding us back on bhyve at iX.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=215737

The alternative solution (the same used by QEMU) is to read the packet in a local bounce buffer (64KB), so that we figure out its actual length and then we call vq_getchain() the corrent number of times (e.g. just once for 1500 bytes packets). The cost for this solution is an additional packet copy, which is not good imho.

Tap has enough overhead already that I suspect using a bounce-buffer would be noise (and I've done that in a proprietary implementation).

Tap needs to be fixed anyways e.g. by providing the length of the first available buffer in kqueue data, and providing the length of the following buffer if any in msgdata.

That is also a viable solution. Ok, so I will remove the options-related changes and keep the refactoring (move the header adaptation code to pci_virtio_net.c). Then I will keep working on tap in the following patches.

@vmaffione Don't forget to add -D flag to your ping testing otherwise ping is fragmented and reassembled.

Thanks for the suggestion. I retried with -D (on head), and it still works. I cannot reproduce the broken jumbo-frames+tap...

So yes jumbo is broken with tap and gives us a theoretical 2-4x hit in performance in some case the questions is whether merging is a lower hit then that. Bhyve networking speed has been one of the biggest things holding us back on bhyve at iX.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=215737

I think we can get much better with a few changes.

vmaffione retitled this revision from bhyve: add option to enable mergeable rx buffers to bhyve: move virtio-net header processing to pci_virtio_net.
vmaffione edited the summary of this revision. (Show Details)
vmaffione edited the test plan for this revision. (Show Details)

Removed the mergeable rx buffers changes. Keep the refactoring.

Thanks for the suggestion. I retried with -D (on head), and it still works. I cannot reproduce the broken jumbo-frames+tap...

It is very strange. In my tests, the version from the HEAD doesn't work:

Guest: Ubuntu 16.04 mtu 9000 192.168.1.4
Host: ifconfig tap1001 192.168.1.55/24 mtu 9000 up

Ping from host doesn't work.
ping -s 8000 -D 192.168.1.4

I tried to debug this situation, and saw that the guest descriptors are 1536 bytes in size. Therefore, I do not understand how it could work with MTU == 9000.

I think the previous version of the patch is more adequate. And there was no POLA violation because with tap backend and MTU > 1500, it didn't work.

In D23342#514125, @aleksandr.fedorov_itglobal.com wrote:

Thanks for the suggestion. I retried with -D (on head), and it still works. I cannot reproduce the broken jumbo-frames+tap...

It is very strange. In my tests, the version from the HEAD doesn't work:

Guest: Ubuntu 16.04 mtu 9000 192.168.1.4
Host: ifconfig tap1001 192.168.1.55/24 mtu 9000 up

Ping from host doesn't work.
ping -s 8000 -D 192.168.1.4

Have you set mtu 9000 on the vtnet0 in the guest?
My tests were VM-2-VM, both freebsd guests.

I tried to debug this situation, and saw that the guest descriptors are 1536 bytes in size. Therefore, I do not understand how it could work with MTU == 9000.

I think the previous version of the patch is more adequate. And there was no POLA violation because with tap backend and MTU > 1500, it didn't work.

Yeah, but with mergeable buffers on, tap throughput dropped to 50%.
I'm not giving up on adding support for mergeable buffers. I just think it makes sense to split the patch into more changes.

Have you set mtu 9000 on the vtnet0 in the guest?
My tests were VM-2-VM, both freebsd guests.

Yes.
ifconfig from the guest (Ubuntu 16.04):

root@ubuntu:~# ifconfig
enp0s2    Link encap:Ethernet  HWaddr 00:a0:98:cb:89:bb
          inet addr:192.168.1.4  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::2a0:98ff:fecb:89bb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:11 errors:0 dropped:0 overruns:0 frame:0
          TX packets:150 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:8066 (8.0 KB)  TX bytes:48636 (48.6 KB)

Ping from the guest:

root@ubuntu:~# ping 192.168.1.77
PING 192.168.1.77 (192.168.1.77) 56(84) bytes of data.
64 bytes from 192.168.1.77: icmp_seq=1 ttl=64 time=1.05 ms
64 bytes from 192.168.1.77: icmp_seq=2 ttl=64 time=0.197 ms

--- 192.168.1.77 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 0.197/0.627/1.057/0.430 ms
^Croot@ubuntu:~#
^Croot@ubuntu:~# ping -s 8000 192.168.1.77
PING 192.168.1.77 (192.168.1.77) 8000(8028) bytes of data.
^C
--- 192.168.1.77 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3019ms

root@ubuntu:~#

I tried to debug this situation, and saw that the guest descriptors are 1536 bytes in size. Therefore, I do not understand how it could work with MTU == 9000.

I think the previous version of the patch is more adequate. And there was no POLA violation because with tap backend and MTU > 1500, it didn't work.

Yeah, but with mergeable buffers on, tap throughput dropped to 50%.
I'm not giving up on adding support for mergeable buffers. I just think it makes sense to split the patch into more changes.

Host:

root@mothership:/afedorov/vm # ifconfig tap1003
tap1003: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=80000<LINKSTATE>
        ether 58:9c:fc:10:cf:64
        inet 192.168.1.77 netmask 0xffffff00 broadcast 192.168.1.255
        groups: tap
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        Opened by PID 89835
root@mothership:/afedorov/vm #

Ping from host:

root@mothership:/afedorov/vm # ping 192.168.1.4
PING 192.168.1.4 (192.168.1.4): 56 data bytes
64 bytes from 192.168.1.4: icmp_seq=0 ttl=64 time=0.156 ms
64 bytes from 192.168.1.4: icmp_seq=1 ttl=64 time=0.155 ms
^C
--- 192.168.1.4 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.155/0.156/0.156/0.001 ms
root@mothership:/afedorov/vm #
root@mothership:/afedorov/vm # ping -s 8000 192.168.1.4
PING 192.168.1.4 (192.168.1.4): 8000 data bytes
^C
--- 192.168.1.4 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss
root@mothership:/afedorov/vm #

bhyve:

bhyve \
        -A -P -H \
        -s 0:0,hostbridge \
        -s 1:0,lpc \
        -s 2:0,virtio-net,tap1003 \
        -s 3:0,virtio-blk,./ubuntu-16.04.img \
        -l com1,stdio \
        -c 4\
        -m 2048M \
        ubuntu-16.04

Maybe of course I'm doing something wrong.

Yeah, but with mergeable buffers on, tap throughput dropped to 50%.
I'm not giving up on adding support for mergeable buffers. I just think it makes sense to split the patch into more changes.

Oh, then it's good. +1 for splitting patch.

Hah, I checked fFreeBSD 12 guest and it works. It seems that, the FreeBSD guest driver allocates descriptors of size >= MTU.

In D23342#514144, @aleksandr.fedorov_itglobal.com wrote:

Hah, I checked fFreeBSD 12 guest and it works. It seems that, the FreeBSD guest driver allocates descriptors of size >= MTU.

Yes indeed. But also the Linux virtio-net driver should do the same when mergeable rx buffers are not allocated.... so it is weird that it does not work in that case!

In D23342#514141, @aleksandr.fedorov_itglobal.com wrote:

Have you set mtu 9000 on the vtnet0 in the guest?
My tests were VM-2-VM, both freebsd guests.

Yes.
ifconfig from the guest (Ubuntu 16.04):

root@ubuntu:~# ifconfig
enp0s2    Link encap:Ethernet  HWaddr 00:a0:98:cb:89:bb
          inet addr:192.168.1.4  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::2a0:98ff:fecb:89bb/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
          RX packets:11 errors:0 dropped:0 overruns:0 frame:0
          TX packets:150 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:8066 (8.0 KB)  TX bytes:48636 (48.6 KB)

Ping from the guest:

root@ubuntu:~# ping 192.168.1.77
PING 192.168.1.77 (192.168.1.77) 56(84) bytes of data.
64 bytes from 192.168.1.77: icmp_seq=1 ttl=64 time=1.05 ms
64 bytes from 192.168.1.77: icmp_seq=2 ttl=64 time=0.197 ms

--- 192.168.1.77 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1005ms
rtt min/avg/max/mdev = 0.197/0.627/1.057/0.430 ms
^Croot@ubuntu:~#
^Croot@ubuntu:~# ping -s 8000 192.168.1.77
PING 192.168.1.77 (192.168.1.77) 8000(8028) bytes of data.
^C
--- 192.168.1.77 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3019ms

root@ubuntu:~#

I tried to debug this situation, and saw that the guest descriptors are 1536 bytes in size. Therefore, I do not understand how it could work with MTU == 9000.

I think the previous version of the patch is more adequate. And there was no POLA violation because with tap backend and MTU > 1500, it didn't work.

Yeah, but with mergeable buffers on, tap throughput dropped to 50%.
I'm not giving up on adding support for mergeable buffers. I just think it makes sense to split the patch into more changes.

Host:

root@mothership:/afedorov/vm # ifconfig tap1003
tap1003: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000
        options=80000<LINKSTATE>
        ether 58:9c:fc:10:cf:64
        inet 192.168.1.77 netmask 0xffffff00 broadcast 192.168.1.255
        groups: tap
        media: Ethernet autoselect
        status: active
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        Opened by PID 89835
root@mothership:/afedorov/vm #

Ping from host:

root@mothership:/afedorov/vm # ping 192.168.1.4
PING 192.168.1.4 (192.168.1.4): 56 data bytes
64 bytes from 192.168.1.4: icmp_seq=0 ttl=64 time=0.156 ms
64 bytes from 192.168.1.4: icmp_seq=1 ttl=64 time=0.155 ms
^C
--- 192.168.1.4 ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.155/0.156/0.156/0.001 ms
root@mothership:/afedorov/vm #
root@mothership:/afedorov/vm # ping -s 8000 192.168.1.4
PING 192.168.1.4 (192.168.1.4): 8000 data bytes
^C
--- 192.168.1.4 ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss
root@mothership:/afedorov/vm #

bhyve:

bhyve \
        -A -P -H \
        -s 0:0,hostbridge \
        -s 1:0,lpc \
        -s 2:0,virtio-net,tap1003 \
        -s 3:0,virtio-blk,./ubuntu-16.04.img \
        -l com1,stdio \
        -c 4\
        -m 2048M \
        ubuntu-16.04

Maybe of course I'm doing something wrong.

Yeah, but with mergeable buffers on, tap throughput dropped to 50%.
I'm not giving up on adding support for mergeable buffers. I just think it makes sense to split the patch into more changes.

Oh, then it's good. +1 for splitting patch.

Yeah, just waiting for someone to accept the revision.

Yes indeed. But also the Linux virtio-net driver should do the same when mergeable rx buffers are not allocated.... so it is weird that it does not work in that case!

This is a big virtio specification problem. Too many unspecified actions and interpretations. Two weeks ago, I observed at a Windows guest that descriptors stand out as follows: | 12 | 1536 | 1536 | .... Even if the mergable buffers are enabled.

In D23342#514165, @aleksandr.fedorov_itglobal.com wrote:

Yes indeed. But also the Linux virtio-net driver should do the same when mergeable rx buffers are not allocated.... so it is weird that it does not work in that case!

This is a big virtio specification problem. Too many unspecified actions and interpretations. Two weeks ago, I observed at a Windows guest that descriptors stand out as follows: | 12 | 1536 | 1536 | .... Even if the mergable buffers are enabled.

You mean a single 12 bytes entry every two MTU-sized ones? That's weird and unconvenient indeed, although technically correct..

This revision is now accepted and ready to land.Feb 12 2020, 11:02 AM

Ping. Are there any issues that prevent committing these changes?

Is this good to go?

The alternative solution (the same used by QEMU) is to read the packet in a local bounce buffer (64KB), so that we figure out its actual length and then we call vq_getchain() the corrent number of times (e.g. just once for 1500 bytes packets). The cost for this solution is an additional packet copy, which is not good imho.

Tap has enough overhead already that I suspect using a bounce-buffer would be noise (and I've done that in a proprietary implementation).

Tap needs to be fixed anyways e.g. by providing the length of the first available buffer in kqueue data, and providing the length of the following buffer if any in msgdata.

Ok for the length of the first packet in kqueue data. But which msgdata are you talking about? Maybe cmsgdata in recvfrom(2)? if_tuntap exposes a character device to user space, so how could we use recvfrom on it?

Is this good to go?

I gave it a tick.

Ok for the length of the first packet in kqueue data. But which msgdata are you talking about? Maybe cmsgdata in recvfrom(2)? if_tuntap exposes a character device to user space, so how could we use recvfrom on it?

I was in socket-land when writing that comment (which would be good: recvmmsg would be a great way to avoid syscall overhead). But, should be a simple matter to prepend a header in front of every packet ala Linux tap/tun.

Is this good to go?

I gave it a tick.

Yes, this has been merged already.

Ok for the length of the first packet in kqueue data. But which msgdata are you talking about? Maybe cmsgdata in recvfrom(2)? if_tuntap exposes a character device to user space, so how could we use recvfrom on it?

I was in socket-land when writing that comment (which would be good: recvmmsg would be a great way to avoid syscall overhead). But, should be a simple matter to prepend a header in front of every packet ala Linux tap/tun.

Yes, this should be doable.