This patch cleans up the API between the net frontends (e1000, virtio-net) and the net backends (tap and netmap).
We move the virtio-net header stripping/prepending to the virtio-net code, where this functionality belongs.
In this way, the netbe_send() and netbe_recv() signatures can have const struct iov * rather than struct iov *.
Details
- Reviewers
jhb grehan markj bryanv - Group Reviewers
manpages bhyve - Commits
- rS358185: MFC r357846
rS357846: bhyve: move virtio-net header processing to pci_virtio_net
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Minor man page fix.
usr.sbin/bhyve/bhyve.8 | ||
---|---|---|
588 | A line break is required after a sentence stop. |
usr.sbin/bhyve/pci_virtio_net.c | ||
---|---|---|
337–338 | I tried to test this patch. And I found a bug. So, hdr points to an incorrect location. If I move the 'hdr' declaration before calling the netbe_recv () function, then everything works correctly: |
usr.sbin/bhyve/pci_virtio_net.c | ||
---|---|---|
314 | We need to initialize 'struct virtio_net_rxhdr *hdr = iov[0].iov_base;' before calling netbe_recv() which can change value of iov[0].iov_base. |
I refactored the code so that iov trimming is done by the virtio-net code, where this operation belongs.
As a nice result, now we can pass struct iovec arrays as const to the backend code.
Tested several combinations (vm-2-host, vm-2-vm, tap, vale, mergeable rx buffers on and off).
Everything seems to work correctly.
The changes looks good to me.
Also, I tested Win10, CentOS7, Ubuntu 16 and18 in our test lab - everything works correctly.
@vmaffione Is there a reason we need an options for this or at the very least a reason it shouldn't default to on with an option to disable it. I would consider 9000 mtu not working to be the POLA not other way around?
I'd suggest going even further by not making it an option and fixing it as always on. There doesn't seem to be any reason to have it disabled.
Is "mtu 9000" really not working? I tried on head (without this patch), by setting the mtu to 9000 for both if_vtnet guest interfaces and host interfaces, and I'm able to ping -s 8900 x.y.z.w. So what exactly is not working?
Regarding the option... Actually, at the moment being there is a performance reason to keep it disabled for tap and enabled for netmap.
In the current implementation of mergeable rx buffers, for each packet to be received bhyve will collect ~32 guest buffers (calling vq_getchain ~32 times), so that it can handle a 64KB input packet (either TSO or large MTU). In the default case of vtnet+tap without jumbo frames, this is a huge waste of time (since 1 guest buffer is always enough, and the other 31 are returned to the vq), and it slows down the guest considerably.
The problem is that we keep calling vq_getchain() again and again on the same guest buffers, and vq_getchain() is kind of expensive.. to solve this issue, we can avoid returning unused guest buffers to the vq, and reuse those for the next call. That's something I would like to implement. The alternative solution (the same used by QEMU) is to read the packet in a local bounce buffer (64KB), so that we figure out its actual length and then we call vq_getchain() the corrent number of times (e.g. just once for 1500 bytes packets). The cost for this solution is an additional packet copy, which is not good imho.
If you don't really like the idea of adding an option that may become superfluos in the future, we could postpone the tap+mergeable-rx-buffer support (until we get a performant solution) and just commit the refactoring work (most of this diff, actually).
The alternative solution (the same used by QEMU) is to read the packet in a local bounce buffer (64KB), so that we figure out its actual length and then we call vq_getchain() the corrent number of times (e.g. just once for 1500 bytes packets). The cost for this solution is an additional packet copy, which is not good imho.
Tap has enough overhead already that I suspect using a bounce-buffer would be noise (and I've done that in a proprietary implementation).
Tap needs to be fixed anyways e.g. by providing the length of the first available buffer in kqueue data, and providing the length of the following buffer if any in msgdata.
@vmaffione Don't forget to add -D flag to your ping testing otherwise ping is fragmented and reassembled.
So yes jumbo is broken with tap and gives us a theoretical 2-4x hit in performance in some case the questions is whether merging is a lower hit then that. Bhyve networking speed has been one of the biggest things holding us back on bhyve at iX.
That is also a viable solution. Ok, so I will remove the options-related changes and keep the refactoring (move the header adaptation code to pci_virtio_net.c). Then I will keep working on tap in the following patches.
Thanks for the suggestion. I retried with -D (on head), and it still works. I cannot reproduce the broken jumbo-frames+tap...
So yes jumbo is broken with tap and gives us a theoretical 2-4x hit in performance in some case the questions is whether merging is a lower hit then that. Bhyve networking speed has been one of the biggest things holding us back on bhyve at iX.
I think we can get much better with a few changes.
Thanks for the suggestion. I retried with -D (on head), and it still works. I cannot reproduce the broken jumbo-frames+tap...
It is very strange. In my tests, the version from the HEAD doesn't work:
Guest: Ubuntu 16.04 mtu 9000 192.168.1.4
Host: ifconfig tap1001 192.168.1.55/24 mtu 9000 up
Ping from host doesn't work.
ping -s 8000 -D 192.168.1.4
I tried to debug this situation, and saw that the guest descriptors are 1536 bytes in size. Therefore, I do not understand how it could work with MTU == 9000.
I think the previous version of the patch is more adequate. And there was no POLA violation because with tap backend and MTU > 1500, it didn't work.
Have you set mtu 9000 on the vtnet0 in the guest?
My tests were VM-2-VM, both freebsd guests.
I tried to debug this situation, and saw that the guest descriptors are 1536 bytes in size. Therefore, I do not understand how it could work with MTU == 9000.
I think the previous version of the patch is more adequate. And there was no POLA violation because with tap backend and MTU > 1500, it didn't work.
Yeah, but with mergeable buffers on, tap throughput dropped to 50%.
I'm not giving up on adding support for mergeable buffers. I just think it makes sense to split the patch into more changes.
Have you set mtu 9000 on the vtnet0 in the guest?
My tests were VM-2-VM, both freebsd guests.
Yes.
ifconfig from the guest (Ubuntu 16.04):
root@ubuntu:~# ifconfig enp0s2 Link encap:Ethernet HWaddr 00:a0:98:cb:89:bb inet addr:192.168.1.4 Bcast:192.168.1.255 Mask:255.255.255.0 inet6 addr: fe80::2a0:98ff:fecb:89bb/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:11 errors:0 dropped:0 overruns:0 frame:0 TX packets:150 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:8066 (8.0 KB) TX bytes:48636 (48.6 KB)
Ping from the guest:
root@ubuntu:~# ping 192.168.1.77 PING 192.168.1.77 (192.168.1.77) 56(84) bytes of data. 64 bytes from 192.168.1.77: icmp_seq=1 ttl=64 time=1.05 ms 64 bytes from 192.168.1.77: icmp_seq=2 ttl=64 time=0.197 ms --- 192.168.1.77 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1005ms rtt min/avg/max/mdev = 0.197/0.627/1.057/0.430 ms ^Croot@ubuntu:~#
^Croot@ubuntu:~# ping -s 8000 192.168.1.77 PING 192.168.1.77 (192.168.1.77) 8000(8028) bytes of data. ^C --- 192.168.1.77 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3019ms root@ubuntu:~#
I tried to debug this situation, and saw that the guest descriptors are 1536 bytes in size. Therefore, I do not understand how it could work with MTU == 9000.
I think the previous version of the patch is more adequate. And there was no POLA violation because with tap backend and MTU > 1500, it didn't work.
Yeah, but with mergeable buffers on, tap throughput dropped to 50%.
I'm not giving up on adding support for mergeable buffers. I just think it makes sense to split the patch into more changes.
Host:
root@mothership:/afedorov/vm # ifconfig tap1003 tap1003: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 9000 options=80000<LINKSTATE> ether 58:9c:fc:10:cf:64 inet 192.168.1.77 netmask 0xffffff00 broadcast 192.168.1.255 groups: tap media: Ethernet autoselect status: active nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> Opened by PID 89835 root@mothership:/afedorov/vm #
Ping from host:
root@mothership:/afedorov/vm # ping 192.168.1.4 PING 192.168.1.4 (192.168.1.4): 56 data bytes 64 bytes from 192.168.1.4: icmp_seq=0 ttl=64 time=0.156 ms 64 bytes from 192.168.1.4: icmp_seq=1 ttl=64 time=0.155 ms ^C --- 192.168.1.4 ping statistics --- 2 packets transmitted, 2 packets received, 0.0% packet loss round-trip min/avg/max/stddev = 0.155/0.156/0.156/0.001 ms root@mothership:/afedorov/vm #
root@mothership:/afedorov/vm # ping -s 8000 192.168.1.4 PING 192.168.1.4 (192.168.1.4): 8000 data bytes ^C --- 192.168.1.4 ping statistics --- 2 packets transmitted, 0 packets received, 100.0% packet loss root@mothership:/afedorov/vm #
bhyve:
bhyve \ -A -P -H \ -s 0:0,hostbridge \ -s 1:0,lpc \ -s 2:0,virtio-net,tap1003 \ -s 3:0,virtio-blk,./ubuntu-16.04.img \ -l com1,stdio \ -c 4\ -m 2048M \ ubuntu-16.04
Maybe of course I'm doing something wrong.
Yeah, but with mergeable buffers on, tap throughput dropped to 50%.
I'm not giving up on adding support for mergeable buffers. I just think it makes sense to split the patch into more changes.
Oh, then it's good. +1 for splitting patch.
Hah, I checked fFreeBSD 12 guest and it works. It seems that, the FreeBSD guest driver allocates descriptors of size >= MTU.
Yes indeed. But also the Linux virtio-net driver should do the same when mergeable rx buffers are not allocated.... so it is weird that it does not work in that case!
Yes indeed. But also the Linux virtio-net driver should do the same when mergeable rx buffers are not allocated.... so it is weird that it does not work in that case!
This is a big virtio specification problem. Too many unspecified actions and interpretations. Two weeks ago, I observed at a Windows guest that descriptors stand out as follows: | 12 | 1536 | 1536 | .... Even if the mergable buffers are enabled.
You mean a single 12 bytes entry every two MTU-sized ones? That's weird and unconvenient indeed, although technically correct..
Is this good to go?
Ok for the length of the first packet in kqueue data. But which msgdata are you talking about? Maybe cmsgdata in recvfrom(2)? if_tuntap exposes a character device to user space, so how could we use recvfrom on it?
Is this good to go?
I gave it a tick.
Ok for the length of the first packet in kqueue data. But which msgdata are you talking about? Maybe cmsgdata in recvfrom(2)? if_tuntap exposes a character device to user space, so how could we use recvfrom on it?
I was in socket-land when writing that comment (which would be good: recvmmsg would be a great way to avoid syscall overhead). But, should be a simple matter to prepend a header in front of every packet ala Linux tap/tun.
Yes, this has been merged already.
Ok for the length of the first packet in kqueue data. But which msgdata are you talking about? Maybe cmsgdata in recvfrom(2)? if_tuntap exposes a character device to user space, so how could we use recvfrom on it?
I was in socket-land when writing that comment (which would be good: recvmmsg would be a great way to avoid syscall overhead). But, should be a simple matter to prepend a header in front of every packet ala Linux tap/tun.
Yes, this should be doable.