Details

Reviewers

vmaffione
jhb
grehan
rgrimes
julian
donner

Group Reviewers

bhyve

Commits

rS360958: Add a new bhyve network backend that allow to connect the VM to the netgraph(4)…

Summary

I want to introduce a new bhyve network backend that allow to connect VM to netgraph(4) mesh.
The backend uses socket API with PF_NETGRAPH protocol family which is provided by the ng_socket(4).

The main motivation:

Netgraph is the well-known mature FreeBSD feature for creating flexible networks.
Netgraph already has several useful modules: L2 bridge, vlan, nat, ipfw integration, pseudo-interface (ng_eiface (4)), QoS, etc.
The Socket API looks more flexible and has good performance.
Bhyve allowed to run in a Jail. Netgraph is also virtualized through VNET, so it might be interesting to run a group of virtual machines inside Jail.

Some notes:
The default socket buffer is too small (net.graph.recvspace: 20480 and net.graph.maxdgram: 20480). To achieve good performance, you may need to increase the value of kern.ipc.maxsockbuf. In my tests the optimal value is ~4 Mbytes.

With this value on XEON v4, I achieved the following results:
iperf3 5-6 GBit/s: VM (mtu == 1500) - ng_bridge - VM (mtu == 1500)
iperf3 11-12 GBit/s: VM (mtu == 9000) - ng_bridge - VM (mtu == 9000)
iperf3 22 GBit/s: VM (mtu == 64K) - ng_bridge - VM (mtu == 64K) - this is just for testing troughput if we enabled virtio-net TSO.

To use new backend:

-s X:Y:Z,[virtio-net,e1000],netgraph,socket=[ng_socket name],path=[destination node],hook=[our socket src hook],peerhook=[dst node hook]

with ng_bridge:

-s X:Y:Z,[virtio-net,e1000],netgraph,socket=vmX,path=vmbridge:,hook=vmlink,peerhook=link0

or a short version:

-s X:Y:Z,[virtio-net,e1000],netgraph,path=vmbridge:,peerhook=link0

To connect VM directly to the network interface:

# kldload ng_ether
# bhyve ... -s 5,[virtio-net,e1000],netgraph,path=ix0:,peerhook=lower

Test Plan

I tested various combinations, but the simplest is:

host# ngctl mkpeer . eiface test ether
host# ngctl mkpeer ngeth0: bridge ether link99
host# ngctl name ngeth0:ether vmbridge
host# ifconfig ngeth0 192.168.1.1/24 up
host# sh vmrun.sh -c 4 -m 1024M -t netgraph,socket=vm0,path=vmbridge:,hook=vm0link,peerhook=link0 -d freebsd-0.img freebsd-0
host# sh vmrun.sh -c 4 -m 1024M -t netgraph,socket=vm1,path=vmbridge:,hook=vm1link,peerhook=link1 -d freebsd-1.img freebsd-1
root@fbsd0:~ # ifconfig vtnet0 192.168.1.2/24 up
root@fbsd1:~ # ifconfig vtnet0 192.168.1.3/24 up

Test ping, iperf3 between 192.168.1.1, 192.168.1.2, 192.168.1.3

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

afedorov created this revision.Apr 29 2020, 5:04 PM

Herald added a reviewer: bhyve. · View Herald TranscriptApr 29 2020, 5:04 PM

Herald added subscribers: bcran, jonathan. · View Herald Transcript

afedorov requested review of this revision.Apr 29 2020, 5:04 PM

If I understand correctly, you are adding code in the VM-setup (copied from ngctl) to create a ng_socket and connect it to a specified node (ng_bridge). Then you are using the data channel of the ng_socket to transmit Ethernet frames.

IMHO that's gross inefficient, because the ng_socket interface is designed to handle small messages only. That's why you need to tweak the socket buffer parameters.

What stops you from creating an ng_ether node in the VM-environment (using the mechanism, you already wrote) and connect this? This is a far better way of efficient communication.

In D24620#541979, @lutz_donnerhacke.de wrote:

If I understand correctly, you are adding code in the VM-setup (copied from ngctl) to create a ng_socket and connect it to a specified node (ng_bridge). Then you are using the data channel of the ng_socket to transmit Ethernet frames.

Yes. This is how bhyve network backends works. The guest OS sends / receives packets through the guest driver, bhyve processes them in user space and redirects them to the appropriate device using read / write / mmap system calls. Bhyve currently supports packet processing through /dev/tapX and /dev/netmap.

This review add support packet processing through ng_socket(4). I only know two useful ways to send/receive packets to/from the Netgraph network from the user space: ng_socket and ng_device (open /dev/ngdN and read/write).

IMHO that's gross inefficient, because the ng_socket interface is designed to handle small messages only. That's why you need to tweak the socket buffer parameters.

I am not sure if the ng_socket interface has any design restrictions. This is a regular datagram socket. The increase in buffer size is more due to the fact that the guest OS exports its packet queues (virtio queue) and bhyve processes them in batched mode.

What stops you from creating an ng_ether node in the VM-environment (using the mechanism, you already wrote) and connect this? This is a far better way of efficient communication.

I don’t quite understand what you mean. A typical use is to create several virtual machines, connect them to the L2 bridge, which is connected to the physical interface.

For example with tap backend:

vm0 - tap0 - if_bridge(4) - igb0
vmN - tapN ---/

Or with netmap backend:

vm0 - vale - igb0
vmN - /

With proposed patch:

vm0 - ng_socket[0] - ng_bridge - igb0 (ng_ether)
vmN - ng_socket[N] -----/

Of course, with the proposed patch, you can connect ng_socket hook directly to ng_ether lower hook:

vm0 - ng_socket[0] - igb0(ng_ether)

Moreover, you can connect the VM to any Netgraph node that supports the processing of Ethernet frames.

In D24620#542160, @aleksandr.fedorov_itglobal.com wrote:

In D24620#541979, @lutz_donnerhacke.de wrote:

If I understand correctly, you are adding code in the VM-setup (copied from ngctl) to create a ng_socket and connect it to a specified node (ng_bridge). Then you are using the data channel of the ng_socket to transmit Ethernet frames.

Yes. This is how bhyve network backends works. The guest OS sends / receives packets through the guest driver, bhyve processes them in user space and redirects them to the appropriate device using read / write / mmap system calls. Bhyve currently supports packet processing through /dev/tapX and /dev/netmap.

This review add support packet processing through ng_socket(4). I only know two useful ways to send/receive packets to/from the Netgraph network from the user space: ng_socket and ng_device (open /dev/ngdN and read/write).

I do not know anything about bhyve, that's why I assumed something like a real VM. My fault.

Yes, you are correct: ng_socket is the correct way to handle the communication with user space processes.
So the approach is on the right track.

IMHO that's gross inefficient, because the ng_socket interface is designed to handle small messages only. That's why you need to tweak the socket buffer parameters.

I am not sure if the ng_socket interface has any design restrictions. This is a regular datagram socket.

The interface assumes, that everything is small enough to pas through in a single step.
But the assumption is proven wrong, hence I started with ideas like D23850.

My question so far was about the right approach.
I did not look into the details right now.

ping

Your test defines a node named "vmbridge".

host# ngctl name ngeth0:ether vmbridge

and then referes to a node "vmbr"

host# sh vmrun.sh -c 4 -m 1024M -t netgraph:socket=vm0:path=vmbr:hook=vm0link:peerhook=link0 -d freebsd-0.img freebsd-0

Where does this new node come from?

usr.sbin/bhyve/net_backends.c
1015	Your code assumes, that the name of the netgraph node is specified, so better call this option "nodename" instead of "path".
1016	How do you specify a "unnamed" node? Like lagg12:lower.vlan34 (assuming a ng_ether on lagg12, connected to an unnamed ng_vlan). Your separator is conflicting with the valid path characters.
1017	Assuming you have provided an argument longer than NG_PATHSIZ, so you silently truncate the path argument, correct? Furthermore appending a colon assumes, that the path to be the name of an node only.
1088	Is your manual optimization always better than the compiler code?
1121	You have an error state, your parameters might be invalid or unset. Especially be->fd might be 0 (from memset). ng_cleanup will close this arbitrary file descriptor unconditionally. Is this really intended?
1217	Better return EINVAL, because no parameter can be set.

vmaffione added inline comments.May 3 2020, 9:20 PM

usr.sbin/bhyve/net_backends.c
992	we haven't allocated anything here, so it may be more readable to avoid a goto and just return -1
1000	Can we use "," as a separator? This would be consistent with the other places where we do this kind of parsing.
1027	same thing here, it's probably cleaner to just return -1
1034	same as above
1039	same as above
1052	shouldn't you close the control socket here, to avoid resource leak?
1088	+1 Also, this is initialization code, so it would not make any difference.
1092	I'm curious: have you tried what happens if you try to set a value greater than the maximum allowed one? Do you get a failure or a saturation (to the kernel max value)?
1217	+1 we just do the same as `tap_set_cap()`

Return -1 earlier if an error occurs.
Change separator from ':' to '/' to allow different Netgraph address types (See man 4 netgraph 'Addressing').
Fix resources leak through ctrl_sock.
Change the "path" option to "relpath" to match the ngctl connect command.

ngctl: usage: connect [path] <relpath> <hook> <peerhook>

In D24620#543377, @aleksandr.fedorov_itglobal.com wrote:

Change the "path" option to "relpath" to match the ngctl connect command.

Please keep it "path".
"relpath" is useless in this situation, because your node is not yet connected and can't use a relative path at all.

afedorov marked 9 inline comments as done.May 4 2020, 11:27 AM

afedorov added inline comments.

usr.sbin/bhyve/net_backends.c
1000	The main problem is that the frontend does not pass options separated by the “,” symbol to the backend. Therefore, if we pass “netgraph,socket=X,etc.” from the command line, ng_init() will only get “netgraph”. To change this behavior, we need to change all the backends. What do you think about this?
1015	What do you think about 'relpath', similar to ngctl connect options naming?
1088	I just don't want to use float to int conversion. The problem with kern.ipc.maxsockbuf is described at https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204438
1092	f I try to set the values greater than or equal to kern.ipc.maxsockbuf, then setsockopt () return ENOBUFS.
1121	The value be->fd is initialized to -1 on line 794. So I think ng_cleaunup() must work as expected.

Please keep it "path".
"relpath" is useless in this situation, because your node is not yet connected and can't use a relative path at all.

Ок.

I tried the code generation with:

int testFP(int i) {
   return i*0.75;
}

Then compiling with "cc -S -O3" produces the assembly:

        .section        .rodata.cst8,"aM",@progbits,8
        .p2align        3               # -- Begin function testFP
.LCPI0_0:
        .quad   4604930618986332160     # double 0.75
        .text
        .globl  testFP
        .p2align        4, 0x90
        .type   testFP,@function
testFP:                                 # @testFP
        .cfi_startproc
# %bb.0:
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        cvtsi2sd        %edi, %xmm0
        mulsd   .LCPI0_0(%rip), %xmm0
        cvttsd2si       %xmm0, %eax
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq
.Lfunc_end0:
        .size   testFP, .Lfunc_end0-testFP
        .cfi_endproc
                                        # -- End function

So your objection is correct, the code is using floating point. Somebody may argue, that this is one-time-setup code, so the penalty is irrelevant.

But you can also check for integer only calculation:

int testI(int i) {
   return i*3/4;
}

This compiles to

        .globl  testI                   # -- Begin function testI
        .p2align        4, 0x90
        .type   testI,@function
testI:                                  # @testI
        .cfi_startproc
# %bb.0:
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
                                        # kill: def $edi killed $edi def $rdi
        leal    (%rdi,%rdi,2), %ecx
        movl    %ecx, %eax
        sarl    $31, %eax
        shrl    $30, %eax
        addl    %ecx, %eax
        sarl    $2, %eax
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq
.Lfunc_end1:
        .size   testI, .Lfunc_end1-testI
        .cfi_endproc
                                        # -- End function

So the compiler did choose your approach (at least at my machine).

Please choose wisely.

Revert: 'relpath' -> 'path' option.
Correctly calculate the maximum available socket buffer size, as is done in the kernel.

afedorov marked 2 inline comments as done.May 4 2020, 12:07 PM

I assume, it is common practice to not explicitly assert(be != NULL) in each of the functions.

This revision is now accepted and ready to land.May 4 2020, 12:42 PM

@vmaffione do you have any objections?

vmaffione added inline comments.May 5 2020, 3:56 PM

usr.sbin/bhyve/net_backends.c
942	Most of the code is copy-pasted from tap. And the `priv` is identical. We could reuse `struct tap_priv` and all the callbacks but `ng_init`. This would simplify this patch to just `ng_init` and the `ng_backend` struct.
1000	Right, I missed that. The `/` separator, in particular, looks very confusing because it is used in file path. Choosing `;` looks better to me. In any case, from the bhyve user point of view, the resulting syntax would look confusing in any case. You would have something like: -s 2:0,frontend-name,backend-name;beopt1=beval1;...;beoptN=bevalN,feopt1=feval1,...,feoptN=fevalN and so with a mixture of `,` and `;`, and the backend arguments before the frontend arguments. Rather than this, I would rather make the life easier to the user, by allowing him to use `,` as a separator in any case, and specify frontend and backend options in the order she wishes. Yes, we would need to change the frontends. But I think all we need to do is to make an additional copy of the options string, and pass that copy as `devname` to `netbe_init()`. Then the frontend will parse only the options that it recognizes, and the backend will do the same. We can leave the frontend in charge of freeing the additional string. The real problem here is that frontend and backend should have been separate options (like in QEMU), but in bhyve we specify them both with a single `-s` option. But such a change would be too invasive and not backward-compat.
1088	Yes, I was suggesting using `3*x/4`, which does not use floating point instructions.

afedorov updated this revision to Diff 71428.May 5 2020, 6:42 PM

This revision now requires review to proceed.May 5 2020, 6:42 PM

Reuse tap backend functions.

afedorov marked an inline comment as done.May 5 2020, 6:44 PM

afedorov marked 3 inline comments as done.

@vmaffione , I left '/' as a separator, because ';' used by the shell (tcsh, sh), so there is a need to enclosure the option string.
I think the best solution is still to pass the full line of options to the backend. Then we can use ',' as the delimiter.
I have not completely understood how backward compatibility can be broken? May you clarify?

Yes, my suggestion was indeed to avoid any additional separator (/, ; or whatever), and use only ,. This requires small changes to the frontend (to strdup and then free the whole option string to be passed to the backend), as I described above.

My comment about backward compatibility was not referred to this solution, nor to your current patch. It was referred to the idea of switching from this option style:

-s 2:0,frontend-name,backend-name,opt1=val1,opt2=val2,...

to something like

-s 2:0,frontend-name,backend_id=ID,feopt1=val1,feopt2=val2,...
--backend backend-name,backend_id=ID,beopt1=val1,beopt2=val2,...

similarly to what QEMU does. This solution would be non-backward compatible. I'm not suggesting going this way. I was just illustrating.

afedorov mentioned this in D24735: [bhyve] Pass a complete string of parameters to network backends.May 6 2020, 3:00 PM

I'm still fine with the netgraph part.

Rebase
Use ',' as options separator.
Move '#include <sys/sysctl.h>' to NETGRAPH section.

@vmaffione , do you have any objections?

Looks good to me.

This revision is now accepted and ready to land.May 9 2020, 7:54 AM

Closed by commit rS360958: Add a new bhyve network backend that allow to connect the VM to the netgraph(4)… (authored by afedorov). · Explain WhyMay 12 2020, 11:18 AM

This revision was automatically updated to reflect the committed changes.

afedorov added a commit: rS360958: Add a new bhyve network backend that allow to connect the VM to the netgraph(4)….

Herald added a subscriber: imp. · View Herald TranscriptMay 12 2020, 11:18 AM

[bhyve] Add Netgraph network backend.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 71558

usr.sbin/bhyve/Makefile

usr.sbin/bhyve/net_backends.c

[bhyve] Add Netgraph network backend.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 71558

usr.sbin/bhyve/Makefile

usr.sbin/bhyve/net_backends.c

[bhyve] Add Netgraph network backend.
ClosedPublic
Actions

Revision Contents
Changeset List