Details

Reviewers

vmaffione
murat_sunnyvalley.io
• franco_opnsense.org

Group Reviewers

network
Klara

Commits

rGd862b165a6d3: bridge: Add support for emulated netmap mode

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 49206
Build 46095: arc lint + arc unit

Event Timeline

markj created this revision.Jan 15 2023, 5:04 PM

Herald added subscribers: glebius, melifaro, ae, imp. · View Herald TranscriptJan 15 2023, 5:04 PM

markj requested review of this revision.Jan 15 2023, 5:04 PM

markj added a parent revision: D38065: netmap: Fix queue stalls on generic interfaces.

Harbormaster completed remote builds in B49118: Diff 115157.Jan 15 2023, 5:04 PM

thj added reviewers: Klara, murat_sunnyvalley.io, • franco_opnsense.org.Jan 16 2023, 1:45 PM

markj added inline comments.Jan 16 2023, 2:29 PM

sys/net/if_bridge.c
2480	This might be too early. I think we still want the kernel to handle 802.1D packets, for instance.

I'm sorry, what are we going to achieve here, exactly?

As far as I understand, bridge_input gets called when a bridge member interface receives an input packet. Now, the packet destination may be another member interface(s) or the bridge interface (e.g. bridge0), or maybe both (or all of them).
To offer a consistent behavior, I think that only the packets with destination bridge0 should be passed to netmap. Others should go to their normal non-netmap fate.

For this to make sense from the user perspective attaching to a bridge should capture all packets associated with the bridge as e.g. seen by bpf (although here for now bpf might be circumvented). The reason for that is we don't want to modify user programs and restart and instead simply reconfigure bridge device akin to how lagg netmap works now.

Doesn't bpf receive a copy of the packets, and the packets keep going their normal rule? Netmap steals the packets instead, so that's a completely different use case.

I think the lagg case is also different, because the lagg interface receives all the packets arrived through any of the interfaces. The bridge0 interface, on the other hand, only receives packets destined to the IP associated to bridge0 (plus broadcasts).

In D38066#865741, @vmaffione wrote:

I'm sorry, what are we going to achieve here, exactly?

All packets received by any bridge interface will get shunted to the netmap application.

As far as I understand, bridge_input gets called when a bridge member interface receives an input packet. Now, the packet destination may be another member interface(s) or the bridge interface (e.g. bridge0), or maybe both (or all of them).

That's correct.

To offer a consistent behavior, I think that only the packets with destination bridge0 should be passed to netmap. Others should go to their normal non-netmap fate.

Isn't the consistent behaviour to capture all packets received on the interface? It's up to the application to define some policy to handle packets with an unexpected destination address. Suppose I enable netmap on a member port of a bridge. That port will be in promiscuous mode and may receive packets destined to the bridge interface. But the netmap application will receive those packets anyway, no?

Capture packets after monitor mode and span ports have had a chance to tap them.
Update the learning table before handing packets to netmap. Otherwise if the bridge has no opportunity to learn addresses, it'll flood every port on all transmits.

Harbormaster completed remote builds in B49206: Diff 115541.Jan 24 2023, 4:12 PM

Yes, but a member interface is something different from the bridge0 interface.
If I run netmap on a member interface I expect to see any packets that comes across that interface (irrespective of src or dst addresses), but I do not expect to see packets that come across other member functions (unless those packets also happen to pass through the member ifnet open in netmap mode).
Similary, if I run netmap on bridge0, I expect to see any packets that would be received on that interface if that interface would not be in netmap mode. If I am not mistaken, bridge0 only gets packets with IP destination matching the IP address of bridge0 (plus broadcasts).
In other words, I do not expect an interface to behave functionally different when open in netmap mode. I just want to use a different API. Would bridge0 in this case work differently when open in netmap mode?

In D38066#867752, @vmaffione wrote:

Yes, but a member interface is something different from the bridge0 interface.
If I run netmap on a member interface I expect to see any packets that comes across that interface (irrespective of src or dst addresses), but I do not expect to see packets that come across other member functions (unless those packets also happen to pass through the member ifnet open in netmap mode).
Similary, if I run netmap on bridge0, I expect to see any packets that would be received on that interface if that interface would not be in netmap mode. If I am not mistaken, bridge0 only gets packets with IP destination matching the IP address of bridge0 (plus broadcasts).

I'm not sure what "gets packets" means. bridge_input() is called for any packet received on any member port. There is no IP-layer processing that happens first. bridge_input() looks at the dst ether address to see if it belongs to the bridge or any member port. If so, then the bridge learns the address and finishes, i.e., the packet gets passed to ether_demux(), which calls a handler based on the ethertype (e.g., ip_input()). If not, the bridge tries to forward the packet out of one of the member ports, so no IP-layer processing happens.

So when a bridge is open in netmap mode, I would expect the application to get all packets that the bridge sees, which is all of the packets received on all member ports.

In other words, I do not expect an interface to behave functionally different when open in netmap mode. I just want to use a different API. Would bridge0 in this case work differently when open in netmap mode?

If I understand your question correctly, I don't think so, assuming that the netmap application is simply passing packets between netmap and host rings.

Sure, I didn't mean to imply that any IP processing happens within the bridge code. I'll try to reformulate my question starting from your response. As you say there are two cases: depending on the dst ether address, the packet may (1) belong to the bridge or (2) belong to something else (reachable through some member port). In the first case, you have ether_demux() and maybe ip_input() on the bridge0 interface. In the second case, you forward to one (or more member ports). What I was trying to express (and failed to do so far), is that maybe when opening the netmap port in bridge0 you should only see the packets that match case (1) and not those that match case (2), assuming bridge0 is not in promisc mode (in promisc mode, you should also see packets of case (2)). This would match the behaviour of any other physical interface that you happen to open in netmap mode (again assuming no promisc mode), since you would only see packets with dst ether matching the MAC address of the physical interface.
I hope I managed to explain myself now...
What do you think?

In D38066#868481, @vmaffione wrote:

Sure, I didn't mean to imply that any IP processing happens within the bridge code. I'll try to reformulate my question starting from your response. As you say there are two cases: depending on the dst ether address, the packet may (1) belong to the bridge or (2) belong to something else (reachable through some member port). In the first case, you have ether_demux() and maybe ip_input() on the bridge0 interface. In the second case, you forward to one (or more member ports). What I was trying to express (and failed to do so far), is that maybe when opening the netmap port in bridge0 you should only see the packets that match case (1) and not those that match case (2), assuming bridge0 is not in promisc mode (in promisc mode, you should also see packets of case (2)). This would match the behaviour of any other physical interface that you happen to open in netmap mode (again assuming no promisc mode), since you would only see packets with dst ether matching the MAC address of the physical interface.
I hope I managed to explain myself now...
What do you think?

Thanks for your patience, and sorry for the delayed follow up. I see what you're suggesting now. I think it's reasonable to require the bridge be in promiscuous mode in order to intercept packets that would otherwise be forwarded. As far as I know, setting promiscuous mode has no effect on a bridge interface today.

Yes, I've seen that IFF_PROMISC is not handled by if_bridge right now...
This could allow us to ignore the problem for the moment being, and pass them all to netmap. But handling IFF_PROMISC properly would be the more reasonable approach.

Address Vincenzo's comment: only capture non-local packets if the
bridge is in promiscuous mode.

Update the if_bridge manual page to describe behaviour wrt netmap.

Fix a problem with forwarding in netmap mode: when the application
writes packets to the host ring, the packets are injected via the
bridge's if_input to ether_input(). But this means that the bridge
interface won't see them again and in particular won't perform L2
forwarding.

Fix the problem by interposing bridge_inject() between if_input and
ether_input(). In netmap mode, bridge_inject() flags the packet
such that ether_input() will bring it back to the bridge, which can
then select a fake source port based on the source L2 addr. This
then allows forwarding to work.

Harbormaster completed remote builds in B49579: Diff 116645.Feb 6 2023, 9:35 PM

markj added a parent revision: D38410: bridge: Try to make the GRAB_OUR_PACKETS macro a bit more readable.Feb 6 2023, 9:36 PM

In D38066#871671, @vmaffione wrote:

Yes, I've seen that IFF_PROMISC is not handled by if_bridge right now...
This could allow us to ignore the problem for the moment being, and pass them all to netmap. But handling IFF_PROMISC properly would be the more reasonable approach.

I implemented this - now only locally destined packets are visible to netmap by default.

To be honest, I'm still not entirely convinced that this makes sense. If if_bridge natively required IFF_PROMISC to be set in order to perform L2 forwarding, then I would certainly agree. But now IFF_PROMISC has a special meaning for netmap mode.

Fix a problem with forwarding in netmap mode: when the application
writes packets to the host ring, the packets are injected via the
bridge's if_input to ether_input(). But this means that the bridge
interface won't see them again and in particular won't perform L2
forwarding.

Fix the problem by interposing bridge_inject() between if_input and
ether_input(). In netmap mode, bridge_inject() flags the packet
such that ether_input() will bring it back to the bridge, which can
then select a fake source port based on the source L2 addr. This
then allows forwarding to work.

I'm sorry, I don't want to slow down you work, but I do not understand why this complication is needed..

The purpose of the host stack is to allow for some subset of the traffic (e.g. ssh traffic on interface em0) to keep going to the kernel stack (e.g. the sshd socket), although all the traffic is intercepted by netmap.
So you have a packet ready in your hw RX ring of em0, you look at that and you decide that it should go ahead like netmap never intercepted it; so you write it to the em0 host (sw) TX ring. Netmap will process it by converting it into an mbuf and calling if_input on em0, so that the packet will appear on em0 as if netmap did not exist.

So IMHO the bridge0 should behave the same. On the "hw" RX ring of bridge0 you will receive all the locally destined packets (or all of them in promiscuous mode). If you want some packets to go ahead like netmap never intercepted them, you would write them to the bridge0 host TX ring. Netmap will call the if_input method of bridge0, and I don't see why we should forward the packet across the bridge...

I implemented this - now only locally destined packets are visible to netmap by default.

To be honest, I'm still not entirely convinced that this makes sense. If if_bridge natively required IFF_PROMISC to be set in order to perform L2 forwarding, then I would certainly agree. But now IFF_PROMISC has a special meaning for netmap mode.

Ok, let me try to clarify further my point of view. The meaning of "opening an interface in netmap mode" is to steal and detach that interface from the kernel stack, so that netmap will see all the RX traffic arriving on the interface, and the kernel stack will see nothing. When the kernel stack transmits on the interface, packets end up into the host RX ring, where it is up to the netmap application to process them (in any way, maybe dropping).

Now, to understand what we should to in the bridge(4) case, it may help to think about the physical equivalent of bridge(4). Let's say I have a bridge0 with members tap1, tap2 and em0.
Everything should behave like if the bridge were a physical switch, external to the host machine, where tap1, tap2 and em0 are physical ports of the switch (so external to the host). Likely, tap1 and tap2 would be (each one) the end of a point-to-point physical link towards the vtnet0 "physical" interface of a bhyve/qemu "physical" VM. The em0 is physical for real, so nothing special here. Finally, bridge0 would be an L3-capable physical port of the switch, but at the same time bridge0 would be somehow attached to the PCI bus of the host machine under consideration, so that bridge0 driver runs on the host machine. The`bridge0` interface allows the host to connect to the physical switch to transmit/receive traffic.
Now, we should think about what happens if I run a non-netmap application on bridge0, and make sure the same logic applies in the netmap case; there should be no differences. For instance, if the switch receives a packet on tap1 and the fwd table says it should go to em0, the switch will forward the packet and (if no promisc mode) bridge0 won't see the packet, and so pkt-gen -f rx on netmap:bridge0 won't see it. If a netcat client runs on bridge0 talking to a remote server (reachable through tap2), that traffic will be forwarded by the switch between bridge0 and tap2, so a netmap application sending/receiving on netmap:bridge0 towards/from a remote server will cause the same forwarding behaviour.

Regarding the IFF_PROMISC case, I think it is up to us to decide whether we want bridge0 to see the non-local packets (i.e. the ones forwarded between tap1 and em0) or not. IOW, bridge0 could or could not behave like a sort of SPAN port. However, I think that if we want bridge0 to see any packets passing through the physical switch, that's fine, but then it should behave like a SPAN port and send a copy of non-local packets to bridge0, rather than stealing them, so that they keep being forwarded. If you think about the physical equivalence, it does not make sense that if the host puts bridge0 in promisc mode then the physical switch stops forwarding packets! Once again, this behavior should be the same for both non-netmap and netmap case. The only difference would be that in netmap mode bridge0 is detached from the host stack, but in that regards bridge0 would behave like any really physical interface.

Sorry for the long post, but IMHO the implementation could be simpler: local packets go to the bridge0 if_input (or hw RX ring in case of netmap); in promisc mode a copy of non-local packets can go to bridge0 if_input (or hw RX ring in case of netmap), whereas in non-promisc mode non-local packets just follows their path (and don't go to bridge0). Note that there is no difference between netmap and non-netmap case.
Any thoughts?

In D38066#874987, @vmaffione wrote:

I'm sorry, I don't want to slow down you work, but I do not understand why this complication is needed..

No problem at all, thank you for reviewing.

The purpose of the host stack is to allow for some subset of the traffic (e.g. ssh traffic on interface em0) to keep going to the kernel stack (e.g. the sshd socket), although all the traffic is intercepted by netmap.
So you have a packet ready in your hw RX ring of em0, you look at that and you decide that it should go ahead like netmap never intercepted it; so you write it to the em0 host (sw) TX ring. Netmap will process it by converting it into an mbuf and calling if_input on em0, so that the packet will appear on em0 as if netmap did not exist.

So IMHO the bridge0 should behave the same. On the "hw" RX ring of bridge0 you will receive all the locally destined packets (or all of them in promiscuous mode). If you want some packets to go ahead like netmap never intercepted them, you would write them to the bridge0 host TX ring. Netmap will call the if_input method of bridge0, and I don't see why we should forward the packet across the bridge...

Suppose netmap intercepts a packet that would be forwarded from one bridge port to another, say em0 and em1. The application (a firewall, perhaps) decides it wants to allow the packet through, so it writes the packet to bridge0's host TX ring. netmap will call if_input of bridge0, but now the packet's receiving interface is bridge0, not em0. Thus, bridge_input() does not see the packet again, and the packet is sent to the protocol layers instead of being forwarded.

So, to make sure that the packet goes ahead like netmap never intercepted it, we need some special handling to make sure that the packet goes back to bridge_input(). There, we take advantage of the fact that the bridge already saw the packet once and learned the src MAC address, so it can decide which input port to use. The basic problem is that some information, i.e., the receiving interface, is lost when the packet is intercepted by netmap and reinjected into the host stack. Perhaps there is a more elegant way to handle this, but I don't see how.

In D38066#874997, @vmaffione wrote:

I implemented this - now only locally destined packets are visible to netmap by default.

To be honest, I'm still not entirely convinced that this makes sense. If if_bridge natively required IFF_PROMISC to be set in order to perform L2 forwarding, then I would certainly agree. But now IFF_PROMISC has a special meaning for netmap mode.

Ok, let me try to clarify further my point of view. The meaning of "opening an interface in netmap mode" is to steal and detach that interface from the kernel stack, so that netmap will see all the RX traffic arriving on the interface, and the kernel stack will see nothing. When the kernel stack transmits on the interface, packets end up into the host RX ring, where it is up to the netmap application to process them (in any way, maybe dropping).

I'm with you so far.

Now, to understand what we should to in the bridge(4) case, it may help to think about the physical equivalent of bridge(4). Let's say I have a bridge0 with members tap1, tap2 and em0.
Everything should behave like if the bridge were a physical switch, external to the host machine, where tap1, tap2 and em0 are physical ports of the switch (so external to the host). Likely, tap1 and tap2 would be (each one) the end of a point-to-point physical link towards the vtnet0 "physical" interface of a bhyve/qemu "physical" VM. The em0 is physical for real, so nothing special here. Finally, bridge0 would be an L3-capable physical port of the switch, but at the same time bridge0 would be somehow attached to the PCI bus of the host machine under consideration, so that bridge0 driver runs on the host machine. The`bridge0` interface allows the host to connect to the physical switch to transmit/receive traffic.

So really you are referring to bridge0 as this fake port on a switch, and L2 forwarding is a separate function of the switch, not of bridge0.

Then, in netmap's model, it doesn't really make sense for netmap to see non-local forwarded packets. But this makes netmap+if_bridge less useful. It also just seems like a surprising behaviour to me as a non-expert: above you wrote, 'the meaning of "opening an interface in netmap mode" is to steal and detach that interface from the kernel stack, so that netmap will see all the RX traffic arriving on the interface', and I would consider non-local packets arriving at bridge0 as "RX traffic arriving on the interface", so why shouldn't they be intercepted by netmap?

I understand that in your description of the physical bridge0, this makes sense, but I think it's surprising and limiting behaviour.

Now, we should think about what happens if I run a non-netmap application on bridge0, and make sure the same logic applies in the netmap case; there should be no differences. For instance, if the switch receives a packet on tap1 and the fwd table says it should go to em0, the switch will forward the packet and (if no promisc mode) bridge0 won't see the packet, and so pkt-gen -f rx on netmap:bridge0 won't see it. If a netcat client runs on bridge0 talking to a remote server (reachable through tap2), that traffic will be forwarded by the switch between bridge0 and tap2, so a netmap application sending/receiving on netmap:bridge0 towards/from a remote server will cause the same forwarding behaviour.

Regarding the IFF_PROMISC case, I think it is up to us to decide whether we want bridge0 to see the non-local packets (i.e. the ones forwarded between tap1 and em0) or not. IOW, bridge0 could or could not behave like a sort of SPAN port. However, I think that if we want bridge0 to see any packets passing through the physical switch, that's fine, but then it should behave like a SPAN port and send a copy of non-local packets to bridge0, rather than stealing them, so that they keep being forwarded.

Suppose I wanted to implement a userspace firewall using netmap. The behaviour you describe makes it impossible to filter non-local packets, since they get forwarded no matter what policy the netmap application implements.

If you think about the physical equivalence, it does not make sense that if the host puts bridge0 in promisc mode then the physical switch stops forwarding packets! Once again, this behavior should be the same for both non-netmap and netmap case. The only difference would be that in netmap mode bridge0 is detached from the host stack, but in that regards bridge0 would behave like any really physical interface.

To me this is a signal that we should not attach any special meaning to IFF_PROMISC. Suppose I use tools/tools/netmap/bridge.c between the bridge0 netmap port and host rings. I should expect the interface to behave the same as it would if netmap were disabled, right?

Sorry for the long post, but IMHO the implementation could be simpler: local packets go to the bridge0 if_input (or hw RX ring in case of netmap); in promisc mode a copy of non-local packets can go to bridge0 if_input (or hw RX ring in case of netmap), whereas in non-promisc mode non-local packets just follows their path (and don't go to bridge0). Note that there is no difference between netmap and non-netmap case.
Any thoughts?

As I suggested above, this behaviour makes it impossible to do filtering of non-local packets. I'm not sure if you consider this to be a real problem or not, this is just the motivating use-case for this patch.

And if we copy forwarded packets, then whether or not there's a difference depends on what the netmap application does with those copies. If the application is just bridging the netmap and host ports of bridge0, then your proposed behaviour will cause all non-local packets to 1) be forwarded by the kernel, 2) reinjected back into the host stack and handled by L3 protocol layers or dropped. So, I believe that behaving like a SPAN port isn't the right option. It seems that netmap+if_bridge must either intercept all non-local packets, or do nothing at all with non-local packets.

If I remove special handling of IFF_PROMISC from this patch, and run bridge -i netmap:bridge0 -i netmap:bridge0^, then I believe bridge0 behaves the same as without netmap enabled, which is why I prefer not to have this special IFF_PROMISC handling. Then the question boils down to, do we intercept non-local packets or not?

In D38066#876096, @markj wrote:

In D38066#874987, @vmaffione wrote:

I'm sorry, I don't want to slow down you work, but I do not understand why this complication is needed..

No problem at all, thank you for reviewing.

The purpose of the host stack is to allow for some subset of the traffic (e.g. ssh traffic on interface em0) to keep going to the kernel stack (e.g. the sshd socket), although all the traffic is intercepted by netmap.
So you have a packet ready in your hw RX ring of em0, you look at that and you decide that it should go ahead like netmap never intercepted it; so you write it to the em0 host (sw) TX ring. Netmap will process it by converting it into an mbuf and calling if_input on em0, so that the packet will appear on em0 as if netmap did not exist.

So IMHO the bridge0 should behave the same. On the "hw" RX ring of bridge0 you will receive all the locally destined packets (or all of them in promiscuous mode). If you want some packets to go ahead like netmap never intercepted them, you would write them to the bridge0 host TX ring. Netmap will call the if_input method of bridge0, and I don't see why we should forward the packet across the bridge...

Suppose netmap intercepts a packet that would be forwarded from one bridge port to another, say em0 and em1. The application (a firewall, perhaps) decides it wants to allow the packet through, so it writes the packet to bridge0's host TX ring. netmap will call if_input of bridge0, but now the packet's receiving interface is bridge0, not em0. Thus, bridge_input() does not see the packet again, and the packet is sent to the protocol layers instead of being forwarded.

Now I start to see the source of mutual misunderstanding. See below.

So, to make sure that the packet goes ahead like netmap never intercepted it, we need some special handling to make sure that the packet goes back to bridge_input(). There, we take advantage of the fact that the bridge already saw the packet once and learned the src MAC address, so it can decide which input port to use. The basic problem is that some information, i.e., the receiving interface, is lost when the packet is intercepted by netmap and reinjected into the host stack. Perhaps there is a more elegant way to handle this, but I don't see how.

In D38066#874997, @vmaffione wrote:

I implemented this - now only locally destined packets are visible to netmap by default.

To be honest, I'm still not entirely convinced that this makes sense. If if_bridge natively required IFF_PROMISC to be set in order to perform L2 forwarding, then I would certainly agree. But now IFF_PROMISC has a special meaning for netmap mode.

Ok, let me try to clarify further my point of view. The meaning of "opening an interface in netmap mode" is to steal and detach that interface from the kernel stack, so that netmap will see all the RX traffic arriving on the interface, and the kernel stack will see nothing. When the kernel stack transmits on the interface, packets end up into the host RX ring, where it is up to the netmap application to process them (in any way, maybe dropping).

I'm with you so far.

Now, to understand what we should to in the bridge(4) case, it may help to think about the physical equivalent of bridge(4). Let's say I have a bridge0 with members tap1, tap2 and em0.
Everything should behave like if the bridge were a physical switch, external to the host machine, where tap1, tap2 and em0 are physical ports of the switch (so external to the host). Likely, tap1 and tap2 would be (each one) the end of a point-to-point physical link towards the vtnet0 "physical" interface of a bhyve/qemu "physical" VM. The em0 is physical for real, so nothing special here. Finally, bridge0 would be an L3-capable physical port of the switch, but at the same time bridge0 would be somehow attached to the PCI bus of the host machine under consideration, so that bridge0 driver runs on the host machine. The`bridge0` interface allows the host to connect to the physical switch to transmit/receive traffic.

So really you are referring to bridge0 as this fake port on a switch, and L2 forwarding is a separate function of the switch, not of bridge0.

Yes, exactly. Sorry, I I was taking it for granted. But I think that's what it is, and I think it is true irrespective of netmap.

Then, in netmap's model, it doesn't really make sense for netmap to see non-local forwarded packets. But this makes netmap+if_bridge less useful. It also just seems like a surprising behaviour to me as a non-expert: above you wrote, 'the meaning of "opening an interface in netmap mode" is to steal and detach that interface from the kernel stack, so that netmap will see all the RX traffic arriving on the interface', and I would consider non-local packets arriving at bridge0 as "RX traffic arriving on the interface", so why shouldn't they be intercepted by netmap?

Non-local packets are not arriving at bridge0, IMHO, because they are meant to be forwarded, rather than be received by the host. That's why they should not be intercepted by netmap.
I don't think this is related to netmap. Netmap is just a different API to access the ifnet (alternative to raw sockets, for example), but it should not change the behaviour of an interface.
For example, what happens if you bind() to the IP address of bridge0, port TCP/5555? You'll only see local traffic, not forwarded traffic that happens to have TCP port 5555.
In your firewall example above, suppose there is no netmap. Would you be able to implement a firewall by only bind()ing one or more raw sockets to bridge0? I don't think so, and then it makes sense that you won't be able to implement a firewall by just opening bridge0 in netmap mode.

I understand that in your description of the physical bridge0, this makes sense, but I think it's surprising and limiting behaviour.

Now, we should think about what happens if I run a non-netmap application on bridge0, and make sure the same logic applies in the netmap case; there should be no differences. For instance, if the switch receives a packet on tap1 and the fwd table says it should go to em0, the switch will forward the packet and (if no promisc mode) bridge0 won't see the packet, and so pkt-gen -f rx on netmap:bridge0 won't see it. If a netcat client runs on bridge0 talking to a remote server (reachable through tap2), that traffic will be forwarded by the switch between bridge0 and tap2, so a netmap application sending/receiving on netmap:bridge0 towards/from a remote server will cause the same forwarding behaviour.

Regarding the IFF_PROMISC case, I think it is up to us to decide whether we want bridge0 to see the non-local packets (i.e. the ones forwarded between tap1 and em0) or not. IOW, bridge0 could or could not behave like a sort of SPAN port. However, I think that if we want bridge0 to see any packets passing through the physical switch, that's fine, but then it should behave like a SPAN port and send a copy of non-local packets to bridge0, rather than stealing them, so that they keep being forwarded.

Suppose I wanted to implement a userspace firewall using netmap. The behaviour you describe makes it impossible to filter non-local packets, since they get forwarded no matter what policy the netmap application implements.

Why is that limiting? A netmap firewall, router, switch, or any other middlebox is supposed to be implemented by opening all the involved interfaces in netmap mode, and implementing the middlebox logic entirely in userspace. For example, if you want to implement a router between em0, tap1 and tap2, you will open netmap:em0, netmap:tap1 and netmap:tap2, and route packets by moving/copying them between the RX rings and TX rings of the three interfaces. The tools/tools/netmap/bridge application is an example L2 bridge supporting two interfaces, e.g., ./bridge -i netmap:em0 -i netmap:em1

If you think about the physical equivalence, it does not make sense that if the host puts bridge0 in promisc mode then the physical switch stops forwarding packets! Once again, this behavior should be the same for both non-netmap and netmap case. The only difference would be that in netmap mode bridge0 is detached from the host stack, but in that regards bridge0 would behave like any really physical interface.

To me this is a signal that we should not attach any special meaning to IFF_PROMISC. Suppose I use tools/tools/netmap/bridge.c between the bridge0 netmap port and host rings. I should expect the interface to behave the same as it would if netmap were disabled, right?

Yes, indeed, I also think we should not. But I thought you were interested in attaching a special meaning.
Yes, bridging between netmap:eth0 and netmap:eth0^ is supposed to behave as if netmap were disabled (assuming no offloads). But that's just a special case! The general use of the bridge program is to bridge two different interfaces (e.g. em0 and em1).

Sorry for the long post, but IMHO the implementation could be simpler: local packets go to the bridge0 if_input (or hw RX ring in case of netmap); in promisc mode a copy of non-local packets can go to bridge0 if_input (or hw RX ring in case of netmap), whereas in non-promisc mode non-local packets just follows their path (and don't go to bridge0). Note that there is no difference between netmap and non-netmap case.
Any thoughts?

As I suggested above, this behaviour makes it impossible to do filtering of non-local packets. I'm not sure if you consider this to be a real problem or not, this is just the motivating use-case for this patch.

Filtering packets between interfaces is of course a real problem. However, that should not be done by trying to reuse the if_bridge(4) kernel code and attach a special meaning to opening bridge0 in netmap mode. Rather, filtering should be implemented by an userspace netmap application that opens all the interfaces that are involved in filtering and forwarding.

And if we copy forwarded packets, then whether or not there's a difference depends on what the netmap application does with those copies. If the application is just bridging the netmap and host ports of bridge0, then your proposed behaviour will cause all non-local packets to 1) be forwarded by the kernel, 2) reinjected back into the host stack and handled by L3 protocol layers or dropped. So, I believe that behaving like a SPAN port isn't the right option. It seems that netmap+if_bridge must either intercept all non-local packets, or do nothing at all with non-local packets.

Yeah, all of this discussion about the SPAN port started from my misunderstanding of your view.

If I remove special handling of IFF_PROMISC from this patch, and run bridge -i netmap:bridge0 -i netmap:bridge0^, then I believe bridge0 behaves the same as without netmap enabled, which is why I prefer not to have this special IFF_PROMISC handling. Then the question boils down to, do we intercept non-local packets or not?

Yes, I also do not like the idea to give special meaning to IFF_PROMISC. Again, a guiding principle should be that netmap should be just a hopefully faster alternative to raw sockets, but it should not change the behaviour of an interface. We should not give a special meaning to bridge0 when is open in netmap mode vs when it is not.

That's why my opinion is that we should not intercept non-local packets, because bridge0 does not receive them. If the final goal is to implement middleboxes, that should not be done trying to reuse if_bridge(4)...

In D38066#878318, @vmaffione wrote:

In D38066#876096, @markj wrote:

Then, in netmap's model, it doesn't really make sense for netmap to see non-local forwarded packets. But this makes netmap+if_bridge less useful. It also just seems like a surprising behaviour to me as a non-expert: above you wrote, 'the meaning of "opening an interface in netmap mode" is to steal and detach that interface from the kernel stack, so that netmap will see all the RX traffic arriving on the interface', and I would consider non-local packets arriving at bridge0 as "RX traffic arriving on the interface", so why shouldn't they be intercepted by netmap?

Non-local packets are not arriving at bridge0, IMHO, because they are meant to be forwarded, rather than be received by the host. That's why they should not be intercepted by netmap.
I don't think this is related to netmap. Netmap is just a different API to access the ifnet (alternative to raw sockets, for example), but it should not change the behaviour of an interface.

From an ifnet perspective, if_bridge is already special. There is a unique ifnet hook, if_bridge_input, by which it receives packets. The if_input hook of a bridge interface is not used.

For example, what happens if you bind() to the IP address of bridge0, port TCP/5555? You'll only see local traffic, not forwarded traffic that happens to have TCP port 5555.

Well, yes, but raw sockets are implemented purely by the L3 protocol layers. I don't think this is a very convincing example to be honest. If I run tcpdump on bridge0, then I'll see both local and non-local traffic. Is this an argument in favour of my approach? bridge_forward() also (optionally) runs pfil hooks so that kernel firewalls can apply policy to forwarded packets.

Suppose I have a physical port in promiscuous mode, opened by netmap. Should netmap only see packets with dst MAC address equal to the port's address? Right now it will see both local and L2-forwarded traffic.

In your firewall example above, suppose there is no netmap. Would you be able to implement a firewall by only bind()ing one or more raw sockets to bridge0? I don't think so, and then it makes sense that you won't be able to implement a firewall by just opening bridge0 in netmap mode.

No, but I don't see how you can implement a useful firewall with raw_ip at all, independent of bridge0.

I understand that in your description of the physical bridge0, this makes sense, but I think it's surprising and limiting behaviour.

Now, we should think about what happens if I run a non-netmap application on bridge0, and make sure the same logic applies in the netmap case; there should be no differences. For instance, if the switch receives a packet on tap1 and the fwd table says it should go to em0, the switch will forward the packet and (if no promisc mode) bridge0 won't see the packet, and so pkt-gen -f rx on netmap:bridge0 won't see it. If a netcat client runs on bridge0 talking to a remote server (reachable through tap2), that traffic will be forwarded by the switch between bridge0 and tap2, so a netmap application sending/receiving on netmap:bridge0 towards/from a remote server will cause the same forwarding behaviour.

Regarding the IFF_PROMISC case, I think it is up to us to decide whether we want bridge0 to see the non-local packets (i.e. the ones forwarded between tap1 and em0) or not. IOW, bridge0 could or could not behave like a sort of SPAN port. However, I think that if we want bridge0 to see any packets passing through the physical switch, that's fine, but then it should behave like a SPAN port and send a copy of non-local packets to bridge0, rather than stealing them, so that they keep being forwarded.

Suppose I wanted to implement a userspace firewall using netmap. The behaviour you describe makes it impossible to filter non-local packets, since they get forwarded no matter what policy the netmap application implements.

Why is that limiting? A netmap firewall, router, switch, or any other middlebox is supposed to be implemented by opening all the involved interfaces in netmap mode, and implementing the middlebox logic entirely in userspace. For example, if you want to implement a router between em0, tap1 and tap2, you will open netmap:em0, netmap:tap1 and netmap:tap2, and route packets by moving/copying them between the RX rings and TX rings of the three interfaces. The tools/tools/netmap/bridge application is an example L2 bridge supporting two interfaces, e.g., ./bridge -i netmap:em0 -i netmap:em1

But these aren't the same thing. Suppose I want to start a new VM and create tap3 for that purpose, and add it to bridge0. Now my running netmap application needs to somehow discover this and open the port, manage extra rings, etc.. This is possible, of course, but it's a bad user interface.

If I want to apply some policy to packets going through bridge0, then it would be best if the application can simply open bridge0 in netmap mode.

If you think about the physical equivalence, it does not make sense that if the host puts bridge0 in promisc mode then the physical switch stops forwarding packets! Once again, this behavior should be the same for both non-netmap and netmap case. The only difference would be that in netmap mode bridge0 is detached from the host stack, but in that regards bridge0 would behave like any really physical interface.

To me this is a signal that we should not attach any special meaning to IFF_PROMISC. Suppose I use tools/tools/netmap/bridge.c between the bridge0 netmap port and host rings. I should expect the interface to behave the same as it would if netmap were disabled, right?

Yes, indeed, I also think we should not. But I thought you were interested in attaching a special meaning.
Yes, bridging between netmap:eth0 and netmap:eth0^ is supposed to behave as if netmap were disabled (assuming no offloads). But that's just a special case! The general use of the bridge program is to bridge two different interfaces (e.g. em0 and em1).

Sure. I use this example to address your comments that the interface must behave the same as before when it's in netmap mode. To me, this means, "if I run bridge -i netmap:eth0 -i netmap:eth0^ then I should not observe any difference in the system behaviour."

Sorry for the long post, but IMHO the implementation could be simpler: local packets go to the bridge0 if_input (or hw RX ring in case of netmap); in promisc mode a copy of non-local packets can go to bridge0 if_input (or hw RX ring in case of netmap), whereas in non-promisc mode non-local packets just follows their path (and don't go to bridge0). Note that there is no difference between netmap and non-netmap case.
Any thoughts?

As I suggested above, this behaviour makes it impossible to do filtering of non-local packets. I'm not sure if you consider this to be a real problem or not, this is just the motivating use-case for this patch.

Filtering packets between interfaces is of course a real problem. However, that should not be done by trying to reuse the if_bridge(4) kernel code and attach a special meaning to opening bridge0 in netmap mode. Rather, filtering should be implemented by an userspace netmap application that opens all the interfaces that are involved in filtering and forwarding.

This is possible of course, but it simply makes netmap less useful as I tried to illustrate above. The netmap application might not have any particular knowledge of the system's networking configuration, so it's very useful to be able to say, "open bridge0 and do something with the packets that go through it." An application which wants to ignore forwarded packets can simply pass them back through to the host stack, and with this patch they will be forwarded as if netmap didn't intercept them.

If I remove special handling of IFF_PROMISC from this patch, and run bridge -i netmap:bridge0 -i netmap:bridge0^, then I believe bridge0 behaves the same as without netmap enabled, which is why I prefer not to have this special IFF_PROMISC handling. Then the question boils down to, do we intercept non-local packets or not?

Yes, I also do not like the idea to give special meaning to IFF_PROMISC.

We agree on something at least. :)

Again, a guiding principle should be that netmap should be just a hopefully faster alternative to raw sockets,

Sorry, I don't understand this at all. What kind of raw sockets? Something other than PF_INET/INET6?

but it should not change the behaviour of an interface. We should not give a special meaning to bridge0 when is open in netmap mode vs when it is not.
That's why my opinion is that we should not intercept non-local packets, because bridge0 does not receive them.

I understand that in your physical model, bridge0 does not receive non-local packets, but then if_bridge's BPF and pfil integration do not behave according to your model either.

From my POV, this patch really does not give any special behaviour to if_bridge when in netmap mode. Every packet that comes in to bridge_input() goes to netmap. Every packet that comes out of netmap:bridge0^ goes to the host stack. It behaves consistently with respect to other networking features. Yes the patch is a bit complicated, but that's only to provide consistent behaviour with non-netmap mode.

If the final goal is to implement middleboxes, that should not be done trying to reuse if_bridge(4)...

The point is to allow netmap applications to be used more flexibly, in deployments where the application author cannot control the network configuration.

In D38066#878350, @markj wrote:

In D38066#878318, @vmaffione wrote:

In D38066#876096, @markj wrote:

Then, in netmap's model, it doesn't really make sense for netmap to see non-local forwarded packets. But this makes netmap+if_bridge less useful. It also just seems like a surprising behaviour to me as a non-expert: above you wrote, 'the meaning of "opening an interface in netmap mode" is to steal and detach that interface from the kernel stack, so that netmap will see all the RX traffic arriving on the interface', and I would consider non-local packets arriving at bridge0 as "RX traffic arriving on the interface", so why shouldn't they be intercepted by netmap?

Non-local packets are not arriving at bridge0, IMHO, because they are meant to be forwarded, rather than be received by the host. That's why they should not be intercepted by netmap.
I don't think this is related to netmap. Netmap is just a different API to access the ifnet (alternative to raw sockets, for example), but it should not change the behaviour of an interface.

Sorry for the delayed answer.

From an ifnet perspective, if_bridge is already special. There is a unique ifnet hook, if_bridge_input, by which it receives packets. The if_input hook of a bridge interface is not used.

Just out of curiosity: isn't the if_input bridge0 hook used at line 2542 of if_bridge.c (multicast and broadcast)?
I'm trying to understand here... how is it possible that bridge0 if_input is not used? How can local packets reach the protocol stack for bridge0 (e.g. TCP/UDP sockets bound to bridge0) if not by means of if_input?

For example, what happens if you bind() to the IP address of bridge0, port TCP/5555? You'll only see local traffic, not forwarded traffic that happens to have TCP port 5555.

Well, yes, but raw sockets are implemented purely by the L3 protocol layers. I don't think this is a very convincing example to be honest. If I run tcpdump on bridge0, then I'll see both local and non-local traffic. Is this an argument in favour of my approach? bridge_forward() also (optionally) runs pfil hooks so that kernel firewalls can apply policy to forwarded packets.

I realized just now that bpf(4) does on FreeBSD what I thought could be done with raw sockets (like in other OSs). Sorry for that, I generated a lot of confusion.
But tcpdump sets IFF_PROMISC, and that's why you see non-local packets! That's independent of the specific OS API that you use to access the ifnet traffic. I'm not sure about bpf(4), but when you open an interface in netmap mode you don't automatically set IFF_PROMISC: that is up to the userspace applications.
Regarding pfil hooks... sure if_bridge is special in that sense, but netmap is not something akin to pfil. I'd say it is more similar to bpf(4) as an API that can inject and receive packets directly at the Ethernet level, but with no receive filters.

Suppose I have a physical port in promiscuous mode, opened by netmap. Should netmap only see packets with dst MAC address equal to the port's address? Right now it will see both local and L2-forwarded traffic.

Absolutely, netmap (like bpf) will see anything that the physical port receives in its physical RX rings. If IFF_PROMISC is set, it will also see non-local traffic. If MCAST filters are enabled in the hardware, it will see multicast traffic. Netmap does not specify any policy about what RX packets should or should not be seen, it is just a different API to access what is available in the ifnet RX rings (and ofc to access TX rings for transmissions).

In your firewall example above, suppose there is no netmap. Would you be able to implement a firewall by only bind()ing one or more raw sockets to bridge0? I don't think so, and then it makes sense that you won't be able to implement a firewall by just opening bridge0 in netmap mode.

No, but I don't see how you can implement a useful firewall with raw_ip at all, independent of bridge0.

Sorry for the confusion, I should have said bpf(4). I believe with bpf(4) you could implement an userspace firewall, because you can receive and send raw Ethernet packets from multiple network interfaces, forwarding and filtering as needed.

I understand that in your description of the physical bridge0, this makes sense, but I think it's surprising and limiting behaviour.

Now, we should think about what happens if I run a non-netmap application on bridge0, and make sure the same logic applies in the netmap case; there should be no differences. For instance, if the switch receives a packet on tap1 and the fwd table says it should go to em0, the switch will forward the packet and (if no promisc mode) bridge0 won't see the packet, and so pkt-gen -f rx on netmap:bridge0 won't see it. If a netcat client runs on bridge0 talking to a remote server (reachable through tap2), that traffic will be forwarded by the switch between bridge0 and tap2, so a netmap application sending/receiving on netmap:bridge0 towards/from a remote server will cause the same forwarding behaviour.

Regarding the IFF_PROMISC case, I think it is up to us to decide whether we want bridge0 to see the non-local packets (i.e. the ones forwarded between tap1 and em0) or not. IOW, bridge0 could or could not behave like a sort of SPAN port. However, I think that if we want bridge0 to see any packets passing through the physical switch, that's fine, but then it should behave like a SPAN port and send a copy of non-local packets to bridge0, rather than stealing them, so that they keep being forwarded.

Suppose I wanted to implement a userspace firewall using netmap. The behaviour you describe makes it impossible to filter non-local packets, since they get forwarded no matter what policy the netmap application implements.

Why is that limiting? A netmap firewall, router, switch, or any other middlebox is supposed to be implemented by opening all the involved interfaces in netmap mode, and implementing the middlebox logic entirely in userspace. For example, if you want to implement a router between em0, tap1 and tap2, you will open netmap:em0, netmap:tap1 and netmap:tap2, and route packets by moving/copying them between the RX rings and TX rings of the three interfaces. The tools/tools/netmap/bridge application is an example L2 bridge supporting two interfaces, e.g., ./bridge -i netmap:em0 -i netmap:em1

But these aren't the same thing. Suppose I want to start a new VM and create tap3 for that purpose, and add it to bridge0. Now my running netmap application needs to somehow discover this and open the port, manage extra rings, etc.. This is possible, of course, but it's a bad user interface.

Wait, in the plain middlebox implementation approach above, there would not be any bridge0. The middlebox functionality would be implemented entirely in userspace, including some specific tool and/or API to add/remove interfaces. An example of this approach is Linux vhost-user (https://www.redhat.com/en/blog/how-vhost-user-came-being-virtio-networking-and-dpdk).
As another example, there is FreeBSD vale(4), where datapath is actually kernelspace, but otherwise the concept is the same (and you have the valectl tool to add/remove interfaces). There is no bridge0.
I do not see why this is a bad user interface...
Also, with this approach you can actually leverage on the native netmap support of physical network interfaces (iflib, vtnet), which gives you a real speed-up, because no mbufs are involved (ofc you would need to get rid of if_tap(4) and implement a more efficient interface like vhost-user).
With the bridge0 approach you propose, you are forced to use emulated netmap, so performance will never be great.

If I want to apply some policy to packets going through bridge0, then it would be best if the application can simply open bridge0 in netmap mode.

See comments at the end.

If you think about the physical equivalence, it does not make sense that if the host puts bridge0 in promisc mode then the physical switch stops forwarding packets! Once again, this behavior should be the same for both non-netmap and netmap case. The only difference would be that in netmap mode bridge0 is detached from the host stack, but in that regards bridge0 would behave like any really physical interface.

To me this is a signal that we should not attach any special meaning to IFF_PROMISC. Suppose I use tools/tools/netmap/bridge.c between the bridge0 netmap port and host rings. I should expect the interface to behave the same as it would if netmap were disabled, right?

Yes, indeed, I also think we should not. But I thought you were interested in attaching a special meaning.
Yes, bridging between netmap:eth0 and netmap:eth0^ is supposed to behave as if netmap were disabled (assuming no offloads). But that's just a special case! The general use of the bridge program is to bridge two different interfaces (e.g. em0 and em1).

Sure. I use this example to address your comments that the interface must behave the same as before when it's in netmap mode. To me, this means, "if I run bridge -i netmap:eth0 -i netmap:eth0^ then I should not observe any difference in the system behaviour."

Sorry for the long post, but IMHO the implementation could be simpler: local packets go to the bridge0 if_input (or hw RX ring in case of netmap); in promisc mode a copy of non-local packets can go to bridge0 if_input (or hw RX ring in case of netmap), whereas in non-promisc mode non-local packets just follows their path (and don't go to bridge0). Note that there is no difference between netmap and non-netmap case.
Any thoughts?

As I suggested above, this behaviour makes it impossible to do filtering of non-local packets. I'm not sure if you consider this to be a real problem or not, this is just the motivating use-case for this patch.

Filtering packets between interfaces is of course a real problem. However, that should not be done by trying to reuse the if_bridge(4) kernel code and attach a special meaning to opening bridge0 in netmap mode. Rather, filtering should be implemented by an userspace netmap application that opens all the interfaces that are involved in filtering and forwarding.

This is possible of course, but it simply makes netmap less useful as I tried to illustrate above. The netmap application might not have any particular knowledge of the system's networking configuration, so it's very useful to be able to say, "open bridge0 and do something with the packets that go through it." An application which wants to ignore forwarded packets can simply pass them back through to the host stack, and with this patch they will be forwarded as if netmap didn't intercept them.

Yes, but please note that we are not on the same page here because you assume there must be a bridge0 in your VM deployment (bhyve host, I guess). As I tried to explain above, the intended approach (for userspace networking frameworks such as netmap) would be to get rid of the kernel and implement the whole datapath and control path in userspace. (Please note this is not something I invented myself, ofc, but it is a standard approach for userspace networking).
However, I do understand this is complex (if you do not have enough components yet), and I see why reusing if_bridge(4) looks appealing. More on this below.

If I remove special handling of IFF_PROMISC from this patch, and run bridge -i netmap:bridge0 -i netmap:bridge0^, then I believe bridge0 behaves the same as without netmap enabled, which is why I prefer not to have this special IFF_PROMISC handling. Then the question boils down to, do we intercept non-local packets or not?

Yes, I also do not like the idea to give special meaning to IFF_PROMISC.

We agree on something at least. :)

Again, a guiding principle should be that netmap should be just a hopefully faster alternative to raw sockets,

Sorry, I don't understand this at all. What kind of raw sockets? Something other than PF_INET/INET6?

Sorry again, please read "raw sockets" as bpf(4).

but it should not change the behaviour of an interface. We should not give a special meaning to bridge0 when is open in netmap mode vs when it is not.
That's why my opinion is that we should not intercept non-local packets, because bridge0 does not receive them.

I understand that in your physical model, bridge0 does not receive non-local packets, but then if_bridge's BPF and pfil integration do not behave according to your model either.

Can you please elaborate a little bit on how bpf(4) integration works for if_bridge(4)? I don't think pfil is relevant to the discussion here, since netmap is not about packet filtering (bpf is about packet filtering, but it also looks like an interface for packet I/O, like netmap).

From my POV, this patch really does not give any special behaviour to if_bridge when in netmap mode. Every packet that comes in to bridge_input() goes to netmap. Every packet that comes out of netmap:bridge0^ goes to the host stack. It behaves consistently with respect to other networking features. Yes the patch is a bit complicated, but that's only to provide consistent behaviour with non-netmap mode.

Yes, I see that's consistent.

If the final goal is to implement middleboxes, that should not be done trying to reuse if_bridge(4)...

The point is to allow netmap applications to be used more flexibly, in deployments where the application author cannot control the network configuration.

So as I see it this whole discussion boils down to what bridge0 represents and therefore what opening it in netmap mode is supposed to mean.

I am still convinced that bridge0 should represent the interface that lets the host participate in the L2 switch, and not the whole L2 switch, because that's a linear, not surprising behaviour for the netmap user. To the contrary, if bridge0 represents the whole L2 switch, and therefore netmap RX rings receive (and steal) any packet that passes through the switch, that is surprising behaviour. And netmap has no metadata to store information like the receiving interface (mbufs have those metadata, but one of the points of netmap is to avoid the mbuf,), so you need workarounds, which is not a good sign.

However, if you really think that there is a specialized use case for bridge0 that can make the latter interpretation convenient, that's ok for me.
So if I understand correctly your use case is a netmap application that opens all the RX rings of netmap:bridge0, filtering packets according to some firewall rules, and then forwarding the non-filtered packets to the host TX ring of netmap:bridge0?

In D38066#880176, @vmaffione wrote:

In D38066#878350, @markj wrote:

In D38066#878318, @vmaffione wrote:

In D38066#876096, @markj wrote:

From an ifnet perspective, if_bridge is already special. There is a unique ifnet hook, if_bridge_input, by which it receives packets. The if_input hook of a bridge interface is not used.

Just out of curiosity: isn't the if_input bridge0 hook used at line 2542 of if_bridge.c (multicast and broadcast)?
I'm trying to understand here... how is it possible that bridge0 if_input is not used? How can local packets reach the protocol stack for bridge0 (e.g. TCP/UDP sockets bound to bridge0) if not by means of if_input?

Yes, it's true that the bridge's if_input is called for that one special case, but that's not part of the regular data path.

Each ifnet which belongs to a bridge has a special hook, if_bridge_input, pointing to bridge_input(). Each ifnet also carries a pointer to the bridge softc. When a bridge member receives a packet, an mbuf chain is passed to ether_input_internal(), which checks to see if the receiving ifnet has if_bridge_input set. If so, the packet is passed to if_bridge_input/bridge_input().

bridge_input() can consume the packet and return NULL, which it does in the forwarding case, and then ether_input_internal() does nothing further. If the packet is local, bridge_input() uses the dst MAC to figure out which bridge port "received" the packet (this may be the bridge interface itself), and then returns the mbuf chain back to ether_input_internal(), which dispatches it to the protocol layers.

For example, what happens if you bind() to the IP address of bridge0, port TCP/5555? You'll only see local traffic, not forwarded traffic that happens to have TCP port 5555.

Well, yes, but raw sockets are implemented purely by the L3 protocol layers. I don't think this is a very convincing example to be honest. If I run tcpdump on bridge0, then I'll see both local and non-local traffic. Is this an argument in favour of my approach? bridge_forward() also (optionally) runs pfil hooks so that kernel firewalls can apply policy to forwarded packets.

I realized just now that bpf(4) does on FreeBSD what I thought could be done with raw sockets (like in other OSs). Sorry for that, I generated a lot of confusion.
But tcpdump sets IFF_PROMISC, and that's why you see non-local packets! That's independent of the specific OS API that you use to access the ifnet traffic. I'm not sure about bpf(4), but when you open an interface in netmap mode you don't automatically set IFF_PROMISC: that is up to the userspace applications.

But if_bridge does nothing special in promiscuous mode. There is no reference to IFF_PROMISC in if_bridge.c. In particular, tcpdump --no-promiscuous-mode -i bridge0 shows forwarded packets.

Regarding pfil hooks... sure if_bridge is special in that sense, but netmap is not something akin to pfil. I'd say it is more similar to bpf(4) as an API that can inject and receive packets directly at the Ethernet level, but with no receive filters.

Suppose I have a physical port in promiscuous mode, opened by netmap. Should netmap only see packets with dst MAC address equal to the port's address? Right now it will see both local and L2-forwarded traffic.

Absolutely, netmap (like bpf) will see anything that the physical port receives in its physical RX rings. If IFF_PROMISC is set, it will also see non-local traffic. If MCAST filters are enabled in the hardware, it will see multicast traffic. Netmap does not specify any policy about what RX packets should or should not be seen, it is just a different API to access what is available in the ifnet RX rings (and ofc to access TX rings for transmissions).

Ok, that makes sense. So what exactly is an RX packet for if_bridge? Netmap generic mode consumes every packet that comes in from if_input, but as discussed above, that doesn't make sense here, so we need some special definition for if_bridge. In keeping with the notion that "netmap is a user-mode API for ifnets", I think it is natural for netmap to consume everything that comes in via bridge_input().

I understand that in your description of the physical bridge0, this makes sense, but I think it's surprising and limiting behaviour.

Now, we should think about what happens if I run a non-netmap application on bridge0, and make sure the same logic applies in the netmap case; there should be no differences. For instance, if the switch receives a packet on tap1 and the fwd table says it should go to em0, the switch will forward the packet and (if no promisc mode) bridge0 won't see the packet, and so pkt-gen -f rx on netmap:bridge0 won't see it. If a netcat client runs on bridge0 talking to a remote server (reachable through tap2), that traffic will be forwarded by the switch between bridge0 and tap2, so a netmap application sending/receiving on netmap:bridge0 towards/from a remote server will cause the same forwarding behaviour.

Regarding the IFF_PROMISC case, I think it is up to us to decide whether we want bridge0 to see the non-local packets (i.e. the ones forwarded between tap1 and em0) or not. IOW, bridge0 could or could not behave like a sort of SPAN port. However, I think that if we want bridge0 to see any packets passing through the physical switch, that's fine, but then it should behave like a SPAN port and send a copy of non-local packets to bridge0, rather than stealing them, so that they keep being forwarded.

Suppose I wanted to implement a userspace firewall using netmap. The behaviour you describe makes it impossible to filter non-local packets, since they get forwarded no matter what policy the netmap application implements.

Why is that limiting? A netmap firewall, router, switch, or any other middlebox is supposed to be implemented by opening all the involved interfaces in netmap mode, and implementing the middlebox logic entirely in userspace. For example, if you want to implement a router between em0, tap1 and tap2, you will open netmap:em0, netmap:tap1 and netmap:tap2, and route packets by moving/copying them between the RX rings and TX rings of the three interfaces. The tools/tools/netmap/bridge application is an example L2 bridge supporting two interfaces, e.g., ./bridge -i netmap:em0 -i netmap:em1

But these aren't the same thing. Suppose I want to start a new VM and create tap3 for that purpose, and add it to bridge0. Now my running netmap application needs to somehow discover this and open the port, manage extra rings, etc.. This is possible, of course, but it's a bad user interface.

Wait, in the plain middlebox implementation approach above, there would not be any bridge0. The middlebox functionality would be implemented entirely in userspace, including some specific tool and/or API to add/remove interfaces. An example of this approach is Linux vhost-user (https://www.redhat.com/en/blog/how-vhost-user-came-being-virtio-networking-and-dpdk).
As another example, there is FreeBSD vale(4), where datapath is actually kernelspace, but otherwise the concept is the same (and you have the valectl tool to add/remove interfaces). There is no bridge0.
I do not see why this is a bad user interface...

I am assuming that the netmap application has no ability to define networking configuration. There is a bridge0 with a bunch of member ports, and it has to stay that way. Your suggestion was to open all member ports in netmap mode and intercept traffic from them, instead of attaching to bridge0 directly. This /can/ work, but is more complicated to implement. That is what I'm referring to as a bad interface.

Also, with this approach you can actually leverage on the native netmap support of physical network interfaces (iflib, vtnet), which gives you a real speed-up, because no mbufs are involved (ofc you would need to get rid of if_tap(4) and implement a more efficient interface like vhost-user).
With the bridge0 approach you propose, you are forced to use emulated netmap, so performance will never be great.

From what I understand (I am not the author of this netmap application), this is not a major concern. The point is to be flexible and adapt to different networking configurations.

Sorry for the long post, but IMHO the implementation could be simpler: local packets go to the bridge0 if_input (or hw RX ring in case of netmap); in promisc mode a copy of non-local packets can go to bridge0 if_input (or hw RX ring in case of netmap), whereas in non-promisc mode non-local packets just follows their path (and don't go to bridge0). Note that there is no difference between netmap and non-netmap case.
Any thoughts?

As I suggested above, this behaviour makes it impossible to do filtering of non-local packets. I'm not sure if you consider this to be a real problem or not, this is just the motivating use-case for this patch.

Filtering packets between interfaces is of course a real problem. However, that should not be done by trying to reuse the if_bridge(4) kernel code and attach a special meaning to opening bridge0 in netmap mode. Rather, filtering should be implemented by an userspace netmap application that opens all the interfaces that are involved in filtering and forwarding.

This is possible of course, but it simply makes netmap less useful as I tried to illustrate above. The netmap application might not have any particular knowledge of the system's networking configuration, so it's very useful to be able to say, "open bridge0 and do something with the packets that go through it." An application which wants to ignore forwarded packets can simply pass them back through to the host stack, and with this patch they will be forwarded as if netmap didn't intercept them.

Yes, but please note that we are not on the same page here because you assume there must be a bridge0 in your VM deployment (bhyve host, I guess). As I tried to explain above, the intended approach (for userspace networking frameworks such as netmap) would be to get rid of the kernel and implement the whole datapath and control path in userspace. (Please note this is not something I invented myself, ofc, but it is a standard approach for userspace networking).
However, I do understand this is complex (if you do not have enough components yet), and I see why reusing if_bridge(4) looks appealing. More on this below.

To be clear, I don't want to make if_bridge a core component of some kind of middlebox solution. The target is configurations which already use if_bridge somehow and want to deploy a netmap application on top of it without changing anything else.

The example with a VM switch is just to demonstrate that the approach of opening all bridge /members/ in netmap mode is not very appealing. I understand that if_bridge is not the best tool for high-performance VM networking.

but it should not change the behaviour of an interface. We should not give a special meaning to bridge0 when is open in netmap mode vs when it is not.
That's why my opinion is that we should not intercept non-local packets, because bridge0 does not receive them.

I understand that in your physical model, bridge0 does not receive non-local packets, but then if_bridge's BPF and pfil integration do not behave according to your model either.

Can you please elaborate a little bit on how bpf(4) integration works for if_bridge(4)? I don't think pfil is relevant to the discussion here, since netmap is not about packet filtering (bpf is about packet filtering, but it also looks like an interface for packet I/O, like netmap).

A BPF filter on bridge0 will capture packets at all calls to ETHER_BPF_MTAP() in if_bridge.c where ifp is the bridge ifnet pointer. In the RX path, that happens for local packets in the GRAB_OUR_PACKETS macro, and for non-local packets in bridge_forward():

2385         /*                                                                                                                                                                                                                                                                                                               
2386          * If we have a destination interface which is a member of our bridge,                                                                                                                                                                                                                                           
2387          * OR this is a unicast packet, push it through the bpf(4) machinery.                                                                                                                                                                                                                                            
2388          * For broadcast or multicast packets, don't bother because it will                                                                                                                                                                                                                                              
2389          * be reinjected into ether_input. We do this before we pass the packets                                                                                                                                                                                                                                         
2390          * through the pfil(9) framework, as it is possible that pfil(9) will                                                                                                                                                                                                                                            
2391          * drop the packet, or possibly modify it, making it difficult to debug                                                                                                                                                                                                                                          
2392          * firewall issues on the bridge.                                                                                                                                                                                                                                                                                
2393          */                                                                                                                                                                                                                                                                                                              
2394         if (dst_if != NULL || (m->m_flags & (M_BCAST | M_MCAST)) == 0)                                                                                                                                                                                                                                                   
2395                 ETHER_BPF_MTAP(ifp, m);

So with my patch, netmap will intercept local packets (GRAB_OUR_PACKETS) and non-local packets (bridge_forward()) in a similar way.

If the final goal is to implement middleboxes, that should not be done trying to reuse if_bridge(4)...

The point is to allow netmap applications to be used more flexibly, in deployments where the application author cannot control the network configuration.

So as I see it this whole discussion boils down to what bridge0 represents and therefore what opening it in netmap mode is supposed to mean.

I am still convinced that bridge0 should represent the interface that lets the host participate in the L2 switch, and not the whole L2 switch, because that's a linear, not surprising behaviour for the netmap user. To the contrary, if bridge0 represents the whole L2 switch, and therefore netmap RX rings receive (and steal) any packet that passes through the switch, that is surprising behaviour. And netmap has no metadata to store information like the receiving interface (mbufs have those metadata, but one of the points of netmap is to avoid the mbuf,), so you need workarounds, which is not a good sign.

However, if you really think that there is a specialized use case for bridge0 that can make the latter interpretation convenient, that's ok for me.
So if I understand correctly your use case is a netmap application that opens all the RX rings of netmap:bridge0, filtering packets according to some firewall rules, and then forwarding the non-filtered packets to the host TX ring of netmap:bridge0?

Yes, that's correct.

In D38066#881958, @markj wrote:

In D38066#880176, @vmaffione wrote:

In D38066#878350, @markj wrote:

In D38066#878318, @vmaffione wrote:

In D38066#876096, @markj wrote:

From an ifnet perspective, if_bridge is already special. There is a unique ifnet hook, if_bridge_input, by which it receives packets. The if_input hook of a bridge interface is not used.

Just out of curiosity: isn't the if_input bridge0 hook used at line 2542 of if_bridge.c (multicast and broadcast)?
I'm trying to understand here... how is it possible that bridge0 if_input is not used? How can local packets reach the protocol stack for bridge0 (e.g. TCP/UDP sockets bound to bridge0) if not by means of if_input?

Yes, it's true that the bridge's if_input is called for that one special case, but that's not part of the regular data path.

Each ifnet which belongs to a bridge has a special hook, if_bridge_input, pointing to bridge_input(). Each ifnet also carries a pointer to the bridge softc. When a bridge member receives a packet, an mbuf chain is passed to ether_input_internal(), which checks to see if the receiving ifnet has if_bridge_input set. If so, the packet is passed to if_bridge_input/bridge_input().

bridge_input() can consume the packet and return NULL, which it does in the forwarding case, and then ether_input_internal() does nothing further. If the packet is local, bridge_input() uses the dst MAC to figure out which bridge port "received" the packet (this may be the bridge interface itself), and then returns the mbuf chain back to ether_input_internal(), which dispatches it to the protocol layers.

Thanks a lot for the detailed explanation!
(Following up on your explanation, I cannot help myself noting that one possibility would be to have ether_input_internal() pass up to netmap the mbuf chains that bridge_input() returns: in this way netmap receives only the "local" packets, whereas in the forwarding case you would have a NULL pointer, so nothing gets passed to netmap).

For example, what happens if you bind() to the IP address of bridge0, port TCP/5555? You'll only see local traffic, not forwarded traffic that happens to have TCP port 5555.

Well, yes, but raw sockets are implemented purely by the L3 protocol layers. I don't think this is a very convincing example to be honest. If I run tcpdump on bridge0, then I'll see both local and non-local traffic. Is this an argument in favour of my approach? bridge_forward() also (optionally) runs pfil hooks so that kernel firewalls can apply policy to forwarded packets.

I realized just now that bpf(4) does on FreeBSD what I thought could be done with raw sockets (like in other OSs). Sorry for that, I generated a lot of confusion.
But tcpdump sets IFF_PROMISC, and that's why you see non-local packets! That's independent of the specific OS API that you use to access the ifnet traffic. I'm not sure about bpf(4), but when you open an interface in netmap mode you don't automatically set IFF_PROMISC: that is up to the userspace applications.

But if_bridge does nothing special in promiscuous mode. There is no reference to IFF_PROMISC in if_bridge.c. In particular, tcpdump --no-promiscuous-mode -i bridge0 shows forwarded packets.

I know, but nothing prevents from adding proper IFF_PROMISC support if it makes sense.

Regarding pfil hooks... sure if_bridge is special in that sense, but netmap is not something akin to pfil. I'd say it is more similar to bpf(4) as an API that can inject and receive packets directly at the Ethernet level, but with no receive filters.

Suppose I have a physical port in promiscuous mode, opened by netmap. Should netmap only see packets with dst MAC address equal to the port's address? Right now it will see both local and L2-forwarded traffic.

Absolutely, netmap (like bpf) will see anything that the physical port receives in its physical RX rings. If IFF_PROMISC is set, it will also see non-local traffic. If MCAST filters are enabled in the hardware, it will see multicast traffic. Netmap does not specify any policy about what RX packets should or should not be seen, it is just a different API to access what is available in the ifnet RX rings (and ofc to access TX rings for transmissions).

Ok, that makes sense. So what exactly is an RX packet for if_bridge? Netmap generic mode consumes every packet that comes in from if_input, but as discussed above, that doesn't make sense here, so we need some special definition for if_bridge. In keeping with the notion that "netmap is a user-mode API for ifnets", I think it is natural for netmap to consume everything that comes in via bridge_input().

As I said, if you think that this interpretation is more useful than the one I am proposing, that's fine, and we can go ahead.
(I think it would make sense to have the mbufs passed up to netmap for local traffic in ether_input_internal, as outline above, and that looks more natural to me).

I understand that in your description of the physical bridge0, this makes sense, but I think it's surprising and limiting behaviour.

Now, we should think about what happens if I run a non-netmap application on bridge0, and make sure the same logic applies in the netmap case; there should be no differences. For instance, if the switch receives a packet on tap1 and the fwd table says it should go to em0, the switch will forward the packet and (if no promisc mode) bridge0 won't see the packet, and so pkt-gen -f rx on netmap:bridge0 won't see it. If a netcat client runs on bridge0 talking to a remote server (reachable through tap2), that traffic will be forwarded by the switch between bridge0 and tap2, so a netmap application sending/receiving on netmap:bridge0 towards/from a remote server will cause the same forwarding behaviour.

Regarding the IFF_PROMISC case, I think it is up to us to decide whether we want bridge0 to see the non-local packets (i.e. the ones forwarded between tap1 and em0) or not. IOW, bridge0 could or could not behave like a sort of SPAN port. However, I think that if we want bridge0 to see any packets passing through the physical switch, that's fine, but then it should behave like a SPAN port and send a copy of non-local packets to bridge0, rather than stealing them, so that they keep being forwarded.

Suppose I wanted to implement a userspace firewall using netmap. The behaviour you describe makes it impossible to filter non-local packets, since they get forwarded no matter what policy the netmap application implements.

Why is that limiting? A netmap firewall, router, switch, or any other middlebox is supposed to be implemented by opening all the involved interfaces in netmap mode, and implementing the middlebox logic entirely in userspace. For example, if you want to implement a router between em0, tap1 and tap2, you will open netmap:em0, netmap:tap1 and netmap:tap2, and route packets by moving/copying them between the RX rings and TX rings of the three interfaces. The tools/tools/netmap/bridge application is an example L2 bridge supporting two interfaces, e.g., ./bridge -i netmap:em0 -i netmap:em1

But these aren't the same thing. Suppose I want to start a new VM and create tap3 for that purpose, and add it to bridge0. Now my running netmap application needs to somehow discover this and open the port, manage extra rings, etc.. This is possible, of course, but it's a bad user interface.

Wait, in the plain middlebox implementation approach above, there would not be any bridge0. The middlebox functionality would be implemented entirely in userspace, including some specific tool and/or API to add/remove interfaces. An example of this approach is Linux vhost-user (https://www.redhat.com/en/blog/how-vhost-user-came-being-virtio-networking-and-dpdk).
As another example, there is FreeBSD vale(4), where datapath is actually kernelspace, but otherwise the concept is the same (and you have the valectl tool to add/remove interfaces). There is no bridge0.
I do not see why this is a bad user interface...

I am assuming that the netmap application has no ability to define networking configuration. There is a bridge0 with a bunch of member ports, and it has to stay that way. Your suggestion was to open all member ports in netmap mode and intercept traffic from them, instead of attaching to bridge0 directly. This /can/ work, but is more complicated to implement. That is what I'm referring to as a bad interface.

The assumption is correct by definition, simply because netmap is userspace networking, which means completely bypassing kernel networking (if_bridge, pfil, vlans, etc.) and any network stack configuration. In this light, forcing netmap to work with such network configurations (like if_bridge), and with a custom semantic, is indeed a forcing. But, again, if that's useful for some use cases then so be it.
I concur that userspace networking is more complicated to implement, but it is also (way) more performant and cleaner, since you don't mix up kernel network configuration with the userspace application.

Also, with this approach you can actually leverage on the native netmap support of physical network interfaces (iflib, vtnet), which gives you a real speed-up, because no mbufs are involved (ofc you would need to get rid of if_tap(4) and implement a more efficient interface like vhost-user).
With the bridge0 approach you propose, you are forced to use emulated netmap, so performance will never be great.

From what I understand (I am not the author of this netmap application), this is not a major concern. The point is to be flexible and adapt to different networking configurations.

Don't know what netmap application you are referring to, but ok, as I said if there are use cases I'm ok with it, although it's not the natural approach.

Sorry for the long post, but IMHO the implementation could be simpler: local packets go to the bridge0 if_input (or hw RX ring in case of netmap); in promisc mode a copy of non-local packets can go to bridge0 if_input (or hw RX ring in case of netmap), whereas in non-promisc mode non-local packets just follows their path (and don't go to bridge0). Note that there is no difference between netmap and non-netmap case.
Any thoughts?

As I suggested above, this behaviour makes it impossible to do filtering of non-local packets. I'm not sure if you consider this to be a real problem or not, this is just the motivating use-case for this patch.

Filtering packets between interfaces is of course a real problem. However, that should not be done by trying to reuse the if_bridge(4) kernel code and attach a special meaning to opening bridge0 in netmap mode. Rather, filtering should be implemented by an userspace netmap application that opens all the interfaces that are involved in filtering and forwarding.

This is possible of course, but it simply makes netmap less useful as I tried to illustrate above. The netmap application might not have any particular knowledge of the system's networking configuration, so it's very useful to be able to say, "open bridge0 and do something with the packets that go through it." An application which wants to ignore forwarded packets can simply pass them back through to the host stack, and with this patch they will be forwarded as if netmap didn't intercept them.

Yes, but please note that we are not on the same page here because you assume there must be a bridge0 in your VM deployment (bhyve host, I guess). As I tried to explain above, the intended approach (for userspace networking frameworks such as netmap) would be to get rid of the kernel and implement the whole datapath and control path in userspace. (Please note this is not something I invented myself, ofc, but it is a standard approach for userspace networking).
However, I do understand this is complex (if you do not have enough components yet), and I see why reusing if_bridge(4) looks appealing. More on this below.

To be clear, I don't want to make if_bridge a core component of some kind of middlebox solution. The target is configurations which already use if_bridge somehow and want to deploy a netmap application on top of it without changing anything else.

The example with a VM switch is just to demonstrate that the approach of opening all bridge /members/ in netmap mode is not very appealing. I understand that if_bridge is not the best tool for high-performance VM networking.

Ok.

but it should not change the behaviour of an interface. We should not give a special meaning to bridge0 when is open in netmap mode vs when it is not.
That's why my opinion is that we should not intercept non-local packets, because bridge0 does not receive them.

I understand that in your physical model, bridge0 does not receive non-local packets, but then if_bridge's BPF and pfil integration do not behave according to your model either.

Can you please elaborate a little bit on how bpf(4) integration works for if_bridge(4)? I don't think pfil is relevant to the discussion here, since netmap is not about packet filtering (bpf is about packet filtering, but it also looks like an interface for packet I/O, like netmap).

2385         /*                                                                                                                                                                                                                                                                                                               
2386          * If we have a destination interface which is a member of our bridge,                                                                                                                                                                                                                                           
2387          * OR this is a unicast packet, push it through the bpf(4) machinery.                                                                                                                                                                                                                                            
2388          * For broadcast or multicast packets, don't bother because it will                                                                                                                                                                                                                                              
2389          * be reinjected into ether_input. We do this before we pass the packets                                                                                                                                                                                                                                         
2390          * through the pfil(9) framework, as it is possible that pfil(9) will                                                                                                                                                                                                                                            
2391          * drop the packet, or possibly modify it, making it difficult to debug                                                                                                                                                                                                                                          
2392          * firewall issues on the bridge.                                                                                                                                                                                                                                                                                
2393          */                                                                                                                                                                                                                                                                                                              
2394         if (dst_if != NULL || (m->m_flags & (M_BCAST | M_MCAST)) == 0)                                                                                                                                                                                                                                                   
2395                 ETHER_BPF_MTAP(ifp, m);

So with my patch, netmap will intercept local packets (GRAB_OUR_PACKETS) and non-local packets (bridge_forward()) in a similar way.

Ok, that makes sense.

If the final goal is to implement middleboxes, that should not be done trying to reuse if_bridge(4)...

The point is to allow netmap applications to be used more flexibly, in deployments where the application author cannot control the network configuration.

So as I see it this whole discussion boils down to what bridge0 represents and therefore what opening it in netmap mode is supposed to mean.

I am still convinced that bridge0 should represent the interface that lets the host participate in the L2 switch, and not the whole L2 switch, because that's a linear, not surprising behaviour for the netmap user. To the contrary, if bridge0 represents the whole L2 switch, and therefore netmap RX rings receive (and steal) any packet that passes through the switch, that is surprising behaviour. And netmap has no metadata to store information like the receiving interface (mbufs have those metadata, but one of the points of netmap is to avoid the mbuf,), so you need workarounds, which is not a good sign.

However, if you really think that there is a specialized use case for bridge0 that can make the latter interpretation convenient, that's ok for me.
So if I understand correctly your use case is a netmap application that opens all the RX rings of netmap:bridge0, filtering packets according to some firewall rules, and then forwarding the non-filtered packets to the host TX ring of netmap:bridge0?

Yes, that's correct.

Just wondering what is the added value of netmap here... can't the same be achieved with pf or other kernel features?
With all the packet copies and overhead of emulated netmap I do not expect speed ups...

Is the patch ready for review or maybe the man page changes still need to be reworked?

In D38066#883987, @vmaffione wrote:

In D38066#881958, @markj wrote:

In D38066#880176, @vmaffione wrote:

In D38066#878350, @markj wrote:

In D38066#878318, @vmaffione wrote:

In D38066#876096, @markj wrote:

From an ifnet perspective, if_bridge is already special. There is a unique ifnet hook, if_bridge_input, by which it receives packets. The if_input hook of a bridge interface is not used.

Just out of curiosity: isn't the if_input bridge0 hook used at line 2542 of if_bridge.c (multicast and broadcast)?
I'm trying to understand here... how is it possible that bridge0 if_input is not used? How can local packets reach the protocol stack for bridge0 (e.g. TCP/UDP sockets bound to bridge0) if not by means of if_input?

Yes, it's true that the bridge's if_input is called for that one special case, but that's not part of the regular data path.

Each ifnet which belongs to a bridge has a special hook, if_bridge_input, pointing to bridge_input(). Each ifnet also carries a pointer to the bridge softc. When a bridge member receives a packet, an mbuf chain is passed to ether_input_internal(), which checks to see if the receiving ifnet has if_bridge_input set. If so, the packet is passed to if_bridge_input/bridge_input().

bridge_input() can consume the packet and return NULL, which it does in the forwarding case, and then ether_input_internal() does nothing further. If the packet is local, bridge_input() uses the dst MAC to figure out which bridge port "received" the packet (this may be the bridge interface itself), and then returns the mbuf chain back to ether_input_internal(), which dispatches it to the protocol layers.

Thanks a lot for the detailed explanation!
(Following up on your explanation, I cannot help myself noting that one possibility would be to have ether_input_internal() pass up to netmap the mbuf chains that bridge_input() returns: in this way netmap receives only the "local" packets, whereas in the forwarding case you would have a NULL pointer, so nothing gets passed to netmap).

Unless I am missing something, this can't be implemented easily: to pass the chain to netmap, ether_input_internal() needs the if_bridge ifp, but it doesn't have it, only the receiving interface's ifp.

For example, what happens if you bind() to the IP address of bridge0, port TCP/5555? You'll only see local traffic, not forwarded traffic that happens to have TCP port 5555.

Well, yes, but raw sockets are implemented purely by the L3 protocol layers. I don't think this is a very convincing example to be honest. If I run tcpdump on bridge0, then I'll see both local and non-local traffic. Is this an argument in favour of my approach? bridge_forward() also (optionally) runs pfil hooks so that kernel firewalls can apply policy to forwarded packets.

I realized just now that bpf(4) does on FreeBSD what I thought could be done with raw sockets (like in other OSs). Sorry for that, I generated a lot of confusion.
But tcpdump sets IFF_PROMISC, and that's why you see non-local packets! That's independent of the specific OS API that you use to access the ifnet traffic. I'm not sure about bpf(4), but when you open an interface in netmap mode you don't automatically set IFF_PROMISC: that is up to the userspace applications.

But if_bridge does nothing special in promiscuous mode. There is no reference to IFF_PROMISC in if_bridge.c. In particular, tcpdump --no-promiscuous-mode -i bridge0 shows forwarded packets.

I know, but nothing prevents from adding proper IFF_PROMISC support if it makes sense.

Aside from backward compatibility, yes.

Regarding pfil hooks... sure if_bridge is special in that sense, but netmap is not something akin to pfil. I'd say it is more similar to bpf(4) as an API that can inject and receive packets directly at the Ethernet level, but with no receive filters.

Suppose I have a physical port in promiscuous mode, opened by netmap. Should netmap only see packets with dst MAC address equal to the port's address? Right now it will see both local and L2-forwarded traffic.

Absolutely, netmap (like bpf) will see anything that the physical port receives in its physical RX rings. If IFF_PROMISC is set, it will also see non-local traffic. If MCAST filters are enabled in the hardware, it will see multicast traffic. Netmap does not specify any policy about what RX packets should or should not be seen, it is just a different API to access what is available in the ifnet RX rings (and ofc to access TX rings for transmissions).

Ok, that makes sense. So what exactly is an RX packet for if_bridge? Netmap generic mode consumes every packet that comes in from if_input, but as discussed above, that doesn't make sense here, so we need some special definition for if_bridge. In keeping with the notion that "netmap is a user-mode API for ifnets", I think it is natural for netmap to consume everything that comes in via bridge_input().

As I said, if you think that this interpretation is more useful than the one I am proposing, that's fine, and we can go ahead.
(I think it would make sense to have the mbufs passed up to netmap for local traffic in ether_input_internal, as outline above, and that looks more natural to me).

I do not think it is possible without some additional surgery. The if_bridge field of struct ifnet is an opaque pointer; ether_input_internal() has no reference to the bridge ifnet or softc.

Also, with this approach you can actually leverage on the native netmap support of physical network interfaces (iflib, vtnet), which gives you a real speed-up, because no mbufs are involved (ofc you would need to get rid of if_tap(4) and implement a more efficient interface like vhost-user).
With the bridge0 approach you propose, you are forced to use emulated netmap, so performance will never be great.

From what I understand (I am not the author of this netmap application), this is not a major concern. The point is to be flexible and adapt to different networking configurations.

Don't know what netmap application you are referring to, but ok, as I said if there are use cases I'm ok with it, although it's not the natural approach.

suricata and zenarmor are the motivating applications.

If the final goal is to implement middleboxes, that should not be done trying to reuse if_bridge(4)...

The point is to allow netmap applications to be used more flexibly, in deployments where the application author cannot control the network configuration.

So as I see it this whole discussion boils down to what bridge0 represents and therefore what opening it in netmap mode is supposed to mean.

I am still convinced that bridge0 should represent the interface that lets the host participate in the L2 switch, and not the whole L2 switch, because that's a linear, not surprising behaviour for the netmap user. To the contrary, if bridge0 represents the whole L2 switch, and therefore netmap RX rings receive (and steal) any packet that passes through the switch, that is surprising behaviour. And netmap has no metadata to store information like the receiving interface (mbufs have those metadata, but one of the points of netmap is to avoid the mbuf,), so you need workarounds, which is not a good sign.

However, if you really think that there is a specialized use case for bridge0 that can make the latter interpretation convenient, that's ok for me.
So if I understand correctly your use case is a netmap application that opens all the RX rings of netmap:bridge0, filtering packets according to some firewall rules, and then forwarding the non-filtered packets to the host TX ring of netmap:bridge0?

Yes, that's correct.

Just wondering what is the added value of netmap here... can't the same be achieved with pf or other kernel features?
With all the packet copies and overhead of emulated netmap I do not expect speed ups...

The added value is that existing software which uses netmap (zenarmor and suricata in this case) can be deployed in configurations which already make use of if_bridge.

In D38066#883993, @vmaffione wrote:

Is the patch ready for review or maybe the man page changes still need to be reworked?

Not quite ready, I was waiting for the discussion to resolve somehow before doing anything further. I will remove the IFF_PROMISC handling and update the manual page accordingly.

ok, so maybe we can go ahead

Remove special handling for IFF_PROMISC. Now, a netmap application will see all packets received by member interfaces, and packets injected via the host ring will be revisited by bridge_input().
Update the manual page correspondingly.

Harbormaster completed remote builds in B50177: Diff 118432.Mar 6 2023, 9:32 PM

vmaffione requested changes to this revision.Mar 11 2023, 4:10 PM

vmaffione added inline comments.

sys/net/if_bridge.c
2333	this change looks unrelated... why is that needed here?
2358	Is the mention to promiscuous mode a leftover? In any case, if I'm not mistaken within the if clause it should be `if_input == freebsd_generic_rx_handler`, so that here netmap intercepts forwarded packets. Maybe the comment maybe a little be more clear and explain this?
2482	Maybe avoid pointer leak, and use `if_name(ifp)`?
2490	The `rcvif` field is a metadata field that is reconstructed here after netmap has intercepted the packet and reinjected it through the host TX ring. Are we sure there are no additional metadata fields that got lost because of netmap interception?
2682	Maybe avoid the kernel pointer leak to protect ASLR?
sys/net/if_ethersubr.c
673 ↗	(On Diff #118432)	So in this case must be `ifp->if_bridge == NULL` and `ifp->if_bridge_input != NULL`, and this happens for the bridge `ifp`. I think a comment is required here to explain a little bit. An alternative may be to set `ifp->if_bridge = ifp` for the bridge `ifp`, so that you avoid the second check in this hot path?

This revision now requires changes to proceed.Mar 11 2023, 4:10 PM

Address comments from Vincenzo.

Harbormaster completed remote builds in B50293: Diff 118663.Mar 11 2023, 5:04 PM

markj added inline comments.Mar 11 2023, 5:04 PM

sys/net/if_bridge.c
2333	It's an oversight, this is an unrelated change. (The check is redundant.)
2358	Yes, the comment needs to be updated. The check for IFF_NETMAP in the capenable bits is sufficient. Is there some reason to check `if_input` specifically?
2490	Yes, I believe this is sufficient from looking at bridge_forward(). I'm not sure what else could be reconstructed - there is no way to tell if this is an existing packet that was reinjected by netmap, or if it's a new packet that is being transmitted by the netmap application.
2682	There is no KASLR, so nothing is leaked really. There are many, many ways to leak kernel pointers besides. Also, this is a panic string, so at that point there is nothing to hide.
sys/net/if_ethersubr.c
673 ↗	(On Diff #118432)	I like your suggestion, but there's code elsewhere in the network stack which assumes that `ifp->if_bridge != NULL` means that `ifp` is a bridge member port, so some more work would be needed. It's still possible to disambiguate by checking `ifp->if_type == IFT_BRIDGE`, but I'd prefer to leave that for a separate change.

vmaffione added inline comments.Mar 11 2023, 6:22 PM

sys/net/if_bridge.c
2358	No, no I was not suggesting to add that check. I was just asking to clarify that in the comment here we pass the packet to netmap if netmap is enabled. So that's fine now.
2484	Does this handle the case of local packet? E.g., a packet that is destined to `bifp`, intercepted by netmap and then reinjected?
2682	That's right, but it's more informative if we give the ifp name.

vmaffione accepted this revision.Mar 12 2023, 9:55 AM

vmaffione added inline comments.

sys/net/if_bridge.c
763	Maybe clarify that this is assumed to be `ether_input` at this point? (The whole thing is quite convoluted...)
768	...to make it clear that `bridge_input` is called on `bifp` as a result of `bridge_inject`.