Details

Reviewers

tuexen
kevans
kp
ivy
bryanv
rgrimes
peter.lei_ieee.org
markj

Group Reviewers

transport
manpages

Commits

rGdd8b6e2ce3fd: sctp, tcp, udp: improve deferred computation of checksums
rGbcb298fa9e23: sctp, tcp, udp: improve deferred computation of checksums

Summary

When the SCTP, TCP, or UDP implementation send a packet, it does not compute the corresponding checksum but defers that. The network layer will determine whether the network interface selected for the packet has the requested capability and computes the checksum in software, if the selected network interface does not have the requested capability.
Do this not only for packets being sent by the local SCTP, TCP, and UDP stack, but also when forwarding packets. Furthermore, when such packets are delivered to a local SCTP, TCP, or UDP stack, do not compute or validate the checksum, since such packets have never been on the wire. This allows to support checksum offloading also in the case of local virtual machines or jails. Support for epair, vtnet, and tap interfaces will be added in separate commits.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

timo.voelker_fh-muenster.de created this revision.Jul 23 2025, 12:46 PM

Herald added a reviewer: transport. · View Herald TranscriptJul 23 2025, 12:46 PM

Herald added subscribers: vegeta_tuxpowered.net, glebius, melifaro and 2 others. · View Herald Transcript

timo.voelker_fh-muenster.de requested review of this revision.Jul 23 2025, 12:46 PM

timo.voelker_fh-muenster.de mentioned this in D51298: tuntap bridge: set tx csum mbuf flags if in a bridge.Jul 23 2025, 12:53 PM

How is CSUM_IP6_UDP different from CSUM_DATA_VALID_IPV6 now?

sys/net/if_epair.c
598	I don't see how this can be correct. if_epair doesn't provide checksum offload. (Yes, we could potentially lie because there's not likely to be corruption between tx and rx here, but that makes if_epair less like real hardware, less useful for testing and more likely to have very hard to diagnose checksum issues in the future.)
sys/netinet/ip_fastfwd.c
471	ip_output() also handles SCTP checksums. Shouldn't we do that for fastforward too?

In D51475#1175448, @kp wrote:

How is CSUM_IP6_UDP different from CSUM_DATA_VALID_IPV6 now?

CSUM_DATA_VALID_IPV6 is the IPv6 version of CSUM_DATA_VALID. The mbuf man page says:

CSUM_DATA_VALID  The checksum of the data portion of the IP packet
                 has been computed and stored in the field
                 csum_data in network byte order.

With this patch, CSUM_IP6_UDP means for the input path that the checksum is unnecessary. It does not imply that the checksum has been computed. With CSUM_IP6_UDP the csum_data field instead contains the offset of the checksum field, which is important in case the packet somehow leaves the host.

timo.voelker_fh-muenster.de added inline comments.Jul 23 2025, 2:23 PM

sys/net/if_epair.c
598	Well it's all virtual. A virtual checksum offloading for a virtual interface. The advantage is that TCP/UDP in a VNET can use checksum offloading where the checksum is calculated in hardware if the packet leaves the host over a real interface with that capability.

timo.voelker_fh-muenster.de added inline comments.Jul 23 2025, 2:33 PM

sys/netinet/ip_fastfwd.c
471	That's a good point. I have this on my to-do list, but it shouldn't be an issue right now. A VM only uses TCP/UDP checksum offloading. Between the VM and the host we have this conversion between VIRTIO header flags and mbuf header flags. The `VIRTIO_NET_HDR_F_NEEDS_CSUM` flag is specified for TCP/UDP only. For the epair interface, I only enabled checksum offloading for TCP and UDP. ifp->if_hwassist = CSUM_IP_TCP \| CSUM_IP_UDP \| CSUM_IP6_TCP \| CSUM_IP6_UDP;

In D51475#1175460, @timo.voelker_fh-muenster.de wrote:
In D51475#1175448, @kp wrote:

How is CSUM_IP6_UDP different from CSUM_DATA_VALID_IPV6 now?

CSUM_DATA_VALID_IPV6 is the IPv6 version of CSUM_DATA_VALID. The mbuf man page says:
CSUM_DATA_VALID  The checksum of the data portion of the IP packet
                 has been computed and stored in the field
                 csum_data in network byte order.
With this patch, CSUM_IP6_UDP means for the input path that the checksum is unnecessary. It does not imply that the checksum has been computed. With CSUM_IP6_UDP the csum_data field instead contains the offset of the checksum field, which is important in case the packet somehow leaves the host.

But we're actively using CSUM_IP6_UDP on the input side now. It's an output flag, to indicate that the NIC is expected to calculate the checksum, isn't it?

I'm worried we're conflating things here, and creating a mess of what is already very messy and touchy.

sys/net/if_epair.c
598	Yeah, I know you can get away with it in simple setups, but it's begging for other bugs when bridging, firewalling or netgraph-ing. I strongly object to at least this part of the diff.

In D51475#1175479, @kp wrote:
In D51475#1175460, @timo.voelker_fh-muenster.de wrote:
In D51475#1175448, @kp wrote:

How is CSUM_IP6_UDP different from CSUM_DATA_VALID_IPV6 now?

CSUM_DATA_VALID_IPV6 is the IPv6 version of CSUM_DATA_VALID. The mbuf man page says:
CSUM_DATA_VALID  The checksum of the data portion of the IP packet
                 has been computed and stored in the field
                 csum_data in network byte order.
With this patch, CSUM_IP6_UDP means for the input path that the checksum is unnecessary. It does not imply that the checksum has been computed. With CSUM_IP6_UDP the csum_data field instead contains the offset of the checksum field, which is important in case the packet somehow leaves the host.
But we're actively using CSUM_IP6_UDP on the input side now.

I haven't been aware of that. Can you point me to the code where this flag is used on the input path?

It's an output flag, to indicate that the NIC is expected to calculate the checksum, isn't it?

Right, my understanding was that this is the only meaning of the flag before this patch.

I'm worried we're conflating things here, and creating a mess of what is already very messy and touchy.

tuexen added inline comments.Jul 23 2025, 4:50 PM

sys/net/if_epair.c
598	I think using this for an epair interface makes sense (for example in the container context). I would suggest to add support for it and disable it per default. So people can try it and report if it does not work. This enables a way forward... We should also include SCTP here, since at has a substantial CPU impact.
598	We can do it in a separate commit. What about doing that and not enabling it by default. This allows people to test it and report any problems. We are happy to look at them.
sys/netinet/ip_fastfwd.c
471	I agree with kp here. Let us get SCTP support in here, even if we don't make use of it right now. Let us only touch the code here once.

I have one generic question: I would have added in ip_input() before calling the transport header the code to add the flag CSUM_DATA_VALID in case of CSUM_IP_UDP is requested. That would mean switching from output request to input handling as soon as we are sure to do input processing. So we would handle SCTP/TCP/UDP in ip_input() instead of sctp_input()/tcp_input()/udp_input().

Looking at the files changed, I would find it much more intuitive that you need changes in ip_fastfwd.c and ip_input.c compared to ip_fastfwd.c and three transport specific files (and the corresponding IPv6 stuff).

Any opinions on that?

sys/dev/virtio/network/virtio_net.h
263 ↗	(On Diff #158939)	I am wondering if this need to be protected by #if defined(INET) \|\| defined(INET6) like the next line.
299 ↗	(On Diff #158939)	I am wondering if this need to be protected by #if defined(INET) \|\| defined(INET6) like the next line.
302 ↗	(On Diff #158939)	I am wondering if this need to be protected by #if defined(INET) \|\| defined(INET6) like the next line.

tuexen added reviewers: rgrimes, peter.lei_ieee.org.Jul 23 2025, 5:04 PM

tuexen added inline comments.Jul 23 2025, 5:09 PM

sys/dev/virtio/network/virtio_net.h
260 ↗	(On Diff #158939)	Isn't the name now confusing? You are setting the tx flags, not rx.

In D51475#1175481, @timo.voelker_fh-muenster.de wrote:

In D51475#1175479, @kp wrote:

But we're actively using CSUM_IP6_UDP on the input side now.

I haven't been aware of that. Can you point me to the code where this flag is used on the input path?

It's being introduced in this patch in both udp6_input() and tcp_input().

Overall I really don't like this approach. We're spreading something that should only affect tun tap around the entire stack. I've had long and difficult arguments with checksum flags before, and I fear that this is something that's going to create even more confusion about checksum flags in the kernel, and even more scenarios that only sometimes work.

Why not just calculate the checksums in tuntap and be done with it? Or just plain don't announce checksum offload support.
It's not going to make any difference at all to its performance (the bottleneck for bhyve's use of tun tap is the bounce to userspace, both the copying in and out and the context switches. On top of the mutex, of course).

In D51475#1175555, @kp wrote:

In D51475#1175481, @timo.voelker_fh-muenster.de wrote:

In D51475#1175479, @kp wrote:

But we're actively using CSUM_IP6_UDP on the input side now.

I haven't been aware of that. Can you point me to the code where this flag is used on the input path?

It's being introduced in this patch in both udp6_input() and tcp_input().

Overall I really don't like this approach. We're spreading something that should only affect tun tap around the entire stack. I've had long and difficult arguments

This work is about two things:

Fix the issue that right now FreeBSD sends packets with a wrong checksum for TCP und UDP.
Improve the performance by leveraging offload capabilities of physical interfaces or of computations not necessary.

You can do the first by always computing them. But I think addressing the second is also necessary. Doing checksum offloading and large segment offloading requires actions at the transport layer and at the interface layer. We have the code already there.

with checksum flags before, and I fear that this is something that's going to create even more confusion about checksum flags in the kernel, and even more scenarios that only sometimes work.

I think a better way to look at this work is:

It will allow FreeBSD to work in setups where it currently doesn't work. See the bug report Timo is referring to. Of course this needs corresponding upcoming patches in bhyve. That basically improves the situation for vtnet by making it work in several scenarios where it is not working yet.

With the fixes to improve the situation for vtnet, we can just enable the feature for epair. It will just work. For me, this is an indication that the infrastructure changes done for vtnet are not vtnet-specific, but improve the situation in epair.

I am all for deploying them off by default first. There might be other fields, where stuff is not working correctly, but people need to report it such that we are aware of it and can improve it.

Why not just calculate the checksums in tuntap and be done with it? Or just plain don't announce checksum offload support.

That does not help from a performance perspective. And this is a prerequisite for TCP segment offloading.

It's not going to make any difference at all to its performance (the bottleneck for bhyve's use of tun tap is the bounce to userspace, both the copying in and out and the context switches. On top of the mutex, of course).

That is a good point. I think what we need is a way that you can move an mbuf chain from within a VM to the host and vice versa such that flags for checksum offloading and TSO and so on are preserved. I think this is the intention of the vtnet interface. The current implementation of this might not be performant. But this work is to make use of the offload capabilities on the host side.
So if we eventually come up with an improved version of such an mbuf chain transfer mode, it will just work and make use for the offload capabilities. Another example of such a transfer mode is the epair interface. So I think the infrastructure work is independent from the particular implementation of the vtnet interface and helps leveraging the available offload capabilities.

Michael, Timo and I had an chat about this just now, and that's clarified things for me.
My major objections have been addressed, I would still like this to be documented a bit better, both in sys/mbuf.h in the checksum flag comment block, as well as the commit message for this patch.

They're going to update this review to keep the if_epair and stack changes, but split off the virtio changes.
Michael also suggested making the if_epair changes opt-in with a sysctl, but I'm not sure that's required. (And I'm inclined to think that it's going to get a lot more testing if enabled by default.)

(We also touched on the notion of translating the checksum flags from output flags to input flags in if_epair rather than in the network stack itself, but that doesn't work, because when we get the packet on the if_epair we don't know if it's going to end up being forwarded or bridged or sent to the host IP stack as an input packet.)

Thanks for adding the ip forwarding check, this should fix my issue with virutalized routers using tunnel like transports (pppOtcp, ipsec, etc).

sys/dev/virtio/network/virtio_net.h
263 ↗	(On Diff #158939)	Yes it needs INET+INEt6 protection, not repeating the YES to the rest of this same question, but it appears to me that they all need the protection.

Removed the tap changes to focus this revision more on the handling of incoming packets with checksum offloading flags.
Added code to handle SCTP packets.
Added documentation in mbuf man page and mbuf.h.
Extended changes in epair to be able to get and set interface capabilities.

Herald added a subscriber: ziaee. · View Herald TranscriptJul 26 2025, 9:08 PM

timo.voelker_fh-muenster.de added inline comments.Jul 26 2025, 9:27 PM

sys/net/if_epair.c
440	Did not add `CSUM_IP_SCTP` on purpose. If one side of the epair-interface is added to a bridge, the bridge leaves `txcsum` only if all other interfaces in the bridge offer that feature. However, it is not clear what `txcsum` includes. Certainly `CSUM_IP_TCP` and `CSUM_IP_UDP`, but maybe not `CSUM_IP_SCTP`. If the epair-Interface would offer `CSUM_IP_SCTP` but the physical outgoing interface in the bridge not, then the SCTP packets that would be sent through the epair-interface in the bridge with an incorrect checksum and `CSUM_IP_SCTP` set leave the host over the physical interface still with an incorrect checksum.

Addressing comments from @tuexen I got from a chat. I touched only the code I added before.

ip6_forward.c and ip_forward.c: The return value from ip6_lasthdr is now uniformly stored in an int-variable.
sctp_input.c and sctp6_usrreq.c: With CSUM_IP_SCTP or CSUM_IP6_SCTP set, increment the same counter as if CSUM_SCTP_VALID were set (sctps_recvhwcrc).
Let epair annouce RXCSUM capability. This is only to inform the user that packets with, for example, the mbuf flag CSUM_DATA_VALID set may arrive. This happens (independently of this patch) if the packet was originally received over a physical interface with real RXCSUM (which sets the mbuf flag CSUM_DATA_VALID) and then switched or routed out over an epair interface.
Restrict the added documentation in mbuf man page to the handling of incoming packets with the CSUM_TCP, CSUM_UDP, or CSUM_SCTP flag set and keep the existing format.

Also, I added a paragraph in the epair main page to document the TXCSUM, TXCSUM6, RXCSUM, and RXCSUM6 capabilities.

Are we sure about the predict_false's being added into a file which appears to not have any predict in it? I can see that a system with assist hardware and lots of VM's this _false would actually be a pessimization as almost all packets are going to be marked with offloading capabilities. Perhaps some dtrace data might answer this.

In D51475#1178852, @rgrimes wrote:

Are we sure about the predict_false's being added into a file which appears to not have any predict in it? I can see that a system with assist hardware and lots of VM's this _false would actually be a pessimization as almost all packets are going to be marked with offloading capabilities. Perhaps some dtrace data might answer this.

I haven't done any measurements so far. How well these __predict_false's are depends on how people use FreeBSD.

I assume these predictions will be more often correct than wrong (that's why I added them). The predictions are wrong only if

A packet sent with CSUM_IP_TCP, CSUM_IP_UDP, or CSUM_IP_SCTP switches to the input path (e.g., sent from a VM or Jail).
The packet does not switch back to the output path on link layer (e.g., input interface is not in a bridge or bridge decides its for us).
The packet switches back to the output path by IP routing.
The outgoing interface selected by IP does not support checksum offloading for TCP, UDP, or SCTP, respectively.

In a setup with lots of packets from a VM or Jail that will be routed out by IP, points 1 to 3 are often true, but I assume that point 4 is false in most setups.