Page MenuHomeFreeBSD

offload: Compute and insert checksums as late as possible
Needs ReviewPublic

Authored by timo.voelker_fh-muenster.de on Tue, Apr 21, 9:11 PM.
Tags
None
Referenced Files
F154294800: D56564.diff
Mon, Apr 27, 4:16 PM
F154273886: D56564.id176242.diff
Mon, Apr 27, 1:06 PM
F154268491: D56564.id176457.diff
Mon, Apr 27, 12:20 PM
F154252982: D56564.diff
Mon, Apr 27, 10:19 AM
Unknown Object (File)
Mon, Apr 27, 7:41 AM
Unknown Object (File)
Sun, Apr 26, 9:33 PM
Unknown Object (File)
Sun, Apr 26, 7:57 PM
Unknown Object (File)
Sun, Apr 26, 6:31 AM

Details

Summary

Cases exist where, for an outgoing packet, the IP/SCTP/TCP/UDP checksum is computed in software, even though the hardware could have done it, or, worse, where the checksum is not computed in software even though the hardware is incapable of doing so. To avoid such cases, this patch moves the computation of checksums in software to the point right before the packet is passed to the driver.

This patch changes the following:

  1. mbuf.h: Make csum_data two bytes shorter and add offload_l3_hdr_offset and offload_l4_hdr_offset in the mbuf packet header.
  2. ip_output.c/ip6_output.c: Remove code that computes the checksum. Set offload_l4_hdr_offset.
  3. if_ethersubr.c/if_infiniband.c: Set offload_l3_hdr_offset.
  4. if.c: When an interface comes up or its capabilities have been changed, call the new if_offload_caps_changed, which sets ifp->if_transmit to the new if_offload_transmit if the interface driver does not support all expected offloading capabilities.
  5. if_offload.c: When if_offload_transmit is called, compute and insert all checksums that are still required and cannot be offloaded to the interface (driver), and call the original if_transmit function.

https://wiki.freebsd.org/Networking/ChecksumOffloading

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sys/net/if_offload.c
83

This is so trivial, maybe it would be easier / faster to just always calculate the checksum when composing the header.

89

The header will almost certainly be there. I'd check if (__predict_false(m->m_len < offset + m->m_pkthdr.offload_l4_hdr_offset)) and only then call m_pullup()

160

Can this happen?

167

I'd make everything above this an inline, and only call into a function for a slow path

sys/net/if_offload.c
83

It is really not a big deal to calculate the IPv4 header checksum and probably better be done when composing the header than here. However, the expectation is that interfaces support IPv4 header checksum offloading and that it is not required to do it in software. My assumption is that it is faster to do it in hardware than in software. This assumption might be wrong. Have you measurements data?

160

I don't know. I just chose the defensive way.

Is it guaranteed that an mbuf passed over the if_transmit function has a packet header?

167

Like so?

static inline int
if_offload_transmit(struct ifnet *ifp, struct mbuf *m)
{
	uint16_t offset, l4csum_field_offset;
	uint32_t offload_req;

	IF_OFFLOAD_LOG("Enter if_offload_transmit for %s", ifp->if_xname);
	if ((m->m_flags & M_PKTHDR) == 0)
		return ((ifp->if_transmit_org)(ifp, m));

	/* determine the offload capabilities required to be performed here */
	offload_req = m->m_pkthdr.csum_flags & IF_OFFLOAD_EXPECTED
	    & ~ifp->if_hwassist;
	if (offload_req == 0)
		return ((ifp->if_transmit_org)(ifp, m));

	if_offload_perform(m, offload_req);
	return ((ifp->if_transmit_org)(ifp, m));

I like the idea. Do you know if this has an effect?

In the past, I made some tests with clang and inlining where clang did not inline functions when called from another c-file. In my tests, I called the functions directly. This function is called over a function pointer. My assumption is that clang not even know which function is called. Do I underestimate clang?

timo.voelker_fh-muenster.de marked an inline comment as done and 3 inline comments as not done.Wed, Apr 22, 8:20 AM
sys/net/if_offload.c
83

Its not so much a question of it being "faster" in software than in hardware, I'm just advocating for simplicity.
On any modern CPU, calculating the IPv4 sum in software is trivial, as long as it is done while the header is in cache.

In terms of performance data: I don't have anything modern. I did experiments back in 2013 or so with enabling and disabling IP checksum offload on a variety of NICs and could never notice a difference from having it enabled at the level of CPU use consumed by network benchmarks.

160

Yes, without a packet header, the NIC would have no idea what to do with the packet. You might want to change that to a KASSERT that you have a M_PKTHDR mbuf if you want to be defensive.

167

I suspect it will have a small effect at high packet rates, by removing a call from the critical path. But I'm not in a position to prove it.

timo.voelker_fh-muenster.de added inline comments.
sys/net/if_offload.c
83

Currently, the IPv4 header is composed by SCTP/TCP/UDP and the header checksum is calculated in ip_output if the selected outgoing interface does not support it. Is your suggestion to calculate the IPv4 header checksum in ip_output() unconditionally?

What about modern but rather slow CPUs? I guess, we find such CPUs in embedded environments. Would you expect a benefit by using IPv4 header checksum offloading there?

I'd suggest to add links to the discussing about defer calculating the checksum. I believe I saw that before, but I can not find it now.

For defer calculating the checksum to the leaf interface ( the interface transmitting packets those are leaving the OS ), there're lots of benefits to do so. For example modern HW can handle them efficiently, and it will simplify the logic to handle checksum in bridge(4) etc.

Ideally I think the soft calculating the checksum shall be per-driver. But that would introduce too much work to modify every driver, and may introduce regression to some without extensive test.

The introduction of if_offload_transmit() is smart, but I think it is the driver's responsibility to handle soft checksum offload, but not the net stack. I think maybe we should focus on widely used drivers right now, and put soft checksum offload to them. For drivers those are not widely used, keep them as is.

I'd suggest to add links to the discussing about defer calculating the checksum. I believe I saw that before, but I can not find it now.

I added the link in the summary.

For defer calculating the checksum to the leaf interface ( the interface transmitting packets those are leaving the OS ), there're lots of benefits to do so. For example modern HW can handle them efficiently, and it will simplify the logic to handle checksum in bridge(4) etc.

Ideally I think the soft calculating the checksum shall be per-driver. But that would introduce too much work to modify every driver, and may introduce regression to some without extensive test.

The introduction of if_offload_transmit() is smart, but I think it is the driver's responsibility to handle soft checksum offload, but not the net stack. I think maybe we should focus on widely used drivers right now, and put soft checksum offload to them. For drivers those are not widely used, keep them as is.

I consider this patch an intermediate solution. If an interface does not support all expected offload capabilities and it is more efficient to do some offload tasks in the driver, one could add the function in the driver. The driver then can announce the offload capability via the hwassist field to prevent if_offload_transmit() from doing this offload task. Once the driver added all expected offload capabilities in hwassist, if_offload_transmit() is not even called. Once all drivers support all expected offload capabilities (in software or hardware), if_offload_transmit() is obsolete. However, I guess it will take a while until we are there.

sys/net/if_private.h
135

'if_transmit_orig' might be more obvious, at the cost of making the name one letter longer.

(I also wonder if we shouldn't 'just' make sure that everything that sends packets if_transmit(ifp, m), rather than ifp->if_transmit(m) which could then do the checksum work in if_transmit() without having to store an additional pointer in struct ifnet, but that may be a lot of effort, so take this as nothing more than a passing thought.)

Leave the lines that initialize ip_sum with 0, which some NICs require.

timo.voelker_fh-muenster.de added inline comments.
sys/net/if_private.h
135

I changed the name of if_transmit_org to if_transmit_orig.

I designed the solution to have no performance impact on NICs that support all expected offload capabilities. If, for example Ethernet, calls if_transmit(ifp, m) (which calls ifp->if_transmit(m)) instead of calling ifp->if_transmit(m) directly, we have always one more function call (even for NICs that support all expected offload capabilities).

To check the impact of IPv4 header checksum, I ran tests.

I have two computers with an Intel Sandy Bridge CPU (Core i7-2600, 3.40 GHz) and an Intel 82599ES network card that connects them with a direct 10 Gb/s link. I used iperf3 to do a throughput measurement.

On the sender, I disabled tso4 and started an iperf3 run 20 times, each for 25 seconds. During the run, the sender is CPU limited.

On the receiver, I measured the throughput every ten seconds and used the second measurement for my results.

I repeated the test two times with different kernels on the sender. One time with the current head kernel to get throughputs when the kernel offloads TCP and IPv4 header checksums (tp_ipv4off.txt). Second time with the following modification to get throughputs when the kernel offloads the TCP checksum but not the IPv4 header checksum (tp_noipv4off.txt).

diff --git a/sys/netinet/ip_output.c b/sys/netinet/ip_output.c
index 200f281f34a7..e2375b06e8a7 100644
--- a/sys/netinet/ip_output.c
+++ b/sys/netinet/ip_output.c
@@ -758,7 +758,6 @@ ip_output(struct mbuf *m, struct mbuf *opt, struct route *ro, int flags,
                }
        }
 
-       m->m_pkthdr.csum_flags |= CSUM_IP;
        if (m->m_pkthdr.csum_flags & CSUM_DELAY_DATA & ~ifp->if_hwassist) {
                in_delayed_cksum(m);
                m->m_pkthdr.csum_flags &= ~CSUM_DELAY_DATA;
@@ -781,10 +780,7 @@ ip_output(struct mbuf *m, struct mbuf *opt, struct route *ro, int flags,
            (m->m_pkthdr.csum_flags & ifp->if_hwassist &
            (CSUM_TSO | CSUM_INNER_TSO)) != 0) {
                ip->ip_sum = 0;
-               if (m->m_pkthdr.csum_flags & CSUM_IP & ~ifp->if_hwassist) {
-                       ip->ip_sum = in_cksum(m, hlen);
-                       m->m_pkthdr.csum_flags &= ~CSUM_IP;
-               }
+               ip->ip_sum = in_cksum(m, hlen);
 
                /*
                 * Record statistics for this interface address.

With ministat, I get the following output.

% ministat tp_ipv4off.txt tp_noipv4off.txt 
x tp_ipv4off.txt
+ tp_noipv4off.txt
+------------------------------------------------------------------------------------------------------+
|                                              +     +    +                              xx           x|
|+                             +      +   +++ ++   ++++ ++++  +       x  x         x  x xxxx xx xx    x|
|                                 |_____________A__M_________|                    |_______A______|     |
+------------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  20          6573          6674        6636.5       6635.05     24.737517
+  20          6358          6550          6516        6503.9     42.319462
Difference at 95.0% confidence
	-131.15 +/- 22.1851
	-1.97662% +/- 0.332692%
	(Student's t, pooled s = 34.6618)

It looks like there is a benefit in using IPv4 header checksum offloading than to compute the IPv4 header checksum always in ip_output(). Due to the better efficiency I would buy the slightly more complex code.

sys/net/if_offload.c
83

As described in my comment above, I tested it. Not on the most modern CPU, but still, as a result, I would keep using IPv4 header checksum offloading.