To be able to handle the transfer rates of tomorrow we need to reduce the number of requests to the TCP stack when receiving packets. Currently there is a limitation of 65535 bytes, which is due to using the 16-bit IP payload field in the IP header when accumulating LRO data. Instead use the 32-bit m_len field of the mbuf when doing data accumulation.
Except for the signed / unsigned issue, I like this change. However, please get approval from some of the networking guys before committing.
Shouldn't this be unsigned?
shouldn't this be unsigned?
This really isn't a compile time constant (though I think FreeBSD has magic to kinda make it such). I don't see what's wrong with uint16_t p_len = htons(65535) honestly. The enum type is also signed, which may cause some grief with subtle assumptions in the code.
My main concern with this patch is the special casing of what is the "packet length" being sprinkled all throughout the code. It feels like we could be chasing down obscure "is this the right length for this kind of packet" bugs for quite some time.
Is there any way we could unify this stuff a bit? I'm open to ideas.
Historically, I believe that we've allowed additional trailer content/padding to be included at the end of mbuf chains that isn't in the IP datagram, and is silently ignored, as it isn't included in the IP packet length. I may misunderstand what was going on there, but it gives me serious pause in pondering the proposed change, as those two semantics are not really compatible with one another.
See comments added.
Because the variable is later on compared to the "mtu" which is also an "int". Else you will get a compile warning about signed comparison mismatch. Same goes for other location where "int" is used.
This is a dummy value which should not be used in the new LRO implementation. 65535 is chosen to make the packet drop as invalid in case some code that is not updated, if any, decides to check the "p_len" with "m_pkthdr.len".
rwatson: The LRO code ensures that all trailing and padding is stripped. This happens both before and after my change. Beware: If the packet does not originate from the LRO code, it is treated like before. That's why I have a custom HASHTYPE for these packets.
*laugh* I know you're not; I'm just worried that we'll have lots of subtle bugs here and there with the semantics of "sometimes the length is in ip_hdr; sometimes the length is in m_pkthdr; and there's no clear definition of when we should check both."
If we had a clearer definition of that and we had accessor macros that abstracted that away, it'll make things clearer and it'll give us a place to put assertions about our assumptions.
LRO affects the kernel TCP code in subtle (and almost always undesirable) ways by "compressing" multiple TCP headers into one. Think TCP timestamps, bursty changes in sequence space as seen by the kernel, what happens to pure acks, etc. Our LRO implementation is even willing to combine multiple received TCP PSHes into one. All this is a decent tradeoff when a handful of segments are involved but the proposed LRO_PLEN_MAX (1MB) is 700+ MSS sized segments at 1500 MTU. I wonder how well the kernel TCP will deal with such big bubbles of data. Please do get the big picture reviewed by one of our TCP protocol experts.
M_HASHTYPE_LRO_TCP isn't really a hash type and will likely confuse the RSS code. There is some value in providing the hash type to the stack but with your proposed change the hash type will be clobbered by tcp_lro_flush. Data for a single stream will show up in the stack with either the correct hash type or M_HASHTYPE_LRO_TCP. Not pretty.
I wonder what one of these gigantic LRO'd packet looks like to bpf consumers? If they go by the ip header then they will likely get confused. A good test would be to see if wireshark is able to follow the TCP stream accurately or does it lose track when it encounters one of these VLRO (Very Large RO) packets?
At the very least, allow drivers to opt out of this VLRO by
a) making LRO_PLEN_MAX per lro_ctrl, to be set when the LRO structures are initialized by the driver.
b) never clobbering the hash type in tcp_lro.c if the total length accumulated is <= 65535.
.. and yes, I don't think you should be using the hashtype field to flag LRO information. They're totally separate things.
(It's totally feasible that a NIC doing > 64KiB LRO will reassemble frames with correct RSS information and feed it to an RSS kernel, which can then ensure it's running on the correct cpuset with the TCP state and locks.)
np@'s comments are good ones, both with respect to ACK clocking and BPF -- but this will also affect local firewalls that do state tracking and/or transformation, and likewise will run into problems.
Given this discussion, I'm increasingly convinced that this is Not A Good Idea In Its Current Form -- far more thought is required about how we might want to handle this. I would recommend we set aside time at a BSDCan devsummit session to talk about the implications of VLRO for the stack as a whole.
Given our default direct-dispatch input configuration, presumably much of the benefit from VLRO comes from reducing the number of tcpcb lookups, and a bit more comes from avoiding reassembly costs. I do wonder if we might choose to present VLRO in a different way: rather than as a single, very large datagram, but instead as a series of datagrams that the lower layer promises are for the same TCP connection, and sequential, allowing a single lookup and reassembly avoidance, but avoiding diverging from the current assumption that mbuf chains correspond to a single IP datagram that meets its invariant properties. We could easily assert that the IP addresses match and that the segments are sequential, allowing us to catch bugs while making strong assumptions.
I still don't want to see the hash type stuff polluted with things that aren't hashes, making it more difficult to implement RSS work distribution along with other kinds of features.
We need to stash that flag somewhere else. If we've run out of mbuf header flag bits then it's time to think about extending them before 11-REL.
Would a set of macros hiding how we set and clear the LRO_TCP "flag" be acceptable, then we can later resolve that minor detail, exactly how the bit is encoded?
#define M_SET_LRO_TCP(m) ...
#define M_CLR_LRO_TCP(m) ...
#define M_GET_LRO_TCP(m) ...
I don't think we should be using the hash bits like this. There's even more hash types to add (all the variations on symmetric and non-RSS hashing; hashing XOR versus RSS, different fields, etc). I'd rather we don't make things more complicated by having the flowid/hashtype fields get more overloaded than they already are.
Let's see if we can do something more sensible by adding more flags; we can always look at growing the mbuf to include these new features. As I said before, there will be more features than this one coming in the pipeline so we will be better off if we list stuff that's in the pipeline so we can all figure out what new bits and what shuffling we need to do.
There's also robert's mbuf shuffling that needs to finish; once that's finished we may be in a much better situation to add fields. :-)
Just because some hardware is capable of coalescing more than 64k of data doesn't mean we should feel obligated to support the functionality. I'd be curious to understand the anticipated use cases that led to hardware support being added. Without some compelling data to show that this is useful, I think this work should be put on ice until such time as it can be shown to be worthwhile. If such data exists, I'm willing to give it due consideration and revise my judgment, but at this stage I strongly suspect there is no workload we support or will support in the near future that would significantly benefit from raising the LRO chunk size above 64k vs the hacks required to make it work, so that's why I'm voting against this patch outright rather than suggesting changes. The real goal is to remove LRO entirely anyway, which I believe we have ideas on how to do e.g. packet batching techniques.
As an aside, it would be useful to socialise ideas like this a bit more along with good data before investing the time and energy into doing the work unless it's trivial enough that it doesn't matter. Ideally we should have had this discussion on the mailing list centered around proposed use case(s) and a data set showing the limitations of the 64k limit on those use cases before the patch was proposed.
I hope I didn't delete it... from what I could see online, the "Abandon" Phabricator action is the means by which a reviewer indicates they have permanently rejected the patch (as opposed to suggesting changes).
As to people using the patch, can you say who and why?
lawrence: It is someone well known to be using FreeBSD. This patch makes such a big difference when applied to +10Gbit/s connections that we can run 2 TCP streams totalling 37.5 GBit/s on a single 2.x GHz CPU core instead of only one.
Ok, but that's anecdotal and gives us reviewers nothing to go on - without any methodology or raw data who knows whether the LRO change is solely responsible for the improvement and if it introduced any undesired side effects. It's also possible that with tuning, the same results could have been obtained without the "jumbo" LRO change.
As there seems to be some sensitivity around sharing specific details from field deployments which is fine, the path forward is therefore for you and/or Mellanox test engineers to run experiments, capture + analyse data and present it for discussion. You should provide your methodology so anyone wanting to replicate your experiments and results can do so.
That being said, I personally feel the energy would be better spent on batching, which would allow a tunable number of 64k correctly formed packets to be passed up the stack which should give 99% of the benefits of this work without the hackiness, plus gives us a win in many other workloads when LRO is unavailable or not used.
lstewart: We can generate a paper documenting the benefits of enlarging the IP-packet input payload, so that we can fully understand what is going on. Going the multipacket approach seems a bit more tricky, hence it involves changing the TCP and posibly also if_output() calls needs to handle this aswell, though not impossible.
lstewart: What should an mbuf accessor that computes the total pktheader payload for multiple mbufs be called?