Differential D5725

tcp/lro: Factor out tcp_lro_rx_done() to avoid code duplication
ClosedPublic
Actions

Authored by sepherosa_gmail.com on Mar 24 2016, 5:19 AM.

Details

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

sepherosa_gmail.com updated this revision to Diff 14565.Mar 24 2016, 5:19 AM

sepherosa_gmail.com retitled this revision from to tcp/lro: Factor out tcp_lro_rx_done() to avoid code duplication.

sepherosa_gmail.com updated this object.

sepherosa_gmail.com edited the test plan for this revision. (Show Details)

sepherosa_gmail.com added reviewers: network, gallatin, • hselasky, np, erj, sbruno, delphij, bz, davidcs, zbb, gnn, royger, adrian, decui_microsoft.com, honzhan_microsoft.com, howard0su_gmail.com.

Herald added reviewers: Intel Networking, transport. · View Herald TranscriptMar 24 2016, 5:19 AM

Looks OK with regard to MLX4 / MLX5.

Internally at Netflix, we have a version of this patch with a slightly different name. This is because, in addition to avoiding code duplication, it also allows a bit more room to innovate with the LRO internals.

Is there any chance you could call this tcp_lro_flush_all() rather than tcp_lro_rx_done()? The "flush" name follows the existing convention, (And, selfishly, it will avoid merge conflicts when it trickles into our tree, as we called it tcp_lro_flush_all())

In D5725#122306, @gallatin wrote:

Internally at Netflix, we have a version of this patch with a slightly different name. This is because, in addition to avoiding code duplication, it also allows a bit more room to innovate with the LRO internals.

Yeah, in addition to the deduplicate the code, I have worked on improving the internal of the LRO; my first step is change the SLIST to LIST, so that the medium sized (128 used by hyperv's network driver) LRO entries will not slow things down when an entry is removed from the end of the list :). Testing of this change will be posted shortly, after I test it in Azure. And even more, I will add a hash table, so the tcp_lro_rx() don't need to iterate the active list.

Is there any chance you could call this tcp_lro_flush_all() rather than tcp_lro_rx_done()? The "flush" name follows the existing convention, (And, selfishly, it will avoid merge conflicts when it trickles into our tree, as we called it tcp_lro_flush_all())

Well, the tcp_lro_flush_all() was taken by the mbuf queuing, that's why I use tcp_lro_rx_done() here. But, we can rename tcp_lro_flush_all() to tcp_lro_flush_mbuf(), and then rename tcp_lro_rx_done() to tcp_lro_flush_all(). Will you and Hans agree upon this approach?

I'd forgotten that was what hps had named his flusher. But, the cool thing is that the existing tcp_lro_flush_all() does exactly what you want. All you need to do is get hps' approval to remove the __predict_false() from the check at the top of tcp_lro_flush_all().

As to hash tables and such..

We (netflix) already have a hash table. Randall (rrs@) has been working on this for the last year or so, on and off. He's trying to tell if it is faster or slower than HPS's RSS sorted method.

The advantage of the hash is that all drivers would benefit. The disadvantage is that it is arguably less cache friendly. We're in the middle of a comparison. Let's collaborate before you end up duplicating all our work. I'll send you our current code off-line.

Please also add rrs to this review.

sepherosa_gmail.com added a reviewer: rrs.Mar 25 2016, 1:38 AM

In D5725#122334, @gallatin wrote:

I'd forgotten that was what hps had named his flusher. But, the cool thing is that the existing tcp_lro_flush_all() does exactly what you want. All you need to do is get hps' approval to remove the __predict_false() from the check at the top of tcp_lro_flush_all().

Grrr, I didn't read the code closely. I will remove the predict false and call the flush_all in drivers. Thanks.

As to hash tables and such..

We (netflix) already have a hash table. Randall (rrs@) has been working on this for the last year or so, on and off. He's trying to tell if it is faster or slower than HPS's RSS sorted method.

The advantage of the hash is that all drivers would benefit. The disadvantage is that it is arguably less cache friendly. We're in the middle of a comparison. Let's collaborate before you end up duplicating all our work. I'll send you our current code off-line.

Yeah, sure, I could test the patch in the Azure.

Please also add rrs to this review.

Done.

Use tcp_lro_flush_all and remove the predict_false at the beginning of the tcp_lro_flush_all. Drivers now call tcp_lro_flush_all() and staticize the tcp_lro_rx_done(). As suggested by gallatin.

Herald edited edge metadata. · View Herald TranscriptMar 25 2016, 3:42 AM

In D5725#122334, @gallatin wrote:

I'd forgotten that was what hps had named his flusher. But, the cool thing is that the existing tcp_lro_flush_all() does exactly what you want. All you need to do is get hps' approval to remove the __predict_false() from the check at the top of tcp_lro_flush_all().

Patch updated according to your suggestion.

This patch works on VNIC

decui_microsoft.com accepted this revision.Mar 29 2016, 2:33 PM

decui_microsoft.com edited edge metadata.

• hselasky accepted this revision.Mar 29 2016, 3:15 PM

• hselasky edited edge metadata.

for these two:

M sys/dev/cxgb/cxgb_sge.c (6 lines)
M sys/dev/cxgbe/t4_sge.c (7 lines)

The call to tcp_lro_flush_all is not equivalent to the code it's replacing. The other LRO receive paths in these drivers do not do the qsort before tcp_lro_rx. Neither should the final flush.

By the way, the names of the routines are a bit awkward. tcp_lro_flush_all sounds like something that would be a simple loop around tcp_lro_flush but isn't. That threw me off initially and I was about to accept this revision.

This revision now requires changes to proceed.Mar 30 2016, 12:44 AM

In D5725#123430, @np wrote:

for these two:

M sys/dev/cxgb/cxgb_sge.c (6 lines)
M sys/dev/cxgbe/t4_sge.c (7 lines)

The call to tcp_lro_flush_all is not equivalent to the code it's replacing. The other LRO receive paths in these drivers do not do the qsort before tcp_lro_rx. Neither should the final flush.

Well, tcp_lro_flush_all == tcp_lro_rx_done == "iterate the lro active list and call tcp_lro_flush", if the mbuf qsort is _not_ used. I thought the tcp_lro_flush_all was just for mbuf qsort as you do, but as gallatin@ pointed out: if mbuf qsort is not used, it just flush the lro active list.

By the way, the names of the routines are a bit awkward. tcp_lro_flush_all sounds like something that would be a simple loop around tcp_lro_flush but isn't. That threw me off initially and I was about to accept this revision.

See the above comment.

I'm ok with this if it's the same as the code it's replacing. Looks like it is.

One unrelated comment: why doesn't tcp_lro_queue_mbuf() do a flush_all check right after adding the mbuf to lro_mbuf_data? Doesn't seem to be much point waiting for the next enqueue to flush a list that's already full. Basically move the flush_all to the end, like this:

/* store sequence number */
TCP_LRO_SEQUENCE(mb) = lc->lro_mbuf_count;

/* enter mbuf */
lc->lro_mbuf_data[lc->lro_mbuf_count++] = mb;

/* check if array is full */
if (__predict_false(lc->lro_mbuf_count == lc->lro_mbuf_max))

		tcp_lro_flush_all(lc);

Closed by commit rS297482: tcp/lro: Use tcp_lro_flush_all in device drivers to avoid code duplication (authored by sephe). · Explain WhyApr 1 2016, 6:28 AM

This revision was automatically updated to reflect the committed changes.

This probably shouldn't have been committed without a gentle prod towards the Intel Networking group.

Was this tested on Intel hardware at all? if so, what model cards?

Sean:

While I agree a prod to Networking might have been in order, I think its a general
*good* idea here. I don't think having drivers know the internal make-up of the
tcp-lro code is good. I have something similar running at NF for quite some time
now...

I very much doubt that any driver that was doing the SLIST thing
and removing that to make it call the function that does the "SLIST" thing
is going to have a problem...

IMO this really cleans up one of I am sure a number of little driver-dup over and over
again.. :-)

In D5725#124391, @rrs wrote:

Sean:

While I agree a prod to Networking might have been in order, I think its a general
*good* idea here. I don't think having drivers know the internal make-up of the
tcp-lro code is good. I have something similar running at NF for quite some time
now...

I very much doubt that any driver that was doing the SLIST thing
and removing that to make it call the function that does the "SLIST" thing
is going to have a problem...

IMO this really cleans up one of I am sure a number of little driver-dup over and over
again.. :-)

If you guys popped this onto machines with Intel networking cards, then I'm fine with it.

In D5725#124502, @sbruno wrote:

In D5725#124391, @rrs wrote:

Sean:

While I agree a prod to Networking might have been in order, I think its a general
*good* idea here. I don't think having drivers know the internal make-up of the
tcp-lro code is good. I have something similar running at NF for quite some time
now...

I very much doubt that any driver that was doing the SLIST thing
and removing that to make it call the function that does the "SLIST" thing
is going to have a problem...

IMO this really cleans up one of I am sure a number of little driver-dup over and over
again.. :-)

If you guys popped this onto machines with Intel networking cards, then I'm fine with it.