Use insertion sort instead of bubble sort in TCP LRO
ClosedPublic
Actions

Authored by • hselasky on May 28 2016, 8:44 PM.

Details

Reviewers

gallatin
pieter_degoeje.nl
gnn
ed

Group Reviewers

transport

Summary

Hi,

Replacing the bubble sort with insertion sort gives an 80% reduction in runtime on average (with randomized keys) for small partitions.

If the keys are pre-sorted, insertion sort runs in linear time, and even if the keys are reversed, insertion sort is faster than bubble sort, although not by much. See below for measurements.

Suggested by: Pieter de Goeje

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

• hselasky updated this revision to Diff 17042.May 28 2016, 8:44 PM

• hselasky retitled this revision from to Use insertion sort instead of bubble sort in TCP LRO.

• hselasky updated this object.

• hselasky edited the test plan for this revision. (Show Details)

• hselasky added reviewers: gallatin, pieter_degoeje.nl.

• hselasky set the repository for this revision to rS FreeBSD src repository - subversion.

Herald added a reviewer: transport. · View Herald TranscriptMay 28 2016, 8:44 PM

Herald added a subscriber: imp. · View Herald Transcript

Don't change the bend-off constant.

Herald edited edge metadata. · View Herald TranscriptMay 28 2016, 8:45 PM

Add comment from Ed Schouten about Radix Sort while at it.

Herald edited edge metadata. · View Herald TranscriptMay 29 2016, 10:57 AM

In D6619#140014, @hselasky wrote:

Add comment from Ed Schouten about Radix Sort while at it.

Thanks! Tiny random remark (read: there is nothing you need to do, just something informational, which you may find interesting):

An interesting aspect of radix sort functions is that they can also be implemented without any recursion at all. This can be accomplished by sorting the list by just the least significant bit, followed by the next bit, until you finally reach the most significant bit. You end up with a fully sorted list, as long as you make sure that the sorting that you perform at every step is stable.

My guess is that such an algorithm is also quite friendly from a buffering/caching perspective, as the algorithm just consists of a couple of linear scans on the input.

In D6619#140018, @ed wrote:

An interesting aspect of radix sort functions is that they can also be implemented without any recursion at all. This can be accomplished by sorting the list by just the least significant bit, followed by the next bit, until you finally reach the most significant bit. You end up with a fully sorted list, as long as you make sure that the sorting that you perform at every step is stable.

My guess is that such an algorithm is also quite friendly from a buffering/caching perspective, as the algorithm just consists of a couple of linear scans on the input.

Bottom up radix sort ends up moving a lot of elements in positions that are far from their eventual location. Top-down adaptive algorithms (as implemented here) are usually friendlier for cache and faster on general purpose hardware. They tend to sort sub partitions fully before moving on, improving cache utilization.

It would be interesting to test the impact of a radix greater than 2, for instance 256. The obvious advantage is that at most 8 passes are necessary, but each pass requires two sub passes (key-indexed counting sort), plus the memory overhead of a 256-element histogram. Seeing as the currently implementation already uses very little CPU, it is unlikely to be worth it. I like the current radix-2 implementation because it can use a very simple partition algorithm. Perhaps for some insanely high number of connections it would be worth it to revisit this stuff.

/edit: removed nonsense

For Netflix workloads (~80Gb/s, ~80K connections), this seems to reduce the percentage of time spent in tcp_lro_sort() from ~1.1% down to 0.85%.

So it seems to be an improvement for us, so I"m happy.

Thanks again,

Drew

Why are we implementing new sorts in the TCP stack? The stack should depend on generic sorting functions placed in a library.

@gnn

I don't see a problem to make the tcp_lro_sort() into a library function which can sort any kind of mbufs based on a 64-bit criteria. Would this be a more "generic" approach? Then it should probably be re-named for example like, mbuf_sort_by_u64() and moved into sys/kern - right?

The characteristics of tcp_lro_sort() is such it can safely be used for network type of traffic, taking stack usage and runtime into consideration. It should be its own function with these specific set of properties.

--HPS

In D6619#140530, @gnn wrote:

Why are we implementing new sorts in the TCP stack? The stack should depend on generic sorting functions placed in a library.

Because Hans is exploiting domain-specific knowledge of mbufs sorted via RSS hash with this sort, and there is exactly a single user of the functionality. Let's not let the perfect be the enemy of the good here.

gnn accepted this revision.Jun 2 2016, 4:31 PM

gnn added a reviewer: gnn.

This revision is now accepted and ready to land.Jun 2 2016, 4:31 PM

Committed to FreeBSD-head as r301249.

Revision Contents
Changeset List

Path

Size

sys/

netinet/

tcp_lro.c

16 lines

Diff 17043

View Options

Use insertion sort instead of bubble sort in TCP LROClosedPublicActions