Page MenuHomeFreeBSD

Use insertion sort instead of bubble sort in TCP LRO
ClosedPublic

Authored by hselasky on May 28 2016, 8:44 PM.
Tags
None
Referenced Files
Unknown Object (File)
Oct 3 2024, 6:26 PM
Unknown Object (File)
Sep 16 2024, 9:06 AM
Unknown Object (File)
Sep 16 2024, 9:06 AM
Unknown Object (File)
Sep 16 2024, 9:06 AM
Unknown Object (File)
Sep 16 2024, 9:06 AM
Unknown Object (File)
Sep 16 2024, 8:54 AM
Unknown Object (File)
Sep 12 2024, 2:35 AM
Unknown Object (File)
Sep 12 2024, 2:34 AM
Subscribers

Details

Reviewers
gallatin
pieter_degoeje.nl
gnn
ed
Group Reviewers
transport
Summary

Hi,

Replacing the bubble sort with insertion sort gives an 80% reduction in runtime on average (with randomized keys) for small partitions.

If the keys are pre-sorted, insertion sort runs in linear time, and even if the keys are reversed, insertion sort is faster than bubble sort, although not by much. See below for measurements.

Suggested by: Pieter de Goeje

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

hselasky retitled this revision from to Use insertion sort instead of bubble sort in TCP LRO.
hselasky updated this object.
hselasky edited the test plan for this revision. (Show Details)
hselasky set the repository for this revision to rS FreeBSD src repository - subversion.
hselasky edited edge metadata.

Don't change the bend-off constant.

hselasky updated this object.
hselasky added a reviewer: ed.

Add comment from Ed Schouten about Radix Sort while at it.

Add comment from Ed Schouten about Radix Sort while at it.

Thanks! Tiny random remark (read: there is nothing you need to do, just something informational, which you may find interesting):

An interesting aspect of radix sort functions is that they can also be implemented without any recursion at all. This can be accomplished by sorting the list by just the least significant bit, followed by the next bit, until you finally reach the most significant bit. You end up with a fully sorted list, as long as you make sure that the sorting that you perform at every step is stable.

My guess is that such an algorithm is also quite friendly from a buffering/caching perspective, as the algorithm just consists of a couple of linear scans on the input.

In D6619#140018, @ed wrote:

An interesting aspect of radix sort functions is that they can also be implemented without any recursion at all. This can be accomplished by sorting the list by just the least significant bit, followed by the next bit, until you finally reach the most significant bit. You end up with a fully sorted list, as long as you make sure that the sorting that you perform at every step is stable.

My guess is that such an algorithm is also quite friendly from a buffering/caching perspective, as the algorithm just consists of a couple of linear scans on the input.

Bottom up radix sort ends up moving a lot of elements in positions that are far from their eventual location. Top-down adaptive algorithms (as implemented here) are usually friendlier for cache and faster on general purpose hardware. They tend to sort sub partitions fully before moving on, improving cache utilization.

It would be interesting to test the impact of a radix greater than 2, for instance 256. The obvious advantage is that at most 8 passes are necessary, but each pass requires two sub passes (key-indexed counting sort), plus the memory overhead of a 256-element histogram. Seeing as the currently implementation already uses very little CPU, it is unlikely to be worth it. I like the current radix-2 implementation because it can use a very simple partition algorithm. Perhaps for some insanely high number of connections it would be worth it to revisit this stuff.

/edit: removed nonsense

gallatin edited edge metadata.

For Netflix workloads (~80Gb/s, ~80K connections), this seems to reduce the percentage of time spent in tcp_lro_sort() from ~1.1% down to 0.85%.

So it seems to be an improvement for us, so I"m happy.

Thanks again,

Drew

Why are we implementing new sorts in the TCP stack? The stack should depend on generic sorting functions placed in a library.

@gnn

I don't see a problem to make the tcp_lro_sort() into a library function which can sort any kind of mbufs based on a 64-bit criteria. Would this be a more "generic" approach? Then it should probably be re-named for example like, mbuf_sort_by_u64() and moved into sys/kern - right?

The characteristics of tcp_lro_sort() is such it can safely be used for network type of traffic, taking stack usage and runtime into consideration. It should be its own function with these specific set of properties.

--HPS

In D6619#140530, @gnn wrote:

Why are we implementing new sorts in the TCP stack? The stack should depend on generic sorting functions placed in a library.

Because Hans is exploiting domain-specific knowledge of mbufs sorted via RSS hash with this sort, and there is exactly a single user of the functionality. Let's not let the perfect be the enemy of the good here.

gnn added a reviewer: gnn.
This revision is now accepted and ready to land.Jun 2 2016, 4:31 PM

Committed to FreeBSD-head as r301249.