Details

Reviewers

gallatin
rrs
sepherosa_gmail.com
gnn

Group Reviewers

transport

Summary

The kernels qsort() routine can in worst case spend O(N*N) amount of comparisons before the result is sorted.

Because the sorting key is very small, 64-bits, we can use a bit-slice sorter algorithmn instead, which is faster than mergesort() and comparable() to qsort().

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

• hselasky updated this revision to Diff 16619.May 20 2016, 1:27 PM

• hselasky retitled this revision from to Use optimised complexity safe sort routine instead of the kernel's qsort.

• hselasky updated this object.

• hselasky edited the test plan for this revision. (Show Details)

• hselasky added reviewers: gallatin, sepherosa_gmail.com, gnn, rrs.

• hselasky set the repository for this revision to rS FreeBSD src repository - subversion.

Herald added a reviewer: transport. · View Herald TranscriptMay 20 2016, 1:27 PM

Herald added a subscriber: imp. · View Herald Transcript

Implement suggestion from Drew:

Keep all data in same array to optimize cache line usage

Further optimize sorting function.

Herald edited edge metadata. · View Herald TranscriptMay 20 2016, 5:35 PM

kbowling added a subscriber: kbowling.May 20 2016, 6:01 PM

kbowling added inline comments.

sys/netinet/tcp_lro.c
384	Simple typo alorithm >> algorithm

Fix spelling. Found by Kevin.

Herald edited edge metadata. · View Herald TranscriptMay 20 2016, 6:17 PM

Fixed.

Looks good to me.

Use ffsll() when it is provided by the CPU.

Suggested by Drew.

Herald edited edge metadata. · View Herald TranscriptMay 23 2016, 3:29 PM

Use flsll() instead of ffsll(). Else sorting result will be bit-reversed.

--HPS

Herald edited edge metadata. · View Herald TranscriptMay 23 2016, 9:14 PM

Optimize sorting algorithm.

Herald edited edge metadata. · View Herald TranscriptMay 24 2016, 1:57 PM

Great work -- faster AND safer

BTW, I tested a version of this patch in our Netflix code base. When serving roughly 80Gb/s with 80K TCP connections, the old method (qsort + tcp_lro_mbuf_compare_header) used 1.4% CPU, while the new (tcp_lro_sort) used 1.1% for LRO related sorting as measured by Intel Vtune. This test was done with a sysctl toggle to switch between qsort and the new sort.

That is why I mentioned that in addition to being safer (by limiting recursion), this is also faster.

If Drew says it works I am happy with it as well :-)

This revision is now accepted and ready to land.May 25 2016, 8:11 PM

Committed to head, r300731.

Use optimised complexity safe sort routine instead of the kernel's qsort
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 16802

sys/netinet/tcp_lro.h

sys/netinet/tcp_lro.c

Use optimised complexity safe sort routine instead of the kernel's qsortClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 16802

sys/netinet/tcp_lro.h

sys/netinet/tcp_lro.c

Use optimised complexity safe sort routine instead of the kernel's qsort
ClosedPublic
Actions

Revision Contents
Changeset List