tcp/lro: Implement hash table for LRO entries.
ClosedPublic
Actions

Authored by sepherosa_gmail.com on Jun 2 2016, 3:10 AM.

Details

Reviewers

gallatin
• hselasky
rrs
glebius
gnn
bz
rwatson

Group Reviewers

transport

Commits

rS303656: tcp/lro: Implement hash table for LRO entries.

Summary

This helps a lot when you have large amount of LRO entries configured.

Obtained from: rrs, gallatin (Netflix)

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

sepherosa_gmail.com updated this revision to Diff 17214.Jun 2 2016, 3:10 AM

sepherosa_gmail.com retitled this revision from to tcp/lro: Implement hash table for LRO entries..

sepherosa_gmail.com updated this object.

sepherosa_gmail.com edited the test plan for this revision. (Show Details)

sepherosa_gmail.com added reviewers: rrs, gallatin, • hselasky, glebius, gnn, bz, rwatson.

sepherosa_gmail.com added a subscriber: freebsd-net-list.

Herald added a reviewer: transport. · View Herald TranscriptJun 2 2016, 3:10 AM

Hi Sepherosa,

There is no need to create the hash-table. It will just lead to more cache misses I think.

The hashing should be put into "tcp_lro_queue_mbuf()". Instead you should use the existing "tcp_lro_flush_all()" to get the mbufs sorted by hash and then re-use the existing lro_head structure to input them all.

Can you test that?

--HPS

This revision now requires changes to proceed.Jun 2 2016, 7:34 AM

In D6689#141069, @hselasky wrote:

Hi Sepherosa,

There is no need to create the hash-table. It will just lead to more cache misses I think.

It will benefit all drivers.

The hashing should be put into "tcp_lro_queue_mbuf()". Instead you should use the existing "tcp_lro_flush_all()" to get the mbufs sorted by hash and then re-use the existing lro_head structure to input them all.

Can you test that?

Well, to be frank, as far as we have tested in Hyper-V, choosing a proper number of mbufs to be used by the sorting is difficult. That's also why for Hyper-V we choose to aggregate less data segments and less ACKs. And VM's CPU is slow compared w/ physical machines. Extra sorting probably is going to hurt.

In D6689#141069, @hselasky wrote:

Hi Sepherosa,

There is no need to create the hash-table. It will just lead to more cache misses I think.

It will benefit all drivers.

It will hurt the existing tcp_lro_queue_mbuf() which only needs one hash-bucket, that now multiple hash buckets and also more memory will be used for no good reason.

The hashing should be put into "tcp_lro_queue_mbuf()". Instead you should use the existing "tcp_lro_flush_all()" to get the mbufs sorted by hash and then re-use the existing lro_head structure to input them all.

Can you test that?
Well, to be frank, as far as we have tested in Hyper-V, choosing a proper number of mbufs to be used by the sorting is difficult.

Can you explain this a bit more? I don't understand. If you need to handle 64 streams per "ring" you should have 64 mbufs at least in the sorting array.

That's also why for Hyper-V we choose to aggregate less data segments and less ACKs. And VM's CPU is slow compared w/ physical machines. Extra sorting probably is going to hurt.

Please keep in mind that the hash-bucket approach is only accurate up to the hash number of bits you setup, which directly influence the memory usage, while the sorting approach is far more accurate, keeping the memory usage very low.

Further when you're using a non-random hash like summing the port number fields and IP addresses it can easily be predicted to overflow a bucket. And then it is even more important to use the full size of the hash.

With the "tcp_lro_queue_mbuf()" approach you have 64-bits of hash. That means you could possibly store the IPv4 source and destination address into a 32-bit variable, and even the port numbers into an 8-bit field. That would give much better results than the hash-bucket approach, and will also protect against many low-speed streams drowning among high-speed ones.

Only in the single TCP stream case I've seen minor degradation.

--HPS

In D6689#141081, @hselasky wrote:
In D6689#141069, @hselasky wrote:
Hi Sepherosa,

Yo :)

There is no need to create the hash-table. It will just lead to more cache misses I think.
It will benefit all drivers.
It will hurt the existing tcp_lro_queue_mbuf() which only needs one hash-bucket, that now multiple hash buckets and also more memory will be used for no good reason.

I see you point. I believe that can be easily fixed.

The hashing should be put into "tcp_lro_queue_mbuf()". Instead you should use the existing "tcp_lro_flush_all()" to get the mbufs sorted by hash and then re-use the existing lro_head structure to input them all.

Can you test that?
Well, to be frank, as far as we have tested in Hyper-V, choosing a proper number of mbufs to be used by the sorting is difficult.
Can you explain this a bit more? I don't understand. If you need to handle 64 streams per "ring" you should have 64 mbufs at least in the sorting array.

The trouble is that we don't know how many streams will be used by users of Azure or Hyper-V. Even the local testing is conducted 1~10K concurrent connection tests. Set the size of mbuf list too large hurts performance last time I tried (data/ACKs were trapped at the driver for too much time), but set it too small hurts performance too, since certain amount (4~8 for data segements) of LRO aggregation is required to achieve good perf on 10Ge/40Ge physical network.

That's also why for Hyper-V we choose to aggregate less data segments and less ACKs. And VM's CPU is slow compared w/ physical machines. Extra sorting probably is going to hurt.

Please keep in mind that the hash-bucket approach is only accurate up to the hash number of bits you setup, which directly influence the memory usage, while the sorting approach is far more accurate, keeping the memory usage very low.

Not sure about what do you mean by "hash number of bits". Toeplitz results are pretty random and is 32bits. And the hash table is intentionally created using primary-hash table to avoid the 'hash&127' side effect. rrs' testing at Netflix shows good enough result: 95% lookups take one check.

Further when you're using a non-random hash like summing the port number fields and IP addresses it can easily be predicted to overflow a bucket. And then it is even more important to use the full size of the hash.

Well, that can be improved, I think we have many efficient hash functions in base, like jenkins, we can just pickup one.

With the "tcp_lro_queue_mbuf()" approach you have 64-bits of hash. That means you could possibly store the IPv4 source and destination address into a 32-bit variable, and even the port numbers into an 8-bit field. That would give much better results than the hash-bucket approach, and will also protect against many low-speed streams drowning among high-speed ones.

Well, as I said the VM does not have the luxury to do sorting :)

Only in the single TCP stream case I've seen minor degradation.

--HPS

Well, as I said the VM does not have the luxury to do sorting :)

Can you test the existing "tcp_lro_flush_all()" in combination with "tcp_lro_queue_mbuf()" and compare it to your version? Sorting is not as slow as you think.

--HPS

In D6689#141127, @hselasky wrote:

Well, as I said the VM does not have the luxury to do sorting :)

Can you test the existing "tcp_lro_flush_all()" in combination with "tcp_lro_queue_mbuf()" and compare it to your version? Sorting is not as slow as you think.

--HPS

Sure, I can do it. I will make the # of mbufs sysctl, and let you know the result. But # of mbufs sysctl is more difficult to tune than the current setting, since in Hyper-V we absolutely need to aggregate for each connection, but we can't aggregate too much for each connection. Aggregation is kinda like a dual edge sword here :)

Address hps's cache pollution concern for queue_mbuf path. Similar approach was in the patch obtained from rrs.

Herald edited edge metadata. · View Herald TranscriptJun 6 2016, 8:39 AM

In D6689#141127, @hselasky wrote:

Well, as I said the VM does not have the luxury to do sorting :)

Can you test the existing "tcp_lro_flush_all()" in combination with "tcp_lro_queue_mbuf()" and compare it to your version? Sorting is not as slow as you think.

--HPS

I will use the updated patch to test the hash and the queue_mbuf in Hyper-V. Thanks!

Looks good in terms of not killing perf. for the sorted case, and I'm fine with it as-is. However, maybe an else would be better than a goto?

In D6689#142118, @gallatin wrote:

Looks good in terms of not killing perf. for the sorted case, and I'm fine with it as-is. However, maybe an else would be better than a goto?

Yeah, sure, I can move the bucket assignment into the {}, and use if-elseif-else to avoid goto.

And as for performance, I am running the test using hash table currently. The result of nginx+wrk test is quite promising (web object sizes range from 1KB~40KB; for 4reqs/conn, 4~10K concurrent connection, I am getting 12%~14% performance improvement :). 14reqs/conn nginx+wrk test is still running.

Hi,

Were you able to test the performance using tcp_lro_queue_mbuf() ?

Better name for function?? tcp_lro_rx2() -> tcp_lro_rx_sub()

--HPS

@sepherosa_gmail.com

Regarding performance. Is it possible to get the Hyper-V to sort the IP-packets before they enter the FreeBSD network stack in the VM?

In D6689#142267, @hselasky wrote:

Hi,

Were you able to test the performance using tcp_lro_queue_mbuf() ?

Better name for function?? tcp_lro_rx2() -> tcp_lro_rx_sub()

--HPS

Yeah, it's planned this week :)

In D6689#142276, @hselasky wrote:

@sepherosa_gmail.com

Regarding performance. Is it possible to get the Hyper-V to sort the IP-packets before they enter the FreeBSD network stack in the VM?

Not sure about that, I don't think it is delivering packet in sorted way currently.

Avoid goto, suggested by gallatin

Herald edited edge metadata. · View Herald TranscriptJun 14 2016, 2:10 AM

These look fine and familiar ;-)

I think a hash table as an option to sorting is probably a good thing :D

In D6689#148252, @rrs wrote:

These look fine and familiar ;-)

I think a hash table as an option to sorting is probably a good thing :D

Yeah, and I got pretty exiting performance improvement and noticeable latency reduction with both methods in Azure for nginx 1KB~40KB web objects and 4/14 reqs/conn :)). I will post the result once 128/256 mbuf queue depth for sorting method is done (512/1024/2048/4096 have been completed so far).

In D6689#148473, @sepherosa_gmail.com wrote:

In D6689#148252, @rrs wrote:

These look fine and familiar ;-)

I think a hash table as an option to sorting is probably a good thing :D

Yeah, and I got pretty exiting performance improvement and noticeable latency reduction with both methods in Azure for nginx 1KB~40KB web objects and 4/14 reqs/conn :)). I will post the result once 128/256 mbuf queue depth for sorting method is done (512/1024/2048/4096 have been completed so far).

If you are interested, I have the measurement result posted here:
results

For nginx workload, both latency and performance are measured. As you can see both performance and latency got improved, almost for all nginx workloads we have tested so far.

Both methods provide significant improvement for some kinds of workload:
1KB obj 14reqs/conn
8KB obj 14reqs/conn

Closed by commit rS303656: tcp/lro: Implement hash table for LRO entries. (authored by sephe). · Explain WhyAug 2 2016, 6:37 AM

This revision was automatically updated to reflect the committed changes.