This helps a lot when you have large amount of LRO entries configured.
Obtained from: rrs, gallatin (Netflix)
Differential D6689
tcp/lro: Implement hash table for LRO entries. sepherosa_gmail.com on Jun 2 2016, 3:10 AM. Authored by Tags None Referenced Files
Subscribers
Details
This helps a lot when you have large amount of LRO entries configured. Obtained from: rrs, gallatin (Netflix)
Diff Detail
Event TimelineComment Actions Hi Sepherosa, There is no need to create the hash-table. It will just lead to more cache misses I think. The hashing should be put into "tcp_lro_queue_mbuf()". Instead you should use the existing "tcp_lro_flush_all()" to get the mbufs sorted by hash and then re-use the existing lro_head structure to input them all. Can you test that? --HPS Comment Actions It will benefit all drivers.
Well, to be frank, as far as we have tested in Hyper-V, choosing a proper number of mbufs to be used by the sorting is difficult. That's also why for Hyper-V we choose to aggregate less data segments and less ACKs. And VM's CPU is slow compared w/ physical machines. Extra sorting probably is going to hurt. Comment Actions
Hi Sepherosa,
It will hurt the existing tcp_lro_queue_mbuf() which only needs one hash-bucket, that now multiple hash buckets and also more memory will be used for no good reason.
Can you explain this a bit more? I don't understand. If you need to handle 64 streams per "ring" you should have 64 mbufs at least in the sorting array.
Please keep in mind that the hash-bucket approach is only accurate up to the hash number of bits you setup, which directly influence the memory usage, while the sorting approach is far more accurate, keeping the memory usage very low. Further when you're using a non-random hash like summing the port number fields and IP addresses it can easily be predicted to overflow a bucket. And then it is even more important to use the full size of the hash. With the "tcp_lro_queue_mbuf()" approach you have 64-bits of hash. That means you could possibly store the IPv4 source and destination address into a 32-bit variable, and even the port numbers into an 8-bit field. That would give much better results than the hash-bucket approach, and will also protect against many low-speed streams drowning among high-speed ones. Only in the single TCP stream case I've seen minor degradation. --HPS Comment Actions Yo :)
I see you point. I believe that can be easily fixed.
The trouble is that we don't know how many streams will be used by users of Azure or Hyper-V. Even the local testing is conducted 1~10K concurrent connection tests. Set the size of mbuf list too large hurts performance last time I tried (data/ACKs were trapped at the driver for too much time), but set it too small hurts performance too, since certain amount (4~8 for data segements) of LRO aggregation is required to achieve good perf on 10Ge/40Ge physical network.
Not sure about what do you mean by "hash number of bits". Toeplitz results are pretty random and is 32bits. And the hash table is intentionally created using primary-hash table to avoid the 'hash&127' side effect. rrs' testing at Netflix shows good enough result: 95% lookups take one check.
Well, that can be improved, I think we have many efficient hash functions in base, like jenkins, we can just pickup one.
Well, as I said the VM does not have the luxury to do sorting :)
Comment Actions
Can you test the existing "tcp_lro_flush_all()" in combination with "tcp_lro_queue_mbuf()" and compare it to your version? Sorting is not as slow as you think. --HPS Comment Actions Sure, I can do it. I will make the # of mbufs sysctl, and let you know the result. But # of mbufs sysctl is more difficult to tune than the current setting, since in Hyper-V we absolutely need to aggregate for each connection, but we can't aggregate too much for each connection. Aggregation is kinda like a dual edge sword here :) Comment Actions Address hps's cache pollution concern for queue_mbuf path. Similar approach was in the patch obtained from rrs. Comment Actions Looks good in terms of not killing perf. for the sorted case, and I'm fine with it as-is. However, maybe an else would be better than a goto? Comment Actions Yeah, sure, I can move the bucket assignment into the {}, and use if-elseif-else to avoid goto. And as for performance, I am running the test using hash table currently. The result of nginx+wrk test is quite promising (web object sizes range from 1KB~40KB; for 4reqs/conn, 4~10K concurrent connection, I am getting 12%~14% performance improvement :). 14reqs/conn nginx+wrk test is still running. Comment Actions Hi, Were you able to test the performance using tcp_lro_queue_mbuf() ? Better name for function?? tcp_lro_rx2() -> tcp_lro_rx_sub() --HPS Comment Actions Regarding performance. Is it possible to get the Hyper-V to sort the IP-packets before they enter the FreeBSD network stack in the VM? Comment Actions These look fine and familiar ;-) I think a hash table as an option to sorting is probably a good thing :D Comment Actions Yeah, and I got pretty exiting performance improvement and noticeable latency reduction with both methods in Azure for nginx 1KB~40KB web objects and 4/14 reqs/conn :)). I will post the result once 128/256 mbuf queue depth for sorting method is done (512/1024/2048/4096 have been completed so far). Comment Actions If you are interested, I have the measurement result posted here: For nginx workload, both latency and performance are measured. As you can see both performance and latency got improved, almost for all nginx workloads we have tested so far. Both methods provide significant improvement for some kinds of workload: |