User Details
- User Since
- Jun 22 2015, 5:21 PM (296 w, 3 d)
Tue, Feb 23
Mon, Feb 22
Fri, Feb 19
Thu, Feb 18
Wed, Feb 17
Maybe add a comment for the else cases saying that they are chacha?
I've tested this on a Netflix server, and can confirm that it works fine, and eliminates spinlock contention on the gic cmd queue.
Tue, Feb 16
I've tested this on Netflix servers, and it seems to work.
Are legacy interrupts supported on this platform? Is this code also used for legacy interrupts?
I'm shocked that this feature was not present before.
This is nothing short of amazing. It cuts CPU use on an original Netflix 100G server (16 core Broadwell Xeon 2697A v4) from ~62% to ~52% at roughly 92Gb/s of a 100% ktls Netflix video serving workload.
Fri, Feb 12
Thu, Feb 11
Wed, Feb 10
Tue, Feb 9
This is pretty exciting. I can try this on amd64 and let you know how much it helps. It won't be until next week, as I have a few things going on this week, and it looks like I may need to change the ktls_isa-l_crypto-kmod to catch up to having iniovcnt != outiovcnt
Sun, Feb 7
The producer and consumer certainly may be on different CPUs.
Fri, Feb 5
Wed, Jan 27
This probably needs to be MFC'ed to 13
Jan 22 2021
Note that we (Netflix) have been running this patch in evolving state since 2017 in production.
Jan 21 2021
Jan 18 2021
Jan 15 2021
Jan 14 2021
Thank you for catching this.
Jan 13 2021
OK, thanks that makes sense. I was not considering a multi-frag case.
Jan 12 2021
Whats the path to having a null pointer as pf_rv ?
Jan 11 2021
Jan 8 2021
Jan 7 2021
There is already a mechanism drivers can use to build an array of packets which is used by LRO. Eg, tcp_lro_queue_mbuf(). It was quite helpful, not so much for the batching, but because it sorts the array to put packets from the same TCP connection adjacent to each other..
After talking with our AMD rep, the AMD advise for this is:
Might it be better to just make a new if_input method which takes an array of mbufs? List processing for a batch of packets seems like a recipe for cache misses..
Jan 6 2021
I tested the kernel part of this on a Netflix server with an EPYC 7502P using TSC-low. It saves roughly 2% CPU and tsc disappears entirely from profiling output, making profiling output match that from our Xeons much more closely.
Dec 19 2020
Thanks so much for reviewing. I've made the changes you requested in the patch I'm about to commit.
Dec 18 2020
Rebase patch to r368767
Rebase patch as of r368767
Dec 14 2020
Dec 13 2020
Dec 6 2020
Dec 4 2020
This looks correct to me.
Nov 25 2020
Nov 24 2020
A few more thoughts based on a conversation I had with somebody: In order for mbuf sorting to be effective, you need to have a large number of packets received per irq. Sorting a dozen mbufs is not useful when you're already aggregating 64 different connections, for example. So AIM (or very large irq colaescing timer/packet settings) are required to get enough packets per irq to make mbuf sorting worthwhile.
Approving for just myself for now until the discussion regarding AIM / LRO is wrapped up.
Nov 20 2020
Nov 19 2020
After this change, perl freaks out like this:
Nov 18 2020
Nov 16 2020
Nov 12 2020
Nov 11 2020
Oct 30 2020
Isn't errno different in linux on different architectures? I know that at least alpha had a very different errno than x86..
Oct 29 2020
Its in my branch, but not our main branch.
Oct 22 2020
Oct 19 2020
Oct 12 2020
Oct 10 2020
Awesome. Thank you!
Oct 9 2020
Can we add some __predict_false() to aid the compiler (and the reader) to know these are rare?
Oct 8 2020
Oct 6 2020
Oct 5 2020
Oct 1 2020
Thank you for making these changes.
Sep 25 2020
Sep 21 2020
Sep 11 2020
Is there a file which is already perfectly style(9) compliant? Is it this file?
Sep 7 2020
What command line arguments, or other clang format configuration was used here?