emeric.poupon_stormshield.eu (Emeric POUPON)
User

Projects

User does not belong to any projects.

User Details

User Since
Oct 21 2015, 3:01 PM (116 w, 6 d)

Recent Activity

Nov 3 2017

sheda_fsfe.org awarded D10384: Make crypto(9) multi thread a Hungry Hippo token.
Nov 3 2017, 3:03 PM
emeric.poupon_stormshield.eu updated the diff for D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.

Moved function to simplify diff

Nov 3 2017, 8:31 AM

Oct 23 2017

emeric.poupon_stormshield.eu updated the summary of D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.
Oct 23 2017, 7:15 AM

Oct 19 2017

emeric.poupon_stormshield.eu added a comment to D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.
In D10680#264139, @ae wrote:

I finally played a bit with this patch. I used if_ipsec(4) tunnel between two hosts and iperf TCP test. With disabled async_crypto I have ~720Mbit/s, with enabled async_crypto it is 5.2Gbit/s.

Oct 19 2017, 4:18 PM
emeric.poupon_stormshield.eu updated the diff for D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.

Fixed typo in comment + 32bits build

Oct 19 2017, 4:16 PM

Oct 17 2017

emeric.poupon_stormshield.eu updated the diff for D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.

Changed sysctl name to "net.inet.inet.ipsec.async_crypto"

Oct 17 2017, 12:20 PM

Oct 2 2017

emeric.poupon_stormshield.eu updated the diff for D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.

Style remarks

Oct 2 2017, 7:59 AM
emeric.poupon_stormshield.eu added inline comments to D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.
Oct 2 2017, 7:53 AM

Sep 29 2017

emeric.poupon_stormshield.eu updated the diff for D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.

Updates done regarding jmg remarks. I hope this is easier to understand now.

Sep 29 2017, 11:43 AM
emeric.poupon_stormshield.eu added a comment to D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.

@jmg, I will post a new review soon with the changes you suggest.

Sep 29 2017, 7:47 AM

Sep 25 2017

emeric.poupon_stormshield.eu added inline comments to D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.
Sep 25 2017, 12:40 PM

Sep 12 2017

emeric.poupon_stormshield.eu updated the diff for D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.

Updated the crypto(9) man page to reflect the changes

Sep 12 2017, 11:36 AM

Aug 29 2017

emeric.poupon_stormshield.eu added inline comments to D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.
Aug 29 2017, 6:27 AM

Aug 25 2017

emeric.poupon_stormshield.eu updated the diff for D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.

As jhb suggested, I added a flag on drivers that may benefit from a multi cpu dispatch.
This provides a way to just do direct dispatch for hardware crypto drivers.

Aug 25 2017, 4:08 PM

Aug 21 2017

emeric.poupon_stormshield.eu added inline comments to D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.
Aug 21 2017, 3:47 PM

Jul 28 2017

emeric.poupon_stormshield.eu updated the diff for D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.
Jul 28 2017, 8:18 AM
emeric.poupon_stormshield.eu added inline comments to D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.
Jul 28 2017, 7:52 AM

Jun 12 2017

emeric.poupon_stormshield.eu updated the diff for D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.

Now deleting the taskqueue in crypto_destroy()

Jun 12 2017, 11:11 AM

May 30 2017

emeric.poupon_stormshield.eu added inline comments to D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.
May 30 2017, 11:45 AM

May 19 2017

emeric.poupon_stormshield.eu added a comment to D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.

Any thoughts on this review?

May 19 2017, 7:04 AM

May 11 2017

emeric.poupon_stormshield.eu created D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.
May 11 2017, 12:15 PM

Apr 20 2017

emeric.poupon_stormshield.eu added a comment to D10384: Make crypto(9) multi thread.

The taskqueue API could possibly be expanded to support this. But I don't think it could easily be expanded to support the use case that I mentioned in the previous comment to support a single invocation supporting multiple streaming requests.

I have made some further tests, it looks like a single queue that is shared between several worker threads is far more efficient than a round robin on several workers, each having its own queue (none of the kernel threads are pinned to a CPU)
The problem is that on a dual socket system (2*6 cores) performance is really bad: all the time is spent in the mutex protecting the single job queue.
What would you suggest to get decent performance? Round robin on several queues, each one having several worker threads?

Are you making sure you're only waking one when adding to the queue (and not waking up all 12 threads to hammer on the mutex)? Also, even doing a wake one is only effective if when the worker thread wakes up, it processes one item before going to sleep, otherwise you'll be waking threads that don't do anything (but hammer the mutex) before going back to sleep. If you have your worker threads polling the queue for work, you need to have an active flag, which when set, causes no wakeup's to happen when a item is enqueued. Then it is up to the worker thread to detect the back log, and wake additional worker threads once there is an average or more than one item per worker thread woken in the backlog. This average should probably be kept over multiple iterations unless you care more about latency than cpu usage.

Apr 20 2017, 10:02 AM

Apr 18 2017

emeric.poupon_stormshield.eu added a comment to D10384: Make crypto(9) multi thread.

Eventually the goal for us is to use crypto(9) from IPsec to accelerate single flows processing. Indeed IPsec does not guarantee packet ordering (neither does IP), but it would be for sure quite harmful for some end user applications if packets are not ordered.
A same crypto session may be used for several flows coming from the nic on several CPUs. It would be needed to keep the packets ordered for each flow on each CPU but it does not really matter to loss the ESP packet order in ouput, as the anti replay window handles that on the remote host.
That's why I think it would be nice for crypto(9) users to keep ordering when dispatching the jobs.

No, this is a requirement of the IPsec layer, not all users of crypto(9) require this. For example, disk encryption does not need this, as the upper layers ensures that writes are ordered correctly (ZFS and UFS both do this). And by forcing order, you are increasing latency unnecessarily for other consumers.

This isn't hard to handle at the IPsec layer. You use a TAILQ to enqueue the packets w/ a simple data structure w/ a flag that gets set when the packet is completed, then each completed packet, while the head of the tailq is ready, send it. It's not hard, and keeps the ordering logic where it belongs, or you add a flag to crypto(9) and the logic there, but you need to allow non-ordered operation.

This is maybe a bit more difficult, since we would need to reorder packets only within the flows that may share the same crypto session, but I get your idea. Maybe a reording queue per CPU would do the job, since we expect each flow to be pinned on the same CPU.

Apr 18 2017, 1:53 PM

Apr 14 2017

emeric.poupon_stormshield.eu added a comment to D10384: Make crypto(9) multi thread.
In D10384#215403, @jmg wrote:

as per other comments in the code, ordering does not have to be maintained.. w/ the async nature of callbacks, it is already assumed that the callers can handle this.

Apr 14 2017, 9:44 PM

Apr 13 2017

emeric.poupon_stormshield.eu created D10384: Make crypto(9) multi thread.
Apr 13 2017, 3:06 PM
emeric.poupon_stormshield.eu accepted D10375: Add large replay widow support to setkey(8) and improve setkey's debugging.

Thanks for completing the job!

Apr 13 2017, 7:42 AM

Nov 17 2016

emeric.poupon_stormshield.eu added a reviewer for D8468: IPSec: support for large replay windows: bz.
Nov 17 2016, 1:06 PM
emeric.poupon_stormshield.eu updated the diff for D8468: IPSec: support for large replay windows.

Added missing m_cat call.

Nov 17 2016, 10:13 AM
emeric.poupon_stormshield.eu added inline comments to D8468: IPSec: support for large replay windows.
Nov 17 2016, 9:04 AM

Nov 16 2016

emeric.poupon_stormshield.eu updated the diff for D8468: IPSec: support for large replay windows.

Hello,
Please find an updated version using extension messages.
I also checked the new behavior with a patched version of charon.

Nov 16 2016, 2:45 PM

Nov 10 2016

emeric.poupon_stormshield.eu added inline comments to D8468: IPSec: support for large replay windows.
Nov 10 2016, 7:40 AM

Nov 9 2016

emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.

delphij: is that OK for you?

Nov 9 2016, 10:41 AM · Core Team

Nov 8 2016

emeric.poupon_stormshield.eu added inline comments to D8468: IPSec: support for large replay windows.
Nov 8 2016, 4:58 PM
emeric.poupon_stormshield.eu retitled D8468: IPSec: support for large replay windows from to IPSec: support for large replay windows.
Nov 8 2016, 9:14 AM

Nov 7 2016

emeric.poupon_stormshield.eu updated the diff for D8130: Split arc4random mutexes to improve performance on IPSec traffic.

Updated diff using full context, changed code according to delphij's comment

Nov 7 2016, 10:01 AM · Core Team
emeric.poupon_stormshield.eu added inline comments to D8130: Split arc4random mutexes to improve performance on IPSec traffic.
Nov 7 2016, 9:57 AM · Core Team

Oct 7 2016

emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.

Is that OK for you?

Oct 7 2016, 3:43 PM · Core Team
emeric.poupon_stormshield.eu updated the diff for D8130: Split arc4random mutexes to improve performance on IPSec traffic.

Updated diff according to ache's remarks

Oct 7 2016, 8:11 AM · Core Team
emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.
if (atomic_cmpset_int(&arc4rand_iniseed_state, ARC4_ENTR_HAVE, ARC4_ENTR_SEED) ||
    reseed) {
        ARC4_FOREACH(arc4)
                arc4_randomstir(arc4);
}
...
if (arc4->numruns > ARC4_RESEED_BYTES || 
    tv.tv_sec > arc4->t_reseed)
        arc4_randomstir(arc4);

This is nearly what I proposed before, but why would you reseed all the instances in case the user calls arc4rand with reseed = 1?

Oct 7 2016, 7:21 AM · Core Team

Oct 6 2016

emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.
In D8130#169388, @ache wrote:

There can be small window when arc4rand_iniseed_state is already set to SEED, but arc4_randomstir() is not called yet. And right in this time another thread calls the code. Well, we miss only single reinitialization per-CPU (more are time-consuming and can't fit between the check and immediate function call), on the next call they will be already reinitialized. Dou mean another scenario?

Yes that is what I meant. Not sure this is a big security concern or not?

Oct 6 2016, 4:13 PM · Core Team
emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.
In D8130#169387, @ache wrote:

The problem is that another thread may call the arc4rand() function and get some unsafe random bytes before the stir actually occurs and after the state was set to SEED.

Please show how it can happen in steps by steps. In old code if the state was SEED, it surely reinitialized, and reinitialization itself keeps a lock, so RNG can't move further and wait for reinitialization (from any thread).

Oct 6 2016, 4:10 PM · Core Team
emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.
In D8130#169376, @ache wrote:

I meant the arc4rand_iniseed_state variable.

I too. Without locking or atomic either yet one additional seeding can happens or no seeding can happens at all (CPU writing to another half of word, which is already checked).

I do agree with that, but on the very first call of arc4rand(), we can make sure readomstir() is called,

Not all stiring are equal) Non random stiring is done to just not block arc4 on early boot phase.
When good randomnes becomes available (which arc4rand_iniseed_state indicates) it must be reinitialized immediately on the next call.

Oct 6 2016, 12:11 PM · Core Team
emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.
In D8130#169374, @ache wrote:

But I am not sure to see the actual benefit of this atomic?

It allows to not play with locking.

I meant the arc4rand_iniseed_state variable.

Oct 6 2016, 10:47 AM · Core Team
emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.

That's a good point indeed.

Oct 6 2016, 10:28 AM · Core Team
emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.

Do you have other remarks on this patch?

Oct 6 2016, 9:13 AM · Core Team

Oct 5 2016

emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.

Thanks for working on this, but I have some different ideas which I don't really have the time to sit down and implement at this time that I feel like sharing and may be useful:

I think we could take a different approach: now that we have the pseudo-random state per-CPU, perhaps we can make them CPU-bound and completely eliminate the locks and use critical sections instead?

Oct 5 2016, 7:07 AM · Core Team

Oct 4 2016

emeric.poupon_stormshield.eu updated the diff for D8130: Split arc4random mutexes to improve performance on IPSec traffic.

Changed malloc test to KASSERT

Oct 4 2016, 1:29 PM · Core Team
emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.
In D8130#168678, @ache wrote:

Sorry, my prev. comment was sent non-edited. Please forget it if you got it through the mail and re-read here instead.

Oct 4 2016, 8:47 AM · Core Team
emeric.poupon_stormshield.eu added a comment to D8130: Split arc4random mutexes to improve performance on IPSec traffic.
In D8130#168667, @ache wrote:

It will be better to use non-failing malloc flag. arc4_init() can't fail.

Oct 4 2016, 7:41 AM · Core Team

Oct 3 2016

emeric.poupon_stormshield.eu updated the diff for D8130: Split arc4random mutexes to improve performance on IPSec traffic.

Sorry, forgot to remove a commented line

Oct 3 2016, 12:10 PM · Core Team
emeric.poupon_stormshield.eu retitled D8130: Split arc4random mutexes to improve performance on IPSec traffic from to Split arc4random mutexes to improve performance on IPSec traffic.
Oct 3 2016, 12:02 PM · Core Team