Looks good and useful!
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Jun 28 2018
May 25 2018
In D15389#328341, @cem wrote:In D15389#328317, @emeric.poupon_stormshield.eu wrote:Nobody wants this? It is definitely better than the current code though.
Sure, it improves performance (I haven't tested, but I'm quite willing to take your word for it) of a single OCF driver in the case that some consumer is allocating a lot of sessions (or maybe a reasonable number of sessions on a high core-count CPU). But that doesn't make it definitely better.
It adds complexity and perpetuates the problem that every OCF driver needs to copy and paste crappy session management. I think a general solution that moves the session management into OCF and out of drivers is a better way to fix this. You don't need the RB tree if sessions are just pointers.
May 24 2018
I have two questions:
- is it OK to leave the old SP in case of error? I guess it is but in the previous implementation the SP is removed if it cannot be updated, and I think it is really questionable.
- is it OK to hold a mutex during a printf? I guess not but there are already some examples doing that in this file
Nobody wants this? It is definitely better than the current code though.
May 16 2018
Remove bad comment
May 15 2018
ae's remarks
May 14 2018
Remarks
In D15389#324522, @cem wrote:I agree that the current system sucks and that it does not scale with multiple sessions. I'm not sure a binary tree is the right replacement, though. And probably session management should be lifted out into OCF (OpenCrypto Framework) so that all drivers can benefit from it without copy-pasting.
OCF has a number of shortcomings which John detailed here: https://lists.freebsd.org/pipermail/freebsd-arch/2018-January/018835.html
May 11 2018
Apr 23 2018
ae's remarks
Apr 20 2018
Update the maxentries sysctl once the hashtable has been init in order to reflect the actual size used.
Apr 13 2018
In D15050#317087, @ae wrote:Can you provide some numbers what benefits you have with this patch?
Apr 12 2018
Nov 3 2017
Moved function to simplify diff
Oct 23 2017
Oct 19 2017
In D10680#264139, @ae wrote:I finally played a bit with this patch. I used if_ipsec(4) tunnel between two hosts and iperf TCP test. With disabled async_crypto I have ~720Mbit/s, with enabled async_crypto it is 5.2Gbit/s.
Fixed typo in comment + 32bits build
Oct 17 2017
Changed sysctl name to "net.inet.inet.ipsec.async_crypto"
Oct 2 2017
Style remarks
Sep 29 2017
Updates done regarding jmg remarks. I hope this is easier to understand now.
@jmg, I will post a new review soon with the changes you suggest.
Sep 25 2017
Sep 12 2017
Updated the crypto(9) man page to reflect the changes
Aug 29 2017
Aug 25 2017
As jhb suggested, I added a flag on drivers that may benefit from a multi cpu dispatch.
This provides a way to just do direct dispatch for hardware crypto drivers.
Aug 21 2017
Jul 28 2017
Jun 12 2017
Now deleting the taskqueue in crypto_destroy()
May 30 2017
May 19 2017
Any thoughts on this review?
May 11 2017
Apr 20 2017
The taskqueue API could possibly be expanded to support this. But I don't think it could easily be expanded to support the use case that I mentioned in the previous comment to support a single invocation supporting multiple streaming requests.
I have made some further tests, it looks like a single queue that is shared between several worker threads is far more efficient than a round robin on several workers, each having its own queue (none of the kernel threads are pinned to a CPU)
The problem is that on a dual socket system (2*6 cores) performance is really bad: all the time is spent in the mutex protecting the single job queue.
What would you suggest to get decent performance? Round robin on several queues, each one having several worker threads?Are you making sure you're only waking one when adding to the queue (and not waking up all 12 threads to hammer on the mutex)? Also, even doing a wake one is only effective if when the worker thread wakes up, it processes one item before going to sleep, otherwise you'll be waking threads that don't do anything (but hammer the mutex) before going back to sleep. If you have your worker threads polling the queue for work, you need to have an active flag, which when set, causes no wakeup's to happen when a item is enqueued. Then it is up to the worker thread to detect the back log, and wake additional worker threads once there is an average or more than one item per worker thread woken in the backlog. This average should probably be kept over multiple iterations unless you care more about latency than cpu usage.
Apr 18 2017
Eventually the goal for us is to use crypto(9) from IPsec to accelerate single flows processing. Indeed IPsec does not guarantee packet ordering (neither does IP), but it would be for sure quite harmful for some end user applications if packets are not ordered.
A same crypto session may be used for several flows coming from the nic on several CPUs. It would be needed to keep the packets ordered for each flow on each CPU but it does not really matter to loss the ESP packet order in ouput, as the anti replay window handles that on the remote host.
That's why I think it would be nice for crypto(9) users to keep ordering when dispatching the jobs.No, this is a requirement of the IPsec layer, not all users of crypto(9) require this. For example, disk encryption does not need this, as the upper layers ensures that writes are ordered correctly (ZFS and UFS both do this). And by forcing order, you are increasing latency unnecessarily for other consumers.
This isn't hard to handle at the IPsec layer. You use a TAILQ to enqueue the packets w/ a simple data structure w/ a flag that gets set when the packet is completed, then each completed packet, while the head of the tailq is ready, send it. It's not hard, and keeps the ordering logic where it belongs, or you add a flag to crypto(9) and the logic there, but you need to allow non-ordered operation.
This is maybe a bit more difficult, since we would need to reorder packets only within the flows that may share the same crypto session, but I get your idea. Maybe a reording queue per CPU would do the job, since we expect each flow to be pinned on the same CPU.
Apr 14 2017
In D10384#215403, @jmg wrote:as per other comments in the code, ordering does not have to be maintained.. w/ the async nature of callbacks, it is already assumed that the callers can handle this.
Apr 13 2017
Thanks for completing the job!
Nov 17 2016
Added missing m_cat call.
Nov 16 2016
Hello,
Please find an updated version using extension messages.
I also checked the new behavior with a patched version of charon.
Nov 10 2016
Nov 9 2016
delphij: is that OK for you?
Nov 8 2016
Nov 7 2016
Updated diff using full context, changed code according to delphij's comment
Oct 7 2016
Is that OK for you?
Updated diff according to ache's remarks
if (atomic_cmpset_int(&arc4rand_iniseed_state, ARC4_ENTR_HAVE, ARC4_ENTR_SEED) || reseed) { ARC4_FOREACH(arc4) arc4_randomstir(arc4); } ... if (arc4->numruns > ARC4_RESEED_BYTES || tv.tv_sec > arc4->t_reseed) arc4_randomstir(arc4);
This is nearly what I proposed before, but why would you reseed all the instances in case the user calls arc4rand with reseed = 1?
Oct 6 2016
In D8130#169388, @ache wrote:There can be small window when arc4rand_iniseed_state is already set to SEED, but arc4_randomstir() is not called yet. And right in this time another thread calls the code. Well, we miss only single reinitialization per-CPU (more are time-consuming and can't fit between the check and immediate function call), on the next call they will be already reinitialized. Dou mean another scenario?
Yes that is what I meant. Not sure this is a big security concern or not?
In D8130#169387, @ache wrote:The problem is that another thread may call the arc4rand() function and get some unsafe random bytes before the stir actually occurs and after the state was set to SEED.
Please show how it can happen in steps by steps. In old code if the state was SEED, it surely reinitialized, and reinitialization itself keeps a lock, so RNG can't move further and wait for reinitialization (from any thread).
In D8130#169376, @ache wrote:I meant the arc4rand_iniseed_state variable.
I too. Without locking or atomic either yet one additional seeding can happens or no seeding can happens at all (CPU writing to another half of word, which is already checked).
I do agree with that, but on the very first call of arc4rand(), we can make sure readomstir() is called,
Not all stiring are equal) Non random stiring is done to just not block arc4 on early boot phase.
When good randomnes becomes available (which arc4rand_iniseed_state indicates) it must be reinitialized immediately on the next call.
In D8130#169374, @ache wrote:But I am not sure to see the actual benefit of this atomic?
It allows to not play with locking.
I meant the arc4rand_iniseed_state variable.
That's a good point indeed.
Do you have other remarks on this patch?
Oct 5 2016
In D8130#168807, @delphij wrote:Thanks for working on this, but I have some different ideas which I don't really have the time to sit down and implement at this time that I feel like sharing and may be useful:
I think we could take a different approach: now that we have the pseudo-random state per-CPU, perhaps we can make them CPU-bound and completely eliminate the locks and use critical sections instead?
Oct 4 2016
Changed malloc test to KASSERT
In D8130#168678, @ache wrote:Sorry, my prev. comment was sent non-edited. Please forget it if you got it through the mail and re-read here instead.
In D8130#168667, @ache wrote:It will be better to use non-failing malloc flag. arc4_init() can't fail.
Oct 3 2016
Sorry, forgot to remove a commented line