Page MenuHomeFreeBSD

IPSec performance increase in single flow mode by making crypto(9) multi thread
ClosedPublic

Authored by emeric.poupon_stormshield.eu on May 11 2017, 12:15 PM.
Tags
None
Referenced Files
F107806536: D10680.id28237.diff
Sat, Jan 18, 8:07 AM
F107794585: D10680.id32386.diff
Sat, Jan 18, 6:14 AM
F107794541: D10680.id32958.diff
Sat, Jan 18, 6:14 AM
F107740469: D10680.diff
Fri, Jan 17, 9:55 PM
F107673572: D10680.id31302.diff
Fri, Jan 17, 12:08 PM
Unknown Object (File)
Wed, Jan 8, 10:04 AM
Unknown Object (File)
Sun, Jan 5, 8:56 AM
Unknown Object (File)
Sun, Jan 5, 8:43 AM

Details

Summary

crypto(9) is called from ipsec in CRYPTO_F_CBIFSYNC mode. This is working fine when a lot of different flows to be ciphered/deciphered are involved.
However, when a software crypto driver is used, there are situations where we could benefit from making crypto(9) multi threaded:

  • a single flow is to be ciphered: only one thread is used to cipher it,
  • a single ESP flow is to be deciphered: only one thread is used to decipher it.

The idea here is to call crypto(9) using a new mode (CRYPTO_F_ASYNC) to dispatch the crypto jobs on multiple threads, if the underlying crypto driver is working in synchronous mode.
Another flag is added (CRYPTO_F_ASYNC_KEEPORDER) to make crypto(9) dispatch the crypto jobs in the order they are received (an additional queue/thread is used), so that the packets are reinjected in the network using the same order they were posted.

A new sysctl net.inet.ipsec.async_crypto can be used to activate this new behavior (disabled by default)

A previous attempt was made in D10384, but to keep things clear I created this new revision with a slightly different solution.
Please feel free to comment and/or add relevant people.

Test Plan

Tests done on a Xeon E3-1125@3.2Ghz (4 cores) gateway with small packet sizes (128 bytes).

Results:

  • best use cases (one flow to be ciphered and/or one flow to be deciphered): +60% performance increase
  • worst use cases (250 flows to be ciphered, 24 ESP flows to be deciphered): -7.5% performance decrease

Different packet sizes have been mixed to check the packets are re-injected ordered in the network

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
manu added inline comments.
opencrypto/crypto.c
393 ↗(On Diff #28237)

if (crypto_tq != NULL)

taskqueue_free(crypto_tq);

No ?

emeric.poupon_stormshield.eu added inline comments.
opencrypto/crypto.c
393 ↗(On Diff #28237)

Yes indeed!

emeric.poupon_stormshield.eu marked an inline comment as done.

Now deleting the taskqueue in crypto_destroy()

Changed the threads of the taskqueue from PRI_MAX_TIMESHARE to PI_NET.
Not sure of the value to be set (kproc_create call uses level PVM).
Any thoughts on this?

fabient added inline comments.
opencrypto/crypto.c
173 ↗(On Diff #29482)

a little complex to retrieve the id ?

183 ↗(On Diff #29482)

add a readonly sysctl?

275 ↗(On Diff #29482)

cannot find it in the new code ? is it replaced by something else?

282 ↗(On Diff #29482)

is there a maximum ?

880 ↗(On Diff #29482)

empty line

963 ↗(On Diff #29482)

is it possible to scale krp also ?

1334 ↗(On Diff #29482)

is it necessary to check all queue here ?

emeric.poupon_stormshield.eu added inline comments.
opencrypto/crypto.c
173 ↗(On Diff #29482)

It is used only for debugging purpose. Do you suggest another approach?

183 ↗(On Diff #29482)

yes, this is a good idea

275 ↗(On Diff #29482)

Your are right, this is indeed a merge error

282 ↗(On Diff #29482)

Ok, I changed to set the maximum to mp_ncpus

880 ↗(On Diff #29482)

Ok

963 ↗(On Diff #29482)

It is possible indeed, but it requires some changes in the code and unfortunately I cannot test this now.

1334 ↗(On Diff #29482)

Yes, since the same return worker is used to dispatch all of them.
I guess it is cheaper to check these pointers rather than triggering a potential useless wake up

I'm planning to commit the dev on behalf of emeric 08/21, please make any comments before. :)

This revision is now accepted and ready to land.Jul 28 2017, 8:31 AM
opencrypto/crypto.c
910 ↗(On Diff #31302)

My only thought here is to not add an extra context switch for actual hardware drivers. Currently you can't use the crypto_getcaps(hid) to figure out what a driver is because aesni0 is incorrectly tagged as hardware (it should be tagged as software, but in such a way that it has priority over the "plain" software, this is also important for avoiding the use of the in-kernel aesni0 driver for user requests from /dev/crypto as any user requests are better off doing AESNI or other accelerated software directly in userland).

173 ↗(On Diff #29482)

What about just '((w) - crypto_ret_workers)'?

emeric.poupon_stormshield.eu added inline comments.
opencrypto/crypto.c
910 ↗(On Diff #31302)

You are right, the current proposed patch works fine only for the software driver.

Actually we already have patches to integrate the aesni and padlock drivers in the software driver. We plan to submit these patches in future reviews.
For now, as it does not make much sense, we could just prevent the crypto jobs using hardware accelerated drivers to be dispatched over multi CPU?
What do you think?

173 ↗(On Diff #29482)

Yes indeed!

opencrypto/crypto.c
910 ↗(On Diff #31302)

We should provide a way to just do direct dispatch for hardware crypto drivers, that is all I really care about, but I'm not sure how to do that cleanly with the current aesni driver without some sort of extra "I'm accelerated software" flag, that is all. I'm open to any solution that does direct dispatch for hardware drivers and uses multiple threads for software (plain or accelerated)

emeric.poupon_stormshield.eu edited edge metadata.
emeric.poupon_stormshield.eu marked an inline comment as done.

As jhb suggested, I added a flag on drivers that may benefit from a multi cpu dispatch.
This provides a way to just do direct dispatch for hardware crypto drivers.

This revision now requires review to proceed.Aug 25 2017, 4:08 PM
netipsec/xform_ah.c
660 ↗(On Diff #32386)

I feel like I'd rather do the F_PERCPU check in the crypto code itself (e.g. in crypto_dispatch) rather than in consumers.

netipsec/xform_ah.c
660 ↗(On Diff #32386)

That is what I did first.
Unfortunately, if you request a MP dispatch and the underlying crypto driver does not handle it, you lose the CBIFSYNC flag and get a performance penalty.

I don't see any man page updates for this code. Please make sure any new capabilities, flags and features are properly documented in crypto(9). I cannot provide review on the patch till documentation is written.

Updated the crypto(9) man page to reflect the changes

Only got part way through the patch. I'll continue reviewing it another time.

share/man/man9/crypto.9
334

Maybe rename this to _ASYNC, to counteract the CAP_F_SYNC flag on a driver.

337

I was confused, as I thought this flag allowed things to be reordered, instead of keeping things ordered. Maybe rename the flag to _F_KEEPORDER. Maybe add some text that the caller is required to ensure that proper ordering is maintained via holding a lock or some other synchronization primitive.

542

I'm not sure this flag should be added. The code should be generic to apply to all _SYNC drivers, so that special code isn't added to drivers.

sys/netipsec/xform_ah.c
659

Maybe make this into a macro or in-line function, as it's repeated a number of times, and involves a bit of logic.

sys/opencrypto/crypto.c
155

This is implementation defined behavior per 6.3.1.3 of c99 and should be fixed (different compliers may output different results when running this code). Something like:
((a)-(b) <= INT_MAX)

is probably best, as that is what you end up doing w/ the cast, and results in well defined behavior.

share/man/man9/crypto.9
334

Yes why not, but the user has no clue there is a kind of acceleration to be expected.
Not sure it is a problem since the explanation is below.

337

You are right, it is more a "keep order" behavior. I will add what you suggested.

542

I see what you mean; it indeed applies well on SYNC drivers. I just wanted to make sure things are well separated to avoid surprising effects, but it could be added later only if we really need to have such a distinction.

sys/netipsec/xform_ah.c
659

Well I have the feeling that CRYPTO_F_CBIFSYNC should be done automatically

Anyway, with your suggestions, the resulting condition would be something like that:

if (CRYPTO_SESID2CAPS(cryptoid) & CRYPTOCAP_F_SYNC) {

if (V_crypto_mp_dispatch)
    crp->crp_flags |= CRYPTO_F_MPDISPATCH | CRYPTO_F_KEEPORDER;
else
    crp->crp_flags |= CRYPTO_F_CBIFSYNC;

}

That raises another question: why not using CRYPTO_F_CBIMM instead of CRYPTO_F_CBIFSYNC then?

sys/opencrypto/crypto.c
155

Actually I reused the logic used by SEQ_GT(a,b) defined in netinet/tcp_seq.h

Maybe we have some problems there too?

@jmg, I will post a new review soon with the changes you suggest.

Updates done regarding jmg remarks. I hope this is easier to understand now.

I also lowered the priority used by the taskqueue (that caused scheduling problems on very high loads)

lidl added inline comments.
sys/opencrypto/crypto.c
1698–1710

Please check the whitespace on this block of changes. It looks to be using a combination of spaces and tabs for the indentation.

Thanks.

sys/opencrypto/cryptodev.h
399

Please check the whitespace for the additions to this file.
It looks like you're using spaces for indentation, and
the existing code uses tabs.

Thanks.

emeric.poupon_stormshield.eu added inline comments.
sys/opencrypto/crypto.c
1698–1710

I use the same indentation style as before: tabs are used to indent, and 4 extra spaces are used on multi lines instructions.
Is that no longer valid?

sys/opencrypto/cryptodev.h
399

Fixed!

Changed sysctl name to "net.inet.inet.ipsec.async_crypto"

I finally played a bit with this patch. I used if_ipsec(4) tunnel between two hosts and iperf TCP test. With disabled async_crypto I have ~720Mbit/s, with enabled async_crypto it is 5.2Gbit/s.

In D10680#264139, @ae wrote:

I finally played a bit with this patch. I used if_ipsec(4) tunnel between two hosts and iperf TCP test. With disabled async_crypto I have ~720Mbit/s, with enabled async_crypto it is 5.2Gbit/s.

Thanks for testing! Actually you get a far better improvement than I observed!
Just curious, which hardware have you used for the test?

Thanks for testing! Actually you get a far better improvement than I observed!
Just curious, which hardware have you used for the test?

This is due to I used large MTU value, for generic ~1500 MTU the results is not so much impressive. E.g. 720Mbit/s vs 1.2 Gbit/s, but this is also good result.

This revision is now accepted and ready to land.Oct 23 2017, 11:58 AM
sys/netipsec/xform_ah.c
659

There is a slight difference between _CBIMM and _CBIFSYNC... in the case of _CBIMM, it is ALWAYS called immediately. This can cause a lock reversal as the driver may be holding a lock when calling crypto_done. In the case of _CBIFSYNC, the callback is only called when the code is processed immediately. I don't know all the locking context to know what is safe, but the above makes some sense to keep it this way.

sys/opencrypto/crypto.c
155

likely. an email to bde would help figure out what the best solution is.

881

If this function was moved after the following one, it would reduce the diff by wrapping the tail of the crypto_dispatch w/ this new function. It also makes tracing history easier.

This revision now requires review to proceed.Nov 3 2017, 8:31 AM
This revision was automatically updated to reflect the committed changes.
bjk added inline comments.
head/share/man/man9/crypto.9
333 ↗(On Diff #34708)

please remember to update .Dd before the final commit.

336 ↗(On Diff #34708)

Is this supposed to be "(that is, if the..."?

338 ↗(On Diff #34708)

Start the new sentence on a new line, per mdoc style.