Make crypto(9) multi thread
AbandonedPublic
Actions

Authored by emeric.poupon_stormshield.eu on Apr 13 2017, 3:06 PM.

Details

Reviewers

ae
jmg
jim_netgate.com
delphij
gnn

Summary

Hello,

This is an attempt to make crypto(9) multi threaded: it is now possible to spawn multi crypto worker/ret workers.
As before, each worker has its own job queue.

The goal is to boost non hardware accelerated crypto operations, using the BATCH mode. In that case, you can spread crypto jobs on several workers, and then benefit from parallel executions.
Currently the choice of the worker is based on the session id. This is the easiest choice to keep the jobs ordered.

The second step would be to spread the crypto jobs on several workers, and use the return workers to reorder the jobs.
The goal is then to boost operations when only a few sessions are involved.

Here is what I thought:

spawn as many workers as CPU
pin each worker/ret worker on a single CPU
make a global counter per CPU, and use it to mark each crypto job with a tuple { crp_worker_id (=curcpu), crp_seq}
load balance the jobs on the workers
once the crypto is processed, put the job in the original worker queue, at the correct place (thanks to the worker_id/crp_seq info)
the return worker makes sure to redispatch the jobs in the correct order (no "hole")

What do you think?

(Please feel free to add relevant people)

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

emeric.poupon_stormshield.eu created this revision.Apr 13 2017, 3:06 PM

Herald added a subscriber: imp. · View Herald TranscriptApr 13 2017, 3:06 PM

emaste added a subscriber: emaste.Apr 13 2017, 5:34 PM

I guess George would have interest on this one.

Other than the two minor nits above this looks OK to me.

opencrypto/crypto.c
285	Add parens to make this match the code above.
1308	Remove commented out line before commit.

This revision is now accepted and ready to land.Apr 14 2017, 1:34 PM

as per other comments in the code, ordering does not have to be maintained.. w/ the async nature of callbacks, it is already assumed that the callers can handle this.

In reality, we need a proper library in the kernel to handle multi cpu queueing. We keep upgrading each subsystem to do this MP work, which means each one will have their own set of unique bugs instead of properly writing a library that will handle this for the consumer properly. this allows a better tuning of number of threads, number of queues per thread, etc. Though it's simple to do queue per thread, with the high cost of work each thread does, it is better to have m threads per work queue which will automatically help balance the work out. This will also differ depending upon the work loads too.

For example, because the crypto framework didn't have MP, geli created their own handling of this problem, and now you've done this, but haven't addressed the fact that geli has duplicated code.

opencrypto/crypto.c
198	default should be 0, and then when zero, set to the number of cpu's so this gets used by default.
214	why is this writable?
218	why is this writable?
275	just remove the if, any unsigned num % 1 will always return 0,
276	have you profiled the session allocations to make sure that this makes sense and gives good distribution? what happens when heavily sessions collide?
398	this change is unrelated to this.. please break out
944	this should be made into a PCPU option, and then aggregated as needed, if you have lots of threads working, this will bounce the stats line around cpus a lot..
1097	why is this code deleted as part of this change?
1241	why has this limit been introduced? we didn't have it before? and now it adds a new way for requests to fail.
1537	these should be _32 instead.
1665	why is this header printed for each worker?
1705	again, why is this header printed for each worker?

In D10384#215403, @jmg wrote:

as per other comments in the code, ordering does not have to be maintained.. w/ the async nature of callbacks, it is already assumed that the callers can handle this.

Thanks for your comments!

Eventually the goal for us is to use crypto(9) from IPsec to accelerate single flows processing. Indeed IPsec does not garantee packet ordering (neither does IP), but it would be for sure quite harmful for some end user applications if packets are not ordered.
A same crypto session may be used for several flows coming from the nic on several CPUs. It would be needed to keep the packets ordered for each flow on each CPU but it does not really matter to loss the ESP packet order in ouput, as the anti replay window handles that on the remote host.
That's why I think it would be nice for crypto(9) users to keep ordering when dispatching the jobs.

About the queue limit, if you receive more packets than you can actually process in the crypto layer, you are to accumulate a huge amount of packets and create a lot of latency.

In reality, we need a proper library in the kernel to handle multi cpu queueing. We keep upgrading each subsystem to do this MP work, which means each one will have their own set of unique bugs instead of properly writing a library that will handle this for the consumer properly. this allows a better tuning of number of threads, number of queues per thread, etc. Though it's simple to do queue per thread, with the high cost of work each thread does, it is better to have m threads per work queue which will automatically help balance the work out. This will also differ depending upon the work loads too.

For example, because the crypto framework didn't have MP, geli created their own handling of this problem, and now you've done this, but haven't addressed the fact that geli has duplicated code.

I do agree with that. About geli, I can't see the code right now but I guess it would have been better to make crypto(9) multi thread at the time geli handled the problem?
What kind of API would you suggest for that? Do you have some examples?

In D10384#215486, @emeric.poupon_stormshield.eu wrote:

In D10384#215403, @jmg wrote:

as per other comments in the code, ordering does not have to be maintained.. w/ the async nature of callbacks, it is already assumed that the callers can handle this.

Thanks for your comments!

Eventually the goal for us is to use crypto(9) from IPsec to accelerate single flows processing. Indeed IPsec does not guarantee packet ordering (neither does IP), but it would be for sure quite harmful for some end user applications if packets are not ordered.
A same crypto session may be used for several flows coming from the nic on several CPUs. It would be needed to keep the packets ordered for each flow on each CPU but it does not really matter to loss the ESP packet order in ouput, as the anti replay window handles that on the remote host.
That's why I think it would be nice for crypto(9) users to keep ordering when dispatching the jobs.

No, this is a requirement of the IPsec layer, not all users of crypto(9) require this. For example, disk encryption does not need this, as the upper layers ensures that writes are ordered correctly (ZFS and UFS both do this). And by forcing order, you are increasing latency unnecessarily for other consumers.

This isn't hard to handle at the IPsec layer. You use a TAILQ to enqueue the packets w/ a simple data structure w/ a flag that gets set when the packet is completed, then each completed packet, while the head of the tailq is ready, send it. It's not hard, and keeps the ordering logic where it belongs, or you add a flag to crypto(9) and the logic there, but you need to allow non-ordered operation.

About the queue limit, if you receive more packets than you can actually process in the crypto layer, you are to accumulate a huge amount of packets and create a lot of latency.

Again, this should be handled at the higher layer. If the upper layer cares about latency, it should decide how to handle this. Or at least if you want it here, ensure that it is configurable per session, and only enforce it on sessions that would like it. For example, on my relatively small system, I have 14 geli volumes (2 for root, 2 for log, 2 for arc2 cache and 8 for data volumes), on larger systems, it's likely that it'll be easy to hit the 256 limit just for disk encryption.

Luckily, this won't effect eli by default because of it's own code to handle this, but it does so poorly.. On my system, if I left it w/ defaults, I would have 84 geli threads bouncing around (6 cpu cores * 14 geli volumes), and most/all 84 threads all trying to run at once... I currently run w/ kern.geom.eli.threads=1 since my volumes provide the parallelism needed..

In reality, we need a proper library in the kernel to handle multi cpu queueing. We keep upgrading each subsystem to do this MP work, which means each one will have their own set of unique bugs instead of properly writing a library that will handle this for the consumer properly. this allows a better tuning of number of threads, number of queues per thread, etc. Though it's simple to do queue per thread, with the high cost of work each thread does, it is better to have m threads per work queue which will automatically help balance the work out. This will also differ depending upon the work loads too.

For example, because the crypto framework didn't have MP, geli created their own handling of this problem, and now you've done this, but haven't addressed the fact that geli has duplicated code.

I do agree with that. About geli, I can't see the code right now but I guess it would have been better to make crypto(9) multi thread at the time geli handled the problem?

Correct.

What kind of API would you suggest for that? Do you have some examples?

For the API it should be fairly simple:
create_queue(nthreads/* 0 for ncpu*/, threadsperqueue, queueselectorfun)
enqueue(funtorun, arg)

The crypto interface isn't the only place that can use this. others such as multi queue NICs, geom, etc.

The hard part is writing the code, ensuring that the locking is correct, etc, and then of course converting the various subsystems to using the system. Then you can build smarts into this system, like queue length and latency monitoring, such that if you notice one queue running long, you can either rebalance queues (drain, change selector key, start again) or notice that there are not enough threads and inform the developer/sysadmin.

Even the threadsperqueue may be dynamically adjusted (and probably should).. If the average length of the job is long, then having more threads per queue helps latency, but if it's short, you'll need less threads per queue to ensure that the locking overhead doesn't block the threads... Even w/ crypto, the time to encrypt blocks will vary largely depending on if you have aesni or not, and even type of crypto operations (some queues may be slower because they use AES-CBC+HMAC-SHA256 vs AES-GCM.

I did some work on adding support to multistream in a single thread. It turns out to get good performance w/ AES-CBC encryption, you need to do 4-8 streams (requests) per thread, which is another type of queuing that would be useful to support, (where the funtorun does not necessarily complete the work item given, but in this case, it might be easier to have the worker thread call a dequeue function, something like:
workerthread(struct workqueue *wq, int qnum)
{

for (;;) {
    item = dequeue(wq, qnum, 1/* wait */);
    processitem(item);
    completeitem(wq, item);
}

}

As the complete is separate, multiple dequeues can be processed at once via the wait parameter, 0 will say return before complete, that way you can do:
workerthread(struct workqueue *wq, int qnum)
{

int avail = 0xff;
for (;;) {
    if (avail) {
        item = dequeue(wq, qnum, 0/* wait */);
        if (item == NULL && avail == 0xff)
             item = dequeue(wq, qnum, 1/* wait */);
        if (item != NULL)
            clearavailbitandallocateplace(item);
    }
    item = processitems();
    completeitem(wq, item);
    setitemcompletbit(item);
}

}

Eventually the goal for us is to use crypto(9) from IPsec to accelerate single flows processing. Indeed IPsec does not guarantee packet ordering (neither does IP), but it would be for sure quite harmful for some end user applications if packets are not ordered.
A same crypto session may be used for several flows coming from the nic on several CPUs. It would be needed to keep the packets ordered for each flow on each CPU but it does not really matter to loss the ESP packet order in ouput, as the anti replay window handles that on the remote host.
That's why I think it would be nice for crypto(9) users to keep ordering when dispatching the jobs.

No, this is a requirement of the IPsec layer, not all users of crypto(9) require this. For example, disk encryption does not need this, as the upper layers ensures that writes are ordered correctly (ZFS and UFS both do this). And by forcing order, you are increasing latency unnecessarily for other consumers.

This isn't hard to handle at the IPsec layer. You use a TAILQ to enqueue the packets w/ a simple data structure w/ a flag that gets set when the packet is completed, then each completed packet, while the head of the tailq is ready, send it. It's not hard, and keeps the ordering logic where it belongs, or you add a flag to crypto(9) and the logic there, but you need to allow non-ordered operation.

This is maybe a bit more difficult, since we would need to reorder packets only within the flows that may share the same crypto session, but I get your idea. Maybe a reording queue per CPU would do the job, since we expect each flow to be pinned on the same CPU.

What kind of API would you suggest for that? Do you have some examples?

For the API it should be fairly simple:
create_queue(nthreads/* 0 for ncpu*/, threadsperqueue, queueselectorfun)
enqueue(funtorun, arg)

The crypto interface isn't the only place that can use this. others such as multi queue NICs, geom, etc.

I have just had a look to OpenBSD's crypto implementation and they seem to use something that is close to the basic part of the API you suggest:
https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/crypto/crypto.c?rev=1.79&content-type=text/x-cvsweb-markup
http://man.openbsd.org/task_add.9

And what about the existing tasqueue(9) API?

I have made some further tests, it looks like a single queue that is shared between several worker threads is far more efficient than a round robin on several workers, each having its own queue (none of the kernel threads are pinned to a CPU)
The problem is that on a dual socket system (2*6 cores) performance is really bad: all the time is spent in the mutex protecting the single job queue.
What would you suggest to get decent performance? Round robin on several queues, each one having several worker threads?

In D10384#215987, @emeric.poupon_stormshield.eu wrote:

Eventually the goal for us is to use crypto(9) from IPsec to accelerate single flows processing. Indeed IPsec does not guarantee packet ordering (neither does IP), but it would be for sure quite harmful for some end user applications if packets are not ordered.
A same crypto session may be used for several flows coming from the nic on several CPUs. It would be needed to keep the packets ordered for each flow on each CPU but it does not really matter to loss the ESP packet order in ouput, as the anti replay window handles that on the remote host.
That's why I think it would be nice for crypto(9) users to keep ordering when dispatching the jobs.

No, this is a requirement of the IPsec layer, not all users of crypto(9) require this. For example, disk encryption does not need this, as the upper layers ensures that writes are ordered correctly (ZFS and UFS both do this). And by forcing order, you are increasing latency unnecessarily for other consumers.

This isn't hard to handle at the IPsec layer. You use a TAILQ to enqueue the packets w/ a simple data structure w/ a flag that gets set when the packet is completed, then each completed packet, while the head of the tailq is ready, send it. It's not hard, and keeps the ordering logic where it belongs, or you add a flag to crypto(9) and the logic there, but you need to allow non-ordered operation.

This is maybe a bit more difficult, since we would need to reorder packets only within the flows that may share the same crypto session, but I get your idea. Maybe a reording queue per CPU would do the job, since we expect each flow to be pinned on the same CPU.

Didn't think about this, but we are adding flow information to packets if requested, so you can combine the session id and the flow id to prevent reordering of packets.

What kind of API would you suggest for that? Do you have some examples?

For the API it should be fairly simple:
create_queue(nthreads/* 0 for ncpu*/, threadsperqueue, queueselectorfun)
enqueue(funtorun, arg)

The crypto interface isn't the only place that can use this. others such as multi queue NICs, geom, etc.

I have just had a look to OpenBSD's crypto implementation and they seem to use something that is close to the basic part of the API you suggest:
https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/crypto/crypto.c?rev=1.79&content-type=text/x-cvsweb-markup
http://man.openbsd.org/task_add.9

And what about the existing tasqueue(9) API?

The taskqueue API could possibly be expanded to support this. But I don't think it could easily be expanded to support the use case that I mentioned in the previous comment to support a single invocation supporting multiple streaming requests.

I have made some further tests, it looks like a single queue that is shared between several worker threads is far more efficient than a round robin on several workers, each having its own queue (none of the kernel threads are pinned to a CPU)
The problem is that on a dual socket system (2*6 cores) performance is really bad: all the time is spent in the mutex protecting the single job queue.
What would you suggest to get decent performance? Round robin on several queues, each one having several worker threads?

Are you making sure you're only waking one when adding to the queue (and not waking up all 12 threads to hammer on the mutex)? Also, even doing a wake one is only effective if when the worker thread wakes up, it processes one item before going to sleep, otherwise you'll be waking threads that don't do anything (but hammer the mutex) before going back to sleep. If you have your worker threads polling the queue for work, you need to have an active flag, which when set, causes no wakeup's to happen when a item is enqueued. Then it is up to the worker thread to detect the back log, and wake additional worker threads once there is an average or more than one item per worker thread woken in the backlog. This average should probably be kept over multiple iterations unless you care more about latency than cpu usage.

The taskqueue API could possibly be expanded to support this. But I don't think it could easily be expanded to support the use case that I mentioned in the previous comment to support a single invocation supporting multiple streaming requests.

I have made some further tests, it looks like a single queue that is shared between several worker threads is far more efficient than a round robin on several workers, each having its own queue (none of the kernel threads are pinned to a CPU)
The problem is that on a dual socket system (2*6 cores) performance is really bad: all the time is spent in the mutex protecting the single job queue.
What would you suggest to get decent performance? Round robin on several queues, each one having several worker threads?

Are you making sure you're only waking one when adding to the queue (and not waking up all 12 threads to hammer on the mutex)? Also, even doing a wake one is only effective if when the worker thread wakes up, it processes one item before going to sleep, otherwise you'll be waking threads that don't do anything (but hammer the mutex) before going back to sleep. If you have your worker threads polling the queue for work, you need to have an active flag, which when set, causes no wakeup's to happen when a item is enqueued. Then it is up to the worker thread to detect the back log, and wake additional worker threads once there is an average or more than one item per worker thread woken in the backlog. This average should probably be kept over multiple iterations unless you care more about latency than cpu usage.

My test case uses two crypto sessions on a 2*6 CPUs system. A lot of jobs are submitted for each session ( >300k jobs per session per second)
I am already waking up only one thread each time I dispatch a crypto job.
Due to the time spent in the lock, performance in batch mode is lower than with direct calls (i do not benefit from the other cores to accelerate the crypto jobs).
This is not the case on a mono CPU (4 cores) system as I get a x1,7 performance boost.

I quickly tried something like you suggest: I added a counter on active workers and I only wake up one if there is no active worker. When processing a crypto job, if the hint is that more jobs are to be processed and the current number of active workers is lower than the number of workers, I wake up another one.
It seems I have a mean of 5/6 workers active at the same time (of 12), but I still lose a lot of time in the mutex.

I am wondering if the solution would be to first migrate the exiting code to use taskqueues and then use the new base code to better handle the multi socket issues.
Unfortunately it is not that simple to use taskqueues and get a code close to the openbsd implementation, due to the "block" logic that have been added to make crypto workers pause.
I think I can handle that using a separate queue (per driver?) to store the paused jobs and to later restore them in the taskqueue once the driver is unlocked.

The hint that more crypto operations for the driver are available seems to be quite complicated to do too. I can't see a way to implement it in a multi thread model. It looks like there can't be any garantee that the flag is always correct if two threads are calling the CRYPTODEV_PROCESS in parallel with no lock.
There seems to be a real problem here: some underlying hardware drivers usually expect the hint not to be set in order to flush the crypto operations: it could create huge latencies if the hint is badly set (a new job would "deliver" the previous one)

What do you think?

Edit: I have migrated the basic part using taskqueues.
Here is what I got using pmcstat in top mode:

PMC: [CPU_CLK_UNHALTED.THREAD_P] Samples: 84816 (100.0%) , 1 unresolved

%SAMP IMAGE FUNCTION CALLERS
59.3 kernel __mtx_lock_sleep taskqueue_enqueue:29.7 taskqueue_run_locked:27.8 igb_mq_start:1.3
6.8 kernel     taskqueue_enqueue    swcr_process
5.2 kernel     taskqueue_run_locked taskqueue_thread_loop
3.1 kernel     sha256_SSSE3_block_d
1.5 kernel     bcopy                m_copydata

PMC: [CPU_CLK_UNHALTED.THREAD_P] Samples: 85013 (100.0%) , 0 unresolved

%SAMP CALLTREE
73.6 fork_exit@kernel taskqueue_thread_loop taskqueue_run_locked(5.2%) esp_input_cb ipsec4_common_input_cb netisr_dispatch_src ip_input(0.6%) ip_forward ip_output_ex ip_ipsec_output
ipsec4_process_packet esp_output crypto_dispatch taskqueue_enqueue(6.7%) __mtx_lock_sleep(59.1%)
72.4 fork_exit@kernel taskqueue_thread_loop taskqueue_run_locked(5.2%) esp_output_cb ipsec_process_done ip_output ip_output_ex ip_ipsec_output ipsec4_process_packet esp_output
crypto_dispatch taskqueue_enqueue(6.7%) __mtx_lock_sleep(59.1%)
71.6 fork_exit@kernel taskqueue_thread_loop taskqueue_run_locked(5.2%) swcr_process(0.6%) taskqueue_enqueue(6.7%) __mtx_lock_sleep(59.1%)

As you can see we get a serious contention here (no idle left on any CPU)

emeric.poupon_stormshield.eu mentioned this in D10680: IPSec performance increase in single flow mode by making crypto(9) multi thread.May 11 2017, 12:15 PM

sheda_fsfe.org awarded a token.Nov 3 2017, 3:03 PM

sheda_fsfe.org added a subscriber: sheda_fsfe.org.

emeric.poupon_stormshield.eu abandoned this revision.Apr 12 2018, 12:44 PM