geli: enable direct dispatch
ClosedPublic
Actions

Authored by asomers on Jul 7 2020, 9:57 PM.

Details

Reviewers

markj
pjd
jmg

Commits

rS363486: MFC r363014:
rS363014: geli: enable direct dispatch

Summary

geli: enable direct dispatch

geli does all of its crypto operations in a separate thread pool, so
g_eli_start, g_eli_read_done, and g_eli_write_done don't actually do very
much work. Enabling direct dispatch eliminates the g_up/g_down bottlenecks,
doubling IOPs on my system. This change does not affect the thread pool.

Test Plan

Ran the geli test suite, added a test case for geli-on-geli
setups, and benchmarked sequential IOPs on md(4) providers.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

asomers created this revision.Jul 7 2020, 9:57 PM

Herald added a subscriber: imp. · View Herald TranscriptJul 7 2020, 9:57 PM

asomers requested review of this revision.Jul 7 2020, 9:57 PM

Harbormaster completed remote builds in B32190: Diff 74174.Jul 7 2020, 9:57 PM

This diff is not ready to commit, because I haven't been able to test it with a hardware crypto accelerator.

sys/geom/eli/g_eli.c
92 ↗	(On Diff #74174)	I intend to remove this sysctl once testing is complete.

markj added a subscriber: markj.Jul 7 2020, 11:24 PM

markj added inline comments.

tests/sys/geom/class/eli/reentrancy_test.sh
34 ↗	(On Diff #74174)	This splits "status".
44 ↗	(On Diff #74174)	I had to increase the timeout. This took 315s on my espressobin, and the default timeout is 300s, so 600s or 900s seems reasonable.

asomers added inline comments.Jul 8 2020, 3:37 PM

tests/sys/geom/class/eli/reentrancy_test.sh
34 ↗	(On Diff #74174)	Curses! I must've done some last minute reformatting and not rerun the test.

Fix typo in reentrancy_test
Increase timeout in reentrancy_test
Remove kern.geom.eli.direct sysctl

Harbormaster completed remote builds in B32202: Diff 74201.Jul 8 2020, 3:38 PM

Ok, ready for review now.

Hmm. Suppose we are using a hardware crypto driver, so crypto processing is completed in a separate thread (presumably the driver's ithread). For decryption requests, the cryptop completion callback is g_eli_crypto_read_done(), which calls g_io_deliver to handle the completed BIO_READ up the stack. With direct dispatch enabled in that direction (i.e., G_CF_DIRECT_RECEIVE on the GELI consumer), it means that driver ithreads will be completing I/O requests, which could hurt latency if other kernel crypto consumers like IPSec are in use. Similarly, when writing, the cryptop callback will dispatch the BIO_WRITE request to the provider below GELI, and with direct dispatch enabled in GELI's consumer the crypto callback will call directly into lower GEOMs.

With software crypto, completion is done synchronously in the GELI worker threads, so I see no problem with using direct dispatch there. With hardware crypto, it might make sense to leave direct dispatch disabled in the consumer. Any thoughts?

tests/sys/geom/class/eli/reentrancy_test.sh
8 ↗	(On Diff #74201)	Style nit: the opening brace is on a separate line elsewhere in this file.

In D25587#566124, @markj wrote:

Hmm. Suppose we are using a hardware crypto driver, so crypto processing is completed in a separate thread (presumably the driver's ithread). For decryption requests, the cryptop completion callback is g_eli_crypto_read_done(), which calls g_io_deliver to handle the completed BIO_READ up the stack. With direct dispatch enabled in that direction (i.e., G_CF_DIRECT_RECEIVE on the GELI consumer), it means that driver ithreads will be completing I/O requests, which could hurt latency if other kernel crypto consumers like IPSec are in use. Similarly, when writing, the cryptop callback will dispatch the BIO_WRITE request to the provider below GELI, and with direct dispatch enabled in GELI's consumer the crypto callback will call directly into lower GEOMs.

With software crypto, completion is done synchronously in the GELI worker threads, so I see no problem with using direct dispatch there. With hardware crypto, it might make sense to leave direct dispatch disabled in the consumer. Any thoughts?

So your concern is not that the CPU consumption of geli would be too high, but that the CPU consumption of whatever is stacked on top of GELI might be too high? If that's true, then shouldn't the thing on top not be doing direct dispatch?

In D25587#565988, @asomers wrote:

I haven't been able to test it with a hardware crypto accelerator.

Do you have any AMD Ryzen machines? You can kldload ccp to access AMD's built-in PCIe hardware crypto "card".

In D25587#566130, @asomers wrote:

In D25587#566124, @markj wrote:

Hmm. Suppose we are using a hardware crypto driver, so crypto processing is completed in a separate thread (presumably the driver's ithread). For decryption requests, the cryptop completion callback is g_eli_crypto_read_done(), which calls g_io_deliver to handle the completed BIO_READ up the stack. With direct dispatch enabled in that direction (i.e., G_CF_DIRECT_RECEIVE on the GELI consumer), it means that driver ithreads will be completing I/O requests, which could hurt latency if other kernel crypto consumers like IPSec are in use. Similarly, when writing, the cryptop callback will dispatch the BIO_WRITE request to the provider below GELI, and with direct dispatch enabled in GELI's consumer the crypto callback will call directly into lower GEOMs.

With software crypto, completion is done synchronously in the GELI worker threads, so I see no problem with using direct dispatch there. With hardware crypto, it might make sense to leave direct dispatch disabled in the consumer. Any thoughts?

So your concern is not that the CPU consumption of geli would be too high, but that the CPU consumption of whatever is stacked on top of GELI might be too high? If that's true, then shouldn't the thing on top not be doing direct dispatch?

Well, some thread has to do the work, so it's not so much CPU consumption as it is latency. If a crypto driver ithread is passing requests through GEOM, it won't be able to handle pending cryptop completions, which can affect other crypto(9) consumers. When GEOM up/down threads are doing that work, they can be preempted by ithreads.

I'm not sure that it is a significant problem in practice. You'd probably have to have a large number of layers above or below GELI for the latency to be noticeable. And as you note, GEOMs that require significant CPU resources to perform their transforms (e.g., memory allocations, like g_mirror requires) probably should not enable direct dispatch in the first place, or should use a dedicated worker thread for expensive operations. I just wanted to point it out in case anyone else has thoughts on this tradeoff.

In any case, the diff itself looks ok to me.

This revision is now accepted and ready to land.Jul 8 2020, 4:46 PM

In D25587#566132, @greg_unrelenting.technology wrote:

In D25587#565988, @asomers wrote:

I haven't been able to test it with a hardware crypto accelerator.

Do you have any AMD Ryzen machines? You can kldload ccp to access AMD's built-in PCIe hardware crypto "card".