Page MenuHomeFreeBSD

geli: enable direct dispatch
ClosedPublic

Authored by asomers on Jul 7 2020, 9:57 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Oct 3, 6:29 PM
Unknown Object (File)
Tue, Oct 1, 2:55 PM
Unknown Object (File)
Sun, Sep 29, 4:02 PM
Unknown Object (File)
Sun, Sep 29, 2:48 AM
Unknown Object (File)
Thu, Sep 26, 3:53 PM
Unknown Object (File)
Wed, Sep 25, 2:23 PM
Unknown Object (File)
Thu, Sep 19, 3:23 AM
Unknown Object (File)
Sep 6 2024, 2:52 AM

Details

Summary

geli: enable direct dispatch

geli does all of its crypto operations in a separate thread pool, so
g_eli_start, g_eli_read_done, and g_eli_write_done don't actually do very
much work. Enabling direct dispatch eliminates the g_up/g_down bottlenecks,
doubling IOPs on my system. This change does not affect the thread pool.

Test Plan

Ran the geli test suite, added a test case for geli-on-geli
setups, and benchmarked sequential IOPs on md(4) providers.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

This diff is not ready to commit, because I haven't been able to test it with a hardware crypto accelerator.

sys/geom/eli/g_eli.c
92 ↗(On Diff #74174)

I intend to remove this sysctl once testing is complete.

markj added inline comments.
tests/sys/geom/class/eli/reentrancy_test.sh
34 ↗(On Diff #74174)

This splits "status".

44 ↗(On Diff #74174)

I had to increase the timeout. This took 315s on my espressobin, and the default timeout is 300s, so 600s or 900s seems reasonable.

tests/sys/geom/class/eli/reentrancy_test.sh
34 ↗(On Diff #74174)

Curses! I must've done some last minute reformatting and not rerun the test.

  • Fix typo in reentrancy_test
  • Increase timeout in reentrancy_test
  • Remove kern.geom.eli.direct sysctl

Hmm. Suppose we are using a hardware crypto driver, so crypto processing is completed in a separate thread (presumably the driver's ithread). For decryption requests, the cryptop completion callback is g_eli_crypto_read_done(), which calls g_io_deliver to handle the completed BIO_READ up the stack. With direct dispatch enabled in that direction (i.e., G_CF_DIRECT_RECEIVE on the GELI consumer), it means that driver ithreads will be completing I/O requests, which could hurt latency if other kernel crypto consumers like IPSec are in use. Similarly, when writing, the cryptop callback will dispatch the BIO_WRITE request to the provider below GELI, and with direct dispatch enabled in GELI's consumer the crypto callback will call directly into lower GEOMs.

With software crypto, completion is done synchronously in the GELI worker threads, so I see no problem with using direct dispatch there. With hardware crypto, it might make sense to leave direct dispatch disabled in the consumer. Any thoughts?

tests/sys/geom/class/eli/reentrancy_test.sh
8 ↗(On Diff #74201)

Style nit: the opening brace is on a separate line elsewhere in this file.

Hmm. Suppose we are using a hardware crypto driver, so crypto processing is completed in a separate thread (presumably the driver's ithread). For decryption requests, the cryptop completion callback is g_eli_crypto_read_done(), which calls g_io_deliver to handle the completed BIO_READ up the stack. With direct dispatch enabled in that direction (i.e., G_CF_DIRECT_RECEIVE on the GELI consumer), it means that driver ithreads will be completing I/O requests, which could hurt latency if other kernel crypto consumers like IPSec are in use. Similarly, when writing, the cryptop callback will dispatch the BIO_WRITE request to the provider below GELI, and with direct dispatch enabled in GELI's consumer the crypto callback will call directly into lower GEOMs.

With software crypto, completion is done synchronously in the GELI worker threads, so I see no problem with using direct dispatch there. With hardware crypto, it might make sense to leave direct dispatch disabled in the consumer. Any thoughts?

So your concern is not that the CPU consumption of geli would be too high, but that the CPU consumption of whatever is stacked on top of GELI might be too high? If that's true, then shouldn't the thing on top not be doing direct dispatch?

I haven't been able to test it with a hardware crypto accelerator.

Do you have any AMD Ryzen machines? You can kldload ccp to access AMD's built-in PCIe hardware crypto "card".

Hmm. Suppose we are using a hardware crypto driver, so crypto processing is completed in a separate thread (presumably the driver's ithread). For decryption requests, the cryptop completion callback is g_eli_crypto_read_done(), which calls g_io_deliver to handle the completed BIO_READ up the stack. With direct dispatch enabled in that direction (i.e., G_CF_DIRECT_RECEIVE on the GELI consumer), it means that driver ithreads will be completing I/O requests, which could hurt latency if other kernel crypto consumers like IPSec are in use. Similarly, when writing, the cryptop callback will dispatch the BIO_WRITE request to the provider below GELI, and with direct dispatch enabled in GELI's consumer the crypto callback will call directly into lower GEOMs.

With software crypto, completion is done synchronously in the GELI worker threads, so I see no problem with using direct dispatch there. With hardware crypto, it might make sense to leave direct dispatch disabled in the consumer. Any thoughts?

So your concern is not that the CPU consumption of geli would be too high, but that the CPU consumption of whatever is stacked on top of GELI might be too high? If that's true, then shouldn't the thing on top not be doing direct dispatch?

Well, some thread has to do the work, so it's not so much CPU consumption as it is latency. If a crypto driver ithread is passing requests through GEOM, it won't be able to handle pending cryptop completions, which can affect other crypto(9) consumers. When GEOM up/down threads are doing that work, they can be preempted by ithreads.

I'm not sure that it is a significant problem in practice. You'd probably have to have a large number of layers above or below GELI for the latency to be noticeable. And as you note, GEOMs that require significant CPU resources to perform their transforms (e.g., memory allocations, like g_mirror requires) probably should not enable direct dispatch in the first place, or should use a dedicated worker thread for expensive operations. I just wanted to point it out in case anyone else has thoughts on this tradeoff.

In any case, the diff itself looks ok to me.

This revision is now accepted and ready to land.Jul 8 2020, 4:46 PM
In D25587#566132, @greg_unrelenting.technology wrote:

I haven't been able to test it with a hardware crypto accelerator.

Do you have any AMD Ryzen machines? You can kldload ccp to access AMD's built-in PCIe hardware crypto "card".

Nope. No AMD machines in my house. I might have an old Via Nano lying around, though. Is Padlock a hardware device or an instruction extension?

This revision was automatically updated to reflect the committed changes.