Page MenuHomeFreeBSD

Fortuna: allow increased concurrency
ClosedPublic

Authored by cem on May 18 2019, 11:04 PM.
Tags
None
Referenced Files
F103535367: D20313.id57570.diff
Tue, Nov 26, 5:27 AM
Unknown Object (File)
Sat, Nov 23, 12:46 PM
Unknown Object (File)
Fri, Nov 15, 9:31 PM
Unknown Object (File)
Thu, Nov 14, 7:39 PM
Unknown Object (File)
Oct 22 2024, 3:00 AM
Unknown Object (File)
Oct 21 2024, 11:17 AM
Unknown Object (File)
Oct 21 2024, 8:55 AM
Unknown Object (File)
Oct 19 2024, 10:21 AM
Subscribers

Details

Summary

Add experimental feature to increase concurrency in Fortuna. As this
diverges slightly from canonical Fortuna, and due to the security
sensitivity of random(4), it is off by default. To enable it, set the
tunable kern.random.fortuna.concurrent_read="1". The rest of this commit
message describes the behavior when enabled.

Readers continue to update shared Fortuna state under global mutex, as they
do in the status quo implementation of the algorithm, but shift the actual
PRF generation out from under the global lock. This massively reduces the
CPU time readers spend holding the global lock, allowing for increased
concurrency on SMP systems and less bullying of the harvestq kthread.

It is somewhat of a deviation from FS&K. I think the primary difference is
that the specific sequence of AES keys will differ if READ_RANDOM_UIO is
accessed concurrently (as the 2nd thread to take the mutex will no longer
receive a key derived from rekeying the first thread). However, I believe
the goals of rekeying AES are maintained: trivially, we continue to rekey
every 1MB for the statistical property; and each consumer gets a
forward-secret, independent AES key for their PRF.

Since Chacha doesn't need to rekey for sequences of any length, this change
makes no difference to the sequence of Chacha keys and PRF generated when
Chacha is used in place of AES.

On a GENERIC 4-thread VM (so, INVARIANTS/WITNESS, numbers not necessarily
representative), 3x concurrent AES performance jumped from ~55 MiB/s per
thread to ~197 MB/s per thread. Concurrent Chacha20 at 3 threads went from
roughly ~113 MB/s per thread to ~430 MB/s per thread.

Prior to this change, the system was extremely unresponsive with 3-4
concurrent random readers; each thread had high variance in latency and
throughput, depending on who got lucky and won the lock. "rand_harvestq"
thread CPU use was high (double digits), seemingly due to spinning on the
global lock.

After the change, concurrent random readers and the system in general are
much more responsive, and rand_harvestq CPU use dropped to basically zero.

Tests are added to the devrandom suite to ensure the uint128_add primitive
utilized by unlocked read functions to specification.

Test Plan
$ make -C tests/sys/devrandom
$ $(make -V .OBJDIR -C tests/sys/devrandom)/uint128_test.full uint128_inc
passed
$ $(make -V .OBJDIR -C tests/sys/devrandom)/uint128_test.full uint128_add64
passed
$ $(make -V .OBJDIR -C tests/sys/devrandom)/uint128_test.full uint128_chacha_ctr
passed

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 24317
Build 23140: arc lint + arc unit

Event Timeline

This is a step in a direction that I've been wanting to doo for quite some time; pin multiple output generators to processors. There is more, but I need to write it in detail first.

This revision is now accepted and ready to land.May 19 2019, 10:11 AM

Thanks Mark for taking a look at both of these differentials!

This is a step in a direction that I've been wanting to doo for quite some time; pin multiple output generators to processors. There is more, but I need to write it in detail first.

I remember you had proposed work in this direction! I know I said I didn't intend to work on it at the time, but I sort of fell into this idea accidentally. I was rereading FS&K for the parent differential and this idea occurred to me. I had to implement at least a proof of concept. :-) (And I still have one TODO in my scribbled notes on my printed copy of FS&K chapter 9…)

I think this differential is mostly a conservative step — we still have a canonical, single instance, global Fortuna (modulo slight AES rekeying differences) — it just shifts the CPU work of PRF generation out from under the global mutex. (This is especially beneficial because FreeBSD mutexes assume CPU work takes a relatively short amount of time and therefore waiters will spin indefinitely waiting for an on-CPU lock holder. That doesn't work as well for long CPU-intensive PRF generation under lock. Additionally, our mutexes are not fair, so concurrent readers (and rand_harvestq) get varying amounts of lock acquisitions arbitrarily.)

I have a nice screenshot demonstrating the result (in Chacha mode; with AES the readers see about 190 MB/s):

rand_harvestq.png (1×826 px, 164 KB)

I think to actually land this, it probably makes sense to hide this behavior behind a non-default tunable. I don't think that complicates things too much, but please discuss. It's certainly a "Relnotes: yes". Is that acceptable from a secteam perspective, @delphij ?

Another alternative would be to fork the fortuna module, but I don't like that solution much.

Add default-off (for now) knob to enable/disable the feature

This revision now requires review to proceed.May 19 2019, 9:21 PM
  • Rebase on newer version of D20312
  • Refactor somewhat to improve clarity
  • Continue to artificially limit lock hold duration to 4kB in "locked" == "!concurrent" mode
  • Continue to rekey AES at 1MB intervals

I think this is more or less ready to commit (behavior while disabled is
satisfactory), barring any review concerns. Thanks!

cem retitled this revision from EXPERIMENTAL Fortuna: allow increased concurrency to Fortuna: allow increased concurrency.Jun 7 2019, 8:35 PM
cem edited the summary of this revision. (Show Details)
cem edited the test plan for this revision. (Show Details)
cem edited the summary of this revision. (Show Details)

Rebase on D20312 changes

delphij requested changes to this revision.Jun 17 2019, 5:55 PM
delphij added inline comments.
sys/dev/random/fortuna.c
487

The mutex would be held for too long for huge block (it seems that we all missed this in D20312), and it shouldn't because it would enable userland to effectively DoS the kernel because the mutexes would spin waiting for something else that is computational intensive. The synchronization model must be revisited here.

I think in order to make this work the code needs to be refactored to something more MP friendly, e.g. fortuna_state should be per-CPU, and the read be done in critical sections.

sys/dev/random/uint128.h
69

It seems that this is really uint128_add64?

This revision now requires changes to proceed.Jun 17 2019, 5:55 PM
sys/dev/random/fortuna.c
487

I noticed and fixed this in D20312, actually. In this revision, please see line 477 above (if locked ⇒ chunk_size = PAGE_SIZE;) and corresponding comment.

The user can already DoS the kernel (although less severely) today, due to PAGE_SIZE generation under lock. There's no good way to fix this without the exact change in this revision ­— moving PRF generation out from under the mutex.

This change does make the code substantially more MP friendly, please see the proposed commit message.

sys/dev/random/uint128.h
69

Yes, I can change the name if you like.

delphij added inline comments.
sys/dev/random/uint128.h
69

Yes, please.

This revision is now accepted and ready to land.Jun 17 2019, 6:33 PM

Rename uint128_add to the more appropriate uint128_add64

This revision now requires review to proceed.Jun 17 2019, 7:00 PM
This revision is now accepted and ready to land.Jun 17 2019, 7:36 PM
This revision was automatically updated to reflect the committed changes.