Page MenuHomeFreeBSD

random/ivy: Provide mechanism to read independent seed values from rdrand
ClosedPublic

Authored by cem on Nov 20 2019, 2:40 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Sep 8, 7:15 PM
Unknown Object (File)
Thu, Sep 5, 7:55 PM
Unknown Object (File)
Wed, Sep 4, 3:39 PM
Unknown Object (File)
Sat, Aug 24, 6:45 AM
Unknown Object (File)
Sat, Aug 17, 9:58 PM
Unknown Object (File)
Thu, Aug 15, 11:35 AM
Unknown Object (File)
Thu, Aug 15, 11:35 AM
Unknown Object (File)
Thu, Aug 15, 11:33 AM
Subscribers

Details

Summary

On x86 platforms with the intrinsic, rdrand is a deterministic bit generator
(AES-CTR) seeded from an entropic source. On x86 platforms with rdseed, it
is something closer to the upstream entropic source. (There is more nuance;
a block diagram is provided in [1].)

On devices with rdrand and without rdseed, there is no good intrinsic for
acecssing the good entropic soure directly. However, the DRBG is guaranteed
to reseed every 8 kB on these platforms. As a conservative option, on such
hardware we can just read an extra 7.99kB samples every time we want a
sample from an independent seed.

Because there is some performance penalty to this more conservative option,
a knob is provided to disable (and enable) the change. The change does not
affect platforms with RDSEED.

[1]: https://software.intel.com/en-us/articles/intel-digital-random-number-generator-drng-software-implementation-guide#inpage-nav-4-2

Test Plan

rdrand microbench here, if someone wants to replicate or check my work: https://reviews.freebsd.org/P336

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/dev/random/ivy.c
65 ↗(On Diff #64604)

I am open to defaulting to the opposite (status quo). I don't know how bad the performance penalty will be. It seems like the worst case might be something like 114 microseconds additional per sample.

No objection to the logic, just the user-facing wording needs to be "de-geeked" a bit! :-)

sys/dev/random/ivy.c
65 ↗(On Diff #64604)

A few microbenchmarks will help here.

69 ↗(On Diff #64604)

This string may be confusing to the reader. It will either need to be clarified in a man page, or get a bit more explicit about the option turning on potentially expensive reseed delays.

84 ↗(On Diff #64604)

*OUCH* :-)

No objection to the logic, just the user-facing wording needs to be "de-geeked" a bit! :-)

Sure! I struggled to find good wording for this.

sys/dev/random/ivy.c
65 ↗(On Diff #64604)

I get about 25 MB/s out of RDRAND on AMD Zen single thread. 8kiB works out to about 300 microseconds per sample here. (But this machine supports RDSEED, so it wouldn't need this workaround either.)

The other option from §5.2.5 is:

Iteratively execute 32 RDRAND invocations with a 10 us wait period per iteration.

That would have less CPU overhead, but similar or worse latency, depending on platform RDRAND speed.

I'll plan to default it off, status quo, until we can better explain why it might be preferable and better understand the performance impact to real systems without rdseed.

69 ↗(On Diff #64604)

I agree. I'm not sure how to bridge the gap, though.

84 ↗(On Diff #64604)

Yes, adding RDSEED was good (for the concerns that this change would reflect).

sys/dev/random/ivy.c
69 ↗(On Diff #64604)

Maybe something like "If non-zero, use more expensive and slow, but safer, seeded samples where RDSEED is not present"?

cem marked 2 inline comments as done.

Update default and sysctl language.

sys/dev/random/ivy.c
65 ↗(On Diff #64604)

We try to harvest 4-32 bytes from each pure entropy source per pool. At 32 pools, that's 128-1024 bytes. That's 5-40 milliseconds of CPU time per random_sources_feed() on my platform, if I'm doing the math right. That's probably too expensive to default "on."

For comparison, on a Haswell-era Intel I have lying around (which actually lacks RDSEED), RDRAND achieves more like 173 MB/s, reducing these costs somewhat: 43 us per sample, 700 us - 6ms per feed. Still too high. I suppose we could also implement the sleep based approach, although latency might still be a problem.

One other amusing approach might be to take advantage of the independent generators on each CPU in SMP systems: you could pull 8 bytes per core, then run all of the 32x10us sleeps in parallel, then pull another 8 bytes per core. I think this approach is way too complicated and not worth it.

This revision is now accepted and ready to land.Nov 21 2019, 11:40 PM