Page MenuHomeFreeBSD

random(4): Generalize algorithm-independent APIs

Authored by cem on May 18 2019, 9:46 PM.



At a basic level, remove assumptions about the underlying algorithm (such as
output block size and reseeding requirements) from the algorithm-independent
logic in randomdev.c. Chacha20 does not have many of the restrictions that
AES-ICM does as a PRF (Pseudo-Random Function), because it has a cipher
block size of 256 bits. The motivation is that by generalizing the API,
Chacha is not penalized by the limitations of AES.

In READ_RANDOM_UIO, first attempt to NOWAIT allocate a large enough buffer
for the entire user request, or the maximal input we'll accept between
signal checking, whichever is smaller. The idea is that the implementation
of any randomdev algorithm is then free to divide up large requests in
whatever fashion it sees fit.

As part of this, two responsibilities from the "algorithm-generic" randomdev
code are pushed down into the Fortuna ra_read implementation (and any other
future or out-of-tree ra_read implementations):

  1. If an algorithm needs to rekey every N bytes, it is responsible for handling that in ra_read(). (I.e., Fortuna's 1MB rekey interval for AES block generation.)
  2. If an algorithm uses a block cipher that doesn't tolerate partial-block requests (again, e.g., AES), it is also responsible for handling that in ra_read().

Several APIs are changed from u_int buffer length to the more canonical
size_t. Several APIs are changed from taking a blockcount to a bytecount,
to permit PRFs like Chacha20 to directly generate quantities of output that
are not multiples of RANDOM_BLOCKSIZE (AES block size).

The Fortuna algorithm is changed to NOT rekey every 1MiB when in Chacha20
mode (kern.random.use_chacha20_cipher="1"). This is explicitly supported by
the math in FS&K §9.4 (Ferguson, Schneier, and Kohno; "Cryptography
Engineering"), as well as by their conclusion: "If we had a block cipher
with a 256-bit block size, then the collisions would not have been an issue
at all."

Test Plan

Manual dd/getrandom(2) exerciser tests with various block sizes on a
GENERIC-NODEBUG VM to exercise different paths for performance; single thread
benefit is pretty minimal but it may reduce CPU time of rand_harvestq. I
should probably test on a single CPU VM to determine that.

Similar tests on GENERIC VM to verify assertions hold.

Test with both chacha20 mode and AES mode.

Diff Detail

rS FreeBSD src repository - subversion
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

cem planned changes to this revision.May 19 2019, 5:54 AM

Note that the fast path doesn't work on real systems with SMAP. (It worked in a bhyve VM, of course.) I'm not sure vslock provides sufficient protection against concurrent modification of the UVA to copy directly into it, and the actual difference was not large (2-3%). So I will probably revert that portion. The rest is useful for the follow-up patch, see child revision.

markm requested changes to this revision.May 19 2019, 9:47 AM
markm added inline comments.
312 ↗(On Diff #57553)

I don't like this. Chacha may be good, but as a significant departure from Fortuna/FS&K, this needs to be very conservative indeed. Drop the "multiple of RANDOM_BLOCKSIZE" by all means, but the reseed every 1MB is sacred, for an appropriate value of "1MB". Unlimited output from one seeding is asking for trouble IMVHO. That model may be OK for arcfour*() but not here.

312 ↗(On Diff #57553)

I've re-read FS&K 9.4, and no longer believe that I'm right in the review above.

Please instead mention in comments a (brief) summary of why chacha20 doesn't need the 1MB reseed, referencing the 256-bit key.

312 ↗(On Diff #57553)

Yep, this change is actually canonical FS&K (§9.4, p.144 in my copy):

If we had a block cipher with a 256-bit block size, then the collisions would not have been an issue at all.

It's not the 256 bit key size (which AES also has in the mode we use), but instead the 256 bit block output size which is crucial to not requiring the 1MB reseed. 9.4 goes on to describe why the reseed is needed for a 128-bit block cipher (and implicitly, not needed for a 256-bit block cipher).

I have a comment about this referencing FS&K §9.4 at line 466 below, but should probably include another one here.

cem marked 2 inline comments as done.
  • Add comment to genrandom justifying not rekeying Chacha (caveat, this whole

function is removed in the following differential, but I did try to flesh out a
similar comment in the relevant place there)

  • Drop broken/bogus "fast path" and just malloc a larger buffer (if NOWAIT

succeeds) instead

  • Notably, explicit_bzero() the malloc'd bounce buffer to prevent potential

disclosure of generated material. Impact of the leak is nominally larger with
the larger buffer size (since reuse of the page would mean only the last 4k of
output was leaked), but in practice keys are much smaller than 4kB and the leak
is bad, in and of itself. Probably that part should be MFC'd and maybe SA/ENed
by someone who cares about the stable branches.

The commit message is reworded in my git stack, I'll update it in phabricator

cem retitled this revision from random(4): Reduce locking and memcpy overhead to random(4): Generalize algorithm-independent APIs.May 19 2019, 7:37 PM
cem edited the summary of this revision. (Show Details)
215–226 ↗(On Diff #57567)

One caveat of (and argument against) the change to a larger bufsize is that with the child differential (concurrent Fortuna read) OFF (default), larger random_buf takes longer to fill, which exacerbates the global Fortuna mutex hogging and spinning problems.

This concern could be mitigated by modifying Fortuna's ra_read implementation to periodically drop the lock (if more general concurrent read feature is OFF). It could be dropped with the exact same frequency (PAGE_SIZE).

The more trivial fix would be to drop the bufsize increase entirely for now. This reduces some, but not all, of the benefit of the concurrent read feature.

So I think this is one change needed before the patch can land.


Restore previous PAGE_SIZE lock yielding behavior.

Adjust test to pass bytes, rather than block count.

LGTM but please amend the wmsg message to not exceed 6 characters (see comment inline) prior to commit.

492 ↗(On Diff #58341)

wait message should not exceed 6 characters (see sleep(9)). This would appear as 'frread' with 'yield' truncated.

This revision is now accepted and ready to land.Jun 7 2019, 8:11 PM
492 ↗(On Diff #58341)

I intend to replace this line with DELAY(1) instead. Any sleep seems to penalize single thread performance too much. Prior to this change, status quo is essentially RANDOM_RESEED_UNLOCK(); RANDOM_RESEED_LOCK(); (no attempt at fairness). DELAY(1) doesn't hurt single thread too much (4kB/1µs ⇒ 3.81 GiB/s upper limit) but may help provide some fairness to other lock waiters.

Does that seem ok? I'll upload the updated version of this patch.

cem marked an inline comment as done.

Replace tsleep with DELAY().

This revision now requires review to proceed.Jun 7 2019, 8:27 PM

Actually, the delay/sleep is probably not necessary (releasing the lock is sufficient for another thread to race in).

492 ↗(On Diff #58341)

Or maybe just remove the DELAY()? What's the point of sleeping/busy waiting here?

500 ↗(On Diff #58341)

OPTIONAL: Note that memcpy/explicit_bzero here don't really need the lock being held, it's probably worth releasing the lock and simply return afterward.

cem marked an inline comment as not done.Jun 7 2019, 9:20 PM
cem added inline comments.
492 ↗(On Diff #58341)

The idea is that the previous lock holder is more likely to win the race, unless the delay is injected. But it is a crude mechanism and I am fine removing it.

500 ↗(On Diff #58341)

Sure, will fix.

cem marked an inline comment as done.
  • Drop DELAY() entirely
  • Minor optimization: drop global lock over 1-15 bytes of memcpy and 16 bytes explicit_bzero.

For the microoptimization: Note that it's possible to avoid the additional branching by doing unlock in both cases, e.g.:

	if (bytecount > 0) {
		random_fortuna_genrandom(remainder_buf, sizeof(remainder_buf));

		memcpy(buf, remainder_buf, bytecount);
		explicit_bzero(remainder_buf, sizeof(remainder_buf));
	} else

It's probably worthy to do some benchmarks to find out which way to go, but since the buffer is sufficiently small it's probably not worth the effort.

This revision is now accepted and ready to land.Jun 7 2019, 10:07 PM

For the microoptimization: Note that it's possible to avoid the additional branching by doing unlock in both cases, e.g.:

Sure; the compiler may make the same transformation anyway.

It's probably worthy to do some benchmarks to find out which way to go, but since the buffer is sufficiently small it's probably not worth the effort.

Probably the cost of one additional branch is de minimis compared to a single block of AES-256 in software without any SIMD acceleration.

This revision was automatically updated to reflect the committed changes.