⚙ D10048 Replace the kernel RC4 with Chacha20.

Add an SVN-friendly commit message template.

Harbormaster completed remote builds in B8122: Diff 26397.Mar 18 2017, 9:00 PM

Reply to review comment.

Reply to reviewer.

sys/libkern/arc4random.c
84 ↗	(On Diff #26395)	No, sorry; files are WAY too late in the game. But it occurs to me I may be able to do something with boot-time stored entropy.

LGTM overall but please consider using explicit_bzero when removing sensitive data from memory.

BTW. do you think it's worth to implement a wrapper around arc4rand() to emulate arc4random_buf? (no action requested here)

sys/libkern/arc4random.c
98 ↗	(On Diff #26397)	Please use explicit_bzero() here.
118 ↗	(On Diff #26397)	Please use explicit_bzero()
119 ↗	(On Diff #26397)	Ditto.

This revision now requires changes to proceed.Mar 19 2017, 7:33 AM

delphij added inline comments.Mar 19 2017, 8:17 AM

sys/libkern/arc4random.c
84 ↗	(On Diff #26395)	Just an idea -- I wonder if it's possible that we could do something like, if the read_random() failed, we fill CHACHA20_KEYBYTES with some data derived from the loader supplied /boot/entropy plus attach entropy already available to harvest queue? It's not perfect but would be better than just crossing fingers with random data on stack...

Fix reviewer concerns.
Add boot-time entropy usage for first (re)seed.

Respond to reviewer remarks.

Add 'bootverbose' printfs.

Harbormaster completed remote builds in B8131: Diff 26410.Mar 19 2017, 12:06 PM

Final tweeks; tidy up some stuff and add bootverbose startup
reports.

Harbormaster completed remote builds in B8133: Diff 26413.Mar 19 2017, 12:37 PM

This will cause issues on platforms that do not use loader. We do not require loader on all of our platforms, and those that don't will have issues w/ the way chacha is started. As there is not an error (continues), this creates divergent behavior.

We should ensure that chacha is seeded from the random infrastructure so that people who work to ensure that it is properly seeded will gain the benefits instead of creating this hidden and undocumented dependency on /boot/entropy.

Also, /boot/entropy is not documented in random(4).

In D10048#207864, @jmg wrote:

This will cause issues on platforms that do not use loader. We do not require loader on all of our platforms, and those that don't will have issues w/ the way chacha is started. As there is not an error (continues), this creates divergent behavior.

Incorrect. If the preload_search_by_type() cannot find a preloaded chunk (perhaps because there was no loader to load one), then it will simply return NULL, and no harm is done. This fall back to the way arc4random has been working until now.

We should ensure that chacha is seeded from the random infrastructure so that people who work to ensure that it is properly seeded will gain the benefits instead of creating this hidden and undocumented dependency on /boot/entropy.

Its not a dependency; it is an attempt to improve on the /status quo/ in a safe, backwards compatible way. The current situation is to grasp at straws; this fix gives it a chance IF there is a stash of random gunk given. If there is none, no change (apart from the RNG algorithm).

"Best effort" is currently being done, this is an improvement.

Also, /boot/entropy is not documented in random(4).

True, thanks. I will fix this in a separate commit.

I can do a pkg exp-run with this patch on HardenedBSD's infrastructure tomorrow if desired.

In D10048#207892, @lattera-gmail.com wrote:

I can do a pkg exp-run with this patch on HardenedBSD's infrastructure tomorrow if desired.

Yes please!

M

markm, I've pointed out where the issue is.

sys/dev/random/random_harvestq.c
355 ↗	(On Diff #26413)	the name _ENTROPY_FILE is misnamed and should be named _ENTROPY_MODULE as this is the name of the module, not a name of a file.
361 ↗	(On Diff #26413)	this is a hack and should be solved in a better manner, like having chacha pull it's data from the random subsystem like arc4 did instead of seeding it this way.
sys/libkern/arc4random.c
73 ↗	(On Diff #26413)	this is where arc4 used to initalize it randomness, this should be carried over to chacha, or at least a description why WHY it is safe to change this behavior should be included.
100 ↗	(On Diff #26413)	no initialization is done in the case that there is no entropy module loaded.

Could you please take a look at the arc4rand() portion of my comment?

sys/dev/random/random_harvestq.c
355 ↗	(On Diff #26413)	I agree. I'd like to point out that the choose of value is somewhat confusing (e.g. if someone change the value, they may do a s,/boot/entropy,/my/entropy, which will render this configuration useless), by the way, as all other loader file types are named in "module_type" style. Perhaps we should change this to something like "random_cachedentropy", and add support to the old value?
sys/libkern/arc4random.c
78 ↗	(On Diff #26413)	Note that c_buffer is only used when BURN_OUTPUT_LIKE_RC4.
90 ↗	(On Diff #26413)	[no action requested] architecturally, this data really "belong" to random_harvestq, should we move the code there?
112 ↗	(On Diff #26413)	This seems to be unused code, and the reasoning for old code was to address RC4 key scheduling weakness, should we just remove the chunk so reader won't get confused by it?
119 ↗	(On Diff #26413)	If we preserve BURN_OUTPUT_LIKE_RC4, please move this zeroing to the ifdef'ed block.
185 ↗	(On Diff #26413)	I think there is a slight chance that, if chacha20->numruns < CHACHA20_RESEED_BYTES, but len + chacha20->numruns > CHACHA20_RESEED_BYTES, we would be owing a reseed in the middle. This reminds me https://reviews.freebsd.org/D8130#168807 and I think the principle still applies here, perhaps we should restructure the code block (line 176-191) like this: getmicrouptime(&tv); /* XXX / p = ptr; while (len > 0) { critical_enter(); chacha20 = &chacha20inst[curcpu]; if ((chacha20->numruns > CHACHA20_RESEED_BYTES) \|\| (tv.tv_sec > chacha20->t_reseed)) { / randomstir() may block. Reenter the loop after it as we may be on a different CPU and needs to operate a different chacha20 context at the time. */ critical_exit(); chacha20_randomstir(chacha20); getmicrouptime(&tv); continue; } length = MIN(CHACHA20_BUFFER_SIZE, len); chacha20->numruns += len; chacha_encrypt_bytes(&chacha20->ctx, chacha20->m_buffer, p, length); p += length; len -= length; critical_exit(); } And get rid of the per-chacha20 context mutex.

This revision now requires changes to proceed.Mar 20 2017, 7:04 AM

Address review comment.

Reply to reviewers.

sys/libkern/arc4random.c
73 ↗	(On Diff #26413)	The initialisation is present. Please look again below the preload block.
90 ↗	(On Diff #26413)	I want to keep it independent of the harvestq stuff, as arc4random is now capable of starting quite safely without involving random(9).
100 ↗	(On Diff #26413)	See below. look for the ... chacha_keysetup(&chacha20->ctx, key, CHACHA20_KEYBYTES8); chacha_ivsetup(&chacha20->ctx, (u_char )&tv_now.tv_sec, ... lines.

Reply to reviewer.

Address review comments.

markm marked 11 inline comments as done.Mar 20 2017, 9:25 AM

Hi,

Could you please also address https://reviews.freebsd.org/D10048?id=26438#inline-59236 ?

Could you also make the existing type ('/boot/entropy') the fallback type when boot_entropy_cache do not exist?

sys/dev/random/random_harvestq.c
355 ↗	(On Diff #26438)	I think we should do something like this in addition to the existing code: #ifndef NO_BACKWARD_COMPATIBILITY if (keyfile != NULL) keyfile = preload_search_by_type(RANDOM_CACHED_BOOT_LEGACY_ENTROPY_MODULE); #endif So the kernel knows to try loading it if the "new" cached type do not exist? Without it, users upgrading may end up with unseeded pool.

Respond to reviewer.

sys/dev/random/random_harvestq.c
355 ↗	(On Diff #26438)	This says "if the preload_search_by_type() search worked, then replace it by another search that might not work". Is this what you meant to say?
sys/libkern/arc4random.c
185 ↗	(On Diff #26413)	(My previous reply got lost somehow) Are you sure about this? Last time I went near critical sections, I got very strong pushback. Removing the mutexes here also means removing them in other places where there may be data structure conflict. Surely that means putting those in critical sections too? Folks won't like that.

delphij added inline comments.Mar 20 2017, 6:13 PM

sys/dev/random/random_harvestq.c
355 ↗	(On Diff #26438)	Oh good catch! Yes, you are correct that this should be if (keyfile == NULL) i.e. if we can't find the new type, fallback to old type.
sys/libkern/arc4random.c
185 ↗	(On Diff #26413)	Yeah, that's why I asked for a feedback even if it's a reject :) But TL;DR really -- I'm fine if we stick using mutex, and we can continue revising the existing status quo at a later time after this change is landed. My main concern is that we could be consuming more bytes than we are supposed to, because there is no restir in the middle of while loop. My reasoning for critical section: Basically the idea is to prevent preemption or CPU migrations when we are working on the Chacha20 context (because we are already using a per-CPU context now) and there isn't much point to use a full mutex. The only reason we didn't held a mutex or enter a critical section is because read_random() could potentially block. Perhaps we should not do a direct restir in line 167 (as it might not belong to us), but set numruns to CHACHA20_RESEED_BYTES + 1 with an atomic store, and all reads to an atomic read [atomic_load_acq_int(&chacha20->numruns)]? This way, the stir part of chacha20_randomstir() would become something like: getmicrouptime(&tv_now); atomic_store_rel_int(&chacha20->numruns, CHACHA20_RESEED_BYTES + 1); critical_enter(); /* * We can ONLY do the restir if we are still on the same CPU. * Caller is expected to re-check because they have left critical * section. / if (chacha20inst[curcpu] == chacha20) { chacha_keysetup(&chacha20->ctx, key, CHACHA20_KEYBYTES8); chacha_ivsetup(&chacha20->ctx, (u_char )&tv_now.tv_sec, (u_char )&tv_now.tv_usec); /* Reset for next reseed cycle. */ chacha20->t_reseed = tv_now.tv_sec + CHACHA20_RESEED_SECONDS; chacha20->numruns = 0; } critical_exit();

Address review comments.

Harbormaster completed remote builds in B8182: Diff 26489.Mar 21 2017, 8:12 AM

emaste added a subscriber: emaste.Mar 21 2017, 4:49 PM

I'm not comfortable with critical sections for now. Back to mutexes
as before. I'm happy to revisit this later.

In D10048#208539, @markm wrote:

I'm not comfortable with critical sections for now. Back to mutexes
as before. I'm happy to revisit this later.

Ok let's take that at a later time.

Could you please fix arc4rand() so that the while() loop in line 179 will restir as needed though?

Just a few quick comments:

I find it confusing that the new code is in a file named arc4random.c.
I feel that using sleepable mutexes here is fine -- the difference in performance is negligible on most contemporary microarchitectures, and there is an argument for moving some of our other critical sections to being mutexes (e.g., per-CPU UMA caches).
Rather than using malloc() and an array of per-CPU instances, have you considered using DPCPU allocation? This will, if not in the short term, then in the long term, give better NUMA behaviour, as malloc() will return memory from a single NUMA domain close to one of the packages, rather than arranging for memory appropriate to a specific package being used for each per-CPU instance.

This revision now requires changes to proceed.Mar 23 2017, 11:06 AM

In D10048#207924, @markm wrote:

In D10048#207892, @lattera-gmail.com wrote:

I can do a pkg exp-run with this patch on HardenedBSD's infrastructure tomorrow if desired.

Yes please!

pkg exp-run completed successfully with an earlier version of the patch on HardenedBSD. It did take longer to perform the pkg build: 40 hours versus an average 32 hours.

In D10048#209125, @rwatson wrote:

Just a few quick comments:

I find it confusing that the new code is in a file named arc4random.c.

I can fix that!

I feel that using sleepable mutexes here is fine -- the difference in performance is negligible on most contemporary microarchitectures, and there is an argument for moving some of our other critical sections to being mutexes (e.g., per-CPU UMA caches).

I'm concerned about cpu migration. Mutexes don't guarantee that a thread will stay on the same cpu, right?

Rather than using malloc() and an array of per-CPU instances, have you considered using DPCPU allocation? This will, if not in the short term, then in the long term, give better NUMA behaviour, as malloc() will return memory from a single NUMA domain close to one of the packages, rather than arranging for memory appropriate to a specific package being used for each per-CPU instance.

Sounds nifty! Where can I read about this?

In D10048#209275, @markm wrote:

I feel that using sleepable mutexes here is fine -- the difference in performance is negligible on most contemporary microarchitectures, and there is an argument for moving some of our other critical sections to being mutexes (e.g., per-CPU UMA caches).

I'm concerned about cpu migration. Mutexes don't guarantee that a thread will stay on the same cpu, right?

This is correct: you must make sure that you continue to access state on the CPU for which you acquired a mutex -- e.g., by caching a pointer to the per-CPU state you are accessing, in case migration takes place.

Rather than using malloc() and an array of per-CPU instances, have you considered using DPCPU allocation? This will, if not in the short term, then in the long term, give better NUMA behaviour, as malloc() will return memory from a single NUMA domain close to one of the packages, rather than arranging for memory appropriate to a specific package being used for each per-CPU instance.

Sounds nifty! Where can I read about this?

I'm not sure it has a man page, unfortunately -- and it should do. But take a look at use of the DPCPU* macros in src/sys/kern. E.g., in kern_switch.c for per-CPU schedule statistics.

In D10048#209623, @rwatson wrote:

In D10048#209275, @markm wrote:

I feel that using sleepable mutexes here is fine -- the difference in performance is negligible on most contemporary microarchitectures, and there is an argument for moving some of our other critical sections to being mutexes (e.g., per-CPU UMA caches).

I'm concerned about cpu migration. Mutexes don't guarantee that a thread will stay on the same cpu, right?

This is correct: you must make sure that you continue to access state on the CPU for which you acquired a mutex -- e.g., by caching a pointer to the per-CPU state you are accessing, in case migration takes place.

But that is racey. Preemption can in theory occur straight after I have verified that it hasn't. Looks like I need to use critical regions for now. I can live with that if you can?

Rather than using malloc() and an array of per-CPU instances, have you considered using DPCPU allocation? This will, if not in the short term, then in the long term, give better NUMA behaviour, as malloc() will return memory from a single NUMA domain close to one of the packages, rather than arranging for memory appropriate to a specific package being used for each per-CPU instance.

Sounds nifty! Where can I read about this?

I'm not sure it has a man page, unfortunately -- and it should do. But take a look at use of the DPCPU* macros in src/sys/kern. E.g., in kern_switch.c for per-CPU schedule statistics.

I'm bailing out of this, sorry. The original intent of my commit is in danger of being swamped by side-issues. They may be important, but I feel that is a reason to return, not encumber this job further.

This is correct: you must make sure that you continue to access state on the CPU for which you acquired a mutex -- e.g., by caching a pointer to the per-CPU state you are accessing, in case migration takes place.

But that is racey. Preemption can in theory occur straight after I have verified that it hasn't. Looks like I need to use critical regions for now. I can live with that if you can?

If using mutexes to protect access, then there is no race -- preemption and migration while holding the mutex is fine, as other attempts to access the same data will wait until the first thread is done, regardless of migration. This would make CPU indexing of the data structure a strong affinity rather than a strict requirement, while allowing migration to take place if the scheduler feels it is important (i.e., if preemption is preferable to continuing to run on the current CPU, and migration is better than waiting for the CPU to become available again).

Rather than using malloc() and an array of per-CPU instances, have you considered using DPCPU allocation? This will, if not in the short term, then in the long term, give better NUMA behaviour, as malloc() will return memory from a single NUMA domain close to one of the packages, rather than arranging for memory appropriate to a specific package being used for each per-CPU instance.

Sounds nifty! Where can I read about this?

I'm not sure it has a man page, unfortunately -- and it should do. But take a look at use of the DPCPU* macros in src/sys/kern. E.g., in kern_switch.c for per-CPU schedule statistics.

I'm bailing out of this, sorry. The original intent of my commit is in danger of being swamped by side-issues. They may be important, but I feel that is a reason to return, not encumber this job further.

Your commit introduces per-CPU memory allocation. My comment is that if you're going to do that, you should use the proper infrastructure we provide for the purpose, which was designed to solve the same problem that you are trying to solve, but does it better.

In D10048#209647, @rwatson wrote:

Your commit introduces per-CPU memory allocation.

Where?

M

In D10048#209659, @markm wrote:

In D10048#209647, @rwatson wrote:

Your commit introduces per-CPU memory allocation.

Where?

M

Here, was my reading:

chacha20inst = malloc((mp_maxid + 1) * sizeof(struct chacha20_s),
                M_CHACHA20RANDOM, M_NOWAIT | M_ZERO);

I.e., you allocate memory intended to be used per-CPU. DPCPU is a facility to support that more efficiently that avoids NUMA problems, etc. I'm currently working on a man page since, embarrassingly, there isn't one. But it is quite easy to use, especially when usage is limited to a single .c file.

RWatson: Not to get picky or anything, but there was already a malloc() in that place.

sys/libkern/arc4random.c
106 ↗	(On Diff #26520)	RWatson: Look here. I didn't introduce per-cpu allocation - it was already there. I'm just modifying the /status quo/.

In D10048#209661, @markm wrote:

RWatson: Not to get picky or anything, but there was already a malloc() in that place.

I've misread the patch; I'm happy for this to be fixed in a separate commit. I'll continue to sort out a DPCPU man page, however.

In D10048#209666, @rwatson wrote:

In D10048#209661, @markm wrote:

RWatson: Not to get picky or anything, but there was already a malloc() in that place.

I've misread the patch; I'm happy for this to be fixed in a separate commit. I'll continue to sort out a DPCPU man page, however.

Thanks! :-)

FYI, I have now committed a man page for DPCPU(9) in r316003. It includes some (safe?) synchronisation patterns in its example code.

Ping? :)

If this needs additional time, would you mind if I integrate part of your change (arc4random_buf) for now?

In D10048#209666, @rwatson wrote:

In D10048#209661, @markm wrote:

RWatson: Not to get picky or anything, but there was already a malloc() in that place.

I've misread the patch; I'm happy for this to be fixed in a separate commit. I'll continue to sort out a DPCPU man page, however.

In D10048#213241, @delphij wrote:

Ping? :)

If this needs additional time, would you mind if I integrate part of your change (arc4random_buf) for now?

I lost momentum a bit - I'll finish off this weekend.

M

Address reviewer comments.

I have not renamed the arc4*.c file to chacha*.c.

delphij accepted this revision.Apr 14 2017, 1:58 AM

This revision is now accepted and ready to land.Apr 14 2017, 1:58 AM

Please allow me some time to commit my Chacha20 implementation first so we can use that instead of the legally dubious version which is included in this patch. I hit a snag that I haven't had time to debug, but I'm hoping to have it done by Tuesday.

Turns out the snag was that I was loading the wrong version of the module. I have committed it now (r316982). If anyone is interested, I have a version that includes test vectors and runs self-tests when loaded, but I removed them from the final version as they are about six times larger than the actual code.

In D10048#215609, @des wrote:

Please allow me some time to commit my Chacha20 implementation first so we can use that instead of the legally dubious version which is included in this patch. I hit a snag that I haven't had time to debug, but I'm hoping to have it done by Tuesday.

Next time, give me more time than the few hours that you did, and give me a chance to review your code, which for a couple of reasons is unsuitable. This can be fixed, but your "under-my-feet" commit was MOST unwelcome.

Closed by commit rS317015: Replace the RC4 algorithm for generating in-kernel secure random (authored by markm). · Explain WhyApr 16 2017, 9:11 AM

This revision was automatically updated to reflect the committed changes.

Replace the kernel RC4 with Chacha20.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 27475

head/sys/boot/forth/loader.conf

head/sys/conf/files

head/sys/crypto/chacha20/chacha.h

head/sys/crypto/chacha20/chacha.c

head/sys/dev/random/random_harvestq.h

head/sys/dev/random/random_harvestq.c

head/sys/libkern/arc4random.c

head/sys/sys/libkern.h

head/sys/sys/random.h

Replace the kernel RC4 with Chacha20.ClosedPublicActions