Page MenuHomeFreeBSD

Replace the kernel RC4 with Chacha20.
ClosedPublic

Authored by markm on Mar 18 2017, 6:55 PM.
Tags
None
Referenced Files
F108099265: D10048.id27475.diff
Tue, Jan 21, 9:19 AM
Unknown Object (File)
Sun, Jan 19, 9:59 PM
Unknown Object (File)
Sat, Jan 18, 5:46 PM
Unknown Object (File)
Sat, Jan 18, 5:23 PM
Unknown Object (File)
Sat, Jan 18, 5:11 PM
Unknown Object (File)
Sat, Jan 18, 12:19 AM
Unknown Object (File)
Fri, Jan 17, 3:41 PM
Unknown Object (File)
Fri, Jan 17, 12:32 PM

Details

Summary

Replace the kernel RC4 algorithm with Chacha20.
Keep the API, though, as that is what the other *BSD's have done.

Use boot-time entropy cache to bootstrap the generator at
first (re)seed.

Test Plan

Lots of printf's in the code, not visible here.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 8201
Build 8435: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Add an SVN-friendly commit message template.

Reply to reviewer.

sys/libkern/arc4random.c
89

No, sorry; files are WAY too late in the game. But it occurs to me I may be able to do something with boot-time stored entropy.

delphij requested changes to this revision.Mar 19 2017, 7:33 AM

LGTM overall but please consider using explicit_bzero when removing sensitive data from memory.

BTW. do you think it's worth to implement a wrapper around arc4rand() to emulate arc4random_buf? (no action requested here)

sys/libkern/arc4random.c
118

Please use explicit_bzero() here.

133

Please use explicit_bzero()

134

Ditto.

This revision now requires changes to proceed.Mar 19 2017, 7:33 AM
sys/libkern/arc4random.c
89

Just an idea -- I wonder if it's possible that we could do something like, if the read_random() failed, we fill CHACHA20_KEYBYTES with some data derived from the loader supplied /boot/entropy plus attach entropy already available to harvest queue? It's not perfect but would be better than just crossing fingers with random data on stack...

markm edited edge metadata.
markm edited the summary of this revision. (Show Details)

Fix reviewer concerns.
Add boot-time entropy usage for first (re)seed.

Respond to reviewer remarks.

Add 'bootverbose' printfs.

Final tweeks; tidy up some stuff and add bootverbose startup
reports.

This will cause issues on platforms that do not use loader. We do not require loader on all of our platforms, and those that don't will have issues w/ the way chacha is started. As there is not an error (continues), this creates divergent behavior.

We should ensure that chacha is seeded from the random infrastructure so that people who work to ensure that it is properly seeded will gain the benefits instead of creating this hidden and undocumented dependency on /boot/entropy.

Also, /boot/entropy is not documented in random(4).

In D10048#207864, @jmg wrote:

This will cause issues on platforms that do not use loader. We do not require loader on all of our platforms, and those that don't will have issues w/ the way chacha is started. As there is not an error (continues), this creates divergent behavior.

Incorrect. If the preload_search_by_type() cannot find a preloaded chunk (perhaps because there was no loader to load one), then it will simply return NULL, and no harm is done. This fall back to the way arc4random has been working until now.

We should ensure that chacha is seeded from the random infrastructure so that people who work to ensure that it is properly seeded will gain the benefits instead of creating this hidden and undocumented dependency on /boot/entropy.

Its not a dependency; it is an attempt to improve on the /status quo/ in a safe, backwards compatible way. The current situation is to grasp at straws; this fix gives it a chance IF there is a stash of random gunk given. If there is none, no change (apart from the RNG algorithm).

"Best effort" is currently being done, this is an improvement.

Also, /boot/entropy is not documented in random(4).

True, thanks. I will fix this in a separate commit.

I can do a pkg exp-run with this patch on HardenedBSD's infrastructure tomorrow if desired.

I can do a pkg exp-run with this patch on HardenedBSD's infrastructure tomorrow if desired.

Yes please!

M

markm, I've pointed out where the issue is.

sys/dev/random/random_harvestq.c
355–359

the name _ENTROPY_FILE is misnamed and should be named _ENTROPY_MODULE as this is the name of the module, not a name of a file.

365

this is a hack and should be solved in a better manner, like having chacha pull it's data from the random subsystem like arc4 did instead of seeding it this way.

sys/libkern/arc4random.c
73

this is where arc4 used to initalize it randomness, this should be carried over to chacha, or at least a description why WHY it is safe to change this behavior should be included.

102

no initialization is done in the case that there is no entropy module loaded.

delphij requested changes to this revision.Mar 20 2017, 7:04 AM

Could you please take a look at the arc4rand() portion of my comment?

sys/dev/random/random_harvestq.c
355–359

I agree.

I'd like to point out that the choose of value is somewhat confusing (e.g. if someone change the value, they may do a s,/boot/entropy,/my/entropy, which will render this configuration useless), by the way, as all other loader file types are named in "module_type" style. Perhaps we should change this to something like "random_cachedentropy", and add support to the old value?

sys/libkern/arc4random.c
80

Note that c_buffer is only used when BURN_OUTPUT_LIKE_RC4.

92

[no action requested] architecturally, this data really "belong" to random_harvestq, should we move the code there?

114

This seems to be unused code, and the reasoning for old code was to address RC4 key scheduling weakness, should we just remove the chunk so reader won't get confused by it?

121

If we preserve BURN_OUTPUT_LIKE_RC4, please move this zeroing to the ifdef'ed block.

179–185

I think there is a slight chance that, if chacha20->numruns < CHACHA20_RESEED_BYTES, but len + chacha20->numruns > CHACHA20_RESEED_BYTES, we would be owing a reseed in the middle.

This reminds me https://reviews.freebsd.org/D8130#168807 and I think the principle still applies here, perhaps we should restructure the code block (line 176-191) like this:

getmicrouptime(&tv); /* XXX */
p = ptr;

while (len > 0) {
    critical_enter();
    chacha20 = &chacha20inst[curcpu];
    if ((chacha20->numruns > CHACHA20_RESEED_BYTES) ||
        (tv.tv_sec > chacha20->t_reseed)) {
             /* randomstir() may block.  Reenter the loop after it as we may be on a different CPU and needs to operate a different chacha20 context at the time. */
             critical_exit();
             chacha20_randomstir(chacha20);
             getmicrouptime(&tv);
             continue;
    }

    length = MIN(CHACHA20_BUFFER_SIZE, len);
    chacha20->numruns += len;
    chacha_encrypt_bytes(&chacha20->ctx, chacha20->m_buffer, p, length);
    p += length;
    len -= length;
    critical_exit();
}

And get rid of the per-chacha20 context mutex.

This revision now requires changes to proceed.Mar 20 2017, 7:04 AM

Reply to reviewers.

sys/libkern/arc4random.c
73

The initialisation is present. Please look again below the preload block.

92

I want to keep it independent of the harvestq stuff, as arc4random is now capable of starting quite safely without involving random(9).

102

See below. look for the ...

chacha_keysetup(&chacha20->ctx, key, CHACHA20_KEYBYTES*8);
chacha_ivsetup(&chacha20->ctx, (u_char *)&tv_now.tv_sec,

... lines.

markm edited edge metadata.

Address review comments.

markm marked 11 inline comments as done.Mar 20 2017, 9:25 AM

Hi,

Could you please also address https://reviews.freebsd.org/D10048?id=26438#inline-59236 ?

Could you also make the existing type ('/boot/entropy') the fallback type when boot_entropy_cache do not exist?

sys/dev/random/random_harvestq.c
355–359

I think we should do something like this in addition to the existing code:

#ifndef NO_BACKWARD_COMPATIBILITY
        if (keyfile != NULL)
            keyfile = preload_search_by_type(RANDOM_CACHED_BOOT_LEGACY_ENTROPY_MODULE);
#endif

So the kernel knows to try loading it if the "new" cached type do not exist? Without it, users upgrading may end up with unseeded pool.

Respond to reviewer.

sys/dev/random/random_harvestq.c
355–359

This says "if the preload_search_by_type() search worked, then replace it by another search that might not work". Is this what you meant to say?

sys/libkern/arc4random.c
179–185

(My previous reply got lost somehow)

Are you sure about this? Last time I went near critical sections, I got very strong pushback.

Removing the mutexes here also means removing them in other places where there may be data structure conflict. Surely that means putting those in critical sections too? Folks won't like that.

sys/dev/random/random_harvestq.c
355–359

Oh good catch! Yes, you are correct that this should be if (keyfile == NULL) i.e. if we can't find the new type, fallback to old type.

sys/libkern/arc4random.c
179–185

Yeah, that's why I asked for a feedback even if it's a reject :)

But TL;DR really -- I'm fine if we stick using mutex, and we can continue revising the existing status quo at a later time after this change is landed. My main concern is that we could be consuming more bytes than we are supposed to, because there is no restir in the middle of while loop.

My reasoning for critical section:

Basically the idea is to prevent preemption or CPU migrations when we are working on the Chacha20 context (because we are already using a per-CPU context now) and there isn't much point to use a full mutex. The only reason we didn't held a mutex or enter a critical section is because read_random() could potentially block.

Perhaps we should not do a direct restir in line 167 (as it might not belong to us), but set numruns to CHACHA20_RESEED_BYTES + 1 with an atomic store, and all reads to an atomic read [atomic_load_acq_int(&chacha20->numruns)]? This way, the stir part of chacha20_randomstir() would become something like:

getmicrouptime(&tv_now);
atomic_store_rel_int(&chacha20->numruns, CHACHA20_RESEED_BYTES + 1);
critical_enter();
/*
  * We can ONLY do the restir if we are still on the same CPU.
  * Caller is expected to re-check because they have left critical
  * section.
  */
if (chacha20inst[curcpu] == chacha20) {
    chacha_keysetup(&chacha20->ctx, key, CHACHA20_KEYBYTES*8);
    chacha_ivsetup(&chacha20->ctx, (u_char *)&tv_now.tv_sec,
        (u_char *)&tv_now.tv_usec);
    /* Reset for next reseed cycle. */
    chacha20->t_reseed = tv_now.tv_sec + CHACHA20_RESEED_SECONDS;
    chacha20->numruns = 0;
}
critical_exit();

I'm not comfortable with critical sections for now. Back to mutexes
as before. I'm happy to revisit this later.

I'm not comfortable with critical sections for now. Back to mutexes
as before. I'm happy to revisit this later.

Ok let's take that at a later time.

Could you please fix arc4rand() so that the while() loop in line 179 will restir as needed though?

rwatson added a subscriber: rwatson.

Just a few quick comments:

  • I find it confusing that the new code is in a file named arc4random.c.
  • I feel that using sleepable mutexes here is fine -- the difference in performance is negligible on most contemporary microarchitectures, and there is an argument for moving some of our other critical sections to being mutexes (e.g., per-CPU UMA caches).
  • Rather than using malloc() and an array of per-CPU instances, have you considered using DPCPU allocation? This will, if not in the short term, then in the long term, give better NUMA behaviour, as malloc() will return memory from a single NUMA domain close to one of the packages, rather than arranging for memory appropriate to a specific package being used for each per-CPU instance.
This revision now requires changes to proceed.Mar 23 2017, 11:06 AM

I can do a pkg exp-run with this patch on HardenedBSD's infrastructure tomorrow if desired.

Yes please!

pkg exp-run completed successfully with an earlier version of the patch on HardenedBSD. It did take longer to perform the pkg build: 40 hours versus an average 32 hours.

Just a few quick comments:

  • I find it confusing that the new code is in a file named arc4random.c.

I can fix that!

  • I feel that using sleepable mutexes here is fine -- the difference in performance is negligible on most contemporary microarchitectures, and there is an argument for moving some of our other critical sections to being mutexes (e.g., per-CPU UMA caches).

I'm concerned about cpu migration. Mutexes don't guarantee that a thread will stay on the same cpu, right?

  • Rather than using malloc() and an array of per-CPU instances, have you considered using DPCPU allocation? This will, if not in the short term, then in the long term, give better NUMA behaviour, as malloc() will return memory from a single NUMA domain close to one of the packages, rather than arranging for memory appropriate to a specific package being used for each per-CPU instance.

Sounds nifty! Where can I read about this?

  • I feel that using sleepable mutexes here is fine -- the difference in performance is negligible on most contemporary microarchitectures, and there is an argument for moving some of our other critical sections to being mutexes (e.g., per-CPU UMA caches).

I'm concerned about cpu migration. Mutexes don't guarantee that a thread will stay on the same cpu, right?

This is correct: you must make sure that you continue to access state on the CPU for which you acquired a mutex -- e.g., by caching a pointer to the per-CPU state you are accessing, in case migration takes place.

  • Rather than using malloc() and an array of per-CPU instances, have you considered using DPCPU allocation? This will, if not in the short term, then in the long term, give better NUMA behaviour, as malloc() will return memory from a single NUMA domain close to one of the packages, rather than arranging for memory appropriate to a specific package being used for each per-CPU instance.

Sounds nifty! Where can I read about this?

I'm not sure it has a man page, unfortunately -- and it should do. But take a look at use of the DPCPU* macros in src/sys/kern. E.g., in kern_switch.c for per-CPU schedule statistics.

  • I feel that using sleepable mutexes here is fine -- the difference in performance is negligible on most contemporary microarchitectures, and there is an argument for moving some of our other critical sections to being mutexes (e.g., per-CPU UMA caches).

I'm concerned about cpu migration. Mutexes don't guarantee that a thread will stay on the same cpu, right?

This is correct: you must make sure that you continue to access state on the CPU for which you acquired a mutex -- e.g., by caching a pointer to the per-CPU state you are accessing, in case migration takes place.

But that is racey. Preemption can in theory occur straight after I have verified that it hasn't. Looks like I need to use critical regions for now. I can live with that if you can?

  • Rather than using malloc() and an array of per-CPU instances, have you considered using DPCPU allocation? This will, if not in the short term, then in the long term, give better NUMA behaviour, as malloc() will return memory from a single NUMA domain close to one of the packages, rather than arranging for memory appropriate to a specific package being used for each per-CPU instance.

Sounds nifty! Where can I read about this?

I'm not sure it has a man page, unfortunately -- and it should do. But take a look at use of the DPCPU* macros in src/sys/kern. E.g., in kern_switch.c for per-CPU schedule statistics.

I'm bailing out of this, sorry. The original intent of my commit is in danger of being swamped by side-issues. They may be important, but I feel that is a reason to return, not encumber this job further.

This is correct: you must make sure that you continue to access state on the CPU for which you acquired a mutex -- e.g., by caching a pointer to the per-CPU state you are accessing, in case migration takes place.

But that is racey. Preemption can in theory occur straight after I have verified that it hasn't. Looks like I need to use critical regions for now. I can live with that if you can?

If using mutexes to protect access, then there is no race -- preemption and migration while holding the mutex is fine, as other attempts to access the same data will wait until the first thread is done, regardless of migration. This would make CPU indexing of the data structure a strong affinity rather than a strict requirement, while allowing migration to take place if the scheduler feels it is important (i.e., if preemption is preferable to continuing to run on the current CPU, and migration is better than waiting for the CPU to become available again).

  • Rather than using malloc() and an array of per-CPU instances, have you considered using DPCPU allocation? This will, if not in the short term, then in the long term, give better NUMA behaviour, as malloc() will return memory from a single NUMA domain close to one of the packages, rather than arranging for memory appropriate to a specific package being used for each per-CPU instance.

Sounds nifty! Where can I read about this?

I'm not sure it has a man page, unfortunately -- and it should do. But take a look at use of the DPCPU* macros in src/sys/kern. E.g., in kern_switch.c for per-CPU schedule statistics.

I'm bailing out of this, sorry. The original intent of my commit is in danger of being swamped by side-issues. They may be important, but I feel that is a reason to return, not encumber this job further.

Your commit introduces per-CPU memory allocation. My comment is that if you're going to do that, you should use the proper infrastructure we provide for the purpose, which was designed to solve the same problem that you are trying to solve, but does it better.

Your commit introduces per-CPU memory allocation.

Where?

M

Your commit introduces per-CPU memory allocation.

Where?

M

Here, was my reading:

chacha20inst = malloc((mp_maxid + 1) * sizeof(struct chacha20_s),
                M_CHACHA20RANDOM, M_NOWAIT | M_ZERO);

I.e., you allocate memory intended to be used per-CPU. DPCPU is a facility to support that more efficiently that avoids NUMA problems, etc. I'm currently working on a man page since, embarrassingly, there isn't one. But it is quite easy to use, especially when usage is limited to a single .c file.

RWatson: Not to get picky or anything, but there was already a malloc() in that place.

sys/libkern/arc4random.c
106

RWatson: Look here. I didn't *introduce* per-cpu allocation - it was already there. I'm just modifying the /status quo/.

RWatson: Not to get picky or anything, but there was already a malloc() in that place.

I've misread the patch; I'm happy for this to be fixed in a separate commit. I'll continue to sort out a DPCPU man page, however.

RWatson: Not to get picky or anything, but there was already a malloc() in that place.

I've misread the patch; I'm happy for this to be fixed in a separate commit. I'll continue to sort out a DPCPU man page, however.

Thanks! :-)

FYI, I have now committed a man page for DPCPU(9) in r316003. It includes some (safe?) synchronisation patterns in its example code.

Ping? :)

If this needs additional time, would you mind if I integrate part of your change (arc4random_buf) for now?

markm marked 7 inline comments as done.Apr 7 2017, 7:26 AM

RWatson: Not to get picky or anything, but there was already a malloc() in that place.

I've misread the patch; I'm happy for this to be fixed in a separate commit. I'll continue to sort out a DPCPU man page, however.

Ping? :)

If this needs additional time, would you mind if I integrate part of your change (arc4random_buf) for now?

I lost momentum a bit - I'll finish off this weekend.

M

markm edited edge metadata.

Address reviewer comments.

I have not renamed the arc4*.c file to chacha*.c.

This revision is now accepted and ready to land.Apr 14 2017, 1:58 AM

Please allow me some time to commit my Chacha20 implementation first so we can use that instead of the legally dubious version which is included in this patch. I hit a snag that I haven't had time to debug, but I'm hoping to have it done by Tuesday.

Turns out the snag was that I was loading the wrong version of the module. I have committed it now (r316982). If anyone is interested, I have a version that includes test vectors and runs self-tests when loaded, but I removed them from the final version as they are about six times larger than the actual code.

In D10048#215609, @des wrote:

Please allow me some time to commit my Chacha20 implementation first so we can use that instead of the legally dubious version which is included in this patch. I hit a snag that I haven't had time to debug, but I'm hoping to have it done by Tuesday.

Next time, give me more time than the few hours that you did, and give me a chance to review your code, which for a couple of reasons is unsuitable. This can be fixed, but your "under-my-feet" commit was MOST unwelcome.

This revision was automatically updated to reflect the committed changes.