Page MenuHomeFreeBSD

reduce overhead of entropy collection
ClosedPublic

Authored by mmacy on May 23 2018, 12:45 AM.

Details

Summary
  • move mask to hot primarily read only cache line
  • check mask first to avoid overhead of a function call
  • don't collect entropy on ethernet by default

This is rx of ~2.9Mpps on a 2x8x2 SKL
https://people.freebsd.org/~mmacy/2018.05.11/udprx1.svg

This is rx of ~4Mpps
https://people.freebsd.org/~mmacy/2018.05.11/udprx_norandom.svg

We're suffering major performance degradation doing this.

However, there's more to unpack here:

random_harvest_queue(m, sizeof(*m), 2, RANDOM_NET_ETHER);

We're sampling harvest size (8 bytes) from the start of the mbuf. So we're literally only ever getting the m_next pointer (the first field of the mbuf). Assuming this were actually set this would be a poor source of entropy. There are only a few thousands of mbufs allocated at low activity. However, unless you're running with jumbo frames, m_next will _always_ be null on ether_input. And, anyone running in production using jumbo frames will have turned this off. So, in the entire time since this change went in we've never gotten any entropy from this source.

Correction: it will be non-NULL on LRO packets. Nonetheless, the point holds.

More generally we need to see a lot more rigor in the justifications surrounding random's design, entropy collection needs, and entropy collection points.

  • How much entropy does random actually need?
  • How much is it actually getting from a given source?
  • Why aren't we using hardware entropy sources? When can we?

There are statistical tests to qualify these things. I think it's quite reasonable that we require a more substantive justification for the overhead than any that have been given to date. We're allowing very poor optics for FreeBSD on the basis of what appears to at best be conjecture.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mmacy created this revision.May 23 2018, 12:45 AM
mmacy edited the summary of this revision. (Show Details)May 23 2018, 12:47 AM
cem added inline comments.May 23 2018, 12:54 AM
sys/dev/random/random_harvestq.c
490–494 ↗(On Diff #42865)

KASSERT references removed parameter

sys/sys/random.h
119 ↗(On Diff #42865)

I don't think the predict_false's make sense here.

158 ↗(On Diff #42865)

comment mismatch

imp added a comment.May 23 2018, 1:38 AM

We only need to generate about 8 bytes of entropy per second total for the spiking of the algorithms...

mmacy edited the summary of this revision. (Show Details)May 23 2018, 5:14 PM
sbruno accepted this revision.May 23 2018, 11:15 PM

Pending trivial cleanups pointed out by @cem , approved.

This revision is now accepted and ready to land.May 23 2018, 11:15 PM
seanc added a subscriber: seanc.May 23 2018, 11:29 PM
mmacy updated this revision to Diff 43179.May 30 2018, 9:13 PM

Update with cem's comments
Will commit 5/31 barring further comments.

This revision now requires review to proceed.May 30 2018, 9:13 PM
mjg added inline comments.May 30 2018, 9:16 PM
sys/sys/random.h
119 ↗(On Diff #42865)

why do you think it does not make sense? it helps the compiler *avoid* a forward jump in the common case of the feature not being enabled - instead it will be more likely to insert a jmp to a section which calls random_harvest_queue_ and then jumps back. a case which arguably is already heavily pessimized. so i would argue it makes excellent sense.

sbruno accepted this revision.May 30 2018, 9:19 PM
This revision is now accepted and ready to land.May 30 2018, 9:19 PM
gallatin accepted this revision.May 30 2018, 9:22 PM

If people want to use ethernet entropy harvesting, can we do something to make it more useful in a separate change? I was going to initially suggest the ether dst addr, but that's not very random either.

In any cases, I think these changes are a huge win. Especially moving the collection mask outside a cacheline that is written in the hot path.

cem accepted this revision.May 30 2018, 10:36 PM

LGTM modulo nit about only disabling on amd64.

sys/sys/random.h
119 ↗(On Diff #42865)

Why do you think that features being disabled is the common case? E.g., I would expect pure sources to be both enabled and sampled reasonably often. The __predict_false will mispredict those.

Additionally, CPU branch predictors have been smart enough to predict the branch correctly for disabled sources since the 90s. I think predict_false/true should only be used when we're actually going to be smarter than the runtime branch predictor.

102–105 ↗(On Diff #43179)

This special casing is a little dubious — e.g., PPC64 has TRNG since POWER7 and is more than capable of driving the same high-throughput NICs as amd64.

I'm leaning towards just disable it for all arch. *Maybe* add a runtime test to detect other sources of entropy and enable it if there's nothing else. Then again, the data is not very random (or not at all random for non-jumbo frames? Not sure we ever determined that), so maybe it's useless there too.

mmacy updated this revision to Diff 43185.EditedMay 30 2018, 10:55 PM
  • disable ethernet entropy collection by default
This revision now requires review to proceed.May 30 2018, 10:55 PM
jeff added a comment.May 30 2018, 11:05 PM

Given that there is trivially little if any entropy coming from mbufs is there a reason we're leaving this callsite at all? has anyone from secteam commented?

cem added a comment.May 30 2018, 11:08 PM
In D15526#330095, @jeff wrote:

Given that there is trivially little if any entropy coming from mbufs is there a reason we're leaving this callsite at all? has anyone from secteam commented?

That might be the right next step, or maybe it can be improved to be useful. I don't think we need to go that far in the first change.

sys/sys/random.h
159 ↗(On Diff #43185)

comment mismatch cropped up again

delphij requested changes to this revision.May 31 2018, 2:08 AM
delphij added a subscriber: delphij.

The change LGTM in principal but I'd like to request a few minor (cosmetic) change.

  1. Please add RANDOM_ENABLE_ETHER to sys/conf/NOTES; this makes sure we have build code either way in tinderbox builds.
  1. Please fix the RANDOM_ENABLE_ETHER / RANDOM_ENABLE_UMA mismatch in sys/sys/random.h, as pointed out by @cem .

Optionally, if I was you I'd probably define random_harvest_queue_ether differently by omitting the RANDOM_NET_ETHER parameter to the macro (since the the macro's name already give a clear hint that it's for ethernet entropy, and the sole purpose of its existence is to provide a way to fully bypass the collection code).

This revision now requires changes to proceed.May 31 2018, 2:08 AM
delphij accepted this revision as: secteam.May 31 2018, 2:09 AM

Accept as secteam@ -- talked with so@ and we have no objection to the change in principal.

mmacy updated this revision to Diff 43191.May 31 2018, 3:23 AM

Apply changes requested by @delphij and @cem

mmacy updated this revision to Diff 43192.May 31 2018, 3:25 AM

fix typo in NOTES

mmacy updated this revision to Diff 43193.May 31 2018, 3:31 AM

Further clarify the intent of RANDOM_ENABLE_ETHER

cem accepted this revision.May 31 2018, 3:42 AM
delphij accepted this revision.May 31 2018, 4:25 AM
This revision is now accepted and ready to land.May 31 2018, 4:25 AM
This revision was automatically updated to reflect the committed changes.
jmg added a comment.Jun 2 2018, 5:45 PM

If people want to use ethernet entropy harvesting, can we do something to make it more useful in a separate change? I was going to initially suggest the ether dst addr, but that's not very random either.
In any cases, I think these changes are a huge win. Especially moving the collection mask outside a cacheline that is written in the hot path.

as I have suggested, doing a sampling is a good choice for those that want/need it.. i.e. use 10 packets/second (or less). Use the previous second's pps to calculate the n for selecting every nth packet to use such that it comes out to something like 10pps. Then you use the entire packet, not just the header w/o packet data. Gives us good entropy, makes minimal impact on system overhead, and a huge improvement.

cem added a comment.Jun 2 2018, 6:20 PM
In D15526#330700, @jmg wrote:

as I have suggested, doing a sampling is a good choice for those that want/need it.. i.e. use 10 packets/second (or less). Use the previous second's pps to calculate the n for selecting every nth packet to use such that it comes out to something like 10pps.

Sure, this seems plausible. The existing pps primitive we have probably doesn't work well on many-core systems — I think it would ping pong around with many RX queues — but one can easily imagine enhancing it to be suitable for this goal.

Then you use the entire packet, not just the header w/o packet data.

random_harvest_queue() truncates to the first 8 bytes of input due to constraint on the size of a hc_entropy_ring entry (harvest_event::he_entropy[HARVESTSIZE]). We would have to change that behavior (with potentially wide-reaching implications), or add a new primitive that does not truncate, or pre-mix packet bytes together (simple xor probably fine).

And after all that work, I'm still not sure packet contents are an especially good source of random data. It's harmless to mix in crappy random data after Fortuna is seeded, but spamming the entropy pool with crappy entropy pre-seed may unblock random before it is truly unpredictable, since our unblock heuristic is basically just a byte count of seed data.

mmacy added a comment.Jun 2 2018, 6:22 PM

While certainly better than what we were doing, global pps accounting would not be efficient multi-core. I'd much rather have it collect the first 10 packets or so every second.

jmg added a comment.Jun 2 2018, 6:45 PM

While certainly better than what we were doing, global pps accounting would not be efficient multi-core. I'd much rather have it collect the first 10 packets or so every second.

That is an implementation detail that I did not specify. If you have a 8 queue nic, maybe pick 1 or 2 packets per queue, or something.. or look at traffic per nic and balance it between each, etc. There are MANY solutions that will get you close to 10pps (or maybe even less) that are perfectly workable for efficient multi-core machines.

Collecting the first 10 packets or so every second is not a great idea. With things like tcp off-load which bursts packets, it's common for bursts of packets to be correlated.

mmacy added a comment.Jun 2 2018, 7:24 PM

Collecting the first 10 packets or so every second is not a great idea. With things like tcp off-load which bursts packets, it's common for bursts of packets to be correlated.

Yes. True. Nonetheless, I'm much more interested in the questions I raised in the review. All of these implementation details are secondary to those.

jmg added a comment.Jun 2 2018, 8:42 PM

Collecting the first 10 packets or so every second is not a great idea. With things like tcp off-load which bursts packets, it's common for bursts of packets to be correlated.

Yes. True. Nonetheless, I'm much more interested in the questions I raised in the review. All of these implementation details are secondary to those.

I've written answers to most of those, or at least attempted to discuss it before. But short answers are:

  • How much entropy does random actually need?

"None". Once a 256-bit seed has been generated at install time and no further entropy is needed. I'm fine w/ small amounts of additional entropy to prevent evil maid compromises (e.g. machine is powered off, maid clones hard drive, next boot, rng is "predictable"). Reference: https://blog.cr.yp.to/20140205-entropy.html

  • How much is it actually getting from a given source?

Due to above, it doesn't/shouldn't mater. Depending upon kernel sources other than the keyboard for generating initial seed is likely to be very broken.

  • Why aren't we using hardware entropy sources? When can we?

We are in a few cases. But there is also the case where people confuses PRNGs with TRNGs. Only the later are useful. RDRAND is now being used, I am working on setting up a lab (for everyone to access) that has various hardware so that drivers can be more easily be written and tested (no ETA though).

jmg added a comment.Jun 2 2018, 8:45 PM

Oh, note that the above are my opinions, and they differ significantly from most of the rest of the FreeBSD team. Mainly why I haven't spent much time working on the rng system,

cem added a comment.Jun 2 2018, 8:54 PM
In D15526#330724, @jmg wrote:

Oh, note that the above are my opinions, and they differ significantly from most of the rest of the FreeBSD team. Mainly why I haven't spent much time working on the rng system,

That mostly looks reasonable to me — the big caveat is around initial seeding. Especially on embedded systems with readonly media, but also "evil maid" scenarios, as you point out. I'm not part of secteam, though.