Avoid spinning in random_harvest_queue
AcceptedPublic

Authored by shurd on Aug 26 2017, 7:06 PM.

Details

Summary

Using random_harvest_fast() in ether_input_internal() increases small
packet forwarding rates by about 25%. In the interest of harvesting
entropy on low packet rates systems though, it's desired to harvest
that data when reasonable.

Avoiding the spin wait should address both concerns, and generally
collect as much entropy as possible without seriously impacting
performance.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 11258
Build 11632: arc lint + arc unit
shurd created this revision.Aug 26 2017, 7:06 PM
shurd added reviewers: scottl, gallatin, np, jtl.
kmacy added a subscriber: kmacy.Aug 26 2017, 8:29 PM
This comment was removed by kmacy.
markm added a comment.Aug 27 2017, 9:43 AM

I see an opportunity for documentation improvement here. :-)

Your change will lose LOTS of valuable entropy in a setup where there is modest network traffic.

*HOWEVER*, 25% speed improvement is not to be sneered at, so I'd rather make this an option. Run-time if possible, or build-time if that is not viable.

M

markm added a comment.Aug 27 2017, 3:48 PM

Hi

I'm having a bit of a problem parsing this; could I please ask you to rephrase?

M

shurd added a comment.Aug 27 2017, 5:04 PM

Hrm, I'll look into dynamically switching based on current packet rate.

If that's not cleanly doable, it doesn't seem like it would fit cleanly into the mask concept unless a second, "Fast" mask was added. Going with the second mask, random_harvest_queue() would look something like this:

if (!(harvest_context.hc_source_mask & (1 << origin)))
        return;
if ((harvest_context.fast_source_mask & (1 << origin))) {
        random_harvest_fast(entropy, size, bits, origin);
        return;
}

Which seems cleaner than the alternative of a special check just for RANDOM_NET_ETHER.

Another, option would be to always fall back on lock contention:

if (!mtx_trylock(&harvest_context.hc_mtx)) {
        random_harvest_fast(entropy, size, bits, origin);
        return;
}

Which would basically eliminate the spinning for all sources.

kmacy added a comment.Aug 27 2017, 5:33 PM
This comment was removed by kmacy.
kmacy added a comment.Aug 27 2017, 5:44 PM
This comment was removed by kmacy.
kmacy added a comment.Aug 27 2017, 6:49 PM

It occurred to me that we can both have our way by enabling the driver to disable entropy collection on packets.

shurd updated this revision to Diff 32428.Aug 27 2017, 6:49 PM

Avoid lock contention in random_harvest_queue

Fall back to random_harvest_fast on contention. I haven't tested
this yet, will update with the test results.

shurd retitled this revision from Treat ethernet packets as a fast entropy source to Avoid spinning in random_harvest_queue.Aug 27 2017, 6:56 PM
shurd edited the summary of this revision. (Show Details)Aug 27 2017, 7:03 PM

How do you measure the 25% performance increase ?

shurd added a comment.EditedAug 28 2017, 4:53 PM

How do you measure the 25% performance increase ?

I used your scripts with the new iflib driver. With the current head driver, the difference is a less pronounced 8.7%:

x kernel.stock.default.64.pps
+ kernel.stock.fastr.default.64.pps
+-------------------------------------------------------------------------+
| x                                                                    +  |
| x                                                                    +  |
|xxx                                                                   +++|
||A|                                                                      |
|                                                                      MA||
+-------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       3131676       3141161     3134644.5     3135430.1     3506.6568
+   5       3404869       3413299       3406825     3408434.8     3470.5472
Difference at 95.0% confidence
	273005 +/- 5087.99
	8.70709% +/- 0.169559%
	(Student's t, pooled s = 3488.65)
markm added a comment.Aug 28 2017, 7:26 PM

Hi

Do you mean not harvesting the packet contents?

M

markm added a comment.Aug 28 2017, 7:30 PM

That might be an idea. The actual PPS rate could be unable, as optimal choices may not be universal. I'm leery of turning off harvesting altogether as that would be a big POLA violation, but lessening the impact makes sense.

shurd added a comment.Aug 28 2017, 7:42 PM

Do you mean not harvesting the packet contents?

In the updated patch, it continues to call random_harvest_queue(), but random_harvest_queue() has been modified to fall back to random_harvest_fast() on lock contention instead of spinning. If entropy is being collected at a high rate, random_harvest_queue() will be kept running most of the time, but instead of spinning to wait for the mutex, other threads will now fall back to random_harvest_fast() if the mutex is held when random_harvest_queue() is called.

With this change, the argument to even have a separate random_harvest_fast() is weaker since the spin on the lock is avoided for all harvesting.

Do you mean not harvesting the packet contents?

In the updated patch, it continues to call random_harvest_queue(), but random_harvest_queue() has been modified to fall back to random_harvest_fast() on lock contention instead of spinning. If entropy is being collected at a high rate, random_harvest_queue() will be kept running most of the time, but instead of spinning to wait for the mutex, other threads will now fall back to random_harvest_fast() if the mutex is held when random_harvest_queue() is called.

With this change, the argument to even have a separate random_harvest_fast() is weaker since the spin on the lock is avoided for all harvesting.

Full disclosure: I don't know much about the entropy gathering framework in FreeBSD.

It sounds like this might be a weaker solution than what's currently implemented. If that's true, is there any way to quantify how much weaker? If that's not true, please forgive my question.

shurd added a comment.Aug 28 2017, 8:44 PM

It sounds like this might be a weaker solution than what's currently implemented. If that's true, is there any way to quantify how much weaker? If that's not true, please forgive my question.

Yes, it is generally weaker. Instead of spinning to collect the largest amount of entropy possible, it instead collects a small to zero amount from the sources causing lock contention.

How much weaker would depend directly on the amount of lock contention, so it would be very load dependant. In general, harvesting is very fast and the default sources aren't high rate (assuming lower ethernet traffic levels), so there is very minimal contention under that use case. If there is lock contention, entropy is being provided faster than it can be added to the pool anyway, so for sustained lock contention there is little concern.

The biggest issue would be if multiple low rate sources all harvested at the same time and cause contention. In that case, it could be significantly weaker. The default sources don't look like they're likely to combine in that manner though.

Profiling contention on the lock under the expected load would provide the best approximation at quantifying the difference. Low rates with high contention would be a serious concern with this change. High rates with high contention would be of low concern.

mjg added a subscriber: mjg.EditedAug 28 2017, 10:47 PM

since this is a spin mutex even failed trylock adds a trip through disabling/enabling preemption + interrupts. add a cacheline fetch. "fortunately" it so happens that spin trylock does not dirty it if it sees a taken lock so this bit is not extremely slow. the sum is definitely significantly slower than necessary.

I think a much better approach would be to have this rate-limited instead. use a per-cpu counter and grab every n'th packet.

longer term this should probably gather per-cpu and rollup once a second or similar

kmacy added a comment.Aug 29 2017, 4:02 AM
In D12132#252205, @mjg wrote:

since this is a spin mutex even failed trylock adds a trip through disabling/enabling preemption + interrupts. add a cacheline fetch. "fortunately" it so happens that spin trylock does not dirty it if it sees a taken lock so this bit is not extremely slow. the sum is definitely significantly slower than necessary.

I think a much better approach would be to have this rate-limited instead. use a per-cpu counter and grab every n'th packet.

longer term this should probably gather per-cpu and rollup once a second or similar

YES. Thank you.

The random situation has been bugging me since Fortuna went in, the default mask is WAY too aggressive. Schneier himself says "Slowing down the prng by a perceptible factor to get a few bits more security is counterproductive" in his article describing it. I think the right choice is to always use random_harvest_fast in the network paths.

@markm are you able to provide guidance on this or may we proceed in running the performance issues to ground?

markm added a comment.Sep 4 2017, 1:16 PM

I will be able to help in a week or two when I get out of hospital - I'm about to have an operation.

Please go ahead and experiment in the meanwhile; the general direction of the discussion is good so far!

M

jmg added a comment.Sep 7 2017, 3:46 PM

So, you'll still suffer terribly with this, as random_harvest_fast is not PCPU, so you'll have the cache line bouncing around.

IMO, as we feed in the mbuf (and not the mbuf cluster) we are not getting great entropy from the current ethernet harvester. One solution would be to turn off all ethernet harvesting, and enable INTR_ENTROPY on all the ethernet drivers. This will provide timing data for each interrupt, which is good for a couple bits of entropy. This will automatically scale down the rate as most high end ethernet cards do interrupt coalescing.

We don't need to collect THAT much entropy from the system. We should assume we have a good seed (and an uncompromised seed). If it does get compromised, it's more about a time to recovery. We do not need to be "collecting" thousands of bits of entropy every second, even just 10 bit per second not including RDRAND, (we poll RDRAND a minimum of 10 times per second, it increases 2x+𝛆 per bytes read from /dev/random, and it provides 64-bits EACH poll, or 640 bits per second), would give a recover time of under 30 seconds.

markm added a comment.Sep 8 2017, 9:47 AM

(From hospital bed)

I like the general sound of this!

M

des added a subscriber: des.Sep 21 2017, 10:02 AM
delphij added a subscriber: delphij.Oct 6 2017, 6:32 PM
In D12132#254430, @jmg wrote:

So, you'll still suffer terribly with this, as random_harvest_fast is not PCPU, so you'll have the cache line bouncing around.

IMO, as we feed in the mbuf (and not the mbuf cluster) we are not getting great entropy from the current ethernet harvester. One solution would be to turn off all ethernet harvesting, and enable INTR_ENTROPY on all the ethernet drivers. This will provide timing data for each interrupt, which is good for a couple bits of entropy. This will automatically scale down the rate as most high end ethernet cards do interrupt coalescing.

We don't need to collect THAT much entropy from the system. We should assume we have a good seed (and an uncompromised seed). If it does get compromised, it's more about a time to recovery. We do not need to be "collecting" thousands of bits of entropy every second, even just 10 bit per second not including RDRAND, (we poll RDRAND a minimum of 10 times per second, it increases 2x+𝛆 per bytes read from /dev/random, and it provides 64-bits EACH poll, or 640 bits per second), would give a recover time of under 30 seconds.

+1.

des added a comment.Oct 9 2017, 10:03 AM

I second @jmg's suggestion in D12132#254430.

What I take away from the conversation in this review is that the solution being proposed does what its supposed to do, and there is even more room for improvement.

Can we commit this change first and move on to a discussion of how to change this for the future of FreeBSD 12?

markm accepted this revision.Nov 17 2017, 5:57 PM

I'm happy with this.

delphij accepted this revision.Nov 17 2017, 6:00 PM
This revision is now accepted and ready to land.Nov 17 2017, 6:00 PM
mjg added a comment.EditedNov 17 2017, 6:09 PM

Does this patch really make sense to go in?

Note sure I read it right - is it that the 25% win was from doing random_harvest_fast only and it significantly drops with trylock added? Reduction with trylock is not surprising due to already mentioned interrupt enable/disable trips + lock bouncing.

The patch indeed does look like an improvement over the current state, but it only makes sense to go in if a significantly better fix is not trivial.

Can someone explain what's the loss from just doing random_harvest_fast in this codepath?

As for random_harvest_fast itself:

pos = harvest_context.hc_entropy_fast_accumulator.pos;
harvest_context.hc_entropy_fast_accumulator.buf[pos] ^= jenkins_hash(entropy, size, (uint32_t)get_cyclecount());
harvest_context.hc_entropy_fast_accumulator.pos = (pos + 1)%RANDOM_ACCUM_MAX;

this bounces everything during concurrent access. is there any problem just making this per-cpu?

Side note is that hc_source_mask should probably be moved out of written to areas.