Page MenuHomeFreeBSD

Avoid spinning in random_harvest_queue
AcceptedPublic

Authored by shurd on Aug 26 2017, 7:06 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Mar 12, 6:44 AM
Unknown Object (File)
Dec 24 2023, 5:44 PM
Unknown Object (File)
Dec 20 2023, 3:20 AM
Unknown Object (File)
Dec 13 2023, 3:30 AM
Unknown Object (File)
Nov 2 2023, 3:07 AM
Unknown Object (File)
Oct 1 2023, 6:42 PM
Unknown Object (File)
Sep 10 2023, 7:52 PM
Unknown Object (File)
Aug 15 2023, 2:22 AM

Details

Summary

Using random_harvest_fast() in ether_input_internal() increases small
packet forwarding rates by about 25%. In the interest of harvesting
entropy on low packet rates systems though, it's desired to harvest
that data when reasonable.

Avoiding the spin wait should address both concerns, and generally
collect as much entropy as possible without seriously impacting
performance.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 11248
Build 11624: arc lint + arc unit

Event Timeline

This comment was removed by kmacy.

I see an opportunity for documentation improvement here. :-)

Your change will lose LOTS of valuable entropy in a setup where there is modest network traffic.

*HOWEVER*, 25% speed improvement is not to be sneered at, so I'd rather make this an option. Run-time if possible, or build-time if that is not viable.

M

Hi

I'm having a bit of a problem parsing this; could I please ask you to rephrase?

M

Hrm, I'll look into dynamically switching based on current packet rate.

If that's not cleanly doable, it doesn't seem like it would fit cleanly into the mask concept unless a second, "Fast" mask was added. Going with the second mask, random_harvest_queue() would look something like this:

if (!(harvest_context.hc_source_mask & (1 << origin)))
        return;
if ((harvest_context.fast_source_mask & (1 << origin))) {
        random_harvest_fast(entropy, size, bits, origin);
        return;
}

Which seems cleaner than the alternative of a special check just for RANDOM_NET_ETHER.

Another, option would be to always fall back on lock contention:

if (!mtx_trylock(&harvest_context.hc_mtx)) {
        random_harvest_fast(entropy, size, bits, origin);
        return;
}

Which would basically eliminate the spinning for all sources.

This comment was removed by kmacy.
This comment was removed by kmacy.

It occurred to me that we can both have our way by enabling the driver to disable entropy collection on packets.

Avoid lock contention in random_harvest_queue

Fall back to random_harvest_fast on contention. I haven't tested
this yet, will update with the test results.

shurd retitled this revision from Treat ethernet packets as a fast entropy source to Avoid spinning in random_harvest_queue.Aug 27 2017, 6:56 PM

How do you measure the 25% performance increase ?

How do you measure the 25% performance increase ?

I used your scripts with the new iflib driver. With the current head driver, the difference is a less pronounced 8.7%:

x kernel.stock.default.64.pps
+ kernel.stock.fastr.default.64.pps
+-------------------------------------------------------------------------+
| x                                                                    +  |
| x                                                                    +  |
|xxx                                                                   +++|
||A|                                                                      |
|                                                                      MA||
+-------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       3131676       3141161     3134644.5     3135430.1     3506.6568
+   5       3404869       3413299       3406825     3408434.8     3470.5472
Difference at 95.0% confidence
	273005 +/- 5087.99
	8.70709% +/- 0.169559%
	(Student's t, pooled s = 3488.65)

Hi

Do you mean not harvesting the packet contents?

M

That might be an idea. The actual PPS rate could be unable, as optimal choices may not be universal. I'm leery of turning off harvesting altogether as that would be a big POLA violation, but lessening the impact makes sense.

Do you mean not harvesting the packet contents?

In the updated patch, it continues to call random_harvest_queue(), but random_harvest_queue() has been modified to fall back to random_harvest_fast() on lock contention instead of spinning. If entropy is being collected at a high rate, random_harvest_queue() will be kept running most of the time, but instead of spinning to wait for the mutex, other threads will now fall back to random_harvest_fast() if the mutex is held when random_harvest_queue() is called.

With this change, the argument to even have a separate random_harvest_fast() is weaker since the spin on the lock is avoided for all harvesting.

Do you mean not harvesting the packet contents?

In the updated patch, it continues to call random_harvest_queue(), but random_harvest_queue() has been modified to fall back to random_harvest_fast() on lock contention instead of spinning. If entropy is being collected at a high rate, random_harvest_queue() will be kept running most of the time, but instead of spinning to wait for the mutex, other threads will now fall back to random_harvest_fast() if the mutex is held when random_harvest_queue() is called.

With this change, the argument to even have a separate random_harvest_fast() is weaker since the spin on the lock is avoided for all harvesting.

Full disclosure: I don't know much about the entropy gathering framework in FreeBSD.

It sounds like this might be a weaker solution than what's currently implemented. If that's true, is there any way to quantify how much weaker? If that's not true, please forgive my question.

It sounds like this might be a weaker solution than what's currently implemented. If that's true, is there any way to quantify how much weaker? If that's not true, please forgive my question.

Yes, it is generally weaker. Instead of spinning to collect the largest amount of entropy possible, it instead collects a small to zero amount from the sources causing lock contention.

How much weaker would depend directly on the amount of lock contention, so it would be very load dependant. In general, harvesting is very fast and the default sources aren't high rate (assuming lower ethernet traffic levels), so there is very minimal contention under that use case. If there is lock contention, entropy is being provided faster than it can be added to the pool anyway, so for sustained lock contention there is little concern.

The biggest issue would be if multiple low rate sources all harvested at the same time and cause contention. In that case, it could be significantly weaker. The default sources don't look like they're likely to combine in that manner though.

Profiling contention on the lock under the expected load would provide the best approximation at quantifying the difference. Low rates with high contention would be a serious concern with this change. High rates with high contention would be of low concern.

since this is a spin mutex even failed trylock adds a trip through disabling/enabling preemption + interrupts. add a cacheline fetch. "fortunately" it so happens that spin trylock does not dirty it if it sees a taken lock so this bit is not extremely slow. the sum is definitely significantly slower than necessary.

I think a much better approach would be to have this rate-limited instead. use a per-cpu counter and grab every n'th packet.

longer term this should probably gather per-cpu and rollup once a second or similar

In D12132#252205, @mjg wrote:

since this is a spin mutex even failed trylock adds a trip through disabling/enabling preemption + interrupts. add a cacheline fetch. "fortunately" it so happens that spin trylock does not dirty it if it sees a taken lock so this bit is not extremely slow. the sum is definitely significantly slower than necessary.

I think a much better approach would be to have this rate-limited instead. use a per-cpu counter and grab every n'th packet.

longer term this should probably gather per-cpu and rollup once a second or similar

YES. Thank you.

The random situation has been bugging me since Fortuna went in, the default mask is WAY too aggressive. Schneier himself says "Slowing down the prng by a perceptible factor to get a few bits more security is counterproductive" in his article describing it. I think the right choice is to always use random_harvest_fast in the network paths.

@markm are you able to provide guidance on this or may we proceed in running the performance issues to ground?

I will be able to help in a week or two when I get out of hospital - I'm about to have an operation.

Please go ahead and experiment in the meanwhile; the general direction of the discussion is good so far!

M

So, you'll still suffer terribly with this, as random_harvest_fast is not PCPU, so you'll have the cache line bouncing around.

IMO, as we feed in the mbuf (and not the mbuf cluster) we are not getting great entropy from the current ethernet harvester. One solution would be to turn off all ethernet harvesting, and enable INTR_ENTROPY on all the ethernet drivers. This will provide timing data for each interrupt, which is good for a couple bits of entropy. This will automatically scale down the rate as most high end ethernet cards do interrupt coalescing.

We don't need to collect THAT much entropy from the system. We should assume we have a good seed (and an uncompromised seed). If it does get compromised, it's more about a time to recovery. We do not need to be "collecting" thousands of bits of entropy every second, even just 10 bit per second not including RDRAND, (we poll RDRAND a minimum of 10 times per second, it increases 2x+𝛆 per bytes read from /dev/random, and it provides 64-bits EACH poll, or 640 bits per second), would give a recover time of under 30 seconds.

(From hospital bed)

I like the general sound of this!

M

In D12132#254430, @jmg wrote:

So, you'll still suffer terribly with this, as random_harvest_fast is not PCPU, so you'll have the cache line bouncing around.

IMO, as we feed in the mbuf (and not the mbuf cluster) we are not getting great entropy from the current ethernet harvester. One solution would be to turn off all ethernet harvesting, and enable INTR_ENTROPY on all the ethernet drivers. This will provide timing data for each interrupt, which is good for a couple bits of entropy. This will automatically scale down the rate as most high end ethernet cards do interrupt coalescing.

We don't need to collect THAT much entropy from the system. We should assume we have a good seed (and an uncompromised seed). If it does get compromised, it's more about a time to recovery. We do not need to be "collecting" thousands of bits of entropy every second, even just 10 bit per second not including RDRAND, (we poll RDRAND a minimum of 10 times per second, it increases 2x+𝛆 per bytes read from /dev/random, and it provides 64-bits EACH poll, or 640 bits per second), would give a recover time of under 30 seconds.

+1.

What I take away from the conversation in this review is that the solution being proposed does what its supposed to do, and there is even more room for improvement.

Can we commit this change first and move on to a discussion of how to change this for the future of FreeBSD 12?

This revision is now accepted and ready to land.Nov 17 2017, 6:00 PM

Does this patch really make sense to go in?

Note sure I read it right - is it that the 25% win was from doing random_harvest_fast only and it significantly drops with trylock added? Reduction with trylock is not surprising due to already mentioned interrupt enable/disable trips + lock bouncing.

The patch indeed does look like an improvement over the current state, but it only makes sense to go in if a significantly better fix is not trivial.

Can someone explain what's the loss from just doing random_harvest_fast in this codepath?

As for random_harvest_fast itself:

pos = harvest_context.hc_entropy_fast_accumulator.pos;
harvest_context.hc_entropy_fast_accumulator.buf[pos] ^= jenkins_hash(entropy, size, (uint32_t)get_cyclecount());
harvest_context.hc_entropy_fast_accumulator.pos = (pos + 1)%RANDOM_ACCUM_MAX;

this bounces everything during concurrent access. is there any problem just making this per-cpu?

Side note is that hc_source_mask should probably be moved out of written to areas.