Page MenuHomeFreeBSD

Split arc4random mutexes to improve performance on IPSec traffic
ClosedPublic

Authored by emeric.poupon_stormshield.eu on Oct 3 2016, 12:02 PM.

Details

Summary

In a dual processor system (2*6 cores) during IPSec throughput tests, we see a lot of contention on the arc4 lock, used to generate the IV of the ESP output packets.

The idea of this patch is to split this mutex in order to reduce/kill the contention on this lock.

Test Plan

2*6Gps flows to be ciphered on a 2*6cores system.
We see an improvement up to +400% thanks to the patch: no more contention on the arc4 lock.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

emeric.poupon_stormshield.eu retitled this revision from to Split arc4random mutexes to improve performance on IPSec traffic.
emeric.poupon_stormshield.eu updated this object.
emeric.poupon_stormshield.eu edited the test plan for this revision. (Show Details)
emeric.poupon_stormshield.eu edited edge metadata.

Sorry, forgot to remove a commented line

It will be better to use non-failing malloc flag. arc4_init() can't fail.

In D8130#168667, @ache wrote:

It will be better to use non-failing malloc flag. arc4_init() can't fail.

You mean a M_WAITOK flag? I was wondering: are you sure it is really safe to sleep in this context?

In D8130#168667, @ache wrote:

It will be better to use non-failing malloc flag. arc4_init() can't fail.

You mean a M_WAITOK flag? I was wondering: are you sure it is really safe to sleep in this context?

I am not sure about sleeping allowed or not so early. But I am sure we can't left RNG non-initialized:

if (arc4inst == NULL)
        return;

Sorry, my prev. comment was sent non-edited. Please forget it if you got it through the mail and re-read here instead.

In D8130#168678, @ache wrote:

Sorry, my prev. comment was sent non-edited. Please forget it if you got it through the mail and re-read here instead.

No problem!
Since the call is very unlikely to fail, what about a KASSERT on failure then? That seems to be an acceptable compromise.

In D8130#168678, @ache wrote:

Sorry, my prev. comment was sent non-edited. Please forget it if you got it through the mail and re-read here instead.

No problem!
Since the call is very unlikely to fail, what about a KASSERT on failure then? That seems to be an acceptable compromise.

Yes, I think KASSERT will be better than nothing.

Thanks for working on this, but I have some different ideas which I don't really have the time to sit down and implement at this time that I feel like sharing and may be useful:

I think we could take a different approach: now that we have the pseudo-random state per-CPU, perhaps we can make them CPU-bound and completely eliminate the locks and use critical sections instead?

(To this extent, arc4rand would become something like:

{

getmicrouptime(&tv);

while (len > 0) {
    critical_enter();
    struct arc4_s *arc4 = &arc4inst[curcpu];
    int count=128; // XXX a number that one CPU can comfortably compute without leaving critical section
    if (atomic_cmpset_int(&arc4rand_iniseed_state, ARC4_ENTR_HAVE,
        ARC4_ENTR_SEED) || reseed ||
       (arc4->numruns > ARC4_RESEED_BYTES) ||
       (tv.tv_sec > arc4->t_reseed))
            arc4_randomstir(arc4);

    arc4->numruns += len;
    p = ptr;
    while (len-- > 0 && count-- > 0)
            *p++ = arc4_randbyte(arc4);
    critical_exit();
}

}

)

Additionally, to avoid duplicated work, it may be worthwhile to take Chacha20 before this or do the conversion at the same time.

Thanks for working on this, but I have some different ideas which I don't really have the time to sit down and implement at this time that I feel like sharing and may be useful:

I think we could take a different approach: now that we have the pseudo-random state per-CPU, perhaps we can make them CPU-bound and completely eliminate the locks and use critical sections instead?

Thanks for your comment.
I was already thinking about a critical section, but unfortunately it is not that simple since we may sleep while harvesting some random numbers in the underlying functions.

Then I made that simple implementation and I found there are now a lot of functions that are more time consuming than the IV generation.

The problem is that arc4rand_iniseed_state which is reset on the first use with atomic currently asssume that all RNGs are stired, but really only first one.

reseed arg without specifying particular CPU means nothing but stir all CPUs too. And specifying that CPU is impossible on API level - nobody know from which CPU this code will be called.

That's a good point indeed.

But I am not sure to see the actual benefit of this atomic?
I guess the read_random() call is blocking until the underlying entropy processor becomes secure?

In that case we could as well initialize the t_reseed variable to -1 to make sure randomstir is called before pulling any random bytes.
Otherwise, we could just reseed of all the arc4 instances when the state is set from ARC4_ENTR_HAVE to ARC4_ENTR_SEED.

What do you think?

But I am not sure to see the actual benefit of this atomic?

It allows to not play with locking.

I guess the read_random() call is blocking until the underlying entropy processor becomes secure?

Yes. Then it allows arc4 reseed. There is no point to reseed it before it happens with non random data.

We can separate arc4rand_iniseed_state and reseed arg to reseed all RNGs while other conditions left per-RNG as it currrently is.

In D8130#169374, @ache wrote:

But I am not sure to see the actual benefit of this atomic?

It allows to not play with locking.

I meant the arc4rand_iniseed_state variable.

I guess the read_random() call is blocking until the underlying entropy processor becomes secure?

Yes. Then it allows arc4 reseed. There is no point to reseed it before it happens with non random data.

I do agree with that, but on the very first call of arc4rand(), we can make sure readomstir() is called, blocking/waiting for the underlying generator to be safe, then we are good. No need for a arc4rand_iniseed_state variable in that scenario?

I meant the arc4rand_iniseed_state variable.

I too. Without locking or atomic either yet one additional seeding can happens or no seeding can happens at all (CPU writing to another half of word, which is already checked).

I do agree with that, but on the very first call of arc4rand(), we can make sure readomstir() is called,

Not all stiring are equal) Non random stiring is done to just not block arc4 on early boot phase.
When good randomnes becomes available (which arc4rand_iniseed_state indicates) it must be reinitialized immediately on the next call.

In D8130#169376, @ache wrote:

I meant the arc4rand_iniseed_state variable.

I too. Without locking or atomic either yet one additional seeding can happens or no seeding can happens at all (CPU writing to another half of word, which is already checked).

I do agree with that, but on the very first call of arc4rand(), we can make sure readomstir() is called,

Not all stiring are equal) Non random stiring is done to just not block arc4 on early boot phase.
When good randomnes becomes available (which arc4rand_iniseed_state indicates) it must be reinitialized immediately on the next call.

Ok, I understand your point.
However, in the existing code, we have a problem: arc4rand_iniseed_state is set from HAVE to SEED, and after that the stir() function is called .
The problem is that another thread may call the arc4rand() function and get some unsafe random bytes before the stir actually occurs and after the state was set to SEED.

Here is what I propose:

void
arc4rand(void *ptr, u_int len, int reseed)                                                  
{       
        u_char *p;
        struct timeval tv;
        struct arc4_s *arc4;                                                                
        
        if (atomic_cmpset_int(&arc4rand_iniseed_state,
                        ARC4_ENTR_HAVE, ARC4_ENTR_SEED)) {                                  
                ARC4_FOREACH(arc4)
                        arc4_randomstir(arc4);                                              
        }                                                                                   
        
        arc4 = &arc4inst[curcpu];                                                           
        getmicrouptime(&tv);
        if (reseed || (arc4->numruns > ARC4_RESEED_BYTES)                                   
                || (tv.tv_sec > arc4->t_reseed))                                            
                arc4_randomstir(arc4);                                                      
        
        mtx_lock(&arc4->mtx);                                                               
        arc4->numruns += len;                                                               
        p = ptr;
        while (len--)
                *p++ = arc4_randbyte(arc4);                                                 
        mtx_unlock(&arc4->mtx);                                                             
}

The same issue is still here but at least it solves the contention problem.

What do you think?

The problem is that another thread may call the arc4rand() function and get some unsafe random bytes before the stir actually occurs and after the state was set to SEED.

Please show how it can happen in steps by steps. In old code if the state was SEED, it surely reinitialized, and reinitialization itself keeps a lock, so RNG can't move further and wait for reinitialization (from any thread).

Here is what I propose:

In your code 'reseed' variable should be move to the same 'if' as 'arc4rand_iniseed_state' placed, we never know what we cpu we reseed otherwise.
The next thing I don't like much is possible double initialization, if this condition additionaly meet in the same time 'if ((arc4->numruns > ARC4_RESEED_BYTES) || (tv.tv_sec > arc4->t_reseed))
All you need is simple local 'reseedall' variable and set it where ARC4_FOREACH(arc4) is.

There can be small window when arc4rand_iniseed_state is already set to SEED, but arc4_randomstir() is not called yet. And right in this time another thread calls the code. Well, we miss only single reinitialization per-CPU (more are time-consuming and can't fit between the check and immediate function call), on the next call they will be already reinitialized. Dou mean another scenario?
We can move both check and function call under lock, but it will slow things down, i.e. the thing you fight against.

In D8130#169387, @ache wrote:

The problem is that another thread may call the arc4rand() function and get some unsafe random bytes before the stir actually occurs and after the state was set to SEED.

Please show how it can happen in steps by steps. In old code if the state was SEED, it surely reinitialized, and reinitialization itself keeps a lock, so RNG can't move further and wait for reinitialization (from any thread).

1/ thread T1 enters in arc4rand()
2/ As arc4rand_iniseed_state = ARC4_ENTR_HAVE => arc4rand_iniseed_state is set to ARC4_ENTR_SEED
3/ T1 enters in arc4_randomstir(), and read_random is called. note the arc4_mtx is released in that state
4/ Another thread T2 enters in arc4rand()
5/ All the conditions are false (arc4rand_iniseed_state already set to SEED, tv.tv_sec = 0, etc.
6/ T2 takes the lock arc4_mtx
7/ T2 pulls some unsafe bytes and releases the lock
8/ T1 takes the lock and actually reseeds the SBOX => too late

Here is what I propose:

In your code 'reseed' variable should be move to the same 'if' as 'arc4rand_iniseed_state' placed, we never know what we cpu we reseed otherwise.
The next thing I don't like much is possible double initialization, if this condition additionaly meet in the same time 'if ((arc4->numruns > ARC4_RESEED_BYTES) || (tv.tv_sec > arc4->t_reseed))
All you need is simple local 'reseedall' variable and set it where ARC4_FOREACH(arc4) is.

I see what you mean but could you please paste some code? Do you agree that if the state is set externaly to HAVE, we have to reseed all the instances?

In D8130#169388, @ache wrote:

There can be small window when arc4rand_iniseed_state is already set to SEED, but arc4_randomstir() is not called yet. And right in this time another thread calls the code. Well, we miss only single reinitialization per-CPU (more are time-consuming and can't fit between the check and immediate function call), on the next call they will be already reinitialized. Dou mean another scenario?

Yes that is what I meant. Not sure this is a big security concern or not?

Not sure this is a big security concern or not?

Maybe I am wrong, but I don't see big deal here - the moment of good randomness available is unpredictable in the time intervals we consider, and if some threads got it a bit later (on the next call), I don't see the problem.

Initially I mean that:

if (atomic_cmpset_int(&arc4rand_iniseed_state,
                        ARC4_ENTR_HAVE, ARC4_ENTR_SEED) || reseed) {
                ARC4_FOREACH(arc4)
                        arc4_randomstir(arc4);
                allstired = 1;
}
...
if (!allstired) {
    if ((arc4->numruns > ARC4_RESEED_BYTES) ||
        (tv.tv_sec > arc4->t_reseed))
                arc4_randomstir(arc4);
}

But think about second condition will be always false since reset by arc4_randomstir(), so playing witl 'allstired' is not needed:

if (atomic_cmpset_int(&arc4rand_iniseed_state, ARC4_ENTR_HAVE, ARC4_ENTR_SEED) ||
    reseed) {
        ARC4_FOREACH(arc4)
                arc4_randomstir(arc4);
}
...
if (arc4->numruns > ARC4_RESEED_BYTES || 
    tv.tv_sec > arc4->t_reseed)
        arc4_randomstir(arc4);
if (atomic_cmpset_int(&arc4rand_iniseed_state, ARC4_ENTR_HAVE, ARC4_ENTR_SEED) ||
    reseed) {
        ARC4_FOREACH(arc4)
                arc4_randomstir(arc4);
}
...
if (arc4->numruns > ARC4_RESEED_BYTES || 
    tv.tv_sec > arc4->t_reseed)
        arc4_randomstir(arc4);

This is nearly what I proposed before, but why would you reseed all the instances in case the user calls arc4rand with reseed = 1?

This is nearly what I proposed before, but why would you reseed all the instances in case the user calls arc4rand with reseed = 1?

Because from current API point of view there is no separate CPU argument, so the call with reseed == 1 can come from any random CPU.
In general any random CPU may be not equal to unknown CPU for which this operation was made.

emeric.poupon_stormshield.eu edited edge metadata.

Updated diff according to ache's remarks

I have some comments inline -- please fix the style change.

I'm not 100% certain with the other two and would like to get a second opinion from someone else.

By the way, will it be possible that you can do a svn diff --diff-cmd diff -x -U99999 instead? (the current diff lacks a lot of context, which would make reviewing harder)

libkern/arc4random.c
174 ↗(On Diff #21136)

Note that this is an unlocked read of arc4->numruns, wouldn't this lead to a race window? (Not necessarily a problem in practice, but it's worrying me). @markm and/or @ache -- what's your opinion?

175 ↗(On Diff #21136)

style: || operator should be at the end of line.

179 ↗(On Diff #21136)

Because the mtx is released and re-acquired, we may no longer be guaranteed to have a satisfactory numruns at this point (and may be running on a different CPU)?

unlocked read of arc4->numruns

Will lead to several arc4 stiring in worst case (which IMHO is non-practical case). Any number of stiring does not harm arc4, at least one stiring should be enough. But all this for one CPU case, I don't know about per-CPU locking in details so can't answer the next note.

libkern/arc4random.c
179 ↗(On Diff #21136)

Unfortunately this is a problem that is already present in the current version.

The goal is not to bind the running thread to a given CPU, but just to split the arc4rand instance in several instances in order to reduce contention. I used curcpu to choose an instance but I could as well have done a round robin on a pool of instances which size is set using a tunable.

Updated diff using full context, changed code according to delphij's comment

Unfortunately this is a problem that is already present in the current version.

In current version it is not a problem (see my previous comment about multiple stiring above), but in case each arc4 instance can run on different arc4 data set, it can be a problem.

By the second thought, no problem:) Since you mention that per-CPU locking is not your goal, in worst case the same harmless multiple stiring can happens since numruns belongs to the same arc4 data set to which stiring applied.

delphij edited edge metadata.

Sorry for the delay (des@ have pinged me today). Consider this as a so@ approval as we haven't heard other objections so far.

We should make further improvements after this change lands, I'll file a ticket for it.

Thanks for your work by the way!

This revision is now accepted and ready to land.Nov 15 2016, 6:07 PM
This revision was automatically updated to reflect the committed changes.