Page MenuHomeFreeBSD

Restructure cache_handle_range to avoid unnecessary barriers

Authored by alc on Jul 4 2019, 3:00 AM.



In order to reduce the number of "dsb ish" operations, restructure cache_handle_range so that all of the "dc" operations are performed before any "ic" operation. Then, as I interpret the ARM ARM, a single "dsb ish" between the "dc" and "ic" operations and a single "dsb ish" after the "ic" operations suffice.

Linux 5.1.16's flush_icache_range() in arch/arm64/mm/cache.S and its macro invalidate_icache_by_line in arch/arm64/include/asm/assembler.h are similarly structured.

Test Plan

With this change, the time that it takes to do a "make -j8 buildworld" on an Amazon EC2 a1.2xlarge (8 cores, 16 GB RAM) is reduced by 9%.

Since the a1.2xlarge cores are Cortex-A72's with PIPT L1 I-Caches, I also tested this patch with arm64_icache_sync_range()'s implementation changed from

         * XXX Temporary solution - I-cache flush should be range based for                                                                                                         
         * PIPT cache or IALLUIS for VIVT or VIPT caches                                                                                                                            
/*      cache_handle_range      dcop = cvau, ic = 1, icop = ivau */
        cache_handle_range      dcop = cvau
        ic      ialluis
        dsb     ish


        cache_handle_range      dcop = cvau, ic = 1, icop = ivau

Diff Detail

rS FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

alc created this revision.Jul 4 2019, 3:00 AM
alc edited the test plan for this revision. (Show Details)Jul 4 2019, 3:03 AM
andrew accepted this revision.Jul 4 2019, 1:31 PM
This revision is now accepted and ready to land.Jul 4 2019, 1:31 PM

Just an FYI, when I modified arm64_icache_sync_range() to do the range-based "ic" operations, instead of flushing the entire L1 I-cache, performance got very slightly worse. I was somewhat surprised by this result.

My guess is we are hitting the crossing point where the instruction cost of the loop is greater than the cost of invalidating the entire instruction cache. We might need to have an invalidate all version of arm64_icache_sync_range then switch between them in the macro based on the length of the memory.