Restructure cache_handle_range to avoid unnecessary barriers
ClosedPublic
Actions

Authored by alc on Jul 4 2019, 3:00 AM.

Details

Reviewers

andrew
markj
manu

Commits

rS349768: Restructure cache_handle_range to avoid repeated barriers. Specifically,

Summary

In order to reduce the number of "dsb ish" operations, restructure cache_handle_range so that all of the "dc" operations are performed before any "ic" operation. Then, as I interpret the ARM ARM, a single "dsb ish" between the "dc" and "ic" operations and a single "dsb ish" after the "ic" operations suffice.

Linux 5.1.16's flush_icache_range() in arch/arm64/mm/cache.S and its macro invalidate_icache_by_line in arch/arm64/include/asm/assembler.h are similarly structured.

Test Plan

With this change, the time that it takes to do a "make -j8 buildworld" on an Amazon EC2 a1.2xlarge (8 cores, 16 GB RAM) is reduced by 9%.

Since the a1.2xlarge cores are Cortex-A72's with PIPT L1 I-Caches, I also tested this patch with arm64_icache_sync_range()'s implementation changed from

ENTRY(arm64_icache_sync_range)
        /*                                                                                                                                                                          
         * XXX Temporary solution - I-cache flush should be range based for                                                                                                         
         * PIPT cache or IALLUIS for VIVT or VIPT caches                                                                                                                            
         */
/*      cache_handle_range      dcop = cvau, ic = 1, icop = ivau */
        cache_handle_range      dcop = cvau
        ic      ialluis
        dsb     ish
        isb
        ret
END(arm64_icache_sync_range)

ENTRY(arm64_icache_sync_range)
        cache_handle_range      dcop = cvau, ic = 1, icop = ivau
        ret
END(arm64_icache_sync_range)

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

alc created this revision.Jul 4 2019, 3:00 AM

Herald added a subscriber: emaste. · View Herald TranscriptJul 4 2019, 3:00 AM

alc edited the test plan for this revision. (Show Details)Jul 4 2019, 3:03 AM

andrew accepted this revision.Jul 4 2019, 1:31 PM

This revision is now accepted and ready to land.Jul 4 2019, 1:31 PM

Just an FYI, when I modified arm64_icache_sync_range() to do the range-based "ic" operations, instead of flushing the entire L1 I-cache, performance got very slightly worse. I was somewhat surprised by this result.

Closed by commit rS349768: Restructure cache_handle_range to avoid repeated barriers. Specifically, (authored by alc). · Explain WhyJul 5 2019, 8:01 PM

This revision was automatically updated to reflect the committed changes.

alc added a commit: rS349768: Restructure cache_handle_range to avoid repeated barriers. Specifically,.

Herald added a reviewer: manu. · View Herald TranscriptJul 5 2019, 8:01 PM

Herald added a subscriber: imp. · View Herald Transcript

My guess is we are hitting the crossing point where the instruction cost of the loop is greater than the cost of invalidating the entire instruction cache. We might need to have an invalidate all version of arm64_icache_sync_range then switch between them in the macro based on the length of the memory.

Revision Contents
Changeset List

Path

Size

head/

sys/

arm64/

cpufunc_asm.S

18 lines

Diff 59451

View Options

head/sys/arm64/arm64/cpufunc_asm.S

Restructure cache_handle_range to avoid unnecessary barriersClosedPublicActions