Page MenuHomeFreeBSD

A preliminary patch for using ASIDs in the arm64 pmap
Needs ReviewPublic

Authored by alc on Mon, Oct 7, 6:40 PM.

Details

Reviewers
andrew
markj
Summary

There are still loose ends to be dealt with before this patch is committed, but I'm posting it now because it seems to work and I would like to see it tested on other hardware besides Amazon EC2 Cortex-A72-based machines.

Test Plan

Results obtained from lmbench's lat_ctx microbenchmark, specifically, "lat_ctx 20 20 20 ...":

x /tmp/withoutASID
+ /tmp/withASID
+------------------------------------------------------------------------------+
|                        +  +  +                  xx       x     xx     xx     |
|+    +  + +     +   + +++  +  +  *++ + +       + xx      xx   x xx  x xxx    x|
|            |____________A___________|            |__________A__M_______|     |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  20           6.9          8.87         8.285        8.1605    0.48485999
+  20          5.42          7.51         6.555         6.519    0.54039849
Difference at 95.0% confidence
        -1.6415 +/- 0.328587
        -20.1152% +/- 4.02655%
        (Student's t, pooled s = 0.513381)

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

alc created this revision.Mon, Oct 7, 6:40 PM
mhorne added a subscriber: mhorne.Mon, Oct 7, 11:02 PM
andrew added a comment.Tue, Oct 8, 7:13 AM

We have access to ThunxerX servers in Sentex with 48 and 96 cores, and @emaste has a ThubderX2 and eMAG server on his office.

For the original ThunderX we'll need to handle erratum 27456. It looks like we'll need to invalidate the local icache after setting a new ttb with a different ASID.

arm64/arm64/pmap.c
939

You'll need to check ID_AA64MMFR0_EL1 to know if the HW supports 8 or 16 bit ASIDs.

1513–1514

Shouldn't we be doing something smarter here when this fails? If the hardware only supports an 8 bit ASID then this is likely.

alc added inline comments.Tue, Oct 8, 6:01 PM
arm64/arm64/pmap.c
939

Currently, start_mmu in locore.S unconditionally sets TCR_ASID_16 in tcr_el1. I'm going to change that to read ASIDBits from ID_AA64MMFR0_EL1 and set tcr_el1 accordingly. Then, here I will read back the value set in tcr_el1.

1489

This one-line is arguably (1) a bug fix and (2) orthogonal to ASID support, and could and should be committed now. While developing the ASID support, I discovered that a context switch to PID 0 loads the identity map created in locore.S into TTRB0, but a context switch to any other kernel process, e.g., idle, loads the kernel page table, i.e., the page table that TTBR1 points to, into TTBR0. In other words, TTBR0 and TTBR1 wind up pointing to the same page table. This one-line change ensures that all kernel processes use the identity map, and thereby avoids some spurious page table switches in pmap_switch().

1513–1514

Yes, this is the biggest "loose end" that remains, but probably the last one that I will deal with. Based on seeing TCR_ASID_16 set unconditionally in tcr_el1, I have assumed, perhaps incorrectly, that all of the hardware that we currently run on supports 16-bit ASIDs. And, I didn't want to have to remotely debug a complex ASID allocator at the same time as everything else here. :-)

alc updated this revision to Diff 63074.Wed, Oct 9, 3:53 AM

Change start_mmu in locore.S to set TCR_EL1.AS based on the ASIDBits field from ID_AA64MMFR0_EL1.

Two changes to armreg.h: Fix an integer overflow issue in the definition of TCR_ASID_16. Define TCR_A1.

alc marked an inline comment as done.Wed, Oct 9, 3:54 AM
alc added a comment.Wed, Oct 9, 4:21 AM

For the original ThunderX we'll need to handle erratum 27456. It looks like we'll need to invalidate the local icache after setting a new ttb with a different ASID.

How should I test for the affected CPUs?

andrew added inline comments.Wed, Oct 9, 6:36 AM
arm64/arm64/pmap.c
1513–1514

I don't know of any hardware with an 8-bit ASID, and a search through dmesgs in https://dmesgd.nycbug.org/ didn't find any so it may be safe to assume all usable cores have a 16-bit ASID.

mjg added a subscriber: mjg.Thu, Oct 10, 7:01 PM
mjg added inline comments.
arm64/arm64/pmap.c
941

This is going to be a performance problem. unr api insists on using small amount of memory and to this end incurs a lot of overhead to manage the space -- see e.g., free_unr starting with 2 mallocs just in case it will go ahead and compact the space. With the range up to 64k I don't think just having a static bitmap should be considered a problem.

At the very least this should be rewritten as a simple mutex-protected bitmap. Preferably this would made scalable with partitioning it but I suspect this can wait.

alc updated this revision to Diff 63188.Sat, Oct 12, 5:31 PM

Workaround Cavium erratum 27456.

alc edited the test plan for this revision. (Show Details)Sun, Oct 13, 2:49 AM
alc edited the test plan for this revision. (Show Details)
alc added inline comments.Sun, Oct 13, 3:29 AM
arm64/arm64/pmap.c
5782–5785

@andrew, is this comment related to the Cavium erratum?

alc updated this revision to Diff 63235.Sun, Oct 13, 7:14 PM

Tidy up locore.S. Specifically, allow callers of build_l1_block_pagetable to control other lower attributes besides cacheability.

alc added a comment.Mon, Oct 14, 8:37 PM

I would appreciate it if folks would exercise this patch a bit, particularly on ThunderX, ThunderX2, and eMAG machines. I want know if there are issues with this patch, e.g., mysterious program crashes, before I rewrite the ASID allocator.

markj added a comment.Tue, Oct 15, 2:00 AM
In D21922#481116, @alc wrote:

I would appreciate it if folks would exercise this patch a bit, particularly on ThunderX, ThunderX2, and eMAG machines. I want know if there are issues with this patch, e.g., mysterious program crashes, before I rewrite the ASID allocator.

I can give it a spin on a thunderx this week.

@emaste Any chance we could try it on your thunderx2 and emag?

I can boot with it on the dual package ThunderX in Sentex (2 x 48 cores). Unfortunately I was unable to test it as I hit an unrelated nfs locking issue.

In D21922#481116, @alc wrote:

I would appreciate it if folks would exercise this patch a bit, particularly on ThunderX, ThunderX2, and eMAG machines.

I will apply the patch to my work tree and try on ThunderX2 and eMAG. I've been using both for Poudriere builds of the full pkg set, unfortunately ThunderX2 encounters some sort of lock UAF that needs to be addressed so the best I'll be able to do there is suggest it seems no worse.

alc added a comment.Wed, Oct 16, 6:36 PM

I can boot with it on the dual package ThunderX in Sentex (2 x 48 cores). Unfortunately I was unable to test it as I hit an unrelated nfs locking issue.

Can you please check the output of dmesg to see if the printf that I placed in pmap_pinit0() shows that the broadcast TLBI workaround is enabled, i.e., it's non-zero?

markj added a comment.Fri, Oct 18, 5:03 PM
In D21922#481116, @alc wrote:

I would appreciate it if folks would exercise this patch a bit, particularly on ThunderX, ThunderX2, and eMAG machines. I want know if there are issues with this patch, e.g., mysterious program crashes, before I rewrite the ASID allocator.

I can give it a spin on a thunderx this week.

Unfortunately I haven't been able to get a ThunderX from packet.net this week - they have been consistently unavailable. I can't do much to help unless that changes.

emaste added a subscriber: gnn.Fri, Oct 18, 5:19 PM

Unfortunately I haven't been able to get a ThunderX from packet.net this week

We have a couple of 1S ThunderX and one 2S ThunderX hosted at Sentex, coordinate with @gnn for use of them

markj added a comment.Tue, Oct 22, 3:09 PM

Unfortunately I haven't been able to get a ThunderX from packet.net this week

We have a couple of 1S ThunderX and one 2S ThunderX hosted at Sentex, coordinate with @gnn for use of them

I have access to one now, I will be testing today.

alc added a comment.Tue, Oct 22, 3:32 PM

Unfortunately I haven't been able to get a ThunderX from packet.net this week

We have a couple of 1S ThunderX and one 2S ThunderX hosted at Sentex, coordinate with @gnn for use of them

I have access to one now, I will be testing today.

Please verify that the printf in pmap_init() shows that bcast_tlbi_workaround is non-zero.

In D21922#483157, @alc wrote:

Unfortunately I haven't been able to get a ThunderX from packet.net this week

We have a couple of 1S ThunderX and one 2S ThunderX hosted at Sentex, coordinate with @gnn for use of them

I have access to one now, I will be testing today.

Please verify that the printf in pmap_init() shows that bcast_tlbi_workaround is non-zero.

I see:

FreeBSD/SMP: Multiprocessor System Detected: 96 CPUs                                                                                                          
random: unblocking device.                                                                                                                                    
pmap_kextract(kernel_pmap->pm_l0) = 10fea605000                                                                                                               
ttbr0 = 10fea607000
bcast_tlbi_workaround = 1

and a bit later:

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0                                                                                                                   
ada0: <INTEL SSDSCKHB340G4 G2010150> ACS-2 ATA SATA 3.x device                                                                                                
ada0: Serial Number BTWM609507D2340C                                                                                                                          
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)                                                                                                   
ada0: Command Queueing enabled                                                                                                                                
ada0: 324322MB (664212528 512 byte sectors)                                                                                                                   
Release APs...panic: data abort with spinlock held                                                                                                            
cpuid = 19                                                                                                                                                    
time = 1                                                                                                                                                      
KDB: stack backtrace:                                                                                                                                         
db_trace_self() at db_trace_self_wrapper+0x28                                                                                                                 
         pc = 0xffff0000007262bc  lr = 0xffff000000103bf8                                                                                                     
         sp = 0xffff000000ddbc20  fp = 0xffff000000ddbe30                                                                                                     
                                                                                                                                                              
db_trace_self_wrapper() at vpanic+0x18c                                                                                                                       
         pc = 0xffff000000103bf8  lr = 0xffff0000003fd61c                                                                                                     
         sp = 0xffff000000ddbe40  fp = 0xffff000000ddbef0                                                                                                     
                                                                                                                                                              
vpanic() at panic+0x44                                                                                                                                        
         pc = 0xffff0000003fd61c  lr = 0xffff0000003fd3cc                                                                                                     
         sp = 0xffff000000ddbf00  fp = 0xffff000000ddbf80                                                                                                     
                                                                                                                                                              
panic() at data_abort+0x254                                                                                                                                   
         pc = 0xffff0000003fd3cc  lr = 0xffff0000007422bc                                                                                                     
         sp = 0xffff000000ddbf90  fp = 0xffff000000ddc050                                                                                                     
                                                                                                                                                              
data_abort() at do_el1h_sync+0x128                                                                                                                            
         pc = 0xffff0000007422bc  lr = 0xffff000000741f64                                                                                                     
         sp = 0xffff000000ddc060  fp = 0xffff000000ddc090                                                                                                     
                                                                                                                                                              
do_el1h_sync() at handle_el1h_sync+0x74                                                                                                                       
         pc = 0xffff000000741f64  lr = 0xffff000000728874                                                                                                     
         sp = 0xffff000000ddc0a0  fp = 0xffff000000ddc1b0                                                                                                     
                                                                                                                                                              
handle_el1h_sync() at sched_clock+0x4c                                                                                                                        
         pc = 0xffff000000728874  lr = 0xffff00000042ae60                                                                                                     
         sp = 0xffff000000ddc1c0  fp = 0xffff000000ddc360                                                                                                     
                                                                                                                                                              
sched_clock() at statclock+0x138                                                                                                                              
         pc = 0xffff00000042ae60  lr = 0xffff000000399078                                                                                                     
         sp = 0xffff000000ddc370  fp = 0xffff000000ddc390                                                                                                     
                                                                                                                                                              
statclock() at handleevents+0x108                                                                                                                             
         pc = 0xffff000000399078  lr = 0xffff000000775cb4                                                                                                     
         sp = 0xffff000000ddc3a0  fp = 0xffff000000ddc3e0                                                                                                     
                                                                                                                                                              
handleevents() at timercb+0x1b0                                                                                                                               
         pc = 0xffff000000775cb4  lr = 0xffff0000007765b4                                                                                                     
         sp = 0xffff000000ddc3f0  fp = 0xffff000000ddc450                                                                                                     
                                                                                                                                                              
timercb() at arm_tmr_intr+0x58                                                                                                                                
         pc = 0xffff0000007765b4  lr = 0xffff00000070a6e0                                                                                                     
         sp = 0xffff000000ddc460  fp = 0xffff000000ddc460
                                                                                                     
arm_tmr_intr() at intr_event_handle+0xc8                                                                                                                      
         pc = 0xffff00000070a6e0  lr = 0xffff0000003c0934                                                                                                     
         sp = 0xffff000000ddc470  fp = 0xffff000000ddc4b0                                                                                                     
                                                                                                                                                              
intr_event_handle() at intr_isrc_dispatch+0x34                                                                                                                
         pc = 0xffff0000003c0934  lr = 0xffff000000778068                                                                                                     
         sp = 0xffff000000ddc4c0  fp = 0xffff000000ddc4d0                                                                                                     
                                                                                                                                                              
intr_isrc_dispatch() at arm_gic_v3_intr+0x138                                                                                                                 
         pc = 0xffff000000778068  lr = 0xffff00000072ccb4                                                                                                     
         sp = 0xffff000000ddc4e0  fp = 0xffff000000ddc530                                                                                                     
                                                                                                                                                              
arm_gic_v3_intr() at intr_irq_handler+0x74                                                                                                                    
         pc = 0xffff00000072ccb4  lr = 0xffff000000777ec8                                                                                                     
         sp = 0xffff000000ddc540  fp = 0xffff000000ddc560                                                                                                     
                                                                                                                                                              
intr_irq_handler() at handle_el1h_irq+0x70                                                                                                                    
         pc = 0xffff000000777ec8  lr = 0xffff000000728930                                                                                                     
         sp = 0xffff000000ddc570  fp = 0xffff000000ddc680                                                                                                     
                                                                                                                                                              
handle_el1h_irq() at init_secondary+0xf4                                                                                                                      
         pc = 0xffff000000728930  lr = 0xffff000000732b3c                                                                                                     
         sp = 0xffff000000ddc690  fp = 0xffff000000ddc720                                                                                                     
                                                                                                                                                              
init_secondary() at init_secondary+0xf4                                                                                                                       
         pc = 0xffff000000732b3c  lr = 0xffff000000732b3c                                                                                                     
         sp = 0xffff000000ddc730  fp = 0xffff000000ddc730                                                                                                     
                                                                                                                                                              
init_secondary() at 0x10fea6010bc                                                                                                                             
         pc = 0xffff000000732b3c  lr = 0x0000010fea6010bc                                                                                                     
         sp = 0xffff000000ddc740  fp = 0x0000000000000000
In D21922#483157, @alc wrote:

Unfortunately I haven't been able to get a ThunderX from packet.net this week

We have a couple of 1S ThunderX and one 2S ThunderX hosted at Sentex, coordinate with @gnn for use of them

I have access to one now, I will be testing today.

Please verify that the printf in pmap_init() shows that bcast_tlbi_workaround is non-zero.

I see:

FreeBSD/SMP: Multiprocessor System Detected: 96 CPUs                                                                                                          
random: unblocking device.                                                                                                                                    
pmap_kextract(kernel_pmap->pm_l0) = 10fea605000                                                                                                               
ttbr0 = 10fea607000
bcast_tlbi_workaround = 1

and a bit later:

A subsequent boot succeeded. Perhaps there is some unrelated problem. I'll try some builds.