Page MenuHomeFreeBSD

A patch for using ASIDs in the arm64 pmap
ClosedPublic

Authored by alc on Oct 7 2019, 6:40 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 24, 7:20 PM
Unknown Object (File)
Fri, Jan 24, 7:14 PM
Unknown Object (File)
Fri, Jan 24, 6:57 PM
Unknown Object (File)
Fri, Jan 24, 6:55 PM
Unknown Object (File)
Sat, Jan 18, 10:05 PM
Unknown Object (File)
Sat, Jan 18, 10:01 PM
Unknown Object (File)
Sat, Jan 18, 5:33 PM
Unknown Object (File)
Fri, Jan 17, 10:16 PM

Details

Summary

Utilize ASIDs to reduce both the direct and indirect costs of context switching. The indirect costs being unnecessary TLB misses that are incurred when ASIDs are not used. In fact, currently, when we perform a context switch on one processor, we issue a broadcast TLB invalidation that flushes the TLB contents on every processor.

Mark all user-space ("ttbr0") page table entries with the non-global flag so that they are cached in the TLB under their ASID.

Correct an error in pmap_pinit0(). The pointer to the root of the page table was being initialized to the root of the kernel-space page table rather than a user-space page table. However, the root of the page table that was being cached in process 0's md_l0addr field correctly pointed to a user-space page table. As long as ASIDs weren't being used, this was harmless, except that it led to some unnecessary page table switches in pmap_switch(). Specifically, other kernel processes besides process 0 would have their md_l0addr field set to the root of the kernel-space page table, and so pmap_switch() would actually change page tables when switching between process 0 and other kernel processes.

Implement a workaround for Cavium erratum 27456 affecting ThunderX machines. (I would like to thank andrew@ for providing the code to detect the affected machines.)

Address integer overflow in the definition of TCR_ASID_16.

Setup TCR according to the PARange and ASIDBits fields from ID_AA64MMFR0_EL1. Previously, TCR_ASID_16 was unconditionally set.

Modify build_l1_block_pagetable so that lower attributes, such as ATTR_nG, can be specified.

Eliminate some unused code.

Test Plan

Results obtained from lmbench's lat_ctx microbenchmark, specifically, "cpuset -l <n> lat_ctx 20 20 20 ..." on Amazon EC2 Cortex-A72-based machines:

x ASIDbefore
+ ASIDafter
+------------------------------------------------------------------------------+
| +                                                                            |
| +                                                                            |
| +                                                                            |
| +                                                                           x|
| +                                                                          xx|
| +                                                                          xx|
|++                                                                          xx|
|++                                                                         xxx|
||A                                                                         |A||
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10          3.57          3.59         3.585         3.584   0.006992059
+  10           2.9          2.91          2.91         2.908  0.0042163702
Difference at 95.0% confidence
        -0.676 +/- 0.00542476
        -18.8616% +/- 0.15136%
        (Student's t, pooled s = 0.0057735)

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

We have access to ThunxerX servers in Sentex with 48 and 96 cores, and @emaste has a ThubderX2 and eMAG server on his office.

For the original ThunderX we'll need to handle erratum 27456. It looks like we'll need to invalidate the local icache after setting a new ttb with a different ASID.

arm64/arm64/pmap.c
939

You'll need to check ID_AA64MMFR0_EL1 to know if the HW supports 8 or 16 bit ASIDs.

1513–1514

Shouldn't we be doing something smarter here when this fails? If the hardware only supports an 8 bit ASID then this is likely.

arm64/arm64/pmap.c
939

Currently, start_mmu in locore.S unconditionally sets TCR_ASID_16 in tcr_el1. I'm going to change that to read ASIDBits from ID_AA64MMFR0_EL1 and set tcr_el1 accordingly. Then, here I will read back the value set in tcr_el1.

1489

This one-line is arguably (1) a bug fix and (2) orthogonal to ASID support, and could and should be committed now. While developing the ASID support, I discovered that a context switch to PID 0 loads the identity map created in locore.S into TTRB0, but a context switch to any other kernel process, e.g., idle, loads the kernel page table, i.e., the page table that TTBR1 points to, into TTBR0. In other words, TTBR0 and TTBR1 wind up pointing to the same page table. This one-line change ensures that all kernel processes use the identity map, and thereby avoids some spurious page table switches in pmap_switch().

1513–1514

Yes, this is the biggest "loose end" that remains, but probably the last one that I will deal with. Based on seeing TCR_ASID_16 set unconditionally in tcr_el1, I have assumed, perhaps incorrectly, that all of the hardware that we currently run on supports 16-bit ASIDs. And, I didn't want to have to remotely debug a complex ASID allocator at the same time as everything else here. :-)

Change start_mmu in locore.S to set TCR_EL1.AS based on the ASIDBits field from ID_AA64MMFR0_EL1.

Two changes to armreg.h: Fix an integer overflow issue in the definition of TCR_ASID_16. Define TCR_A1.

alc marked an inline comment as done.Oct 9 2019, 3:54 AM

For the original ThunderX we'll need to handle erratum 27456. It looks like we'll need to invalidate the local icache after setting a new ttb with a different ASID.

How should I test for the affected CPUs?

arm64/arm64/pmap.c
1513–1514

I don't know of any hardware with an 8-bit ASID, and a search through dmesgs in https://dmesgd.nycbug.org/ didn't find any so it may be safe to assume all usable cores have a 16-bit ASID.

mjg added inline comments.
arm64/arm64/pmap.c
941

This is going to be a performance problem. unr api insists on using small amount of memory and to this end incurs a lot of overhead to manage the space -- see e.g., free_unr starting with 2 mallocs just in case it will go ahead and compact the space. With the range up to 64k I don't think just having a static bitmap should be considered a problem.

At the very least this should be rewritten as a simple mutex-protected bitmap. Preferably this would made scalable with partitioning it but I suspect this can wait.

Workaround Cavium erratum 27456.

alc edited the test plan for this revision. (Show Details)
arm64/arm64/pmap.c
5782–5785

@andrew, is this comment related to the Cavium erratum?

Tidy up locore.S. Specifically, allow callers of build_l1_block_pagetable to control other lower attributes besides cacheability.

I would appreciate it if folks would exercise this patch a bit, particularly on ThunderX, ThunderX2, and eMAG machines. I want know if there are issues with this patch, e.g., mysterious program crashes, before I rewrite the ASID allocator.

In D21922#481116, @alc wrote:

I would appreciate it if folks would exercise this patch a bit, particularly on ThunderX, ThunderX2, and eMAG machines. I want know if there are issues with this patch, e.g., mysterious program crashes, before I rewrite the ASID allocator.

I can give it a spin on a thunderx this week.

@emaste Any chance we could try it on your thunderx2 and emag?

I can boot with it on the dual package ThunderX in Sentex (2 x 48 cores). Unfortunately I was unable to test it as I hit an unrelated nfs locking issue.

In D21922#481116, @alc wrote:

I would appreciate it if folks would exercise this patch a bit, particularly on ThunderX, ThunderX2, and eMAG machines.

I will apply the patch to my work tree and try on ThunderX2 and eMAG. I've been using both for Poudriere builds of the full pkg set, unfortunately ThunderX2 encounters some sort of lock UAF that needs to be addressed so the best I'll be able to do there is suggest it seems no worse.

I can boot with it on the dual package ThunderX in Sentex (2 x 48 cores). Unfortunately I was unable to test it as I hit an unrelated nfs locking issue.

Can you please check the output of dmesg to see if the printf that I placed in pmap_pinit0() shows that the broadcast TLBI workaround is enabled, i.e., it's non-zero?

In D21922#481116, @alc wrote:

I would appreciate it if folks would exercise this patch a bit, particularly on ThunderX, ThunderX2, and eMAG machines. I want know if there are issues with this patch, e.g., mysterious program crashes, before I rewrite the ASID allocator.

I can give it a spin on a thunderx this week.

Unfortunately I haven't been able to get a ThunderX from packet.net this week - they have been consistently unavailable. I can't do much to help unless that changes.

Unfortunately I haven't been able to get a ThunderX from packet.net this week

We have a couple of 1S ThunderX and one 2S ThunderX hosted at Sentex, coordinate with @gnn for use of them

Unfortunately I haven't been able to get a ThunderX from packet.net this week

We have a couple of 1S ThunderX and one 2S ThunderX hosted at Sentex, coordinate with @gnn for use of them

I have access to one now, I will be testing today.

Unfortunately I haven't been able to get a ThunderX from packet.net this week

We have a couple of 1S ThunderX and one 2S ThunderX hosted at Sentex, coordinate with @gnn for use of them

I have access to one now, I will be testing today.

Please verify that the printf in pmap_init() shows that bcast_tlbi_workaround is non-zero.

In D21922#483157, @alc wrote:

Unfortunately I haven't been able to get a ThunderX from packet.net this week

We have a couple of 1S ThunderX and one 2S ThunderX hosted at Sentex, coordinate with @gnn for use of them

I have access to one now, I will be testing today.

Please verify that the printf in pmap_init() shows that bcast_tlbi_workaround is non-zero.

I see:

FreeBSD/SMP: Multiprocessor System Detected: 96 CPUs                                                                                                          
random: unblocking device.                                                                                                                                    
pmap_kextract(kernel_pmap->pm_l0) = 10fea605000                                                                                                               
ttbr0 = 10fea607000
bcast_tlbi_workaround = 1

and a bit later:

ada0 at ahcich0 bus 0 scbus0 target 0 lun 0                                                                                                                   
ada0: <INTEL SSDSCKHB340G4 G2010150> ACS-2 ATA SATA 3.x device                                                                                                
ada0: Serial Number BTWM609507D2340C                                                                                                                          
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes)                                                                                                   
ada0: Command Queueing enabled                                                                                                                                
ada0: 324322MB (664212528 512 byte sectors)                                                                                                                   
Release APs...panic: data abort with spinlock held                                                                                                            
cpuid = 19                                                                                                                                                    
time = 1                                                                                                                                                      
KDB: stack backtrace:                                                                                                                                         
db_trace_self() at db_trace_self_wrapper+0x28                                                                                                                 
         pc = 0xffff0000007262bc  lr = 0xffff000000103bf8                                                                                                     
         sp = 0xffff000000ddbc20  fp = 0xffff000000ddbe30                                                                                                     
                                                                                                                                                              
db_trace_self_wrapper() at vpanic+0x18c                                                                                                                       
         pc = 0xffff000000103bf8  lr = 0xffff0000003fd61c                                                                                                     
         sp = 0xffff000000ddbe40  fp = 0xffff000000ddbef0                                                                                                     
                                                                                                                                                              
vpanic() at panic+0x44                                                                                                                                        
         pc = 0xffff0000003fd61c  lr = 0xffff0000003fd3cc                                                                                                     
         sp = 0xffff000000ddbf00  fp = 0xffff000000ddbf80                                                                                                     
                                                                                                                                                              
panic() at data_abort+0x254                                                                                                                                   
         pc = 0xffff0000003fd3cc  lr = 0xffff0000007422bc                                                                                                     
         sp = 0xffff000000ddbf90  fp = 0xffff000000ddc050                                                                                                     
                                                                                                                                                              
data_abort() at do_el1h_sync+0x128                                                                                                                            
         pc = 0xffff0000007422bc  lr = 0xffff000000741f64                                                                                                     
         sp = 0xffff000000ddc060  fp = 0xffff000000ddc090                                                                                                     
                                                                                                                                                              
do_el1h_sync() at handle_el1h_sync+0x74                                                                                                                       
         pc = 0xffff000000741f64  lr = 0xffff000000728874                                                                                                     
         sp = 0xffff000000ddc0a0  fp = 0xffff000000ddc1b0                                                                                                     
                                                                                                                                                              
handle_el1h_sync() at sched_clock+0x4c                                                                                                                        
         pc = 0xffff000000728874  lr = 0xffff00000042ae60                                                                                                     
         sp = 0xffff000000ddc1c0  fp = 0xffff000000ddc360                                                                                                     
                                                                                                                                                              
sched_clock() at statclock+0x138                                                                                                                              
         pc = 0xffff00000042ae60  lr = 0xffff000000399078                                                                                                     
         sp = 0xffff000000ddc370  fp = 0xffff000000ddc390                                                                                                     
                                                                                                                                                              
statclock() at handleevents+0x108                                                                                                                             
         pc = 0xffff000000399078  lr = 0xffff000000775cb4                                                                                                     
         sp = 0xffff000000ddc3a0  fp = 0xffff000000ddc3e0                                                                                                     
                                                                                                                                                              
handleevents() at timercb+0x1b0                                                                                                                               
         pc = 0xffff000000775cb4  lr = 0xffff0000007765b4                                                                                                     
         sp = 0xffff000000ddc3f0  fp = 0xffff000000ddc450                                                                                                     
                                                                                                                                                              
timercb() at arm_tmr_intr+0x58                                                                                                                                
         pc = 0xffff0000007765b4  lr = 0xffff00000070a6e0                                                                                                     
         sp = 0xffff000000ddc460  fp = 0xffff000000ddc460
                                                                                                     
arm_tmr_intr() at intr_event_handle+0xc8                                                                                                                      
         pc = 0xffff00000070a6e0  lr = 0xffff0000003c0934                                                                                                     
         sp = 0xffff000000ddc470  fp = 0xffff000000ddc4b0                                                                                                     
                                                                                                                                                              
intr_event_handle() at intr_isrc_dispatch+0x34                                                                                                                
         pc = 0xffff0000003c0934  lr = 0xffff000000778068                                                                                                     
         sp = 0xffff000000ddc4c0  fp = 0xffff000000ddc4d0                                                                                                     
                                                                                                                                                              
intr_isrc_dispatch() at arm_gic_v3_intr+0x138                                                                                                                 
         pc = 0xffff000000778068  lr = 0xffff00000072ccb4                                                                                                     
         sp = 0xffff000000ddc4e0  fp = 0xffff000000ddc530                                                                                                     
                                                                                                                                                              
arm_gic_v3_intr() at intr_irq_handler+0x74                                                                                                                    
         pc = 0xffff00000072ccb4  lr = 0xffff000000777ec8                                                                                                     
         sp = 0xffff000000ddc540  fp = 0xffff000000ddc560                                                                                                     
                                                                                                                                                              
intr_irq_handler() at handle_el1h_irq+0x70                                                                                                                    
         pc = 0xffff000000777ec8  lr = 0xffff000000728930                                                                                                     
         sp = 0xffff000000ddc570  fp = 0xffff000000ddc680                                                                                                     
                                                                                                                                                              
handle_el1h_irq() at init_secondary+0xf4                                                                                                                      
         pc = 0xffff000000728930  lr = 0xffff000000732b3c                                                                                                     
         sp = 0xffff000000ddc690  fp = 0xffff000000ddc720                                                                                                     
                                                                                                                                                              
init_secondary() at init_secondary+0xf4                                                                                                                       
         pc = 0xffff000000732b3c  lr = 0xffff000000732b3c                                                                                                     
         sp = 0xffff000000ddc730  fp = 0xffff000000ddc730                                                                                                     
                                                                                                                                                              
init_secondary() at 0x10fea6010bc                                                                                                                             
         pc = 0xffff000000732b3c  lr = 0x0000010fea6010bc                                                                                                     
         sp = 0xffff000000ddc740  fp = 0x0000000000000000
In D21922#483157, @alc wrote:

Unfortunately I haven't been able to get a ThunderX from packet.net this week

We have a couple of 1S ThunderX and one 2S ThunderX hosted at Sentex, coordinate with @gnn for use of them

I have access to one now, I will be testing today.

Please verify that the printf in pmap_init() shows that bcast_tlbi_workaround is non-zero.

I see:

FreeBSD/SMP: Multiprocessor System Detected: 96 CPUs                                                                                                          
random: unblocking device.                                                                                                                                    
pmap_kextract(kernel_pmap->pm_l0) = 10fea605000                                                                                                               
ttbr0 = 10fea607000
bcast_tlbi_workaround = 1

and a bit later:

A subsequent boot succeeded. Perhaps there is some unrelated problem. I'll try some builds.

A subsequent boot succeeded. Perhaps there is some unrelated problem. I'll try some builds.

I haven't seen any issues with an overnight build loop. I'm trying a reboot loop now to see if I can trigger the original panic again.

On my eMAG bcast_tlbi_workaround is 0. I'm now installing a new kernel to start Poudriere runs.

A Poudriere build is now running on a kernel with this patch on the Ampere eMAG in Kitchener.

On ThunderX2 bcast_tlbi_workaround is 0. I successfully netbooted with the patch but haven't done anything significant.

I haven't been able to trigger any problems after the initial panic that I reported yesterday.

I tried running lat_ctx with -NODEBUG kernels with and without the patch. I see a pretty substantial reduction in context switch time with the patch applied.

With:

# cpuset -l 10 ./lat_ctx 20 20 20 20 20 20 20 20 20 20 20

"size=0k ovr=2.02
20 10.13
20 10.10
20 10.12
20 10.13
20 10.11
20 10.13
20 10.15
20 10.14
20 10.10
20 10.12
20 10.11

Without:

# cpuset -l 10 ./lat_ctx 20 20 20 20 20 20 20 20 20 20 20

"size=0k ovr=2.01
20 13.66
20 13.62
20 13.71
20 13.63
20 13.60
20 13.61
20 13.66
20 13.68
20 13.59
20 13.60
20 13.60

I also found that this patch very slightly reduced the time for a "make -j16 buildkernel" on a 16-core machine. Specifically, the reduction was about 6 seconds out of about 4 minutes 35 seconds. I suspect that this is because a context switch by any core currently triggers a global TLB invalidation that affects all cores (as opposed to a local invalidation that just affects the context switching core).

alc retitled this revision from A preliminary patch for using ASIDs in the arm64 pmap to A patch for using ASIDs in the arm64 pmap.
alc edited the summary of this revision. (Show Details)

Implement a realistic ASID allocator. This should work with either 8- or 16-bit ASIDs.

I've tested generation changes by disabling the explicit freeing of the pmap's ASID within pmap_release().

alc marked an inline comment as done.Oct 24 2019, 5:08 AM
alc marked 2 inline comments as done.Oct 24 2019, 5:11 AM

If you don't see any mysterious application (or kernel) crashes with the previous patch, then please test this new version. I believe that this new version could reasonably be committed, so also consider this message a request for review.

I only see two hypothetical issues with this new version. First, we may want to layer a small per-CPU cache of ASID "cookies" on top the ASID allocator to avoid potential lock contention on large-scale machines. Second, if a process/pmap is assigned an ASID in generation G and then remains idle/blocked until the ASID allocator's generation number wraps around to generation G again, we could hypothetically wind up with two distinct processes/pmaps using the same ASID. In practice, I don't foresee that happening on machines with 16-bit ASIDs.

I saw no issue with about 24h of Poudriere build with this patch on eMAG (still running, 12795 packages built so far with 17486 remaining). I won't be able to try the new patch until end of next week. If this gets committed to head before then (after @markj testing perhaps) I will update and try again at the beginning of November.

Tidy up PCPU_MD_FIELDS.

alc edited the test plan for this revision. (Show Details)
alc edited the summary of this revision. (Show Details)

No functional changes: Add two comments about memory ordering on context switches. Add a KASSERT that the given pmap is the curpmap in pmap_remove_pages(), just like amd64.

Add some comments.

Replace "generation" by "epoch" throughout. The word "epoch" is equally appropriate and shorter.

No functional changes.

Add a comment to efi_arch_enter() explaining why we don't update curpmap.

Use curpmap in efi_arch_leave(), simplifying the code.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 3 2019, 5:45 PM
This revision was automatically updated to reflect the committed changes.

I'm now running a Poudriere build on my eMAG with the committed version of this patch. I arbitrarily chose some packages that have finished building (3231 built so far of 32947 queued) and they seem to have finished building in about 5% less time than a kernel from a month or so ago.