There are still loose ends to be dealt with before this patch is committed, but I'm posting it now because it seems to work and I would like to see it tested on other hardware besides Amazon EC2 Cortex-A72-based machines.
Results obtained from lmbench's lat_ctx microbenchmark, specifically, "lat_ctx 20 20 20 ...":
x /tmp/withoutASID + /tmp/withASID +------------------------------------------------------------------------------+ | + + + xx x xx xx | |+ + + + + + +++ + + *++ + + + xx xx x xx x xxx x| | |____________A___________| |__________A__M_______| | +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 20 6.9 8.87 8.285 8.1605 0.48485999 + 20 5.42 7.51 6.555 6.519 0.54039849 Difference at 95.0% confidence -1.6415 +/- 0.328587 -20.1152% +/- 4.02655% (Student's t, pooled s = 0.513381)
Unit Tests Skipped
We have access to ThunxerX servers in Sentex with 48 and 96 cores, and @emaste has a ThubderX2 and eMAG server on his office.
For the original ThunderX we'll need to handle erratum 27456. It looks like we'll need to invalidate the local icache after setting a new ttb with a different ASID.
You'll need to check ID_AA64MMFR0_EL1 to know if the HW supports 8 or 16 bit ASIDs.
Shouldn't we be doing something smarter here when this fails? If the hardware only supports an 8 bit ASID then this is likely.
Currently, start_mmu in locore.S unconditionally sets TCR_ASID_16 in tcr_el1. I'm going to change that to read ASIDBits from ID_AA64MMFR0_EL1 and set tcr_el1 accordingly. Then, here I will read back the value set in tcr_el1.
This one-line is arguably (1) a bug fix and (2) orthogonal to ASID support, and could and should be committed now. While developing the ASID support, I discovered that a context switch to PID 0 loads the identity map created in locore.S into TTRB0, but a context switch to any other kernel process, e.g., idle, loads the kernel page table, i.e., the page table that TTBR1 points to, into TTBR0. In other words, TTBR0 and TTBR1 wind up pointing to the same page table. This one-line change ensures that all kernel processes use the identity map, and thereby avoids some spurious page table switches in pmap_switch().
Yes, this is the biggest "loose end" that remains, but probably the last one that I will deal with. Based on seeing TCR_ASID_16 set unconditionally in tcr_el1, I have assumed, perhaps incorrectly, that all of the hardware that we currently run on supports 16-bit ASIDs. And, I didn't want to have to remotely debug a complex ASID allocator at the same time as everything else here. :-)
Change start_mmu in locore.S to set TCR_EL1.AS based on the ASIDBits field from ID_AA64MMFR0_EL1.
Two changes to armreg.h: Fix an integer overflow issue in the definition of TCR_ASID_16. Define TCR_A1.
This is going to be a performance problem. unr api insists on using small amount of memory and to this end incurs a lot of overhead to manage the space -- see e.g., free_unr starting with 2 mallocs just in case it will go ahead and compact the space. With the range up to 64k I don't think just having a static bitmap should be considered a problem.
At the very least this should be rewritten as a simple mutex-protected bitmap. Preferably this would made scalable with partitioning it but I suspect this can wait.
I would appreciate it if folks would exercise this patch a bit, particularly on ThunderX, ThunderX2, and eMAG machines. I want know if there are issues with this patch, e.g., mysterious program crashes, before I rewrite the ASID allocator.
I will apply the patch to my work tree and try on ThunderX2 and eMAG. I've been using both for Poudriere builds of the full pkg set, unfortunately ThunderX2 encounters some sort of lock UAF that needs to be addressed so the best I'll be able to do there is suggest it seems no worse.
FreeBSD/SMP: Multiprocessor System Detected: 96 CPUs random: unblocking device. pmap_kextract(kernel_pmap->pm_l0) = 10fea605000 ttbr0 = 10fea607000 bcast_tlbi_workaround = 1
and a bit later:
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: <INTEL SSDSCKHB340G4 G2010150> ACS-2 ATA SATA 3.x device ada0: Serial Number BTWM609507D2340C ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 512bytes) ada0: Command Queueing enabled ada0: 324322MB (664212528 512 byte sectors) Release APs...panic: data abort with spinlock held cpuid = 19 time = 1 KDB: stack backtrace: db_trace_self() at db_trace_self_wrapper+0x28 pc = 0xffff0000007262bc lr = 0xffff000000103bf8 sp = 0xffff000000ddbc20 fp = 0xffff000000ddbe30 db_trace_self_wrapper() at vpanic+0x18c pc = 0xffff000000103bf8 lr = 0xffff0000003fd61c sp = 0xffff000000ddbe40 fp = 0xffff000000ddbef0 vpanic() at panic+0x44 pc = 0xffff0000003fd61c lr = 0xffff0000003fd3cc sp = 0xffff000000ddbf00 fp = 0xffff000000ddbf80 panic() at data_abort+0x254 pc = 0xffff0000003fd3cc lr = 0xffff0000007422bc sp = 0xffff000000ddbf90 fp = 0xffff000000ddc050 data_abort() at do_el1h_sync+0x128 pc = 0xffff0000007422bc lr = 0xffff000000741f64 sp = 0xffff000000ddc060 fp = 0xffff000000ddc090 do_el1h_sync() at handle_el1h_sync+0x74 pc = 0xffff000000741f64 lr = 0xffff000000728874 sp = 0xffff000000ddc0a0 fp = 0xffff000000ddc1b0 handle_el1h_sync() at sched_clock+0x4c pc = 0xffff000000728874 lr = 0xffff00000042ae60 sp = 0xffff000000ddc1c0 fp = 0xffff000000ddc360 sched_clock() at statclock+0x138 pc = 0xffff00000042ae60 lr = 0xffff000000399078 sp = 0xffff000000ddc370 fp = 0xffff000000ddc390 statclock() at handleevents+0x108 pc = 0xffff000000399078 lr = 0xffff000000775cb4 sp = 0xffff000000ddc3a0 fp = 0xffff000000ddc3e0 handleevents() at timercb+0x1b0 pc = 0xffff000000775cb4 lr = 0xffff0000007765b4 sp = 0xffff000000ddc3f0 fp = 0xffff000000ddc450 timercb() at arm_tmr_intr+0x58 pc = 0xffff0000007765b4 lr = 0xffff00000070a6e0 sp = 0xffff000000ddc460 fp = 0xffff000000ddc460 arm_tmr_intr() at intr_event_handle+0xc8 pc = 0xffff00000070a6e0 lr = 0xffff0000003c0934 sp = 0xffff000000ddc470 fp = 0xffff000000ddc4b0 intr_event_handle() at intr_isrc_dispatch+0x34 pc = 0xffff0000003c0934 lr = 0xffff000000778068 sp = 0xffff000000ddc4c0 fp = 0xffff000000ddc4d0 intr_isrc_dispatch() at arm_gic_v3_intr+0x138 pc = 0xffff000000778068 lr = 0xffff00000072ccb4 sp = 0xffff000000ddc4e0 fp = 0xffff000000ddc530 arm_gic_v3_intr() at intr_irq_handler+0x74 pc = 0xffff00000072ccb4 lr = 0xffff000000777ec8 sp = 0xffff000000ddc540 fp = 0xffff000000ddc560 intr_irq_handler() at handle_el1h_irq+0x70 pc = 0xffff000000777ec8 lr = 0xffff000000728930 sp = 0xffff000000ddc570 fp = 0xffff000000ddc680 handle_el1h_irq() at init_secondary+0xf4 pc = 0xffff000000728930 lr = 0xffff000000732b3c sp = 0xffff000000ddc690 fp = 0xffff000000ddc720 init_secondary() at init_secondary+0xf4 pc = 0xffff000000732b3c lr = 0xffff000000732b3c sp = 0xffff000000ddc730 fp = 0xffff000000ddc730 init_secondary() at 0x10fea6010bc pc = 0xffff000000732b3c lr = 0x0000010fea6010bc sp = 0xffff000000ddc740 fp = 0x0000000000000000