Right now code first flushes all local TLB entries that needs to be flushed, then signals IPI to remote cores, and then waits for acknowledgements while spinning idle. In the VMWare article 'Don’t shoot down TLB shootdowns!' it was noted that the time spent spinning is lost, and can be more usefully used doing local TLB invalidation.
We could use the same invalidation handler for local TLB as for remote, but typically for pmap == curpmap we can use INVLPG for locals instead of INVPCID on remotes, since we cannot control context switches on them. Due to that, keep the local code and provide the callbacks to be called from smp_targeted_tlb_shootdown() after IPIs are fired but before spin wait starts.
Tested by: pho