Newest AMD CPUs have the new broadcast invpg instruction INVLPGB. It performs invalidation locally and broadcasts the request for same invalidation to all logical CPUs. I tried to apply it to our pmap, but it is not convenient due to our use of the sliding PCID algorithm.
But I do think that one use of the instruction could be very profitable, esp. on large machines. We can do kernel pmap invalidations without IPIs. These invalidations always use PCID 0, and are always totally broadcast. This seems to be an ideal application.
I suspect that intense kmalloc/malloc(9) loads like ZFS might get significant speedup due to elimination of IPIs.