For machines having cmpxcgh16b instruction, i.e. everything but very early Athlons, provide lockless implementation of delayed invalidation.
The implementation maintains lock-less single-linked list with the trick from the T.L. Harris article about volatile mark of the elements being removed. Double-CAS is used to atomically update both link and generation. New thread starting DI appends itself to the end of the queue, setting the generation to the generation of the last element +1. On DI finish, thread donates its generation to the previous element. The generation of the fake head of the list is the last passed DI generation.
Basically, the implementation is queued spinlock without spinlock.
On 32-thread machine, running make -j 32 buildworld on tmpfs gives the following numbers:
root@r-freeb43:~ # sysctl vm.pmap | grep invl vm.pmap.invl_wait: 6 vm.pmap.invl_max_qlen: 8 vm.pmap.invl_finish_restart: 15209 vm.pmap.invl_start_restart: 77830