This separates the buffer cache from a single managed entity in to multiple 'buf domains'. A bufobj is associated with a buf domain. Each domain manages its own buf space and buf count limits, has its own wait channels, and its own bufspace daemon. Each domain additionally has an optional per-cpu clean queue to alleviate pressure on the clean queue lock. A single global bufspace is not longer supportable on large machines, it becomes a point of significant contention, whether for the atomic bufspace variable, or the locks synchronizing sleeps and wakeups.
The kva space is actually still managed by a single vmem that uses per-cpu quantum caches. The buf headers are still also managed in per-cpu uma caches. All global counters have been switched from atomics to counter(9).
There is a second level of lock avoidance with B_REUSE. This favors a second chance algorithm in buf_recycle() rather than directly adjusting queue position on every access to a buf. It means frequently read bufs may linger slightly longer and buf_recycle() may have to visit more bufs to reach successful completion. However, for things like indirect blocks which are read very frequently it was a substantial reduction in lock contention.
I don't really like bufqueue() and bufqueue_acquire(). They work, they are not elegant. I would accept suggestions for other mechanisms. The rest is fairly typically divide and conquer locking work.
I changed bufspace_reserve() to use only atomic add. I believe there are no races with it missing wakeups but I could be mistaken.