Parallelize the buffer cache and rewrite getnewbuf(). This results in a
8x performance improvement in a micro benchmark on a 4 socket machine.
- Get buffer headers from a per-cpu uma cache that sits in from of the free queue.
- Use a per-cpu quantum cache in vmem to eliminate contention for kva.
- Use multiple clean queues according to buffer cache size to eliminate clean queue lock contention.
- Introduce a bufspace daemon that attempts to prevent getnewbuf() callers from blocking or doing direct recycling.
- Close some bufspace allocation races that could lead to endless recycling.
- Further the transition to a more modern style of small functions grouped by prefix in order to improve growing complexity.
Sponsored by: EMC / Isilon
Reviewed by: kib
Tested by: pho