I apologize for the length and complexity of this review. I did this in a flurry of commits over two days that were not well isolated and churned eachother. I don't think it's worth taking apart now.
The main thrust is providing independent locking for the domains in the zone layer. This improved allocator throughput on a machine with 50 domains by 2x.
This moves all of the per-zone domain logic into zone_fetch_bucket() and zone_put_bucket(). This really simplifies callers and makes it easier to do other restructuring.
It moves drain into bucket_free() so that every caller doesn't have to do both.
It attempts to avoid the lock on the alloc path if we don't think we'll succeed.
I restructured zone and moved fields around where I thought was appropriate and to reduce gaps.
I now calculate the zone domain address rather than fetching a pointer.