This is a working prototype, or perhaps a Request For Comments. It's implemented for da(4) and ada(4), but should be applicable for other protocols and periphs.
The end result is that CCBs issued via da(4) take ~512B (size of ccb_scsiio) instead of the usual 2kB (size of union ccb, ~1.5kB, rounded up by malloc(9)). We waste less memory, we avoid zeroing the unused 1kB, and it should be easier to allocate those CCBs in low memory conditions.
Note that this does not change the size, or the layout, of CCBs as such. CCBs get allocated in various different ways, in particular on the stack, and I don't want to redo all that. Instead, this provides an opt-in mechanism for the periph to declare "my start() callback is fine with receiving a CCB allocated from this UMA zone", and makes dastart(9) use it. In other words, most of the code works exactly as it used to; the change only happens to IOs issued by xpt_run_allockq(), which is - conveniently - pretty much all that matters for performance. In case of dastart(), the routine only ever casts the received CCB pointer to ccb_scsiio, so it doesn't require any special changes to make it work; I believe most periphs follow this pattern.
The reason for doing it this way is that it's pretty small, localized change, and can be implemented gradually and iteratively: take a periph, make sure its start() callback only casts the CCBs it takes to a particular type of CCB, add UMA zone for that size, and declare it safe to XPT. Because it's UMA, there's no alignment overhead, and it makes it possible to use uma_zone_reserve(9) to improve behaviour in low memory conditions even further.
I've considered making the UMA zone internal to periphs, and making xpt_run_allocq() pass NULL to the start() routine instead. The reason I hadn't done that is that I don't understand the interaction between xpt_run_allocq(), priorities, and requeueing to the ccb_list.