We want to allocate scheduler state dynamically so that it can be allocated to the correct NUMA domain. This also eliminates a global and statically sized MAXCPU state array.
I added a per-cpu variable to point at dpcpu allocated memory to keep the instruction bloat down. This actually makes TDQ_SELF() more efficient than it was before but TDQ_CPU() is less efficient.