This eliminates the following:
- multiplication from zpcpu_get. zpcpu_get is used all the time, for example by malloc
- subtraction from counteru64_add on amd64
Changes are simple:
1. instead of recomputing the offset every time for current cpu, store it
2. instead of subracting __pcpu every time in counter code, store the already subtracted pointer. there is only one way to both alloc and free them and that place can convert as necessary
The add routine should be removed from counter and instead reimplemented as zpcpu_add. Simlarly someone should add zpcpu_set. Perhaps I'll do it later.
I'm not fond of names used for uma, but I think the concept is sound.