Collapse 3 loops over the vm_page array into 1 following a suggestion
from alc@. Testing on EC2 and a few desktop CPUs (Intel and AMD) shows
a significant improvement in initialization time.
One exception is pig1 in the netperf cluster, which contains Westmere-EX
CPUs. This patch makes vm_page_array initialization about 10% slower
there. I still haven't figured out how to address that; some
experimentation showed that zeroing vm_pages in a stride less than the
L1 cache size gives a good improvement, but this depends on the
alignment of the strides: my initial patch didn't align them, but when I
aligned them to the cache line size, the initialization slowed down
again.
I managed to reproduce what I think is the same problem using a userland
program, but the issue is difficult to analyze since many of the PMCs
I've tried to use don't appear to work (pmcstat returns EINVAL upon
attempting to enable them). I haven't yet dug into this further.
However, since the patch is straightforward and gives a good
improvement for cperciva's case, I'd like to propose its inclusion now.