These allow one to non-destructively iterate over the set or clear bits
in a bitset. The motivation is that we have several code fragments
which iterate over a CPU set like this:
```
while ((cpu = CPU_FFS(&cpus)) != 0) {
cpu--;
CPU_CLR(cpu, &cpus);
<do something>;
}
```
This is slow because CPU_FFS begins the search at the beginning of the
bitset each time. On amd64 and arm64, CPU sets have size 256, so there
are four legs in the bitset and we do a lot of unnecessary scanning.
A second problem is that this is destructive, so code which needs to
preserve the original set has to make a copy. In particular, we have
quite a few functions which take a cpuset_t parameter by value, meaning
that each call has to copy the 32 byte cpuset_t.
An alternate implementation would be something like this:
```
for (i = 0; i < BITSET_SIZE; i++)
if (BIT_ISSET(i, bitset))
<do something>;
```
Some benchmarking on a new-ish i7 CPU and on a Haswell Xeon indicates
that this alternate implementation is much slower, though. I tried
several microbenchmarks which iterate over a set and store a value to
memory in the loop body. When the set is zero-filled, BIT_FOREACH_SET
is at least an order of magnitude faster; when it is all-ones
BIT_FOREACH_SET is about twice as fast; when it is filled with random
bytes BIT_FOREACH_SET is 5-6x faster. The destructive loop is even
worse than the other two on this hardware.
I get similar results on an EC2 Graviton (arm64) instance, but the
differences are even more pronounced.