Eliminate the nesting and re-implement following a suggestion from
rlibby.
It does seem to be possible to implement the macro using nested loops in
a way that lets "break" work, but it's kind of ugly and gcc -O2 at least
generates pessimal code with that implementation; clang manages to
generate slightly faster code with that approach vs this one, but it
doesn't seem worth it.
Add some simple regression tests.