Provide workaround for a performance issue with the popcnt instruction
on Intel processors. Clear spurious dependency by explicitely xoring
the destination register of popcnt.
Use bitcount64() instead of re-implementing SWAR locally, for
processors without popcnt instruction.
Reviewed by: jhb
Discussed with: jilles (previous version)
Sponsored by: The FreeBSD Foundation