As alc pointed out, Clang does not generate optimal popcnt_pc_map_elem_pq().
http://docs.freebsd.org/cgi/mid.cgi?552BFEB2.8040407
I tried many implementations but I was not able to generate optimal code for both Clang and GCC. This patch just implements it in inline assembly without loop or unrolled loop. Also, newly added bit_count() is used instead of bitcount64() for the non-POPCNT case.