Specifically, if we're waking up some value n > BATCH_SIZE, then the copyin(9) is wrong on the second iteration due to upp being the wrong type. upp is currently a uint32_t**, so upp + pos advances it by twice as many elements as it should (host pointer size vs. compat32 pointer size).
Fix it by just making upp a uint32_t*; it's still technically a double pointer, but the distinction doesn't matter all that much here since we're just doing arithmetic on it.
Add a test case that demonstrates the problem, placed with the libthr tests since one messing with _umtx_op should be running these tests. Running under compat32, the new test case will hang as threads after the first 128 get missed in the wake. it's not immediately clear how to hit it in practice, since pthread_cond_broadcast() uses a smaller (sleepq batch?) size observed to be around ~50 -- I did not spend much time digging into it.
The uintptr_t change makes no functional difference, but i've tossed it in since it's more accurate (semantically).
Reported by: Andrew Gierth