It was showing up on a name resolution-heavy benchmark (4% as reported by pmcstat, 0.7% after the patch).
The main reason was the slac/clac combo around each byte copy.
The patch reoders lables to remove a common case forward jump (the nosmap routine still suffers it) and englarges the smap-disabled region.
Note the current byte copy loop is still slow and is subject to being optimized later.