The -w option checks word boundaries before and after each potential
match by decoding the adjacent character. This was done via the
heavyweight sscanf(3) with "%lc", which goes through the full scanf
parser and locale-aware mbrtowc(3) machinery even for simple ASCII.
Replace with a three-tier fast path:
- ASCII bytes (< 0x80): simple isalnum(3) / '_' comparison
- UTF-8 continuation bytes (0x80-0xBF): interior bytes of a multi-byte character are always word characters -> no further decoding needed
- Multi-byte start bytes (>= 0xC0): decode with mbrtowc(3) directly instead of sscanf(3)/%lc, avoiding scanf parser overhead
Benchmark with ministat(1) (10 runs each):
Worst-case ASCII (100k lines of 100 'a' chars, -w 'a'):
Difference at 95.0% confidence: -15.3% +/- 3.1%
Worst-case Unicode (50k lines of 100 accented 'e', -w 'e'):
Difference at 95.0% confidence: -11.2% +/- 4.7%
Normal -w (500k lines, -w 'the'):
Difference at 95.0% confidence: -18.1% +/- 3.6%
French text (100k lines, -w accented 'ete'):
Difference at 95.0% confidence: -18.0% +/- 4.1%
Non -w case shows no regression.