Page MenuHomeFreeBSD

grep(1): optimize -w/--word-regexp word boundary check
Needs ReviewPublic

Authored by bapt on Wed, Jun 10, 2:54 PM.
Tags
None
Referenced Files
F159386082: D57528.diff
Sat, Jun 13, 1:36 PM
Unknown Object (File)
Thu, Jun 11, 4:36 AM
Subscribers

Details

Reviewers
kevans
Summary

The -w option checks word boundaries before and after each potential
match by decoding the adjacent character. This was done via the
heavyweight sscanf(3) with "%lc", which goes through the full scanf
parser and locale-aware mbrtowc(3) machinery even for simple ASCII.

Replace with a three-tier fast path:

  1. ASCII bytes (< 0x80): simple isalnum(3) / '_' comparison
  2. UTF-8 continuation bytes (0x80-0xBF): interior bytes of a multi-byte character are always word characters -> no further decoding needed
  3. Multi-byte start bytes (>= 0xC0): decode with mbrtowc(3) directly instead of sscanf(3)/%lc, avoiding scanf parser overhead

Benchmark with ministat(1) (10 runs each):

Worst-case ASCII (100k lines of 100 'a' chars, -w 'a'):

Difference at 95.0% confidence: -15.3% +/- 3.1%

Worst-case Unicode (50k lines of 100 accented 'e', -w 'e'):

Difference at 95.0% confidence: -11.2% +/- 4.7%

Normal -w (500k lines, -w 'the'):

Difference at 95.0% confidence: -18.1% +/- 3.6%

French text (100k lines, -w accented 'ete'):

Difference at 95.0% confidence: -18.0% +/- 4.1%

Non -w case shows no regression.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 73806
Build 70689: arc lint + arc unit