Page MenuHomeFreeBSD

grep(1): optimize -w/--word-regexp word boundary check
ClosedPublic

Authored by bapt on Wed, Jun 10, 2:54 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jun 26, 6:09 AM
Unknown Object (File)
Fri, Jun 26, 5:49 AM
Unknown Object (File)
Fri, Jun 26, 2:24 AM
Unknown Object (File)
Thu, Jun 25, 8:55 PM
Unknown Object (File)
Wed, Jun 24, 11:46 PM
Unknown Object (File)
Wed, Jun 24, 7:23 PM
Unknown Object (File)
Wed, Jun 24, 6:23 AM
Unknown Object (File)
Tue, Jun 23, 3:10 PM
Subscribers

Details

Summary

The -w option checks word boundaries before and after each potential
match by decoding the adjacent character. This was done via the
heavyweight sscanf(3) with "%lc", which goes through the full scanf
parser and locale-aware mbrtowc(3) machinery even for simple ASCII.

Replace with a three-tier fast path:

  1. ASCII bytes (< 0x80): simple isalnum(3) / '_' comparison
  2. UTF-8 continuation bytes (0x80-0xBF): interior bytes of a multi-byte character are always word characters -> no further decoding needed
  3. Multi-byte start bytes (>= 0xC0): decode with mbrtowc(3) directly instead of sscanf(3)/%lc, avoiding scanf parser overhead

Benchmark with ministat(1) (10 runs each):

Worst-case ASCII (100k lines of 100 'a' chars, -w 'a'):

Difference at 95.0% confidence: -15.3% +/- 3.1%

Worst-case Unicode (50k lines of 100 accented 'e', -w 'e'):

Difference at 95.0% confidence: -11.2% +/- 4.7%

Normal -w (500k lines, -w 'the'):

Difference at 95.0% confidence: -18.1% +/- 3.6%

French text (100k lines, -w accented 'ete'):

Difference at 95.0% confidence: -18.0% +/- 4.1%

Non -w case shows no regression.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable