Page MenuHomeFreeBSD

grep(1): optimize -w/--word-regexp word boundary check
ClosedPublic

Authored by bapt on Sun, Jun 14, 2:36 PM.
Tags
None
Referenced Files
F159758255: D57587.diff
Wed, Jun 17, 10:32 PM
F159758217: D57587.diff
Wed, Jun 17, 10:31 PM
Unknown Object (File)
Mon, Jun 15, 6:11 PM
Subscribers

Details

Summary

The -w option checks word boundaries before and after each potential
match by decoding the adjacent character. This was done via the
heavyweight sscanf(3) with "%lc", which goes through the full scanf
parser and locale-aware mbrtowc(3) machinery even for simple ASCII.

Replace with a three-tier fast path:

  1. ASCII bytes (< 0x80): simple isalnum(3) / '_' comparison
  2. UTF-8 continuation bytes (0x80-0xBF): interior bytes of a multi-byte character are always word characters -> no further decoding needed
  3. Multi-byte start bytes (>= 0xC0): decode with mbrtowc(3) directly instead of sscanf(3)/%lc, avoiding scanf parser overhead

Benchmark with ministat(1) (10 runs each):

Worst-case ASCII (100k lines of 100 'a' chars, -w 'a'):

Difference at 95.0% confidence: -15.3% +/- 3.1%

Worst-case Unicode (50k lines of 100 accented 'e', -w 'e'):

Difference at 95.0% confidence: -11.2% +/- 4.7%

Normal -w (500k lines, -w 'the'):

Difference at 95.0% confidence: -18.1% +/- 3.6%

French text (100k lines, -w accented 'ete'):

Difference at 95.0% confidence: -18.0% +/- 4.1%

Non -w case shows no regression.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

bapt requested review of this revision.Sun, Jun 14, 2:36 PM
usr.bin/grep/util.c
610–631

Since we use this sequence twice, we should probably pull it out into a helper function that takes pc->ln.dat, pc->ln.len and an offset.

bapt marked an inline comment as done.Sun, Jun 14, 5:02 PM
kevans added inline comments.
usr.bin/grep/util.c
519

This could just be a single return of that condition, but I do not insist.

This revision is now accepted and ready to land.Sun, Jun 14, 5:08 PM
This revision was automatically updated to reflect the committed changes.
bapt marked an inline comment as done.