grep(1): optimize -w/--word-regexp word boundary check
ClosedPublic
Actions

Authored by bapt on Jun 10 2026, 2:54 PM.

Details

Reviewers

Commits

rGffe47c424e0a: grep: periodic timer-based fflush instead of unconditional per-line flush

Summary

The -w option checks word boundaries before and after each potential
match by decoding the adjacent character. This was done via the
heavyweight sscanf(3) with "%lc", which goes through the full scanf
parser and locale-aware mbrtowc(3) machinery even for simple ASCII.

Replace with a three-tier fast path:

ASCII bytes (< 0x80): simple isalnum(3) / '_' comparison
UTF-8 continuation bytes (0x80-0xBF): interior bytes of a multi-byte character are always word characters -> no further decoding needed
Multi-byte start bytes (>= 0xC0): decode with mbrtowc(3) directly instead of sscanf(3)/%lc, avoiding scanf parser overhead

Benchmark with ministat(1) (10 runs each):

Worst-case ASCII (100k lines of 100 'a' chars, -w 'a'):

Difference at 95.0% confidence: -15.3% +/- 3.1%

Worst-case Unicode (50k lines of 100 accented 'e', -w 'e'):

Difference at 95.0% confidence: -11.2% +/- 4.7%

Normal -w (500k lines, -w 'the'):

Difference at 95.0% confidence: -18.1% +/- 3.6%

French text (100k lines, -w accented 'ete'):

Difference at 95.0% confidence: -18.0% +/- 4.1%

Non -w case shows no regression.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

bapt created this revision.Jun 10 2026, 2:54 PM

Herald added subscribers: kevans, imp. · View Herald TranscriptJun 10 2026, 2:54 PM

bapt requested review of this revision.Jun 10 2026, 2:54 PM

Harbormaster completed remote builds in B73806: Diff 179574.Jun 10 2026, 2:54 PM

bapt added a reviewer: kevans.Jun 10 2026, 2:54 PM

This revision was not accepted when it landed; it landed in state Needs Review.Jun 14 2026, 2:27 PM

Closed by commit rGffe47c424e0a: grep: periodic timer-based fflush instead of unconditional per-line flush (authored by bapt). · Explain Why

This revision was automatically updated to reflect the committed changes.

bapt added a commit: rGffe47c424e0a: grep: periodic timer-based fflush instead of unconditional per-line flush.

Revision Contents
Changeset List

Path

Size

usr.bin/

grep/

util.c

33 lines

Diff 179734

View Options

grep(1): optimize -w/--word-regexp word boundary checkClosedPublicActions