Page MenuHomeFreeBSD

bsdgrep(1): Fix -w -v matching improperly with certain patterns
ClosedPublic

Authored by kevans on Apr 9 2017, 4:27 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Apr 19, 11:13 AM
Unknown Object (File)
Dec 31 2023, 5:29 PM
Unknown Object (File)
Dec 22 2023, 10:06 PM
Unknown Object (File)
Dec 10 2023, 7:20 PM
Unknown Object (File)
Dec 8 2023, 7:37 PM
Unknown Object (File)
Dec 1 2023, 10:02 PM
Unknown Object (File)
Dec 1 2023, 6:16 AM
Unknown Object (File)
Nov 27 2023, 7:31 AM
Subscribers
None

Details

Summary

-w and -v flag matching is mostly functional, but with some minor problems. In this case, there were two problems:

1.) -w flag processing only allowed one iteration through pattern matching on a line. This is problematic if one pattern could match more than once, or if there were multiple patterns and the earliest/longest match was not the most ideal, and
2.) I "fixed" things to not further process a line if the first iteration through patterns didn't produce any matches. This is clearly wrong if we're dealing with the more restrictive -w matching.

#2 breakage could have also occurred before my broad rewrites, but it would be more arbitrary based on input patterns as to whether or not it actually affected things.

Ultimately, we can fix both of these by forcing a retry of the patterns after advancing just past the start of the first match if we're doing more restrictive -w matching and we didn't get any hits to start with. We can also go ahead and move -v flag processing outside of the loop it's in so that we have as much chance as possible to match in the more restrictive cases. This wasn't strictly wrong, but it could be a little more error prone.

While here, I introduced some regressions tests for this behavior and went ahead and fixed some excessive wrapping nearby that actually hindered my ability to parse through these if statements. gnugrep(1) does continue to pass these new tests.

PR: 218467, 218811

Test Plan

Ensure no regressions in Kyua tests, to include newly introduced tests

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

usr.bin/grep/util.c
319–322 ↗(On Diff #27250)

These changes are unrelated, right? If so, please leave them out of this patch for now.

324 ↗(On Diff #27250)

Ditto

327 ↗(On Diff #27250)

There's no corresponding check for whole word matching. Seems like we'll retry in any case?

368 ↗(On Diff #27250)

style(9) nit: no space between st and ++.

usr.bin/grep/util.c
319–322 ↗(On Diff #27250)

Alright, will revise in the morning.

327 ↗(On Diff #27250)

Right, so I didn't want to limit retry to just the wflag case, and there is only one path to set retry at this point- through the wflag. It's entirely possible that I might later find a reason to allow retry in other paths, such as -x flag (for some reason).

LGTM modulo nits which you'll fix.

This revision is now accepted and ready to land.Apr 9 2017, 5:47 AM
kevans edited edge metadata.
  • Remove unrelated style changes, fix style(9) nit
This revision now requires review to proceed.Apr 9 2017, 1:36 PM
This revision is now accepted and ready to land.Apr 9 2017, 3:35 PM
kevans edited edge metadata.

Rebase to account for recent commits, stop polluting repo with trivial test inputs that can be generated on the fly

This revision now requires review to proceed.Apr 17 2017, 4:54 PM
usr.bin/grep/util.c
355–357 ↗(On Diff #27487)

overly-long lines

usr.bin/grep/util.c
355–357 ↗(On Diff #27487)

Ah, drats. I forgot to take into account tab width.

  • Re-wrap long lines, take into account tab widths
  • Slight optimization; advance to the end of the earliest match rather than one character at a time
  • Wrapping, use start of match instead of end of match due to bug found and add test case

Alright, sorry, I think I'm done tweaking this. Advancing one character at a time could've been pretty bad for long line lengths and one or two patterns matching not at the beginning of the string.

Advancing to just part the start of the shortest match is pretty safe and can reduce time spent here by quite a bit.

I've also added a test case for a pattern that can match in whole or subset, but only subset matches -w, to make sure we test the case that I tried to break in one of these middle revisions.

This revision is now accepted and ready to land.Apr 18 2017, 4:49 PM

Also note that the -w flag in general could probably be optimized by actually embedding pattern in :<: and [[:>::] if !(cflags&REG_NOSPEC); we could then avoid doing the after-match check for word boundaries in all but the literal (fgrep/-F) case, and all matches by regexec(3) could be considered in !REG_NOSPEC case.

ngie added inline comments.
usr.bin/grep/util.c
362–363 ↗(On Diff #27525)

I'm not 100% caught up with regcomp and friends, but can .rm_so ever be -1 here?

usr.bin/grep/util.c
362–363 ↗(On Diff #27525)

Should not be possible; we evacuate further up if we didn't actually match. We've requested exactly one match and we matched, so that match should be filled to a valid substring of the subject.

Ping @emaste, anything else on this one? =)

This revision was automatically updated to reflect the committed changes.