Page MenuHomeFreeBSD

regex(3): Consistently handle invalid {} constructs between BREs and EREs
ClosedPublic

Authored by kevans on Apr 8 2017, 4:53 AM.

Details

Summary

As of right now, invalid {} constructs are not handled consistently between BREs and EREs. Observe:

$ echo "a{1,2,3}b" | sed -r "s/{/_/"
# Works!
$ echo "a{1,2,3}b" | sed "s/\{/_/"
# Throws an error - correct behavior

$ echo "a{1,2,3}b" | sed -r "s/}/_/"
# Works!
$ echo "a{1,2,3}b" | sed "s/\}/_/"
# Throws an error! - wrong behavior

$ echo "a{1,2,3}b" | sed -r "s/{}/_/"
# Works!
$ echo "a{1,2,3}b" | sed "s/\{\}/_/"
# Throws an error! - correct behavior

Where correct behavior/wrong behavior is based on:

  • Strict interpretation of re_format(7)
  • How modern GNU regex(3) operates, not counting some GNU-specific weirdness I found (such as a{,} being valid) that will be ignored for the time being.
  • Which behavior seemed to be more reasonable in terms of 'this could go really badly if someone wrote these expecting valid repetitions'

While we're here, adjust tests to meet future expectations:

  • { should never be interpreted literally in EREs, invalidating ERE expressions like {,2}, {,}, and {b} because they relied on the literal interpretation of { in EREs.
  • Add a couple of cases to ensure that \{ and \} are handled similarly in BREs to { and } in EREs so that we're actually consistent going forward

PR: 166861

Test Plan

Run test suite to ensure no regressions in any other tests than the ones adjusted

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kevans retitled this revision from regex(3): Consistently handle {} constructs between BREs and EREs to regex(3): Consistently handle invalid {} constructs between BREs and EREs.Apr 8 2017, 6:02 AM
$ echo "a{1,2,3}b" | sed -r "s/{}/_/"
# Works!

I'm not sure I follow why this is correct.

This also should be given an exp- run to make sure we don't have ports that expect broken behavior.

In D10315#213651, @ngie wrote:
$ echo "a{1,2,3}b" | sed -r "s/{}/_/"
# Works!

I'm not sure I follow why this is correct.

It's not, as noted in the second part of that example. =) On second thought, I should have clearly denoted that as incorrectly, my apologies.

To be perfectly explicit, the *new* behavior:

$ echo "a{1,2,3}b" | sed -r "s/{/_/"
# Throws an error (previously worked)
$ echo "a{1,2,3}b" | sed "s/\{/_/"
# Throws an error
l
$ echo "a{1,2,3}b" | sed -r "s/}/_/"
# Works!
$ echo "a{1,2,3}b" | sed "s/\}/_/"
# Works! (previously threw an error)

$ echo "a{1,2,3}b" | sed -r "s/{}/_/"
# Throws an error! (previously worked)
$ echo "a{1,2,3}b" | sed "s/\{\}/_/"
# Throws an error!
In D10315#213652, @ngie wrote:

This also should be given an exp- run to make sure we don't have ports that expect broken behavior.

Am I able to request an exp-run, or is this something a committer should do on my/this review's behalf?

pfg added a subscriber: pfg.

The change looks good to me but given the history I've had with regex and sed, I do feel an exp-run is needed.

By request of @pfg

root@ghost:/usr/src/usr.bin/sed/tests# make check
sed2_test:inplace_hardlink_src  ->  passed  [0.071s]
sed2_test:inplace_symlink_src  ->  passed  [0.043s]
sed_test:c2048  ->  passed  [0.045s]
sed_test:emptybackref  ->  passed  [0.054s]
sed_test:longlines  ->  passed  [0.061s]
sed_test:preserve_leading_ws_ia  ->  passed  [0.040s]
sed_test:rangeselection  ->  passed  [0.204s]
legacy_test:main  ->  passed  [0.657s]
multi_test:main  ->  passed  [1.402s]
inplace_race_test:main  ->  passed  [3.772s]

Results file id is usr_tests_usr.bin_sed.20170408-150131-450096
Results saved to /root/.kyua/store/results.usr_tests_usr.bin_sed.20170

10/10 passed (0 failed)
In D10315#213703, @pfg wrote:

The change looks good to me but given the history I've had with regex and sed, I do feel an exp-run is needed.

Alright then, I'll go ahead and request a exp-run with this patch (minus the test bits?) and respond here with the associated PR.

In D10315#213705, @kevans91_ksu.edu wrote:

By request of @pfg

root@ghost:/usr/src/usr.bin/sed/tests# make check
sed2_test:inplace_hardlink_src  ->  passed  [0.071s]
sed2_test:inplace_symlink_src  ->  passed  [0.043s]
sed_test:c2048  ->  passed  [0.045s]
sed_test:emptybackref  ->  passed  [0.054s]
sed_test:longlines  ->  passed  [0.061s]
sed_test:preserve_leading_ws_ia  ->  passed  [0.040s]
sed_test:rangeselection  ->  passed  [0.204s]
legacy_test:main  ->  passed  [0.657s]
multi_test:main  ->  passed  [1.402s]
inplace_race_test:main  ->  passed  [3.772s]

Results file id is usr_tests_usr.bin_sed.20170408-150131-450096
Results saved to /root/.kyua/store/results.usr_tests_usr.bin_sed.20170

10/10 passed (0 failed)

Actually... I meant the GNU sed port tests: we can't import them but our sed in base passes the GNU tests.

In D10315#213703, @pfg wrote:

The change looks good to me but given the history I've had with regex and sed, I do feel an exp-run is needed.

Alright then, I'll go ahead and request a exp-run with this patch (minus the test bits?) and respond here with the associated PR.

Send me the patch in private and I'll request the exp-run.

For the record...

exp-run is here:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=218495
(Some changes required in the ports framework/Makefiles)

$ echo "a{1,2,3}b" | sed "s/\{/_/"
# Throws an error

Shouldn't this be OK?

$ echo "a{1,2,3}b" | sed "s/\{/_/"
# Throws an error

Shouldn't this be OK?

Indeed, I revised the original review after a comment by ngie@ to clarify which behaviors are correct and incorrect. This one was the correct interpretation, and EREs were not throwing an error as expected.

exp-run issues (from PR 218495) have been resolved now.

This revision is now accepted and ready to land.May 25 2017, 8:33 PM
contrib/netbsd-tests/lib/libc/regex/data/repet_bounded.in
5 ↗(On Diff #27220)

Should this fail with BREs, but work with EREs (because it's escaped)? - seems to suggest that it's a placeholder... not imply EREs?

9 ↗(On Diff #27220)

Same question as above.

36 ↗(On Diff #27220)

Why is {,2} invalid -- shouldn't it be synonymous with "up to 2 repeated chars" in this case?

contrib/netbsd-tests/lib/libc/regex/data/repet_bounded.in
5 ↗(On Diff #27220)

Indeed. - is a placeholder/no flags, and matching it as an ERE-only is the default. b switches it to BRE-only, and & is "both BRE and ERE". So this line means "As an ERE, \{ matches {"

9 ↗(On Diff #27220)

Ditto

36 ↗(On Diff #27220)

I agree with you here, but unfortunately POSIX does not agree with either of us. See [1] and the grammar [2] -- a lower bound is actually required by POSIX standards. =( GNU extensions relax this requirement, though, and thus makes valid both a{,2} and a{,} -- the latter clearly being equivalent to a*, but is still something we'll have to take into account when GNU extensions come along.

[1] http://pubs.opengroup.org/onlinepubs/009696899/basedefs/xbd_chap09.html#tag_09_04_06
[2] http://pubs.opengroup.org/onlinepubs/009696899/basedefs/xbd_chap09.html#tag_09_05_03

Ping @emaste, anything else on this one?

Ping @emaste, anything else on this one?

Well, since this change points out issues that the historic regex never did, I think it is convenient to avoid MFC'ing this change to -stable to avoid surprises for old-time users.
I'd also suggest committing the other regex cleanup first.

I have noticed you are now a committer (congratulations, punishment well deserved), so you can do the honors (mentor approval pending of course).

LGTM

contrib/netbsd-tests/lib/libc/regex/data/repet_bounded.in
36 ↗(On Diff #27220)

"Someone" should push for a fixed specification with the Austin Group here...

@ngie your concerns were answered?

Yes -- thank you for following up.

This revision was automatically updated to reflect the committed changes.

This should have been MFC-ed into 11. Now `sed``` behavior is different across systems.
Some common utilities, like `poudriere`` make expensive use of `sed``, and it works differently.

In D10315#415052, @yuri wrote:

This should have been MFC-ed into 11. Now `sed``` behavior is different across systems.
Some common utilities, like `poudriere`` make expensive use of `sed``, and it works differently.

I respectfully disagree- this change was potentially disruptive and better to be left as a thing that consumers of the base system have to trip over across major upgrades rather than a 'minor' upgrade.

That said, it's fundamentally broken according to re_format(7) -- I'm re-evaluating it and resolving some of my own inner conflict. This is all UB territory by POSIX standards.