Page MenuHomeFreeBSD

regex(3): Interpret escaped ordinary characters as EESCAPE
ClosedPublic

Authored by kevans on Apr 26 2017, 3:56 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Mar 25, 9:41 PM
Unknown Object (File)
Feb 12 2024, 1:55 AM
Unknown Object (File)
Jan 28 2024, 8:06 AM
Unknown Object (File)
Jan 28 2024, 8:06 AM
Unknown Object (File)
Jan 28 2024, 8:06 AM
Unknown Object (File)
Jan 28 2024, 8:05 AM
Unknown Object (File)
Jan 28 2024, 8:05 AM
Unknown Object (File)
Jan 27 2024, 8:03 AM
Subscribers

Details

Summary

In IEEE 1003.1-2008 [1] and earlier revisions, BRE/ERE grammar allows for
any character to be escaped, but "ORD_CHAR preceded by an unescaped <backslash> character [gives undefined results]".

Historically, we've interpreted an escaped ordinary character as the ordinary character
itself. This becomes problematic when some extensions give special meanings to an
otherwise ordinary character (e.g. GNU's \b, \s, \w), meaning we may have two different
valid interpretations of the same sequence.

To make this easier to deal with and given that the standard calls this undefined,
we should throw an error (EESCAPE) if we run into this scenario to ease transition
into a state where some escaped ordinaries are blessed with a special meaning --
it will either error out or have extended behavior, rather than have two entirely
different versions of undefined behavior that leave the consumer of regex(3) guessing
as to what behavior will be used or leaving them with false impressions.

This is on the tail end of D10315, which should hopefully be wrapping up soon.

[1] http://pubs.opengroup.org/onlinepubs/9699919799.2016edition/

Test Plan

Run sed tests, gsed tests, bsdgrep tests, exp-run

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 8918
Build 9306: arc lint + arc unit

Event Timeline

This is not really an objection but my preference for undefined behavior is to leave it unchanged for as long as we can.
Even if it's undefined behavior, a lot fo things can break,and until the GNU compatible behavior is implemented such "transitional" change brings no benefit.

In D10510#217775, @pfg wrote:

This is not really an objection but my preference for undefined behavior is to leave it unchanged for as long as we can.
Even if it's undefined behavior, a lot fo things can break,and until the GNU compatible behavior is implemented such "transitional" change brings no benefit.

My next couple of patches that I've already prepared will be introducing libregex with the GNU compatible behavior- this transitional change is to fix the much less obvious fallout that comes from \b suddenly being interpreted as word-bound rather than 'b', especially in cases where 'b' is *also* a word-bound and it works by coincidence rather than correctness.

Immediately following the resolution of this review, I will be submitting these other patches -- bsdgrep really needs to re-grow GNU-compatible behavior in a way that's slightly less broken as it previously was.

In D10510#217778, @kevans91_ksu.edu wrote:
In D10510#217775, @pfg wrote:

This is not really an objection but my preference for undefined behavior is to leave it unchanged for as long as we can.
Even if it's undefined behavior, a lot fo things can break,and until the GNU compatible behavior is implemented such "transitional" change brings no benefit.

My next couple of patches that I've already prepared will be introducing libregex with the GNU compatible behavior- this transitional change is to fix the much less obvious fallout that comes from \b suddenly being interpreted as word-bound rather than 'b', especially in cases where 'b' is *also* a word-bound and it works by coincidence rather than correctness.

As I wrote, it's not an objection, just that I don't believe in transitional changes :).

Immediately following the resolution of this review, I will be submitting these other patches -- bsdgrep really needs to re-grow GNU-compatible behavior in a way that's slightly less broken as it previously was.

I certainly agree to the last sentence.

In D10510#217784, @pfg wrote:

As I wrote, it's not an objection, just that I don't believe in transitional changes :).

Indeed, I just don't like causing even this kind of discomfort. =) Would it be helpful if I go ahead and submit some reviews for libregex bits so that this becomes less of a transitional change and more of a blocker for other things under review? (I consider it 'blocking' since it's a really unfortunate interpretation of undefined behavior that could be really really confusing if you don't immediately observe whether extensions are available or not)

In D10510#217785, @kevans91_ksu.edu wrote:
In D10510#217784, @pfg wrote:

As I wrote, it's not an objection, just that I don't believe in transitional changes :).

Indeed, I just don't like causing even this kind of discomfort. =) Would it be helpful if I go ahead and submit some reviews for libregex bits so that this becomes less of a transitional change and more of a blocker for other things under review? (I consider it 'blocking' since it's a really unfortunate interpretation of undefined behavior that could be really really confusing if you don't immediately observe whether extensions are available or not)

I think you could just include this change along with the corresponding GNU-compatible feature so we see why the change is needed.

I'm reviving this with the creation of D12935; add functionality to libregex. This will need to go in before-hand, and will need an exp-run of its own to correct bogus usage in the ports tree.

@kevans: does this revision need review?

In D10510#397581, @ngie wrote:

@kevans: does this revision need review?

Aye- I've revised the set of characters that may be escaped and started crunching through an exp-run [1] with this so I can get started on pounding out some of the bogus usage throughout the various ports. It's also been quite helpful in catching places where ports erroneously used \t, \r, and \n.

Review on the approach and idea would be appreciated- the main motivation for this change is so we can safely grant some of these ordinary characters meaning in libregex and have them error out in libc to indicate that it's bogus. It's all undefined behavior by the standard, so we're really just flipping out interpretation on it.

[1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=229925

Update based on what's going through exp-run; I've left in the NOTYET bit as I intend to use this block instead after working out where it conflicts with bmake/ports framework bits.

I have not reviewed this change in detail yet but am happy with the approach, now that the exp-run is in progress and the follow-on review is up.

Exp-run is fine now; updating to version with old regcomp compat prior to commit.

This revision was not accepted when it landed; it landed in state Needs Review.Jul 29 2020, 11:22 PM
This revision was automatically updated to reflect the committed changes.