sed(1) / regex(3): correctly identify word boundaries when pattern is matched more than once
AbandonedPublic
Actions

Authored by avg on Jun 12 2015, 1:30 PM.

Details

Reviewers

Summary

Currenty there is a problem with matching word boundaries at the start
of the pattern when the pattern is applied to the string more than once.
E.g. in the sed substitution command with 'g' flag and a pattern starting
with ':<:'.

Here is a concrete example of the problem:
http://thread.gmane.org/gmane.os.freebsd.current/152226/focus=152228

With this change sed makes available the whole string to regexec(3) while passing
offsets of the start and the end of the substring to work on.
This allows regex(3) to correctly identify a word boundary at the start
of the substring, because now we can examine 'lastc' (previous character)
when operating on the beginning of the substring which is not at the very
start of the string.

Note that other uitilites that use regex(3) would also have to pass
the whole string to avoid the same problem. The regex(3) change alone is not
enough to fix all the consumers.

This change also includes a hack to allow libc/regex/grot to compile
after xlocale changes.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Event Timeline

avg updated this revision to Diff 6138.Jun 12 2015, 1:30 PM

avg retitled this revision from to sed(1) / regex(3): correctly identify word boundaries when pattern is matched more than once.

avg updated this object.

avg edited the test plan for this revision. (Show Details)

Pedro, I've noticed your recent activity in this area. Maybe you'll have some feedback. Thank you.

pfg added a parent revision: D6257: Regex improvement..May 17 2016, 2:21 PM

Interestingly .. the OpenBSD guys have been working on a similar issue that breaks Mesa.

I almost committed a similar patch, but it is incorrect.
Check:
https://www.mail-archive.com/tech@openbsd.org/msg31625.html

lib/libc/regex/engine.c
792	This causes a regression in regexex(3) REG_STARTEND.
897	As in the previous line this causes a regression. I have an improved patch but it's not ready either.
lib/libc/regex/regcomp.c
774	Are you disabling collation? This seems unrelated and should be an independent revision.

This revision now requires changes to proceed.May 17 2016, 2:33 PM

In D2792#136077, @pfg wrote:

Interestingly .. the OpenBSD guys have been working on a similar issue that breaks Mesa.

I almost committed a similar patch, but it is incorrect.
Check:
https://www.mail-archive.com/tech@openbsd.org/msg31625.html

Thank you for the pointer.

lib/libc/regex/regcomp.c
774	I don't recall much details now. I think that I did this to get some test program to compile. Please note that for normal compilation `REDEBUG` is never defined and so the code was enabled.