Details

Reviewers

kevans
pfg

Commits

rS341745: MFC r340835:
rS340835: regexec: fix processing multibyte strings.

Summary

findmust() in regcomp.c correctly processed the pattern as multibyte string (so I'm leaving it alone this time); however when we used its results in matcher(), we stepped back g->moffset *bytes* instead of *characters*, which produced inconsistent results.

To fix this, introduce a stepback() function in engine.c, using short path for single-byte locales, and for multi-byte ones going back byte-by-byte, checking if we have a legal character sequence.

Test Plan

Running lib/libc/regex test cases, including the newly added one.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

yuripv created this revision.Nov 22 2018, 1:46 PM

Herald added a subscriber: imp. · View Herald TranscriptNov 22 2018, 1:46 PM

I am not an expert on the area but I see where the comment comes from:

RFC3629 states on page 2:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

The Boyer-Moore fast search algorithm can be used with UTF-8 data.

redo.

In D18297#388056, @pfg wrote:

I am not an expert on the area but I see where the comment comes from:

RFC3629 states on page 2:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

The Boyer-Moore fast search algorithm can be used with UTF-8 data.

I spent a bit more time on this, and finally found the real issue here, description and diff updated.

typos.

yuripv updated this revision to Diff 50940.Nov 22 2018, 10:09 PM

Aha .. so nice to find a bug , thanks !

This revision is now accepted and ready to land.Nov 23 2018, 2:11 AM

Looks good to me. =) ++testcases

Closed by commit rS340835: regexec: fix processing multibyte strings. (authored by yuripv). · Explain WhyNov 23 2018, 3:49 PM

This revision was automatically updated to reflect the committed changes.

PR153502: [libc] regex(3) bug with UTF-8 locale
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 50939

lib/libc/regex/engine.c

lib/libc/tests/regex/Makefile.inc

lib/libc/tests/regex/multibyte.sh

PR153502: [libc] regex(3) bug with UTF-8 localeClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 50939

lib/libc/regex/engine.c

lib/libc/tests/regex/Makefile.inc

lib/libc/tests/regex/multibyte.sh

PR153502: [libc] regex(3) bug with UTF-8 locale
ClosedPublic
Actions

Revision Contents
Changeset List