Page MenuHomeFreeBSD

PR153502: [libc] regex(3) bug with UTF-8 locale
ClosedPublic

Authored by yuripv on Nov 22 2018, 1:46 PM.

Details

Summary

findmust() in regcomp.c correctly processed the pattern as multibyte string (so I'm leaving it alone this time); however when we used its results in matcher(), we stepped back g->moffset *bytes* instead of *characters*, which produced inconsistent results.

To fix this, introduce a stepback() function in engine.c, using short path for single-byte locales, and for multi-byte ones going back byte-by-byte, checking if we have a legal character sequence.

Test Plan

Running lib/libc/regex test cases, including the newly added one.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

yuripv created this revision.Nov 22 2018, 1:46 PM
pfg added a comment.Nov 22 2018, 3:10 PM

I am not an expert on the area but I see where the comment comes from:

RFC3629 states on page 2:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.
The Boyer-Moore fast search algorithm can be used with UTF-8 data.

yuripv edited the summary of this revision. (Show Details)Nov 22 2018, 10:01 PM
yuripv updated this revision to Diff 50938.

redo.

yuripv added a comment.EditedNov 22 2018, 10:02 PM
In D18297#388056, @pfg wrote:

I am not an expert on the area but I see where the comment comes from:
RFC3629 states on page 2:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.
The Boyer-Moore fast search algorithm can be used with UTF-8 data.

I spent a bit more time on this, and finally found the real issue here, description and diff updated.

yuripv updated this revision to Diff 50940.Nov 22 2018, 10:09 PM
pfg accepted this revision.Nov 23 2018, 2:11 AM

Aha .. so nice to find a bug , thanks !

This revision is now accepted and ready to land.Nov 23 2018, 2:11 AM
kevans accepted this revision.Nov 23 2018, 2:16 AM

Looks good to me. =) ++testcases

This revision was automatically updated to reflect the committed changes.