Page MenuHomeFreeBSD

PR153502: [libc] regex(3) bug with UTF-8 locale
ClosedPublic

Authored by yuripv on Nov 22 2018, 1:46 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Nov 25, 7:05 AM
Unknown Object (File)
Sun, Nov 24, 7:30 AM
Unknown Object (File)
Sun, Nov 24, 6:22 AM
Unknown Object (File)
Sat, Nov 23, 2:32 AM
Unknown Object (File)
Fri, Nov 22, 12:57 PM
Unknown Object (File)
Thu, Nov 7, 5:47 AM
Unknown Object (File)
Wed, Nov 6, 4:10 PM
Unknown Object (File)
Wed, Nov 6, 10:03 AM
Subscribers

Details

Summary

findmust() in regcomp.c correctly processed the pattern as multibyte string (so I'm leaving it alone this time); however when we used its results in matcher(), we stepped back g->moffset *bytes* instead of *characters*, which produced inconsistent results.

To fix this, introduce a stepback() function in engine.c, using short path for single-byte locales, and for multi-byte ones going back byte-by-byte, checking if we have a legal character sequence.

Test Plan

Running lib/libc/regex test cases, including the newly added one.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

I am not an expert on the area but I see where the comment comes from:

RFC3629 states on page 2:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

The Boyer-Moore fast search algorithm can be used with UTF-8 data.

yuripv edited the summary of this revision. (Show Details)

redo.

In D18297#388056, @pfg wrote:

I am not an expert on the area but I see where the comment comes from:

RFC3629 states on page 2:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

The Boyer-Moore fast search algorithm can be used with UTF-8 data.

I spent a bit more time on this, and finally found the real issue here, description and diff updated.

Aha .. so nice to find a bug , thanks !

This revision is now accepted and ready to land.Nov 23 2018, 2:11 AM

Looks good to me. =) ++testcases

This revision was automatically updated to reflect the committed changes.