Page MenuHomeFreeBSD

PR153502: [libc] regex(3) bug with UTF-8 locale
ClosedPublic

Authored by yuripv on Nov 22 2018, 1:46 PM.
Tags
None
Referenced Files
F114601638: D18297.diff
Mon, Apr 14, 7:58 AM
F114584760: D18297.id.diff
Mon, Apr 14, 4:36 AM
Unknown Object (File)
Sun, Apr 13, 9:55 PM
Unknown Object (File)
Sun, Apr 13, 8:56 PM
Unknown Object (File)
Sun, Apr 13, 8:46 PM
Unknown Object (File)
Sun, Apr 13, 8:45 PM
Unknown Object (File)
Sun, Apr 13, 8:35 PM
Unknown Object (File)
Sun, Apr 13, 7:03 PM
Subscribers

Details

Summary

findmust() in regcomp.c correctly processed the pattern as multibyte string (so I'm leaving it alone this time); however when we used its results in matcher(), we stepped back g->moffset *bytes* instead of *characters*, which produced inconsistent results.

To fix this, introduce a stepback() function in engine.c, using short path for single-byte locales, and for multi-byte ones going back byte-by-byte, checking if we have a legal character sequence.

Test Plan

Running lib/libc/regex test cases, including the newly added one.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

I am not an expert on the area but I see where the comment comes from:

RFC3629 states on page 2:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

The Boyer-Moore fast search algorithm can be used with UTF-8 data.

yuripv edited the summary of this revision. (Show Details)

redo.

In D18297#388056, @pfg wrote:

I am not an expert on the area but I see where the comment comes from:

RFC3629 states on page 2:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

The Boyer-Moore fast search algorithm can be used with UTF-8 data.

I spent a bit more time on this, and finally found the real issue here, description and diff updated.

Aha .. so nice to find a bug , thanks !

This revision is now accepted and ready to land.Nov 23 2018, 2:11 AM

Looks good to me. =) ++testcases

This revision was automatically updated to reflect the committed changes.