Page MenuHomeFreeBSD

PR153502: [libc] regex(3) bug with UTF-8 locale
ClosedPublic

Authored by yuripv on Nov 22 2018, 1:46 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Mar 10, 6:17 PM
Unknown Object (File)
Feb 19 2024, 1:35 AM
Unknown Object (File)
Feb 10 2024, 4:52 PM
Unknown Object (File)
Jan 12 2024, 9:22 AM
Unknown Object (File)
Dec 23 2023, 8:35 PM
Unknown Object (File)
Dec 20 2023, 5:48 AM
Unknown Object (File)
Sep 9 2023, 10:52 PM
Unknown Object (File)
Aug 28 2023, 12:07 AM
Subscribers

Details

Summary

findmust() in regcomp.c correctly processed the pattern as multibyte string (so I'm leaving it alone this time); however when we used its results in matcher(), we stepped back g->moffset *bytes* instead of *characters*, which produced inconsistent results.

To fix this, introduce a stepback() function in engine.c, using short path for single-byte locales, and for multi-byte ones going back byte-by-byte, checking if we have a legal character sequence.

Test Plan

Running lib/libc/regex test cases, including the newly added one.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

I am not an expert on the area but I see where the comment comes from:

RFC3629 states on page 2:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

The Boyer-Moore fast search algorithm can be used with UTF-8 data.

yuripv edited the summary of this revision. (Show Details)

redo.

In D18297#388056, @pfg wrote:

I am not an expert on the area but I see where the comment comes from:

RFC3629 states on page 2:

The byte-value lexicographic sorting order of UTF-8 strings is the
same as if ordered by character numbers. Of course this is of
limited interest since a sort order based on character numbers is
almost never culturally valid.

The Boyer-Moore fast search algorithm can be used with UTF-8 data.

I spent a bit more time on this, and finally found the real issue here, description and diff updated.

Aha .. so nice to find a bug , thanks !

This revision is now accepted and ready to land.Nov 23 2018, 2:11 AM

Looks good to me. =) ++testcases

This revision was automatically updated to reflect the committed changes.