Page MenuHomeFreeBSD

use ctype data from UnicodeData.txt
ClosedPublic

Authored by yuripv on Nov 5 2018, 9:35 AM.

Details

Summary

This is something I was looking to do for a long time the goal being having *complete* ctype map for UTF-8; was just missing the fact that we already have a definitive source of ctype information.

The only issue here is that there's no direct mapping between the categories defined in UnicodeData.txt and the ones defined by POSIX, so I used my best judgement here.

The format is described at: http://www.unicode.org/reports/tr44/#UnicodeData.txt

Categories are described at: http://www.unicode.org/reports/tr44/#General_Category_Values

Test Plan
type    orig    new
alnum   94229   126029
alpha   93557   125419
blank   4       2
cntrl   73      137685
digit   469     622
graph   109615  137203
lower   1478    2145
print   109641  137222
punct   3428    797
rune    110481  274907
space   33      24
upper   983     1781
xdigit  469     622

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

yuripv created this revision.Nov 5 2018, 9:35 AM
yuripv edited the summary of this revision. (Show Details)Nov 5 2018, 9:41 AM
bapt added a comment.Nov 5 2018, 9:54 AM

I haven't looked into details, but I do like this idea, also note that if you do not like perl you can replace the code, I inherited it :)
I have been thinking about replacing it with some awk but I gave up in the mean time

yuripv updated this revision to Diff 50097.Nov 7 2018, 12:12 AM
yuripv edited the summary of this revision. (Show Details)
yuripv edited the test plan for this revision. (Show Details)

cleanup done separately; rebase

yuripv edited the summary of this revision. (Show Details)Nov 7 2018, 12:16 AM
bapt added a comment.Nov 9 2018, 2:06 PM

That look sane to me, the thing is I wonder how hard it would be to maintain

yuripv added a comment.Nov 9 2018, 3:15 PM
In D17842#382775, @bapt wrote:

That look sane to me, the thing is I wonder how hard it would be to maintain

This should not need any maintenance as the definitions now come directly from UnicodeData.txt, so once there's a new CLDR/UNIDATA release, all it takes is to run the utf8-rollup.pl script; of course, if we find the translation of UNICODE character categories to POSIX character classes suitable for us (I think it is).

Baptiste, anything else you want to see done/answered for this to proceed?

bapt accepted this revision.Nov 26 2018, 8:33 AM

LGTM

This revision is now accepted and ready to land.Nov 26 2018, 8:33 AM