use ctype data from UnicodeData.txt
ClosedPublic
Actions

Authored by yuripv on Nov 5 2018, 9:35 AM.

Details

Reviewers

Commits

rS341629: MFC r340491, r340492:
rS340491: Use UnicodeData.txt to create UTF-8 ctype map.

Summary

This is something I was looking to do for a long time the goal being having *complete* ctype map for UTF-8; was just missing the fact that we already have a definitive source of ctype information.

The only issue here is that there's no direct mapping between the categories defined in UnicodeData.txt and the ones defined by POSIX, so I used my best judgement here.

The format is described at: http://www.unicode.org/reports/tr44/#UnicodeData.txt

Categories are described at: http://www.unicode.org/reports/tr44/#General_Category_Values

Test Plan

type    orig    new
alnum   94229   126029
alpha   93557   125419
blank   4       2
cntrl   73      137685
digit   469     622
graph   109615  137203
lower   1478    2145
print   109641  137222
punct   3428    797
rune    110481  274907
space   33      24
upper   983     1781
xdigit  469     622

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

yuripv created this revision.Nov 5 2018, 9:35 AM

Herald added a subscriber: imp. · View Herald TranscriptNov 5 2018, 9:35 AM

yuripv edited the summary of this revision. (Show Details)Nov 5 2018, 9:41 AM

I haven't looked into details, but I do like this idea, also note that if you do not like perl you can replace the code, I inherited it :)
I have been thinking about replacing it with some awk but I gave up in the mean time

cleanup done separately; rebase

yuripv edited the summary of this revision. (Show Details)Nov 7 2018, 12:16 AM

That look sane to me, the thing is I wonder how hard it would be to maintain

In D17842#382775, @bapt wrote:

That look sane to me, the thing is I wonder how hard it would be to maintain

This should not need any maintenance as the definitions now come directly from UnicodeData.txt, so once there's a new CLDR/UNIDATA release, all it takes is to run the utf8-rollup.pl script; of course, if we find the translation of UNICODE character categories to POSIX character classes suitable for us (I think it is).

Baptiste, anything else you want to see done/answered for this to proceed?

LGTM

This revision is now accepted and ready to land.Nov 26 2018, 8:33 AM

Done in rS340491.