(This is still a WIP, I'd like to know if the idea looks sane first)
This is something I was looking to do for a long time the goal being having *complete* ctype map for UTF-8; was just missing the fact that we already have a definitive source of ctype information.
This includes a bit of cleanup to make things easier and cleaner, and the main change is in utf8-rollup.pl. We no longer use manually assembled definitions, and parse UnicodeData.txt directly. The only issue here is that there's no direct mapping between the categories defined in UnicodeData.txt and the ones defined by POSIX, so I used my best judgement here.
The format is described at: http://www.unicode.org/reports/tr44/#UnicodeData.txt
Categories are described at: http://www.unicode.org/reports/tr44/#General_Category_Values
I did NOT do any range compression in utf8-rollup.pl as I'm no perl wizard, and localedef does it for us so it just doesn't look worth spending the time on it; the resulting file has grown somewhat though -- if that's an issue, I'll look into this.