Page MenuHomeFreeBSD

Unicode collation support
AbandonedPublic

Authored by bapt on Aug 8 2015, 7:52 PM.

Details

Summary

This is the needed changes in order to have the new collation support. this part only concern the libc

This has been merged from Illumos and Dragonfly work + small fixes.

Please note that the whole LC_COLLATE has been changed to a new format generated from CLDR unicode db using tools written by edwin@ and extended by marino@ and an Illumos localedef(1) tool which will be added into base along with this.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

bapt retitled this revision from to Unicode collation support.Aug 8 2015, 7:52 PM
bapt updated this object.
bapt edited the test plan for this revision. (Show Details)
bapt added reviewers: jilles, theraven, pfg.
bapt set the repository for this revision to rS FreeBSD src repository.
bapt added a subscriber: marino.
bapt updated this revision to Diff 7788.
pfg added inline comments.Aug 8 2015, 9:07 PM
lib/libc/locale/collate.h
2

Systems

jilles added inline comments.Aug 8 2015, 11:28 PM
lib/libc/locale/Symbol.map
218 ↗(On Diff #7788)

Why do these need to be exported?

Note that exports at version FBSDprivate_1.0 are for situations where a symbol is needed by a different DSO or executable in base but not outside base (the declaration should be in a header file that is not installed). If a symbol is only needed in libc itself, don't list it in a Symbol.map at all and make it static or add __hidden in the declaration for best efficiency.

lib/libc/locale/collate.c
67–68

This #include should go below the other #includes.

338

Hmm, is it possible that *t < 0? This would access the array out of bounds.

lib/libc/locale/collate.h
45

whose

lib/libc/locale/utf8.c
150

This is a good change in principle but should be a separate commit since it's not related to collation.

Given the common use of UTF-16, it is unlikely that the 5- and 6-byte sequences will ever be used. Therefore, support for them can be removed entirely instead of adding #if 0.

Additionally, sequences with value greater than 0x10ffff should be rejected (these are some 4-byte sequences).

200

This code should not be removed. UTF-16 surrogates are invalid in UTF-8.

There are some encodings very similar to but different from UTF-8 that do permit UTF-16 surrogates (CESU-8 and Java Modified UTF-8) but people should not use this code to decode those.

333

As above, the conditions for 5- and 6-byte sequences can be removed and the condition for 4-byte sequences should be adjusted (wc >= 0 && wc <= 0x10ffff).

lib/libc/string/strcoll.c
45

easiest

48

Hmm, this means that strcoll_l() is not transitive, and therefore not suitable for use as a comparison function for qsort(), for example.

One fix is to make strings that cannot be converted to wide characters compare greater than strings that cannot be converted, ordering two strings that cannot be converted by strcmp(). A corresponding change (prepending a byte indicating whether the string could be converted) is needed in strxfrm_l().

113

free(NULL) does nothing and I think it is OK to rely on that. Unless Illumos code style is different, I think the ifs should be removed.

The ifs in the normal path are always true, even.

bapt marked 7 inline comments as done.Aug 9 2015, 10:16 AM

Mark already addressed comments as done (update will come with when all comments will be addressed)

bapt marked an inline comment as done.Aug 9 2015, 10:24 AM
bapt marked an inline comment as done.Aug 9 2015, 10:36 AM
bapt added inline comments.
lib/libc/locale/collate.c
338

Reading where it is called, I can't find how it would be possible to have *t < 0

theraven added inline comments.Aug 9 2015, 11:43 AM
lib/libc/locale/collate.c
104–105

Why does ret exist as a variable now that we don't have anything in between its definition and use?

128

Given the cost of the printf and the file I/O to load the table relative to the cost of a heap allocation, asprintf() / free() would probably be cleaner here.

lib/libc/locale/euc.c
24

We have permission from UCB to remove this clause, do we have permission from the other copyright holders (or has all of their code been removed?)

lib/libc/locale/setrunelocale.c
113–120

Same comment here re asprintf() / free().

bapt marked 3 inline comments as done.Aug 9 2015, 12:16 PM
bapt added inline comments.
lib/libc/locale/euc.c
24
lib/libc/locale/setrunelocale.c
113–120

I have addressed both with your suggestion even if I prefer the snprintf version :)

bapt marked an inline comment as done.Aug 9 2015, 1:54 PM
bapt added inline comments.
lib/libc/string/strcoll.c
48

Sorry but I do not really understand what you mean here? Would it be possible for you to show me an example code that would trigger the issue or even better to directly make that change in the collation branch?

pfg added inline comments.Aug 9 2015, 2:18 PM
lib/libc/locale/euc.c
24

We don't have permission from Tim J Robbins and email to him bounces back.

bapt added inline comments.Aug 9 2015, 4:33 PM
lib/libc/locale/euc.c
24

Garrett believes he got the permssion, but he does not have access to the mail anymore
What should I do readd the close?

bapt updated this revision to Diff 9367.Oct 14 2015, 12:54 PM
bapt marked an inline comment as done.Nov 7 2015, 1:09 PM
bapt abandoned this revision.

committed as rS290494