Page MenuHomeFreeBSD

Unicode collation support
AbandonedPublic

Authored by bapt on Aug 8 2015, 7:52 PM.
Tags
None
Referenced Files
Unknown Object (File)
Feb 29 2024, 2:17 PM
Unknown Object (File)
Feb 22 2024, 10:12 AM
Unknown Object (File)
Dec 20 2023, 12:48 AM
Unknown Object (File)
Nov 18 2023, 11:35 AM
Unknown Object (File)
Nov 17 2023, 10:52 PM
Unknown Object (File)
Nov 17 2023, 10:47 PM
Unknown Object (File)
Aug 12 2023, 10:06 AM
Unknown Object (File)
Jun 28 2023, 10:04 PM
Subscribers

Details

Summary

This is the needed changes in order to have the new collation support. this part only concern the libc

This has been merged from Illumos and Dragonfly work + small fixes.

Please note that the whole LC_COLLATE has been changed to a new format generated from CLDR unicode db using tools written by edwin@ and extended by marino@ and an Illumos localedef(1) tool which will be added into base along with this.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

bapt retitled this revision from to Unicode collation support.
bapt updated this object.
bapt edited the test plan for this revision. (Show Details)
bapt added reviewers: jilles, theraven, pfg.
bapt set the repository for this revision to rS FreeBSD src repository - subversion.
bapt added a subscriber: marino.
lib/libc/locale/collate.h
2

Systems

lib/libc/locale/Symbol.map
218

Why do these need to be exported?

Note that exports at version FBSDprivate_1.0 are for situations where a symbol is needed by a different DSO or executable in base but not outside base (the declaration should be in a header file that is not installed). If a symbol is only needed in libc itself, don't list it in a Symbol.map at all and make it static or add __hidden in the declaration for best efficiency.

lib/libc/locale/collate.c
65

This #include should go below the other #includes.

333

Hmm, is it possible that *t < 0? This would access the array out of bounds.

lib/libc/locale/collate.h
45

whose

lib/libc/locale/utf8.c
150

This is a good change in principle but should be a separate commit since it's not related to collation.

Given the common use of UTF-16, it is unlikely that the 5- and 6-byte sequences will ever be used. Therefore, support for them can be removed entirely instead of adding #if 0.

Additionally, sequences with value greater than 0x10ffff should be rejected (these are some 4-byte sequences).

208

This code should not be removed. UTF-16 surrogates are invalid in UTF-8.

There are some encodings very similar to but different from UTF-8 that do permit UTF-16 surrogates (CESU-8 and Java Modified UTF-8) but people should not use this code to decode those.

334

As above, the conditions for 5- and 6-byte sequences can be removed and the condition for 4-byte sequences should be adjusted (wc >= 0 && wc <= 0x10ffff).

lib/libc/string/strcoll.c
45

easiest

48

Hmm, this means that strcoll_l() is not transitive, and therefore not suitable for use as a comparison function for qsort(), for example.

One fix is to make strings that cannot be converted to wide characters compare greater than strings that cannot be converted, ordering two strings that cannot be converted by strcmp(). A corresponding change (prepending a byte indicating whether the string could be converted) is needed in strxfrm_l().

113

free(NULL) does nothing and I think it is OK to rely on that. Unless Illumos code style is different, I think the ifs should be removed.

The ifs in the normal path are always true, even.

bapt marked 7 inline comments as done.Aug 9 2015, 10:16 AM

Mark already addressed comments as done (update will come with when all comments will be addressed)

bapt marked an inline comment as done.Aug 9 2015, 10:24 AM
bapt marked an inline comment as done.Aug 9 2015, 10:36 AM
bapt added inline comments.
lib/libc/locale/collate.c
333

Reading where it is called, I can't find how it would be possible to have *t < 0

lib/libc/locale/collate.c
104

Why does ret exist as a variable now that we don't have anything in between its definition and use?

125

Given the cost of the printf and the file I/O to load the table relative to the cost of a heap allocation, asprintf() / free() would probably be cleaner here.

lib/libc/locale/euc.c
24

We have permission from UCB to remove this clause, do we have permission from the other copyright holders (or has all of their code been removed?)

lib/libc/locale/setrunelocale.c
113

Same comment here re asprintf() / free().

bapt marked 3 inline comments as done.Aug 9 2015, 12:16 PM
bapt added inline comments.
lib/libc/locale/euc.c
24
lib/libc/locale/setrunelocale.c
113

I have addressed both with your suggestion even if I prefer the snprintf version :)

bapt marked an inline comment as done.Aug 9 2015, 1:54 PM
bapt added inline comments.
lib/libc/string/strcoll.c
48

Sorry but I do not really understand what you mean here? Would it be possible for you to show me an example code that would trigger the issue or even better to directly make that change in the collation branch?

lib/libc/locale/euc.c
24

We don't have permission from Tim J Robbins and email to him bounces back.

lib/libc/locale/euc.c
24

Garrett believes he got the permssion, but he does not have access to the mail anymore
What should I do readd the close?

bapt marked an inline comment as done.

committed as rS290494