Page MenuHomeFreeBSD

Initial collation support.
AbandonedPublic

Authored by pfg on Jun 4 2015, 9:28 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jan 1, 3:20 PM
Unknown Object (File)
Sun, Dec 29, 9:51 PM
Unknown Object (File)
Sat, Dec 28, 11:30 PM
Unknown Object (File)
Sat, Dec 28, 11:30 PM
Unknown Object (File)
Sat, Dec 28, 2:14 PM
Unknown Object (File)
Sat, Dec 28, 1:04 PM
Unknown Object (File)
Sat, Dec 28, 10:31 AM
Unknown Object (File)
Fri, Dec 27, 12:12 PM

Details

Reviewers
theraven
Group Reviewers
manpages
Summary

Developed for GSoC 2014 by ghostmansd@
Mentored by pfg@

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage

Event Timeline

pfg retitled this revision from to Initial collation support..
pfg updated this object.
pfg edited the test plan for this revision. (Show Details)

There are still build issues, but I am uploading this in the hope that it will be easier to fix/review.

pfg edited edge metadata.

Update build: with help from bapt@

I had to change the third parameter in __collate_lookup to
avoid a dumb cast in strxfrm_l(). This also triggered the
corresponding type update in strcoll_l().

I think with these changes the code builds on all platforms
(tinderbox still running).

Also fixed the Python lint issues while here.

wblock added inline comments.
lib/libcolldb/colldb.3
108

Needs a serial comma after "key". Also, start new sentences on new lines.

109

New sentence on new line. Usually we would say "already be" rather than the other way around.

110

Needs an article: "a value"? "the value"? Also, s/database/the database/. Finally, start a new sentence on a new line.

115

Needs a comma after "number".

117

Start a new sentence on a new line. "In order to" can usually just be "To"

118

This sentence is kind of confusing. "may" implies permission, but "might" in that same place is still confusing. It's not clear how the second half of the sentence relates to the first. Not clear what "exotic databases" means, either. Does this have the right meaning:

For some exotic databases, the weights buffer might be too small.
In this case,
.Fn colldb_get
will return -1 and set errno to
.Er ERANGE
to indicate that struct colldb_value must allocate a larger weights buffer.

123

Serial comma needed after the first "key". Also, start a new sentence on a new line.

129

s/handle/a handle/
s/collation/the collation/
Also, start a new sentence on a new line.

130

Typo: s/Berkley/Berkeley/.
Also, start a new sentence on a new line.
Also, s/caller/the caller/

usr.sbin/colldb/colldb.1
40

Needs articles: s/Unicode/a Unicode/
s/to/to a/

41

s/database/databases/

42

s/Berkley/Berkeley/
probably should have a "the" before that, too.

43

Not sure what this is saying. It's missing some articles, but seems to be saying both that root collation is the fallback method, and also here is an example. "is expected" implies that it might not work that way. Maybe better as two sentences, one explaining that root collation is the fallback (collation method?), and a second that introduces the example.

This is still not handling endianness portably.

lib/libc/string/strcoll.c
82

This is ugly. Do we have to initialize all this?

148

Gratuitous space.

lib/libc/string/strxfrm.c
136–137

Failure on MIPS:

/scratch/tmp/pfg/head/lib/libc/string/strxfrm.c: In function 'strxfrm_l':
/scratch/tmp/pfg/head/lib/libc/string/strxfrm.c:141: warning: passing argument 3 of '__collate_lookup' from incompatible pointer type

lib/libc/string/wcsxfrm.c
88

Breaks SPARC64 after changing 3rd parameter to size_t

cc1: warnings being treated as errors
/scratch/tmp/pfg/head/lib/libc/string/strcoll.c: In function 'strcoll_l':
/scratch/tmp/pfg/head/lib/libc/string/strcoll.c:127: warning: passing argument 3 of 'collate_lookup' from incompatible pointer type
/scratch/tmp/pfg/head/lib/libc/string/strcoll.c:131: warning: passing argument 3 of '
collate_lookup' from incompatible pointer type

lib/libcolldb/colldb.c
243

Build broken in ARM:

/scratch/tmp/pfg/head/lib/libcolldb/colldb.c: In function 'colldb_get':
/scratch/tmp/pfg/head/lib/libcolldb/colldb.c:242: error: 'BYTE_ORDER' undeclared (first use in this function)
/scratch/tmp/pfg/head/lib/libcolldb/colldb.c:242: error: (Each undeclared identifier is reported only once
/scratch/tmp/pfg/head/lib/libcolldb/colldb.c:242: error: for each function it appears in.)
/scratch/tmp/pfg/head/lib/libcolldb/colldb.c:242: error: 'ORDER_BIG_ENDIAN' undeclared (first use in this function)
/scratch/tmp/pfg/head/lib/libcolldb/colldb.c: In function 'colldb_put':
/scratch/tmp/pfg/head/lib/libcolldb/colldb.c:311: error: 'BYTE_ORDER' undeclared (first use in this function)
/scratch/tmp/pfg/head/lib/libcolldb/colldb.c:311: error: 'ORDER_BIG_ENDIAN' undeclared (first use in this function)

pfg marked an inline comment as done.

Update manpages. Thanks Warren for the review!
Add some $FreeBSD$ tags.
This is all still pretty much WIP.

pfg marked 7 inline comments as done.

Fix endinanness issue using <sys/endian.h>.
It's not clear to me if we are DTRT for the big endian case.
Clean some initializations in the new code.

pfg marked an inline comment as done.

Fix remaining build issues.

This passes the tinderbox build so it's basically ready for review/testing.

Phabricator is "helping" by putting previous comments on the wrong lines. Please ignore that.

I still think the last sentence of colldb.1 is confusing, but maybe it makes sense in context.

lib/libcolldb/colldb.3
107

Remove semicolon.

124

No semicolon, new sentence goes on new line. Originally, I read these as accepting three parameters, but it appears they really only accept two, key and value. It might be clearer to leave out the word "database". So:

accepts a key and value.
If necessary, it converts the key and value

125

Not really "into", just "to". And "the" is not needed here, so:

to network byte order before writing.

131

Three "e"s in "Berkeley". Database should probably not be capitalized:

All other functions behave exactly like their Berkeley database ancestors.

usr.sbin/colldb/colldb.1
39

utility converts Unicode collation files to *BSD collation databases.

40

s/database/databases/

Update the man pages according to Warren's feedback.
I think we still need some technical review on the manpages
(along with the code). It may be convenient to add a reference to
the Berkeley database documentation dbopen(3).

shouldn't there be modifications to share/colldef/Makefile and maybe an addition to share/colldef ? It seems to be after this commit is entered, the collate locate for UTF-8 codesets are still going to be pointing at la_LN.US-ASCII.

or are the UTF-8 codeset collations going to be fixed in a way I don't see?

In D2736#62123, @marino wrote:

shouldn't there be modifications to share/colldef/Makefile and maybe an addition to share/colldef ? It seems to be after this commit is entered, the collate locate for UTF-8 codesets are still going to be pointing at la_LN.US-ASCII.

or are the UTF-8 codeset collations going to be fixed in a way I don't see?

I asked Dmitry to document this but he hasn't done it yet.

I recall a new set of collation databases has to be downloaded from the unicode site, generated with colldef(1), and loaded in /share/colldb/. The pre-existing code is only kept as a fallback method in case the collation databases are not available.

Baptiste is working on the Illumos/Dragonfly collation so leave this aside, at least for now, and hope the best for that effort.