Page MenuHomeFreeBSD

Add a way to get version information from querylocale(3).
ClosedPublic

Authored by munro_ip9.org on Sep 13 2018, 11:16 PM.

Details

Summary

Allow user programs to ask for a version string for the components of a locale_t. The structure of the version string is undefined, but can be used to detect changes in the locale's definition.

The intended use-case is databases that use libc collations to implement indexes. When collation definitions change, database indexes are frequently corrupted. By exposing the CLDR version string, databases can detect the change and issue an error or force an index rebuild. Without such versioning, database projects (such as PostgreSQL) are forced to move to other collation implementations (eg libicu) just to protect themselves from silent collation definition changes. Motivation discussed in slightly more detail here: https://lists.freebsd.org/pipermail/freebsd-hackers/2018-September/053289.html

For now only LC_COLLATE components have a way to report their version. The LC_COLLATE file format is backwards and forwards compatible. Where previously 24 bytes held "BSD 1.0\n" followed by NUL characters, there are now 12 bytes for the file format identifier and then 12 bytes for the data version string.

I plan to file a POSIX defect report that proposes this new interface, but that process requires an existing implementation in a real operating system.

Test Plan

Run though the complete locale import procedure from tools/tools/locale/README, and then you will finish up with collation definitions that have the CLDR version extracted from the files you pulled down from unicode.org. Or, a quick and dirty way to get a single versioned collation without that whole procedure is to make a fake locale xx_XX.UTF-8:

# cp -r /usr/share/locale/en_US.UTF-8 /usr/share/locale/xx_XX.UTF-8

Now create a new LC_COLLATE file that has a version string stamped on it, and install it:

# localedef \
  -i /usr/src/share/colldef/en_US.UTF-8.src \
  -f /usr/src/tools/tools/locale/etc/final-maps/map.UTF-8 \
  -V '30.0.2' \
  xx_XX
# cp xx_XX/LC_COLLATE /usr/share/locale/xx_XX.UTF-8/

You can then see the version string with querylocaleversion:

# cat test.c 
#include <locale.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
        const char *name;
        locale_t loc;

        if (argc < 2)
                return (1);
        name = argv[1];

        loc = newlocale(LC_ALL_MASK, name, NULL);
        if (!loc) {
                printf("failed to open locale\n");
                return (1);
        }
        printf("name = [%s], version = [%s]\n",
                querylocale(LC_COLLATE_MASK, loc),
                querylocale(LC_COLLATE_MASK | LC_VERSION_MASK, loc));
        freelocale(loc);

        return (0);
}
# cc -Wall test.c
# ./a.out xx_XX.UTF-8
name = [xx_XX.UTF-8], version = [0:30.0.2]

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

munro_ip9.org retitled this revision from Add querylocaleversion(). to Add querylocaleversion(3) to libc..Sep 14 2018, 1:06 AM

Please provide the diff with full context, e.g. by using -U with number of lines big enough to include full file contents if you are using git diff or git log -p.

lib/libc/locale/querylocaleversion.3
45

Please run mandoc -Tlint on the man pages you are changing, e.g. it would tell you to start new sentences from new lines for this one :-) Also, it's likely should be .Dv LC_COLLATE_MASK.

49

Again, mandoc -Tlint would tell you that these needs to be sorted correctly, first by section, then alphabetically.

usr.bin/localedef/localedef.1
135

New sentence, new line; use .Xr querylocaleversion 3.

203

Sort.

Man page tidying + more context, based on feedback from yuripv.

munro_ip9.org marked an inline comment as not done.

New version because I missed one detail from yuripv's feedback (.Xr querylocaleversion 3).

0mp added a subscriber: 0mp.
0mp added inline comments.
lib/libc/locale/querylocaleversion.3
2

Historically, the phrase 'All Rights Reserved.' was included in all copyright notices. The BSD releases had it to comply with the Buenos Aires Convention of 1910 in the Americas. With the ratification of the Berne Convention in 2000, it became obsolete. As such, the FreeBSD project recommends that new code omit the phrase and encourages existing copyright holders to remove it. In 2018, the project updated its templates to remove it.

https://www.freebsd.org/internal/software-license.html

51

Could you cross reference this new manual page from querylocale(3) as well?

usr.bin/localedef/localedef.1
36

Remember to dump Dd.

This revision is now accepted and ready to land.Sep 14 2018, 8:12 AM

Adjust man pages based on feedback from 0mp. While removing "All rights reserved", and after checking some nearby files and the documentation, it seemed that the thing to do with new files was to use my own name on the copyright message. I also added a new line to the querylocaleversion.3 page to mention the purpose of the version string.

This revision now requires review to proceed.Sep 14 2018, 9:53 AM
lib/libc/locale/collate.c
158

You likely want strlcpy() here to make sure it's NUL-terminated.

usr.bin/localedef/collate.c
1161

Just a nit: I know that current localedef code does that a lot, but still version != NULL would be a proper comparison here.

strncpy->strlcpy, as suggested by yuripv.

munro_ip9.org added inline comments.
usr.bin/localedef/collate.c
1161

Hmm well it'd look odd next to all the nearby uses of implicit NULL comparison in this file, so I think it's probably better to keep it that way.

I am not sure is it a good idea at all, but still cannot resist explaining it.

You could add a bit to the LC_XXX_MASK and make querylocale() returning either string or numeric representation of the version when the bit is set together with the real locale category bit.
BTW, how does querylocaleversion() behaves WRT locales created dynamically by newlocale(3) ?

In D17166#366077, @kib wrote:

You could add a bit to the LC_XXX_MASK and make querylocale() returning either string or numeric representation of the version when the bit is set together with the real locale category bit.

That is a nice idea. querylocale(LC_COLLATE_MASK | LC_VERSION_MASK, my_locale). It's attractive because it avoids introducing a new libc function, and yet it still makes a lot of sense as an interface. Note that unlike uselocale(3) etc, querylocale(3) is not part of POSIX 1003.1-2018 (we apparently borrowed it from macOS because libc++ needed it, see r227753). If we go this way my proposal to POSIX will be slightly wider in scope. Maybe that's a good thing. I will try this idea out tomorrow and report back.

BTW, how does querylocaleversion() behaves WRT locales created dynamically by newlocale(3) ?

It works. You are allowed to mix and match components from different locales, which is why my proposal requires you to ask for the version for one specific category. So if you start with en_US and replace its LC_COLLATE with the one from fr_FR, you'll have a locale that does American style money formatting but French style sorting. That is: querylocale(LC_MONETARY_MASK, loc) -> "en_US", querylocale(LC_COLLATE_MASK, loc) -> "fr_FR", querylocaleversion(LC_COLLATE_MASK, loc) -> [the version from fr_FR's LC_COLLATE file].

munro_ip9.org retitled this revision from Add querylocaleversion(3) to libc. to Add a way to get version information from querylocale(3)..
munro_ip9.org edited the test plan for this revision. (Show Details)

Instead of introducing a new function querylocaleversion(3), teach the existing querylocale(3) to return versions, as suggested by kib. I think I prefer it this way, but I'm not sure if it's more or less likely to survive contact with the standards process.

lib/libc/locale/querylocale.3
52 ↗(On Diff #48068)

Do you intent to only allow to detect change, or also to define a 'newer than' relation ?

lib/libc/locale/querylocale.3
52 ↗(On Diff #48068)

I don't think 'newer than' would be very useful on its own, and it would require a complicated specification of the version string which wouldn't work well with my goal of convincing other operating systems to adopt this via a the standards process. Consider that not everyone uses CLDR, and sometimes OSes change to a different source of collation data (as we did). (A much more ambitious project would be to support multiple versions installed at the same time, and then you might want to be able to order them, but I don't have the resources to propose that; I merely want to fix the problem of silent changes that result in corrupted persistent data, because that is very arguably a "defect" in the standard).

This revision is now accepted and ready to land.Sep 18 2018, 9:02 AM
munro_ip9.org edited the test plan for this revision. (Show Details)

Sorry for the long delay. Here is a new version that addresses a theoretical problem not handled before: what if our code changes for some reason, but the CLDR version doesn't? So I added an extra part to the version string. Changes:

  • If the collation algorithm changes (because we fix a major bug in libc that affects collating in a user-visible way), the value of the macro COLLATE_ALG_VERSION in collate.c should be incremented, though I expect that would be rare. The value returned by querylocale() is <algorithm version> + ":" + <definition version>, where that last thing is taken directly from CLDR. The initial algorithm version is "000". So the complete version string is "000:34.0" currently.
  • I updated the locale generation scripts to extract the CLDR version and put it into all generated collation files.
  • The CLDR version finishes up in a variable in the generated-but-present-in-the-tree file share/colldef/Makefile alongside the generated .src files, so you can see that in the patch. I thought about putting the version into the .src files instead, but I wasn't sure that would be OK, since localedef's input file syntax is controlled by the standard.
This revision now requires review to proceed.Oct 3 2019, 10:45 AM
munro_ip9.org edited the test plan for this revision. (Show Details)

Instead of "000", just use "0" as the algorithm version. So the value you get is "0:34.0" currently. No other changes. I'd like to commit this soon so that we can use it in PostgreSQL 13.

Rebased on top of the recent commit "tools/tools/locale: allow POSIX target to be built in parallel".

I decided the "algorithm" part was unmaintainable overkill, and I went back to the earlier version that was previously accepted by reviewers, that just reports the plain CLDR version, eg, "34.0" initially. Aside from a couple of cosmetic changes, the only change to the accepted version was that I now need a couple of lines in bootstrap_xlocale_private.h, since the commit "Allow bootstrapping localdef on non-FreeBSD systems". If there are no objections, I plan to commit this tomorrow.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 8 2020, 2:50 AM
This revision was automatically updated to reflect the committed changes.