Page MenuHomeFreeBSD

sort: test against all month formats in month-sort
ClosedPublic

Authored by christos on Nov 30 2023, 9:06 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Apr 28, 3:02 PM
Unknown Object (File)
Fri, Apr 26, 9:46 PM
Unknown Object (File)
Fri, Apr 26, 2:29 AM
Unknown Object (File)
Thu, Apr 25, 4:43 PM
Unknown Object (File)
Thu, Apr 25, 4:42 PM
Unknown Object (File)
Feb 11 2024, 1:26 PM
Unknown Object (File)
Feb 11 2024, 9:50 AM
Unknown Object (File)
Feb 11 2024, 9:49 AM

Details

Summary

The CLDR specification [1] defines three possible month formats:

  • Abbreviation (e.g Jan, Ιαν)
  • Full (e.g January, Ιανουαρίου)
  • Standalone (e.g January, Ιανουάριος)

Many languages use different case endings depending on whether the month
is referenced as a standalone word (nominative case), or in date context
(genitive, partitive, etc.). sort(1)'s -M option currently sorts months
by testing input against only the abbrevation format, which is
essentially a substring of the full format. While this works fine for
languages like English, where there are no cases, for languages where
there is a different case ending between the abbreviation/full and
standalone formats, this is not sufficient.

For example, in Greek, "May" can take the following forms:

Abbreviation: Μαΐ (genitive case)
Full: Μαΐου (genitive case)
Standalone: Μάιος (nominative case)

If we use the standalone format in Greek, sort(1) will not able to match
"Μαΐ" to "Μάιος" and the sort will fail.

This change makes sort(1) test against all three formats. It also works
when the input contains mixed formats.

[1] https://cldr.unicode.org/translation/date-time/date-time-patterns

Test Plan
$ kyua test
sort_monthsort_test:monthsort_all_formats  ->  passed  [0.068s]
sort_monthsort_test:monthsort_mixed_formats  ->  passed  [0.010s]
...

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

christos created this revision.
christos edited the test plan for this revision. (Show Details)

Neat. :)

Could you please add a regression test based on your manual tests? Currently we use the tests from upstream, but you can add new ones to usr.bin/sort/tests.

Is it worth clarifying the description of -M in the sort man page?

usr.bin/sort/bwstring.c
52–105

I'd delete this comment. :)

126–127

I think m can just be a char *?

I'd suggest declaring local variables inside the macro body. Historically we prefer to keep all variable declarations at the beginning of the function, but that's no longer required by style(9). Moving local variables into the macro would make it easier to see what it's doing:

#define POPULATE(field) do {
    char *m, *tmp;
    size_t len;
    ...
}

I think it could be written as a C function instead, but I don't have any particular preference.

Could you please add a regression test based on your manual tests? Currently we use the tests from upstream, but you can add new ones to usr.bin/sort/tests.

Sure. Will update the diff later today with the other comments addressed.

Is it worth clarifying the description of -M in the sort man page?

I don't really have a serious objection to this, other than that I think it is not really worth mentioning, since this is what I'd expect the -M option to do in the first place. The only change I would make to the man page description is simply replace "Sort by month abbreviations" to "Sort by month".

christos marked 2 inline comments as done.

Address Mark's comments, add regression tests.

This looks good to me. Thanks for adding tests.

Is it worth clarifying the description of -M in the sort man page?

I don't really have a serious objection to this, other than that I think it is not really worth mentioning, since this is what I'd expect the -M option to do in the first place. The only change I would make to the man page description is simply replace "Sort by month abbreviations" to "Sort by month".

Fair enough!

usr.bin/sort/tests/sort_monthsort_test.sh
2

Typically you'd add a copyright line here - did you omit it deliberately?

5

This is best placed inside the individual tests which assume a particular locale. This script may also be invoked simply to list the tests contained within (e.g. with ./sort_monthsort_test -l), and it's weird to set LC_TIME in that case.

You might also note in the test names that they're testing the Greek locale; one might conceivably add additional tests for

124

While here I'd suggest adding a test with English month names, just for good measure.

This revision is now accepted and ready to land.Nov 30 2023, 9:04 PM
christos marked 3 inline comments as done.

Address Mark's comments.

This revision now requires review to proceed.Dec 1 2023, 12:14 AM
This revision is now accepted and ready to land.Dec 1 2023, 12:23 AM

Thank you for this nice addition! It does, however, fail when cross-building on macOS:
https://pipelinesghubeus2.actions.githubusercontent.com/8Zobszzx1fzKxF8BHSJLzmYvAIfV7kuYLyhz5o35SqgHcKvvbE/_apis/pipelines/1/runs/19723/signedlogcontent/7?urlExpires=2023-12-01T17%3A06%3A05.6816198Z&urlSigningMethod=HMACV1&urlSignature=SNFGWVRwQUneUXeKKdbEWuKQ64ngWnBLEXpcY8%2FCEI0%3D
The reason being, Darwin does not have those constants (defined in a250649bd8d6999d8511cb826085deb34236f3f3 ):
https://github.com/apple-oss-distributions/Libc/blob/main/include/langinfo.h
On Linux, this is not a problem (https://www.gnu.org/software/libc/manual/html_node/The-Elegant-and-Fast-Way.html).
I trust you would know how best to proceed, whether manually adding the constants to the cross-build tools or #ifdef-ing APPLE or else.
Thank you again!

Thank you for this nice addition! It does, however, fail when cross-building on macOS:

Hum. Thanks for the report. So we're missing definitions of ALTMON_*. I don't see a preprocessor constant that we can use to gate that, and I think #ifdef APPLE is too specific, quite a few other platforms don't have them. So I'd just wrap the corresponding code with #ifdef ALTMON_1. @christos are you able to take a look at this?

Thank you for this nice addition! It does, however, fail when cross-building on macOS:

Hum. Thanks for the report. So we're missing definitions of ALTMON_*. I don't see a preprocessor constant that we can use to gate that

I was thinking about adding an extra header file to tools/build/cross-build/include/mac (with these definitions) and using it to cross-build... but yes, let's gather other opinions.

Thank you for this nice addition! It does, however, fail when cross-building on macOS:

Hum. Thanks for the report. So we're missing definitions of ALTMON_*. I don't see a preprocessor constant that we can use to gate that

I was thinking about adding an extra header file to tools/build/cross-build/include/mac (with these definitions) and using it to cross-build... but yes, let's gather other opinions.

I think that doesn't make sense unless the corresponding locale files (e.g., /usr/share/local/el_GR.UTF-8/LC_TIME) are also installed on the host?

I fixed it here with a #ifdef ATLMON_1 ... #endif around the relevant bits.
I suggest we do that. #ifdef APPLE is a last resort always, and we don't need it here: Linux would break too.

Thank you for this nice addition! It does, however, fail when cross-building on macOS:

Hum. Thanks for the report. So we're missing definitions of ALTMON_*. I don't see a preprocessor constant that we can use to gate that

I was thinking about adding an extra header file to tools/build/cross-build/include/mac (with these definitions) and using it to cross-build... but yes, let's gather other opinions.

I think that doesn't make sense unless the corresponding locale files (e.g., /usr/share/local/el_GR.UTF-8/LC_TIME) are also installed on the host?

We don't need this functionality for BOOTSTRAP, which is why sort is being built here. Let's keep things simple and omit it in the bootstrap version of sort. The build is almost certainly never going to need this.

Thank you for this nice addition! It does, however, fail when cross-building on macOS:

Hum. Thanks for the report. So we're missing definitions of ALTMON_*. I don't see a preprocessor constant that we can use to gate that

I was thinking about adding an extra header file to tools/build/cross-build/include/mac (with these definitions) and using it to cross-build... but yes, let's gather other opinions.

I think that doesn't make sense unless the corresponding locale files (e.g., /usr/share/local/el_GR.UTF-8/LC_TIME) are also installed on the host?

Hum... Maybe I'm missing something:

➜  uname -a                                      
Darwin macbook-pro.home.arpa 23.1.0 Darwin Kernel Version 23.1.0: Mon Oct  9 21:28:12 PDT 2023; root:xnu-10002.41.9~6/RELEASE_ARM64_T8103 arm64
➜  cat /usr/share/locale/el_GR.UTF-8/LC_TIME
Ιαν
Φεβ
Μαρ
Απρ
Μαϊ
Ιον
Ιολ
Αυγ
Σεπ
Οκτ
Νοε
Δεκ
Ιανουαρίου
Φεβρουαρίου
Μαρτίου
Απριλίου
Μαΐου
Ιουνίου
Ιουλίου
Αυγούστου
Σεπτεμβρίου
Οκτωβρίου
Νοεμβρίου
Δεκεμβρίου
Κυρ
Δευ
Τρι
Τετ
Πεμ
Παρ
Σαβ
Κυριακή
Δευτέρα
Τρίτη
Τετάρτη
Πέμπτη
Παρασκευή
Σάββατο
%H:%M:%S
%d/%m/%Y
%a %e %b %X %Y
πμ
μμ
%a %e %b %Y %X %Z
Ιανουάριος
Φεβρουάριος
Μάρτιος
Απρίλιος
Μάϊος
Ιούνιος
Ιούλιος
Αύγουστος
Σεπτέμβριος
Οκτώβριος
Νοέμβριος
Δεκέμβριος
dm
%I:%M:%S %p

quite a few other platforms don't have them.

I had missed this important bit. Thank you!