Page MenuHomeFreeBSD

printf(1): add \uHHHH and \UHHHHHHHH universal character name escapes
Needs ReviewPublic

Authored by olivier on Sun, Jun 14, 7:09 PM.
Tags
None
Referenced Files
F161164612: D57591.id179749.diff
Wed, Jul 1, 4:05 AM
Unknown Object (File)
Tue, Jun 30, 10:37 AM
Unknown Object (File)
Tue, Jun 30, 10:34 AM
Unknown Object (File)
Sat, Jun 27, 11:15 PM
Unknown Object (File)
Fri, Jun 26, 1:40 PM
Unknown Object (File)
Fri, Jun 26, 3:46 AM
Unknown Object (File)
Sat, Jun 20, 12:28 AM
Unknown Object (File)
Mon, Jun 15, 12:23 PM
Subscribers

Details

Summary

Emit the named Unicode code point as a multibyte sequence in the current locale.
Matches the extension in bash(1), ksh93(1), and GNU coreutils printf(1); not specified by POSIX.

Reject (as '?') surrogates, values above 0x10FFFF, characters not representable in the current locale, and encodings that would exceed the input length consumed (escape() rewrites in place).
Double '%' bytes when processing the format string so \u0025 et al. behave like \045.
Use uint32_t for hex parsing to avoid signed-overflow UB on \UFFFFFFFF.

Test Plan

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Am I missing something or does POSIX at https://pubs.opengroup.org/onlinepubs/9799919799/utilities/printf.html not list this feature? This version of POSIX does list dollar-single quotes at https://pubs.opengroup.org/onlinepubs/9799919799/utilities/V3_chap02.html#tag_19_02_04 but even there without \u and \U. I'm not really opposed to this feature but we shouldn't say it's specified by POSIX when it isn't.

printf/printf.c
551–563

In most cases, POSIX allows messages to standard error ("diagnostic messages") only if the utility's exit status is also non-zero. For example, printf %@ writes a message to standard error and returns status 1 and all existing unrecognized or invalid backslash sequences are handled without a message.

557–569

Octal sequences above make sure printf "\045" writes a percent sign without errors. I would say the same should apply to \u0025 and \U00000025 and that should also be tested.

558–559

I'm curious what this does for historical locales like en_US.ISO8859-1 and en_US.ISO8859-15. For example, ideally, \u20AC should expand to a question mark in `en_US.ISO8859-1 and to a byte A4 in en_US.ISO8859-15.

567–569

Perhaps use memcpy(store, mb, n); and store += n - 1; instead of a loop.

567–569

If you have exotic character encodings, could it be possible that something like \uA1 expands to more than four bytes? This would cause data corruption and/or buffer overflow.

printf/tests/regress.sh
30

Apart from \u0025 and \U00000025 mentioned above, we should also test \U00002A7D, \U2A7D, \u alone, \U alone, \u25 and \U25.

olivier retitled this revision from printf(1): implement POSIX 2024 \u and \U escape sequences to printf(1): add \uHHHH and \UHHHHHHHH universal character name escapes.
olivier edited the summary of this revision. (Show Details)

I think, the major comments were updated.
The problem is regarding the "historical locales like en_US.ISO8859-1 and en_US.ISO8859-15. For example, ideally, \u20AC should expand to a question mark in `en_US.ISO8859-1 and to a byte A4 in en_US.ISO8859-15." comment.
Indeed, a warning will be generated with en_US.ISO8859-1 and en_US.ISO8859-15, but a proper fix would route Unicode -> UTF-8 -> mbrtowc() -> wctomb(), but that's non-trivial, and seems heavy.