printf(1): add \uHHHH and \UHHHHHHHH universal character name escapes
Needs ReviewPublic
Actions

Authored by olivier on Sun, Jun 14, 7:09 PM.

Details

Reviewers

emaste
imp
jhb
jilles

Summary

Emit the named Unicode code point as a multibyte sequence in the current locale.
Matches the extension in bash(1), ksh93(1), and GNU coreutils printf(1); not specified by POSIX.

Reject (as '?') surrogates, values above 0x10FFFF, characters not representable in the current locale, and encodings that would exceed the input length consumed (escape() rewrites in place).
Double '%' bytes when processing the format string so \u0025 et al. behave like \045.
Use uint32_t for hex parsing to avoid signed-overflow UB on \UFFFFFFFF.

Test Plan

regression test included (cf https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263210)

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

olivier created this revision.Sun, Jun 14, 7:09 PM

Herald added a subscriber: ziaee. · View Herald TranscriptSun, Jun 14, 7:09 PM

olivier requested review of this revision.Sun, Jun 14, 7:09 PM

Am I missing something or does POSIX at https://pubs.opengroup.org/onlinepubs/9799919799/utilities/printf.html not list this feature? This version of POSIX does list dollar-single quotes at https://pubs.opengroup.org/onlinepubs/9799919799/utilities/V3_chap02.html#tag_19_02_04 but even there without \u and \U. I'm not really opposed to this feature but we shouldn't say it's specified by POSIX when it isn't.

printf/printf.c
551–563	In most cases, POSIX allows messages to standard error ("diagnostic messages") only if the utility's exit status is also non-zero. For example, `printf %@` writes a message to standard error and returns status 1 and all existing unrecognized or invalid backslash sequences are handled without a message.
557–569	Octal sequences above make sure `printf "\045"` writes a percent sign without errors. I would say the same should apply to `\u0025` and `\U00000025` and that should also be tested.
558–559	I'm curious what this does for historical locales like `en_US.ISO8859-1` and `en_US.ISO8859-15`. For example, ideally, `\u20AC` should expand to a question mark in `en_US.ISO8859-1 and to a byte A4 in `en_US.ISO8859-15`.
567–569	Perhaps use `memcpy(store, mb, n);` and `store += n - 1;` instead of a loop.
567–569	If you have exotic character encodings, could it be possible that something like `\uA1` expands to more than four bytes? This would cause data corruption and/or buffer overflow.
printf/tests/regress.sh
30	Apart from `\u0025` and `\U00000025` mentioned above, we should also test `\U00002A7D`, `\U2A7D`, `\u` alone, `\U` alone, `\u25` and `\U25`.

olivier updated this revision to Diff 180178.Sat, Jun 20, 6:03 PM

olivier retitled this revision from printf(1): implement POSIX 2024 \u and \U escape sequences to printf(1): add \uHHHH and \UHHHHHHHH universal character name escapes.

olivier edited the summary of this revision. (Show Details)

I think, the major comments were updated.
The problem is regarding the "historical locales like en_US.ISO8859-1 and en_US.ISO8859-15. For example, ideally, \u20AC should expand to a question mark in `en_US.ISO8859-1 and to a byte A4 in en_US.ISO8859-15." comment.
Indeed, a warning will be generated with en_US.ISO8859-1 and en_US.ISO8859-15, but a proper fix would route Unicode -> UTF-8 -> mbrtowc() -> wctomb(), but that's non-trivial, and seems heavy.