Page MenuHomeFreeBSD

tty/teken: fix UTF8 sequence validation logic

Authored by bnovkov on Oct 10 2023, 5:56 PM.
Referenced Files
Unknown Object (File)
Thu, Jun 20, 6:44 AM
Unknown Object (File)
Thu, Jun 20, 6:34 AM
Unknown Object (File)
Sat, Jun 15, 8:12 PM
Unknown Object (File)
Sat, Jun 15, 7:19 PM
Unknown Object (File)
Fri, Jun 14, 8:07 PM
Unknown Object (File)
Wed, May 29, 9:14 PM
Unknown Object (File)
May 18 2024, 1:52 AM
Unknown Object (File)
May 7 2024, 12:41 AM



This patch fixes UTF8 sequence validation logic in teken_utf8_bytes_to_codepoint and fixes fallback behaviour in ttydisc_rubchar when an invalid UTF8 sequence is encountered. The code previously used bitcount to extract sequence length information from the leading byte. However, this assumption breaks for certain code points that have additional bits set in the first half of the leading byte (e.g. Cyrillic characters). This lead to incorrect behaviour when deleting those characters using backspaces. The code now checks the number of consecutive set bits in the leading byte starting from the MSB, as per RFC 3629.

Diff Detail

rG FreeBSD src repository
Lint Not Applicable
Tests Not Applicable

Event Timeline

The commit message should say RFC 3629 instead of 2629.

bnovkov edited the summary of this revision. (Show Details)

Address @christos 's comments.

  • Add more detailed explanation of the use of __builtin_clz

Other fixes:

  • Codepoint calculation for two-byte sequences was missing one bit in the mask used for the leading character, fixed now
  • ttydisc_rubchar now falls back to non-UTF8 behaviour if teken_wcwidth returns an error

Fix formatting for multiline comment in teken_utf8_bytes_to_codepoint.

This revision is now accepted and ready to land.Oct 11 2023, 9:58 PM
This revision was automatically updated to reflect the committed changes.