Page MenuHomeFreeBSD

tty/teken: fix UTF8 sequence validation logic
ClosedPublic

Authored by bnovkov on Oct 10 2023, 5:56 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 7, 3:57 PM
Unknown Object (File)
Thu, Oct 17, 4:31 AM
Unknown Object (File)
Wed, Oct 16, 9:53 PM
Unknown Object (File)
Mon, Oct 14, 10:12 PM
Unknown Object (File)
Mon, Oct 14, 10:12 PM
Unknown Object (File)
Sat, Oct 12, 2:08 AM
Unknown Object (File)
Thu, Oct 10, 6:56 PM
Unknown Object (File)
Oct 9 2024, 6:06 PM
Subscribers

Details

Summary

This patch fixes UTF8 sequence validation logic in teken_utf8_bytes_to_codepoint and fixes fallback behaviour in ttydisc_rubchar when an invalid UTF8 sequence is encountered. The code previously used bitcount to extract sequence length information from the leading byte. However, this assumption breaks for certain code points that have additional bits set in the first half of the leading byte (e.g. Cyrillic characters). This lead to incorrect behaviour when deleting those characters using backspaces. The code now checks the number of consecutive set bits in the leading byte starting from the MSB, as per RFC 3629.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

The commit message should say RFC 3629 instead of 2629.

sys/teken/teken_wcwidth.h
131–150
bnovkov edited the summary of this revision. (Show Details)

Address @christos 's comments.

  • Add more detailed explanation of the use of __builtin_clz

Other fixes:

  • Codepoint calculation for two-byte sequences was missing one bit in the mask used for the leading character, fixed now
  • ttydisc_rubchar now falls back to non-UTF8 behaviour if teken_wcwidth returns an error

Fix formatting for multiline comment in teken_utf8_bytes_to_codepoint.

This revision is now accepted and ready to land.Oct 11 2023, 9:58 PM
This revision was automatically updated to reflect the committed changes.