tty/teken: fix UTF8 sequence validation logic
ClosedPublic
Actions

Authored by bnovkov on Oct 10 2023, 5:56 PM.

Details

Reviewers

Commits

rGb80d4b2f93e5: tty/teken: fix UTF8 sequence validation logic
rG376c2ff8981a: tty/teken: fix UTF8 sequence validation logic
rG72a8e373f2d1: tty/teken: fix UTF8 sequence validation logic
rG2fed1c579c52: tty/teken: fix UTF8 sequence validation logic

Summary

This patch fixes UTF8 sequence validation logic in teken_utf8_bytes_to_codepoint and fixes fallback behaviour in ttydisc_rubchar when an invalid UTF8 sequence is encountered. The code previously used bitcount to extract sequence length information from the leading byte. However, this assumption breaks for certain code points that have additional bits set in the first half of the leading byte (e.g. Cyrillic characters). This lead to incorrect behaviour when deleting those characters using backspaces. The code now checks the number of consecutive set bits in the leading byte starting from the MSB, as per RFC 3629.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

bnovkov created this revision.Oct 10 2023, 5:56 PM

Herald added a subscriber: imp. · View Herald TranscriptOct 10 2023, 5:56 PM

bnovkov requested review of this revision.Oct 10 2023, 5:56 PM

The commit message should say RFC 3629 instead of 2629.

sys/teken/teken_wcwidth.h
131–150

Address @christos 's comments.

Add more detailed explanation of the use of __builtin_clz

Other fixes:

Codepoint calculation for two-byte sequences was missing one bit in the mask used for the leading character, fixed now
ttydisc_rubchar now falls back to non-UTF8 behaviour if teken_wcwidth returns an error