Page MenuHomeFreeBSD

Fix MSDOSFS to support UTF-16 surrogate pairs
ClosedPublic

Authored by se on Thu, May 28, 8:06 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jun 23, 8:30 AM
Unknown Object (File)
Fri, Jun 19, 10:17 PM
Unknown Object (File)
Thu, Jun 18, 10:31 PM
Unknown Object (File)
Wed, Jun 17, 2:25 AM
Unknown Object (File)
Mon, Jun 15, 5:27 PM
Unknown Object (File)
Sat, Jun 13, 7:56 PM
Unknown Object (File)
Fri, Jun 12, 2:04 AM
Unknown Object (File)
Thu, Jun 11, 12:05 PM

Details

Summary

The VFAT file system uses UTF-16 to encode long filenames.
The code in FreeBSD lacks support for characters that cannot be represented by a single UTF-16 symbol (e.g. emojis, U+1F600 and above).
Support of code points >= U+10000 uses 2 UTF-16 symbols in the range U+D800 to U+DBFF (low surrogate) and 0xDC00 to 0xDFFF (high surrogate), both providing 10 bits to be combined into a 20 bit value that covers the range U+10000 to U+FFFFF.
Such files can be created on Windows, Linux, and MacOS by adding, e.g., an emoji to the file name.

This patch adds surrogate pair support to all file system operations in MSDOSFS.

Test Plan

Apply the patch and create, read, rename, delete etc. files with names that require surrogate pairs, e.g. which include emojis in their names.
An example name could be "123456789012🀐😎31234567890123🀐A😎B😎C😍DEFG🀐A😎B😎C😍DEFG123456789012345.dat".
Test with files created using other operating systems (WIndows, Linix, MacOS), too.
No regressions should exist for filenames without surrogates (e.g., execute the msdos stress tests in the tools/test/stress2/misc/ directory).

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

se requested review of this revision.Thu, May 28, 8:06 PM

Have you run your suggested tests?

Works for me.

Manually tested on an ElectroBSD amd64 system based on stable/15 71ec93ff69f.

Using the patch, the msdosfs issue reported on freebsd-hackers@ is no longer reproducible.

I'll probably be able to run the suggested stress tests tomorrow.

sys/fs/msdosfs/msdosfs_conv.c
691

Typo.

Have you run your suggested tests?

Yes, all tests gave identical results to a previous baseline run without the patches applied.

In msdos12.sh the mount operation needs to either use a mask of 777 or "-u $testuser" due to permission problems with the original code (also without these patches applied).
IMHO the script should be fixed to not mount with access allowed only for root, but then trying to run the test as a non-privileged user.
There are also some error messages from the all.debug.sh in the leak tests, since some of the labels looked for do not necessarily exist (e.g., vmstat -m does not return a line for "workfree", possibly since my system is ZFS-only and workfree is related to soft-updates).
AFAICT, all "errors" reported by the tests are due to test script issues and neither the original msdosfs nor the patched one.

But these tests are not sufficient: they are not using -L and thus msdosfs_iconv.ko will not be loaded and there is no support for UTF-8 userland filenames.
In fact, most of the patched code is only executed if msdosfs_iconv is loaded ((mp->pm_flags & MSDOSFSMNT_KICONV) != 0) and therefore by none of the stress tests.

I plan to add a test case that specifically tests operations on files named with surrogate pairs; this test will mount the file system with -L using an UTF-8 locale. (Currently there are no tests that test msdosfs with ICONV support loaded at all.)

I have committed 3 new tests in tools/test/stress2/misc (msdos22.sh, msdos23.sh, msdos24.sh), since I found that msdosfs with the patches in this review failed "under load" when running these tests (originally the script that has been committed as msdos24.sh).

But I found that the tests do also fail when run using the MSDOSFS in -CURRENT without these patches (and in the same way).

Since msdos24.sh tested surrogate support, which is not available in -CURRENT, I have added modified tests that use only ASCII characters (msdos22.sh) or only single UTF-16 symbols (msdos23.sh) in filenames.

As long as msdos22.sh and msdos23.sh fail on the baseline, I'd rather not commit the surrogare pair support, which further complicates the directory operations.

sys/fs/msdosfs/msdosfs_lookup.c
281

Should surrogate be reinitialized there? Or put it differently, why not embed the surrogate state into struct mbnambuf?

I have found that the test added to the stress2/misc directory to specifically test filenames with surrogate pairs (msdos24.sh) reports errors.
When restricted to just ASCII file names or UTF-16 symbols without surrogate pairs, the test succeeds.

I'll debug the code and will update the patch when it fully supports msdos24.sh.

sys/fs/msdosfs/msdosfs_lookup.c
281

Yes, I'll update the code to add a surrogate element to struct mbnambufand reinitialize it at more places (even though the current code should not require this, at least for wellformed directories).

The previous version failed for names that required exactly n*13+1 UTF-16 symbols (e.g. 123456🀐1234🀐) and had a surrogate pair at the end of the name.
This problam was detected by running the msdos24.sh test script recently added to the stress2/misc directory.
The cause was that the end of the unix name had been seen, which made unix2winfn() return end == true, but there still was a surrogate value that required another invocation of unix2winfn() to be written to the next directory slot.

This new version survived 1000s of runs of msdos24.sh (which creates some 1000 files of very different length and with surrogate pairs at random positions), copies them, replaces the source file of the copy by the destination file, then removes the source file.

This version has also been tested using the other msdos*.sh scripts in stress2/misc (after fixing msdos12.sh and with some warnings and script errors like [: -a: unexpected operator that are not caused by msdosfs).

se marked an inline comment as done.Sat, May 30, 11:30 AM

This now looks ready to commit.

This revision is now accepted and ready to land.Sat, May 30, 9:01 PM