Detect valid UTF-8 as UTF-8#572
Conversation
330ddb6 to
f399fed
Compare
6861c91 to
9276cc8
Compare
|
I've added some basic tests. However, the tests need to know if Aegisub was compiled with uchardet, but the |
1b69fc3 to
ca587b2
Compare
|
Thanks! And sorry for the delay; I ended up postponing absolutely everything else until I finished #594 since I probably would have never gotten it done otherwise. The first commit looks good to me and fixes the wrongly detected files I've been sent in the past. For the second commit (thanks for adding tests!):
|
f7665fa to
c5b7500
Compare
c3d45e8 to
6db1db7
Compare
|
Thanks for the update! I pushed to your branch to fix the compilation on Windows and remove the no longer needed |
Also remove unneeded setup script parameter and fix quoting in the unix shell version. Co-Authored-By: arch1t3cht <arch1t3cht@gmail.com>
Co-Authored-By: arch1t3cht <arch1t3cht@gmail.com>
6db1db7 to
859ac2f
Compare
Thanks for fixing this.
Actually it was already unused even before these changes.
I guess it's good. Some thoughts:
|
|
Thanks!
Yeah, I only realized that afterwards. But, either way, I think it's better to simplify the script as much as possible, especially since it runs on every
Probably, but that can be done later on. |
This PR modifies Aegisub's charset detection function so that if the input is valid UTF-8, it will always be detected as UTF-8 instead of using uchardet's heuristics (which sometimes fail to correctly detect UTF-8).
ICU charset conversion API is used for UTF-8 validation. Because this API is designed for conversion, not validation, a dummy output buffer has to be provided.
A charset detection test suite has been added. For testing purposes, an additional "detect reason" output parameter has been added to the charset detection function, which allows distinguishing between e.g. detecting utf-8 from byte order mark, UTF-8 validation or uchardet heuristics.
Remaining issues:
WITH_UCHARDETmacro is not available when compiling tests.Fixes #370