Detect valid UTF-8 as UTF-8 by filip-hejsek · Pull Request #572 · TypesettingTools/Aegisub

filip-hejsek · 2026-03-03T19:06:53Z

This PR modifies Aegisub's charset detection function so that if the input is valid UTF-8, it will always be detected as UTF-8 instead of using uchardet's heuristics (which sometimes fail to correctly detect UTF-8).

ICU charset conversion API is used for UTF-8 validation. Because this API is designed for conversion, not validation, a dummy output buffer has to be provided.

A charset detection test suite has been added. For testing purposes, an additional "detect reason" output parameter has been added to the charset detection function, which allows distinguishing between e.g. detecting utf-8 from byte order mark, UTF-8 validation or uchardet heuristics.

Remaining issues:

The added tests need to know if libaegisub was compiled with or without uchardet, but the WITH_UCHARDET macro is not available when compiling tests.

Fixes #370

filip-hejsek · 2026-03-24T18:32:14Z

I've added some basic tests. However, the tests need to know if Aegisub was compiled with uchardet, but the WITH_UCHARDET macro is not available when compiling the tests.

arch1t3cht · 2026-05-09T21:49:34Z

Thanks! And sorry for the delay; I ended up postponing absolutely everything else until I finished #594 since I probably would have never gotten it done otherwise.

The first commit looks good to me and fixes the wrongly detected files I've been sent in the past. For the second commit (thanks for adding tests!):

I like the added TEST_DATA_DIR logic in principle (though maybe it'd be nice to prefix the env variable with AEGISUB_) but as it is it's inconsistent with the way test files are handled for the other test variants. The other tests run tests/setup.{sh,bat} to copy the test data to the build directory so that they can be found via a relative path from the test binary's CWD.

Now, these dedicated setup scripts are probably not actually needed any more. They used to set up some extra files with different permissions, but that was removed in 5da48ec . So in principle they could just be removed entirely and be replaced with the AEGISUB_TEST_DATA_DIR logic everywhere.

If you agree I can do this myself sometime soon unless you want to do it yourself.
The configuration macros not being available to the tests is unfortunate, yeah. I think the simplest solution might be to move the logic in meson.build that's generating aegisub_defines outside of the if/else branch and pass those flags to the tests, as in the attached patch: testflags.patch

arch1t3cht · 2026-06-09T19:03:45Z

Thanks for the update!

I pushed to your branch to fix the compilation on Windows and remove the no longer needed test_data_des argument for setup-test-data. From my side, this is good to merge now, unless you disagree.

Also remove unneeded setup script parameter and fix quoting in the unix shell version. Co-Authored-By: arch1t3cht <arch1t3cht@gmail.com>

Co-Authored-By: arch1t3cht <arch1t3cht@gmail.com>

filip-hejsek · 2026-06-11T08:53:50Z

I pushed to your branch to fix the compilation on Windows

Thanks for fixing this.

and remove the no longer needed test_data_des argument for setup-test-data.

Actually it was already unused even before these changes.

From my side, this is good to merge now, unless you disagree.

I guess it's good. Some thoughts:

You suggested removing the setup scripts, but that couldn't be done because they also setup some output directories (and the one file that the tests touch and then check the mod time of).
mru_ok.json, mru_invalid.json and ten_bytes are still created by the setup script, it might be better to change these too?
If a bug were to be introduced in the agi::Options::FLUSH_SKIP implementation, the tests would now end up overwriting the original files, but I don't think that's a problem.

arch1t3cht · 2026-06-11T10:38:40Z

Thanks!

You suggested removing the setup scripts, but that couldn't be done because they also setup some output directories (and the one file that the tests touch and then check the mod time of).

Yeah, I only realized that afterwards. But, either way, I think it's better to simplify the script as much as possible, especially since it runs on every ninja invocation.

mru_ok.json, mru_invalid.json and ten_bytes are still created by the setup script, it might be better to change these too?

Probably, but that can be done later on.

filip-hejsek force-pushed the utf8_detect branch from 330ddb6 to f399fed Compare March 3, 2026 19:42

filip-hejsek force-pushed the utf8_detect branch 8 times, most recently from 6861c91 to 9276cc8 Compare March 24, 2026 17:59

filip-hejsek force-pushed the utf8_detect branch 2 times, most recently from 1b69fc3 to ca587b2 Compare March 24, 2026 19:26

filip-hejsek force-pushed the utf8_detect branch 2 times, most recently from f7665fa to c5b7500 Compare June 3, 2026 17:58

arch1t3cht force-pushed the utf8_detect branch 2 times, most recently from c3d45e8 to 6db1db7 Compare June 9, 2026 19:03

filip-hejsek and others added 4 commits June 11, 2026 09:27

Detect valid UTF-8 as UTF-8

422116e

Locate test data files using env var instead of copying to build dir

581b4f9

Also remove unneeded setup script parameter and fix quoting in the unix shell version. Co-Authored-By: arch1t3cht <arch1t3cht@gmail.com>

Pass config macros to the compiler when compiling tests

f9c3056

Co-Authored-By: arch1t3cht <arch1t3cht@gmail.com>

Add tests for charset detection

859ac2f

filip-hejsek force-pushed the utf8_detect branch from 6db1db7 to 859ac2f Compare June 11, 2026 08:39

filip-hejsek marked this pull request as ready for review June 11, 2026 08:53

arch1t3cht merged commit 369872e into TypesettingTools:master Jun 11, 2026
7 checks passed

filip-hejsek deleted the utf8_detect branch June 11, 2026 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect valid UTF-8 as UTF-8#572

Detect valid UTF-8 as UTF-8#572
arch1t3cht merged 4 commits into
TypesettingTools:masterfrom
filip-hejsek:utf8_detect

filip-hejsek commented Mar 3, 2026 •

edited

Loading

Uh oh!

filip-hejsek commented Mar 24, 2026

Uh oh!

arch1t3cht commented May 9, 2026

Uh oh!

arch1t3cht commented Jun 9, 2026

Uh oh!

filip-hejsek commented Jun 11, 2026

Uh oh!

arch1t3cht commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

filip-hejsek commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

filip-hejsek commented Mar 24, 2026

Uh oh!

arch1t3cht commented May 9, 2026

Uh oh!

arch1t3cht commented Jun 9, 2026

Uh oh!

filip-hejsek commented Jun 11, 2026

Uh oh!

arch1t3cht commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

filip-hejsek commented Mar 3, 2026 •

edited

Loading