add requires_utf8 argument to tests #7388

ben-schwen · 2025-10-25T16:44:27Z

Closes #7336
Closes #1343
Closes #7333

codecov · 2025-10-25T17:01:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.97%. Comparing base (291a711) to head (5568eb5).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #7388   +/-   ##
=======================================
  Coverage   98.97%   98.97%           
=======================================
  Files          87       87           
  Lines       16733    16737    +4     
=======================================
+ Hits        16561    16565    +4     
  Misses        172      172

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-10-25T17:03:28Z

HEAD=tests_requires_utf8 slower P<0.001 for memrecycle regression fixed in #5463

Generated via commit 5568eb5

Download link for the artifact containing the test results: ↓ atime-results.zip

Task	Duration
R setup and installing dependencies	4 minutes and 37 seconds
Installing different package versions	10 minutes and 42 seconds
Running and plotting the test cases	3 minutes and 28 seconds

aitap · 2025-10-25T17:04:17Z

Sorry if I'm late to note this, but wouldn't a more reliable test for this be the same thing as we currently use for ñ in test 2266? A test may require some symbols (ñ, ü, ん) to be representable in the native encoding. The symbols may be represented using Unicode escapes (\uXXXX) as they currently do. If !identical(foo, enc2native(foo)), then the test must be skipped.

ben-schwen · 2025-10-25T18:55:15Z

The symbols may be represented using Unicode escapes (\uXXXX) as they currently do. If !identical(foo, enc2native(foo)), then the test must be skipped.

Good point. I have integrated this for the utf8_check.

MichaelChirico · 2025-12-29T19:29:17Z

inst/tests/tests.Rraw

+x1 = c("al\u00E4", "ala", "\u00E4allc", "coep")
+x2 = c("ala", "al\u00E4")
+tstc = function(y) unlist(lapply(y, function(x) as.character(as.name(x))), use.names=FALSE)
+test(1088.1, requires_utf8="\u00E4", chmatch(x1, x2), match(x1, x2)) # should not fallback to "match"


maybe easier to understand/write tests as requires_utf8=c(x1, x2)?

also, maybe better as a local() check here too, to reduce the visual noise of all the identical requires_utf8= inputs?

MichaelChirico · 2025-12-29T19:29:34Z

inst/tests/tests.Rraw

+# for completness, include test from #2528 of non ascii LHS of := (it could feasibly fail in future due to something other than chmatch)
+
+local(if (utf8_check("\u00E4")) {
+eval(parse(text='


why do we need eval(parse())?

also, parse(keep.source=FALSE) for micro-improvement

When the parser sees constructs like

data.table("\u00FCber" = c(1, 0, 0, 0, 0))

...it needs to construct a "language" (LANGSXP) call object where TAG(CADR(call)) is a symbol whose PRINTNAME is a CHARSXP saying über. That CHARSXP must be in the native encoding: requiring a single encoding makes it possible to compare pointers to SYMSXP values for equality (unlike CHARSXP where we have to test NEED2UTF8 and so on), and the native encoding was the default back before R had string encodings.

So when the parser tries to translate that Unicode string into the native encoding, it fails, emits a warning, and probably fails the following test, because the resulting string contains a substitution sequence:

LC_ALL=C R -q -s -e 'parse(text = r"{data.table("\u00FCber" = c(1, 0, 0, 0, 0))}")'

expression(data.table(`<U+00FC>ber` = c(1, 0, 0, 0, 0))) Warning message: In parse(text = "data.table(\"\\u00FCber\" = c(1, 0, 0, 0, 0))") : unable to translate '<U+00FC>ber' to native encoding

Without the runtime eval(parse(...)), this warning happens during source() with no way to avoid it.

it would be good to write that down somewhere as a reminder, but I'm not sure the best place to do it while being (1) discoverable and (2) not repetitive.

maybe (1) document it near require_utf8 in R/test.data.table and (2) add a comment by each eval(parse()) like "see require_utf8 description"?

MichaelChirico · 2025-12-29T19:30:41Z

inst/tests/tests.Rraw

+eval(parse(text='
+  DT = data.table(pas = c(1:5, NA, 6:10), good = c(1:10, NA))
+  setnames(DT, "pas", "p\u00E4s")
+  test(1092, requires_utf8="\u00E4", eval(parse(text="DT[is.na(p\u00E4s), p\u00E4s := 99L]")), data.table("p\u00E4s" = c(1:5, 99L, 6:10), good = c(1:10,NA)))


nested eval(parse())... gnarly

MichaelChirico · 2025-12-29T19:32:45Z

inst/tests/tests.Rraw

-} else {
-  cat("Test 2194.7 skipped because it needs a UTF-8 locale.\n")
-})
+needed_chars = "\u0105\u017E\u016B\u012F\u0173\u0117\u0161\u0119"


maybe easier to read in vector form?

Suggested change

needed_chars = "\u0105\u017E\u016B\u012F\u0173\u0117\u0161\u0119"

needed_chars = c("\u0105", "\u017E", "\u016B", "\u012F", "\u0173", "\u0117", "\u0161", "\u0119")

MichaelChirico · 2025-12-29T19:33:43Z

inst/tests/tests.Rraw

+ja_ni = "\u4E8C"
+ja_ko = "\u3053"
+ja_n = "\u3093"
+nc = paste0(accented_a, ja_ichi, ja_ni, ja_ko, ja_n)


Suggested change

nc = paste0(accented_a, ja_ichi, ja_ni, ja_ko, ja_n)

nc = c(accented_a, ja_ichi, ja_ni, ja_ko, ja_n)

MichaelChirico · 2025-12-29T19:35:55Z

NEWS.md


 3. Vignettes are now built using `litedown` instead of `knitr`, [#6394](https://github.com/Rdatatable/data.table/issues/6394). Thanks @jangorecki for the suggestion and @ben-schwen and @aitap for the implementation.

+4. `test()` gains new argument `requires_utf8` to skip tests when UTF-8 support is not available, [#7336](https://github.com/Rdatatable/data.table/issues/7336). Thanks @MichaelChirico for the suggestion and @ben-schwen for the implementation.


test() is not exported, so I'm not sure this is the right framing. Maybe

The data.table test suite is a bit more robust to lacking UTF-8 support [...]

add requires_utf8 argument to tests

45dc9e7

ben-schwen requested a review from MichaelChirico as a code owner October 25, 2025 16:44

ben-schwen added 2 commits October 25, 2025 20:48

fix warnings of sys.source

e9ecf69

change typos

8161adf

ben-schwen added 3 commits October 25, 2025 21:25

register utf8_check function

1a96390

add documentation and NEWS

0e82cf0

add nocov for region that only hits on non UTF8 systems

fa5967f

aitap mentioned this pull request Dec 15, 2025

More test fixes for when translations are present #7478

Merged

Merge branch 'master' into tests_requires_utf8

5568eb5

ben-schwen requested review from MichaelChirico and removed request for MichaelChirico December 28, 2025 15:03

MichaelChirico reviewed Dec 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add requires_utf8 argument to tests #7388

add requires_utf8 argument to tests #7388

ben-schwen commented Oct 25, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 25, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 25, 2025 •

edited

Loading

Uh oh!

aitap commented Oct 25, 2025

Uh oh!

ben-schwen commented Oct 25, 2025

Uh oh!

MichaelChirico Dec 29, 2025

Uh oh!

MichaelChirico Dec 29, 2025 •

edited

Loading

Uh oh!

aitap Dec 29, 2025 •

edited

Loading

Uh oh!

MichaelChirico Dec 29, 2025

Uh oh!

MichaelChirico Dec 29, 2025

Uh oh!

MichaelChirico Dec 29, 2025

Uh oh!

MichaelChirico Dec 29, 2025

Uh oh!

MichaelChirico Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	needed_chars = "\u0105\u017E\u016B\u012F\u0173\u0117\u0161\u0119"
	needed_chars = c("\u0105", "\u017E", "\u016B", "\u012F", "\u0173", "\u0117", "\u0161", "\u0119")

	nc = paste0(accented_a, ja_ichi, ja_ni, ja_ko, ja_n)
	nc = c(accented_a, ja_ichi, ja_ni, ja_ko, ja_n)


		3. Vignettes are now built using `litedown` instead of `knitr`, [#6394](https://github.com/Rdatatable/data.table/issues/6394). Thanks @jangorecki for the suggestion and @ben-schwen and @aitap for the implementation.

		4. `test()` gains new argument `requires_utf8` to skip tests when UTF-8 support is not available, [#7336](https://github.com/Rdatatable/data.table/issues/7336). Thanks @MichaelChirico for the suggestion and @ben-schwen for the implementation.

add requires_utf8 argument to tests #7388

Are you sure you want to change the base?

add requires_utf8 argument to tests #7388

Conversation

ben-schwen commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Oct 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aitap commented Oct 25, 2025

Uh oh!

ben-schwen commented Oct 25, 2025

Uh oh!

MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aitap Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ben-schwen commented Oct 25, 2025 •

edited

Loading

codecov bot commented Oct 25, 2025 •

edited

Loading

github-actions bot commented Oct 25, 2025 •

edited

Loading

MichaelChirico Dec 29, 2025 •

edited

Loading

aitap Dec 29, 2025 •

edited

Loading