Skip to content

Conversation

@ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Oct 25, 2025

Closes #7336
Closes #1343
Closes #7333

@codecov
Copy link

codecov bot commented Oct 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.97%. Comparing base (291a711) to head (5568eb5).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #7388   +/-   ##
=======================================
  Coverage   98.97%   98.97%           
=======================================
  Files          87       87           
  Lines       16733    16737    +4     
=======================================
+ Hits        16561    16565    +4     
  Misses        172      172           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Oct 25, 2025

  • HEAD=tests_requires_utf8 slower P<0.001 for memrecycle regression fixed in #5463
    Comparison Plot

Generated via commit 5568eb5

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 4 minutes and 37 seconds
Installing different package versions 10 minutes and 42 seconds
Running and plotting the test cases 3 minutes and 28 seconds

@aitap
Copy link
Member

aitap commented Oct 25, 2025

Sorry if I'm late to note this, but wouldn't a more reliable test for this be the same thing as we currently use for ñ in test 2266? A test may require some symbols (ñ, ü, ん) to be representable in the native encoding. The symbols may be represented using Unicode escapes (\uXXXX) as they currently do. If !identical(foo, enc2native(foo)), then the test must be skipped.

@ben-schwen
Copy link
Member Author

The symbols may be represented using Unicode escapes (\uXXXX) as they currently do. If !identical(foo, enc2native(foo)), then the test must be skipped.

Good point. I have integrated this for the utf8_check.

@ben-schwen ben-schwen requested review from MichaelChirico and removed request for MichaelChirico December 28, 2025 15:03
x1 = c("al\u00E4", "ala", "\u00E4allc", "coep")
x2 = c("ala", "al\u00E4")
tstc = function(y) unlist(lapply(y, function(x) as.character(as.name(x))), use.names=FALSE)
test(1088.1, requires_utf8="\u00E4", chmatch(x1, x2), match(x1, x2)) # should not fallback to "match"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe easier to understand/write tests as requires_utf8=c(x1, x2)?

also, maybe better as a local() check here too, to reduce the visual noise of all the identical requires_utf8= inputs?

# for completness, include test from #2528 of non ascii LHS of := (it could feasibly fail in future due to something other than chmatch)

local(if (utf8_check("\u00E4")) {
eval(parse(text='
Copy link
Member

@MichaelChirico MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need eval(parse())?

also, parse(keep.source=FALSE) for micro-improvement

Copy link
Member

@aitap aitap Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the parser sees constructs like

data.table("\u00FCber" = c(1, 0, 0, 0, 0))

...it needs to construct a "language" (LANGSXP) call object where TAG(CADR(call)) is a symbol whose PRINTNAME is a CHARSXP saying über. That CHARSXP must be in the native encoding: requiring a single encoding makes it possible to compare pointers to SYMSXP values for equality (unlike CHARSXP where we have to test NEED2UTF8 and so on), and the native encoding was the default back before R had string encodings.

So when the parser tries to translate that Unicode string into the native encoding, it fails, emits a warning, and probably fails the following test, because the resulting string contains a substitution sequence:

LC_ALL=C R -q -s -e 'parse(text = r"{data.table("\u00FCber" = c(1, 0, 0, 0, 0))}")'
expression(data.table(`<U+00FC>ber` = c(1, 0, 0, 0, 0)))
Warning message:
In parse(text = "data.table(\"\\u00FCber\" = c(1, 0, 0, 0, 0))") :
  unable to translate '<U+00FC>ber' to native encoding

Without the runtime eval(parse(...)), this warning happens during source() with no way to avoid it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to write that down somewhere as a reminder, but I'm not sure the best place to do it while being (1) discoverable and (2) not repetitive.

maybe (1) document it near require_utf8 in R/test.data.table and (2) add a comment by each eval(parse()) like "see require_utf8 description"?

eval(parse(text='
DT = data.table(pas = c(1:5, NA, 6:10), good = c(1:10, NA))
setnames(DT, "pas", "p\u00E4s")
test(1092, requires_utf8="\u00E4", eval(parse(text="DT[is.na(p\u00E4s), p\u00E4s := 99L]")), data.table("p\u00E4s" = c(1:5, 99L, 6:10), good = c(1:10,NA)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested eval(parse())... gnarly

} else {
cat("Test 2194.7 skipped because it needs a UTF-8 locale.\n")
})
needed_chars = "\u0105\u017E\u016B\u012F\u0173\u0117\u0161\u0119"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe easier to read in vector form?

Suggested change
needed_chars = "\u0105\u017E\u016B\u012F\u0173\u0117\u0161\u0119"
needed_chars = c("\u0105", "\u017E", "\u016B", "\u012F", "\u0173", "\u0117", "\u0161", "\u0119")

ja_ni = "\u4E8C"
ja_ko = "\u3053"
ja_n = "\u3093"
nc = paste0(accented_a, ja_ichi, ja_ni, ja_ko, ja_n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nc = paste0(accented_a, ja_ichi, ja_ni, ja_ko, ja_n)
nc = c(accented_a, ja_ichi, ja_ni, ja_ko, ja_n)


3. Vignettes are now built using `litedown` instead of `knitr`, [#6394](https://github.com/Rdatatable/data.table/issues/6394). Thanks @jangorecki for the suggestion and @ben-schwen and @aitap for the implementation.

4. `test()` gains new argument `requires_utf8` to skip tests when UTF-8 support is not available, [#7336](https://github.com/Rdatatable/data.table/issues/7336). Thanks @MichaelChirico for the suggestion and @ben-schwen for the implementation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test() is not exported, so I'm not sure this is the right framing. Maybe

The data.table test suite is a bit more robust to lacking UTF-8 support [...]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add argument like 'requires_utf8' to test() Escape UTF-8 dependent tests Finish DtNonAsciiTests package and submit to CRAN

3 participants