-
Notifications
You must be signed in to change notification settings - Fork 1k
add requires_utf8 argument to tests #7388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7388 +/- ##
=======================================
Coverage 98.97% 98.97%
=======================================
Files 87 87
Lines 16733 16737 +4
=======================================
+ Hits 16561 16565 +4
Misses 172 172 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Generated via commit 5568eb5 Download link for the artifact containing the test results: ↓ atime-results.zip
|
|
Sorry if I'm late to note this, but wouldn't a more reliable test for this be the same thing as we currently use for ñ in test 2266? A test may require some symbols (ñ, ü, ん) to be representable in the native encoding. The symbols may be represented using Unicode escapes ( |
Good point. I have integrated this for the utf8_check. |
| x1 = c("al\u00E4", "ala", "\u00E4allc", "coep") | ||
| x2 = c("ala", "al\u00E4") | ||
| tstc = function(y) unlist(lapply(y, function(x) as.character(as.name(x))), use.names=FALSE) | ||
| test(1088.1, requires_utf8="\u00E4", chmatch(x1, x2), match(x1, x2)) # should not fallback to "match" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe easier to understand/write tests as requires_utf8=c(x1, x2)?
also, maybe better as a local() check here too, to reduce the visual noise of all the identical requires_utf8= inputs?
| # for completness, include test from #2528 of non ascii LHS of := (it could feasibly fail in future due to something other than chmatch) | ||
|
|
||
| local(if (utf8_check("\u00E4")) { | ||
| eval(parse(text=' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need eval(parse())?
also, parse(keep.source=FALSE) for micro-improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the parser sees constructs like
data.table("\u00FCber" = c(1, 0, 0, 0, 0))...it needs to construct a "language" (LANGSXP) call object where TAG(CADR(call)) is a symbol whose PRINTNAME is a CHARSXP saying über. That CHARSXP must be in the native encoding: requiring a single encoding makes it possible to compare pointers to SYMSXP values for equality (unlike CHARSXP where we have to test NEED2UTF8 and so on), and the native encoding was the default back before R had string encodings.
So when the parser tries to translate that Unicode string into the native encoding, it fails, emits a warning, and probably fails the following test, because the resulting string contains a substitution sequence:
LC_ALL=C R -q -s -e 'parse(text = r"{data.table("\u00FCber" = c(1, 0, 0, 0, 0))}")'expression(data.table(`<U+00FC>ber` = c(1, 0, 0, 0, 0)))
Warning message:
In parse(text = "data.table(\"\\u00FCber\" = c(1, 0, 0, 0, 0))") :
unable to translate '<U+00FC>ber' to native encoding
Without the runtime eval(parse(...)), this warning happens during source() with no way to avoid it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be good to write that down somewhere as a reminder, but I'm not sure the best place to do it while being (1) discoverable and (2) not repetitive.
maybe (1) document it near require_utf8 in R/test.data.table and (2) add a comment by each eval(parse()) like "see require_utf8 description"?
| eval(parse(text=' | ||
| DT = data.table(pas = c(1:5, NA, 6:10), good = c(1:10, NA)) | ||
| setnames(DT, "pas", "p\u00E4s") | ||
| test(1092, requires_utf8="\u00E4", eval(parse(text="DT[is.na(p\u00E4s), p\u00E4s := 99L]")), data.table("p\u00E4s" = c(1:5, 99L, 6:10), good = c(1:10,NA))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nested eval(parse())... gnarly
| } else { | ||
| cat("Test 2194.7 skipped because it needs a UTF-8 locale.\n") | ||
| }) | ||
| needed_chars = "\u0105\u017E\u016B\u012F\u0173\u0117\u0161\u0119" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe easier to read in vector form?
| needed_chars = "\u0105\u017E\u016B\u012F\u0173\u0117\u0161\u0119" | |
| needed_chars = c("\u0105", "\u017E", "\u016B", "\u012F", "\u0173", "\u0117", "\u0161", "\u0119") |
| ja_ni = "\u4E8C" | ||
| ja_ko = "\u3053" | ||
| ja_n = "\u3093" | ||
| nc = paste0(accented_a, ja_ichi, ja_ni, ja_ko, ja_n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| nc = paste0(accented_a, ja_ichi, ja_ni, ja_ko, ja_n) | |
| nc = c(accented_a, ja_ichi, ja_ni, ja_ko, ja_n) |
|
|
||
| 3. Vignettes are now built using `litedown` instead of `knitr`, [#6394](https://github.com/Rdatatable/data.table/issues/6394). Thanks @jangorecki for the suggestion and @ben-schwen and @aitap for the implementation. | ||
|
|
||
| 4. `test()` gains new argument `requires_utf8` to skip tests when UTF-8 support is not available, [#7336](https://github.com/Rdatatable/data.table/issues/7336). Thanks @MichaelChirico for the suggestion and @ben-schwen for the implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test() is not exported, so I'm not sure this is the right framing. Maybe
The data.table test suite is a bit more robust to lacking UTF-8 support [...]

Closes #7336
Closes #1343
Closes #7333