[feature](function) support is_valid_utf8 function#62515
[feature](function) support is_valid_utf8 function#62515Mryange merged 3 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Found one blocking correctness issue.
Critical checkpoints:
- Goal of the task: The PR adds / plus a helper. The scalar function path is wired on FE/BE and covered by unit/regression tests, but the new column-level helper does not actually validate each row independently, so this goal is not fully met.
- Modification size/focus: The change is small and focused.
- Concurrency: Not applicable; this code path is single-threaded and does not introduce new shared-state synchronization.
- Lifecycle/static initialization: No special lifecycle or static initialization concerns found.
- Configuration: No new configuration items.
- Compatibility: No storage-format, protocol, or rolling-upgrade compatibility changes found.
- Parallel code paths: FE Nereids metadata and BE execution registration were both updated for the SQL function path. No additional parallel execution path issue found beyond the helper bug below.
- Special conditions/checks: stores row bytes back-to-back, so UTF-8 validation must respect row boundaries explicitly.
- Test coverage: BE unit tests, function tests, and regression coverage were added, but they miss the critical negative case where an invalid UTF-8 sequence is split across adjacent rows.
- Test result files: The added file is consistent with the covered cases.
- Observability: No additional observability appears necessary for this feature.
- Transaction/persistence/data writes/FE-BE variable passing: Not applicable.
- Performance: The scalar function is a straightforward linear scan. No blocking performance issue found.
- Other issues: No other blocking issue found during this review.
Please fix the row-boundary bug in and add a unit test covering a split multibyte sequence across two rows.
| template <typename T> | ||
| bool ColumnStr<T>::is_valid_utf8() const { | ||
| return validate_utf8(reinterpret_cast<const char*>(chars.data()), chars.size()); | ||
| } |
There was a problem hiding this comment.
stores row payloads contiguously without any delimiter, so UTF-8 validity is not compositional here. Validating the whole buffer can return even when an individual row is invalid. A concrete case is rows "\xE4" and "\xB8\x96": each row is invalid by itself, but the concatenated buffer forms the valid UTF-8 sequence for , so this helper would incorrectly accept the column. Since this API is introduced as column-level UTF-8 validation, it needs to walk and validate each row independently.
|
Readable review summary for the previously submitted Critical checkpoints:
Blocking issue:
Please fix that helper and add a unit test covering a split multibyte sequence across two rows. |
|
/review |
There was a problem hiding this comment.
No blocking findings. The BE runtime implementation, FE/BE function registration, alias wiring, and regression/unit coverage are consistent with the stated goal of adding is_valid_utf8 / isValidUTF8 support.
Critical checkpoint conclusions:
- Goal / correctness: Met. The PR adds the scalar function on both FE and BE, and the runtime logic validates each row independently, including the cross-row boundary case covered by the added
ColumnStringunit test. - Scope / minimality: Good. The change is focused on one new string predicate plus targeted tests.
- Concurrency: Not applicable. The touched paths are scalar expression evaluation and column helpers with no new shared-state or lock interactions.
- Lifecycle / static initialization: No special lifecycle or static-init risks introduced.
- Configuration: None added.
- Compatibility: Additive only; no incompatible storage, protocol, or symbol changes observed.
- Parallel code paths: Applicable FE/BE registration paths are both updated, and the alias is registered on both sides.
- Special conditions: The per-row validation behavior is the correct special case here, and the regression/unit tests now cover it.
- Test coverage: Sufficient for runtime behavior. Added BE unit tests plus regression coverage for valid, invalid, NULL, alias, table data, and constant-expression equivalence. Residual gap: there is no FE executable implementation for
is_valid_utf8, so the newtestFoldConst(...)cases verify semantic equivalence rather than proving FE-side constant folding of this function. - Test result files: The new
.outcontent matches the added regression cases. - Observability: No extra logging/metrics needed for this simple scalar predicate.
- Transaction / persistence / data-write safety: Not applicable to the feature implementation.
- FE-BE argument passing: Not applicable beyond normal scalar function binding, and no missing path was identified.
- Performance: Expected O(total input bytes) validation cost; no obvious redundant work beyond the function’s required scan.
- Other issues: None blocking.
Overall opinion: acceptable to merge as-is; the only notable follow-up is optional FE executable support if we want is_valid_utf8 to be truly folded on FE rather than just validated by the existing fold-equivalence test.
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
1 similar comment
|
run buildall |
FE Regression Coverage ReportIncrement line coverage |
7 similar comments
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
What problem does this PR solve?
Add
is_valid_utf8(s)/isValidUTF8(s)function that returnstrueif the input is valid UTF-8,falseotherwise. Also addsis_valid_utf8()method toColumnStrfor column-level UTF-8 validation.Release note
Support
is_valid_utf8/isValidUTF8scalar function to check whether a string is valid UTF-8 encoded.apache/doris-website#3551
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)