Skip to content

[feature](function) support is_valid_utf8 function#62515

Merged
Mryange merged 3 commits intoapache:masterfrom
Mryange:fea-is_valid_utf8
Apr 16, 2026
Merged

[feature](function) support is_valid_utf8 function#62515
Mryange merged 3 commits intoapache:masterfrom
Mryange:fea-is_valid_utf8

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented Apr 15, 2026

What problem does this PR solve?

Add is_valid_utf8(s) / isValidUTF8(s) function that returns true if the input is valid UTF-8, false otherwise. Also adds is_valid_utf8() method to ColumnStr for column-level UTF-8 validation.

Release note

Support is_valid_utf8 / isValidUTF8 scalar function to check whether a string is valid UTF-8 encoded.

apache/doris-website#3551

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 15, 2026

run buildall

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 15, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found one blocking correctness issue.

Critical checkpoints:

  • Goal of the task: The PR adds / plus a helper. The scalar function path is wired on FE/BE and covered by unit/regression tests, but the new column-level helper does not actually validate each row independently, so this goal is not fully met.
  • Modification size/focus: The change is small and focused.
  • Concurrency: Not applicable; this code path is single-threaded and does not introduce new shared-state synchronization.
  • Lifecycle/static initialization: No special lifecycle or static initialization concerns found.
  • Configuration: No new configuration items.
  • Compatibility: No storage-format, protocol, or rolling-upgrade compatibility changes found.
  • Parallel code paths: FE Nereids metadata and BE execution registration were both updated for the SQL function path. No additional parallel execution path issue found beyond the helper bug below.
  • Special conditions/checks: stores row bytes back-to-back, so UTF-8 validation must respect row boundaries explicitly.
  • Test coverage: BE unit tests, function tests, and regression coverage were added, but they miss the critical negative case where an invalid UTF-8 sequence is split across adjacent rows.
  • Test result files: The added file is consistent with the covered cases.
  • Observability: No additional observability appears necessary for this feature.
  • Transaction/persistence/data writes/FE-BE variable passing: Not applicable.
  • Performance: The scalar function is a straightforward linear scan. No blocking performance issue found.
  • Other issues: No other blocking issue found during this review.

Please fix the row-boundary bug in and add a unit test covering a split multibyte sequence across two rows.

template <typename T>
bool ColumnStr<T>::is_valid_utf8() const {
return validate_utf8(reinterpret_cast<const char*>(chars.data()), chars.size());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stores row payloads contiguously without any delimiter, so UTF-8 validity is not compositional here. Validating the whole buffer can return even when an individual row is invalid. A concrete case is rows "\xE4" and "\xB8\x96": each row is invalid by itself, but the concatenated buffer forms the valid UTF-8 sequence for , so this helper would incorrectly accept the column. Since this API is introduced as column-level UTF-8 validation, it needs to walk and validate each row independently.

@github-actions
Copy link
Copy Markdown
Contributor

Readable review summary for the previously submitted REQUEST_CHANGES review:

Critical checkpoints:

  • Goal of the task: The PR adds is_valid_utf8 / isValidUTF8 plus a ColumnStr::is_valid_utf8() helper. The scalar function path is wired on FE/BE and covered by unit/regression tests, but the new column-level helper does not actually validate each row independently, so this goal is not fully met.
  • Modification size/focus: The change is small and focused.
  • Concurrency: Not applicable; this code path is single-threaded and does not introduce new shared-state synchronization.
  • Lifecycle/static initialization: No special lifecycle or static initialization concerns found.
  • Configuration: No new configuration items.
  • Compatibility: No storage-format, protocol, or rolling-upgrade compatibility changes found.
  • Parallel code paths: FE Nereids metadata and BE execution registration were both updated for the SQL function path. No additional parallel execution path issue found beyond the helper bug below.
  • Special conditions/checks: ColumnStr stores row bytes back-to-back, so UTF-8 validation must respect row boundaries explicitly.
  • Test coverage: BE unit tests, function tests, and regression coverage were added, but they miss the critical negative case where an invalid UTF-8 sequence is split across adjacent rows.
  • Test result files: The added regression .out file is consistent with the covered cases.
  • Observability: No additional observability appears necessary for this feature.
  • Transaction/persistence/data writes/FE-BE variable passing: Not applicable.
  • Performance: The scalar function is a straightforward linear scan. No blocking performance issue found.
  • Other issues: No other blocking issue found during this review.

Blocking issue:

  • be/src/core/column/column_string.cpp: ColumnStr::is_valid_utf8() validates the concatenated chars buffer, which can incorrectly treat invalid per-row data as valid when a multibyte sequence is split across adjacent rows. The helper must validate each row using offsets.

Please fix that helper and add a unit test covering a split multibyte sequence across two rows.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 15, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking findings. The BE runtime implementation, FE/BE function registration, alias wiring, and regression/unit coverage are consistent with the stated goal of adding is_valid_utf8 / isValidUTF8 support.

Critical checkpoint conclusions:

  • Goal / correctness: Met. The PR adds the scalar function on both FE and BE, and the runtime logic validates each row independently, including the cross-row boundary case covered by the added ColumnString unit test.
  • Scope / minimality: Good. The change is focused on one new string predicate plus targeted tests.
  • Concurrency: Not applicable. The touched paths are scalar expression evaluation and column helpers with no new shared-state or lock interactions.
  • Lifecycle / static initialization: No special lifecycle or static-init risks introduced.
  • Configuration: None added.
  • Compatibility: Additive only; no incompatible storage, protocol, or symbol changes observed.
  • Parallel code paths: Applicable FE/BE registration paths are both updated, and the alias is registered on both sides.
  • Special conditions: The per-row validation behavior is the correct special case here, and the regression/unit tests now cover it.
  • Test coverage: Sufficient for runtime behavior. Added BE unit tests plus regression coverage for valid, invalid, NULL, alias, table data, and constant-expression equivalence. Residual gap: there is no FE executable implementation for is_valid_utf8, so the new testFoldConst(...) cases verify semantic equivalence rather than proving FE-side constant folding of this function.
  • Test result files: The new .out content matches the added regression cases.
  • Observability: No extra logging/metrics needed for this simple scalar predicate.
  • Transaction / persistence / data-write safety: Not applicable to the feature implementation.
  • FE-BE argument passing: Not applicable beyond normal scalar function binding, and no missing path was identified.
  • Performance: Expected O(total input bytes) validation cost; no obvious redundant work beyond the function’s required scan.
  • Other issues: None blocking.

Overall opinion: acceptable to merge as-is; the only notable follow-up is optional FE executable support if we want is_valid_utf8 to be truly folded on FE rather than just validated by the existing fold-equivalence test.

@Mryange Mryange dismissed github-actions[bot]’s stale review April 15, 2026 06:57

第二次review过了

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 15, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 15, 2026

run buildall

1 similar comment
@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 15, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

7 similar comments
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 92.31% (12/13) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (24/24) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.55% (27366/37205)
Line Coverage 57.23% (295672/516637)
Region Coverage 54.28% (245480/452289)
Branch Coverage 56.03% (106608/190256)

@Mryange Mryange merged commit 08b6ecd into apache:master Apr 16, 2026
29 of 30 checks passed
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants