[refine](column) replace field-based nested default with column-level operation in ColumnVariant#62547
[refine](column) replace field-based nested default with column-level operation in ColumnVariant#62547Mryange wants to merge 2 commits intoapache:masterfrom
Conversation
|
/review |
|
OpenCode automated review failed and did not complete. Error: Review step was skipped (possibly timeout or cancelled) Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
|
/review |
There was a problem hiding this comment.
I found a blocking behavioral regression in ColumnVariant::try_insert_default_from_nested.
Critical checkpoints:
- Goal / correctness: The PR aims to refactor nested default synthesis to a column-level implementation. It does not fully preserve the previous behavior: missing nested siblings now use a synthesized row even when the donor sibling row is
NULL, which changes the inserted value from the oldinsert_default()fallback to a copied/synthesized nested row. No test in this PR proves the null-row case still behaves correctly. - Change size / focus: The code deletion is small and focused, but it also removed the only direct coverage of the old transformation logic without replacing it with assertions for the changed path.
- Concurrency: I did not find new concurrency or lock-order issues in the touched code.
- Lifecycle / config / compatibility: No lifecycle, config, or wire/storage compatibility changes here.
- Parallel paths: The batch helper
try_insert_many_defaults_from_nested()already used the column-level clone path, but the single-row helper previously had an extra null guard. This PR made the two paths look similar by dropping that guard, which is the source of the regression. - Conditional checks: The removed
last_field.is_null()check was behaviorally significant; removing it is not a mechanical cleanup. - Test coverage: Existing tests in
column_variant_test.cppdo not assert the actual inserted row shape/nullability fortry_insert_default_from_nested, and this PR deletes the visitor-only test without adding a replacement for the null-donor case. - Test results / observability / transactionality / FE-BE propagation: Not applicable for this change.
- Performance: The new path is plausibly cheaper, but correctness regressed in the null-donor case.
- Other issues: none beyond the above blocker and the missing targeted regression test.
Please restore the old null-row behavior (or add an explicit justification if semantics are intentionally changing) and add a targeted unit/regression test that covers a missing nested sibling when the donor nested leaf row is NULL.
| if (last_field.is_null()) { | ||
| size_t leaf_size = leaf->data.size(); | ||
| if (leaf_size == 0) { | ||
| return false; |
There was a problem hiding this comment.
This changes behavior for nullable nested siblings. Before this patch, try_insert_default_from_nested() called get_last_field() and returned false when the donor leaf row was NULL, so the caller fell back to entry->data.insert_default(). After the refactor we always cut the last donor row and rebuild a default-shaped subcolumn from it, which means a row like items.tags = null can now synthesize a nested sibling value for a missing items.id instead of keeping that sibling NULL too. That is a user-visible semantic change in the JSON/variant parse path, not just a refactor. Please preserve the old null guard here or add an explicit test showing the new behavior is intended.
| @@ -3082,30 +3082,6 @@ TEST_F(ColumnVariantTest, get_field_info_all_types) { | |||
| } | |||
There was a problem hiding this comment.
The patch removes the only direct test around the deleted transformation logic, but it does not add any replacement coverage for the behavior that changed in try_insert_default_from_nested(). The existing try_insert_default_from_nested test only exercises the call and never asserts the inserted row contents/nullability. Please add a targeted test for the blocking case above: donor nested sibling row is NULL, target sibling path is missing, and we must still fall back to insert_default() rather than materializing a synthesized nested row.
|
/review |
1 similar comment
|
/review |
There was a problem hiding this comment.
Findings
be/src/core/column/column_variant.cpp:try_insert_default_from_nested()now rebuilds the synthetic row throughclone_with_default_values(FieldInfo), but the localFieldInfoonly carriesscalar_type_idandnum_dimensions. For metadata-carrying nested types, especially decimal arrays copied throughColumnVariant::try_insert()/insert_from(), that loses the entry type's precision and scale and can fail while materializing the missing sibling row. The old single-row path inserted a typed defaultFieldinto the existing column and preserved that metadata.
Critical Checkpoints
- Goal / correctness: The PR aims to replace the field round-trip with a column-level operation. It improves the fast path idea, but it does not preserve behavior for metadata-carrying nested types, and the current tests do not prove correctness for the changed path.
- Scope / minimality: The change is localized, but it also changes the single-row nested-default semantics by routing it through a helper with different type-metadata behavior.
- Concurrency: No new concurrency, locking, or lifecycle issues were introduced in the touched code.
- Configuration / compatibility: No config, persistence, or FE/BE protocol compatibility changes are involved here.
- Parallel code paths:
try_insert_many_defaults_from_nested()already uses the column-level helper; this PR makes the single-row path share the same metadata-loss risk instead of keeping its previous typed-field behavior. - Tests: Coverage is insufficient for this behavior change. The remaining
try_insert_default_from_nestedtest only smoke-calls the helper and does not assert the produced nested default values or typed metadata. - Observability / persistence / transactions: Not applicable for this refactor.
- Performance: The intended optimization is reasonable, but correctness needs to be preserved before the single-row path switches implementations.
I did not find additional blocking issues beyond this regression after tracing the insert, parse, finalize, and copy paths around ColumnVariant nested defaults.
| } | ||
|
|
||
| auto default_scalar = entry->data.get_least_common_type()->get_default(); | ||
| FieldInfo field_info = { |
There was a problem hiding this comment.
This changes the single-row path from inserting a typed default Field into the existing entry column to rebuilding a temporary subcolumn from FieldInfo. That is not equivalent for metadata-carrying types. Here field_info only keeps scalar_type_id and num_dimensions; precision and scale stay at their defaults. If a missing nested sibling is ARRAY<DECIMAL(10,2)> and we hit this path through ColumnVariant::try_insert() or insert_from(), clone_with_default_values() reaches create_array_of_type(TYPE_DECIMAL..., 0, 0) and throws from create_decimal(). The old implementation used entry->data.get_least_common_type()->get_default() and inserted that typed field directly, so it preserved the decimal metadata. Please preserve the full type metadata here before switching the single-row path over to clone_with_default_values().
There was a problem hiding this comment.
No blocking issues found in this PR.
Critical checkpoint conclusions:
- Goal / correctness: The change does achieve its stated goal of removing the per-row column -> Field -> visitor -> column round-trip and aligning
try_insert_default_from_nested()with the existing column-level batch path intry_insert_many_defaults_from_nested(). - Change size / focus: The modification is small and focused to the nested-default helper plus dead-code/test cleanup.
- Concurrency: Not applicable. This path is synchronous column manipulation with no new shared-state or lock behavior.
- Lifecycle / static init: Not applicable.
- Config / compatibility / FE-BE protocol: Not applicable. No new config, protocol, or persisted format changes.
- Parallel code paths: The main parallel path here is the batch helper; this PR improves consistency by making the single-row helper use the same column-level approach.
- Conditional checks: The previous null-row guard is preserved with
is_null_at(leaf_size - 1). - Tests: Existing unit coverage does touch
try_insert_default_from_nested()andclone_with_default_values(), but the dedicated assertions around exact defaulted nested shapes are still fairly weak, so mixed-dimension and null-donor cases remain the main residual testing gap. - Observability / persistence / transactionality: Not applicable.
- Performance: Likely improved, since the new path stays at the column level and removes unnecessary Field materialization and visitor traversal.
Overall summary: no critical blocking issues identified during review; residual risk is limited to test depth rather than an obvious correctness problem in the implementation.
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run beut |
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
2cc10c2 to
1048ac1
Compare
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
What problem does this PR solve?
Issue Number: N/A
Problem Summary:
ColumnVariant::try_insert_default_from_nestedwas filling missing nested subcolumn rows byconverting the last row of a sibling leaf column to a
Field, recursively walking theFieldtree via
FieldVisitorReplaceScalarsto replace scalar values with the type default, and theninserting the resulting
Fieldback into the column.This approach was unnecessarily expensive (column → Field → Field tree traversal → column) and
inconsistent with the batch counterpart
try_insert_many_defaults_from_nested, which operatesentirely at the column level using
cut+clone_with_default_values.The scalar visitor (
FieldVisitorReplaceScalars) existed as duplicate dead code in bothcolumn_variant.cppandvariant_util.cpp.Changes
try_insert_default_from_nested: rewrite to use the same column-level approach astry_insert_many_defaults_from_nested—leaf->data.cut(leaf_size - 1, 1).clone_with_default_values(field_info)— eliminating the Field round-trip and the dependency onget_default()/FieldVisitorReplaceScalars.FieldVisitorReplaceScalarsfrom bothcolumn_variant.cppandvariant_util.cpp; neither file has any remaining callers.field_visitorunit test incolumn_variant_test.cppthat directly tested the now-deleted visitor class.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)