[WIP] Fixing PySpark benchmark build & install#55954
Draft
sven-weber-db wants to merge 1 commit into
Draft
Conversation
davidm-db
pushed a commit
to davidm-db/spark
that referenced
this pull request
May 18, 2026
### What changes were proposed in this pull request? This PR extends metric-view support to **DS v2 catalogs** by routing `CREATE VIEW ... WITH METRICS` through the `ViewCatalog` / `TableViewCatalog` APIs introduced by [SPARK-52729](apache#51419) and finalized by [SPARK-56655](apache#55954). Third-party v2 catalogs that implement `ViewCatalog` can now host metric views with the same metadata fidelity as session-catalog metric views. **1. V2 metric-view CREATE path -- shared with `CreateV2ViewExec`.** A new `CreateV2MetricViewExec` and `CreateV2ViewExec` both extend a new `V2CreateViewPreparation` trait (which itself extends `V2ViewPreparation`). The trait owns the shared CREATE-side `run()`: `viewExists` short-circuit on `IF NOT EXISTS`, `createOrReplaceView` for `OR REPLACE`, and cross-type collision decoding (`ViewAlreadyExistsException` -> `tableExists` -> `EXPECT_VIEW_NOT_TABLE.NO_ALTERNATIVE`). The metric-view subclass only supplies the metric-view-specific bits (no collation, schema-mode `UNSUPPORTED`, typed `viewDependencies`, `PROP_TABLE_TYPE = METRIC_VIEW`, `retainColumnMetadata = true`) via optional hooks on `V2ViewPreparation`. `DataSourceV2Strategy` intercepts `CreateMetricViewCommand` on a non-session catalog and routes to the new exec; the v1 session-catalog path stays in `CreateMetricViewCommand.run`. **2. First-class `METRIC_VIEW` table type.** - `CatalogTableType.METRIC_VIEW` is added alongside `EXTERNAL` / `MANAGED` / `VIEW`. - `TableSummary.METRIC_VIEW_TABLE_TYPE = "METRIC_VIEW"` constant for the V2 surface. - The previous `view.viewWithMetrics` property hack is removed; `CatalogTable.isMetricView` checks `tableType == METRIC_VIEW` directly. - `V1Table.summarizeTableType` and `V1Table.toCatalogTable(catalog, ident, ViewInfo)` translate between the V2 property form and the V1 enum. - HMS round-trip support: `HiveTableType` has no `METRIC_VIEW` variant (both regular views and metric views serialize as `VIRTUAL_VIEW`). `HiveExternalCatalog` now persists a `view.subType = METRIC_VIEW` property on write and lifts `tableType` back to `METRIC_VIEW` on read, so HMS-backed metric views survive the round trip. **3. Repo-wide `tableType == VIEW` audit + `CatalogTable.isViewLike` helper.** Promoting metric views to a distinct `CatalogTableType` opens silent regressions wherever existing code branches on `VIEW`. To consolidate the audit and reduce divergence with the Databricks Runtime (which has the same helper), this PR introduces: - `CatalogTable.isViewLike` instance method (DBR parity: today returns `tableType == VIEW || tableType == METRIC_VIEW`; forks may extend the set). - `CatalogTable.isViewLike(t: CatalogTableType)` companion form for the few sites that have a `CatalogTableType` but no `CatalogTable` (e.g. `SessionCatalog.isView`, `verifyAlterTableType`, `HiveClientImpl.toHiveTableType`). All 18 sites in `catalyst` / `core` / `hive` that previously did inline `tableType == VIEW || tableType == METRIC_VIEW` (or the `CatalogTableType.VIEW | CatalogTableType.METRIC_VIEW` pattern alternation) are now routed through these helpers, so adding a new view-like type in the future is a one-line change in the helper body. Notable touched call sites: `CatalogTable.toJsonLinkedHashMap` (DESCRIBE EXTENDED rows), `HiveExternalCatalog.{createTable, alterTable, restoreTableMetadata}`, `HiveClientImpl.toHiveTableType`, `SessionCatalog.isView`, `InMemoryCatalog.listViews`, `RelationResolution`, `Analyzer.lookupTableOrView`, `rules.scala`, `DataStreamWriter`, `DescribeRelationJsonCommand`, `AnalyzeColumnCommand`, `AnalyzePartitionCommand`, `CommandUtils.analyzeTable`, `V2SessionCatalog.dropTableInternal`, `verifyAlterTableType` in `ddl.scala`, and 3 sites in `tables.scala`. **Explicit rejection (uniform error class):** `SHOW CREATE TABLE` on a metric view has no round-trippable `CREATE VIEW ... WITH METRICS` form, so it's rejected explicitly with the dedicated `UNSUPPORTED_SHOW_CREATE_TABLE.ON_METRIC_VIEW` error class on **both** the v1 session-catalog path (in `tables.scala`) and the v2 catalog path (in `DataSourceV2Strategy`), so users see the same actionable message regardless of catalog kind. **4. Drop-command parity.** - `DropTableCommand` (v1 path) treats both `VIEW` and `METRIC_VIEW` as views: `DROP TABLE` rejects either with `wrongCommandForObjectTypeError`, and `DROP VIEW` accepts either. - `V2SessionCatalog.dropTableInternal` extends the existing "view rejected from `DROP TABLE`" guard to cover `METRIC_VIEW`. - For non-session v2: `DropTableExec` (post-SPARK-56655) actively rejects with `WRONG_COMMAND_FOR_OBJECT_TYPE` ("Use DROP VIEW instead") when a view sits at the ident -- works unchanged for metric views since `TableViewCatalog`'s default `viewExists` derives from `loadTableOrView` and recognizes `MetadataTable + ViewInfo`. - `ResolveSessionCatalog`'s `DropView` routing comment is clarified: v2 metric views fall through to `DataSourceV2Strategy` and `ViewCatalog.dropView`. **5. Typed view dependencies (`ViewInfo.viewDependencies`).** - New public DTOs in `org.apache.spark.sql.connector.catalog`: `Dependency` (sealed interface with `Dependency.table(String[])` / `Dependency.function(String[])` non-vararg factories), `TableDependency`, `FunctionDependency`, `DependencyList(Dependency[])`. - `TableDependency` and `FunctionDependency` carry the dependency identifier as **structural multi-part name parts** (`record TableDependency(String[] nameParts)`), not a single dot-flattened string. Arity is preserved per source so multi-level-namespace V2 catalogs (e.g. Iceberg `cat.db1.db2.tbl` -> 4 parts) round-trip without ambiguity against quoted identifiers containing literal `.`. v1 sources resolved through the session catalog are normalized by a new `MetricViewHelper.qualifyV1` to a stable 3-part `[spark_catalog, db, table]` shape so consumers see deterministic arity per source kind (otherwise `TableIdentifier.nameParts` could return 1, 2, or 3 parts depending on what the analyzer captured). - All three records (`TableDependency`, `FunctionDependency`, `DependencyList`) override `equals` / `hashCode` / `toString` using `Arrays.equals` / `Arrays.hashCode` / `Arrays.toString` to give value semantics on their array fields. Without the overrides, records' auto-generated methods on array fields fall through to `Object.equals` (reference equality), which would make structural multi-part names unusable as Map keys / for dedup. Each record also overrides the canonical accessor to return a defensive `clone()` so callers cannot mutate the record's internal array. - `ViewInfo` gains a `viewDependencies` field and a `ViewInfo.Builder.withViewDependencies(...)` setter. Per the field's contract, `null` means "no dependency list was supplied" while an empty `DependencyList.of(new Dependency[0])` means "supplied but the object has none" -- metric-view CREATE always emits the latter, never the former, even when `collectTableDependencies` returns empty. - `MetricViewHelper.collectTableDependencies` walks the analyzed plan and emits structural `Seq[Seq[String]]` parts; the v2 source arm preserves full namespace arity, the v1 source arms (`View`, `HiveTableRelation`, `LogicalRelation`) all route through `qualifyV1` for the stable 3-part shape. **6. Multi-level-namespace targets for v2 metric views.** `MetricViewHelper.analyzeMetricViewText` previously required a `TableIdentifier`, capping the metric-view target at 3 name parts. v2 metric views with multi-level-namespace targets (e.g. `cat.db1.db2.mv`) failed at `ident.asTableIdentifier` with `requiresSinglePartNamespaceError`. The helper now takes `nameParts: Seq[String]` directly; call sites in both the v1 path (`CreateMetricViewCommand`) and the v2 path (`DataSourceV2Strategy`) updated. The helper now also returns `(LogicalPlan, MetricView)` so callers don't have to re-parse the YAML body just to read descriptor properties. **7. `metric_view.*` descriptor properties (v1/v2 parity).** `MetricView.getProperties` produces canonical descriptive properties (`metric_view.from.type`, `metric_view.from.name` / `metric_view.from.sql`, `metric_view.where`) that **both** the v1 path (`CreateMetricViewCommand.createMetricViewInSessionCatalog`) and the v2 path (`DataSourceV2Strategy`) merge into the view's properties bag, so catalog browsers and tooling see the same descriptor rows in `DESCRIBE TABLE EXTENDED` regardless of catalog kind. Long values are truncated to `Constants.MAXIMUM_PROPERTY_SIZE`; the Scaladoc on `getProperties` calls out that `metric_view.from.sql` is therefore a descriptive value, not a round-trippable representation -- consumers should re-read the YAML body for the full SQL. **8. `ViewInfo` constructor cleanup.** The metric-view-specific `PROP_TABLE_TYPE = METRIC_VIEW` special case is dropped from the generic `ViewInfo` constructor in favor of `properties().putIfAbsent(...)`. Callers that want a more specific kind (e.g. `METRIC_VIEW`) call `BaseBuilder.withTableType(...)` before `build()` -- exercised by `CreateV2MetricViewExec` via the new `V2ViewPreparation.tableType` hook. **9. `ViewHelper.aliasPlan(retainMetadata)`.** The user-specified-column-with-comment branch in `aliasPlan` previously dropped existing column metadata. A new `retainMetadata: Boolean = false` parameter merges the analyzed attribute's metadata into the new comment metadata. `ViewHelper.prepareTable` passes `retainMetadata = isMetricView` (v1 path); `V2ViewPreparation` exposes a `retainColumnMetadata` hook that `CreateV2MetricViewExec` overrides to `true` (v2 path). Both preserve the per-column `metric_view.type` / `metric_view.expr` keys that the analyzer attaches to dimensions and measures even when the user renames columns and adds comments. **10. Error classes.** - New `INVALID_METRIC_VIEW_YAML` (sqlState 42K0L). `MetricViewPlanner.parseYAML`'s catch blocks now route through `QueryCompilationErrors.invalidMetricViewYamlError` instead of `SparkException.internalError`, so a typo in the user's YAML body surfaces as a user-correctable `AnalysisException` rather than "please contact support". - New `UNSUPPORTED_SHOW_CREATE_TABLE.ON_METRIC_VIEW` (sqlState 0A000), used by both the v1 session-catalog path and the v2 catalog path so `SHOW CREATE TABLE` on a metric view produces the same actionable message regardless of catalog kind. **11. Misc.** - `MetricViewCanonical.parseSource` accepts multipart identifiers (`parseMultipartIdentifier`) so 3-part `catalog.schema.table` source references work as `AssetSource`. ### Why are the changes needed? Before this PR, metric-view DDL only worked against the session catalog: the create path called `SessionCatalog.createTable` directly, and there was no way for a third-party v2 catalog (Unity Catalog, Hive Metastore catalog, custom REST catalogs, etc.) to own a metric view's lifecycle. SPARK-52729 / SPARK-56655 shipped `ViewCatalog` and `TableViewCatalog` as the public v2 surface for catalog-managed views; metric views are a kind of view and naturally belong on this surface. Once metric views can live on a v2 catalog, two more constraints surface: 1. **Type discriminator.** A consumer reading a row through `ViewCatalog.loadView` needs to know it's a metric view, not a plain SQL view, so it can render the right UI / planner output. Encoding this in `PROP_TABLE_TYPE = METRIC_VIEW` keeps the distinction wire-compatible and lets `V1Table.toCatalogTable` reconstruct `CatalogTableType.METRIC_VIEW` on the read path. 2. **Structured dependency lineage.** Metric views always reference at least one source table; cataloging that lineage as flat string properties or single dot-joined strings loses arity for multi-level namespaces and is ambiguous against quoted identifiers. A typed `DependencyList` of `TableDependency` / `FunctionDependency` with structural `String[] nameParts` lets catalogs persist the lineage as a first-class field with full fidelity. The remaining changes (drop-command parity, `aliasPlan` metadata retention, `metric_view.*` properties, `parseMultipartIdentifier`, `tableType == VIEW` audit + `isViewLike` helper, multi-level-namespace lift, HMS round-trip marker) are mechanical follow-ups that fall out of supporting metric views as a real `CatalogTableType` and as a v2 catalog citizen -- without them, basic operations like `DROP VIEW`, `DESCRIBE TABLE EXTENDED`, `CREATE VIEW (a COMMENT 'c') WITH METRICS ...`, or `CREATE VIEW cat.db1.db2.mv WITH METRICS ...` would silently degrade. ### Does this PR introduce _any_ user-facing change? Yes, both for end users and for catalog plugin developers: **End users:** - `CREATE VIEW <ident> WITH METRICS ...` now works against any v2 catalog that implements `ViewCatalog`, including catalogs with multi-level namespace targets. Previously it was rejected with `MISSING_CATALOG_ABILITY.VIEWS` for non-session catalogs, and capped at single-level namespaces. `IF NOT EXISTS` and `OR REPLACE` are honored on the v2 path (regression vs. v1 fixed). - A v2 metric view can be queried with `SELECT region, measure(count_sum) FROM <mv> ...`, dropped with `DROP VIEW`, listed via `SHOW VIEWS` (and via `SHOW TABLES` on a `TableViewCatalog`, matching v1 SHOW TABLES output), and described with `DESCRIBE TABLE` / `DESCRIBE TABLE EXTENDED`. `DROP TABLE` on a metric view throws `WRONG_COMMAND_FOR_OBJECT_TYPE` ("Use DROP VIEW instead"). - `ALTER VIEW <metric_view> RENAME TO ...` is wired through `RenameV2ViewExec` and preserves the metric-view kind across the rename. - `SHOW CREATE TABLE` on a metric view throws `UNSUPPORTED_SHOW_CREATE_TABLE.ON_METRIC_VIEW` (no round-trippable form yet) on both the v1 and v2 paths -- same error class regardless of catalog kind. - Session-catalog metric views are now stored as `CatalogTableType.METRIC_VIEW` instead of `CatalogTableType.VIEW + view.viewWithMetrics=true`. Observable in `DESCRIBE TABLE EXTENDED`'s `Type` row and the `tableType` column of the `tables` system table. SQL behavior is unchanged. Hive-metastore-backed metric views also round-trip through HMS via a `view.subType = METRIC_VIEW` property marker. - `DESCRIBE TABLE EXTENDED` on metric views (v1 and v2) now consistently surfaces `metric_view.from.type` / `metric_view.from.name` / `metric_view.from.sql` / `metric_view.where` descriptor rows. - Error messages from `DROP TABLE` / `DROP VIEW` mismatch now mention `METRIC_VIEW` alongside `VIEW`. - Malformed metric-view YAML now surfaces as `INVALID_METRIC_VIEW_YAML` (user-correctable) instead of "Spark internal error, please contact support". **Catalog plugin developers:** - New public API surface in `org.apache.spark.sql.connector.catalog`: sealed interface `Dependency` with `permits TableDependency, FunctionDependency`, both records carrying `String[] nameParts`. `Dependency.table(String[])` / `Dependency.function(String[])` static factories (non-vararg per review; callers pass an existing array directly). `DependencyList(Dependency[])` with `DependencyList.of(Dependency[])` factory. All three records override `equals` / `hashCode` / `toString` to give value semantics on their array fields, and the canonical accessors return a defensive `clone()` so internal state is not mutable through the public API. All `Evolving`, `since 4.2.0`. Note: today's only producer in Spark itself is metric-view dependency extraction, which emits `TableDependency` only; `FunctionDependency` and `Dependency.function(...)` are exposed as groundwork for future producers (e.g. SQL UDF dep tracking). - `ViewInfo` gains a typed `viewDependencies()` accessor and `ViewInfo.Builder.withViewDependencies(...)` setter. `viewDependencies` is populated only on the non-session v2 CREATE path; v1 metric views (and v2 metric views read back through `V1Table.toCatalogTable`) carry `null`. Catalog plugin authors persisting dependency lineage should treat the field as v2-only for now -- broadening to v1 is a tracked follow-up. - `TableSummary.METRIC_VIEW_TABLE_TYPE = "METRIC_VIEW"` constant. - `CatalogTableType.METRIC_VIEW` enum value (v1 surface). - `CatalogTable.isViewLike` instance + `CatalogTable.isViewLike(CatalogTableType)` companion helpers (DBR parity helpers for "does this table behave like a view at resolution / DDL time?"). Forks that add their own view-like types (e.g. DBR's `MATERIALIZED_VIEW`, `STREAMING_TABLE`) only need to extend the helper body. - `V2ViewPreparation` (private to `org.apache.spark.sql.execution.datasources.v2`) gains optional `viewDependencies` / `tableType` / `retainColumnMetadata` hooks. ### How was this patch tested? `MetricViewV2CatalogSuite` -- 31 tests across 5 sections, all against an in-memory `ViewCatalog` test fixture (`MetricViewRecordingCatalog extends InMemoryTableCatalog with TableViewCatalog`): **Section 1 -- CREATE-related (11 tests):** - V2 catalog receives `METRIC_VIEW` table type and view text via `ViewInfo`. - V2 catalog path populates `metric_view.*` descriptor properties + view context (`currentCatalog` / `currentNamespace`) + captured SQL configs. - V2 catalog path captures `SQLSource` and comment. - Metric view columns carry `metric_view.type` / `metric_view.expr` in column metadata. - User-specified column names with comments preserve `metric_view.*` metadata (pins the `aliasPlan(retainMetadata = true)` fix). - `CREATE OR REPLACE VIEW ... WITH METRICS` replaces an existing v2 metric view (asserts on the replacement's distinguishing fields: queryText, `metric_view.where`, dependencies). - `CREATE VIEW IF NOT EXISTS ... WITH METRICS` is a no-op when the view exists (catalog never sees the second `createView` call). - `CREATE VIEW ... WITH METRICS` over a v2 table at the ident throws `TABLE_OR_VIEW_ALREADY_EXISTS` (analyzer-time pre-check). - `CREATE VIEW IF NOT EXISTS ... WITH METRICS` is a no-op when a v2 table sits at the ident (v1 parity). - `CREATE VIEW ... WITH METRICS` on a non-`ViewCatalog` catalog fails with `MISSING_CATALOG_ABILITY.VIEWS`. - `CREATE VIEW ... WITH METRICS` at a multi-level-namespace v2 target (`testcat.ns_a.ns_b.mv_deep`) succeeds (pins the `analyzeMetricViewText` lift to `Seq[String]`). **Section 2 -- Dependency extraction (5 tests):** - SQL source `JOIN` captures both tables as 3-part `nameParts`. - SQL source subquery deduplicates same-table references. - SQL source self-join deduplicates same-table references. - V1 session-catalog source emits exactly 3 parts, normalized to `[spark_catalog, db, table]` by `qualifyV1`. - Multi-level V2 namespace source (`testcat.ns_a.ns_b.events_deep`) emits 4-part `nameParts`. **Section 3 -- SELECT cases (5 tests, modeled on `MetricViewSuite` patterns):** - `SELECT measure(count_sum) FROM <mv> GROUP BY region` returns aggregated rows (exercises the full `loadTableOrView` -> `MetadataTable(ViewInfo)` -> `V1Table.toCatalogTable(ViewInfo)` -> `ResolveMetricView` round-trip). - `SELECT measure(...) WHERE region = ...` -- query-layer filter on top of the view. - View's pre-defined `where` clause is applied (`where = Some("count > 1")` filters at view-resolution time). - Multiple measures with different aggregations (sum / sum / max). - `ORDER BY measure(...) DESC LIMIT 1` over the metric view. **Section 4 -- DESCRIBE cases (2 tests):** - `DESCRIBE TABLE EXTENDED` round-trips through `loadTableOrView` and emits the `View Text` / `Type` rows (gated through `CatalogTable.isViewLike` in `toJsonLinkedHashMap`, which now recognizes `METRIC_VIEW`). - `DESCRIBE TABLE` (non-EXTENDED) returns the aliased columns. **Section 5 -- DROP / SHOW / RENAME cases (8 tests):** - `DROP VIEW` succeeds on a v2 metric view. - `DROP VIEW IF EXISTS` on a non-existent v2 metric view is a no-op. - `DROP TABLE` on a v2 metric view throws `WRONG_COMMAND_FOR_OBJECT_TYPE` ("Use DROP VIEW instead", per SPARK-56655's `DropTableExec`) and asserts the metric view is **not** deleted. - `DROP TABLE IF EXISTS` on a v2 metric view also throws (`IF EXISTS` doesn't silence the wrong-type error, v1 parity). - `SHOW CREATE TABLE` on a v2 metric view throws `UNSUPPORTED_SHOW_CREATE_TABLE.ON_METRIC_VIEW` (same dedicated error class as the v1 path). - `SHOW TABLES` on a `TableViewCatalog` lists both tables and metric views (matches v1 SHOW TABLES output per SPARK-56655). - `SHOW VIEWS` lists v2 metric views. - `ALTER VIEW <metric_view> RENAME TO ...` succeeds and preserves the metric-view kind across the rename (pins `RenameV2ViewExec` end-to-end against the fixture's `renameView`). Existing session-catalog metric-view tests (`MetricViewSuite`, `SimpleMetricViewSuite`, `HiveMetricViewSuite`) and v1 path tests pass unchanged. `DDLSuite` and `HiveDDLSuite` had their `tableTypes` enumerations updated to include `METRIC_VIEW` in two assertion lists. `PlanResolutionSuite` test fixture was updated to stub the new `CatalogTable.isViewLike` method on the Mockito mock. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor (Claude Sonnet 4.7) Closes apache#55487 from chenwang-databricks/metric-view-on-51419. Lead-authored-by: Chen Wang <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
| "html_dir": ".asv/html", | ||
| "build_command": [ | ||
| "python -m pip wheel --no-deps -w {build_cache_dir} {conf_dir}/../packaging/classic" | ||
| "{build_dir}/build/sbt -batch -DskipTests clean package", |
Contributor
There was a problem hiding this comment.
I think we can differentiate two modes:
- run with full spark/pyspark, which requires building spark (scala).
- run with only python: so far all asv benchmarks are focusing python only changes, such as benchmark pure python eval types, pyarrow methods. This mode does not require spark.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR is work in progress.
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?