test: support fallback chain in CometPlanStabilitySuite, dedupe existing goldens#4129
test: support fallback chain in CometPlanStabilitySuite, dedupe existing goldens#4129andygrove wants to merge 2 commits intoapache:mainfrom
Conversation
Adds a fallback mechanism so that when a Spark-version-specific plan-stability
directory does not contain a golden directory for a given query, the suite
walks the chain of older-version directories and ultimately the base directory
(approved-plans-vX_Y). This lets each version's directory store only the
queries whose plans actually changed against the previous version.
CometPlanStabilitySuite gains a fallbackGoldenFilePaths: Seq[String] hook.
getDirForTest is unchanged in regenerate mode (always writes to the primary
directory) and in the case where the primary already has a directory for the
query; otherwise it walks the fallback list and returns the first hit. Both
V1_4 and V2_7 subclasses build the chain via planNameChain(variant), which
assembles the correct sequence based on isSpark3{5,4{0,1}}Plus.
When SPARK_GENERATE_GOLDEN_FILES=1, afterAll() walks the just-written primary
directory and drops any query directory whose contents match what the fallback
chain resolves to, mirroring the read-time logic. This keeps version-specific
directories sparse without any extra bash plumbing; dev/regenerate-golden-files.sh
just invokes the suites and now also accepts --spark-version 4.1 (now supported
on main).
…chain
Drops 766 query directories from approved-plans-{v1_4,v2_7}-spark{3_5,4_0,4_1}
whose contents are identical to what the fallback chain resolves to (the
older-version directory or, ultimately, the base approved-plans-{v1_4,v2_7}).
Each version-specific directory now retains only the queries whose plans
actually diverge from the previous tier:
approved-plans-v1_4-spark3_5: 196 -> 10 dirs (10 diverge from base)
approved-plans-v1_4-spark4_0: 194 -> 12 dirs
approved-plans-v1_4-spark4_1: 196 -> 10 dirs
approved-plans-v2_7-spark3_5: 60 -> 4 dirs
approved-plans-v2_7-spark4_0: 60 -> 4 dirs
approved-plans-v2_7-spark4_1: 60 -> 4 dirs
Verified locally that CometTPCDSV1_4_PlanStabilitySuite (194 tests) and
CometTPCDSV2_7_PlanStabilitySuite (64 tests) pass with 0 failures against
-Pspark-3.5, -Pspark-4.0, and -Pspark-4.1. Spark 3.4 reads only the base
directory and is unaffected.
461869e to
ab65163
Compare
comphead
left a comment
There was a problem hiding this comment.
Thanks @andygrove this is a nice cleanup, just thinking aloud about dedup can it be a situation when the same query has diff plans between Spark versions?
Yes, some queries have different plans between Spark versions. There are some optimizer changes in 4.1 that affected a few queries and produced different plans than 4.0, but most queries were not affected. |
Ic, so if such discrepancy happens we would need to attach the query plans for all specific spark versions like before? |
kazuyukitanimura
left a comment
There was a problem hiding this comment.
looks like there is a conflict
| fallbackGoldenFilePaths.iterator | ||
| .map(p => new File(p, goldenFileName)) | ||
| .find(_.isDirectory) | ||
| .getOrElse(primary) |
There was a problem hiding this comment.
is it correct to return primary as default?
Yes, we will store all unique plans per query. |
Which issue does this PR close?
Closes #.
Rationale for this change
Each Spark-version-specific TPC-DS plan-stability directory currently stores a full copy of every query's golden plan, even when that query's plan is byte-identical to the previous version's. After the recent Spark 4.1 enablement (#4106), the per-version
approved-plans-*-spark4_1directories duplicated 196 of 206 (v1_4) and 60 of 64 (v2_7) plans against the 4.0 directory, which itself was largely a duplicate of 3.5, which was largely a duplicate of the base. This pattern grows quadratically as new versions are added.This PR teaches
CometPlanStabilitySuiteto fall back through the chain of older version directories at read time, so each version-specific directory only needs to contain the queries whose plans actually diverge.What changes are included in this PR?
CometPlanStabilitySuitegains aprotected def fallbackGoldenFilePaths: Seq[String]hook.getDirForTestis unchanged in regenerate mode (always writes to the primary directory). For read-time lookups, when the primarygoldenFilePathdoes not contain a directory for the active query, the suite walks the fallback list and returns the first existing directory.Both
CometTPCDSV1_4_PlanStabilitySuiteandCometTPCDSV2_7_PlanStabilitySuitebuild their fallback chain viaCometPlanStabilitySuite.planNameChain(variant), which assembles the right sequence based onisSpark{35,40,41}Plus. The chain ends at the unsuffixed base directory.When
SPARK_GENERATE_GOLDEN_FILES=1, the suite'safterAll()walks the just-written primary directory and drops any query directory whose contents match what the fallback chain resolves to. This mirrors the read-time logic and keeps version-specific directories sparse without any extra bash plumbing.dev/regenerate-golden-files.shis now just a thin wrapper that invokes the suites for each Spark version and accepts--spark-version 4.1(now supported onmain).Applies the prune once across the existing tree, removing 766 duplicate query directories:
approved-plans-v1_4-spark3_5approved-plans-v1_4-spark4_0approved-plans-v1_4-spark4_1approved-plans-v2_7-spark3_5approved-plans-v2_7-spark4_0approved-plans-v2_7-spark4_1How are these changes tested?
Locally,
CometTPCDSV1_4_PlanStabilitySuite(194 tests) andCometTPCDSV2_7_PlanStabilitySuite(64 tests) pass with 0 failures against-Pspark-3.5,-Pspark-4.0, and-Pspark-4.1after the prune. Spark 3.4 reads only the base directory and is unaffected. CI will exercise all four matrix profiles.Verified the auto-prune end-to-end: a full regen of
-Pspark-4.1writes all 206 query directories toapproved-plans-v1_4-spark4_1/, thenafterAll()drops the 196 that match what the fallback chain (4.0 → 3.5 → base) would resolve to, leaving the 10 divergent directories. Repeat regens are idempotent.