[SPARK-54812][SQL] Make executable commands not execute on resultDf.cache() #53572
+269
−39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Follow up of #51032 . That pr changed V2WriteCommand not to execute eagerly on df.cache(). However, there are a bunch of other commands that do.
Ideally, we skip CacheManager for all Command, because these are eagerly-executed already before resultDf.cache(). The problem is, it may be a behavior change.
In some cases, we are lucky and the command, like for example DescribeTableExec, has a in-memory reference to Table object and keeps the old result despite repeated execution.
However, others do not, for example ShowTables commands, that may show different result if run later.
To minimize backward compatibility issue, I make a new interface UsesCachedData to keep the existing behavior, but going forward, all Commands by default bypass the CacheManager.
Why are the changes needed?
To prevent the command with side-effect from being executed again if a user runs df.cache on the result of the command. Many are dangerous as they would be running a second time without the user expectation (df.cache triggering another action on the table)
Does this PR introduce any user-facing change?
Commands with side-effect on running resultDf.cache (that used to fail, or have dangerous behavior) should now no-op.
How was this patch tested?
Existing unit test
Was this patch authored or co-authored using generative AI tooling?
No