Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Dec 23, 2025

What changes were proposed in this pull request?

Follow up of #51032 . That pr changed V2WriteCommand not to execute eagerly on df.cache(). However, there are a bunch of other commands that do.

val df = sql("CREATE TABLE...")
df.cache()  // executes again, fails with TableAlreadyExistsException

Ideally, we skip CacheManager for all Command, because these are eagerly-executed already before resultDf.cache(). The problem is, it may be a behavior change.

In some cases, we are lucky and the command, like for example DescribeTableExec, has a in-memory reference to Table object and keeps the old result despite repeated execution.

However, others do not, for example ShowTables commands, that may show different result if run later.

val df = sql("SHOW TABLES.")
sql("CREATE TABLE foo")
df.cache()  // executes again and df now includes foo

To minimize backward compatibility issue, I make a new interface UsesCachedData to keep the existing behavior, but going forward, all Commands by default bypass the CacheManager.

Why are the changes needed?

To prevent the command with side-effect from being executed again if a user runs df.cache on the result of the command. Many are dangerous as they would be running a second time without the user expectation (df.cache triggering another action on the table)

Does this PR introduce any user-facing change?

Commands with side-effect on running resultDf.cache (that used to fail, or have dangerous behavior) should now no-op.

How was this patch tested?

Existing unit test

Was this patch authored or co-authored using generative AI tooling?

No

@szehon-ho
Copy link
Member Author

@cloud-fan @anchovYu fyi, if you have any suggestion, thanks!

@szehon-ho
Copy link
Member Author

Actually I did som analysis and limit the fallback "UsesCachedData" to only Show commands.

All the Describe commands I tested actually are idempotent, in V2 they have an in-memory reference to DSV2 Table object, in V1 due to RelationCache. So a user triggering a second run by doing describeDf.cache() should not see any difference.

Also , it is more likely that Show output is cached than Describe, as the result is typically a list of entities and bigger (though its not that likely overall).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant