Skip to content

[PLUGIN-1957] Validate PK Chunking for incremental loads#354

Merged
vikasrathee-cs merged 1 commit into
data-integrations:developfrom
cloudsufi:feature/prevent-empty-chunks-pk-chunking
Jun 9, 2026
Merged

[PLUGIN-1957] Validate PK Chunking for incremental loads#354
vikasrathee-cs merged 1 commit into
data-integrations:developfrom
cloudsufi:feature/prevent-empty-chunks-pk-chunking

Conversation

@harishhk107

@harishhk107 harishhk107 commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

PLUGIN-1957 Autodetect PK Chunking for incremental loads

What

Adds an autonomous pre-flight check to decide whether to enable PK Chunking in SalesforceBatchSource.getSplits().

  • If the estimated record count is below the user-configured chunk size (config.getChunkSize()) and the query is selective, PK chunking is skipped and executed as a standard un-chunked query to avoid "empty chunk" overhead on small datasets.
  • If the dataset volume exceeds the threshold, or is flagged as non-selective, PK Chunking is enabled.
  • This feature is specifically designed for automated Data Transfer Service.

Why

PK chunking is designed for very large datasets. Enabling it on small datasets (like highly restrictive incremental intervals where 0 or very few records changed) causes dozens of empty chunk boundaries to be spooled and executed, wasting daily Bulk API allocations and adding massive pipeline overhead. This change ensures chunking is only applied when operationally justified by the actual matching record volume.

Changes

  • SalesforceBatchSource.java: Configured PK chunking threshold dynamically using chunkSize.
  • SalesforceSplitUtil.java: Added REST Query Plan explain check and removed unused constants.
  • SalesforceQueryUtil.java: Replaced magic HTTP status codes with constants.
  • SalesforceSourceConstants.java: Cleaned up the unused threshold constant.
  • SalesforceQueryUtilTest.java: Added unit tests covering Query Plan and fallback paths.
  • SalesforceBatchSourceETLTest.java: Renamed ETL test case to remove UI naming.

@google-cla

google-cla Bot commented Apr 20, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch from faac75b to 5b4cf9e Compare April 20, 2026 05:53
Comment thread src/test/java/io/cdap/plugin/salesforce/etl/SalesforceBatchSourceETLTest.java Outdated
Comment thread src/test/java/io/cdap/plugin/salesforce/etl/SalesforceBatchSourceETLTest.java Outdated
@vikasrathee-cs vikasrathee-cs changed the title feat: validate PK Chunking for large queries [PLUGIN-1957] Autodetect PK Chunking for incremental loads Apr 21, 2026
@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch from db562e2 to 3f613ab Compare April 22, 2026 11:19
@vikasrathee-cs vikasrathee-cs force-pushed the feature/prevent-empty-chunks-pk-chunking branch from 3f613ab to d6c2a04 Compare April 22, 2026 11:31
@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch 2 times, most recently from d88d48d to fba1f3f Compare April 22, 2026 12:54
Comment thread src/test/java/io/cdap/plugin/salesforce/SalesforceQueryUtilTest.java Outdated
Comment thread src/test/java/io/cdap/plugin/salesforce/SalesforceQueryUtilTest.java Outdated
Comment thread src/test/java/io/cdap/plugin/salesforce/etl/SalesforceBatchSourceETLTest.java Outdated
Comment thread src/main/java/io/cdap/plugin/salesforce/SalesforceQueryUtil.java
@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch from ca92e80 to 73dacc9 Compare April 27, 2026 11:23
@harishhk107 harishhk107 changed the title [PLUGIN-1957] Autodetect PK Chunking for incremental loads [PLUGIN-1957] Validate PK Chunking for incremental loads Apr 27, 2026
@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch 4 times, most recently from dfd2083 to 62deabb Compare May 11, 2026 17:47
@vikasrathee-cs vikasrathee-cs force-pushed the feature/prevent-empty-chunks-pk-chunking branch from 62deabb to 303f515 Compare May 21, 2026 04:44

@Sunish-Dahiya Sunish-Dahiya left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a high level change looks fine but can you please test all these scenario

  1. When chunking is enabled for data less than <100K(old logic)
  2. Chunking should be disabled for data less than <100K - compare time/performance of #1 and #2.
  3. All other cases , adding few of them for reference: query is selective, query is expensive, query is selective but record count >100K

Comment thread src/test/java/io/cdap/plugin/salesforce/etl/SalesforceBatchSourceETLTest.java Outdated
@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch 3 times, most recently from a5cc3dd to e23a9c7 Compare June 1, 2026 09:41
@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch from a6058fc to fe54235 Compare June 8, 2026 09:25

@Sunish-Dahiya Sunish-Dahiya left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment regarding unit tests is still not addressed

@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch 2 times, most recently from 539a8ca to cf5c0b7 Compare June 8, 2026 13:45

@Sunish-Dahiya Sunish-Dahiya left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address the comments before merging.

Comment thread src/main/java/io/cdap/plugin/salesforce/SalesforceQueryUtil.java Outdated
@harishhk107

Copy link
Copy Markdown
Contributor Author

Add all the testcases in the SalesforceQueryUtilTest to reuse the existing PowerMockito class-level configuration for mocking query plan REST calls and

@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch from cf5c0b7 to ccd70b5 Compare June 8, 2026 18:41
@harishhk107 harishhk107 requested a review from Sunish-Dahiya June 9, 2026 05:41
Comment thread src/test/java/io/cdap/plugin/salesforce/etl/SalesforceBatchSourceETLTest.java Outdated
@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch from ccd70b5 to 4ce99f0 Compare June 9, 2026 09:34
@harishhk107 harishhk107 requested a review from Sunish-Dahiya June 9, 2026 09:49
if (enablePKChunk && pkChunkCountCheck) {
int chunkSize = config.getChunkSize();
enablePKChunk = SalesforceSplitUtil.hasRequiredCountForPkChunking(
query, authenticatorCredentials, chunkSize);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be inlined, we don't need a new variable

@harishhk107 harishhk107 force-pushed the feature/prevent-empty-chunks-pk-chunking branch from 4ce99f0 to edd067c Compare June 9, 2026 10:51
@vikasrathee-cs vikasrathee-cs merged commit b3521e9 into data-integrations:develop Jun 9, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants