Skip to content

[VL] Reduce Velox hash shuffle partition buffer memory by evicting large partitions after split#12089

Open
wankunde wants to merge 4 commits into
apache:mainfrom
wankunde:hash_shuffle_writer
Open

[VL] Reduce Velox hash shuffle partition buffer memory by evicting large partitions after split#12089
wankunde wants to merge 4 commits into
apache:mainfrom
wankunde:hash_shuffle_writer

Conversation

@wankunde
Copy link
Copy Markdown
Contributor

@wankunde wankunde commented May 13, 2026

What changes are proposed in this pull request?

Why this PR is needed?

In doSplit() method, VeloxHashShuffleWriter will reclaim memory when the memory is full, but evictPartitionBuffersMinSize() method will be skipped due to splitState_ == kInit condition. Although there are some large shuffle partition buffers, we can not evict them.

Changes in this PR:

This change adds a configurable threshold for the Velox hash shuffle writer to evict large partition buffers immediately after splitting an input batch.
When enabled, the hash shuffle writer estimates each partition buffer’s allocated capacity after split and evicts partitions whose buffered data exceeds the threshold.
This reduces retained partition buffer memory before processing the next input RowVector.

The new config is:

spark.gluten.sql.columnar.shuffle.evictPartitionSize

Default value: 256 * 1024 bytes.

How was this patch tested?

Fixed some OOM cases in our production.

Was this patch authored or co-authored using generative AI tooling?

NO

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 13, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@wankunde wankunde changed the title [VL] Reduce Velox hash shuffle partition buffer memory by evicting large partitions after split [DRAFT][VL] Reduce Velox hash shuffle partition buffer memory by evicting large partitions after split May 13, 2026
@github-actions github-actions Bot added the DOCS label May 13, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@wankunde wankunde changed the title [DRAFT][VL] Reduce Velox hash shuffle partition buffer memory by evicting large partitions after split [VL] Reduce Velox hash shuffle partition buffer memory by evicting large partitions after split May 14, 2026
@marin-ma
Copy link
Copy Markdown
Contributor

but evictPartitionBuffersMinSize() method will be skipped due to splitState_ == kInit condition.

Can you explain a bit more about this? According to the code logic, evictPartitionBuffersMinSize() is used for reclaiming memory when splitState_ == kInit

https://github.com/apache/gluten/blob/main/cpp/velox/shuffle/VeloxHashShuffleWriter.cc#L1389-L1391

@wankunde
Copy link
Copy Markdown
Contributor Author

but evictPartitionBuffersMinSize() method will be skipped due to splitState_ == kInit condition.

Can you explain a bit more about this? According to the code logic, evictPartitionBuffersMinSize() is used for reclaiming memory when splitState_ == kInit

https://github.com/apache/gluten/blob/main/cpp/velox/shuffle/VeloxHashShuffleWriter.cc#L1389-L1391

When the shuffle writer is in doSplit() method, and split state may be kPreAlloc or kSplit, hash shuffle writer can not reclaim memory from itself.

spill number split state memory used comment
0 kInit 0MB
0 kPreAlloc 100MB
0 kSplit 150MB
1 kInit 150MB
1 kPreAlloc 250MB
1 kSplit 300MB
2 kInit 300MB
2 kPreAlloc 450MB case1 : ShuffleWriter used + some other used > task offheap total
2 kSplit 500MB case2 : ShuffleWriter used + some other used > task offheap total

Copy link
Copy Markdown
Contributor

@marin-ma marin-ma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Some minor comments for the configuration

.createWithDefault(0.25)

val COLUMNAR_SHUFFLE_EVICT_PARTITION_SIZE =
buildConf("spark.gluten.sql.columnar.shuffle.evictPartitionSize")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest use spark.gluten.sql.columnar.shuffle.partitionBufferEvictThreshold for this configuration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

"For Velox hash shuffle writer, evict partition buffers larger than this threshold " +
"after splitting an input batch.")
.intConf
.createWithDefault(256 * 1024)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set the default value to -1 to disable this eviction behaviour by default?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Comment thread docs/Configuration.md Outdated
| spark.gluten.sql.columnar.shuffle.codecBackend | <undefined> |
| spark.gluten.sql.columnar.shuffle.compression.threshold | 100 | If number of rows in a batch falls below this threshold, will copy all buffers into one buffer to compress. |
| spark.gluten.sql.columnar.shuffle.dictionary.enabled | false | Enable dictionary in hash-based shuffle. |
| spark.gluten.sql.columnar.shuffle.evictPartitionSize | 262144 | For Velox hash shuffle writer, evict partition buffers larger than this threshold after splitting an input batch. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the document.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@marin-ma
Copy link
Copy Markdown
Contributor

LGTM. cc @FelixYBW Do you have any comments?

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants