Skip to content

[Integration] Expose length-aware batching in all ModelHandler subclasses#37945

Merged
damccorm merged 4 commits into
apache:masterfrom
Eliaaazzz:users/elia/issue-37531-smart-bucketing-integration
Mar 31, 2026
Merged

[Integration] Expose length-aware batching in all ModelHandler subclasses#37945
damccorm merged 4 commits into
apache:masterfrom
Eliaaazzz:users/elia/issue-37531-smart-bucketing-integration

Conversation

@Eliaaazzz

Copy link
Copy Markdown
Contributor

Summary

Addresses #37531.

This PR completes the smart bucketing integration for Python RunInference by exposing batch_length_fn and batch_bucket_boundaries on all concrete ModelHandler implementations.

The underlying batching support already exists in the base layer. The missing piece was that many user-facing handlers did not surface these options, which made length-aware batching effectively unavailable for a large part of the inference API surface. With this change, users can enable smart bucketing directly from the handler constructor across supported backends.

What Changed

This change adds batch_length_fn and batch_bucket_boundaries to 16 concrete handlers across the following backends:

  • PyTorch
  • HuggingFace
  • scikit-learn
  • TensorFlow
  • ONNX
  • XGBoost
  • TensorRT
  • vLLM
  • Vertex AI
  • Gemini

Implementation details:

  • Handlers that inherit from ModelHandler now pass the new parameters through to super().__init__()
  • Remote handlers that manage batching kwargs directly (GeminiModelHandler and VertexAIModelHandlerJSON) now wire the values into _batching_kwargs

Testing

Added test coverage in base_test.py for both behavior and wiring:

  • an end-to-end RunInferenceLengthAwareBatchingTest that verifies short and long string inputs are bucketed into separate batches under FnApiRunner
  • a HandlerBucketingKwargsForwardingTest that checks each concrete handler forwards batch_length_fn and batch_bucket_boundaries into batch_elements_kwargs()
  • follow-up fixes to keep the forwarding tests hermetic, especially for HuggingFace pipeline validation and Vertex AI endpoint liveness checks

Context

This is the final integration piece for smart bucketing:

Together, these changes make length-aware batching usable through the public Python inference handlers rather than only at the base implementation layer.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants