[Integration] Expose length-aware batching in all ModelHandler subclasses#37945
Merged
damccorm merged 4 commits intoMar 31, 2026
Merged
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses #37531.
This PR completes the smart bucketing integration for Python
RunInferenceby exposingbatch_length_fnandbatch_bucket_boundarieson all concreteModelHandlerimplementations.The underlying batching support already exists in the base layer. The missing piece was that many user-facing handlers did not surface these options, which made length-aware batching effectively unavailable for a large part of the inference API surface. With this change, users can enable smart bucketing directly from the handler constructor across supported backends.
What Changed
This change adds
batch_length_fnandbatch_bucket_boundariesto 16 concrete handlers across the following backends:Implementation details:
ModelHandlernow pass the new parameters through tosuper().__init__()GeminiModelHandlerandVertexAIModelHandlerJSON) now wire the values into_batching_kwargsTesting
Added test coverage in
base_test.pyfor both behavior and wiring:RunInferenceLengthAwareBatchingTestthat verifies short and long string inputs are bucketed into separate batches underFnApiRunnerHandlerBucketingKwargsForwardingTestthat checks each concrete handler forwardsbatch_length_fnandbatch_bucket_boundariesintobatch_elements_kwargs()Context
This is the final integration piece for smart bucketing:
Together, these changes make length-aware batching usable through the public Python inference handlers rather than only at the base implementation layer.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md