Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a "sticky session" routing strategy for LoRA adapters within the Ray Serve framework. The primary goal is to enhance performance and resource efficiency by ensuring that subsequent requests related to a specific LoRA adapter are consistently directed to the same model replica that initially loaded it. This is achieved through a custom request router that considers both existing multiplexed sessions and the available LoRA capacity on each replica, supported by backend state management and client-side header injection. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a custom router for sticky sessions, which is a key feature for handling stateful LoRA adapters efficiently. The implementation is comprehensive, touching configuration, client-side patching, server-side routing, and state management. The logic for tracking replica capacity and using it for routing decisions is well-thought-out. My feedback focuses on improving the maintainability and robustness of the new code, specifically by replacing print statements with a proper logger and refactoring duplicated logic within the ModelManagement class. Overall, this is a solid implementation of a complex feature.
There was a problem hiding this comment.
Pull request overview
This pull request implements sticky session routing for LoRA adapters by introducing a custom Ray Serve request router that tracks replica capacity and routes requests to replicas with available LoRA slots. The implementation adds replica capacity tracking to the server state management system and modifies the model management service to register replicas and associate models with their hosting replicas.
Changes:
- Adds
StickyLoraRequestRouterthat routes requests based on both Ray Serve's multiplexing mechanism and replica LoRA capacity - Implements replica registration and capacity tracking in
ServerStateandModelManager - Modifies model creation to track which replica hosts each model for improved routing decisions
- Refactors client-side tinker patching for better code organization
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
src/twinkle/server/tinker/common/router.py |
New custom router implementing capacity-aware sticky session routing for LoRA adapters |
src/twinkle/server/utils/state/server_state.py |
Adds replica management methods (register/unregister/query capacity) to ServerState and proxy |
src/twinkle/server/utils/state/models.py |
Adds replica_id field to ModelRecord for tracking model-replica associations |
src/twinkle/server/utils/state/model_manager.py |
Implements replica tracking with capacity limits and cleanup logic |
src/twinkle/server/tinker/model.py |
Registers replica on startup, tracks models per replica, extracts token via helper function, implements multiplexed sticky entry |
src/twinkle_client/utils/patch_tinker.py |
Refactors ServiceClient patching into separate function for better maintainability |
src/twinkle/model/multi_lora.py |
Improves error message to include max_loras limit |
cookbook/client/tinker/megatron/server_config_7b.yaml |
Updates config to scale to 2 replicas with 1 LoRA per replica and increases per-token model limit |
PR type
PR information
Write the detail information belongs to this PR.
Experiment results
Paste your experiment result here(if needed).