Skip to content

fix: prevent prefill starvation under high decode load#4532

Open
grimoire wants to merge 4 commits intoInternLM:mainfrom
grimoire:less-prefill-waiting
Open

fix: prevent prefill starvation under high decode load#4532
grimoire wants to merge 4 commits intoInternLM:mainfrom
grimoire:less-prefill-waiting

Conversation

@grimoire
Copy link
Copy Markdown
Collaborator

Summary

  • Under high utilization (running + ready >= 50% max_batches) with small waiting requests (total tokens < max_prefill_token_num), do_prefill_default would always choose decode, starving waiting requests indefinitely
  • Add a consecutive decode counter (_decode_count) that forces a prefill after prefill_interval (default 16) decode rounds when requests are waiting
  • The counter resets only when a prefill actually produces inputs (not just when one is attempted), preventing the guard from being "burned" by failed allocation attempts

Copilot AI review requested due to automatic review settings April 16, 2026 05:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the PyTorch engine input scheduling policy to prevent prefill starvation when decode load is high, by introducing a “max consecutive decode rounds” guard that forces a prefill attempt when requests are waiting.

Changes:

  • Add InputsMakerConfig.max_prefill_gap (sourced from engine.engine_config.prefill_interval) to control how many consecutive decode rounds are allowed before forcing prefill.
  • Track consecutive decode rounds via InputsMakerAsync._decode_count, forcing prefill once the threshold is reached while requests are waiting.
  • Reset the counter only when a prefill actually produces inputs (not merely when prefill is attempted).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/pytorch/engine/inputs_maker.py Outdated
Comment thread lmdeploy/pytorch/engine/inputs_maker.py Outdated
Comment thread lmdeploy/pytorch/engine/inputs_maker.py Outdated
@grimoire grimoire changed the title ix: prevent prefill starvation under high decode load fix: prevent prefill starvation under high decode load Apr 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Prevents waiting requests from being starved by continuous decode selection under high decode load in the PyTorch engine’s input-making/scheduling logic.

Changes:

  • Add prefill_interval to InputsMakerConfig (default 16) and plumb it from engine.engine_config.
  • Track consecutive decode decisions via _decode_count and force a prefill once the interval is reached while requests are waiting.
  • Reset _decode_count only when a non-decoding ModelInputs is actually produced.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/pytorch/engine/inputs_maker.py Outdated
Comment thread lmdeploy/pytorch/engine/inputs_maker.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants