optimize get_sorted_idx in moe by grimoire · Pull Request #4529 · InternLM/lmdeploy

grimoire · 2026-04-15T13:18:13Z

optimize for large number of experts.
less memory usage

Warning

The order of the sorted idx is not stable.

Copilot

Pull request overview

This PR updates the CUDA fused MoE routing index generation to reduce memory usage for large expert counts, and adjusts a PyTorch engine default related to prefill sizing.

Changes:

Replaces the previous mask/cumsum-based _get_sorted_idx implementation with a 2-phase Triton approach (atomic histogram + scatter).
Wires the new Triton implementation as the default _get_sorted_idx used by fused MoE kernels.
Increases PytorchEngineConfig.max_prefill_token_num default from 4096 to 8192.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File	Description
`lmdeploy/pytorch/kernels/cuda/fused_moe.py`	Introduces new Triton kernels and replaces `_get_sorted_idx` with a 2-phase atomic/scatter approach.
`lmdeploy/messages.py`	Updates the default PyTorch engine prefill token budget.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

-    return out_mask, out_k

+def _get_sorted_idx_triton(topk_ids: torch.Tensor, num_experts: int):
+    """Get sorted idx with 2-phase Triton kernels (4 kernel launches total)."""


grimoire · 2026-04-16T06:24:07Z

+    counts = torch.zeros(num_experts, dtype=topk_ids.dtype, device=topk_ids.device)
+    local_pos = torch.empty(N, dtype=topk_ids.dtype, device=topk_ids.device)
+    _sorted_idx_phase1_kernel[grid](topk_ids, counts, local_pos, N, BLOCK_SIZE=BLOCK_SIZE)


Use int64 to avoid experts weights visit overflow.

grimoire · 2026-04-16T06:24:54Z

+    # Compute exp_start = exp_end - counts (only first block writes it)
+    if pid == 0:
+        e_offs = tl.arange(0, BLOCK_E)
+        e_mask = e_offs < num_experts
+        end_val = tl.load(ExpEnd + e_offs, mask=e_mask)
+        cnt_val = tl.load(Counts + e_offs, mask=e_mask)
+        tl.store(ExpStart + e_offs, end_val - cnt_val, mask=e_mask)
+


Tested on H800 with 10240 experts.

optimize get_sorted_idx in moe

e923e76

grimoire marked this pull request as ready for review April 16, 2026 04:18

Copilot AI review requested due to automatic review settings April 16, 2026 04:18

Copilot started reviewing on behalf of grimoire April 16, 2026 04:18 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

add assert

a505c36

lvhan028 requested review from CUHKSZzxy and RunningLeon April 17, 2026 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize get_sorted_idx in moe#4529

optimize get_sorted_idx in moe#4529
grimoire wants to merge 2 commits intoInternLM:mainfrom
grimoire:optimize-moe-expert-map

grimoire commented Apr 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

grimoire Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

grimoire Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

grimoire commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

grimoire Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

grimoire Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

grimoire commented Apr 15, 2026 •

edited

Loading