Skip to content

Mixed modality#4531

Open
CUHKSZzxy wants to merge 23 commits intoInternLM:mainfrom
CUHKSZzxy:mixed-modality
Open

Mixed modality#4531
CUHKSZzxy wants to merge 23 commits intoInternLM:mainfrom
CUHKSZzxy:mixed-modality

Conversation

@CUHKSZzxy
Copy link
Copy Markdown
Collaborator

@CUHKSZzxy CUHKSZzxy commented Apr 16, 2026

Refactor the VLM preprocessing pipeline to support mixed modality (image + video in one request), and extend forward to handle per-modality multimodal masks.

For backward compatibility, the new pipeline is opt-in: models that override preprocess(self, messages) continue to use the old path unchanged. New-style models inherit the base implementation and are detected automatically via inspect.signature at init.

New-style models: Qwen3-VL, Qwen3.5-VL, InternS1-Pro, GLM4.1V


lmdeploy/vl/model/base.py

  • New preprocess(messages, input_prompt, mm_processor_kwargs) — collects all modalities from messages, calls HF processor once with images_kwargs/videos_kwargs to route per-modality size overrides independently (no cross-modality bleed), and returns one item per multimodal token
  • get_override_size(processor, mm_processor_kwargs, modality) — resolves min_pixels/max_pixels from mm_processor_kwargs['image'] or mm_processor_kwargs['video'] independently
  • get_expanded_input_ids / get_expanded_mm_items — expand placeholder tokens into per-token multimodal items for the PT engine
  • MultimodalSpecialTokens dataclass — centralises image/video/audio/time-series token IDs per model

lmdeploy/serve/processors/multimodal.py

  • Signature-based dispatch: routes to new or legacy preprocess based on _uses_new_preprocess flag cached at init
  • apply_chat_template moved to engine layer so the new-style path tokenises after preprocessing

lmdeploy/pytorch/models/utils/model.py

  • get_multimodal_mask — builds a unified position mask across image, video, and time-series tokens for use in forward

lmdeploy/pytorch/models/ (Qwen3-VL, Qwen3.5, InternS1-Pro, GLM4.1V)

  • preprocess_input updated to unpack new-style per-modality items and build correct MultiModalData
  • forward uses get_multimodal_mask to scatter visual embeddings at the right positions

lmdeploy/vl/constants.py

  • Modality enum supports == comparison with plain strings for legacy compatibility

Docs

  • Add multimodal_inputs.md (EN + ZH) covering all modalities, local/base64 inputs, mm_processor_kwargs and media_io_kwargs

Tests

  • test_qwen3vl_processor.py: per-modality min_pixels/max_pixels override tests and mixed image+video independence test

CUHKSZzxy and others added 10 commits April 16, 2026 21:22
Add multimodal_inputs.md covering all modalities (text, image, video,
audio, time series, mixed) with OpenAI-style examples, local file /
base64 usage via lmdeploy.vl.utils helpers, and mm_processor_kwargs /
media_io_kwargs guidance. Link from vl_pipeline.md and index.rst.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@CUHKSZzxy CUHKSZzxy marked this pull request as ready for review April 17, 2026 07:09
Copilot AI review requested due to automatic review settings April 17, 2026 07:09
@CUHKSZzxy CUHKSZzxy removed the WIP label Apr 17, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends LMDeploy’s VLM preprocessing and PyTorch serving flow to support mixed-modality inputs (notably image+video) with per-modality processor kwargs, targeting specific new-style models (Qwen3-VL / Qwen3.5 / InternS1Pro / GLM4.1v) while keeping legacy behavior for others. It also adds comprehensive documentation for OpenAI-style multimodal message formats.

Changes:

  • Introduce a new “engine-aligned” preprocess path that uses apply_chat_template → preprocess(input_prompt, mm_processor_kwargs) and returns {input_ids, multimodal}.
  • Add mixed-modality handling in PyTorch models via a unified multimodal_mask and per-item offsets.
  • Add new multimodal input documentation (EN/ZH) and update related tests.

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_lmdeploy/test_vl/test_qwen3vl_processor.py Updates preprocessing test flow to match engine pipeline; adds mixed image+video tests and per-modality kwargs usage.
tests/test_lmdeploy/test_vl/test_hf_chat_template.py Adds/adjusts chat template tests, switching Qwen3-VL to apply_chat_template API.
lmdeploy/vl/model/qwen3_5.py Simplifies preprocessor build by deferring to superclass logic.
lmdeploy/vl/model/qwen3.py Removes legacy per-modality preprocess/packing logic; keeps apply_chat_template and special-token setup for new flow.
lmdeploy/vl/model/interns1_pro.py Adds time-series special tokens and a time-series processor; updates to apply_chat_template.
lmdeploy/vl/model/glm4_1v.py Introduces mm_tokens and switches to apply_chat_template.
lmdeploy/vl/model/base.py Adds new shared preprocess implementation returning {input_ids, multimodal} with per-item expansion/offset computation.
lmdeploy/vl/engine.py Adds apply_chat_template wrapper and signature-based preprocess kwargs passing (input_prompt/mm kwargs).
lmdeploy/vl/constants.py Changes Modality enum behavior to support string comparisons/hashing without inheriting from str.
lmdeploy/serve/processors/multimodal.py Detects new preprocess API by signature; for PyTorch backend uses apply_chat_template + new preprocess path.
lmdeploy/serve/core/async_engine.py Makes prompt logging tolerant of missing prompt in prompt_input.
lmdeploy/pytorch/models/utils/model.py Adds get_multimodal_mask helper to combine image/video/time-series token masks.
lmdeploy/pytorch/models/qwen3_vl_moe.py Renames image_mask to multimodal_mask and uses it for embedding scatter.
lmdeploy/pytorch/models/qwen3_vl.py Updates generation prep and multimodal packing to use offsets + multimodal_mask; adjusts grid_thw stacking and mRoPE helper.
lmdeploy/pytorch/models/qwen3_5.py Same multimodal_mask refactor as Qwen3-VL.
lmdeploy/pytorch/models/interns1_pro.py Updates image/video/time-series handling to use offsets and multimodal_mask; switches grid_thw to stack.
lmdeploy/pytorch/models/glm4_1v.py Switches grid_thw concatenation to stacking; adds a dedicated Glm4vInputProcessor using offsets.
lmdeploy/pytorch/messages.py Changes multimodal range filtering logic in HistoryMultiModals.get_datas.
lmdeploy/pytorch/configurations/glm4_1v.py Adds GLM4v config builder to align bos_token_id handling.
docs/zh_cn/multi_modal/vl_pipeline.md Links to the new multimodal input reference doc.
docs/zh_cn/multi_modal/multimodal_inputs.md Adds detailed ZH multimodal message format reference and examples.
docs/zh_cn/multi_modal/index.rst Adds the new guide to the ZH multi-modal docs toctree.
docs/en/multi_modal/vl_pipeline.md Links to the new multimodal input reference doc.
docs/en/multi_modal/multimodal_inputs.md Adds detailed EN multimodal message format reference and examples.
docs/en/multi_modal/index.rst Adds the new guide to the EN multi-modal docs toctree.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/vl/model/glm4_1v.py Outdated
Comment thread lmdeploy/vl/model/base.py
Comment thread lmdeploy/vl/model/base.py
Comment thread lmdeploy/vl/model/base.py
Comment on lines +127 to +129
start_positions = (mask & ~torch.roll(mask, 1)).nonzero(as_tuple=True)[0]
end_positions = (mask & ~torch.roll(mask, -1)).nonzero(as_tuple=True)[0]
end_positions += 1 # convert to exclusive end index, compatible with legacy pytorch implementation
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_mm_items_offset uses torch.roll to detect segment starts/ends. This breaks for edge cases (e.g., sequence length 1 where the only token is mm_token_id, or when the first/last positions are multimodal tokens) because roll wraps around. Consider using a non-wrapping boundary approach (e.g., diff with prepended/appended False) to compute start/end indices robustly.

Suggested change
start_positions = (mask & ~torch.roll(mask, 1)).nonzero(as_tuple=True)[0]
end_positions = (mask & ~torch.roll(mask, -1)).nonzero(as_tuple=True)[0]
end_positions += 1 # convert to exclusive end index, compatible with legacy pytorch implementation
prev_mask = torch.cat((mask.new_zeros(1), mask[:-1]))
next_mask = torch.cat((mask[1:], mask.new_zeros(1)))
start_positions = (mask & ~prev_mask).nonzero(as_tuple=True)[0]
end_positions = (mask & ~next_mask).nonzero(as_tuple=True)[0]
end_positions += 1 # convert to exclusive end index, compatible with legacy pytorch implementation

Copilot uses AI. Check for mistakes.
Comment thread lmdeploy/vl/model/base.py Outdated
Comment on lines 584 to 586
for modal_data in modal_datas:
if (modal_data.start not in test_range and modal_data.end - 1 not in test_range):
if (modal_data.start not in test_range or modal_data.end - 1 not in test_range):
continue
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The range check in get_datas is incorrect for interval overlap. As written, it only includes multimodal data when both start and end-1 are inside [start, end), which drops partial overlaps and also fails the case where a multimodal span fully covers the query range. Use a proper overlap test (e.g., modal_data.start < end and modal_data.end > start).

Copilot uses AI. Check for mistakes.
- glm4_1v: guard chat_template_kwargs against None before ** expansion
- base: use local time_series_processor to avoid mutating self.processor
- base: fix preprocess return type annotation list[dict] -> dict[str, Any]
- base: lower valid size-override log from WARNING to INFO

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@CUHKSZzxy CUHKSZzxy mentioned this pull request Apr 17, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants