Feat:(model) qwen image vae checkpoint by Pfannkuchensack · Pull Request #9108 · invoke-ai/InvokeAI

Pfannkuchensack · 2026-05-01T22:46:08Z

Summary

Adds standalone model support for Qwen Image so users no longer need the full ~40 GB Diffusers pipeline. A GGUF transformer can now be combined with a standalone VAE checkpoint, a standalone Qwen2.5-VL encoder (Diffusers folder or ComfyUI single-file fp8), and the Component Source (Diffusers) field becomes a fallback rather than a hard requirement. All standalone components are also exposed as installable starter models, so a fully working GGUF setup can be installed in one click.

Why: The Qwen Image PR (#9000) only allowed loading the VAE and text encoder from the full Diffusers pipeline. That meant ~40 GB on disk just to use a tiny VAE (~250 MB) plus the encoder (~16 GB), and re-downloading both for every model variant. The smallest fully-standalone setup with this PR drops to ~12 GB (GGUF transformer + ~250 MB VAE + ~7 GB ComfyUI fp8 encoder).

How:

Backend

VAE checkpoint: new VAE_Checkpoint_QwenImage_Config detects single-file Qwen Image VAEs via 5D conv weights + z_dim=16 and loads them via AutoencoderKLQwenImage (init_empty_weights + load_state_dict). The generic VAE checkpoint matcher now explicitly excludes Qwen Image VAEs so they aren't misclassified as FLUX.
Qwen2.5-VL encoder (Diffusers folder): new ModelType.QwenVLEncoder + ModelFormat.QwenVLEncoder with QwenVLEncoder_Diffusers_Config recognising directories that contain text_encoder/ (with Qwen2_5_VLForConditionalGeneration / Qwen2VLForConditionalGeneration) + tokenizer/. The new QwenVLEncoderLoader handles Tokenizer and TextEncoder submodel loading from the folder layout.
Qwen2.5-VL encoder (ComfyUI single-file): new QwenVLEncoder_Checkpoint_Config matches consolidated single-file checkpoints (e.g. qwen_2.5_vl_7b_fp8_scaled.safetensors) by detecting both LM keys (model.embed_tokens / model.layers.*) and visual tower keys (visual.patch_embed.* / visual.blocks.*). The new QwenVLEncoderCheckpointLoader loads the safetensors, dequantises ComfyUI fp8 weights via weight * weight_scale (with block-wise expansion, mirroring the Z-Image Qwen3 loader), strips comfy_quant / weight_scale / scaled_fp8 metadata, fetches the architecture config from Qwen/Qwen2.5-VL-7B-Instruct (offline-cache fallback), and instantiates Qwen2_5_VLForConditionalGeneration via init_empty_weights + assign load. Tokenizer comes from the same HF repo with offline fallback.
Text encoder invocation: qwen_image_text_encoder.py now branches on whether model_root is a file. Single-file checkpoints get tokenizer + image processor from HuggingFace (Qwen/Qwen2.5-VL-7B-Instruct, ~10 MB, cached); the existing folder layout path is unchanged. BnB-quantised loading falls back to the cached encoder for single-file checkpoints since BnB can't load from a bare safetensors and the file is already FP8.
Loader invocation: QwenImageModelLoaderInvocation gains optional vae_model and qwen_vl_encoder_model fields. Resolution priority for each component: standalone override → main model (if Diffusers) → Component Source. Bumped to v1.2.0.
Starter models: three new starter entries — Qwen Image VAE (single-file checkpoint, ~250 MB), Qwen2.5-VL Encoder (fp8 scaled) (ComfyUI single-file, ~7 GB), and Qwen2.5-VL Encoder (Diffusers) (multi-folder HF download text_encoder+tokenizer+processor, ~16 GB). All 8 GGUF main starters (Q2_K / Q4_K_M / Q6_K / Q8_0 for both Edit and txt2img) declare the VAE + fp8 encoder as dependencies, so installing any of them auto-installs a complete generation-ready setup. The Qwen Image starter bundle gets the VAE and fp8 encoder prepended too.

Frontend

New params state, selectors and reducers for qwenImageVaeModel and qwenImageQwenVLEncoderModel, plus a migration entry.
Combobox UI added alongside the existing Component Source picker (single component now renders all three: VAE / Qwen2.5-VL Encoder / Component Source).
New useQwenImageVAEModels / useQwenVLEncoderModels hooks, isQwenImageVAEModelConfig / isQwenVLEncoderModelConfig type guards, and Model Manager category + format badge entries.
Graph builder passes the new fields to the loader node.
Readiness check for GGUF Qwen Image now allows either a standalone source or a Component Source for the VAE and encoder independently.
schema.ts patched manually for the new ModelType / ModelFormat values, the QwenVLEncoder_Diffusers_Config and QwenVLEncoder_Checkpoint_Config schemas, the new loader fields, and the AnyModelConfig union.

Related Issues / Discussions

Follow-up to #9000 (Qwen Image full pipeline support). Closes the standalone-component gap that was called out for users with limited disk space.

QA Instructions

Quickest verification (recommended):
Install one of the GGUF starter models (e.g. Qwen Image Edit 2511 (Q4_K_M)) from the starter list. The VAE and fp8 encoder should be auto-installed as dependencies, and the model should generate without any further configuration.

Setup options for manual testing:

VAE only: install Qwen Image VAE from the starter list (or download vae/diffusion_pytorch_model.safetensors from a Qwen Image HF repo manually, ~250 MB). Verify it's identified as a Qwen Image VAE checkpoint.
Encoder folder: install Qwen2.5-VL Encoder (Diffusers) from the starter list (or download text_encoder/ + tokenizer/ (+ optionally processor/) from Qwen/Qwen-Image-Edit-2511 manually). Verify it's identified as qwen_vl_encoder / qwen_vl_encoder.
Encoder single-file: install Qwen2.5-VL Encoder (fp8 scaled) from the starter list (or qwen_2.5_vl_7b_fp8_scaled.safetensors directly, ~7 GB). Verify it's identified as qwen_vl_encoder / checkpoint. First generation will fetch the tokenizer + processor configs from Qwen/Qwen2.5-VL-7B-Instruct (~10 MB) and cache them.
Full standalone setup: GGUF transformer + standalone VAE + standalone Qwen2.5-VL encoder (folder or single-file), with no Component Source set — should generate successfully.

Cases to verify on the Qwen Image generation tab:

Diffusers main model, no overrides → all submodels come from main (existing behaviour).
Diffusers main model + standalone VAE → VAE override is used, encoder still from main.
GGUF main + Component Source only → unchanged behaviour, still works.
GGUF main + standalone VAE + Component Source → VAE from standalone, encoder from Component Source.
GGUF main + standalone VAE + standalone Encoder (folder), no Component Source → both come from the standalone models, no error.
GGUF main + standalone VAE + ComfyUI single-file Encoder, no Component Source → generates after first-time tokenizer/processor download.
GGUF main with neither standalone Encoder nor Component Source → readiness check blocks generation with a clear reason.
Quantized encoder (int8 / nf4) still works against a standalone encoder folder. Single-file Encoder + int8 / nf4 falls back to the cached non-BnB path (still works, no error).

Starter model checks:

Starter list shows three new entries under model components: Qwen Image VAE, Qwen2.5-VL Encoder (fp8 scaled), Qwen2.5-VL Encoder (Diffusers).
Installing any GGUF Qwen Image starter (Q2_K / Q4_K_M / Q6_K / Q8_0, Edit or txt2img) also auto-installs the VAE and fp8 encoder.

Automated checks:

pytest tests/app/invocations/test_qwen_image_model_loader.py tests/backend/model_manager/configs/ — 16 passed.
pytest -k "qwen_image" (excluding unrelated PIL get_flattened_data test) — 53 passed.
Frontend: pnpm lint:tsc / pnpm lint:eslint / pnpm lint:prettier / pnpm lint:knip all green.

Merge Plan

Standard merge.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
❗Changes to a redux slice have a corresponding migration
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

…pport Add standalone model types so Qwen Image can be run without downloading the full ~40 GB Diffusers pipeline. The VAE and Qwen2.5-VL encoder can now each come from their own model, with the Component Source (Diffusers) acting as a fallback for any submodel not provided separately.

Add a checkpoint loader for ComfyUI-style consolidated Qwen2.5-VL encoder files (e.g. qwen_2.5_vl_7b_fp8_scaled.safetensors), which bundle the language model and visual tower into one safetensors with FP8 + per-tensor weight_scale quantization. This drops the standalone encoder footprint from ~16 GB (Diffusers folder, FP16) to ~7 GB.

Add three new starter models so users can install a complete GGUF Qwen Image setup in one click without ever touching the full ~40 GB Diffusers pipeline: - "Qwen Image VAE" — single-file VAE checkpoint pulled from the Qwen-Image repo (~250 MB). - "Qwen2.5-VL Encoder (fp8 scaled)" — ComfyUI single-file FP8 encoder (~7 GB). - "Qwen2.5-VL Encoder (Diffusers)" — full-precision encoder via multi-folder HF download (text_encoder+tokenizer+processor, ~16 GB). The 8 GGUF main starters (Q2_K / Q4_K_M / Q6_K / Q8_0 for both Edit and txt2img) now declare the VAE + fp8 encoder as dependencies, so installing any of them automatically pulls in everything needed to generate. The fp8 encoder is preferred as the default dependency since it's smaller and the on-the-fly dequantization is essentially free at runtime. The Qwen Image starter bundle gets the VAE and fp8 encoder prepended so the bundled Lightning LoRA variants also benefit.

JPPhoto · 2026-05-06T21:12:35Z

Findings:

invokeai/frontend/web/src/features/controlLayers/store/paramsSlice.ts:713
The params slice schema adds required fields at invokeai/frontend/web/src/features/controlLayers/store/types.ts:818 and invokeai/frontend/web/src/features/controlLayers/store/types.ts:819, but the persisted state version remains _version: 2 and the migration only handles v1 -> v2 before calling zParamsState.parse(state) at invokeai/frontend/web/src/features/controlLayers/store/paramsSlice.ts:719. Any existing user with a v2 persisted params object will not have qwenImageVaeModel or qwenImageQwenVLEncoderModel, so parse throws during rehydration. The store catches migration failures by falling back to getInitialState() at invokeai/frontend/web/src/app/store/store.ts:159, which wipes persisted params such as selected model, prompt state, dimensions, and component selections on upgrade. To expose this issue, add a params migration test that feeds a valid pre-PR v2 persisted params object into paramsSliceConfig.persistConfig.migrate and asserts the new Qwen Image fields are backfilled to null while existing params are preserved.
invokeai/frontend/web/src/features/nodes/util/graph/generation/buildQwenImageGraph.ts:203
The graph metadata records qwen_image_component_source, quantization, and shift, but not the new standalone qwenImageVaeModel or qwenImageQwenVLEncoderModel values that are actually passed into the loader at lines 75-76. The metadata recall path only knows about qwen_image_component_source in invokeai/frontend/web/src/features/metadata/parsing.tsx:708, and there are no handlers for the standalone VAE/encoder keys. Scenario: generate with a GGUF Qwen Image model using standalone VAE + standalone Qwen VL encoder and no component source. The image metadata cannot reconstruct those component selections, so recall loses the model sources needed to reproduce the generation. To expose this issue, add a metadata/build graph test that sets both standalone component params, asserts both are emitted into graph metadata, then asserts metadata recall dispatches both selectors.
invokeai/frontend/web/src/features/controlLayers/store/paramsSlice.ts: no migration test covers persisted v2 params. This is the test gap behind the migration finding.
invokeai/backend/model_manager/load/model_loaders/qwen_image.py: single-file Qwen VL encoder loading may download tokenizer/config from HuggingFace at runtime when not cached. That may be intentional, but it creates an offline-first behavior gap for a local model install path.
invokeai/backend/model_manager/configs/qwen_vl_encoder.py: model identification loads the checkpoint state dict to classify Qwen VL files. For a 7GB fp8 encoder this can make model scan/import expensive. Existing code has similar patterns, so I would treat it as performance risk unless users report scan stalls.
invokeai/frontend/web/src/app/store/middleware/listenerMiddleware/listeners/modelSelected.ts: switching away from Qwen Image clears only qwenImageComponentSource, not the two new standalone selections. That is probably harmless persistence, but it is inconsistent with the existing cleanup behavior.

…ll in metadata, optimize scan - bump params slice persisted state to v3 with a v2→v3 migration that backfills qwenImageVaeModel and qwenImageQwenVLEncoderModel to null, preventing existing users from losing all persisted params on upgrade - emit qwen_image_vae and qwen_image_qwen_vl_encoder into graph metadata and add recall handlers so generations using standalone components are reproducible - clear the two new fields in the modelSelected listener when switching away from qwen-image, matching the existing cleanup pattern - identify single-file Qwen VL encoder checkpoints by reading only the safetensors key index via safe_open, instead of loading the full ~7GB state dict into RAM during model scan - log a clear info message and raise an actionable RuntimeError when the first-time HuggingFace tokenizer/config download is needed but offline, pointing users to the diffusers folder layout as an offline alternative - add unit tests for the migration, metadata recall, and identification

JPPhoto · 2026-05-07T00:36:32Z

Areas of concern worth noting:

invokeai/backend/model_manager/load/model_loaders/qwen_image.py
The single-file Qwen VL encoder path depends on HuggingFace cache/network for tokenizer and config on first use. The errors are now clearer, but fully offline installs of only the .safetensors encoder still cannot run until those small HF assets are cached.
invokeai/backend/model_manager/load/model_loaders/qwen_image.py
_load_text_encoder_from_singlefile() dequantizes fp8 weights into floating tensors before model construction. This may create a high transient RAM peak for the 7GB encoder. I did not prove a regression because this may be required by the current loading approach.
invokeai/backend/model_manager/configs/qwen_vl_encoder.py
Qwen VL checkpoint detection relies on key prefixes like visual.patch_embed. / visual.blocks.. The added tests cover intended Comfy-style naming, but alternate repacks with different vision tower prefixes could be rejected.
invokeai/backend/model_manager/load/model_loaders/vae.py
Qwen Image VAE loading uses AutoencoderKLQwenImage() default config with strict state dict loading. This is good for catching mismatches, but it assumes the standalone checkpoint exactly matches diffusers’ default Qwen Image VAE architecture.
invokeai/frontend/web/src/features/queue/store/readiness.ts
The readiness message still uses the older noQwenImageComponentSourceSelected text even when the missing source is specifically only VAE or only encoder. Not a behavioral bug, but the user feedback is less precise than the new split controls.
Integration coverage gap
Tests now cover migration, metadata recall, graph building, and config identification. They do not cover an end-to-end install -> model selection -> graph execution flow with the standalone VAE plus single-file Qwen VL encoder.

lstein · 2026-05-07T01:39:18Z

QA Checklist:

Starting with an empty root and clearing out the HF cache each time:

Quick install of a GGUF starter model - FAILED
Install VAE only - WORKED
Install Encoder folder - WORKED
Install Encoder single-file - FAILED
Full standalone setup - WORKED, provided encoder folder version used.

lstein · 2026-05-07T01:54:16Z

I am going through the QA steps, each time starting out with a virgin root and clearing out the HF cache.

Failure when installing a GGUF starter model

Issue #1 - Selecting VAE and Encoder not intuitive to new users

I installed the starter model `Qwen Image Edit 2511 (Q4_K_M). I got the transformer, the VAE and the encoder as expected. I then went to the linear view, selected the model, but I didn't get the Invoke yellow ready button. I had to go to Advanced and select the VAE and Qwen2.5-VL Encoder. Now, I know to do that, but will a new user? I think that Invoke should be able to autoselect the first working VAE and encoder that it finds as the default. I did something similar to this for FLUX.2 in #9108.

Issue #2 - VAE/Encoder Source tip

New users may also find it confusing that the VAE/Encoder Source (Diffusers) field says "Required for GGUF models". If you have a standalone VAE and encoder installed, you don't need to specify the diffusers source. The tip should read "GGUF models require this unless a standalone VAE & Encoder is installed"

Issue #3 - Generation crash

After selecting the VAE and Encoder I tried to generate, but got a stack trace:

[2026-05-06 21:42:23,119]::[QwenVLEncoderCheckpointLoader]::WARNING --> 1444 unexpected keys in checkpoint, first 5: ['visual.blocks.0.attn.proj.scale_input', 'visual.blocks.0.attn.proj.scale_weight', 'visual.blocks.0.attn.qkv.scale_input', 'visual.blocks.0.attn.qkv.scale_weight', 'visual.blocks.0.mlp.down_proj.scale_input']
[2026-05-06 21:42:23,122]::[InvokeAI]::ERROR --> Error while invoking session 7a866454-ffb8-439a-8438-86d870ee8589, invocation 97224537-6ade-43b4-960f-5be1165d4eca (qwen_image_text_encoder): Failed to load all parameters from checkpoint. Meta tensors remain: ['model.visual.patch_embed.proj.weight', 'model.visual.blocks.0.norm1.weight', 'model.visual.blocks.0.norm2.weight', 'model.visual.blocks.0.attn.qkv.weight', 'model.visual.blocks.0.attn.qkv.bias']
[2026-05-06 21:42:23,122]::[InvokeAI]::ERROR --> Traceback (most recent call last):
  File "/home/lstein/Projects/InvokeAI/invokeai/app/services/session_processor/session_processor_default.py", line 130, in run_node
    output = invocation.invoke_internal(context=context, services=self._services)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
  File "/home/lstein/Projects/InvokeAI/invokeai/backend/model_manager/load/model_loaders/qwen_image.py", line 245, in _load_model
    return self._load_text_encoder_from_singlefile(config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lstein/Projects/InvokeAI/invokeai/backend/model_manager/load/model_loaders/qwen_image.py", line 396, in _load_text_encoder_from_singlefile
    raise RuntimeError(f"Failed to load all parameters from checkpoint. Meta tensors remain: {meta_params[:5]}")
RuntimeError: Failed to load all parameters from checkpoint. Meta tensors remain: ['model.visual.patch_embed.proj.weight', 'model.visual.blocks.0.norm1.weight', 'model.visual.blocks.0.norm2.weight', 'model.visual.blocks.0.attn.qkv.weight', 'model.visual.blocks.0.attn.qkv.bias']

Standalone components, full diffusers model present

The next tests were performed after installing the full diffusers model.

The full diffusers model installs and generates. WORKS
The standalone VAE starter model installs and generates (using the diffusers for the encoder source) WORKS
The standalone Qwen2.5-VL Encoder (Diffusers) folder WORKS
The standalone Qwen2.5-VL Encoder (fp8 scaled) single-file FAILS and gives me the stack trace described above.
Full standalone install WORKS, as long as I use the folder version of the encoder. Note that installing any of the quantized GGUF models gives me the encoder file version, which breaks.

…ingle-file encoder crash - Auto-select first available standalone VAE and Qwen2.5-VL encoder when switching to a Qwen Image model, so GGUF users are ready-to-go without digging into Advanced. Prefers the diffusers-folder encoder over the single-file checkpoint. - Update the "Required for GGUF models" placeholder to clarify that the diffusers source is only required when a standalone VAE & encoder is not installed. - Fix QwenVLEncoderCheckpointLoader crash on ComfyUI fp8_scaled single-file encoders. Two issues: (1) handle the `.scale_weight` / `.scale_input` quantization key scheme alongside `.weight_scale`, and (2) apply Qwen2_5_VLForConditionalGeneration's _checkpoint_conversion_mapping before load_state_dict so legacy `visual.*` / `model.*` keys map onto the new `model.visual.*` / `model.language_model.*` layout expected by transformers ≥4.50.

lstein · 2026-05-07T03:47:23Z

QA Checklist:

Starting with an empty root and clearing out the HF cache each time:

Quick install of a GGUF starter model - WORKS
Install VAE only - WORKS
Install Encoder folder - WORKS
Install Encoder single-file - WORKS
Full standalone setup - WORKS

Summary: the most recent commit has addressed the usability and failure issues identified earlier.

JPPhoto

@lstein If you've tested this (I have only looked at the code), then it's good to merge.

Pfannkuchensack added 4 commits May 2, 2026 00:26

Chore Ruff Format

ba717ed

github-actions Bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files frontend PRs that change frontend files labels May 1, 2026

Pfannkuchensack mentioned this pull request May 2, 2026

[bug]: VAE/Text_Encoder Source (required for GGUF) missing for Qwen2511 and 2512 models #9109

Open

1 task

lstein self-assigned this May 5, 2026

lstein added the v6.13.x label May 5, 2026

lstein added this to Invoke - Community Roadmap May 5, 2026

lstein moved this to 6.13.x Theme: MODELS in Invoke - Community Roadmap May 5, 2026

lstein assigned JPPhoto May 5, 2026

Pfannkuchensack added 2 commits May 5, 2026 04:30

Merge branch 'main' into feat/qwen-image-vae-checkpoint

51fe104

Merge branch 'main' into feat/qwen-image-vae-checkpoint

aa24f3e

Pfannkuchensack marked this pull request as ready for review May 5, 2026 17:11

Pfannkuchensack requested review from JPPhoto, blessedcoolant, dunkeroni and lstein as code owners May 5, 2026 17:11

Merge branch 'main' into feat/qwen-image-vae-checkpoint

f764161

github-actions Bot added the python-tests PRs that change python tests label May 6, 2026

JPPhoto added 2 commits May 6, 2026 19:30

Merge branch 'main' into feat/qwen-image-vae-checkpoint

5bb3c52

Merge branch 'main' into feat/qwen-image-vae-checkpoint

c16e3be

Merge branch 'main' into feat/qwen-image-vae-checkpoint

017fc44

Merge branch 'main' into feat/qwen-image-vae-checkpoint

52ddbaa

JPPhoto approved these changes May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat:(model) qwen image vae checkpoint#9108

Feat:(model) qwen image vae checkpoint#9108
Pfannkuchensack wants to merge 13 commits intoinvoke-ai:mainfrom
Pfannkuchensack:feat/qwen-image-vae-checkpoint

Pfannkuchensack commented May 1, 2026 •

edited

Loading

Uh oh!

JPPhoto commented May 6, 2026 •

edited

Loading

Uh oh!

JPPhoto commented May 7, 2026

Uh oh!

lstein commented May 7, 2026 •

edited

Loading

Uh oh!

lstein commented May 7, 2026 •

edited

Loading

Uh oh!

lstein commented May 7, 2026

Uh oh!

JPPhoto left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Pfannkuchensack commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Backend

Frontend

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

Uh oh!

JPPhoto commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPPhoto commented May 7, 2026

Uh oh!

lstein commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lstein commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Failure when installing a GGUF starter model

Issue #1 - Selecting VAE and Encoder not intuitive to new users

Issue #2 - VAE/Encoder Source tip

Issue #3 - Generation crash

Standalone components, full diffusers model present

Uh oh!

lstein commented May 7, 2026

Uh oh!

JPPhoto left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Pfannkuchensack commented May 1, 2026 •

edited

Loading

JPPhoto commented May 6, 2026 •

edited

Loading

lstein commented May 7, 2026 •

edited

Loading

lstein commented May 7, 2026 •

edited

Loading