Skip to content

Feat:(model) qwen image vae checkpoint#9108

Open
Pfannkuchensack wants to merge 13 commits intoinvoke-ai:mainfrom
Pfannkuchensack:feat/qwen-image-vae-checkpoint
Open

Feat:(model) qwen image vae checkpoint#9108
Pfannkuchensack wants to merge 13 commits intoinvoke-ai:mainfrom
Pfannkuchensack:feat/qwen-image-vae-checkpoint

Conversation

@Pfannkuchensack
Copy link
Copy Markdown
Collaborator

@Pfannkuchensack Pfannkuchensack commented May 1, 2026

Summary

Adds standalone model support for Qwen Image so users no longer need the full ~40 GB Diffusers pipeline. A GGUF transformer can now be combined with a standalone VAE checkpoint, a standalone Qwen2.5-VL encoder (Diffusers folder or ComfyUI single-file fp8), and the Component Source (Diffusers) field becomes a fallback rather than a hard requirement. All standalone components are also exposed as installable starter models, so a fully working GGUF setup can be installed in one click.

Why: The Qwen Image PR (#9000) only allowed loading the VAE and text encoder from the full Diffusers pipeline. That meant ~40 GB on disk just to use a tiny VAE (~250 MB) plus the encoder (~16 GB), and re-downloading both for every model variant. The smallest fully-standalone setup with this PR drops to ~12 GB (GGUF transformer + ~250 MB VAE + ~7 GB ComfyUI fp8 encoder).

How:

Backend

  • VAE checkpoint: new VAE_Checkpoint_QwenImage_Config detects single-file Qwen Image VAEs via 5D conv weights + z_dim=16 and loads them via AutoencoderKLQwenImage (init_empty_weights + load_state_dict). The generic VAE checkpoint matcher now explicitly excludes Qwen Image VAEs so they aren't misclassified as FLUX.
  • Qwen2.5-VL encoder (Diffusers folder): new ModelType.QwenVLEncoder + ModelFormat.QwenVLEncoder with QwenVLEncoder_Diffusers_Config recognising directories that contain text_encoder/ (with Qwen2_5_VLForConditionalGeneration / Qwen2VLForConditionalGeneration) + tokenizer/. The new QwenVLEncoderLoader handles Tokenizer and TextEncoder submodel loading from the folder layout.
  • Qwen2.5-VL encoder (ComfyUI single-file): new QwenVLEncoder_Checkpoint_Config matches consolidated single-file checkpoints (e.g. qwen_2.5_vl_7b_fp8_scaled.safetensors) by detecting both LM keys (model.embed_tokens / model.layers.*) and visual tower keys (visual.patch_embed.* / visual.blocks.*). The new QwenVLEncoderCheckpointLoader loads the safetensors, dequantises ComfyUI fp8 weights via weight * weight_scale (with block-wise expansion, mirroring the Z-Image Qwen3 loader), strips comfy_quant / weight_scale / scaled_fp8 metadata, fetches the architecture config from Qwen/Qwen2.5-VL-7B-Instruct (offline-cache fallback), and instantiates Qwen2_5_VLForConditionalGeneration via init_empty_weights + assign load. Tokenizer comes from the same HF repo with offline fallback.
  • Text encoder invocation: qwen_image_text_encoder.py now branches on whether model_root is a file. Single-file checkpoints get tokenizer + image processor from HuggingFace (Qwen/Qwen2.5-VL-7B-Instruct, ~10 MB, cached); the existing folder layout path is unchanged. BnB-quantised loading falls back to the cached encoder for single-file checkpoints since BnB can't load from a bare safetensors and the file is already FP8.
  • Loader invocation: QwenImageModelLoaderInvocation gains optional vae_model and qwen_vl_encoder_model fields. Resolution priority for each component: standalone override → main model (if Diffusers) → Component Source. Bumped to v1.2.0.
  • Starter models: three new starter entries — Qwen Image VAE (single-file checkpoint, ~250 MB), Qwen2.5-VL Encoder (fp8 scaled) (ComfyUI single-file, ~7 GB), and Qwen2.5-VL Encoder (Diffusers) (multi-folder HF download text_encoder+tokenizer+processor, ~16 GB). All 8 GGUF main starters (Q2_K / Q4_K_M / Q6_K / Q8_0 for both Edit and txt2img) declare the VAE + fp8 encoder as dependencies, so installing any of them auto-installs a complete generation-ready setup. The Qwen Image starter bundle gets the VAE and fp8 encoder prepended too.

Frontend

  • New params state, selectors and reducers for qwenImageVaeModel and qwenImageQwenVLEncoderModel, plus a migration entry.
  • Combobox UI added alongside the existing Component Source picker (single component now renders all three: VAE / Qwen2.5-VL Encoder / Component Source).
  • New useQwenImageVAEModels / useQwenVLEncoderModels hooks, isQwenImageVAEModelConfig / isQwenVLEncoderModelConfig type guards, and Model Manager category + format badge entries.
  • Graph builder passes the new fields to the loader node.
  • Readiness check for GGUF Qwen Image now allows either a standalone source or a Component Source for the VAE and encoder independently.
  • schema.ts patched manually for the new ModelType / ModelFormat values, the QwenVLEncoder_Diffusers_Config and QwenVLEncoder_Checkpoint_Config schemas, the new loader fields, and the AnyModelConfig union.

Related Issues / Discussions

Follow-up to #9000 (Qwen Image full pipeline support). Closes the standalone-component gap that was called out for users with limited disk space.

QA Instructions

Quickest verification (recommended):
Install one of the GGUF starter models (e.g. Qwen Image Edit 2511 (Q4_K_M)) from the starter list. The VAE and fp8 encoder should be auto-installed as dependencies, and the model should generate without any further configuration.

Setup options for manual testing:

  1. VAE only: install Qwen Image VAE from the starter list (or download vae/diffusion_pytorch_model.safetensors from a Qwen Image HF repo manually, ~250 MB). Verify it's identified as a Qwen Image VAE checkpoint.
  2. Encoder folder: install Qwen2.5-VL Encoder (Diffusers) from the starter list (or download text_encoder/ + tokenizer/ (+ optionally processor/) from Qwen/Qwen-Image-Edit-2511 manually). Verify it's identified as qwen_vl_encoder / qwen_vl_encoder.
  3. Encoder single-file: install Qwen2.5-VL Encoder (fp8 scaled) from the starter list (or qwen_2.5_vl_7b_fp8_scaled.safetensors directly, ~7 GB). Verify it's identified as qwen_vl_encoder / checkpoint. First generation will fetch the tokenizer + processor configs from Qwen/Qwen2.5-VL-7B-Instruct (~10 MB) and cache them.
  4. Full standalone setup: GGUF transformer + standalone VAE + standalone Qwen2.5-VL encoder (folder or single-file), with no Component Source set — should generate successfully.

Cases to verify on the Qwen Image generation tab:

  • Diffusers main model, no overrides → all submodels come from main (existing behaviour).
  • Diffusers main model + standalone VAE → VAE override is used, encoder still from main.
  • GGUF main + Component Source only → unchanged behaviour, still works.
  • GGUF main + standalone VAE + Component Source → VAE from standalone, encoder from Component Source.
  • GGUF main + standalone VAE + standalone Encoder (folder), no Component Source → both come from the standalone models, no error.
  • GGUF main + standalone VAE + ComfyUI single-file Encoder, no Component Source → generates after first-time tokenizer/processor download.
  • GGUF main with neither standalone Encoder nor Component Source → readiness check blocks generation with a clear reason.
  • Quantized encoder (int8 / nf4) still works against a standalone encoder folder. Single-file Encoder + int8 / nf4 falls back to the cached non-BnB path (still works, no error).

Starter model checks:

  • Starter list shows three new entries under model components: Qwen Image VAE, Qwen2.5-VL Encoder (fp8 scaled), Qwen2.5-VL Encoder (Diffusers).
  • Installing any GGUF Qwen Image starter (Q2_K / Q4_K_M / Q6_K / Q8_0, Edit or txt2img) also auto-installs the VAE and fp8 encoder.

Automated checks:

  • pytest tests/app/invocations/test_qwen_image_model_loader.py tests/backend/model_manager/configs/ — 16 passed.
  • pytest -k "qwen_image" (excluding unrelated PIL get_flattened_data test) — 53 passed.
  • Frontend: pnpm lint:tsc / pnpm lint:eslint / pnpm lint:prettier / pnpm lint:knip all green.

Merge Plan

Standard merge.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

…pport

Add standalone model types so Qwen Image can be run without downloading the
full ~40 GB Diffusers pipeline. The VAE and Qwen2.5-VL encoder can now each
come from their own model, with the Component Source (Diffusers) acting as a
fallback for any submodel not provided separately.
Add a checkpoint loader for ComfyUI-style consolidated Qwen2.5-VL encoder
files (e.g. qwen_2.5_vl_7b_fp8_scaled.safetensors), which bundle the language
model and visual tower into one safetensors with FP8 + per-tensor weight_scale
quantization. This drops the standalone encoder footprint from ~16 GB
(Diffusers folder, FP16) to ~7 GB.
Add three new starter models so users can install a complete GGUF Qwen Image
setup in one click without ever touching the full ~40 GB Diffusers pipeline:

- "Qwen Image VAE" — single-file VAE checkpoint pulled from the Qwen-Image
  repo (~250 MB).
- "Qwen2.5-VL Encoder (fp8 scaled)" — ComfyUI single-file FP8 encoder
  (~7 GB).
- "Qwen2.5-VL Encoder (Diffusers)" — full-precision encoder via multi-folder
  HF download (text_encoder+tokenizer+processor, ~16 GB).

The 8 GGUF main starters (Q2_K / Q4_K_M / Q6_K / Q8_0 for both Edit and
txt2img) now declare the VAE + fp8 encoder as dependencies, so installing
any of them automatically pulls in everything needed to generate. The
fp8 encoder is preferred as the default dependency since it's smaller and
the on-the-fly dequantization is essentially free at runtime.

The Qwen Image starter bundle gets the VAE and fp8 encoder prepended so
the bundled Lightning LoRA variants also benefit.
@github-actions github-actions Bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files frontend PRs that change frontend files labels May 1, 2026
@lstein lstein self-assigned this May 5, 2026
@lstein lstein added the v6.13.x label May 5, 2026
@lstein lstein moved this to 6.13.x Theme: MODELS in Invoke - Community Roadmap May 5, 2026
@Pfannkuchensack Pfannkuchensack marked this pull request as ready for review May 5, 2026 17:11
@JPPhoto
Copy link
Copy Markdown
Collaborator

JPPhoto commented May 6, 2026

Findings:

  • invokeai/frontend/web/src/features/controlLayers/store/paramsSlice.ts:713
    The params slice schema adds required fields at invokeai/frontend/web/src/features/controlLayers/store/types.ts:818 and invokeai/frontend/web/src/features/controlLayers/store/types.ts:819, but the persisted state version remains _version: 2 and the migration only handles v1 -> v2 before calling zParamsState.parse(state) at invokeai/frontend/web/src/features/controlLayers/store/paramsSlice.ts:719. Any existing user with a v2 persisted params object will not have qwenImageVaeModel or qwenImageQwenVLEncoderModel, so parse throws during rehydration. The store catches migration failures by falling back to getInitialState() at invokeai/frontend/web/src/app/store/store.ts:159, which wipes persisted params such as selected model, prompt state, dimensions, and component selections on upgrade. To expose this issue, add a params migration test that feeds a valid pre-PR v2 persisted params object into paramsSliceConfig.persistConfig.migrate and asserts the new Qwen Image fields are backfilled to null while existing params are preserved.
  • invokeai/frontend/web/src/features/nodes/util/graph/generation/buildQwenImageGraph.ts:203
    The graph metadata records qwen_image_component_source, quantization, and shift, but not the new standalone qwenImageVaeModel or qwenImageQwenVLEncoderModel values that are actually passed into the loader at lines 75-76. The metadata recall path only knows about qwen_image_component_source in invokeai/frontend/web/src/features/metadata/parsing.tsx:708, and there are no handlers for the standalone VAE/encoder keys. Scenario: generate with a GGUF Qwen Image model using standalone VAE + standalone Qwen VL encoder and no component source. The image metadata cannot reconstruct those component selections, so recall loses the model sources needed to reproduce the generation. To expose this issue, add a metadata/build graph test that sets both standalone component params, asserts both are emitted into graph metadata, then asserts metadata recall dispatches both selectors.
  • invokeai/frontend/web/src/features/controlLayers/store/paramsSlice.ts: no migration test covers persisted v2 params. This is the test gap behind the migration finding.
  • invokeai/backend/model_manager/load/model_loaders/qwen_image.py: single-file Qwen VL encoder loading may download tokenizer/config from HuggingFace at runtime when not cached. That may be intentional, but it creates an offline-first behavior gap for a local model install path.
  • invokeai/backend/model_manager/configs/qwen_vl_encoder.py: model identification loads the checkpoint state dict to classify Qwen VL files. For a 7GB fp8 encoder this can make model scan/import expensive. Existing code has similar patterns, so I would treat it as performance risk unless users report scan stalls.
  • invokeai/frontend/web/src/app/store/middleware/listenerMiddleware/listeners/modelSelected.ts: switching away from Qwen Image clears only qwenImageComponentSource, not the two new standalone selections. That is probably harmless persistence, but it is inconsistent with the existing cleanup behavior.

…ll in metadata, optimize scan

- bump params slice persisted state to v3 with a v2→v3 migration that
  backfills qwenImageVaeModel and qwenImageQwenVLEncoderModel to null,
  preventing existing users from losing all persisted params on upgrade
- emit qwen_image_vae and qwen_image_qwen_vl_encoder into graph metadata
  and add recall handlers so generations using standalone components are
  reproducible
- clear the two new fields in the modelSelected listener when switching
  away from qwen-image, matching the existing cleanup pattern
- identify single-file Qwen VL encoder checkpoints by reading only the
  safetensors key index via safe_open, instead of loading the full ~7GB
  state dict into RAM during model scan
- log a clear info message and raise an actionable RuntimeError when the
  first-time HuggingFace tokenizer/config download is needed but offline,
  pointing users to the diffusers folder layout as an offline alternative
- add unit tests for the migration, metadata recall, and identification
@github-actions github-actions Bot added the python-tests PRs that change python tests label May 6, 2026
@JPPhoto
Copy link
Copy Markdown
Collaborator

JPPhoto commented May 7, 2026

Areas of concern worth noting:

  • invokeai/backend/model_manager/load/model_loaders/qwen_image.py
    The single-file Qwen VL encoder path depends on HuggingFace cache/network for tokenizer and config on first use. The errors are now clearer, but fully offline installs of only the .safetensors encoder still cannot run until those small HF assets are cached.
  • invokeai/backend/model_manager/load/model_loaders/qwen_image.py
    _load_text_encoder_from_singlefile() dequantizes fp8 weights into floating tensors before model construction. This may create a high transient RAM peak for the 7GB encoder. I did not prove a regression because this may be required by the current loading approach.
  • invokeai/backend/model_manager/configs/qwen_vl_encoder.py
    Qwen VL checkpoint detection relies on key prefixes like visual.patch_embed. / visual.blocks.. The added tests cover intended Comfy-style naming, but alternate repacks with different vision tower prefixes could be rejected.
  • invokeai/backend/model_manager/load/model_loaders/vae.py
    Qwen Image VAE loading uses AutoencoderKLQwenImage() default config with strict state dict loading. This is good for catching mismatches, but it assumes the standalone checkpoint exactly matches diffusers’ default Qwen Image VAE architecture.
  • invokeai/frontend/web/src/features/queue/store/readiness.ts
    The readiness message still uses the older noQwenImageComponentSourceSelected text even when the missing source is specifically only VAE or only encoder. Not a behavioral bug, but the user feedback is less precise than the new split controls.
  • Integration coverage gap
    Tests now cover migration, metadata recall, graph building, and config identification. They do not cover an end-to-end install -> model selection -> graph execution flow with the standalone VAE plus single-file Qwen VL encoder.

@lstein
Copy link
Copy Markdown
Collaborator

lstein commented May 7, 2026

QA Checklist:

Starting with an empty root and clearing out the HF cache each time:

  • Quick install of a GGUF starter model - FAILED
  • Install VAE only - WORKED
  • Install Encoder folder - WORKED
  • Install Encoder single-file - FAILED
  • Full standalone setup - WORKED, provided encoder folder version used.

@lstein
Copy link
Copy Markdown
Collaborator

lstein commented May 7, 2026

I am going through the QA steps, each time starting out with a virgin root and clearing out the HF cache.

Failure when installing a GGUF starter model

Issue #1 - Selecting VAE and Encoder not intuitive to new users

I installed the starter model `Qwen Image Edit 2511 (Q4_K_M). I got the transformer, the VAE and the encoder as expected. I then went to the linear view, selected the model, but I didn't get the Invoke yellow ready button. I had to go to Advanced and select the VAE and Qwen2.5-VL Encoder. Now, I know to do that, but will a new user? I think that Invoke should be able to autoselect the first working VAE and encoder that it finds as the default. I did something similar to this for FLUX.2 in #9108.

Issue #2 - VAE/Encoder Source tip

New users may also find it confusing that the VAE/Encoder Source (Diffusers) field says "Required for GGUF models". If you have a standalone VAE and encoder installed, you don't need to specify the diffusers source. The tip should read "GGUF models require this unless a standalone VAE & Encoder is installed"

Issue #3 - Generation crash

After selecting the VAE and Encoder I tried to generate, but got a stack trace:

[2026-05-06 21:42:23,119]::[QwenVLEncoderCheckpointLoader]::WARNING --> 1444 unexpected keys in checkpoint, first 5: ['visual.blocks.0.attn.proj.scale_input', 'visual.blocks.0.attn.proj.scale_weight', 'visual.blocks.0.attn.qkv.scale_input', 'visual.blocks.0.attn.qkv.scale_weight', 'visual.blocks.0.mlp.down_proj.scale_input']
[2026-05-06 21:42:23,122]::[InvokeAI]::ERROR --> Error while invoking session 7a866454-ffb8-439a-8438-86d870ee8589, invocation 97224537-6ade-43b4-960f-5be1165d4eca (qwen_image_text_encoder): Failed to load all parameters from checkpoint. Meta tensors remain: ['model.visual.patch_embed.proj.weight', 'model.visual.blocks.0.norm1.weight', 'model.visual.blocks.0.norm2.weight', 'model.visual.blocks.0.attn.qkv.weight', 'model.visual.blocks.0.attn.qkv.bias']
[2026-05-06 21:42:23,122]::[InvokeAI]::ERROR --> Traceback (most recent call last):
  File "/home/lstein/Projects/InvokeAI/invokeai/app/services/session_processor/session_processor_default.py", line 130, in run_node
    output = invocation.invoke_internal(context=context, services=self._services)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
  File "/home/lstein/Projects/InvokeAI/invokeai/backend/model_manager/load/model_loaders/qwen_image.py", line 245, in _load_model
    return self._load_text_encoder_from_singlefile(config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lstein/Projects/InvokeAI/invokeai/backend/model_manager/load/model_loaders/qwen_image.py", line 396, in _load_text_encoder_from_singlefile
    raise RuntimeError(f"Failed to load all parameters from checkpoint. Meta tensors remain: {meta_params[:5]}")
RuntimeError: Failed to load all parameters from checkpoint. Meta tensors remain: ['model.visual.patch_embed.proj.weight', 'model.visual.blocks.0.norm1.weight', 'model.visual.blocks.0.norm2.weight', 'model.visual.blocks.0.attn.qkv.weight', 'model.visual.blocks.0.attn.qkv.bias']

Standalone components, full diffusers model present

The next tests were performed after installing the full diffusers model.

  1. The full diffusers model installs and generates. WORKS
  2. The standalone VAE starter model installs and generates (using the diffusers for the encoder source) WORKS
  3. The standalone Qwen2.5-VL Encoder (Diffusers) folder WORKS
  4. The standalone Qwen2.5-VL Encoder (fp8 scaled) single-file FAILS and gives me the stack trace described above.
  5. Full standalone install WORKS, as long as I use the folder version of the encoder. Note that installing any of the quantized GGUF models gives me the encoder file version, which breaks.

…ingle-file encoder crash

- Auto-select first available standalone VAE and Qwen2.5-VL encoder when
  switching to a Qwen Image model, so GGUF users are ready-to-go without
  digging into Advanced. Prefers the diffusers-folder encoder over the
  single-file checkpoint.
- Update the "Required for GGUF models" placeholder to clarify that
  the diffusers source is only required when a standalone VAE & encoder
  is not installed.
- Fix QwenVLEncoderCheckpointLoader crash on ComfyUI fp8_scaled
  single-file encoders. Two issues: (1) handle the `.scale_weight` /
  `.scale_input` quantization key scheme alongside `.weight_scale`,
  and (2) apply Qwen2_5_VLForConditionalGeneration's
  _checkpoint_conversion_mapping before load_state_dict so legacy
  `visual.*` / `model.*` keys map onto the new `model.visual.*` /
  `model.language_model.*` layout expected by transformers ≥4.50.
@lstein
Copy link
Copy Markdown
Collaborator

lstein commented May 7, 2026

QA Checklist:

Starting with an empty root and clearing out the HF cache each time:

  • Quick install of a GGUF starter model - WORKS
  • Install VAE only - WORKS
  • Install Encoder folder - WORKS
  • Install Encoder single-file - WORKS
  • Full standalone setup - WORKS

Summary: the most recent commit has addressed the usability and failure issues identified earlier.

Copy link
Copy Markdown
Collaborator

@JPPhoto JPPhoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lstein If you've tested this (I have only looked at the code), then it's good to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend PRs that change backend files frontend PRs that change frontend files invocations PRs that change invocations python PRs that change python files python-tests PRs that change python tests v6.13.x

Projects

Status: 6.13.x Theme: MODELS

Development

Successfully merging this pull request may close these issues.

3 participants