Feat:(model) qwen image vae checkpoint#9108
Feat:(model) qwen image vae checkpoint#9108Pfannkuchensack wants to merge 13 commits intoinvoke-ai:mainfrom
Conversation
…pport Add standalone model types so Qwen Image can be run without downloading the full ~40 GB Diffusers pipeline. The VAE and Qwen2.5-VL encoder can now each come from their own model, with the Component Source (Diffusers) acting as a fallback for any submodel not provided separately.
Add a checkpoint loader for ComfyUI-style consolidated Qwen2.5-VL encoder files (e.g. qwen_2.5_vl_7b_fp8_scaled.safetensors), which bundle the language model and visual tower into one safetensors with FP8 + per-tensor weight_scale quantization. This drops the standalone encoder footprint from ~16 GB (Diffusers folder, FP16) to ~7 GB.
Add three new starter models so users can install a complete GGUF Qwen Image setup in one click without ever touching the full ~40 GB Diffusers pipeline: - "Qwen Image VAE" — single-file VAE checkpoint pulled from the Qwen-Image repo (~250 MB). - "Qwen2.5-VL Encoder (fp8 scaled)" — ComfyUI single-file FP8 encoder (~7 GB). - "Qwen2.5-VL Encoder (Diffusers)" — full-precision encoder via multi-folder HF download (text_encoder+tokenizer+processor, ~16 GB). The 8 GGUF main starters (Q2_K / Q4_K_M / Q6_K / Q8_0 for both Edit and txt2img) now declare the VAE + fp8 encoder as dependencies, so installing any of them automatically pulls in everything needed to generate. The fp8 encoder is preferred as the default dependency since it's smaller and the on-the-fly dequantization is essentially free at runtime. The Qwen Image starter bundle gets the VAE and fp8 encoder prepended so the bundled Lightning LoRA variants also benefit.
|
Findings:
|
…ll in metadata, optimize scan - bump params slice persisted state to v3 with a v2→v3 migration that backfills qwenImageVaeModel and qwenImageQwenVLEncoderModel to null, preventing existing users from losing all persisted params on upgrade - emit qwen_image_vae and qwen_image_qwen_vl_encoder into graph metadata and add recall handlers so generations using standalone components are reproducible - clear the two new fields in the modelSelected listener when switching away from qwen-image, matching the existing cleanup pattern - identify single-file Qwen VL encoder checkpoints by reading only the safetensors key index via safe_open, instead of loading the full ~7GB state dict into RAM during model scan - log a clear info message and raise an actionable RuntimeError when the first-time HuggingFace tokenizer/config download is needed but offline, pointing users to the diffusers folder layout as an offline alternative - add unit tests for the migration, metadata recall, and identification
|
Areas of concern worth noting:
|
|
QA Checklist: Starting with an empty root and clearing out the HF cache each time:
|
|
I am going through the QA steps, each time starting out with a virgin root and clearing out the HF cache. Failure when installing a GGUF starter modelIssue #1 - Selecting VAE and Encoder not intuitive to new usersI installed the starter model `Qwen Image Edit 2511 (Q4_K_M). I got the transformer, the VAE and the encoder as expected. I then went to the linear view, selected the model, but I didn't get the Invoke yellow ready button. I had to go to Advanced and select the VAE and Qwen2.5-VL Encoder. Now, I know to do that, but will a new user? I think that Invoke should be able to autoselect the first working VAE and encoder that it finds as the default. I did something similar to this for FLUX.2 in #9108. Issue #2 - VAE/Encoder Source tipNew users may also find it confusing that the Issue #3 - Generation crashAfter selecting the VAE and Encoder I tried to generate, but got a stack trace: Standalone components, full diffusers model presentThe next tests were performed after installing the full diffusers model.
|
…ingle-file encoder crash - Auto-select first available standalone VAE and Qwen2.5-VL encoder when switching to a Qwen Image model, so GGUF users are ready-to-go without digging into Advanced. Prefers the diffusers-folder encoder over the single-file checkpoint. - Update the "Required for GGUF models" placeholder to clarify that the diffusers source is only required when a standalone VAE & encoder is not installed. - Fix QwenVLEncoderCheckpointLoader crash on ComfyUI fp8_scaled single-file encoders. Two issues: (1) handle the `.scale_weight` / `.scale_input` quantization key scheme alongside `.weight_scale`, and (2) apply Qwen2_5_VLForConditionalGeneration's _checkpoint_conversion_mapping before load_state_dict so legacy `visual.*` / `model.*` keys map onto the new `model.visual.*` / `model.language_model.*` layout expected by transformers ≥4.50.
|
QA Checklist: Starting with an empty root and clearing out the HF cache each time:
Summary: the most recent commit has addressed the usability and failure issues identified earlier. |
Summary
Adds standalone model support for Qwen Image so users no longer need the full ~40 GB Diffusers pipeline. A GGUF transformer can now be combined with a standalone VAE checkpoint, a standalone Qwen2.5-VL encoder (Diffusers folder or ComfyUI single-file fp8), and the Component Source (Diffusers) field becomes a fallback rather than a hard requirement. All standalone components are also exposed as installable starter models, so a fully working GGUF setup can be installed in one click.
Why: The Qwen Image PR (#9000) only allowed loading the VAE and text encoder from the full Diffusers pipeline. That meant ~40 GB on disk just to use a tiny VAE (~250 MB) plus the encoder (~16 GB), and re-downloading both for every model variant. The smallest fully-standalone setup with this PR drops to ~12 GB (GGUF transformer + ~250 MB VAE + ~7 GB ComfyUI fp8 encoder).
How:
Backend
VAE_Checkpoint_QwenImage_Configdetects single-file Qwen Image VAEs via 5D conv weights +z_dim=16and loads them viaAutoencoderKLQwenImage(init_empty_weights+load_state_dict). The generic VAE checkpoint matcher now explicitly excludes Qwen Image VAEs so they aren't misclassified as FLUX.ModelType.QwenVLEncoder+ModelFormat.QwenVLEncoderwithQwenVLEncoder_Diffusers_Configrecognising directories that containtext_encoder/(withQwen2_5_VLForConditionalGeneration/Qwen2VLForConditionalGeneration) +tokenizer/. The newQwenVLEncoderLoaderhandlesTokenizerandTextEncodersubmodel loading from the folder layout.QwenVLEncoder_Checkpoint_Configmatches consolidated single-file checkpoints (e.g.qwen_2.5_vl_7b_fp8_scaled.safetensors) by detecting both LM keys (model.embed_tokens/model.layers.*) and visual tower keys (visual.patch_embed.*/visual.blocks.*). The newQwenVLEncoderCheckpointLoaderloads the safetensors, dequantises ComfyUI fp8 weights viaweight * weight_scale(with block-wise expansion, mirroring the Z-Image Qwen3 loader), stripscomfy_quant/weight_scale/scaled_fp8metadata, fetches the architecture config fromQwen/Qwen2.5-VL-7B-Instruct(offline-cache fallback), and instantiatesQwen2_5_VLForConditionalGenerationviainit_empty_weights+assignload. Tokenizer comes from the same HF repo with offline fallback.qwen_image_text_encoder.pynow branches on whethermodel_rootis a file. Single-file checkpoints get tokenizer + image processor from HuggingFace (Qwen/Qwen2.5-VL-7B-Instruct, ~10 MB, cached); the existing folder layout path is unchanged. BnB-quantised loading falls back to the cached encoder for single-file checkpoints since BnB can't load from a bare safetensors and the file is already FP8.QwenImageModelLoaderInvocationgains optionalvae_modelandqwen_vl_encoder_modelfields. Resolution priority for each component: standalone override → main model (if Diffusers) → Component Source. Bumped tov1.2.0.Qwen Image VAE(single-file checkpoint, ~250 MB),Qwen2.5-VL Encoder (fp8 scaled)(ComfyUI single-file, ~7 GB), andQwen2.5-VL Encoder (Diffusers)(multi-folder HF downloadtext_encoder+tokenizer+processor, ~16 GB). All 8 GGUF main starters (Q2_K / Q4_K_M / Q6_K / Q8_0 for both Edit and txt2img) declare the VAE + fp8 encoder asdependencies, so installing any of them auto-installs a complete generation-ready setup. The Qwen Image starter bundle gets the VAE and fp8 encoder prepended too.Frontend
qwenImageVaeModelandqwenImageQwenVLEncoderModel, plus a migration entry.useQwenImageVAEModels/useQwenVLEncoderModelshooks,isQwenImageVAEModelConfig/isQwenVLEncoderModelConfigtype guards, and Model Manager category + format badge entries.schema.tspatched manually for the newModelType/ModelFormatvalues, theQwenVLEncoder_Diffusers_ConfigandQwenVLEncoder_Checkpoint_Configschemas, the new loader fields, and theAnyModelConfigunion.Related Issues / Discussions
Follow-up to #9000 (Qwen Image full pipeline support). Closes the standalone-component gap that was called out for users with limited disk space.
QA Instructions
Quickest verification (recommended):
Install one of the GGUF starter models (e.g.
Qwen Image Edit 2511 (Q4_K_M)) from the starter list. The VAE and fp8 encoder should be auto-installed as dependencies, and the model should generate without any further configuration.Setup options for manual testing:
Qwen Image VAEfrom the starter list (or downloadvae/diffusion_pytorch_model.safetensorsfrom a Qwen Image HF repo manually, ~250 MB). Verify it's identified as a Qwen Image VAE checkpoint.Qwen2.5-VL Encoder (Diffusers)from the starter list (or downloadtext_encoder/+tokenizer/(+ optionallyprocessor/) fromQwen/Qwen-Image-Edit-2511manually). Verify it's identified asqwen_vl_encoder/qwen_vl_encoder.Qwen2.5-VL Encoder (fp8 scaled)from the starter list (orqwen_2.5_vl_7b_fp8_scaled.safetensorsdirectly, ~7 GB). Verify it's identified asqwen_vl_encoder/checkpoint. First generation will fetch the tokenizer + processor configs fromQwen/Qwen2.5-VL-7B-Instruct(~10 MB) and cache them.Cases to verify on the Qwen Image generation tab:
int8/nf4) still works against a standalone encoder folder. Single-file Encoder +int8/nf4falls back to the cached non-BnB path (still works, no error).Starter model checks:
Qwen Image VAE,Qwen2.5-VL Encoder (fp8 scaled),Qwen2.5-VL Encoder (Diffusers).Automated checks:
pytest tests/app/invocations/test_qwen_image_model_loader.py tests/backend/model_manager/configs/— 16 passed.pytest -k "qwen_image"(excluding unrelated PILget_flattened_datatest) — 53 passed.pnpm lint:tsc/pnpm lint:eslint/pnpm lint:prettier/pnpm lint:knipall green.Merge Plan
Standard merge.
Checklist
What's Newcopy (if doing a release after this PR)