Skip to content

Conversation

@dxqb
Copy link

@dxqb dxqb commented Dec 21, 2025

What does this PR do?

Most recent models have been using variable lengths captions (Qwen, Chroma, Z-Image, ...) and require attention masking if batch size > 1 with multiple captions.
torch SDPA only uses its internal flash attention algorithm if there is no attention mask. Otherwise it falls back to another algorithm that is significantly slower, especially for high sequence lengths.

This PR implements an attention backend that splits up the attention batch into individual samples. Even though attention has to be called multiple times then, it is still faster than masked attention (tested up to batch size 8).

This PR also lays the groundwork for efficiently using "flash varlen" and other varlen attention backends, which are already implemented but not efficiently (see code comment).

This PR is based on @kashif and @cdutr 's work in this PR: #12702
To see only my changes, relative to their PR, follow this link: https://github.com/dxqb/diffusers/pull/4/files

Benchmarks

Training benchmarks using OneTrainer: especially training in higher resolution benefits:
grafik

Inference benchmark using diffusers Qwen example script (but with regional compilation):
Inference benefits when comparing apples to apples, which is BS2 for CFG. However, the current pipeline already avoids attention masks by calling the transformer twice with BS1, so there is only a slight practical improvement for inference:
grafik

Who can review?

@yiyixuxu and @sayakpaul
CC @kashif and @cdutr for feedback

contains #12892

kashif and others added 30 commits November 23, 2025 18:02
- Remove seq_lens parameter from dispatch_attention_fn
- Update varlen backends to extract seqlens from masks
- Update QwenImage to pass 2D joint_attention_mask
- Fix native backend to handle 2D boolean masks
- Fix sage_varlen seqlens_q to match seqlens_k for self-attention

Note: sage_varlen still producing black images, needs further investigation
Enhances documentation with comprehensive performance insights for QwenImage pipeline:
@dxqb dxqb changed the title Split attention backend Split attention backends Dec 26, 2025
@dxqb
Copy link
Author

dxqb commented Dec 26, 2025

  • added backends to split flash attention. This is useful especially for Windows users, because native torch SDPA doesn't use flash internally on Windows
  • tried to add Z-Image support but ran into this issue: Z-Image text sequence length issue #12893

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants