[Flux.1] improve pos embed for ascend npu by computing on npu #12897

zhangtao0408 · 2025-12-27T10:09:50Z

What does this PR do?

Moving pos_embed computation from CPU back to NPU results in a 1.07x speedup in Flux.1's end-to-end latency.

Since CANN updated to 8.3.RC1, the bad performance of torch.repeat_interleave operator has been optimized. Results shown below:

Model	Device	Resolution	Steps	e2e latency
FLUX.1-DEV	npu	1024 x 1024	50	25.54
FLUX.1-DEV	cpu	1024 x 1024	50	27.41

Tested Hardware

Ascend 910B3

Repro Code

FLUX.1-dev

import time

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("npu")

prompt = "A cat holding a sign that says hello world"

# Warmup
_ = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=2,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
)

# Inference
start_time = time.time()

image = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-dev.png")

end_time = time.time()
print(f"Time: {end_time - start_time:.2f}s")

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…omputation.

zhangtao0408 · 2025-12-27T10:27:04Z

@sayakpaul Please review this pr, thanks.

[Flux.1] improve pos embed for ascend npu by setting it back to npu c…

edf0a69

…omputation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

zhangtao0408 commented Dec 27, 2025

Uh oh!

zhangtao0408 commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

Are you sure you want to change the base?

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

Conversation

zhangtao0408 commented Dec 27, 2025

What does this PR do?

Tested Hardware

Repro Code

Before submitting

Who can review?

Uh oh!

zhangtao0408 commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants