Skip to content

Conversation

@zhangtao0408
Copy link

What does this PR do?

Moving pos_embed computation from CPU back to NPU results in a 1.07x speedup in Flux.1's end-to-end latency.

Since CANN updated to 8.3.RC1, the bad performance of torch.repeat_interleave operator has been optimized. Results shown below:

Model Device Resolution Steps e2e latency
FLUX.1-DEV npu 1024 x 1024 50 25.54
FLUX.1-DEV cpu 1024 x 1024 50 27.41

Tested Hardware

Ascend 910B3

Repro Code

FLUX.1-dev

import time

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("npu")

prompt = "A cat holding a sign that says hello world"

# Warmup
_ = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=2,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
)

# Inference
start_time = time.time()

image = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
    generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-dev.png")

end_time = time.time()
print(f"Time: {end_time - start_time:.2f}s")

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@zhangtao0408
Copy link
Author

@sayakpaul Please review this pr, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants