[Bug] Flux.2-Klein VAE Decode Artefacts

### Git commit

git: 636d3cb6ff25d1ffa7267e5f6dac9f2925945606





### Operating System & Version

Ubuntu 24.04.3 LTS

### GGML backends

CUDA

### Command-line arguments used

[  "--diffusion-model", "/home/charles/coding/models/flux2/flux-2-klein-4b.safetensors",  "--vae", "/home/charles/coding/models/flux2/flux2-vae.safetensors",  "--llm", "/home/charles/coding/models/flux2/Qwen3-4B-Q6_K.gguf",  "--cfg-scale", "1.0",  "--steps", "4",  "-v",  "--diffusion-fa",  "--width", "1024",  "--height", "1024",  "-p", "\"A cat holding a beachball on the river bank.\""                   ]

### Steps to reproduce

Get the latents exported by diffusers from latents_3.zip.

Add the following code to load the repro data:
```
    LOG_INFO("decoding %zu latents", final_latents.size());

    FILE* f=fopen("/home/charles/coding/wishingwell/scripts/latents_3.raw", "rb");
    fseek(f, 0, SEEK_END);
    long fsize = ftell(f);
    fseek(f, 0, SEEK_SET);

    float *data = (float*)malloc(fsize + 1);
    fread(data, fsize, 1, f);
    fclose(f);

    GGML_ASSERT(fsize == ggml_nbytes(final_latents[0]));
    LOG_INFO("latent: %zu %zu %zu %zu", final_latents[0]->ne[0], final_latents[0]->ne[1], final_latents[0]->ne[2], final_latents[0]->ne[3]); 
    auto hack_latent = ggml_dup_tensor(work_ctx, final_latents[0]); 
    memcpy(hack_latent->data, data, fsize);
    final_latents[0] = hack_latent;
```

After the line

```
    LOG_INFO("generating %" PRId64 " latent images completed, taking %.2fs", final_latents.size(), (t3 - t1) * 1.0f / 1000);
```
In stable-diffusion.cpp

(also attached the python code to generate this file)
[simple_inference.py](https://github.com/user-attachments/files/25261557/simple_inference.py)
[latents_3.zip](https://github.com/user-attachments/files/25261592/latents_3.zip)

### What you expected to happen

When working with flux2.klein I noticed artifacts I didn't see when using the python diffuses version before. I suspected the VAE so made a quick repro where diffusers outputs the raw VAE and then load it in stable-diffusion.cpp for decoding. The results are worse in particular notice the gradients on the ball.

This seems beyond expected difference of the two implementations and leads to unusable quality.

<img width="1024" height="1024" alt="Image" src="https://github.com/user-attachments/assets/ea93a8be-85a6-4b30-a425-bb0b877a370e" />

<img width="1024" height="1024" alt="Image" src="https://github.com/user-attachments/assets/95392eb0-2663-45aa-9268-86650edcd6de" />

### What actually happened

Noticeable quality degradation see Images. Same style artifacts happen when directly generating images from prompts. They seem mostly visible in smooth gradient areas.

This one is generated using stock stable-diffusion.cpp

<img width="1024" height="1024" alt="Image" src="https://github.com/user-attachments/assets/8726b656-664e-4726-ab87-5f7ff3ace355" />

### Logs / error messages / stack trace

None

### Additional context / environment details

NVIDIA GB10 Driver Version: 580.95.05 CUDA Version: 13.0 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Flux.2-Klein VAE Decode Artefacts #1275

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

Additional context / environment details

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Bug] Flux.2-Klein VAE Decode Artefacts #1275

Description

Git commit

Operating System & Version

GGML backends

Command-line arguments used

Steps to reproduce

What you expected to happen

What actually happened

Logs / error messages / stack trace

Additional context / environment details

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions