Add HWY-optimized push-pull hole filling by ssh4net · Pull Request #5186 · AcademySoftwareFoundation/OpenImageIO

ssh4net · 2026-05-05T09:22:58Z

Description

Highway implementation for ImageBufAlgo::fillholes_pushpull.

The new path is used for the common in-memory cases we can handle safely: 2-channel or 4-channel images, contiguous local pixels, and alpha in the last channel. It supports float, half, uint16, and uint8 inputs.

The algorithm is still the same push-pull algorithm as the existing OIIO code. It builds the full pyramid, uses the same triangle filtering behavior, divides pulled levels by alpha, and composites the pyramid back down. The difference is that the HWY version does the expensive parts in tighter fused kernels: pull
plus alpha divide, upsample plus over, and final conversion/write. This avoids several temporary image operations while keeping the result very close to the existing implementation.

For uninitialized uint8 output, the natural result is promoted to uint16 to avoid excessive rounding during push-pull. Preallocated destinations keep the format requested by the caller.

Tests

Built OIIO with HWY enabled and compared the HWY path against the existing implementation on synthetic and real image cases.

Tested all input/output pairs across:

uint8
uint16
half
float

Images used:

synthetic 4092x4092 RGBA case
real odd-size 3001x1997 crop
full-resolution RGB and grayscale image/mask set

Observed max differences:

float outputs: about 4e-7
uint16 outputs: about one quantization step, 1.5e-5
half outputs: about one half step, 4.9e-4
uint8 outputs: about one 8-bit step, 0.00392

Typical speedups:

synthetic: about 1.7x-2.7x
real odd-size images: about 2.0x-2.9x
full-resolution uint16: RGB 1.126s -> 0.479s, grayscale 0.822s -> 0.274s

Benchmarks

Synthetic 4092x4092

uint8  -> uint8   2.637x   diff 0.003921598
uint8  -> uint16  2.312x   diff 0.000015318
uint8  -> half    2.366x   diff 0.000488281
uint8  -> float   1.940x   diff 0.000000477

uint16 -> uint8   2.647x   diff 0.003921568
uint16 -> uint16  2.516x   diff 0.000015318
uint16 -> half    2.660x   diff 0.000488281
uint16 -> float   1.905x   diff 0.000000417

half   -> uint8   2.711x   diff 0.003921568
half   -> uint16  2.730x   diff 0.000015318
half   -> half    2.521x   diff 0.000488281
half   -> float   1.743x   diff 0.000000417

float  -> uint8   2.570x   diff 0.003921628
float  -> uint16  2.607x   diff 0.000015318
float  -> half    2.514x   diff 0.000488281
float  -> float   1.687x   diff 0.000000417

Real Odd Crop 3001x1997

uint8  -> uint8   2.631x   diff 0.003921568
uint8  -> uint16  2.910x   diff 0.000015318
uint8  -> half    2.885x   diff 0.000488281
uint8  -> float   1.997x   diff 0.000000477

uint16 -> uint8   2.746x   diff 0.003921598
uint16 -> uint16  2.727x   diff 0.000015318
uint16 -> half    2.669x   diff 0.000488281
uint16 -> float   2.342x   diff 0.000000417

half   -> uint8   2.759x   diff 0.003921568
half   -> uint16  2.867x   diff 0.000015318
half   -> half    2.706x   diff 0.000488281
half   -> float   2.247x   diff 0.000000417

float  -> uint8   2.679x   diff 0.003921598
float  -> uint16  2.887x   diff 0.000015318
float  -> half    2.934x   diff 0.000488281
float  -> float   2.105x   diff 0.000000417

Float-output accuracy is still in the ~4e-7 range. Integer/half diffs are one output quantization step.

Checklist:

I have read the guidelines on contributions and code review procedures.
I have read the Policy on AI Coding Assistants
and if I used AI coding assistants, I have an Assisted-by: Codex GPT5.5 xHigh
line in the pull request description above.
I have updated the documentation if my PR adds features or changes
behavior.
I am sure that this PR's changes are tested in the testsuite.
I have run and passed the testsuite in CI before submitting the
PR, by pushing the changes to my fork and seeing that the automated CI
passed there. (Exceptions: If most tests pass and you can't figure out why
the remaining ones fail, it's ok to submit the PR and ask for help. Or if
any failures seem entirely unrelated to your change; sometimes things break
on the GitHub runners.)
My code follows the prevailing code style of this project and I
fixed any problems reported by the clang-format CI test.
If I added or modified a public C++ API call, I have also amended the
corresponding Python bindings. If altering ImageBufAlgo functions, I also
exposed the new functionality as oiiotool options.

HWY-accelerated push/pull implementation for hole filling (guarded by OIIO_USE_HWY). Adds SIMD-aware helpers, data structures (PushPullLevel, tiled views, weight structs), tiled pull/push routines, and finalization paths for float/half/uint16/uint8 and 2/4 channel images. Signed-off-by: Vlad (Kuzmin) Erium <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HWY-optimized push-pull hole filling#5186

Add HWY-optimized push-pull hole filling#5186
ssh4net wants to merge 1 commit intoAcademySoftwareFoundation:mainfrom
ssh4net:hwy_pushpull

ssh4net commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ssh4net commented May 5, 2026

Description

Tests

Benchmarks

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant