Skip to content

Add HWY-optimized push-pull hole filling#5186

Open
ssh4net wants to merge 1 commit intoAcademySoftwareFoundation:mainfrom
ssh4net:hwy_pushpull
Open

Add HWY-optimized push-pull hole filling#5186
ssh4net wants to merge 1 commit intoAcademySoftwareFoundation:mainfrom
ssh4net:hwy_pushpull

Conversation

@ssh4net
Copy link
Copy Markdown
Contributor

@ssh4net ssh4net commented May 5, 2026

Description

Highway implementation for ImageBufAlgo::fillholes_pushpull.

The new path is used for the common in-memory cases we can handle safely: 2-channel or 4-channel images, contiguous local pixels, and alpha in the last channel. It supports float, half, uint16, and uint8 inputs.

The algorithm is still the same push-pull algorithm as the existing OIIO code. It builds the full pyramid, uses the same triangle filtering behavior, divides pulled levels by alpha, and composites the pyramid back down. The difference is that the HWY version does the expensive parts in tighter fused kernels: pull
plus alpha divide, upsample plus over, and final conversion/write. This avoids several temporary image operations while keeping the result very close to the existing implementation.

For uninitialized uint8 output, the natural result is promoted to uint16 to avoid excessive rounding during push-pull. Preallocated destinations keep the format requested by the caller.

Tests

Built OIIO with HWY enabled and compared the HWY path against the existing implementation on synthetic and real image cases.

Tested all input/output pairs across:

  • uint8
  • uint16
  • half
  • float

Images used:

  • synthetic 4092x4092 RGBA case
  • real odd-size 3001x1997 crop
  • full-resolution RGB and grayscale image/mask set

Observed max differences:

  • float outputs: about 4e-7
  • uint16 outputs: about one quantization step, 1.5e-5
  • half outputs: about one half step, 4.9e-4
  • uint8 outputs: about one 8-bit step, 0.00392

Typical speedups:

  • synthetic: about 1.7x-2.7x
  • real odd-size images: about 2.0x-2.9x
  • full-resolution uint16: RGB 1.126s -> 0.479s, grayscale 0.822s -> 0.274s

Benchmarks

Synthetic 4092x4092

uint8  -> uint8   2.637x   diff 0.003921598
uint8  -> uint16  2.312x   diff 0.000015318
uint8  -> half    2.366x   diff 0.000488281
uint8  -> float   1.940x   diff 0.000000477

uint16 -> uint8   2.647x   diff 0.003921568
uint16 -> uint16  2.516x   diff 0.000015318
uint16 -> half    2.660x   diff 0.000488281
uint16 -> float   1.905x   diff 0.000000417

half   -> uint8   2.711x   diff 0.003921568
half   -> uint16  2.730x   diff 0.000015318
half   -> half    2.521x   diff 0.000488281
half   -> float   1.743x   diff 0.000000417

float  -> uint8   2.570x   diff 0.003921628
float  -> uint16  2.607x   diff 0.000015318
float  -> half    2.514x   diff 0.000488281
float  -> float   1.687x   diff 0.000000417

Real Odd Crop 3001x1997

uint8  -> uint8   2.631x   diff 0.003921568
uint8  -> uint16  2.910x   diff 0.000015318
uint8  -> half    2.885x   diff 0.000488281
uint8  -> float   1.997x   diff 0.000000477

uint16 -> uint8   2.746x   diff 0.003921598
uint16 -> uint16  2.727x   diff 0.000015318
uint16 -> half    2.669x   diff 0.000488281
uint16 -> float   2.342x   diff 0.000000417

half   -> uint8   2.759x   diff 0.003921568
half   -> uint16  2.867x   diff 0.000015318
half   -> half    2.706x   diff 0.000488281
half   -> float   2.247x   diff 0.000000417

float  -> uint8   2.679x   diff 0.003921598
float  -> uint16  2.887x   diff 0.000015318
float  -> half    2.934x   diff 0.000488281
float  -> float   2.105x   diff 0.000000417

Float-output accuracy is still in the ~4e-7 range. Integer/half diffs are one output quantization step.

Checklist:

  • I have read the guidelines on contributions and code review procedures.
  • I have read the Policy on AI Coding Assistants
    and if I used AI coding assistants, I have an Assisted-by: Codex GPT5.5 xHigh
    line in the pull request description above.
  • I have updated the documentation if my PR adds features or changes
    behavior.
  • I am sure that this PR's changes are tested in the testsuite.
  • I have run and passed the testsuite in CI before submitting the
    PR, by pushing the changes to my fork and seeing that the automated CI
    passed there. (Exceptions: If most tests pass and you can't figure out why
    the remaining ones fail, it's ok to submit the PR and ask for help. Or if
    any failures seem entirely unrelated to your change; sometimes things break
    on the GitHub runners.)
  • My code follows the prevailing code style of this project and I
    fixed any problems reported by the clang-format CI test.
  • If I added or modified a public C++ API call, I have also amended the
    corresponding Python bindings. If altering ImageBufAlgo functions, I also
    exposed the new functionality as oiiotool options.

HWY-accelerated push/pull implementation for hole filling (guarded by OIIO_USE_HWY). Adds SIMD-aware helpers, data structures (PushPullLevel, tiled views, weight structs), tiled pull/push routines, and finalization paths for float/half/uint16/uint8 and 2/4 channel images.
Signed-off-by: Vlad (Kuzmin) Erium <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant