Skip to content

Prevent FP contractions to FMA in index/.../content.hpp to avoid numerical instability in rtree#1453

Merged
tinko92 merged 1 commit intoboostorg:developfrom
tinko92:fix/index-content-fp-contract-off
Apr 14, 2026
Merged

Prevent FP contractions to FMA in index/.../content.hpp to avoid numerical instability in rtree#1453
tinko92 merged 1 commit intoboostorg:developfrom
tinko92:fix/index-content-fp-contract-off

Conversation

@tinko92
Copy link
Copy Markdown
Collaborator

@tinko92 tinko92 commented Mar 27, 2026

This PR addresses #1452 . The issue shows a performance regression on GCC with -march=x86-64-v3 over -march=x86-64-v2 and presumably optimizations enabled, the compilation flags are not fully specified. After instrumenting the code further, I noticed many more calls to distance_cross_track and a different tree structure in v3 vs. v2 with less efficient querying (~twice as many geographic distance computations). The issue no longer replicates after doing any change out of a) disabling inlining, b) disabling FMA, c) removing the points with duplicate coordinates from the data.csv in the issue.

This points to a numerical instability in tree insertion due to some fusing of multiply and add/sub (b) across function boundaries (a), in some expression where cancellation to == 0.0 would occur without FMA but not with (from c). After disallowing this as in the commit with fp-contract=off for the content algorithm, the issue disappears. The content algorithm ends in a multiply, so when this small method is inlined, this multiply may be fused into the next addition/subtraction, so it fits the findings.

Multiple call sites of content seem to be affected by this (changing call sites individually yielded different tree structures), so I think, disallowing the fusing at the level of content.hpp is the cleanest solution. I consider this a conservative change because it keeps the behavior similar to how it would have been at x86-64-v2 and before (probably the most tested settings), where no FMA would be available, and also similar to Clang and MSVC which do not default to fp-contract=fast like GCC when optimizations are enabled.

It is guarded behind the BOOST_GCC macro to avoid warnings for compilers that do not support this pragma and because GCC is the only compiler I am aware of that defaults to fp-contract=fast at common optimization levels without enabling it specifically or setting something well-known to be unsafe like -ffast-math.

Edit: Added explanatory comment. No test-case added because the behaviour changes are compiler- and machine-dependent and I see no guarantees wrt to CPU architectures for the Github CI).

@tinko92 tinko92 force-pushed the fix/index-content-fp-contract-off branch from 92f2d52 to dd01797 Compare March 28, 2026 11:50
Comment thread include/boost/geometry/index/detail/algorithms/content.hpp
Copy link
Copy Markdown
Member

@vissarion vissarion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @tinko92, I am OK with the changes.

I have one question: can the issue be replicated (i.e. slow performance) with clang14+ ( which defaults to -ffp-contract=on)?

@barendgehrels
Copy link
Copy Markdown
Collaborator

I'm still fine with the change - but I might have an alternative.

@tinko92 can you personally reproduce it?

I think you can:

After instrumenting the code further, I noticed many more calls to distance_cross_track

so that might help me as well. More news tonight.

@tinko92
Copy link
Copy Markdown
Collaborator Author

tinko92 commented Mar 30, 2026

@vissarion Sorry, unintended edit earlier, I meant to answer. Here is the answer:

I have one question: can the issue be replicated (i.e. slow performance) with clang14+ ( which defaults to -ffp-contract=on)?

It does not replicate with clang for me. Here are different combinations of compilers and fp-contract settings (The line "2: " outputs the count of internal nodes, which is only interpretable to me in so far as it is different between cases):

for command in "g++ -ffp-contract=fast" "g++ -ffp-contract=on" "g++ -ffp-contract=off" "clang++ -ffp-contract=fast" "clang++ -ffp-contract=on" "clang++ -ffp-contract=off"; do echo $command && ($command -I/home/bartels/dev/boost/ -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && ./a.
out | grep -E "query all took|^2:"); done
g++ -ffp-contract=fast #<-- default
2: 6685 #<-- rtree structure that seems to hurt performance
query all took :   6726.298 ms
g++ -ffp-contract=on
2: 6763
query all took :   4316.571 ms
g++ -ffp-contract=off
2: 6763
query all took :   4555.714 ms
clang++ -ffp-contract=fast
2: 6763
query all took :   4734.149 ms
clang++ -ffp-contract=on
2: 6763
query all took :   4323.103 ms
clang++ -ffp-contract=off
2: 6763
query all took :   4801.359 ms

With GCC 14 on this machine and Clang 19. Clang contracts a lot fewer multiplications and additions:

bartels@DESKTOP-BPNV3G3:~/dev/bg_1452/test$ g++ -ffp-contract=fast -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && objdump -d a.out | grep -E "vfmadd|vfmsub" | wc -l ### 20260330 16:49 /home/bartels/dev/bg_1452/test
131
bartels@DESKTOP-BPNV3G3:~/dev/bg_1452/test$ clang++ -ffp-contract=fast -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && objdump -d a.out | grep -E "vfmadd|vfmsub" | wc -l
64
bartels@DESKTOP-BPNV3G3:~/dev/bg_1452/test$ clang++ -ffp-contract=on -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && objdump -d a.out | grep -E "vfmadd|vfmsub" | wc -l
50
bartels@DESKTOP-BPNV3G3:~/dev/bg_1452/test$ clang++ -ffp-contract=off -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && objdump -d a.out | grep -E "vfmadd|vfmsub" | wc -l
0

Apparently it emits none that trigger the problem of issue #1452 for that example program. I don't know what in Clang's optimizer causes this and whether some unrelated changes (that lead to some instruction reordering, that suddenly make some previously unfused operations candidates for fusing) could affect this.

@tinko92
Copy link
Copy Markdown
Collaborator Author

tinko92 commented Mar 30, 2026

@barendgehrels

can you personally reproduce it?

Yes, on a server with an Intel Xeon CPU (every x86 CPU that satisfies x86-64-v3 should work, which are most since Intel Haswell, I think), I have access to. The reporter of the issue has an AMD CPU.

It directly replicates with

g++ -I/home/bartels/dev/boost/ -DNDEBUG -O3 -march=x86-64-v3 main.cpp -std=c++20 && ./a.out for me, but many other combinations also work.

so that might help me as well

The instrumented code? It is rather messy/crude because it was never intended for sharing, probably could be done better with a profiler (I have no IDE on the remote x86 system, my own working machine is ARM and cannot replicate it on that). I appended the edited main.cpp (as txt due to github file filters), and pushed the changed BG code for counting as is to my fork (can be pull from https://github.com/tinko92/geometry/tree/experiment/issue_1452_instrumentation ).
main.cpp.txt

@barendgehrels
Copy link
Copy Markdown
Collaborator

so that might help me as well

The instrumented code? It is rather messy/crude because it was never intended for sharing, probably could be done better with a profiler

Thanks, I actually meant that you could maybe check a PR if I create it... But good to know anyway. And good you can reproduce it.

@barendgehrels
Copy link
Copy Markdown
Collaborator

Now see:

It does not replicate with clang for me.

OK, clear. TBC.

@barendgehrels
Copy link
Copy Markdown
Collaborator

See also #1455 (and #1454)

@sandman7920
Copy link
Copy Markdown

I have tested #1455 and #1454 applied over boost-1.90.0

no patch -ffp-contract=off query all took: 3897.396 ms
patch applied query all took : 4996.639 ms
no patch query all took : 6947.749 ms

@barendgehrels
Copy link
Copy Markdown
Collaborator

I have tested #1455 and #1454 applied over boost-1.90.0

no patch -ffp-contract=off query all took: 3897.396 ms patch applied query all took : 4996.639 ms no patch query all took : 6947.749 ms

Thanks! At least it helps.

There is no objection to apply both PR's. It would make sense.

@sandman7920 so we also compare it with:

./gcc_boost_1.75_v2
query all took : 3626.386 ms

That means it's back to what it was, great.

One detail: no patch means just changing ffp-contract in the compiler, and not applying #1453 as well?

@sandman7920
Copy link
Copy Markdown

One detail: no patch means just changing ffp-contract in the compiler, and not applying #1453 as well?

Yes

@tinko92
Copy link
Copy Markdown
Collaborator Author

tinko92 commented Mar 31, 2026

I have tested #1455 and #1454 applied over boost-1.90.0

no patch -ffp-contract=off query all took: 3897.396 ms patch applied query all took : 4996.639 ms no patch query all took : 6947.749 ms

Just for future reference (because this discussion is now distributed across 1 issue and 2 PRs), here is a test including calls to distance and tree metadata. I find the geographic_cross_track call count helpful because it is exact where the timings are a little noisy. -ffp-contract=off applies to the entire compilation what this PR only applies to index::detail::content.

@tinko92 tinko92 marked this pull request as draft April 7, 2026 15:05
@tinko92 tinko92 marked this pull request as ready for review April 12, 2026 17:23
@tinko92
Copy link
Copy Markdown
Collaborator Author

tinko92 commented Apr 12, 2026

An error-bound based approach I tested (similar direction to #1455 ) , showed a lot of promise on arm64 but did not resolve this issue on x86 either. Because of that I mark this PR ready for and I am in favour of merging this as the solution.

@barendgehrels
Copy link
Copy Markdown
Collaborator

I'm certainly fine - thanks for your investigations!

So we merge this one, feel free to merge.

Should I close mine? (#1455) That's fine.

The #1454 is different and still valid.

@tinko92 tinko92 merged commit 2b47156 into boostorg:develop Apr 14, 2026
17 checks passed
@tinko92 tinko92 deleted the fix/index-content-fp-contract-off branch April 16, 2026 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants