Skip to content

Phoenix GPU: bump nvhpc to 25.5 and use system openmpi#1393

Closed
sbryngelson wants to merge 2 commits intoMFlowCode:masterfrom
sbryngelson:phoenix-nvhpc-25.5
Closed

Phoenix GPU: bump nvhpc to 25.5 and use system openmpi#1393
sbryngelson wants to merge 2 commits intoMFlowCode:masterfrom
sbryngelson:phoenix-nvhpc-25.5

Conversation

@sbryngelson
Copy link
Copy Markdown
Member

Summary

  • Update Phoenix GPU module set: nvhpc/24.5nvhpc/25.5, drop hpcx/2.19-cuda, add system openmpi.

Test plan

  • source ./mfc.sh load -c p -m g succeeds on Phoenix
  • ./mfc.sh build --gpu acc -j 8 succeeds on Phoenix GPU node
  • Run a representative GPU case to confirm runtime

@sbryngelson sbryngelson marked this pull request as ready for review May 2, 2026 14:32
@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 2, 2026

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0)

Grey Divider


Remediation recommended

1. Unpinned OpenMPI for NVHPC 🐞 Bug ☼ Reliability
Description
p-gpu now loads an unversioned openmpi module while exporting FC=nvfortran, but MPI is enabled
by default and CMake requires MPI::MPI_Fortran, which depends on a Fortran MPI installation
compatible with the active compiler. Because openmpi is not versioned/compiler-qualified here
(unlike other NVHPC module sets), the resulting toolchain can become non-reproducible and may fail
to configure/build if the default OpenMPI is not NVHPC-compatible.
Code

toolchain/modules[40]

+p-gpu python/3.12.5 nvhpc/25.5 cuda/12.1.1 openmpi
Evidence
The load mechanism concatenates *-all and *-gpu tokens and passes them directly to `module
load. Phoenix GPU explicitly sets FC=nvfortran`, while MFC enables MPI by default and the CMake
build unconditionally requires MPI Fortran when MPI is enabled. Therefore the module list must
reliably provide an MPI implementation whose Fortran modules/libs are compatible with NVHPC; using
an unpinned openmpi token makes that selection ambiguous and less reproducible than other NVHPC
GPU entries that encode the compiler/variant in the module selection.

toolchain/modules[37-42]
toolchain/bootstrap/modules.sh[106-115]
toolchain/mfc/state.py[13-23]
toolchain/mfc/cli/docs_gen.py[93-98]
toolchain/mfc/build.py[414-422]
CMakeLists.txt[562-571]
toolchain/modules[9-13]
toolchain/modules[105-113]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Phoenix GPU now loads a generic `openmpi` module while forcing `FC=nvfortran`, but MPI is enabled by default and the build requires a compiler-compatible MPI Fortran installation. Leaving `openmpi` unversioned/compiler-agnostic makes the toolchain selection ambiguous and can break configuration/build depending on what `openmpi` resolves to.
### Issue Context
- The module loader concatenates `p-all` + `p-gpu` and calls `module load` on the resulting tokens.
- The build enables MPI by default and CMake requires `MPI::MPI_Fortran`.
- Other NVHPC GPU entries encode the MPI/compiler coupling (e.g., `openmpi/...-nvhpc...` or explicitly using NVHPC MPI wrapper binaries).
### Fix Focus Areas
- toolchain/modules[37-41]
### Suggested changes
- Replace `openmpi` with the exact Phoenix system module that is built for NVHPC 25.5 (e.g., `openmpi/<ver>-nvhpc25.5` if that’s how Phoenix names it), OR
- Switch Phoenix GPU to use the MPI wrapper compilers from the intended MPI stack (set `CC/CXX/FC` to `mpicc/mpicxx/mpifort` paths for the NVHPC-compatible MPI), similar to how `h-gpu` is pinned to NVHPC comm_libs wrappers.
- Keep the module list version-pinned where possible for reproducible CI and user builds.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments

2. Redundant python module token 🐞 Bug ⚙ Maintainability
Description
p-gpu repeats python/3.12.5 even though p-all already loads it, and the loader concatenates
both lists, so Python is redundantly passed to module load. This increases drift risk if one line
is updated later but the other is not.
Code

toolchain/modules[40]

+p-gpu python/3.12.5 nvhpc/25.5 cuda/12.1.1 openmpi
Evidence
Phoenix defines python/3.12.5 in p-all and again in p-gpu; the module loader concatenates
p-all and p-gpu tokens and loads them as a single list, so Python will be requested twice.

toolchain/modules[37-41]
toolchain/bootstrap/modules.sh[106-112]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Phoenix GPU (`p-gpu`) redundantly lists `python/3.12.5` even though it is already included in `p-all`. The loader combines `p-all` and `p-gpu`, so the duplicate is unnecessary and can lead to future drift.
### Issue Context
The loader builds `ELEMENTS` as `(<slug>-all) + (<slug>-gpu)` and then runs a single `module load` over the resulting module tokens.
### Fix Focus Areas
- toolchain/modules[37-41]
- toolchain/bootstrap/modules.sh[106-112]
### Suggested change
- Remove `python/3.12.5` from the `p-gpu` module list (keep it only in `p-all`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 717a8b3

Results up to commit 717a8b3


🐞 Bugs (2) 📘 Rule violations (0)


Remediation recommended
1. Unpinned OpenMPI for NVHPC 🐞 Bug ☼ Reliability
Description
p-gpu now loads an unversioned openmpi module while exporting FC=nvfortran, but MPI is enabled
by default and CMake requires MPI::MPI_Fortran, which depends on a Fortran MPI installation
compatible with the active compiler. Because openmpi is not versioned/compiler-qualified here
(unlike other NVHPC module sets), the resulting toolchain can become non-reproducible and may fail
to configure/build if the default OpenMPI is not NVHPC-compatible.
Code

toolchain/modules[40]

+p-gpu python/3.12.5 nvhpc/25.5 cuda/12.1.1 openmpi
Evidence
The load mechanism concatenates *-all and *-gpu tokens and passes them directly to `module
load. Phoenix GPU explicitly sets FC=nvfortran`, while MFC enables MPI by default and the CMake
build unconditionally requires MPI Fortran when MPI is enabled. Therefore the module list must
reliably provide an MPI implementation whose Fortran modules/libs are compatible with NVHPC; using
an unpinned openmpi token makes that selection ambiguous and less reproducible than other NVHPC
GPU entries that encode the compiler/variant in the module selection.

toolchain/modules[37-42]
toolchain/bootstrap/modules.sh[106-115]
toolchain/mfc/state.py[13-23]
toolchain/mfc/cli/docs_gen.py[93-98]
toolchain/mfc/build.py[414-422]
CMakeLists.txt[562-571]
toolchain/modules[9-13]
toolchain/modules[105-113]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Phoenix GPU now loads a generic `openmpi` module while forcing `FC=nvfortran`, but MPI is enabled by default and the build requires a compiler-compatible MPI Fortran installation. Leaving `openmpi` unversioned/compiler-agnostic makes the toolchain selection ambiguous and can break configuration/build depending on what `openmpi` resolves to.

### Issue Context
- The module loader concatenates `p-all` + `p-gpu` and calls `module load` on the resulting tokens.
- The build enables MPI by default and CMake requires `MPI::MPI_Fortran`.
- Other NVHPC GPU entries encode the MPI/compiler coupling (e.g., `openmpi/...-nvhpc...` or explicitly using NVHPC MPI wrapper binaries).

### Fix Focus Areas
- toolchain/modules[37-41]

### Suggested changes
- Replace `openmpi` with the exact Phoenix system module that is built for NVHPC 25.5 (e.g., `openmpi/<ver>-nvhpc25.5` if that’s how Phoenix names it), OR
- Switch Phoenix GPU to use the MPI wrapper compilers from the intended MPI stack (set `CC/CXX/FC` to `mpicc/mpicxx/mpifort` paths for the NVHPC-compatible MPI), similar to how `h-gpu` is pinned to NVHPC comm_libs wrappers.
- Keep the module list version-pinned where possible for reproducible CI and user builds.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Advisory comments
2. Redundant python module token 🐞 Bug ⚙ Maintainability
Description
p-gpu repeats python/3.12.5 even though p-all already loads it, and the loader concatenates
both lists, so Python is redundantly passed to module load. This increases drift risk if one line
is updated later but the other is not.
Code

toolchain/modules[40]

+p-gpu python/3.12.5 nvhpc/25.5 cuda/12.1.1 openmpi
Evidence
Phoenix defines python/3.12.5 in p-all and again in p-gpu; the module loader concatenates
p-all and p-gpu tokens and loads them as a single list, so Python will be requested twice.

toolchain/modules[37-41]
toolchain/bootstrap/modules.sh[106-112]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Phoenix GPU (`p-gpu`) redundantly lists `python/3.12.5` even though it is already included in `p-all`. The loader combines `p-all` and `p-gpu`, so the duplicate is unnecessary and can lead to future drift.

### Issue Context
The loader builds `ELEMENTS` as `(<slug>-all) + (<slug>-gpu)` and then runs a single `module load` over the resulting module tokens.

### Fix Focus Areas
- toolchain/modules[37-41]
- toolchain/bootstrap/modules.sh[106-112]

### Suggested change
- Remove `python/3.12.5` from the `p-gpu` module list (keep it only in `p-all`).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 2, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 4c23dd4f-670f-4f96-b510-b9b42ba9320b

📥 Commits

Reviewing files that changed from the base of the PR and between 1033ff6 and 717a8b3.

📒 Files selected for processing (1)
  • toolchain/modules

📝 Walkthrough

Walkthrough

The p-gpu module stack definition in toolchain/modules was updated to use NVHPC version 25.5 instead of 24.5. The hpcx/2.19-cuda component was removed from the module list and replaced with openmpi (without a specific version specified). The remaining components in the stack (python/3.12.5 and cuda/12.1.1) and subsequent configuration directives remain unchanged.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description includes a clear summary of changes and a test plan, but lacks the required template structure including issue reference, type of change, and checklist items. Add the missing template sections: issue reference (Fixes #...), type of change checkbox, testing details, and completion checklist items as specified in the repository template.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and concisely describes the main changes: updating nvhpc from 24.5 to 25.5 and switching to system openmpi instead of hpcx.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@qodo-code-review
Copy link
Copy Markdown
Contributor

qodo-code-review Bot commented May 2, 2026

Persistent review updated to latest commit 717a8b3

@codecov
Copy link
Copy Markdown

codecov Bot commented May 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.75%. Comparing base (02eb4ec) to head (f716f51).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1393   +/-   ##
=======================================
  Coverage   64.75%   64.75%           
=======================================
  Files          71       71           
  Lines       18721    18721           
  Branches     1551     1551           
=======================================
+ Hits        12122    12123    +1     
+ Misses       5641     5640    -1     
  Partials      958      958           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson
Copy link
Copy Markdown
Member Author

nvhpc 25.5 is working but slower than 24.5 (15% slower)

@sbryngelson sbryngelson closed this May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant