Skip to content

fix: Infinite hang when MPI process crashes#4001

Draft
arng40 wants to merge 7 commits intodevelopfrom
bugfix/dudes/endless-wait-crash-mpi
Draft

fix: Infinite hang when MPI process crashes#4001
arng40 wants to merge 7 commits intodevelopfrom
bugfix/dudes/endless-wait-crash-mpi

Conversation

@arng40
Copy link
Contributor

@arng40 arng40 commented Mar 18, 2026

The bug occurs when you launch a simulation with multiple ranks and one of the ranks calls a GEOS_THROW. This results in an infinite hang

@arng40 arng40 self-assigned this Mar 18, 2026
@arng40 arng40 added type: bug Something isn't working flag: no rebaseline Does not require rebaseline labels Mar 18, 2026
@arng40 arng40 changed the title bugfix: Infinite hang when MPI process crashes fix: Infinite hang when MPI process crashes Mar 18, 2026
@jhuang2601 jhuang2601 added ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI ci: run code coverage enables running of the code coverage CI jobs labels Mar 18, 2026
@rrsettgast
Copy link
Member

rrsettgast commented Mar 18, 2026

The second call to basicCleanup() was done intentionally in:
#3332

to avoid cleanup issues. Are you sure this doesn't reintroduce the issues that it fixed?

@arng40 arng40 marked this pull request as draft March 19, 2026 08:33
Copy link
Contributor

@MelReyCG MelReyCG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so let's change the strategy:

  • separate anything that is not about cleaning up from basicCleanup() to a new function (finalizeRun()? something else?),
  • ensure to call basicCleanup() in any scenario, to ensure it is called, but any logging of a potencial error must be done before.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes GEOSX shutdown/cleanup behavior to avoid an infinite MPI hang when a rank crashes after throwing a GEOS exception, by threading an inError flag through cleanup/finalization and adjusting MPI finalization behavior.

Changes:

  • Add an inError parameter to basicCleanup, cleanupEnvironment, finalizeMPI, and MpiWrapper::finalize to distinguish normal vs. error shutdown.
  • Avoid normal MPI finalization/reporting on error paths (terminate via the error handler instead) to prevent collective/finalize hangs.
  • Update main entrypoints and a broad set of unit/integration tests to use the new cleanup signatures.

Reviewed changes

Copilot reviewed 88 out of 88 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/pygeosx/pygeosx.cpp Update Python finalize path to call basicCleanup(false).
src/main/main.cpp Add memory stats report on normal exit and pass inError to cleanup; adjust exception paths.
src/coreComponents/physicsSolvers/fluidFlow/unitTests/testFlowStatistics.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/physicsSolvers/fluidFlow/kernels/singlePhase/unitTests/testSinglePhaseMobilityKernel.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/mesh/unitTests/testGraphColoringMPI.cpp Update MPI finalize call signature to finalize(false).
src/coreComponents/mainInterface/initialization.hpp Change basicCleanup API to basicCleanup(bool inError) and document param.
src/coreComponents/mainInterface/initialization.cpp Wire basicCleanup(inError) to cleanupEnvironment(inError).
src/coreComponents/linearAlgebra/utilities/unitTests/testLAIHelperFunctions.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/linearAlgebra/unitTests/testLinearAlgebraUtils.hpp Update cleanupEnvironment(false) usage in test scope destructor.
src/coreComponents/integrationTests/xmlTests/testXMLFile.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/xmlTests/testXML.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testThermalReservoirCompositionalMultiphaseSSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testThermalReservoirCompositionalMultiphaseMSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testReservoirThermalSinglePhaseMSWells_RateInj.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testReservoirThermalSinglePhaseMSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testReservoirSinglePhaseMSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testReservoirCompositionalMultiphaseMSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testOpenClosePerf.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testIsothermalReservoirCompositionalMultiphaseSSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wellsTests/testIsothermalReservoirCompositionalMultiphaseMSWells.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationTaper.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationQ2.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationElasticVTI.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationElasticTTI.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationElasticFirstOrder.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationDG.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationDAS.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAttenuationElastic.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAttenuationAcousticVTI.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAttenuationAcoustic.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAdjoint1.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAcousticFirstOrder.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/wavePropagationTests/testWavePropagation.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/tableFunctionsFileTests/testTableFunctionsOutput.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/solverStatisticsTests/testSolverStats.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/meshTests/testVTKImport.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/meshTests/testNeighborCommunicator.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/meshTests/testMeshGeneration.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/meshTests/testElementRegions.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testTransmissibility.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testThermalSinglePhaseFlow.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testThermalCompMultiphaseFlow.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testSinglePhaseReactiveTransport.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testSinglePhaseMFDPolyhedral.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testReactiveCompositionalMultiphaseOBL.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testImmiscibleMultiphaseFlow.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testFixedDimHydrostaticEquilibrium.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testCompositionalMultiPhaseMFDPolyhedral.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testCompMultiphaseFlowHybrid.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fluidFlowTests/testCompMultiphaseFlow.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fieldSpecificationTests/testRecursiveFieldApplication.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fieldSpecificationTests/testFieldSpecification.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/fieldSpecificationTests/testAquiferBoundaryCondition.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/dataRepositoryTests/testRestartExtended.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/dataRepositoryTests/testRestartBasic.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/dataRepositoryTests/testGroupPath.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/constitutiveTests/testTriaxial.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/constitutiveTests/testRelPermHysteresis.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/constitutiveTests/testReactiveFluid.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/integrationTests/constitutiveTests/testPVT.cpp Update integration test cleanup call to basicCleanup(false).
src/coreComponents/functions/unitTests/testFunctions.cpp Update commented-out cleanup call to new signature.
src/coreComponents/finiteVolume/mimeticInnerProducts/unitTests/testMimeticInnerProducts.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/finiteElement/elementFormulations/unitTests/testConformingVirtualElementOrder1.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/fileIO/timeHistory/unitTests/testHDFParallelFile.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/fileIO/timeHistory/unitTests/testHDFFile.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/fileIO/Outputs/unitTests/testMemoryStats.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/denseLinearAlgebra/unitTests/testBlasLapack.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/dataRepository/unitTests/testWrapperHelpers.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/dataRepository/unitTests/testObjectCatalog.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/dataRepository/unitTests/testErrorHandling.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testTwoPhaseImmiscibleFluid.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testMultiFluidTwoPhaseCompositionalMultiphase.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testMultiFluidThreePhaseCompositionalMultiphase.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testMultiFluidDeadOil.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testMultiFluidCO2Brine.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testMultiFluidBlackOil.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testInverseCapillaryPressure.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testCO2SpycherPruessModels.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/unitTests/testCO2BrinePVTModels.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/constitutive/relativePermeability/unitTests/testRelPerm.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/constitutive/capillaryPressure/unitTests/testCapillaryPressure.cpp Update test cleanup call to basicCleanup(false).
src/coreComponents/common/unitTests/testMpiWrapper.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/common/initializeEnvironment.hpp Update finalizeMPI/cleanupEnvironment APIs to accept inError.
src/coreComponents/common/initializeEnvironment.cpp Thread inError through environment cleanup and MPI finalize; remove memory report from cleanup.
src/coreComponents/common/format/table/unitTests/testMpiTable.cpp Update test environment cleanup call to cleanupEnvironment(false).
src/coreComponents/common/MpiWrapper.hpp Update MPI finalize API to finalize(bool inError) and document param.
src/coreComponents/common/MpiWrapper.cpp Implement finalize(bool) to skip MPI_Finalize() on error and terminate via error handler.
src/coreComponents/codingUtilities/tests/testParallelTestUtilities.cpp Update test cleanup call to basicCleanup(false).


/**
* @brief Cleanup/finalize the environment.
* @param inError indicate if an exception occured
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in parameter documentation: "occured" should be "occurred".

Suggested change
* @param inError indicate if an exception occured
* @param inError indicate if an exception occurred

Copilot uses AI. Check for mistakes.
* @brief Free MPI managed resources, then call MPI_Finalize().
* @brief Free MPI managed resources
* Please note that once called, MPI functions, communicators and resources can no longer be used.
* @param inError indicate if an exception occured
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in parameter documentation: "occured" should be "occurred".

Suggested change
* @param inError indicate if an exception occured
* @param inError indicate if an exception occurred

Copilot uses AI. Check for mistakes.
Comment on lines +199 to +201
* @brief Free MPI managed resources
* Please note that once called, MPI functions, communicators and resources can no longer be used.
* @param inError indicate if an exception occured
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated docstring for MpiWrapper::finalize says it only frees MPI-managed resources, but in the non-error case it still calls MPI_Finalize(), and in the error case it terminates via the error handler. Please update the comment to accurately describe these behaviors so callers understand the lifecycle/termination semantics.

Suggested change
* @brief Free MPI managed resources
* Please note that once called, MPI functions, communicators and resources can no longer be used.
* @param inError indicate if an exception occured
* @brief Finalize the MPI environment and free MPI-managed resources.
*
* In the normal (non-error) case, this calls MPI_Finalize() and releases all MPI-managed
* resources. After a successful return from this function, MPI functions, communicators,
* and other MPI resources can no longer be used.
*
* When called with @p inError set to true, this function invokes the MPI/GEOS error handler
* and will not return normally (it is expected to terminate the process / abort MPI).
*
* @param inError Indicates whether an error has occurred; if true, the error handler is invoked
* and the process is expected to terminate instead of returning.

Copilot uses AI. Check for mistakes.
Comment on lines 80 to 84
catch( geos::Exception & e )
{ // GEOS generated exceptions management
ErrorLogger::global().flushCurrentExceptionMessage();
basicCleanup();
// lvarray error handler is just program termination
LvArray::system::callErrorHandler();
basicCleanup( true );
}
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this catch( geos::Exception & ) path, the program no longer explicitly terminates after cleanup. If GEOS is built without MPI (or if MpiWrapper::finalize(true) doesn't abort), execution can fall through and return 0, effectively swallowing the error. Please ensure this path always terminates with a non-zero exit (e.g., call LvArray::system::callErrorHandler()/std::abort() after basicCleanup(true), or guarantee basicCleanup(true) never returns).

Copilot uses AI. Check for mistakes.
Comment on lines +165 to 169
else
{
LvArray::system::callErrorHandler();
}
#endif
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MpiWrapper::finalize(true) only calls LvArray::system::callErrorHandler() under #ifdef GEOS_USE_MPI. In non-MPI builds, the function becomes a no-op even when inError==true, which can let callers continue running after a fatal error. Consider handling the inError case outside the MPI guard (e.g., always call the error handler/abort when inError is true, regardless of MPI).

Copilot uses AI. Check for mistakes.
Comment on lines +165 to +166
MpiWrapper::commFree( MPI_COMM_GEOS );
MpiWrapper::finalize();
MpiWrapper::finalize( false );
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MpiWrapper::finalize(false) now frees MPI_COMM_GEOS internally. This test frees MPI_COMM_GEOS explicitly and then calls finalize(false), which will attempt to free the communicator a second time. Since MpiWrapper::commFree does not guard against MPI_COMM_NULL, this can trigger an MPI error. Remove the explicit commFree here (or adjust the finalize/free sequence to avoid double-free).

Copilot uses AI. Check for mistakes.

/**
* @brief Perform the basic GEOSX cleanup.
* @param inError indicate if an exception occured
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in parameter documentation: "occured" should be "occurred".

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci: run code coverage enables running of the code coverage CI jobs ci: run CUDA builds Allows to triggers (costly) CUDA jobs ci: run integrated tests Allows to run the integrated tests in GEOS CI flag: no rebaseline Does not require rebaseline type: bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants