fix: Infinite hang when MPI process crashes#4001
Conversation
|
The second call to to avoid cleanup issues. Are you sure this doesn't reintroduce the issues that it fixed? |
There was a problem hiding this comment.
Ok, so let's change the strategy:
- separate anything that is not about cleaning up from
basicCleanup()to a new function (finalizeRun()? something else?), - ensure to call
basicCleanup()in any scenario, to ensure it is called, but any logging of a potencial error must be done before.
There was a problem hiding this comment.
Pull request overview
This PR changes GEOSX shutdown/cleanup behavior to avoid an infinite MPI hang when a rank crashes after throwing a GEOS exception, by threading an inError flag through cleanup/finalization and adjusting MPI finalization behavior.
Changes:
- Add an
inErrorparameter tobasicCleanup,cleanupEnvironment,finalizeMPI, andMpiWrapper::finalizeto distinguish normal vs. error shutdown. - Avoid normal MPI finalization/reporting on error paths (terminate via the error handler instead) to prevent collective/finalize hangs.
- Update main entrypoints and a broad set of unit/integration tests to use the new cleanup signatures.
Reviewed changes
Copilot reviewed 88 out of 88 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/pygeosx/pygeosx.cpp | Update Python finalize path to call basicCleanup(false). |
| src/main/main.cpp | Add memory stats report on normal exit and pass inError to cleanup; adjust exception paths. |
| src/coreComponents/physicsSolvers/fluidFlow/unitTests/testFlowStatistics.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/physicsSolvers/fluidFlow/kernels/singlePhase/unitTests/testSinglePhaseMobilityKernel.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/mesh/unitTests/testGraphColoringMPI.cpp | Update MPI finalize call signature to finalize(false). |
| src/coreComponents/mainInterface/initialization.hpp | Change basicCleanup API to basicCleanup(bool inError) and document param. |
| src/coreComponents/mainInterface/initialization.cpp | Wire basicCleanup(inError) to cleanupEnvironment(inError). |
| src/coreComponents/linearAlgebra/utilities/unitTests/testLAIHelperFunctions.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/linearAlgebra/unitTests/testLinearAlgebraUtils.hpp | Update cleanupEnvironment(false) usage in test scope destructor. |
| src/coreComponents/integrationTests/xmlTests/testXMLFile.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/xmlTests/testXML.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wellsTests/testThermalReservoirCompositionalMultiphaseSSWells.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wellsTests/testThermalReservoirCompositionalMultiphaseMSWells.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wellsTests/testReservoirThermalSinglePhaseMSWells_RateInj.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wellsTests/testReservoirThermalSinglePhaseMSWells.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wellsTests/testReservoirSinglePhaseMSWells.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wellsTests/testReservoirCompositionalMultiphaseMSWells.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wellsTests/testOpenClosePerf.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wellsTests/testIsothermalReservoirCompositionalMultiphaseSSWells.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wellsTests/testIsothermalReservoirCompositionalMultiphaseMSWells.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationTaper.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationQ2.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationElasticVTI.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationElasticTTI.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationElasticFirstOrder.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationDG.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationDAS.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAttenuationElastic.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAttenuationAcousticVTI.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAttenuationAcoustic.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAdjoint1.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagationAcousticFirstOrder.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/wavePropagationTests/testWavePropagation.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/tableFunctionsFileTests/testTableFunctionsOutput.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/solverStatisticsTests/testSolverStats.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/meshTests/testVTKImport.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/meshTests/testNeighborCommunicator.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/meshTests/testMeshGeneration.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/meshTests/testElementRegions.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testTransmissibility.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testThermalSinglePhaseFlow.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testThermalCompMultiphaseFlow.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testSinglePhaseReactiveTransport.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testSinglePhaseMFDPolyhedral.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testReactiveCompositionalMultiphaseOBL.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testImmiscibleMultiphaseFlow.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testFixedDimHydrostaticEquilibrium.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testCompositionalMultiPhaseMFDPolyhedral.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testCompMultiphaseFlowHybrid.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fluidFlowTests/testCompMultiphaseFlow.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fieldSpecificationTests/testRecursiveFieldApplication.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fieldSpecificationTests/testFieldSpecification.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/fieldSpecificationTests/testAquiferBoundaryCondition.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/dataRepositoryTests/testRestartExtended.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/dataRepositoryTests/testRestartBasic.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/dataRepositoryTests/testGroupPath.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/constitutiveTests/testTriaxial.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/constitutiveTests/testRelPermHysteresis.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/constitutiveTests/testReactiveFluid.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/integrationTests/constitutiveTests/testPVT.cpp | Update integration test cleanup call to basicCleanup(false). |
| src/coreComponents/functions/unitTests/testFunctions.cpp | Update commented-out cleanup call to new signature. |
| src/coreComponents/finiteVolume/mimeticInnerProducts/unitTests/testMimeticInnerProducts.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/finiteElement/elementFormulations/unitTests/testConformingVirtualElementOrder1.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/fileIO/timeHistory/unitTests/testHDFParallelFile.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/fileIO/timeHistory/unitTests/testHDFFile.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/fileIO/Outputs/unitTests/testMemoryStats.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/denseLinearAlgebra/unitTests/testBlasLapack.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/dataRepository/unitTests/testWrapperHelpers.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/dataRepository/unitTests/testObjectCatalog.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/dataRepository/unitTests/testErrorHandling.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/constitutive/unitTests/testTwoPhaseImmiscibleFluid.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/constitutive/unitTests/testMultiFluidTwoPhaseCompositionalMultiphase.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/constitutive/unitTests/testMultiFluidThreePhaseCompositionalMultiphase.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/constitutive/unitTests/testMultiFluidDeadOil.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/constitutive/unitTests/testMultiFluidCO2Brine.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/constitutive/unitTests/testMultiFluidBlackOil.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/constitutive/unitTests/testInverseCapillaryPressure.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/constitutive/unitTests/testCO2SpycherPruessModels.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/constitutive/unitTests/testCO2BrinePVTModels.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/constitutive/relativePermeability/unitTests/testRelPerm.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/constitutive/capillaryPressure/unitTests/testCapillaryPressure.cpp | Update test cleanup call to basicCleanup(false). |
| src/coreComponents/common/unitTests/testMpiWrapper.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/common/initializeEnvironment.hpp | Update finalizeMPI/cleanupEnvironment APIs to accept inError. |
| src/coreComponents/common/initializeEnvironment.cpp | Thread inError through environment cleanup and MPI finalize; remove memory report from cleanup. |
| src/coreComponents/common/format/table/unitTests/testMpiTable.cpp | Update test environment cleanup call to cleanupEnvironment(false). |
| src/coreComponents/common/MpiWrapper.hpp | Update MPI finalize API to finalize(bool inError) and document param. |
| src/coreComponents/common/MpiWrapper.cpp | Implement finalize(bool) to skip MPI_Finalize() on error and terminate via error handler. |
| src/coreComponents/codingUtilities/tests/testParallelTestUtilities.cpp | Update test cleanup call to basicCleanup(false). |
|
|
||
| /** | ||
| * @brief Cleanup/finalize the environment. | ||
| * @param inError indicate if an exception occured |
There was a problem hiding this comment.
Typo in parameter documentation: "occured" should be "occurred".
| * @param inError indicate if an exception occured | |
| * @param inError indicate if an exception occurred |
| * @brief Free MPI managed resources, then call MPI_Finalize(). | ||
| * @brief Free MPI managed resources | ||
| * Please note that once called, MPI functions, communicators and resources can no longer be used. | ||
| * @param inError indicate if an exception occured |
There was a problem hiding this comment.
Typo in parameter documentation: "occured" should be "occurred".
| * @param inError indicate if an exception occured | |
| * @param inError indicate if an exception occurred |
| * @brief Free MPI managed resources | ||
| * Please note that once called, MPI functions, communicators and resources can no longer be used. | ||
| * @param inError indicate if an exception occured |
There was a problem hiding this comment.
The updated docstring for MpiWrapper::finalize says it only frees MPI-managed resources, but in the non-error case it still calls MPI_Finalize(), and in the error case it terminates via the error handler. Please update the comment to accurately describe these behaviors so callers understand the lifecycle/termination semantics.
| * @brief Free MPI managed resources | |
| * Please note that once called, MPI functions, communicators and resources can no longer be used. | |
| * @param inError indicate if an exception occured | |
| * @brief Finalize the MPI environment and free MPI-managed resources. | |
| * | |
| * In the normal (non-error) case, this calls MPI_Finalize() and releases all MPI-managed | |
| * resources. After a successful return from this function, MPI functions, communicators, | |
| * and other MPI resources can no longer be used. | |
| * | |
| * When called with @p inError set to true, this function invokes the MPI/GEOS error handler | |
| * and will not return normally (it is expected to terminate the process / abort MPI). | |
| * | |
| * @param inError Indicates whether an error has occurred; if true, the error handler is invoked | |
| * and the process is expected to terminate instead of returning. |
| catch( geos::Exception & e ) | ||
| { // GEOS generated exceptions management | ||
| ErrorLogger::global().flushCurrentExceptionMessage(); | ||
| basicCleanup(); | ||
| // lvarray error handler is just program termination | ||
| LvArray::system::callErrorHandler(); | ||
| basicCleanup( true ); | ||
| } |
There was a problem hiding this comment.
In this catch( geos::Exception & ) path, the program no longer explicitly terminates after cleanup. If GEOS is built without MPI (or if MpiWrapper::finalize(true) doesn't abort), execution can fall through and return 0, effectively swallowing the error. Please ensure this path always terminates with a non-zero exit (e.g., call LvArray::system::callErrorHandler()/std::abort() after basicCleanup(true), or guarantee basicCleanup(true) never returns).
| else | ||
| { | ||
| LvArray::system::callErrorHandler(); | ||
| } | ||
| #endif |
There was a problem hiding this comment.
MpiWrapper::finalize(true) only calls LvArray::system::callErrorHandler() under #ifdef GEOS_USE_MPI. In non-MPI builds, the function becomes a no-op even when inError==true, which can let callers continue running after a fatal error. Consider handling the inError case outside the MPI guard (e.g., always call the error handler/abort when inError is true, regardless of MPI).
| MpiWrapper::commFree( MPI_COMM_GEOS ); | ||
| MpiWrapper::finalize(); | ||
| MpiWrapper::finalize( false ); |
There was a problem hiding this comment.
MpiWrapper::finalize(false) now frees MPI_COMM_GEOS internally. This test frees MPI_COMM_GEOS explicitly and then calls finalize(false), which will attempt to free the communicator a second time. Since MpiWrapper::commFree does not guard against MPI_COMM_NULL, this can trigger an MPI error. Remove the explicit commFree here (or adjust the finalize/free sequence to avoid double-free).
|
|
||
| /** | ||
| * @brief Perform the basic GEOSX cleanup. | ||
| * @param inError indicate if an exception occured |
There was a problem hiding this comment.
Typo in parameter documentation: "occured" should be "occurred".
The bug occurs when you launch a simulation with multiple ranks and one of the ranks calls a GEOS_THROW. This results in an infinite hang