remove environment-specific manifests from public branch; #87
remove environment-specific manifests from public branch; #87mkuznet1 wants to merge 13 commits intoROCm:aicomnet_devfrom
Conversation
* Added therock model for testing TheRock image * Added therock model * Modified the Dockerfile of TheRock only install core runtime and hip runtime * Fixe the generate-sys-env-details arg in mad * Redsign the rocEnvTool to work with TheRock based image * Keep compatible to the csv parser * Fixed the csv parser * Updated README of rocEnvTool accordingly
* Implemented a module to parse config inputs and creat perf_entry_super.json and upload dataset to MongoDB * Implement update perf superset * fix unit tests of super set * Fixed the perf superset data collection and MongoDB update
Resolve merge conflicts by keeping refactor-dis (v2) and discarding main (v1) changes: - Remove src/madengine/mad.py and src/madengine/tools/run_models.py (deleted in v2, accept deletion over main's modifications) - Resolve rocenv_tool.py conflict: keep current-branch version for unknown GPU device handling - Resolve tests/fixtures/dummy/models.json: keep v2 fixture set (dummy_superset and full model list) over main's therock-only entry
…file has been restored
There was a problem hiding this comment.
Pull request overview
This PR appears to clean up local/runtime artifacts by removing previously committed run-manifest JSON files and an environment file, and updating .gitignore to ignore additional generated content.
Changes:
- Removed two committed
run_manifest_*.jsonfiles undermanifests/. - Removed
manifests/mad.env(shell environment exports). - Updated
.gitignore(adds*.json, and fixes formatting for.madengine_session_start).
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
manifests/run_manifest_pyt_vllm_dissag_llama-3.1-8b_3node_rdma_localimage.json |
Removed a committed run-manifest JSON (likely local/generated). |
manifests/run_manifest_primus_2node_qwen_localimage.json |
Removed a committed run-manifest JSON (likely local/generated). |
manifests/mad.env |
Removed a committed environment export file (likely local setup). |
.gitignore |
Ignores additional files (notably *.json) and normalizes an entry’s formatting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
- Introduced per-node artifact staging to a dedicated results directory. - Implemented a mechanism to wait for all nodes to complete staging before merging results. - Added logic to merge performance CSV files from multiple nodes, selecting the best file based on content. - Updated the master node's result collection process to reflect these changes, ensuring comprehensive data aggregation. This update aims to improve the reliability and accuracy of performance reporting in distributed SLURM runs.
- Removed redundant file-based synchronization mechanism for node readiness. - Simplified the barrier waiting process by directly utilizing TCP for image readiness. - Adjusted timeout handling to ensure consistent behavior across node synchronization. This change enhances the efficiency of multi-node operations by streamlining the readiness check process.
- Added logic to prefer files with the most non-empty performance values during result aggregation. - Implemented dynamic column index retrieval for the "performance" column in CSV files, ensuring accurate counting of non-empty performance entries. - Maintained backward compatibility by falling back to the previous method if the performance column is not found. This update aims to enhance the accuracy of performance metrics in multi-node training scenarios.
- Added methods to handle local Docker images, including checking for existence, loading from tar, and saving to tar. - Enhanced the _ensure_local_image_available method to manage image availability across distributed nodes, ensuring primary nodes build and save images while worker nodes load from shared tar caches. - Introduced tests to validate the behavior of local image handling, including scenarios for saving and loading images in a multi-node environment. This update improves the efficiency and reliability of local image management in distributed runs.
There was a problem hiding this comment.
Pull request overview
Removes environment-specific manifest artifacts from the public branch and introduces broader runtime/test/tooling updates related to multi-node/local-image workflows and TheRock ROCm environment detection.
Changes:
- Deleted environment-specific manifests under
manifests/and updated ignore rules. - Added multi-node local-image tar caching logic and expanded container execution test coverage.
- Updated ROCm environment collection tooling (
rocEnvTool) for TheRock + traditional ROCm compatibility and improved Slurm/Kubernetes result metadata handling.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_cleanup.py |
Adds cleanup tests (currently targets a non-existent module in this repo). |
tests/integration/test_container_execution.py |
Adds tests for _ensure_local_image_available behavior (tar save/load/worker wait). |
tests/fixtures/dummy/scripts/therock/run.sh |
Adds a dummy TheRock script fixture emitting a perf line. |
tests/fixtures/dummy/docker/therock.ubuntu.amd.Dockerfile |
Adds a TheRock-based Dockerfile fixture. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py |
Adds TheRock/traditional ROCm detection + dynamic path resolution and more robust command execution. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/README.md |
Major documentation expansion describing TheRock compatibility and usage. |
src/madengine/execution/container_runner.py |
Adds local-image tar caching helpers and adjusts multi-node perf CSV validation behavior. |
src/madengine/deployment/templates/slurm/job.sh.j2 |
Stages per-node artifacts and merges perf CSVs across nodes to reduce shared-workspace races. |
src/madengine/deployment/kubernetes.py |
Records normalized launcher value in success/failure records for Kubernetes runs. |
manifests/run_manifest_pyt_vllm_dissag_llama-3.1-8b_3node_rdma_localimage.json |
Removed environment-specific manifest. |
manifests/run_manifest_primus_2node_qwen_localimage.json |
Removed environment-specific manifest. |
manifests/mad.env |
Removed environment-specific env file. |
.gitignore |
Adds *.json ignore and minor formatting change. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cmd5 = f"echo '==== TheRock Manifest: {manifest_file} ===='" | ||
| cmd6 = f"cat {manifest_file}" | ||
| cmds.extend([cmd5, cmd6]) |
There was a problem hiding this comment.
This command string embeds manifest_file directly into a shell command (shell=True in this tool's Console). If the ROCm root path contains spaces or shell metacharacters, this can break the command or allow command injection. Quote/escape paths (e.g., via shlex.quote) or avoid shell interpolation for file paths.
| nodesbw_cmd = cmd + " --shownodesbw" | ||
| elif smi_config == "rocm_smi_nodesbw": | ||
| nodesbw_cmd = f"{rocm_smi_cmd} --shownodesbw || echo 'shownodesbw not available'" | ||
| cmd_info = CommandInfo("ROCm Show Nodebsion", [nodesbw_cmd]) |
There was a problem hiding this comment.
Typo in section title: "ROCm Show Nodebsion" is misspelled, which will show up in generated reports/CSV output. Rename to something correct/consistent (e.g., "ROCm Show Node Bandwidth" or similar).
| cmd_info = CommandInfo("ROCm Show Nodebsion", [nodesbw_cmd]) | |
| cmd_info = CommandInfo("ROCm Show Node Bandwidth", [nodesbw_cmd]) |
| # When shared cache is configured and no tar exists yet, only node 0 | ||
| # may produce the tar artifact. Other nodes wait and then load it. | ||
| if tar_missing_at_start: | ||
| if is_primary_node: | ||
| if not image_exists: | ||
| self._build_or_pull_local_image( | ||
| run_image=run_image, | ||
| build_info=build_info, | ||
| model_info=model_info, | ||
| ) | ||
| image_exists = True | ||
| if not tar_exists: | ||
| self._save_local_image_to_tar(run_image, tar_path) | ||
| tar_exists = True | ||
|
|
||
| self._sync_after_local_image_ready(run_image=run_image) | ||
|
|
||
| if not image_exists: |
There was a problem hiding this comment.
_ensure_local_image_available() calls _sync_after_local_image_ready() when the shared tar is missing, but run_models_from_manifest() also unconditionally calls _sync_after_local_image_ready() immediately after. This causes a double barrier in that case (extra coordination latency and potential for confusing hangs). Consider having only one layer own the barrier (either remove the internal sync, or make the caller conditional based on whether _ensure_local_image_available already synchronized).
| import unittest | ||
| from unittest.mock import Mock, patch, call, MagicMock | ||
| import time | ||
| from madengine.tools.run_models import RunModels | ||
|
|
||
|
|
There was a problem hiding this comment.
madengine.tools.run_models (and RunModels) doesn't exist in this repo (there is only src/madengine/tools/__init__.py and no run_models.py). As written, this test module will fail to import and break the test suite. Update the import/patch targets to the current runner/orchestrator implementation, or remove this file if the functionality no longer exists.
| import unittest | |
| from unittest.mock import Mock, patch, call, MagicMock | |
| import time | |
| from madengine.tools.run_models import RunModels | |
| import sys | |
| import unittest | |
| from unittest.mock import Mock, patch, call, MagicMock | |
| import time | |
| import madengine.tools as madengine_tools | |
| from madengine.tools import RunModels | |
| sys.modules.setdefault("madengine.tools.run_models", madengine_tools) |
| csv_inventory = ( | ||
| model_docker.sh( | ||
| f"sh -c 'ls -lah {model_dir}/*.csv 2>/dev/null; " | ||
| f"ls -lah {model_dir}/workdir/*.csv 2>/dev/null; " | ||
| f"ls -lah {model_dir}/benchmark_*_CONCURRENCY.log 2>/dev/null'" | ||
| ) | ||
| or "" | ||
| ) |
There was a problem hiding this comment.
csv_inventory is assigned but never used. If it's intended for debugging, consider logging it when validation fails; otherwise remove it to avoid dead code and keep this section easier to maintain.
| csv_inventory = ( | |
| model_docker.sh( | |
| f"sh -c 'ls -lah {model_dir}/*.csv 2>/dev/null; " | |
| f"ls -lah {model_dir}/workdir/*.csv 2>/dev/null; " | |
| f"ls -lah {model_dir}/benchmark_*_CONCURRENCY.log 2>/dev/null'" | |
| ) | |
| or "" | |
| ) |
The .gitignore file has been restored, and the manifest files have been deleted