remove environment-specific manifests from public branch; by mkuznet1 · Pull Request #87 · ROCm/madengine

mkuznet1 · 2026-03-16T21:47:57Z

The .gitignore file has been restored, and the manifest files have been deleted

* Added therock model for testing TheRock image * Added therock model * Modified the Dockerfile of TheRock only install core runtime and hip runtime * Fixe the generate-sys-env-details arg in mad * Redsign the rocEnvTool to work with TheRock based image * Keep compatible to the csv parser * Fixed the csv parser * Updated README of rocEnvTool accordingly

* Implemented a module to parse config inputs and creat perf_entry_super.json and upload dataset to MongoDB * Implement update perf superset * fix unit tests of super set * Fixed the perf superset data collection and MongoDB update

This reverts commit 6d7a660.

Resolve merge conflicts by keeping refactor-dis (v2) and discarding main (v1) changes: - Remove src/madengine/mad.py and src/madengine/tools/run_models.py (deleted in v2, accept deletion over main's modifications) - Resolve rocenv_tool.py conflict: keep current-branch version for unknown GPU device handling - Resolve tests/fixtures/dummy/models.json: keep v2 fixture set (dummy_superset and full model list) over main's therock-only entry

…file has been restored

Copilot

Pull request overview

This PR appears to clean up local/runtime artifacts by removing previously committed run-manifest JSON files and an environment file, and updating .gitignore to ignore additional generated content.

Changes:

Removed two committed run_manifest_*.json files under manifests/.
Removed manifests/mad.env (shell environment exports).
Updated .gitignore (adds *.json, and fixes formatting for .madengine_session_start).

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated no comments.

File	Description
`manifests/run_manifest_pyt_vllm_dissag_llama-3.1-8b_3node_rdma_localimage.json`	Removed a committed run-manifest JSON (likely local/generated).
`manifests/run_manifest_primus_2node_qwen_localimage.json`	Removed a committed run-manifest JSON (likely local/generated).
`manifests/mad.env`	Removed a committed environment export file (likely local setup).
`.gitignore`	Ignores additional files (notably `*.json`) and normalizes an entry’s formatting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

- Introduced per-node artifact staging to a dedicated results directory. - Implemented a mechanism to wait for all nodes to complete staging before merging results. - Added logic to merge performance CSV files from multiple nodes, selecting the best file based on content. - Updated the master node's result collection process to reflect these changes, ensuring comprehensive data aggregation. This update aims to improve the reliability and accuracy of performance reporting in distributed SLURM runs.

- Removed redundant file-based synchronization mechanism for node readiness. - Simplified the barrier waiting process by directly utilizing TCP for image readiness. - Adjusted timeout handling to ensure consistent behavior across node synchronization. This change enhances the efficiency of multi-node operations by streamlining the readiness check process.

- Added logic to prefer files with the most non-empty performance values during result aggregation. - Implemented dynamic column index retrieval for the "performance" column in CSV files, ensuring accurate counting of non-empty performance entries. - Maintained backward compatibility by falling back to the previous method if the performance column is not found. This update aims to enhance the accuracy of performance metrics in multi-node training scenarios.

- Added methods to handle local Docker images, including checking for existence, loading from tar, and saving to tar. - Enhanced the _ensure_local_image_available method to manage image availability across distributed nodes, ensuring primary nodes build and save images while worker nodes load from shared tar caches. - Introduced tests to validate the behavior of local image handling, including scenarios for saving and loading images in a multi-node environment. This update improves the efficiency and reliability of local image management in distributed runs.

Copilot

Pull request overview

Removes environment-specific manifest artifacts from the public branch and introduces broader runtime/test/tooling updates related to multi-node/local-image workflows and TheRock ROCm environment detection.

Changes:

Deleted environment-specific manifests under manifests/ and updated ignore rules.
Added multi-node local-image tar caching logic and expanded container execution test coverage.
Updated ROCm environment collection tooling (rocEnvTool) for TheRock + traditional ROCm compatibility and improved Slurm/Kubernetes result metadata handling.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`tests/test_cleanup.py`	Adds cleanup tests (currently targets a non-existent module in this repo).
`tests/integration/test_container_execution.py`	Adds tests for `_ensure_local_image_available` behavior (tar save/load/worker wait).
`tests/fixtures/dummy/scripts/therock/run.sh`	Adds a dummy TheRock script fixture emitting a perf line.
`tests/fixtures/dummy/docker/therock.ubuntu.amd.Dockerfile`	Adds a TheRock-based Dockerfile fixture.
`src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py`	Adds TheRock/traditional ROCm detection + dynamic path resolution and more robust command execution.
`src/madengine/scripts/common/pre_scripts/rocEnvTool/README.md`	Major documentation expansion describing TheRock compatibility and usage.
`src/madengine/execution/container_runner.py`	Adds local-image tar caching helpers and adjusts multi-node perf CSV validation behavior.
`src/madengine/deployment/templates/slurm/job.sh.j2`	Stages per-node artifacts and merges perf CSVs across nodes to reduce shared-workspace races.
`src/madengine/deployment/kubernetes.py`	Records normalized `launcher` value in success/failure records for Kubernetes runs.
`manifests/run_manifest_pyt_vllm_dissag_llama-3.1-8b_3node_rdma_localimage.json`	Removed environment-specific manifest.
`manifests/run_manifest_primus_2node_qwen_localimage.json`	Removed environment-specific manifest.
`manifests/mad.env`	Removed environment-specific env file.
`.gitignore`	Adds `*.json` ignore and minor formatting change.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-22T13:20:12Z

+            cmd5 = f"echo '==== TheRock Manifest: {manifest_file} ===='"
+            cmd6 = f"cat {manifest_file}"
+            cmds.extend([cmd5, cmd6])


This command string embeds manifest_file directly into a shell command (shell=True in this tool's Console). If the ROCm root path contains spaces or shell metacharacters, this can break the command or allow command injection. Quote/escape paths (e.g., via shlex.quote) or avoid shell interpolation for file paths.

Copilot · 2026-04-22T13:20:13Z

-        nodesbw_cmd = cmd + " --shownodesbw"
+    elif smi_config == "rocm_smi_nodesbw":
+        nodesbw_cmd = f"{rocm_smi_cmd} --shownodesbw || echo 'shownodesbw not available'"
        cmd_info = CommandInfo("ROCm Show Nodebsion", [nodesbw_cmd])


Typo in section title: "ROCm Show Nodebsion" is misspelled, which will show up in generated reports/CSV output. Rename to something correct/consistent (e.g., "ROCm Show Node Bandwidth" or similar).

Suggested change

cmd_info = CommandInfo("ROCm Show Nodebsion", [nodesbw_cmd])

cmd_info = CommandInfo("ROCm Show Node Bandwidth", [nodesbw_cmd])

Copilot · 2026-04-22T13:20:13Z

+        # When shared cache is configured and no tar exists yet, only node 0
+        # may produce the tar artifact. Other nodes wait and then load it.
+        if tar_missing_at_start:
+            if is_primary_node:
+                if not image_exists:
+                    self._build_or_pull_local_image(
+                        run_image=run_image,
+                        build_info=build_info,
+                        model_info=model_info,
+                    )
+                    image_exists = True
+                if not tar_exists:
+                    self._save_local_image_to_tar(run_image, tar_path)
+                    tar_exists = True
+
+            self._sync_after_local_image_ready(run_image=run_image)
+
+            if not image_exists:


_ensure_local_image_available() calls _sync_after_local_image_ready() when the shared tar is missing, but run_models_from_manifest() also unconditionally calls _sync_after_local_image_ready() immediately after. This causes a double barrier in that case (extra coordination latency and potential for confusing hangs). Consider having only one layer own the barrier (either remove the internal sync, or make the caller conditional based on whether _ensure_local_image_available already synchronized).

Copilot · 2026-04-22T13:20:14Z

+import unittest
+from unittest.mock import Mock, patch, call, MagicMock
+import time
+from madengine.tools.run_models import RunModels
+
+


madengine.tools.run_models (and RunModels) doesn't exist in this repo (there is only src/madengine/tools/__init__.py and no run_models.py). As written, this test module will fail to import and break the test suite. Update the import/patch targets to the current runner/orchestrator implementation, or remove this file if the functionality no longer exists.

Suggested change

import unittest

from unittest.mock import Mock, patch, call, MagicMock

import time

from madengine.tools.run_models import RunModels

import sys

import unittest

from unittest.mock import Mock, patch, call, MagicMock

import time

import madengine.tools as madengine_tools

from madengine.tools import RunModels

sys.modules.setdefault("madengine.tools.run_models", madengine_tools)

Copilot · 2026-04-22T13:20:14Z

+                                    csv_inventory = (
+                                        model_docker.sh(
+                                            f"sh -c 'ls -lah {model_dir}/*.csv 2>/dev/null; "
+                                            f"ls -lah {model_dir}/workdir/*.csv 2>/dev/null; "
+                                            f"ls -lah {model_dir}/benchmark_*_CONCURRENCY.log 2>/dev/null'"
+                                        )
+                                        or ""
+                                    )


csv_inventory is assigned but never used. If it's intended for debugging, consider logging it when validation fails; otherwise remove it to avoid dead code and keep this section easier to maintain.

Suggested change

csv_inventory = (

model_docker.sh(

f"sh -c 'ls -lah {model_dir}/*.csv 2>/dev/null; "

f"ls -lah {model_dir}/workdir/*.csv 2>/dev/null; "

f"ls -lah {model_dir}/benchmark_*_CONCURRENCY.log 2>/dev/null'"

)

or ""

)

coketaste and others added 8 commits December 2, 2025 11:49

Fix the cleanup (ROCm#60)

9ba9f98

Perf entry superset (ROCm#58)

6d7a660

* Implemented a module to parse config inputs and creat perf_entry_super.json and upload dataset to MongoDB * Implement update perf superset * fix unit tests of super set * Fixed the perf superset data collection and MongoDB update

Revert "Perf entry superset (ROCm#58)" (ROCm#66)

7ff689e

This reverts commit 6d7a660.

Fail Check condition update for RPM distro (ROCm#64)

9bf6ae6

Fixed launcher type issue on k8s

a197d7c

remove environment-specific manifests from public branch; .gitignore …

5d9ba6f

…file has been restored

mkuznet1 requested review from coketaste, Copilot and gargrahul March 16, 2026 21:47

mkuznet1 self-assigned this Mar 16, 2026

Copilot started reviewing on behalf of mkuznet1 March 16, 2026 21:49 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

mkuznet1 added 5 commits March 18, 2026 17:32

Merge branch 'coketaste/refactor-dis' into aicomnet_dev

fa72fa4

Copilot AI review requested due to automatic review settings April 22, 2026 13:11

Copilot started reviewing on behalf of mkuznet1 April 22, 2026 13:13 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove environment-specific manifests from public branch; #87

remove environment-specific manifests from public branch; #87
mkuznet1 wants to merge 13 commits intoROCm:aicomnet_devfrom
mkuznet1:aicomnet_dev

mkuznet1 commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	cmd_info = CommandInfo("ROCm Show Nodebsion", [nodesbw_cmd])
	cmd_info = CommandInfo("ROCm Show Node Bandwidth", [nodesbw_cmd])

Conversation

mkuznet1 commented Mar 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants