Skip to content

feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110

Open
coketaste wants to merge 8 commits intodevelopfrom
coketaste/v2-rocm-path
Open

feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110
coketaste wants to merge 8 commits intodevelopfrom
coketaste/v2-rocm-path

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Summary

Adds default ROCm install root resolution for the run phase: scan common layouts (including versioned /opt/rocm-* and TheRock markers), with clear overrides for host vs container. Aligns in-container ROCM_PATH with docker_env_vars when the image layout differs from the host.

Motivation

Users on TheRock- or CI-style images may not have a plain /opt/rocm tree; madengine should still find a valid ROCm root for validation and tool paths, while allowing explicit MAD_ROCM_PATH / --rocm-path when auto-detect is wrong or undesired.

What changed

  • New helpers: madengine.utils.rocm_path_resolver and madengine.utils.therock_markers (shared TheRock share/therock file markers).
  • Host precedence: top-level MAD_ROCM_PATH in additional context → --rocm-path → auto-detect (unless MAD_AUTO_ROCM_PATH=0) → ROCM_PATH/opt/rocm.
  • Container: docker_env_vars.MAD_ROCM_PATH sets in-container ROCM_PATH (key consumed); otherwise host-resolved path is mirrored unless the user set ROCM_PATH in docker_env_vars.
  • Wired through Context, run orchestrator, and CLI/validators; README, configuration, usage, and CHANGELOG updated.
  • therock_detector script: prepends the package parent to sys.path so imports work when run as a file; last-resort path fallback only if the script is outside the tree.
  • Unit tests (test_rocm_path, test_therock_markers) and integration GPU test adjustments.

How to test

  • pytest tests/unit/test_rocm_path.py tests/unit/test_therock_markers.py (or full tests/unit/).
  • Optional: madengine run --help and confirm --rocm-path / env behavior as documented.
  • Optional: on a TheRock or versioned-ROCm image, run a minimal madengine run without MAD_ROCM_PATH and confirm a sensible ROCM_PATH in the generated context.

Checklist (optional; trim for your team)

  • Docs / help text match intended precedence.
  • No behavior change for MAD_AUTO_ROCM_PATH=0 beyond documented legacy path.
  • CI / reviewers: link internal ticket (e.g. ROCM-21753) if required.

Add rocm_path_resolver and shared therock_markers; wire Context, run CLI,
validators, and orchestrator. Document in README, configuration, and usage;
update CHANGELOG. Extend unit and integration tests. Adjust therock_detector
(sys.path helper) without ADR docs in this commit.

Made-with: Cursor
@coketaste coketaste self-assigned this Apr 22, 2026
Copilot AI review requested due to automatic review settings April 22, 2026 23:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds robust ROCm root resolution for madengine run, including auto-detection for versioned /opt/rocm-* and TheRock layouts, plus explicit host/container override semantics via MAD_ROCM_PATH.

Changes:

  • Introduces ROCm root detection/normalization helpers (rocm_path_resolver) and shared TheRock file-marker helpers (therock_markers).
  • Wires host/container ROCm-path precedence through Context, run_orchestrator, CLI validation, and help text.
  • Updates docs/tests (unit + integration) to reflect the new resolution behavior and TheRock markers.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/unit/test_therock_markers.py Adds unit tests for TheRock marker path helpers and detection.
tests/unit/test_rocm_path.py Adds/extends unit tests for host/container ROCm path precedence and auto-detect behavior.
tests/integration/test_gpu_management.py Adjusts integration tests to patch the correct vendor-detection import path and stabilizes a Context-based test with mocks.
src/madengine/utils/therock_markers.py Adds canonical TheRock marker relative paths and helper functions.
src/madengine/utils/rocm_path_resolver.py Implements host/container ROCm root resolution, auto-detect heuristics, and legacy compatibility functions.
src/madengine/scripts/common/tools/therock_detector.py Reuses shared TheRock marker helpers and improves importability when run as a standalone script.
src/madengine/orchestration/run_orchestrator.py Defers ROCm path resolution to Context instead of pre-resolving in the orchestrator.
src/madengine/core/context.py Uses new resolver functions to compute host ROCm path and set container ROCM_PATH consistently.
src/madengine/core/constants.py Keeps get_rocm_path() as a legacy (non-auto-detect) wrapper delegating to the new resolver’s legacy function.
src/madengine/cli/validators.py Adds schema validation for host/container MAD_ROCM_PATH values in additional context.
src/madengine/cli/commands/run.py Updates --rocm-path help text to reflect new precedence/auto-detect behavior.
docs/usage.md Documents host vs container ROCm root overrides and auto-detect disabling.
docs/configuration.md Documents new precedence rules and override mechanisms for ROCm root resolution.
README.md Updates top-level docs to reflect new ROCm-path behavior and override knobs.
CHANGELOG.md Notes the new ROCm auto-detect + override behavior and references a design doc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if not isinstance(v, (str, int, float, type(None))):
_fail_structure(
"docker_env_vars['MAD_ROCM_PATH']",
"a string (container ROCm root override)",
compatibility. The resolution pipeline is **not** the historical ``rocEnvTool``/v1
``RocmPathResolver`` from madenginev1; it reuses the same *ideas* in a self-contained
helper for madengine (versioned ``/opt/rocm-*``, TheRock markers, ``PATH`` tools).
See **ADR 0001** in ``docs/adr/0001-rocm-path-resolution.md`` for the full design.
Comment on lines +261 to +266
if MAD_ROCM_PATH in d and d[MAD_ROCM_PATH] is not None:
s = d.pop(MAD_ROCM_PATH)
if isinstance(s, (int, float)):
s = str(s)
if not isinstance(s, str) or not str(s).strip():
s = host_path
Comment on lines +119 to +142
def infer_from_path_tools(self) -> OptionalPathStr:
"""Use ``which`` on rocminfo, amd-smi, rocm-smi; return first plausible root."""
from madengine.utils import rocm_path_resolver as m # same module; for patch hooks

for name in ("rocminfo", "amd-smi", "rocm-smi"):
w = self._which(name) # type: ignore[operator]
if not w:
continue
inferred = self.rocm_root_from_bin_tool(w)
if inferred is not None and m._looks_like_rocm_root(inferred):
return normalize_rocm_path(str(inferred))
return None

def auto_detect(self) -> OptionalPathStr:
"""Heuristic search for a usable ROCm installation (see class doc + module doc)."""
from madengine.utils import rocm_path_resolver as m # same module; for patch hooks

opt = Path("/opt/rocm")
if m._looks_like_rocm_root(opt):
return normalize_rocm_path(str(opt))

for cand in m._versioned_opt_rocm_dirs():
if m._looks_like_rocm_root(cand):
return normalize_rocm_path(str(cand))
Comment thread docs/configuration.md
### ROCm path (run only)

When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths. Use the **run** command option or environment variable (not JSON context):
Design rationale and precedence are recorded in [ADR 0001: ROCm path resolution](adr/0001-rocm-path-resolution.md).
Comment thread CHANGELOG.md Outdated

### Changed

- **ROCm path**: Auto-detect host ROCm root by default (traditional `/opt/rocm`, versioned `/opt/rocm-*`, TheRock `rocm-sdk` + markers, `rocminfo` / `amd-smi` / `rocm-smi` on `PATH`). **Host** overrides: top-level `MAD_ROCM_PATH` in `--additional-context`, then `--rocm-path` (alias). **Container** override: `docker_env_vars.MAD_ROCM_PATH` (sets in-container `ROCM_PATH`); if omitted, host path is mirrored. Set `MAD_AUTO_ROCM_PATH=0` to use legacy `ROCM_PATH` / `/opt/rocm` only without scanning. Code: `madengine.utils.rocm_path_resolver`, shared TheRock file markers in `madengine.utils.therock_markers`. **Design doc**: [ADR 0001](docs/adr/0001-rocm-path-resolution.md).
coketaste and others added 2 commits April 22, 2026 19:56
… checks

- csv_parser.py: resolve rocm-smi via shutil.which() so TheRock images
  (where tools live in a Python venv, not /opt/rocm/bin/) are detected
  correctly; accept path_resolver to get ROCm version without hardcoding
  /opt/rocm/.info/version; add bounds check in NVIDIA GPU info parser
- rocenv_tool.py: pass path_resolver to CSVParser so version resolution
  uses RocmPathResolver.get_version() for both TheRock and traditional installs
- container_runner.py: replace host-resolved hardcoded amd-smi/rocm-smi
  paths in container exec with PATH-based resolution; add run phase
  environment table showing host vs container installation type, ROCm/CUDA
  root, and version side-by-side (supports apt install and therock layouts)
- docs: document run phase environment table in usage.md and configuration.md

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 23, 2026 02:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a centralized ROCm install-root resolution flow for the run phase (host + container), including auto-detection of versioned /opt/rocm-* and TheRock marker layouts, and wires it through context creation, container execution, CLI validation, and documentation.

Changes:

  • Introduces madengine.utils.rocm_path_resolver + madengine.utils.therock_markers and updates Context to resolve host ROCm root and propagate container ROCM_PATH with clear override precedence.
  • Updates runtime execution (container runner + rocEnvTool CSV output) to better handle TheRock-style layouts where tools are discovered via PATH.
  • Adds unit tests for TheRock markers and ROCm path resolution, adjusts integration tests, and updates CLI help + docs/README/CHANGELOG to document the new behavior.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/test_therock_markers.py Adds unit coverage for TheRock marker-path helpers.
tests/unit/test_rocm_path.py Adds unit coverage for new host/container ROCm path resolution behavior and TheRock layouts.
tests/integration/test_gpu_management.py Updates integration mocking/patch points to match current vendor detection paths and stabilizes Context initialization in tests.
src/madengine/utils/therock_markers.py Defines shared TheRock on-disk marker locations and detection helper.
src/madengine/utils/rocm_path_resolver.py Implements host/container ROCm root resolution, auto-detect, and legacy compatibility functions.
src/madengine/scripts/common/tools/therock_detector.py Reuses shared marker helpers; improves importability when run as a standalone script.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Passes path resolver into CSVParser so version resolution follows detected ROCm root.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Uses PATH discovery for rocm-smi and uses resolver-based version when available.
src/madengine/orchestration/run_orchestrator.py Stops pre-resolving ROCm path in orchestrator; delegates to Context/resolver logic.
src/madengine/execution/container_runner.py Uses in-container PATH for SMI tools and prints a host vs container environment summary table.
src/madengine/core/context.py Centralizes ROCm path resolution and container ROCM_PATH propagation during runtime context init.
src/madengine/core/constants.py Reframes get_rocm_path() as legacy behavior and delegates to get_rocm_path_legacy().
src/madengine/cli/validators.py Validates new MAD_ROCM_PATH fields in additional context (top-level + docker_env_vars).
src/madengine/cli/commands/run.py Updates --rocm-path help to match new precedence/auto-detect behavior.
docs/usage.md Documents host/container override behavior and the new run-phase environment table output.
docs/configuration.md Documents precedence, overrides, and references ADR for ROCm path resolution.
README.md Updates high-level ROCm path messaging to match new auto-detect + MAD_ROCM_PATH override model.
CHANGELOG.md Records new ROCm path resolution behavior and references design documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ctr_rocm_ver = _sh(
"rocm-sdk version 2>/dev/null "
"|| cat \"${ROCM_PATH:-/opt/rocm}/.info/version\" 2>/dev/null "
"|| rocminfo 2>/dev/null | grep -i 'ROCm Version' | head -n1 | grep -oP '[\\d.]+' "
Comment on lines +132 to +150
"nvcc --version 2>/dev/null | grep -oP 'release \\K[\\d.]+' | head -1 | "
"xargs -I{} dirname $(which nvcc 2>/dev/null) 2>/dev/null | xargs dirname 2>/dev/null "
"|| echo \"${CUDA_PATH:-${CUDA_HOME:-/usr/local/cuda}}\""
)
host_cuda_ver = _host_sh(
"nvcc --version 2>/dev/null | grep -oP 'release \\K[\\d.]+' | head -1 "
"|| nvidia-smi 2>/dev/null | grep -oP 'CUDA Version: \\K[\\d.]+' | head -1 "
"|| echo unknown"
)

# ── Container side ─────────────────────────────────────────
ctr_cuda_root = _sh(
"dirname $(which nvcc 2>/dev/null) 2>/dev/null | xargs dirname 2>/dev/null "
"|| echo \"${CUDA_PATH:-${CUDA_HOME:-/usr/local/cuda}}\""
)
ctr_cuda_ver = _sh(
"nvcc --version 2>/dev/null | grep -oP 'release \\K[\\d.]+' | head -1 "
"|| nvidia-smi 2>/dev/null | grep -oP 'CUDA Version: \\K[\\d.]+' | head -1 "
"|| echo unknown"
Comment thread docs/configuration.md
### ROCm path (run only)

When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths. Use the **run** command option or environment variable (not JSON context):
Design rationale and precedence are recorded in [ADR 0001: ROCm path resolution](adr/0001-rocm-path-resolution.md).
Comment thread CHANGELOG.md Outdated

### Changed

- **ROCm path**: Auto-detect host ROCm root by default (traditional `/opt/rocm`, versioned `/opt/rocm-*`, TheRock `rocm-sdk` + markers, `rocminfo` / `amd-smi` / `rocm-smi` on `PATH`). **Host** overrides: top-level `MAD_ROCM_PATH` in `--additional-context`, then `--rocm-path` (alias). **Container** override: `docker_env_vars.MAD_ROCM_PATH` (sets in-container `ROCM_PATH`); if omitted, host path is mirrored. Set `MAD_AUTO_ROCM_PATH=0` to use legacy `ROCM_PATH` / `/opt/rocm` only without scanning. Code: `madengine.utils.rocm_path_resolver`, shared TheRock file markers in `madengine.utils.therock_markers`. **Design doc**: [ADR 0001](docs/adr/0001-rocm-path-resolution.md).
compatibility. The resolution pipeline is **not** the historical ``rocEnvTool``/v1
``RocmPathResolver`` from madenginev1; it reuses the same *ideas* in a self-contained
helper for madengine (versioned ``/opt/rocm-*``, TheRock markers, ``PATH`` tools).
See **ADR 0001** in ``docs/adr/0001-rocm-path-resolution.md`` for the full design.
coketaste and others added 3 commits April 22, 2026 21:48
…able

The previous shell command embedded \$(rocm-sdk path --root) inside
[ -f "..." ], which broke quoting when passed via docker exec bash -c "..."
causing the test to always fall through to 'unknown'.

Replace with a quoting-safe check: if rocm-sdk command exists and returns
a root path, report 'therock'; otherwise check /opt/rocm/.info/version
for traditional apt install layout.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- Apply docker_env_vars overrides in context; finalize in run_container (AMD)
  from OCI image env, then docker run probe, then /opt/rocm with warning
- Update docs, README, CLI reference, CHANGELOG, and unit tests
- ADR 0001 left out of this commit (untracked)

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 24, 2026 02:11
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds robust ROCm root resolution for the run phase, including host-side auto-detection (TheRock + versioned /opt/rocm-*) and explicit host/container override behavior, and updates runtime logging + docs/tests accordingly.

Changes:

  • Introduces madengine.utils.rocm_path_resolver + madengine.utils.therock_markers to centralize ROCm/TheRock detection and documented precedence.
  • Updates runtime wiring (Context + ContainerRunner) so host ROCm root is resolved early, while container ROCM_PATH is finalized at Docker run time (OCI env → probe → fallback).
  • Adds/updates unit + integration tests and updates documentation/README/CHANGELOG to reflect the new precedence and behaviors.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/madengine/utils/rocm_path_resolver.py New host/container ROCm path resolution logic (auto-detect, overrides, OCI/probe finalization).
src/madengine/utils/therock_markers.py New shared TheRock marker path helpers.
src/madengine/core/context.py Uses new host resolver; stops mirroring host ROCm root into container env at init; applies container override mapping only.
src/madengine/execution/container_runner.py Finalizes container ROCM_PATH at run time; prints host/container environment summary; uses PATH-based SMI tool invocation in container.
src/madengine/orchestration/run_orchestrator.py Passes --rocm-path through to Context without legacy resolution.
src/madengine/core/constants.py get_rocm_path() becomes legacy wrapper delegating to resolver’s legacy behavior.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH in top-level context and docker_env_vars.
src/madengine/cli/commands/run.py Updates --rocm-path help text to describe new host behavior + auto-detect toggle.
src/madengine/scripts/common/tools/therock_detector.py Imports TheRock marker helpers reliably when run as a file.
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py Threads path_resolver into CSV parsing.
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py Uses PATH-based rocm-smi detection; uses resolver for ROCm version; adds parsing guard.
tests/unit/test_rocm_path.py Adds coverage for new resolver APIs and revised Context/container behavior.
tests/unit/test_therock_markers.py Adds unit tests for TheRock marker helpers.
tests/integration/test_gpu_management.py Fixes patch target for vendor auto-detection; stabilizes non-AMD context creation in integration tests.
docs/usage.md Documents host vs container ROCm path behavior + new run-phase environment table.
docs/configuration.md Documents new precedence and container finalization behavior (OCI/probe).
docs/installation.md Clarifies host vs container ROCm path configuration.
docs/cli-reference.md Updates CLI reference for --rocm-path and env vars.
README.md Updates high-level ROCm path behavior summary.
CHANGELOG.md Notes behavior change and new resolver modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

_CONTAINER_PROBE_SH = r"""try_root() {
d="$1"
[ -d "$d" ] || return 1
{ [ -f "$d/bin/rocminfo" ] || [ -f "$d/bin/amd-smi" ] || [ -f "$d/share/therock/therock_manifest.json" ] || [ -f "$d/share/therock/dist_info.json" ]; } && return 0
Comment on lines +394 to +403
if MAD_ROCM_PATH in d and d[MAD_ROCM_PATH] is not None:
s = d.pop(MAD_ROCM_PATH)
if isinstance(s, (int, float)):
s = str(s)
if not isinstance(s, str) or not str(s).strip():
s = host_path
croot = normalize_rocm_path(str(s).strip())
elif d.get("ROCM_PATH") not in (None, ""):
croot = normalize_rocm_path(str(d["ROCM_PATH"]).strip())
else:
Comment on lines +1060 to +1064

finalize_container_rocm_path(
self.context.ctx["docker_env_vars"],
docker_image,
self.context._rocm_path,
Comment on lines +126 to +140
def _host_sh(cmd: str) -> str:
try:
return subprocess.check_output(cmd, shell=True, stderr=subprocess.DEVNULL, text=True).strip()
except Exception:
return "unknown"

host_cuda_root = _host_sh(
"nvcc --version 2>/dev/null | grep -oP 'release \\K[\\d.]+' | head -1 | "
"xargs -I{} dirname $(which nvcc 2>/dev/null) 2>/dev/null | xargs dirname 2>/dev/null "
"|| echo \"${CUDA_PATH:-${CUDA_HOME:-/usr/local/cuda}}\""
)
host_cuda_ver = _host_sh(
"nvcc --version 2>/dev/null | grep -oP 'release \\K[\\d.]+' | head -1 "
"|| nvidia-smi 2>/dev/null | grep -oP 'CUDA Version: \\K[\\d.]+' | head -1 "
"|| echo unknown"
Comment on lines +36 to 40
# Use PATH resolution first so TheRock images (where rocm-smi lives
# under a Python venv, not /opt/rocm/bin/) are handled correctly.
rocm_smi_cmd = shutil.which("rocm-smi") or "/opt/rocm/bin/rocm-smi"
rocm_smi_out = console.sh(f"{rocm_smi_cmd} || true")
nv_smi_out = console.sh("nvidia-smi -L || true")
Comment thread docs/cli-reference.md Outdated
| `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) |
| `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) |
| `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). |
| `--rocm-path` | | TEXT | `None` | **Host** ROCm installation root only (default: `ROCM_PATH` env or `/opt/rocm`). Use when the **runner’s** ROCm is not in `/opt/rocm` (e.g. TheRock, pip). Does not set the in-container `ROCM_PATH` for Docker; that is resolved at `run` from the image, probe, or `docker_env_vars` — see [Configuration — ROCm path](configuration.md#rocm-path-run-only). |
coketaste and others added 2 commits April 24, 2026 03:08
MI350X (gfx950) was missing from the skip_gpu_arch fixture, causing
test_commandline_argument_skip_gpu_arch to fail on this machine.

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…TH in --additional-context

The --rocm-path flag was an alias for top-level MAD_ROCM_PATH but its
help text implied it could set both host and container paths, risking
confusion when environments differ. Users should set host and container
ROCm roots independently via --additional-context:

  Host:      {"MAD_ROCM_PATH": "/path/to/host/rocm"}
  Container: {"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}

Also fixes datetime.utcnow() deprecation warnings in mongodb.py
(replaced with datetime.now(timezone.utc)).

Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 19:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new ROCm root resolution flow that can auto-detect host ROCm installs (including versioned /opt/rocm-* and TheRock marker layouts) and cleanly separates host ROCm resolution from in-container ROCM_PATH selection at Docker run time.

Changes:

  • Added madengine.utils.rocm_path_resolver (host auto-detect + container finalization) and madengine.utils.therock_markers (shared TheRock marker paths).
  • Updated runtime wiring so Context resolves host ROCm root via the resolver, while ContainerRunner finalizes container ROCM_PATH at run time (OCI env → probe → default).
  • Removed madengine run --rocm-path usage in favor of MAD_ROCM_PATH (top-level and docker_env_vars), and updated docs/tests accordingly.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/madengine/utils/rocm_path_resolver.py Implements host ROCm auto-detect and container ROCM_PATH override/finalization logic.
src/madengine/utils/therock_markers.py Defines canonical TheRock marker relative paths + helper predicates.
src/madengine/core/context.py Switches host ROCm path resolution to the new resolver and defers container ROCM_PATH finalization.
src/madengine/execution/container_runner.py Finalizes in-container ROCM_PATH during Docker runs and adds a host/container environment summary table.
src/madengine/core/constants.py Reframes get_rocm_path() as legacy resolution and delegates to resolver legacy helper.
src/madengine/cli/commands/run.py Removes --rocm-path CLI option wiring.
src/madengine/cli/validators.py Validates MAD_ROCM_PATH (host and docker_env_vars) shapes in additional context.
src/madengine/scripts/common/tools/therock_detector.py Makes the tool importable when executed as a file; reuses shared marker helpers.
src/madengine/scripts/common/pre_scripts/rocEnvTool/{rocenv_tool.py,csv_parser.py} Passes resolver into CSV parsing and improves tool/path selection for TheRock-style layouts.
tests/unit/test_rocm_path.py Adds/updates unit coverage for resolver behaviors and container override rules.
tests/unit/test_therock_markers.py Adds unit coverage for TheRock marker helpers.
tests/integration/test_gpu_management.py Updates patch points and stabilizes integration behavior with Context initialization changes.
docs/{usage.md,installation.md,configuration.md,cli-reference.md} Updates documented precedence and configuration guidance for host vs container ROCm roots.
README.md Updates top-level ROCm path guidance and examples.
CHANGELOG.md Adds an Unreleased entry describing new ROCm resolution behavior.
src/madengine/database/mongodb.py Switches metadata timestamps to timezone-aware UTC.
tests/fixtures/dummy/models.json Adds gfx950 to skip list for a dummy model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread README.md
Comment on lines +78 to +81
madengine run --tags dummy --additional-context ‘{"MAD_ROCM_PATH": "/path/to/rocm"}’
# or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummy
# Override in-container ROCm root independently:
madengine run --tags dummy --additional-context ‘{"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}’
Comment on lines +1061 to +1066
finalize_container_rocm_path(
self.context.ctx["docker_env_vars"],
docker_image,
self.context._rocm_path,
)

Comment on lines +389 to +401
d = docker_env
if MAD_ROCM_PATH in d and d[MAD_ROCM_PATH] is not None:
s = d.pop(MAD_ROCM_PATH)
if isinstance(s, (int, float)):
s = str(s)
if not isinstance(s, str) or not str(s).strip():
s = host_path
croot = normalize_rocm_path(str(s).strip())
elif d.get("ROCM_PATH") not in (None, ""):
croot = normalize_rocm_path(str(d["ROCM_PATH"]).strip())
else:
return None
d["ROCM_PATH"] = croot
Comment on lines 122 to 127
num_gpu += 1
values = line.split(":")
if len(values) < 3:
continue
name = values[1].split("(UUID")[0]
uuid = values[2]
Comment thread CHANGELOG.md

### Changed

- **ROCm path**: Auto-detect host ROCm root by default (traditional `/opt/rocm`, versioned `/opt/rocm-*`, TheRock `rocm-sdk` + markers, `rocminfo` / `amd-smi` / `rocm-smi` on `PATH`). **Host** overrides: top-level `MAD_ROCM_PATH` in `--additional-context`, then `--rocm-path` (alias). **Container** `ROCM_PATH` for Docker runs (AMD): `docker_env_vars.MAD_ROCM_PATH` if set; else `ROCM_PATH` / `ROCM_HOME` from the image OCI config (`docker image inspect`); else an in-image shell probe (`docker run --rm`); else default `/opt/rocm` with a warning. The host-resolved path is no longer mirrored into the container by default. Set `MAD_AUTO_ROCM_PATH=0` to use legacy host `ROCM_PATH` / `/opt/rocm` only without scanning. Code: `madengine.utils.rocm_path_resolver`, shared TheRock file markers in `madengine.utils.therock_markers`. **Design doc**: [ADR 0001](docs/adr/0001-rocm-path-resolution.md).
_CONTAINER_PROBE_SH = r"""try_root() {
d="$1"
[ -d "$d" ] || return 1
{ [ -f "$d/bin/rocminfo" ] || [ -f "$d/bin/amd-smi" ] || [ -f "$d/share/therock/therock_manifest.json" ] || [ -f "$d/share/therock/dist_info.json" ]; } && return 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants