feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110
feat: ROCm path resolution (auto-detect, MAD_ROCM_PATH, TheRock markers)#110
Conversation
Add rocm_path_resolver and shared therock_markers; wire Context, run CLI, validators, and orchestrator. Document in README, configuration, and usage; update CHANGELOG. Extend unit and integration tests. Adjust therock_detector (sys.path helper) without ADR docs in this commit. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Adds robust ROCm root resolution for madengine run, including auto-detection for versioned /opt/rocm-* and TheRock layouts, plus explicit host/container override semantics via MAD_ROCM_PATH.
Changes:
- Introduces ROCm root detection/normalization helpers (
rocm_path_resolver) and shared TheRock file-marker helpers (therock_markers). - Wires host/container ROCm-path precedence through
Context,run_orchestrator, CLI validation, and help text. - Updates docs/tests (unit + integration) to reflect the new resolution behavior and TheRock markers.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_therock_markers.py | Adds unit tests for TheRock marker path helpers and detection. |
| tests/unit/test_rocm_path.py | Adds/extends unit tests for host/container ROCm path precedence and auto-detect behavior. |
| tests/integration/test_gpu_management.py | Adjusts integration tests to patch the correct vendor-detection import path and stabilizes a Context-based test with mocks. |
| src/madengine/utils/therock_markers.py | Adds canonical TheRock marker relative paths and helper functions. |
| src/madengine/utils/rocm_path_resolver.py | Implements host/container ROCm root resolution, auto-detect heuristics, and legacy compatibility functions. |
| src/madengine/scripts/common/tools/therock_detector.py | Reuses shared TheRock marker helpers and improves importability when run as a standalone script. |
| src/madengine/orchestration/run_orchestrator.py | Defers ROCm path resolution to Context instead of pre-resolving in the orchestrator. |
| src/madengine/core/context.py | Uses new resolver functions to compute host ROCm path and set container ROCM_PATH consistently. |
| src/madengine/core/constants.py | Keeps get_rocm_path() as a legacy (non-auto-detect) wrapper delegating to the new resolver’s legacy function. |
| src/madengine/cli/validators.py | Adds schema validation for host/container MAD_ROCM_PATH values in additional context. |
| src/madengine/cli/commands/run.py | Updates --rocm-path help text to reflect new precedence/auto-detect behavior. |
| docs/usage.md | Documents host vs container ROCm root overrides and auto-detect disabling. |
| docs/configuration.md | Documents new precedence rules and override mechanisms for ROCm root resolution. |
| README.md | Updates top-level docs to reflect new ROCm-path behavior and override knobs. |
| CHANGELOG.md | Notes the new ROCm auto-detect + override behavior and references a design doc. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not isinstance(v, (str, int, float, type(None))): | ||
| _fail_structure( | ||
| "docker_env_vars['MAD_ROCM_PATH']", | ||
| "a string (container ROCm root override)", |
| compatibility. The resolution pipeline is **not** the historical ``rocEnvTool``/v1 | ||
| ``RocmPathResolver`` from madenginev1; it reuses the same *ideas* in a self-contained | ||
| helper for madengine (versioned ``/opt/rocm-*``, TheRock markers, ``PATH`` tools). | ||
| See **ADR 0001** in ``docs/adr/0001-rocm-path-resolution.md`` for the full design. |
| if MAD_ROCM_PATH in d and d[MAD_ROCM_PATH] is not None: | ||
| s = d.pop(MAD_ROCM_PATH) | ||
| if isinstance(s, (int, float)): | ||
| s = str(s) | ||
| if not isinstance(s, str) or not str(s).strip(): | ||
| s = host_path |
| def infer_from_path_tools(self) -> OptionalPathStr: | ||
| """Use ``which`` on rocminfo, amd-smi, rocm-smi; return first plausible root.""" | ||
| from madengine.utils import rocm_path_resolver as m # same module; for patch hooks | ||
|
|
||
| for name in ("rocminfo", "amd-smi", "rocm-smi"): | ||
| w = self._which(name) # type: ignore[operator] | ||
| if not w: | ||
| continue | ||
| inferred = self.rocm_root_from_bin_tool(w) | ||
| if inferred is not None and m._looks_like_rocm_root(inferred): | ||
| return normalize_rocm_path(str(inferred)) | ||
| return None | ||
|
|
||
| def auto_detect(self) -> OptionalPathStr: | ||
| """Heuristic search for a usable ROCm installation (see class doc + module doc).""" | ||
| from madengine.utils import rocm_path_resolver as m # same module; for patch hooks | ||
|
|
||
| opt = Path("/opt/rocm") | ||
| if m._looks_like_rocm_root(opt): | ||
| return normalize_rocm_path(str(opt)) | ||
|
|
||
| for cand in m._versioned_opt_rocm_dirs(): | ||
| if m._looks_like_rocm_root(cand): | ||
| return normalize_rocm_path(str(cand)) |
| ### ROCm path (run only) | ||
|
|
||
| When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths. Use the **run** command option or environment variable (not JSON context): | ||
| Design rationale and precedence are recorded in [ADR 0001: ROCm path resolution](adr/0001-rocm-path-resolution.md). |
|
|
||
| ### Changed | ||
|
|
||
| - **ROCm path**: Auto-detect host ROCm root by default (traditional `/opt/rocm`, versioned `/opt/rocm-*`, TheRock `rocm-sdk` + markers, `rocminfo` / `amd-smi` / `rocm-smi` on `PATH`). **Host** overrides: top-level `MAD_ROCM_PATH` in `--additional-context`, then `--rocm-path` (alias). **Container** override: `docker_env_vars.MAD_ROCM_PATH` (sets in-container `ROCM_PATH`); if omitted, host path is mirrored. Set `MAD_AUTO_ROCM_PATH=0` to use legacy `ROCM_PATH` / `/opt/rocm` only without scanning. Code: `madengine.utils.rocm_path_resolver`, shared TheRock file markers in `madengine.utils.therock_markers`. **Design doc**: [ADR 0001](docs/adr/0001-rocm-path-resolution.md). |
… checks - csv_parser.py: resolve rocm-smi via shutil.which() so TheRock images (where tools live in a Python venv, not /opt/rocm/bin/) are detected correctly; accept path_resolver to get ROCm version without hardcoding /opt/rocm/.info/version; add bounds check in NVIDIA GPU info parser - rocenv_tool.py: pass path_resolver to CSVParser so version resolution uses RocmPathResolver.get_version() for both TheRock and traditional installs - container_runner.py: replace host-resolved hardcoded amd-smi/rocm-smi paths in container exec with PATH-based resolution; add run phase environment table showing host vs container installation type, ROCm/CUDA root, and version side-by-side (supports apt install and therock layouts) - docs: document run phase environment table in usage.md and configuration.md Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a centralized ROCm install-root resolution flow for the run phase (host + container), including auto-detection of versioned /opt/rocm-* and TheRock marker layouts, and wires it through context creation, container execution, CLI validation, and documentation.
Changes:
- Introduces
madengine.utils.rocm_path_resolver+madengine.utils.therock_markersand updatesContextto resolve host ROCm root and propagate containerROCM_PATHwith clear override precedence. - Updates runtime execution (container runner + rocEnvTool CSV output) to better handle TheRock-style layouts where tools are discovered via
PATH. - Adds unit tests for TheRock markers and ROCm path resolution, adjusts integration tests, and updates CLI help + docs/README/CHANGELOG to document the new behavior.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_therock_markers.py | Adds unit coverage for TheRock marker-path helpers. |
| tests/unit/test_rocm_path.py | Adds unit coverage for new host/container ROCm path resolution behavior and TheRock layouts. |
| tests/integration/test_gpu_management.py | Updates integration mocking/patch points to match current vendor detection paths and stabilizes Context initialization in tests. |
| src/madengine/utils/therock_markers.py | Defines shared TheRock on-disk marker locations and detection helper. |
| src/madengine/utils/rocm_path_resolver.py | Implements host/container ROCm root resolution, auto-detect, and legacy compatibility functions. |
| src/madengine/scripts/common/tools/therock_detector.py | Reuses shared marker helpers; improves importability when run as a standalone script. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py | Passes path resolver into CSVParser so version resolution follows detected ROCm root. |
| src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py | Uses PATH discovery for rocm-smi and uses resolver-based version when available. |
| src/madengine/orchestration/run_orchestrator.py | Stops pre-resolving ROCm path in orchestrator; delegates to Context/resolver logic. |
| src/madengine/execution/container_runner.py | Uses in-container PATH for SMI tools and prints a host vs container environment summary table. |
| src/madengine/core/context.py | Centralizes ROCm path resolution and container ROCM_PATH propagation during runtime context init. |
| src/madengine/core/constants.py | Reframes get_rocm_path() as legacy behavior and delegates to get_rocm_path_legacy(). |
| src/madengine/cli/validators.py | Validates new MAD_ROCM_PATH fields in additional context (top-level + docker_env_vars). |
| src/madengine/cli/commands/run.py | Updates --rocm-path help to match new precedence/auto-detect behavior. |
| docs/usage.md | Documents host/container override behavior and the new run-phase environment table output. |
| docs/configuration.md | Documents precedence, overrides, and references ADR for ROCm path resolution. |
| README.md | Updates high-level ROCm path messaging to match new auto-detect + MAD_ROCM_PATH override model. |
| CHANGELOG.md | Records new ROCm path resolution behavior and references design documentation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ctr_rocm_ver = _sh( | ||
| "rocm-sdk version 2>/dev/null " | ||
| "|| cat \"${ROCM_PATH:-/opt/rocm}/.info/version\" 2>/dev/null " | ||
| "|| rocminfo 2>/dev/null | grep -i 'ROCm Version' | head -n1 | grep -oP '[\\d.]+' " |
| "nvcc --version 2>/dev/null | grep -oP 'release \\K[\\d.]+' | head -1 | " | ||
| "xargs -I{} dirname $(which nvcc 2>/dev/null) 2>/dev/null | xargs dirname 2>/dev/null " | ||
| "|| echo \"${CUDA_PATH:-${CUDA_HOME:-/usr/local/cuda}}\"" | ||
| ) | ||
| host_cuda_ver = _host_sh( | ||
| "nvcc --version 2>/dev/null | grep -oP 'release \\K[\\d.]+' | head -1 " | ||
| "|| nvidia-smi 2>/dev/null | grep -oP 'CUDA Version: \\K[\\d.]+' | head -1 " | ||
| "|| echo unknown" | ||
| ) | ||
|
|
||
| # ── Container side ───────────────────────────────────────── | ||
| ctr_cuda_root = _sh( | ||
| "dirname $(which nvcc 2>/dev/null) 2>/dev/null | xargs dirname 2>/dev/null " | ||
| "|| echo \"${CUDA_PATH:-${CUDA_HOME:-/usr/local/cuda}}\"" | ||
| ) | ||
| ctr_cuda_ver = _sh( | ||
| "nvcc --version 2>/dev/null | grep -oP 'release \\K[\\d.]+' | head -1 " | ||
| "|| nvidia-smi 2>/dev/null | grep -oP 'CUDA Version: \\K[\\d.]+' | head -1 " | ||
| "|| echo unknown" |
| ### ROCm path (run only) | ||
|
|
||
| When ROCm is not installed under `/opt/rocm` (e.g. [TheRock](https://github.com/ROCm/TheRock) or pip), set the ROCm root so GPU detection and container environment use the correct paths. Use the **run** command option or environment variable (not JSON context): | ||
| Design rationale and precedence are recorded in [ADR 0001: ROCm path resolution](adr/0001-rocm-path-resolution.md). |
|
|
||
| ### Changed | ||
|
|
||
| - **ROCm path**: Auto-detect host ROCm root by default (traditional `/opt/rocm`, versioned `/opt/rocm-*`, TheRock `rocm-sdk` + markers, `rocminfo` / `amd-smi` / `rocm-smi` on `PATH`). **Host** overrides: top-level `MAD_ROCM_PATH` in `--additional-context`, then `--rocm-path` (alias). **Container** override: `docker_env_vars.MAD_ROCM_PATH` (sets in-container `ROCM_PATH`); if omitted, host path is mirrored. Set `MAD_AUTO_ROCM_PATH=0` to use legacy `ROCM_PATH` / `/opt/rocm` only without scanning. Code: `madengine.utils.rocm_path_resolver`, shared TheRock file markers in `madengine.utils.therock_markers`. **Design doc**: [ADR 0001](docs/adr/0001-rocm-path-resolution.md). |
| compatibility. The resolution pipeline is **not** the historical ``rocEnvTool``/v1 | ||
| ``RocmPathResolver`` from madenginev1; it reuses the same *ideas* in a self-contained | ||
| helper for madengine (versioned ``/opt/rocm-*``, TheRock markers, ``PATH`` tools). | ||
| See **ADR 0001** in ``docs/adr/0001-rocm-path-resolution.md`` for the full design. |
…able The previous shell command embedded \$(rocm-sdk path --root) inside [ -f "..." ], which broke quoting when passed via docker exec bash -c "..." causing the test to always fall through to 'unknown'. Replace with a quoting-safe check: if rocm-sdk command exists and returns a root path, report 'therock'; otherwise check /opt/rocm/.info/version for traditional apt install layout. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
- Apply docker_env_vars overrides in context; finalize in run_container (AMD) from OCI image env, then docker run probe, then /opt/rocm with warning - Update docs, README, CLI reference, CHANGELOG, and unit tests - ADR 0001 left out of this commit (untracked) Made-with: Cursor
…to coketaste/v2-rocm-path
There was a problem hiding this comment.
Pull request overview
Adds robust ROCm root resolution for the run phase, including host-side auto-detection (TheRock + versioned /opt/rocm-*) and explicit host/container override behavior, and updates runtime logging + docs/tests accordingly.
Changes:
- Introduces
madengine.utils.rocm_path_resolver+madengine.utils.therock_markersto centralize ROCm/TheRock detection and documented precedence. - Updates runtime wiring (Context + ContainerRunner) so host ROCm root is resolved early, while container
ROCM_PATHis finalized at Docker run time (OCI env → probe → fallback). - Adds/updates unit + integration tests and updates documentation/README/CHANGELOG to reflect the new precedence and behaviors.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/utils/rocm_path_resolver.py |
New host/container ROCm path resolution logic (auto-detect, overrides, OCI/probe finalization). |
src/madengine/utils/therock_markers.py |
New shared TheRock marker path helpers. |
src/madengine/core/context.py |
Uses new host resolver; stops mirroring host ROCm root into container env at init; applies container override mapping only. |
src/madengine/execution/container_runner.py |
Finalizes container ROCM_PATH at run time; prints host/container environment summary; uses PATH-based SMI tool invocation in container. |
src/madengine/orchestration/run_orchestrator.py |
Passes --rocm-path through to Context without legacy resolution. |
src/madengine/core/constants.py |
get_rocm_path() becomes legacy wrapper delegating to resolver’s legacy behavior. |
src/madengine/cli/validators.py |
Validates MAD_ROCM_PATH in top-level context and docker_env_vars. |
src/madengine/cli/commands/run.py |
Updates --rocm-path help text to describe new host behavior + auto-detect toggle. |
src/madengine/scripts/common/tools/therock_detector.py |
Imports TheRock marker helpers reliably when run as a file. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/rocenv_tool.py |
Threads path_resolver into CSV parsing. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/csv_parser.py |
Uses PATH-based rocm-smi detection; uses resolver for ROCm version; adds parsing guard. |
tests/unit/test_rocm_path.py |
Adds coverage for new resolver APIs and revised Context/container behavior. |
tests/unit/test_therock_markers.py |
Adds unit tests for TheRock marker helpers. |
tests/integration/test_gpu_management.py |
Fixes patch target for vendor auto-detection; stabilizes non-AMD context creation in integration tests. |
docs/usage.md |
Documents host vs container ROCm path behavior + new run-phase environment table. |
docs/configuration.md |
Documents new precedence and container finalization behavior (OCI/probe). |
docs/installation.md |
Clarifies host vs container ROCm path configuration. |
docs/cli-reference.md |
Updates CLI reference for --rocm-path and env vars. |
README.md |
Updates high-level ROCm path behavior summary. |
CHANGELOG.md |
Notes behavior change and new resolver modules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _CONTAINER_PROBE_SH = r"""try_root() { | ||
| d="$1" | ||
| [ -d "$d" ] || return 1 | ||
| { [ -f "$d/bin/rocminfo" ] || [ -f "$d/bin/amd-smi" ] || [ -f "$d/share/therock/therock_manifest.json" ] || [ -f "$d/share/therock/dist_info.json" ]; } && return 0 |
| if MAD_ROCM_PATH in d and d[MAD_ROCM_PATH] is not None: | ||
| s = d.pop(MAD_ROCM_PATH) | ||
| if isinstance(s, (int, float)): | ||
| s = str(s) | ||
| if not isinstance(s, str) or not str(s).strip(): | ||
| s = host_path | ||
| croot = normalize_rocm_path(str(s).strip()) | ||
| elif d.get("ROCM_PATH") not in (None, ""): | ||
| croot = normalize_rocm_path(str(d["ROCM_PATH"]).strip()) | ||
| else: |
|
|
||
| finalize_container_rocm_path( | ||
| self.context.ctx["docker_env_vars"], | ||
| docker_image, | ||
| self.context._rocm_path, |
| def _host_sh(cmd: str) -> str: | ||
| try: | ||
| return subprocess.check_output(cmd, shell=True, stderr=subprocess.DEVNULL, text=True).strip() | ||
| except Exception: | ||
| return "unknown" | ||
|
|
||
| host_cuda_root = _host_sh( | ||
| "nvcc --version 2>/dev/null | grep -oP 'release \\K[\\d.]+' | head -1 | " | ||
| "xargs -I{} dirname $(which nvcc 2>/dev/null) 2>/dev/null | xargs dirname 2>/dev/null " | ||
| "|| echo \"${CUDA_PATH:-${CUDA_HOME:-/usr/local/cuda}}\"" | ||
| ) | ||
| host_cuda_ver = _host_sh( | ||
| "nvcc --version 2>/dev/null | grep -oP 'release \\K[\\d.]+' | head -1 " | ||
| "|| nvidia-smi 2>/dev/null | grep -oP 'CUDA Version: \\K[\\d.]+' | head -1 " | ||
| "|| echo unknown" |
| # Use PATH resolution first so TheRock images (where rocm-smi lives | ||
| # under a Python venv, not /opt/rocm/bin/) are handled correctly. | ||
| rocm_smi_cmd = shutil.which("rocm-smi") or "/opt/rocm/bin/rocm-smi" | ||
| rocm_smi_out = console.sh(f"{rocm_smi_cmd} || true") | ||
| nv_smi_out = console.sh("nvidia-smi -L || true") |
| | `--tags` | `-t` | TEXT | `[]` | Model tags to run (can specify multiple) | | ||
| | `--manifest-file` | `-m` | TEXT | `""` | Build manifest file path (for pre-built images) | | ||
| | `--rocm-path` | | TEXT | `None` | ROCm installation root (default: `ROCM_PATH` env or `/opt/rocm`). Use when ROCm is not in `/opt/rocm` (e.g. Rock, pip). | | ||
| | `--rocm-path` | | TEXT | `None` | **Host** ROCm installation root only (default: `ROCM_PATH` env or `/opt/rocm`). Use when the **runner’s** ROCm is not in `/opt/rocm` (e.g. TheRock, pip). Does not set the in-container `ROCM_PATH` for Docker; that is resolved at `run` from the image, probe, or `docker_env_vars` — see [Configuration — ROCm path](configuration.md#rocm-path-run-only). | |
MI350X (gfx950) was missing from the skip_gpu_arch fixture, causing test_commandline_argument_skip_gpu_arch to fail on this machine. Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
…TH in --additional-context
The --rocm-path flag was an alias for top-level MAD_ROCM_PATH but its
help text implied it could set both host and container paths, risking
confusion when environments differ. Users should set host and container
ROCm roots independently via --additional-context:
Host: {"MAD_ROCM_PATH": "/path/to/host/rocm"}
Container: {"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}
Also fixes datetime.utcnow() deprecation warnings in mongodb.py
(replaced with datetime.now(timezone.utc)).
Co-Authored-By: Claude Sonnet 4 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces a new ROCm root resolution flow that can auto-detect host ROCm installs (including versioned /opt/rocm-* and TheRock marker layouts) and cleanly separates host ROCm resolution from in-container ROCM_PATH selection at Docker run time.
Changes:
- Added
madengine.utils.rocm_path_resolver(host auto-detect + container finalization) andmadengine.utils.therock_markers(shared TheRock marker paths). - Updated runtime wiring so
Contextresolves host ROCm root via the resolver, whileContainerRunnerfinalizes containerROCM_PATHat run time (OCI env → probe → default). - Removed
madengine run --rocm-pathusage in favor ofMAD_ROCM_PATH(top-level anddocker_env_vars), and updated docs/tests accordingly.
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/utils/rocm_path_resolver.py |
Implements host ROCm auto-detect and container ROCM_PATH override/finalization logic. |
src/madengine/utils/therock_markers.py |
Defines canonical TheRock marker relative paths + helper predicates. |
src/madengine/core/context.py |
Switches host ROCm path resolution to the new resolver and defers container ROCM_PATH finalization. |
src/madengine/execution/container_runner.py |
Finalizes in-container ROCM_PATH during Docker runs and adds a host/container environment summary table. |
src/madengine/core/constants.py |
Reframes get_rocm_path() as legacy resolution and delegates to resolver legacy helper. |
src/madengine/cli/commands/run.py |
Removes --rocm-path CLI option wiring. |
src/madengine/cli/validators.py |
Validates MAD_ROCM_PATH (host and docker_env_vars) shapes in additional context. |
src/madengine/scripts/common/tools/therock_detector.py |
Makes the tool importable when executed as a file; reuses shared marker helpers. |
src/madengine/scripts/common/pre_scripts/rocEnvTool/{rocenv_tool.py,csv_parser.py} |
Passes resolver into CSV parsing and improves tool/path selection for TheRock-style layouts. |
tests/unit/test_rocm_path.py |
Adds/updates unit coverage for resolver behaviors and container override rules. |
tests/unit/test_therock_markers.py |
Adds unit coverage for TheRock marker helpers. |
tests/integration/test_gpu_management.py |
Updates patch points and stabilizes integration behavior with Context initialization changes. |
docs/{usage.md,installation.md,configuration.md,cli-reference.md} |
Updates documented precedence and configuration guidance for host vs container ROCm roots. |
README.md |
Updates top-level ROCm path guidance and examples. |
CHANGELOG.md |
Adds an Unreleased entry describing new ROCm resolution behavior. |
src/madengine/database/mongodb.py |
Switches metadata timestamps to timezone-aware UTC. |
tests/fixtures/dummy/models.json |
Adds gfx950 to skip list for a dummy model. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| madengine run --tags dummy --additional-context ‘{"MAD_ROCM_PATH": "/path/to/rocm"}’ | ||
| # or: export ROCM_PATH=/path/to/rocm && madengine run --tags dummy | ||
| # Override in-container ROCm root independently: | ||
| madengine run --tags dummy --additional-context ‘{"docker_env_vars": {"MAD_ROCM_PATH": "/path/in/container"}}’ |
| finalize_container_rocm_path( | ||
| self.context.ctx["docker_env_vars"], | ||
| docker_image, | ||
| self.context._rocm_path, | ||
| ) | ||
|
|
| d = docker_env | ||
| if MAD_ROCM_PATH in d and d[MAD_ROCM_PATH] is not None: | ||
| s = d.pop(MAD_ROCM_PATH) | ||
| if isinstance(s, (int, float)): | ||
| s = str(s) | ||
| if not isinstance(s, str) or not str(s).strip(): | ||
| s = host_path | ||
| croot = normalize_rocm_path(str(s).strip()) | ||
| elif d.get("ROCM_PATH") not in (None, ""): | ||
| croot = normalize_rocm_path(str(d["ROCM_PATH"]).strip()) | ||
| else: | ||
| return None | ||
| d["ROCM_PATH"] = croot |
| num_gpu += 1 | ||
| values = line.split(":") | ||
| if len(values) < 3: | ||
| continue | ||
| name = values[1].split("(UUID")[0] | ||
| uuid = values[2] |
|
|
||
| ### Changed | ||
|
|
||
| - **ROCm path**: Auto-detect host ROCm root by default (traditional `/opt/rocm`, versioned `/opt/rocm-*`, TheRock `rocm-sdk` + markers, `rocminfo` / `amd-smi` / `rocm-smi` on `PATH`). **Host** overrides: top-level `MAD_ROCM_PATH` in `--additional-context`, then `--rocm-path` (alias). **Container** `ROCM_PATH` for Docker runs (AMD): `docker_env_vars.MAD_ROCM_PATH` if set; else `ROCM_PATH` / `ROCM_HOME` from the image OCI config (`docker image inspect`); else an in-image shell probe (`docker run --rm`); else default `/opt/rocm` with a warning. The host-resolved path is no longer mirrored into the container by default. Set `MAD_AUTO_ROCM_PATH=0` to use legacy host `ROCM_PATH` / `/opt/rocm` only without scanning. Code: `madengine.utils.rocm_path_resolver`, shared TheRock file markers in `madengine.utils.therock_markers`. **Design doc**: [ADR 0001](docs/adr/0001-rocm-path-resolution.md). |
| _CONTAINER_PROBE_SH = r"""try_root() { | ||
| d="$1" | ||
| [ -d "$d" ] || return 1 | ||
| { [ -f "$d/bin/rocminfo" ] || [ -f "$d/bin/amd-smi" ] || [ -f "$d/share/therock/therock_manifest.json" ] || [ -f "$d/share/therock/dist_info.json" ]; } && return 0 |
Summary
Adds default ROCm install root resolution for the run phase: scan common layouts (including versioned
/opt/rocm-*and TheRock markers), with clear overrides for host vs container. Aligns in-containerROCM_PATHwithdocker_env_varswhen the image layout differs from the host.Motivation
Users on TheRock- or CI-style images may not have a plain
/opt/rocmtree;madengineshould still find a valid ROCm root for validation and tool paths, while allowing explicitMAD_ROCM_PATH/--rocm-pathwhen auto-detect is wrong or undesired.What changed
madengine.utils.rocm_path_resolverandmadengine.utils.therock_markers(shared TheRockshare/therockfile markers).MAD_ROCM_PATHin additional context →--rocm-path→ auto-detect (unlessMAD_AUTO_ROCM_PATH=0) →ROCM_PATH→/opt/rocm.docker_env_vars.MAD_ROCM_PATHsets in-containerROCM_PATH(key consumed); otherwise host-resolved path is mirrored unless the user setROCM_PATHindocker_env_vars.therock_detectorscript: prepends the package parent tosys.pathso imports work when run as a file; last-resort path fallback only if the script is outside the tree.test_rocm_path,test_therock_markers) and integration GPU test adjustments.How to test
pytest tests/unit/test_rocm_path.py tests/unit/test_therock_markers.py(or fulltests/unit/).madengine run --helpand confirm--rocm-path/ env behavior as documented.madengine runwithoutMAD_ROCM_PATHand confirm a sensibleROCM_PATHin the generated context.Checklist (optional; trim for your team)
MAD_AUTO_ROCM_PATH=0beyond documented legacy path.