Skip to content

fix(timeout): guard signal.alarm() against None; centralize auth; fix security issues#111

Closed
srinivamd wants to merge 5 commits intoROCm:coketaste/v2-review-fixfrom
srinivamd:fix/rocm-23419-timeout-none-guard
Closed

fix(timeout): guard signal.alarm() against None; centralize auth; fix security issues#111
srinivamd wants to merge 5 commits intoROCm:coketaste/v2-review-fixfrom
srinivamd:fix/rocm-23419-timeout-none-guard

Conversation

@srinivamd
Copy link
Copy Markdown

Summary

Fixes ROCM-23419: MAD-internal models fail with TypeError: 'NoneType' object cannot be interpreted as an integer in Timeout.__enter__ when --timeout 0 is passed.

This PR incorporates all changes from PR #108 (refactor(v2-review)) and additionally resolves all 4 unresolved Copilot review comments from that PR before merging to main.


Root Cause

PR #106 (merged 2026-04-20) introduced None as a sentinel for "no timeout" in cli/commands/run.py:

elif timeout == 0:
    timeout = None  # "no timeout" sentinel

But core/timeout.py passed self.seconds directly to signal.alarm(), which requires a non-negative integer — causing an immediate TypeError for any model using --timeout 0 or the no-timeout default.

PR #107 reverted PR #106 as a hotfix. This PR re-introduces the auth centralization from PR #106 with the timeout guard and all review comments addressed.


Changes

core/timeout.py — ROCM-23419 core fix

Added early-return guards in __enter__ and __exit__ for None (and 0), preventing signal.alarm() from receiving a non-integer:

def __enter__(self) -> None:
    if not self.seconds:   # guards None AND 0
        return
    signal.signal(signal.SIGALRM, self.handle_timeout)
    signal.alarm(self.seconds)

def __exit__(self, type, value, traceback) -> None:
    if not self.seconds:
        return
    signal.alarm(0)

core/errors.py — Copilot comment #1: avoid breaking API change

Renamed ConnectionErrorNetworkError but preserved a deprecated backward-compatibility alias so downstream callers are not broken in this release:

class NetworkError(MADEngineError): ...

# Deprecated: use NetworkError instead; will be removed in a future release
ConnectionError = NetworkError

Also renamed TimeoutErrorDeploymentTimeoutError to avoid shadowing the Python built-in.

core/auth.py — Copilot comment #2: correct Optional[str] typing (new file)

New centralized credential and registry-login module extracted from 4 duplicated locations. login_to_registry now correctly typed as registry: Optional[str] (was str in PR #108, but call sites pass None for unauthenticated pulls). Registry password passed via MAD_REGISTRY_PASSWORD env var instead of argv to avoid credential exposure in ps output.

core/docker.py — Copilot comment #3: re.escape() for Docker --filter name=

Docker's --filter name= interprets the value as a Go regex — container names with dots or other metacharacters produced false positives. Added re.escape() for the regex portion (separate from shlex.quote() which only escapes for shell argv):

container_name_re = re.escape(container_name)
container_name_exists = self.console.sh(
    "docker container ps -aq --filter " + shlex.quote(f"name=^/{container_name_re}$")
)

Also applied shlex.quote() consistently for docker stop/rm args and env var values.

cli/commands/run.py — Copilot comment #4: Optional[int] + display "Disabled"

  • timeout option now converts to None for no-timeout (documented inline)
  • Panel display shows Disabled instead of None s:
    f"Timeout: [yellow]{f'{timeout}s' if timeout is not None else 'Disabled'}[/yellow]"
  • Validation message updated to mention 0 is valid

Testing

# Install from this branch
pip install git+https://github.com/srinivamd/madengine.git@fix/rocm-23419-timeout-none-guard

# Verify core fix
python3 -c "
from madengine.core.timeout import Timeout
with Timeout(None):
    print('None timeout — OK')
with Timeout(0):
    print('Zero timeout — OK')
with Timeout(5):
    print('Normal timeout — OK')
"

# Verify CLI display
madengine run --timeout 0 --tags <any_model_tag>  # should show "Timeout: Disabled"
madengine run --tags <any_model_tag>               # should show "Timeout: 7200s"

Related

When --timeout 0 is passed, run.py converts it to None as a sentinel
for "no timeout". Timeout.__enter__ passed self.seconds directly to
signal.alarm() which requires an int, causing TypeError.

Add early-return guards in __enter__ and __exit__ so None (and 0)
are treated as "no timeout" without calling signal.alarm().
Also remove unused `import typing`.
…as; rename TimeoutError->DeploymentTimeoutError; remove unused imports

- Copilot review comment ROCm#1: ConnectionError rename is a breaking API
  change; add `ConnectionError = NetworkError` deprecated alias so
  existing callers still work for one release cycle
- Remove unused `import traceback` and `from rich.table import Table`
- Rename TimeoutError -> DeploymentTimeoutError to avoid shadowing
  Python built-in and avoid confusion with madengine.core.timeout
- Remove RuntimeError = ExecutionError alias (shadowed Python built-in)
…e() for shell args

- Copilot review comment ROCm#3: Docker --filter name=^/...$ treats the
  value as a Go regex; shlex.quote() only provides shell quoting, not
  regex safety. Use re.escape() on the container name for the filter
  value, shlex.quote() for shell-level quoting of stop/rm args.
- Fix envVars: wrap values with shlex.quote(str(...)) to prevent
  shell injection from env var values containing spaces or special chars
- Fix docker exec: wrap command with shlex.quote() instead of embedding
  it inside a double-quoted string (avoids escaping issues)
- Add `import re` and `import shlex`
Extracts duplicated credential loading and Docker registry login logic
from BuildOrchestrator and RunOrchestrator into a shared core/auth.py
module (~120 lines of duplicated login logic removed).

Security improvements over PR ROCm#106:
- Registry password passed via MAD_REGISTRY_PASSWORD env var instead
  of argv (avoids password leaking in process list / shell history)
- docker ps uses --filter name=^/$  with re.escape() to avoid regex
  metachar false positives (see docker.py fix)
- build-arg values wrapped with str() before shlex.quote()

Copilot review comment ROCm#2 fix: login_to_registry typed as
Optional[str] (was str) since callers pass None for DockerHub
and tests call with None.
…play

- Copilot review comment ROCm#4: timeout option declared as int but
  assigned None after conversion; add elif timeout == 0: timeout = None
  sentinel so "no timeout" is expressed as None (not a magic int)
- Fix validation message to mention 0 is valid (--timeout 0 = no timeout)
- Fix 2 panel displays: show "Disabled" when timeout is None instead
  of "None s" which was confusing to users
Copy link
Copy Markdown
Collaborator

@coketaste coketaste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this PR be merged into PR #108 first, since there are overlapping changes?

@coketaste coketaste changed the base branch from main to coketaste/v2-review-fix April 23, 2026 13:29
@gargrahul
Copy link
Copy Markdown
Collaborator

We can go with #108

@gargrahul gargrahul closed this Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants