Skip to content

refactor(v2-review): fix timeout, extract auth module, fix security issues, update test suite#108

Open
coketaste wants to merge 11 commits intodevelopfrom
coketaste/v2-review-fix
Open

refactor(v2-review): fix timeout, extract auth module, fix security issues, update test suite#108
coketaste wants to merge 11 commits intodevelopfrom
coketaste/v2-review-fix

Conversation

@coketaste
Copy link
Copy Markdown
Collaborator

Summary

This branch addresses a set of v2 code review items across the source and test layers:

  • New core/auth.py module — extracts load_credentials and login_to_registry
    from BuildOrchestrator, RunOrchestrator, DockerBuilder, and ContainerRunner,
    eliminating ~120 lines of duplicated login logic. Both callers delegate with
    raise_on_failure=True/False respectively.
  • Security fixes — registry password is now passed via MAD_REGISTRY_PASSWORD
    env var instead of printf argv (prevents process-list exposure); docker ps
    container existence check uses --filter name=^/<name>$ instead of grep to
    avoid regex metachar false positives; build-arg keys/values are wrapped with
    str() before shlex.quote().
  • Dead code removalfind_and_replace_pattern, substring_found (ops.py),
    highlight_log_section (log_formatting.py), SessionTracker.get_session_start
    and load_marker, _filter_images_by_dockerfile_context (RunOrchestrator),
    and unused imports across core, deployment, execution, orchestration, and utils.
  • Timeout fix--timeout 0 now correctly disables the timeout; previously
    Timeout.__enter__ called signal.alarm(None) because the None sentinel
    had no early-return guard.
  • Test suite cleanup — dead imports removed across 14 test files; try/assert False/except antipattern replaced with pytest.raises(match=...); bare
    except: narrowed to except Exception:; pass-only and duplicate tests removed;
    misnamed @pytest.fixture test_dir renamed to prevent pytest collecting fixtures
    as test functions; test_profiling_tools_config moved unit→integration,
    test_errors moved integration→unit; test_sh_live_output added for
    Console(live_output=True) path.
  • CLAUDE.md — added codebase architecture guide for future contributors.

Test plan

  • pytest tests/unit/ -v — 365 unit tests pass
  • pytest tests/integration/ -v — 68 integration tests pass (excluding GPU-hardware-gated tests)
  • pytest tests/e2e/ -v — requires built Docker image (dummy fixture); run on GPU machine
  • Verify madengine run --timeout 0 ... does not raise a signal error
  • Verify registry login does not expose password in ps aux output

coketaste and others added 6 commits April 17, 2026 23:25
… and dead code

- Extract `_load_credentials` from both orchestrators into `core/auth.py` (`load_credentials`)
- Remove dead `_filter_images_by_dockerfile_context` method from RunOrchestrator
- Raise `ConfigurationError` instead of `SystemExit` in BuildOrchestrator config loading
- Add CLAUDE.md with codebase guidance and architecture docs
- Update tests to cover auth module and refactored error handling

Co-authored-by: Claude Sonnet 4 <[email protected]>
… code

- Add login_to_registry() to core/auth.py; DockerBuilder and ContainerRunner
  delegate to it (raise_on_failure=True/False respectively), eliminating ~120
  lines of duplicated logic
- Remove unused functions: find_and_replace_pattern, substring_found (ops.py),
  highlight_log_section (log_formatting.py), SessionTracker.get_session_start
  and SessionTracker.load_marker (session_tracker.py)
- Clean up unused imports across core, deployment, execution, orchestration,
  and utils modules

Co-Authored-By: Claude Sonnet 4 <[email protected]>
- auth: pass registry password via MAD_REGISTRY_PASSWORD env var instead of
  printf argv to prevent process-list exposure; wrap all build-arg keys and
  values with str() before shlex.quote() to handle non-string config values
- auth: gate all RuntimeError raises (missing key, invalid format, login
  failure) on raise_on_failure so ContainerRunner can fall through to public
  image pulls
- auth: add TestLoginToRegistry unit tests covering all failure/success paths
  and raise_on_failure behaviour
- docker: replace grep-based container existence check with
  docker ps --filter name=^/<name>$ to avoid regex metachar false positives
  and substring matches
- factory: change ImportWarning → UserWarning so the missing-kubernetes
  message is visible by default
- factory: recommend madengine[kubernetes] (plus madengine[all]) in the
  install hint

Co-Authored-By: Claude Sonnet 4 <[email protected]>
Timeout.__enter__ called signal.alarm(None) when --timeout 0 was passed,
because the CLI correctly maps 0 → None but Timeout had no guard for it.
Add early-return in __enter__/__exit__ when seconds is falsy, and improve
small-value formatting in the perf table display.

Tests: add TestTimeout (None/0/positive) and a resolve_run_timeout None
passthrough case to prevent regression.

Co-Authored-By: Claude Sonnet 4 <[email protected]>
- Remove dead imports across 14 test files (sys, csv, json, MagicMock,
  mock_open, call, tempfile, ConfigurationError, DiscoveryError, pytest
  where unused, generate_additional_context_for_machine)
- Restore subprocess/Mock imports that were incorrectly removed
- Replace try/assert False/except antipattern with pytest.raises() in
  test_auth.py (3 occurrences), adding match= to validate error messages
- Narrow 5 bare except: clauses to except Exception: in utils.py and
  test_gpu_management.py
- Delete pass-only dead test test_dockerfile_executed_if_contexts_keys_are_not_common
- Remove duplicate tests: test_validate_additional_context_invalid_json
  (kept in test_validators.py) and test_filter_images_by_gpu_architecture
  (kept in test_orchestrator_workflows.py)
- Rename misnamed @pytest.fixture test_dir → tmp_dir in test_reporting_superset.py
  to prevent pytest collecting fixtures as test functions
- Reclassify test_profiling_tools_config.py: unit/ → integration/ (reads
  real disk files); test_errors.py: integration/ → unit/ (pure mocks)
- Add test_sh_live_output to test_console_integration.py for live_output=True path

Co-Authored-By: Claude Sonnet 4 <[email protected]>
@coketaste coketaste self-assigned this Apr 21, 2026
Copilot AI review requested due to automatic review settings April 21, 2026 22:57
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors shared registry authentication into madengine.core.auth, tightens several shell/security behaviors across Docker execution paths, fixes timeout disabling semantics, and performs broad dead-code/test-suite cleanup to improve maintainability and reliability across v2.

Changes:

  • Extracted shared credential loading + registry login into src/madengine/core/auth.py and updated orchestrators/builders/runners to delegate to it.
  • Hardened shell interactions (password handling, quoting, container existence checks) and fixed --timeout 0 to truly disable timeouts.
  • Removed dead code and modernized tests (cleanup imports, better exception assertions, fixture naming), plus added contributor architecture docs.

Reviewed changes

Copilot reviewed 60 out of 60 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/unit/test_reporting_superset.py Renames fixture to avoid pytest collecting it as a test; updates references.
tests/unit/test_reporting.py Removes unused imports and keeps reporting unit coverage focused.
tests/unit/test_execution.py Adds unit coverage for Timeout and resolve_run_timeout(None) sentinel behavior.
tests/unit/test_errors.py Updates unit tests to new error class names and removes legacy path hacks.
tests/unit/test_error_handling.py Aligns error-handling tests with new error classes and removes unused imports.
tests/unit/test_deployment.py Ensures is_rocprofv3_available() cache is cleared between tests.
tests/unit/test_database_mongodb.py Removes unused imports to reduce test noise.
tests/unit/test_context_logic.py Removes unused mock import.
tests/unit/test_container_runner.py Removes unused pytest import and validates safer subprocess argv usage.
tests/unit/test_constants.py Removes unused pytest import.
tests/unit/test_cli.py Cleans up unused imports and removes a redundant JSON-invalid test.
tests/unit/test_auth.py Adds unit tests for new core.auth credential loading and login behavior.
tests/integration/test_profiling_tools_config.py Reclassifies as integration and updates module docstring.
tests/integration/test_platform_integration.py Removes dead test and unused imports to match orchestrator behavior.
tests/integration/test_gpu_management.py Narrows bare except to Exception and cleans imports.
tests/integration/test_docker_integration.py Updates build-arg quoting assertions to match shlex.quote behavior.
tests/integration/test_container_execution.py Removes unused imports.
tests/integration/test_console_integration.py Adds live-output integration coverage for Console(sh, live_output=True).
tests/fixtures/utils.py Narrows bare except clauses to Exception in cleanup helpers.
tests/e2e/test_scripting_workflows.py Removes unused imports for e2e suite cleanliness.
tests/e2e/test_run_workflows.py Removes a pass-only duplicate test and unused imports.
tests/e2e/test_profiling_workflows.py Removes unused imports.
tests/e2e/test_execution_features.py Removes unused import.
tests/e2e/test_data_workflows.py Removes unused import.
tests/e2e/test_build_workflows.py Removes unused import.
src/madengine/utils/session_tracker.py Counts only non-empty CSV lines; removes unused session marker APIs.
src/madengine/utils/rocm_tool_manager.py Preserves primary stderr across fallback execution for clearer errors.
src/madengine/utils/ops.py Removes unused dictionary substring helpers and related imports.
src/madengine/utils/log_formatting.py Fixes truncation message to match tail() behavior; removes dead helper.
src/madengine/utils/gpu_validator.py Removes unused typing imports.
src/madengine/utils/gpu_config.py Replaces print() output with structured logging.
src/madengine/utils/discover_models.py Quotes shell cp paths and replaces assert with TypeError for validation.
src/madengine/utils/config_parser.py Fixes directory-walk boundary condition for stop-dir matching.
src/madengine/scripts/common/tools/rocm_smi_utils.py Narrows bare except to Exception.
src/madengine/scripts/common/tools/amd_smi_utils.py Narrows bare except to Exception.
src/madengine/reporting/update_perf_csv.py Avoids pandas single-row DataFrame errors by dropping non-scalar fields like configs.
src/madengine/reporting/csv_to_email.py Corrects return typing to Optional[str] when no CSVs exist.
src/madengine/orchestration/run_orchestrator.py Switches to shared load_credentials() and removes legacy filtering/dead code.
src/madengine/orchestration/build_orchestrator.py Uses shared load_credentials() and raises ConfigurationError for CLI handling.
src/madengine/execution/docker_builder.py Uses shared registry login; hardens build-arg quoting and shell path quoting.
src/madengine/execution/container_runner.py Delegates registry login and hardens env var quoting and exception handling.
src/madengine/deployment/slurm.py Stops assuming success when job status can’t be verified; introduces UNKNOWN handling.
src/madengine/deployment/kubernetes.py Removes unused imports and simplifies error imports.
src/madengine/deployment/factory.py Emits a warning when Kubernetes deps aren’t installed instead of silently skipping.
src/madengine/deployment/config_loader.py Removes unused import.
src/madengine/deployment/common.py Adds lru_cache for rocprofv3 availability checks.
src/madengine/deployment/base.py Adds DeploymentStatus.UNKNOWN and adjusts monitor termination + perf.csv writing logic.
src/madengine/core/timeout.py Fixes `Timeout(None
src/madengine/core/errors.py Renames error classes and removes legacy aliases; trims unused imports.
src/madengine/core/docker.py Refactors container existence check and improves quoting/default arg safety.
src/madengine/core/dataprovider.py Removes debug prints.
src/madengine/core/context.py Removes debug prints and updates NUMA balancing return typing.
src/madengine/core/auth.py New shared module for credentials + registry login using env-based password passing.
src/madengine/cli/validators.py Initializes dockerfile_matched to avoid referencing before assignment on exceptions.
src/madengine/cli/utils.py Tweaks performance formatting for small values.
src/madengine/cli/constants.py Makes ExitCode an IntEnum.
src/madengine/cli/commands/run.py Implements --timeout 0 → no-timeout sentinel and passes through run stack.
src/madengine/cli/app.py Reports package version from metadata instead of hard-coding.
pyproject.toml Removes runtime deps that don’t belong in core install (e.g., pytest).
CLAUDE.md Adds an architecture + contributor workflow guide for future maintainers.
Comments suppressed due to low confidence (2)

src/madengine/core/errors.py:199

  • Similarly, TimeoutError was replaced with DeploymentTimeoutError and the previous RuntimeError = ExecutionError compatibility alias was removed. If these names are part of the public API, this breaks callers and also conflicts with the PR description mentioning backward compatibility. Consider reintroducing the old names as deprecated aliases to the new classes.
class ConfigurationError(MADEngineError):
    """Configuration and setup errors."""

    def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwargs):
        super().__init__(
            message,
            ErrorCategory.CONFIGURATION,
            context,
            recoverable=True,
            **kwargs
        )


class DeploymentTimeoutError(MADEngineError):
    """Timeout and duration errors."""

    def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwargs):
        super().__init__(
            message,
            ErrorCategory.TIMEOUT,
            context,
            recoverable=True,
            **kwargs
        )

src/madengine/core/docker.py:115

  • container_name is still used unquoted in the docker run --name ... command, and again in the docker ps -aqf 'name=...' lookup. If container_name contains spaces/quotes or regex metacharacters (it’s derived from image refs and can include .), this can break container creation/lookup or enable shell injection via console.sh(shell=True). Use a safely quoted name consistently (and for lookups, prefer docker container inspect or an exact/escaped filter).
            for evar in envVars.keys():
                command += "-e " + evar + "=" + shlex.quote(str(envVars[evar])) + " "

        command += "--workdir /myworkspace/ "
        command += "--name " + container_name + " "
        command += image + " "
        
        # Use 'cat' to keep container alive (blocks waiting for stdin)
        # Works reliably across all deployment types (local, k8s, slurm)
        # with fresh image pulls preventing corrupted layer issues
        command += "cat "
        self.console.sh(command)

        # find container sha
        self.docker_sha = self.console.sh(
            "docker ps -aqf 'name=" + container_name + "' "
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/core/errors.py
Comment thread src/madengine/core/auth.py
Comment thread src/madengine/core/docker.py
Comment thread src/madengine/cli/commands/run.py
@coketaste coketaste changed the title refactor(v2-review): extract auth module, fix security issues, clean up dead code and test suite refactor(v2-review): fix timeout, extract auth module, fix security issues, update test suite Apr 21, 2026
coketaste and others added 2 commits April 23, 2026 21:13
- docker.py: apply re.escape() to container name before embedding in
  Docker --filter regex so metacharacters (e.g. ".") are treated as
  literals, preventing false-positive container matches
- run.py: include 0 in timeout validation error message; introduce
  timeout_display so panels show "disabled" instead of "Nones" when
  --timeout 0 is used
- auth.py: fix login_to_registry signature from registry: str to
  registry: Optional[str], matching actual call sites and implementation

Co-Authored-By: Claude Sonnet 4 <[email protected]>
Resolved conflict in tests/unit/test_container_runner.py by accepting
develop's added imports (pytest, PERFORMANCE_LOG_PATTERN) which are
required by the test file.

Co-Authored-By: Claude Sonnet 4 <[email protected]>
Copilot AI review requested due to automatic review settings April 24, 2026 02:36
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 60 out of 60 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

src/madengine/core/docker.py:106

  • container_name is still interpolated unquoted into the docker run command (--name ...). Since Console.sh executes with shell=True, a container name containing spaces or shell metacharacters can break the command or be used for injection. Quote/sanitize the name (e.g., shlex.quote(container_name)) consistently everywhere it's used in shell commands.
                command += "-e " + evar + "=" + shlex.quote(str(envVars[evar])) + " "

        command += "--workdir /myworkspace/ "
        command += "--name " + container_name + " "
        command += image + " "

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/core/timeout.py
Comment thread src/madengine/core/docker.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/deployment/slurm.py
Comment thread src/madengine/core/errors.py
Comment thread src/madengine/core/docker.py
- Validate env var names against POSIX regex before injecting into docker
  commands (docker.py, container_runner.py)
- Allow None/0 timeout in Timeout class to disable signal-based timeout
- Retry sacct up to 3 times with 5s delay to handle transient SLURM
  accounting DB lag after job completion

Co-Authored-By: Claude Sonnet 4 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants