refactor(v2-review): fix timeout, extract auth module, fix security issues, update test suite#108
refactor(v2-review): fix timeout, extract auth module, fix security issues, update test suite#108
Conversation
… and dead code - Extract `_load_credentials` from both orchestrators into `core/auth.py` (`load_credentials`) - Remove dead `_filter_images_by_dockerfile_context` method from RunOrchestrator - Raise `ConfigurationError` instead of `SystemExit` in BuildOrchestrator config loading - Add CLAUDE.md with codebase guidance and architecture docs - Update tests to cover auth module and refactored error handling Co-authored-by: Claude Sonnet 4 <[email protected]>
… code - Add login_to_registry() to core/auth.py; DockerBuilder and ContainerRunner delegate to it (raise_on_failure=True/False respectively), eliminating ~120 lines of duplicated logic - Remove unused functions: find_and_replace_pattern, substring_found (ops.py), highlight_log_section (log_formatting.py), SessionTracker.get_session_start and SessionTracker.load_marker (session_tracker.py) - Clean up unused imports across core, deployment, execution, orchestration, and utils modules Co-Authored-By: Claude Sonnet 4 <[email protected]>
- auth: pass registry password via MAD_REGISTRY_PASSWORD env var instead of printf argv to prevent process-list exposure; wrap all build-arg keys and values with str() before shlex.quote() to handle non-string config values - auth: gate all RuntimeError raises (missing key, invalid format, login failure) on raise_on_failure so ContainerRunner can fall through to public image pulls - auth: add TestLoginToRegistry unit tests covering all failure/success paths and raise_on_failure behaviour - docker: replace grep-based container existence check with docker ps --filter name=^/<name>$ to avoid regex metachar false positives and substring matches - factory: change ImportWarning → UserWarning so the missing-kubernetes message is visible by default - factory: recommend madengine[kubernetes] (plus madengine[all]) in the install hint Co-Authored-By: Claude Sonnet 4 <[email protected]>
Timeout.__enter__ called signal.alarm(None) when --timeout 0 was passed, because the CLI correctly maps 0 → None but Timeout had no guard for it. Add early-return in __enter__/__exit__ when seconds is falsy, and improve small-value formatting in the perf table display. Tests: add TestTimeout (None/0/positive) and a resolve_run_timeout None passthrough case to prevent regression. Co-Authored-By: Claude Sonnet 4 <[email protected]>
- Remove dead imports across 14 test files (sys, csv, json, MagicMock, mock_open, call, tempfile, ConfigurationError, DiscoveryError, pytest where unused, generate_additional_context_for_machine) - Restore subprocess/Mock imports that were incorrectly removed - Replace try/assert False/except antipattern with pytest.raises() in test_auth.py (3 occurrences), adding match= to validate error messages - Narrow 5 bare except: clauses to except Exception: in utils.py and test_gpu_management.py - Delete pass-only dead test test_dockerfile_executed_if_contexts_keys_are_not_common - Remove duplicate tests: test_validate_additional_context_invalid_json (kept in test_validators.py) and test_filter_images_by_gpu_architecture (kept in test_orchestrator_workflows.py) - Rename misnamed @pytest.fixture test_dir → tmp_dir in test_reporting_superset.py to prevent pytest collecting fixtures as test functions - Reclassify test_profiling_tools_config.py: unit/ → integration/ (reads real disk files); test_errors.py: integration/ → unit/ (pure mocks) - Add test_sh_live_output to test_console_integration.py for live_output=True path Co-Authored-By: Claude Sonnet 4 <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR refactors shared registry authentication into madengine.core.auth, tightens several shell/security behaviors across Docker execution paths, fixes timeout disabling semantics, and performs broad dead-code/test-suite cleanup to improve maintainability and reliability across v2.
Changes:
- Extracted shared credential loading + registry login into
src/madengine/core/auth.pyand updated orchestrators/builders/runners to delegate to it. - Hardened shell interactions (password handling, quoting, container existence checks) and fixed
--timeout 0to truly disable timeouts. - Removed dead code and modernized tests (cleanup imports, better exception assertions, fixture naming), plus added contributor architecture docs.
Reviewed changes
Copilot reviewed 60 out of 60 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_reporting_superset.py | Renames fixture to avoid pytest collecting it as a test; updates references. |
| tests/unit/test_reporting.py | Removes unused imports and keeps reporting unit coverage focused. |
| tests/unit/test_execution.py | Adds unit coverage for Timeout and resolve_run_timeout(None) sentinel behavior. |
| tests/unit/test_errors.py | Updates unit tests to new error class names and removes legacy path hacks. |
| tests/unit/test_error_handling.py | Aligns error-handling tests with new error classes and removes unused imports. |
| tests/unit/test_deployment.py | Ensures is_rocprofv3_available() cache is cleared between tests. |
| tests/unit/test_database_mongodb.py | Removes unused imports to reduce test noise. |
| tests/unit/test_context_logic.py | Removes unused mock import. |
| tests/unit/test_container_runner.py | Removes unused pytest import and validates safer subprocess argv usage. |
| tests/unit/test_constants.py | Removes unused pytest import. |
| tests/unit/test_cli.py | Cleans up unused imports and removes a redundant JSON-invalid test. |
| tests/unit/test_auth.py | Adds unit tests for new core.auth credential loading and login behavior. |
| tests/integration/test_profiling_tools_config.py | Reclassifies as integration and updates module docstring. |
| tests/integration/test_platform_integration.py | Removes dead test and unused imports to match orchestrator behavior. |
| tests/integration/test_gpu_management.py | Narrows bare except to Exception and cleans imports. |
| tests/integration/test_docker_integration.py | Updates build-arg quoting assertions to match shlex.quote behavior. |
| tests/integration/test_container_execution.py | Removes unused imports. |
| tests/integration/test_console_integration.py | Adds live-output integration coverage for Console(sh, live_output=True). |
| tests/fixtures/utils.py | Narrows bare except clauses to Exception in cleanup helpers. |
| tests/e2e/test_scripting_workflows.py | Removes unused imports for e2e suite cleanliness. |
| tests/e2e/test_run_workflows.py | Removes a pass-only duplicate test and unused imports. |
| tests/e2e/test_profiling_workflows.py | Removes unused imports. |
| tests/e2e/test_execution_features.py | Removes unused import. |
| tests/e2e/test_data_workflows.py | Removes unused import. |
| tests/e2e/test_build_workflows.py | Removes unused import. |
| src/madengine/utils/session_tracker.py | Counts only non-empty CSV lines; removes unused session marker APIs. |
| src/madengine/utils/rocm_tool_manager.py | Preserves primary stderr across fallback execution for clearer errors. |
| src/madengine/utils/ops.py | Removes unused dictionary substring helpers and related imports. |
| src/madengine/utils/log_formatting.py | Fixes truncation message to match tail() behavior; removes dead helper. |
| src/madengine/utils/gpu_validator.py | Removes unused typing imports. |
| src/madengine/utils/gpu_config.py | Replaces print() output with structured logging. |
| src/madengine/utils/discover_models.py | Quotes shell cp paths and replaces assert with TypeError for validation. |
| src/madengine/utils/config_parser.py | Fixes directory-walk boundary condition for stop-dir matching. |
| src/madengine/scripts/common/tools/rocm_smi_utils.py | Narrows bare except to Exception. |
| src/madengine/scripts/common/tools/amd_smi_utils.py | Narrows bare except to Exception. |
| src/madengine/reporting/update_perf_csv.py | Avoids pandas single-row DataFrame errors by dropping non-scalar fields like configs. |
| src/madengine/reporting/csv_to_email.py | Corrects return typing to Optional[str] when no CSVs exist. |
| src/madengine/orchestration/run_orchestrator.py | Switches to shared load_credentials() and removes legacy filtering/dead code. |
| src/madengine/orchestration/build_orchestrator.py | Uses shared load_credentials() and raises ConfigurationError for CLI handling. |
| src/madengine/execution/docker_builder.py | Uses shared registry login; hardens build-arg quoting and shell path quoting. |
| src/madengine/execution/container_runner.py | Delegates registry login and hardens env var quoting and exception handling. |
| src/madengine/deployment/slurm.py | Stops assuming success when job status can’t be verified; introduces UNKNOWN handling. |
| src/madengine/deployment/kubernetes.py | Removes unused imports and simplifies error imports. |
| src/madengine/deployment/factory.py | Emits a warning when Kubernetes deps aren’t installed instead of silently skipping. |
| src/madengine/deployment/config_loader.py | Removes unused import. |
| src/madengine/deployment/common.py | Adds lru_cache for rocprofv3 availability checks. |
| src/madengine/deployment/base.py | Adds DeploymentStatus.UNKNOWN and adjusts monitor termination + perf.csv writing logic. |
| src/madengine/core/timeout.py | Fixes `Timeout(None |
| src/madengine/core/errors.py | Renames error classes and removes legacy aliases; trims unused imports. |
| src/madengine/core/docker.py | Refactors container existence check and improves quoting/default arg safety. |
| src/madengine/core/dataprovider.py | Removes debug prints. |
| src/madengine/core/context.py | Removes debug prints and updates NUMA balancing return typing. |
| src/madengine/core/auth.py | New shared module for credentials + registry login using env-based password passing. |
| src/madengine/cli/validators.py | Initializes dockerfile_matched to avoid referencing before assignment on exceptions. |
| src/madengine/cli/utils.py | Tweaks performance formatting for small values. |
| src/madengine/cli/constants.py | Makes ExitCode an IntEnum. |
| src/madengine/cli/commands/run.py | Implements --timeout 0 → no-timeout sentinel and passes through run stack. |
| src/madengine/cli/app.py | Reports package version from metadata instead of hard-coding. |
| pyproject.toml | Removes runtime deps that don’t belong in core install (e.g., pytest). |
| CLAUDE.md | Adds an architecture + contributor workflow guide for future maintainers. |
Comments suppressed due to low confidence (2)
src/madengine/core/errors.py:199
- Similarly,
TimeoutErrorwas replaced withDeploymentTimeoutErrorand the previousRuntimeError = ExecutionErrorcompatibility alias was removed. If these names are part of the public API, this breaks callers and also conflicts with the PR description mentioning backward compatibility. Consider reintroducing the old names as deprecated aliases to the new classes.
class ConfigurationError(MADEngineError):
"""Configuration and setup errors."""
def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwargs):
super().__init__(
message,
ErrorCategory.CONFIGURATION,
context,
recoverable=True,
**kwargs
)
class DeploymentTimeoutError(MADEngineError):
"""Timeout and duration errors."""
def __init__(self, message: str, context: Optional[ErrorContext] = None, **kwargs):
super().__init__(
message,
ErrorCategory.TIMEOUT,
context,
recoverable=True,
**kwargs
)
src/madengine/core/docker.py:115
container_nameis still used unquoted in thedocker run --name ...command, and again in thedocker ps -aqf 'name=...'lookup. Ifcontainer_namecontains spaces/quotes or regex metacharacters (it’s derived from image refs and can include.), this can break container creation/lookup or enable shell injection viaconsole.sh(shell=True). Use a safely quoted name consistently (and for lookups, preferdocker container inspector an exact/escaped filter).
for evar in envVars.keys():
command += "-e " + evar + "=" + shlex.quote(str(envVars[evar])) + " "
command += "--workdir /myworkspace/ "
command += "--name " + container_name + " "
command += image + " "
# Use 'cat' to keep container alive (blocks waiting for stdin)
# Works reliably across all deployment types (local, k8s, slurm)
# with fresh image pulls preventing corrupted layer issues
command += "cat "
self.console.sh(command)
# find container sha
self.docker_sha = self.console.sh(
"docker ps -aqf 'name=" + container_name + "' "
)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- docker.py: apply re.escape() to container name before embedding in Docker --filter regex so metacharacters (e.g. ".") are treated as literals, preventing false-positive container matches - run.py: include 0 in timeout validation error message; introduce timeout_display so panels show "disabled" instead of "Nones" when --timeout 0 is used - auth.py: fix login_to_registry signature from registry: str to registry: Optional[str], matching actual call sites and implementation Co-Authored-By: Claude Sonnet 4 <[email protected]>
Resolved conflict in tests/unit/test_container_runner.py by accepting develop's added imports (pytest, PERFORMANCE_LOG_PATTERN) which are required by the test file. Co-Authored-By: Claude Sonnet 4 <[email protected]>
…view-fix Co-Authored-By: Claude Sonnet 4 <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 60 out of 60 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
src/madengine/core/docker.py:106
container_nameis still interpolated unquoted into thedocker runcommand (--name ...). SinceConsole.shexecutes withshell=True, a container name containing spaces or shell metacharacters can break the command or be used for injection. Quote/sanitize the name (e.g.,shlex.quote(container_name)) consistently everywhere it's used in shell commands.
command += "-e " + evar + "=" + shlex.quote(str(envVars[evar])) + " "
command += "--workdir /myworkspace/ "
command += "--name " + container_name + " "
command += image + " "
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Validate env var names against POSIX regex before injecting into docker commands (docker.py, container_runner.py) - Allow None/0 timeout in Timeout class to disable signal-based timeout - Retry sacct up to 3 times with 5s delay to handle transient SLURM accounting DB lag after job completion Co-Authored-By: Claude Sonnet 4 <[email protected]>
Summary
This branch addresses a set of v2 code review items across the source and test layers:
core/auth.pymodule — extractsload_credentialsandlogin_to_registryfrom
BuildOrchestrator,RunOrchestrator,DockerBuilder, andContainerRunner,eliminating ~120 lines of duplicated login logic. Both callers delegate with
raise_on_failure=True/Falserespectively.MAD_REGISTRY_PASSWORDenv var instead of
printfargv (prevents process-list exposure);docker pscontainer existence check uses
--filter name=^/<name>$instead of grep toavoid regex metachar false positives; build-arg keys/values are wrapped with
str()beforeshlex.quote().find_and_replace_pattern,substring_found(ops.py),highlight_log_section(log_formatting.py),SessionTracker.get_session_startand
load_marker,_filter_images_by_dockerfile_context(RunOrchestrator),and unused imports across core, deployment, execution, orchestration, and utils.
--timeout 0now correctly disables the timeout; previouslyTimeout.__enter__calledsignal.alarm(None)because theNonesentinelhad no early-return guard.
try/assert False/exceptantipattern replaced withpytest.raises(match=...); bareexcept:narrowed toexcept Exception:; pass-only and duplicate tests removed;misnamed
@pytest.fixture test_dirrenamed to prevent pytest collecting fixturesas test functions;
test_profiling_tools_configmoved unit→integration,test_errorsmoved integration→unit;test_sh_live_outputadded forConsole(live_output=True)path.Test plan
pytest tests/unit/ -v— 365 unit tests passpytest tests/integration/ -v— 68 integration tests pass (excluding GPU-hardware-gated tests)pytest tests/e2e/ -v— requires built Docker image (dummyfixture); run on GPU machinemadengine run --timeout 0 ...does not raise a signal errorps auxoutput