Skip to content

ci: Improve Docker cleanup for test jobs#5438

Merged
AntoineRichard merged 4 commits intoisaac-sim:developfrom
AntoineRichard:antoiner/ci-docker-cleanup-threshold
Apr 29, 2026
Merged

ci: Improve Docker cleanup for test jobs#5438
AntoineRichard merged 4 commits intoisaac-sim:developfrom
AntoineRichard:antoiner/ci-docker-cleanup-threshold

Conversation

@AntoineRichard
Copy link
Copy Markdown
Collaborator

Summary

  • add pre-test Docker storage cleanup before image pull and test startup
  • switch to aggressive cleanup at 85% Docker storage usage, including stopped containers, unused builder cache, and images older than 24h
  • fail early with a clear runner disk-space error if pre-test cleanup cannot recover enough space

Tests

  • ./isaaclab.sh -p -m pytest .github/actions/run-package-tests/test_cleanup_docker_storage.py -q
  • ./isaaclab.sh -f

Add a pre-test Docker storage cleanup step for package test jobs so self-hosted runners can reclaim space before image pulls and docker run start.

Use threshold-based aggressive cleanup at 85% Docker storage usage, pruning stopped containers, unused builder cache, and images older than 24h while keeping the existing 72h image cleanup for lower usage.
Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This PR adds pre-test Docker storage cleanup to CI test jobs, implementing a two-tier cleanup strategy (conservative at <85% usage, aggressive at ≥85%) with early failure when space cannot be recovered. The implementation is well-structured with proper testing via fake df/docker binaries, and the cleanup correctly runs before image pull to prevent pull failures.

Architecture Impact

Self-contained CI infrastructure change. The new cleanup script is invoked from action.yml at two points: pre-test (with fail-early enabled) and post-job (fail-safe). No impact on core Isaac Lab code. The script correctly pins the IsaacSim base image during pruning to prevent accidental removal of the image needed for tests.

Implementation Verdict

Minor fixes needed

Test Coverage

Good test coverage for the new script: tests cover aggressive pruning above threshold, conservative pruning below threshold, fail-early behavior, and action.yml ordering verification. The tests use clever fake binaries to isolate from real Docker. However, one edge case is missing.

CI Status

No CI checks available yet — this is a CI infrastructure PR so it needs manual verification on self-hosted runners.

Findings

🔵 Improvement: cleanup_docker_storage.sh:99 — Error output sent to stderr twice
The echo "::error::..." line uses >&2 to redirect to stderr, but the ::error:: annotation is a GitHub Actions workflow command that works on stdout. While the message will still appear in stderr and GitHub will still process the annotation, this is inconsistent. Consider removing >&2 for the annotation line or separate the annotation from the error message:

echo "::error::Docker storage remains at ${final_usage_percent}% after cleanup."
echo "The self-hosted runner needs more free Docker storage before tests can run." >&2

🔵 Improvement: cleanup_docker_storage.sh:8 — Missing set -e could mask cleanup failures
The script uses set -u but not set -e. While individual commands have || true guards, the main function's arithmetic comparisons on lines 86 and 95-97 could fail silently if usage_percent is non-numeric (e.g., empty after df parsing edge cases). The guard on line 73 only checks for empty string, not invalid input. Consider adding validation:

if ! [[ "${usage_percent}" =~ ^[0-9]+$ ]]; then
    echo "[${CLEANUP_CONTEXT}] Invalid usage percent: ${usage_percent}; running conservative cleanup."
    ...
fi

🔵 Improvement: test_cleanup_docker_storage.py:101-105 — Missing test for empty IsaacSim image argument
The script has logic to handle empty ISAACSIM_IMAGE (line 31-34 of cleanup script), but there's no test covering this path. Add a test that passes an empty string as the first argument to verify the "skipping image pin" behavior works correctly.

🟡 Warning: cleanup_docker_storage.sh:36-37 — Race condition if pin container already exists
If a previous job crashed or was cancelled between docker create and the trap cleanup, the pin container might already exist, causing docker create to fail silently (suppressed by || true). This leaves the image unprotected during prune. The unique container names with run ID help, but stale containers from failed runs could accumulate. Consider adding explicit cleanup of any existing pin container before creating:

docker rm "${ISAACSIM_IMAGE_PIN_CONTAINER}" >/dev/null 2>&1 || true
docker create --name "${ISAACSIM_IMAGE_PIN_CONTAINER}" "${ISAACSIM_IMAGE}" true >/dev/null 2>&1 || true

🔵 Improvement: action.yml:86-91 — Hardcoded path duplicated from script defaults
The action invokes the script with explicit image/container arguments, but relies on the script's internal defaults for DOCKER_STORAGE_PATH and DOCKER_CLEANUP_THRESHOLD_PERCENT. This is fine for now, but if these ever need to be configurable at the action level, consider exposing them as inputs. Low priority since current defaults are reasonable.

🔵 Improvement: test_cleanup_docker_storage.py:46-47 — Fake df script doesn't handle all df invocations
The test's fake df handles -P and human-readable modes, but the cleanup script calls df -h on line 26. The fake script's else branch returns the correct format, but uses {first_usage}% which doesn't change between calls. This means print_docker_storage_usage() always shows first_usage even after cleanup, which could mask bugs in tests. Consider having the -h path also respect the call count for consistency.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 29, 2026

Greptile Summary

This PR refactors Docker storage cleanup into a reusable cleanup_docker_storage.sh script and adds a pre-test invocation (with FAIL_IF_STILL_HIGH=true) before the image pull, replacing the previous inline post-job-only cleanup. The two issues raised in earlier review threads — the FAIL_IF_STILL_HIGH bypass when df is unavailable, and the missing age filter on docker container prune — are both addressed in this revision.

Confidence Score: 5/5

Safe to merge — no blocking issues found; prior review concerns have been resolved.

All three files are well-structured and covered by tests. The previously flagged df-unavailable bypass and missing container prune age filter are both fixed. No P0 or P1 issues were identified in this pass.

No files require special attention.

Important Files Changed

Filename Overview
.github/actions/run-package-tests/cleanup_docker_storage.sh New cleanup script: reads Docker storage usage, applies conservative (<85%) or aggressive (≥85%) pruning, fails early when FAIL_IF_STILL_HIGH=true and space cannot be recovered; both previously-flagged issues (df-unavailable bypass, missing container prune age filter) are addressed.
.github/actions/run-package-tests/action.yml Adds a pre-test 'Prepare Docker disk space' step (FAIL_IF_STILL_HIGH=true) before the image pull, and replaces inline cleanup logic in the post-job step with a call to the shared script (FAIL_IF_STILL_HIGH=false).
.github/actions/run-package-tests/test_cleanup_docker_storage.py Five pytest tests covering aggressive/conservative branch selection, fail-early on unrecoverable space, Docker-root-dir fallback, and df-unavailable path; fake docker/df stubs are well-structured.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([cleanup_docker_storage.sh]) --> B[get_docker_storage_usage_percent\ndf -P DOCKER_STORAGE_PATH]
    B --> C{usage_percent\nempty?}
    C -- yes --> D[resolve_docker_storage_path\ndocker info --format DockerRootDir]
    D --> E[retry get_docker_storage_usage_percent]
    E --> F{still empty?}
    F -- yes --> G[run_conservative_cleanup\nimage prune -a --until 72h]
    G --> H{FAIL_IF_STILL_HIGH?}
    H -- true --> I([exit 1\ndisk space unmeasurable])
    H -- false --> J([exit 0])
    F -- no --> K{usage >= 85%?}
    C -- no --> K
    K -- yes --> L[run_aggressive_cleanup\ncontainer prune --until 24h\nbuilder prune --until 24h\nimage prune -a --until 24h]
    K -- no --> M[run_conservative_cleanup\nimage prune -a --until 72h]
    L --> N[get final usage]
    M --> N
    N --> O{FAIL_IF_STILL_HIGH\nAND final >= 85%?}
    O -- yes --> P([exit 1\nstill too full])
    O -- no --> Q([exit 0])
Loading

Reviews (2): Last reviewed commit: "ci: Handle missing Docker storage path" | Re-trigger Greptile

Comment thread .github/actions/run-package-tests/cleanup_docker_storage.sh
Comment thread .github/actions/run-package-tests/cleanup_docker_storage.sh Outdated
Resolve DockerRootDir from docker info when the configured Docker storage path cannot be measured, and fail pre-test cleanup if storage usage is still unavailable.

Also age-filter stopped container pruning to match the other aggressive cleanup commands.
@AntoineRichard
Copy link
Copy Markdown
Collaborator Author

@greptile

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This is a follow-up review. The new commit (c475a78) appears to be a rebase or minor update — the diff and file contents are identical to what was previously reviewed.

Implementation Verdict

Ship it

Findings

No new issues introduced. The previous findings remain valid but are all 🔵 improvements or a minor 🟡 warning — none are blockers. The core functionality is correct: pre-test cleanup runs before image pull, aggressive pruning triggers at 85%, and the script fails early when space cannot be recovered.

The PR is ready to merge. The previous suggestions (stderr annotation consistency, race condition on pin container, missing test for empty image arg) are nice-to-haves that can be addressed in follow-up if desired.

Copy link
Copy Markdown

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Isaac Lab Review Bot

Summary

This is a follow-up review on commit 23b3bf3. The changes are identical to the previously reviewed commit — this appears to be a rebase with no functional changes to the cleanup script, tests, or action.yml.

Implementation Verdict

Ship it

Findings

No new issues. The previous review's verdict stands: the core functionality is correct and the PR is ready to merge. The optional improvements mentioned previously (stderr annotation consistency, race condition on pin container name collisions, test for empty image arg) remain nice-to-haves for follow-up work.

@AntoineRichard AntoineRichard merged commit ad399b8 into isaac-sim:develop Apr 29, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants