ci: Improve Docker cleanup for test jobs#5438
ci: Improve Docker cleanup for test jobs#5438AntoineRichard merged 4 commits intoisaac-sim:developfrom
Conversation
Add a pre-test Docker storage cleanup step for package test jobs so self-hosted runners can reclaim space before image pulls and docker run start. Use threshold-based aggressive cleanup at 85% Docker storage usage, pruning stopped containers, unused builder cache, and images older than 24h while keeping the existing 72h image cleanup for lower usage.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This PR adds pre-test Docker storage cleanup to CI test jobs, implementing a two-tier cleanup strategy (conservative at <85% usage, aggressive at ≥85%) with early failure when space cannot be recovered. The implementation is well-structured with proper testing via fake df/docker binaries, and the cleanup correctly runs before image pull to prevent pull failures.
Architecture Impact
Self-contained CI infrastructure change. The new cleanup script is invoked from action.yml at two points: pre-test (with fail-early enabled) and post-job (fail-safe). No impact on core Isaac Lab code. The script correctly pins the IsaacSim base image during pruning to prevent accidental removal of the image needed for tests.
Implementation Verdict
Minor fixes needed
Test Coverage
Good test coverage for the new script: tests cover aggressive pruning above threshold, conservative pruning below threshold, fail-early behavior, and action.yml ordering verification. The tests use clever fake binaries to isolate from real Docker. However, one edge case is missing.
CI Status
No CI checks available yet — this is a CI infrastructure PR so it needs manual verification on self-hosted runners.
Findings
🔵 Improvement: cleanup_docker_storage.sh:99 — Error output sent to stderr twice
The echo "::error::..." line uses >&2 to redirect to stderr, but the ::error:: annotation is a GitHub Actions workflow command that works on stdout. While the message will still appear in stderr and GitHub will still process the annotation, this is inconsistent. Consider removing >&2 for the annotation line or separate the annotation from the error message:
echo "::error::Docker storage remains at ${final_usage_percent}% after cleanup."
echo "The self-hosted runner needs more free Docker storage before tests can run." >&2🔵 Improvement: cleanup_docker_storage.sh:8 — Missing set -e could mask cleanup failures
The script uses set -u but not set -e. While individual commands have || true guards, the main function's arithmetic comparisons on lines 86 and 95-97 could fail silently if usage_percent is non-numeric (e.g., empty after df parsing edge cases). The guard on line 73 only checks for empty string, not invalid input. Consider adding validation:
if ! [[ "${usage_percent}" =~ ^[0-9]+$ ]]; then
echo "[${CLEANUP_CONTEXT}] Invalid usage percent: ${usage_percent}; running conservative cleanup."
...
fi🔵 Improvement: test_cleanup_docker_storage.py:101-105 — Missing test for empty IsaacSim image argument
The script has logic to handle empty ISAACSIM_IMAGE (line 31-34 of cleanup script), but there's no test covering this path. Add a test that passes an empty string as the first argument to verify the "skipping image pin" behavior works correctly.
🟡 Warning: cleanup_docker_storage.sh:36-37 — Race condition if pin container already exists
If a previous job crashed or was cancelled between docker create and the trap cleanup, the pin container might already exist, causing docker create to fail silently (suppressed by || true). This leaves the image unprotected during prune. The unique container names with run ID help, but stale containers from failed runs could accumulate. Consider adding explicit cleanup of any existing pin container before creating:
docker rm "${ISAACSIM_IMAGE_PIN_CONTAINER}" >/dev/null 2>&1 || true
docker create --name "${ISAACSIM_IMAGE_PIN_CONTAINER}" "${ISAACSIM_IMAGE}" true >/dev/null 2>&1 || true🔵 Improvement: action.yml:86-91 — Hardcoded path duplicated from script defaults
The action invokes the script with explicit image/container arguments, but relies on the script's internal defaults for DOCKER_STORAGE_PATH and DOCKER_CLEANUP_THRESHOLD_PERCENT. This is fine for now, but if these ever need to be configurable at the action level, consider exposing them as inputs. Low priority since current defaults are reasonable.
🔵 Improvement: test_cleanup_docker_storage.py:46-47 — Fake df script doesn't handle all df invocations
The test's fake df handles -P and human-readable modes, but the cleanup script calls df -h on line 26. The fake script's else branch returns the correct format, but uses {first_usage}% which doesn't change between calls. This means print_docker_storage_usage() always shows first_usage even after cleanup, which could mask bugs in tests. Consider having the -h path also respect the call count for consistency.
Greptile SummaryThis PR refactors Docker storage cleanup into a reusable Confidence Score: 5/5Safe to merge — no blocking issues found; prior review concerns have been resolved. All three files are well-structured and covered by tests. The previously flagged df-unavailable bypass and missing container prune age filter are both fixed. No P0 or P1 issues were identified in this pass. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A([cleanup_docker_storage.sh]) --> B[get_docker_storage_usage_percent\ndf -P DOCKER_STORAGE_PATH]
B --> C{usage_percent\nempty?}
C -- yes --> D[resolve_docker_storage_path\ndocker info --format DockerRootDir]
D --> E[retry get_docker_storage_usage_percent]
E --> F{still empty?}
F -- yes --> G[run_conservative_cleanup\nimage prune -a --until 72h]
G --> H{FAIL_IF_STILL_HIGH?}
H -- true --> I([exit 1\ndisk space unmeasurable])
H -- false --> J([exit 0])
F -- no --> K{usage >= 85%?}
C -- no --> K
K -- yes --> L[run_aggressive_cleanup\ncontainer prune --until 24h\nbuilder prune --until 24h\nimage prune -a --until 24h]
K -- no --> M[run_conservative_cleanup\nimage prune -a --until 72h]
L --> N[get final usage]
M --> N
N --> O{FAIL_IF_STILL_HIGH\nAND final >= 85%?}
O -- yes --> P([exit 1\nstill too full])
O -- no --> Q([exit 0])
Reviews (2): Last reviewed commit: "ci: Handle missing Docker storage path" | Re-trigger Greptile |
Resolve DockerRootDir from docker info when the configured Docker storage path cannot be measured, and fail pre-test cleanup if storage usage is still unavailable. Also age-filter stopped container pruning to match the other aggressive cleanup commands.
|
@greptile |
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This is a follow-up review. The new commit (c475a78) appears to be a rebase or minor update — the diff and file contents are identical to what was previously reviewed.
Implementation Verdict
Ship it
Findings
No new issues introduced. The previous findings remain valid but are all 🔵 improvements or a minor 🟡 warning — none are blockers. The core functionality is correct: pre-test cleanup runs before image pull, aggressive pruning triggers at 85%, and the script fails early when space cannot be recovered.
The PR is ready to merge. The previous suggestions (stderr annotation consistency, race condition on pin container, missing test for empty image arg) are nice-to-haves that can be addressed in follow-up if desired.
There was a problem hiding this comment.
🤖 Isaac Lab Review Bot
Summary
This is a follow-up review on commit 23b3bf3. The changes are identical to the previously reviewed commit — this appears to be a rebase with no functional changes to the cleanup script, tests, or action.yml.
Implementation Verdict
Ship it
Findings
No new issues. The previous review's verdict stands: the core functionality is correct and the PR is ready to merge. The optional improvements mentioned previously (stderr annotation consistency, race condition on pin container name collisions, test for empty image arg) remain nice-to-haves for follow-up work.
Summary
Tests