Skip to content

ENH: Automatic GitHub Actions cache cleanup for closed/merged PRs#6112

Merged
hjmjohnson merged 1 commit intoInsightSoftwareConsortium:mainfrom
hjmjohnson:cleanup-pr-caches
Apr 24, 2026
Merged

ENH: Automatic GitHub Actions cache cleanup for closed/merged PRs#6112
hjmjohnson merged 1 commit intoInsightSoftwareConsortium:mainfrom
hjmjohnson:cleanup-pr-caches

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

Adds two complementary workflows that free the repository's 10 GB GitHub Actions cache budget automatically, so cache saves in long-running PRs (ccache, ExternalData, etc.) don't silently fail with "Cache reservation failed: You have reached your configured budget, your cache is now read only".

Pattern follows the GitHub documentation for force-deleting cache entries: https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#force-deleting-cache-entries.

Motivation

ITK's per-repo Actions cache is capped at 10 GB (GitHub default). A single Pixi+ARM cycle produces roughly:

Cache Typical size
ccache-v4-Linux-pixi-cxx-<sha> ~2.2 GB
ccache-v4-Windows-pixi-cxx-<sha> ~3.6 MB
ccache-v4-macOS-pixi-cxx-<sha> ~1.6 GB
ccache-v4-Linux-Ubuntu-24.04-arm-<sha> ~330 MB × multiple
ccache-v4-macOS-*-<sha> ~1.6 GB
externaldata-v<ver>-<sha> (from #6109) ~0.5–1 GB

Multiplied by open PRs, the 10 GB cap is reached quickly. On PR #6109's dffb9bd1fd run, every Pixi and ARM cache save failed for this reason, despite the caching being wired correctly.

Manually sweeping closed-PR caches frees ~6 GB at a time, but that's not sustainable. These two workflows automate it.

What each workflow does

.github/workflows/cleanup-pr-caches.yml (event-driven)

  • Trigger: pull_request: types: [closed]
  • Action: enumerate caches for refs/pull/<N>/merge and DELETE each
  • Typical runtime: 2–5 seconds
  • Permissions: actions: write, contents: read
  • No overhead on regular build/test runs.

.github/workflows/cleanup-stale-caches-nightly.yml (safety net)

  • Trigger: cron: '0 6 * * *' (06:00 UTC) + workflow_dispatch
  • Action: scan every refs/pull/<N>/merge cache; for each, check PR state; if state != OPEN and closedAt < now - 3 days, delete
  • The 3-day grace period lets anyone spot-rerun a just-merged PR before its caches vanish
  • Catches edge cases the on-close workflow misses (outages, pre-existing orphans, unusual close paths)
  • Permissions: actions: write, pull-requests: read
Robustness properties
  • Ref-scoped delete. DELETE /repos/{owner}/{repo}/actions/caches/{cache-id} targets one cache entry at a time; the enumeration is filtered by ref=refs/pull/N/merge, so refs/heads/main or other PR refs can never be touched by the on-close path.
  • Idempotent. A second invocation (retry, duplicate event) finds 0 matching caches and exits 0.
  • Works for PRs from forks. The pull_request event runs in the upstream repo context with the upstream's GITHUB_TOKEN, which has cache delete permissions on the upstream's cache store (where the caches actually live).
  • Terminal. A reopened PR gets fresh cache entries keyed on new commits; deletions target the previous-closure era only.
  • Permissions-minimal. On-close: actions: write, contents: read. Nightly: adds pull-requests: read. Neither can push code, comment, or modify anything outside Actions storage.
Verification

Manually dry-run the on-close logic against a recently-closed PR to confirm it finds and removes caches:

```bash
PR_REF=refs/pull/6093/merge # merged 2026-04-23
gh api "repos/InsightSoftwareConsortium/ITK/actions/caches?ref=${PR_REF}"
--jq '.actions_caches[] | {id, key, size_mb: (.size_in_bytes / 1048576 | floor)}'
```

(In this PR's context #6093's caches were already manually swept as part of the investigation that motivated this PR.)

After merge, an end-to-end verification is to close a test PR and confirm its caches disappear within seconds.

Ecosystem impact (follow-up)

The same pattern could be added to the shared InsightSoftwareConsortium/ITKRemoteModuleBuildTestPackageAction workflows so all 20+ remote-module repos inherit cache cleanup in one change. Deferred to a separate PR per user request.

@github-actions github-actions Bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Enhancement Improvement of existing methods or implementation labels Apr 23, 2026
@hjmjohnson
Copy link
Copy Markdown
Member Author

ITK.Linux failed at 12m on a PR that only adds two .github/workflows/*.yml files — no C++ / CMake / test changes. This is the known Azure dashboard-script flake (ci_completed_successfully treating build warnings as fatal, itk_common.cmake:628). Re-triggering.

/azp run ITK.Linux

@hjmjohnson
Copy link
Copy Markdown
Member Author

@greptileai review this draft before I make it official

@hjmjohnson
Copy link
Copy Markdown
Member Author

/azp run ITK.Linux

@greptile-apps

This comment was marked as resolved.

@hjmjohnson hjmjohnson marked this pull request as ready for review April 23, 2026 17:49
Comment thread .github/workflows/cleanup-pr-caches.yml Outdated
Comment thread .github/workflows/cleanup-stale-caches-nightly.yml
Comment thread .github/workflows/cleanup-pr-caches.yml
Comment thread .github/workflows/cleanup-stale-caches-nightly.yml
Comment thread .github/workflows/cleanup-pr-caches.yml Outdated
Add two complementary workflows that free the repository's 10 GB
Actions-cache budget without adding overhead to regular CI:

  - cleanup-pr-caches.yml         event-driven, fires on PR close
  - cleanup-stale-caches-nightly.yml  scheduled sweep, 3-day grace

The event-driven workflow deletes every cache scoped to the closed
PR's merge ref within seconds of close, using the minimal permission
set (actions: write, contents: read).  The nightly sweep is the safety
net: it catches caches orphaned during a cleanup-workflow outage, or
those pre-dating this workflow.

Motivation: ITK regularly hits the 10 GB per-repo cache cap because
ccache entries (2-3 GB per platform, 3 platforms) accumulate across
open and closed PRs.  Once the cap is reached, GitHub silently rejects
all subsequent cache saves with "Cache reservation failed: you have
reached your configured budget, your cache is now read only".  This
manifested on PR InsightSoftwareConsortium#6109's Pixi-Cxx + ARMBUILD runs where both the
ccache and the new externaldata-* saves failed, even though the
workflows are correctly wired.

Pattern follows the GitHub documentation for force-deleting cache
entries:
https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#force-deleting-cache-entries

Robustness properties of the on-close workflow:
  - ref-scoped delete (refs/pull/N/merge) cannot touch refs/heads/main
    or other PR refs
  - idempotent: re-running finds 0 caches and exits 0
  - works for PRs from forks (runs in upstream context with fork's
    PR number and upstream's GITHUB_TOKEN)
  - closed state is terminal: a reopened PR gets fresh cache entries
    tied to new commits; deletions target the previous-closure era
@hjmjohnson
Copy link
Copy Markdown
Member Author

@dzenanz A little more house cleaning. This is somewhat orthogonal to the other cache cleanups, but helps to make sense of the caches that are retained. After a merge is done, the cache for that PR can never be reused, so we should not keep it around.

@dzenanz
Copy link
Copy Markdown
Member

dzenanz commented Apr 23, 2026

Description sounds good. I did not take a close look at the code.

@hjmjohnson
Copy link
Copy Markdown
Member Author

@dzenanz This is recommended to keep costs down (savings of a few GB at $.008/GB/day, so not much). I like it mostly to remove "dead weight" caches that can never be hit again.

@hjmjohnson hjmjohnson merged commit dc5f376 into InsightSoftwareConsortium:main Apr 24, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type:Enhancement Improvement of existing methods or implementation type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants