Skip to content

ENH: Cache ExternalData object store across CI runs#6109

Merged
hjmjohnson merged 2 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:cache-externaldata-object-stores
Apr 25, 2026
Merged

ENH: Cache ExternalData object store across CI runs#6109
hjmjohnson merged 2 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:cache-externaldata-object-stores

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

@hjmjohnson hjmjohnson commented Apr 23, 2026

Adds a persistent per-repo cache of the ExternalData object store to all 9 CI configurations (2 GitHub Actions workflows + 7 Azure Pipelines). The release tarball becomes a warm-boot seed — downloaded on cold cache only, short-circuited on warm cache via a sentinel file.

What changed in each CI

Added next to the existing ccache cache, as a sibling directory rather than colocated under CCACHE_DIR:

  • Cache location. $(Pipeline.Workspace)/ExternalData on Azure; ${{ runner.temp }}/ExternalData on GitHub Actions.
  • Cache key. One entry per ExternalDataVersion, shared across every workflow / job / OS in the repo. ExternalData blobs are platform-agnostic, so there's no reason to key on Agent.OS, matrix name, job name, etc.
    • Save key: externaldata-v<ExternalDataVersion>-<SHA> (immutable per run)
    • Restore key: externaldata-v<ExternalDataVersion>- (fall-through to latest prior)
    • Same ccache-style pattern: every run saves an immutable cache; the next run finds the most recent prior cache via restore-keys.
  • Tarball seed is gated by a sentinel file inside the store (.seeded-v<ExternalDataVersion>), not by actions/cache's cache-hit output. cache-hit is 'true' only for an exact primary-key match; a restore-keys fallback still reports 'false' even though data was restored. The sentinel short-circuits the tarball download in the restore-keys-fallback case too.
  • ExternalData_OBJECT_STORES env var points CMake at the cache path. Set via GITHUB_ENV on GHA; via the top-level variables: block on Azure.
Why sentinel file and not cache-hit

The first revision of this PR used if: steps.restore-externaldata.outputs.cache-hit != 'true' to guard the download. That's subtly wrong:

  • cache-hit: 'true' → exact primary-key match (on my SHA-suffixed key, this is rare — only same-SHA reruns).
  • cache-hit: 'false' → either nothing restored or restored via a restore-keys fallback. The step runs in both cases.

With the SHA-keyed save pattern (required for immutable GHA caches that still need to grow per-run), almost every PR run falls into the second category: the fallback key matches, data is restored, but cache-hit='false' causes a redundant tarball download.

The version-tagged sentinel ($STORE/.seeded-v<ver>) directly reflects whether the store has already been seeded from the release tarball, regardless of how the store got populated. The seed step:

  1. Checks for the sentinel and exits 0 if present.
  2. Otherwise downloads the tarball, unpacks it, moves .ExternalData/CID/ into $STORE/CID, and creates the sentinel.

Works identically for cold-cache, exact-cache-hit, and restore-keys-fallback cases.

Why sibling directory, not under CCACHE_DIR

Considered and rejected colocating ExternalData under CCACHE_DIR:

  1. ccache --cleanup walks the full $CCACHE_DIR tree; future ccache versions could start pruning unknown subdirs.
  2. CCACHE_MAXSIZE=5G would silently eat into ccache's usable budget.
  3. ccache's cache key includes ${{ github.sha }} (by design) — every PR commit would unnecessarily rewrite the whole ~6 GB bundle including ExternalData, which doesn't need SHA-granularity.

Sibling directories with separate caches let each invalidate on its own natural cadence: ccache per-SHA, ExternalData per-version.

Before and after behavior
Before After
Tarball downloaded every run Only on cold cache (first run ever at a given ExternalDataVersion)
Tarball-download flake (92 bytes received known GHA flake) Bites every PR Bites first-ever run only
Ad-hoc mirror fetches for PR-new CIDs Re-fetched every run Persist to next run via save/restore-keys
Works for #6103 (Montage) / #612 (PerfBench) pipelines where new CIDs aren't yet in the tarball ❌ timeouts every run ✅ fetched once, then cached
Cache stays warm across restore-keys fallback N/A ✅ via sentinel
Files touched
File Jobs Shell
.github/workflows/arm.yml 3 matrix jobs shell: bash
.github/workflows/pixi.yml 3 matrix jobs shell: bash
Testing/ContinuousIntegration/AzurePipelinesBatch.yml 1 cmd (script:)
Testing/ContinuousIntegration/AzurePipelinesLinux.yml 3 bash
Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml 1 bash
Testing/ContinuousIntegration/AzurePipelinesMacOS.yml 1 bash
Testing/ContinuousIntegration/AzurePipelinesMacOSPython.yml 1 bash
Testing/ContinuousIntegration/AzurePipelinesWindows.yml 1 cmd
Testing/ContinuousIntegration/AzurePipelinesWindowsPython.yml 1 cmd

cmd steps use if exist for the sentinel check and cmake -E touch to create it (both cross-shell primitives available on Windows runners). Bash steps use [ -f ... ] and : > file.

mkdir -p was also replaced with cmake -E make_directory for Windows compatibility.

CMake plumbing (unchanged, referenced for review)

CMake/ITKExternalData.cmake already reads $ENV{ExternalData_OBJECT_STORES} at configure time (lines 5-14) and uses it as the default for the cache variable. This PR wires the env var; CMake-side logic is untouched.

@github-actions github-actions Bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Enhancement Improvement of existing methods or implementation type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct labels Apr 23, 2026
@hjmjohnson hjmjohnson force-pushed the cache-externaldata-object-stores branch from bea5538 to cba8c6b Compare April 23, 2026 12:18
@hjmjohnson hjmjohnson marked this pull request as ready for review April 23, 2026 12:43
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 23, 2026

Greptile Summary

This PR adds a persistent ExternalData object store cache to all 9 CI configurations (2 GHA workflows + 7 Azure Pipelines), using a version-tagged sentinel file to correctly gate the release-tarball seed step regardless of whether the cache was restored via an exact key match or a restore-keys fallback. The implementation is consistent across platforms — bash steps use : > \"$SENTINEL\", Windows cmd steps use cmake -E touch, and ExternalData_OBJECT_STORES is wired into CMake via $GITHUB_ENV (GHA) or the top-level variables: block (Azure).

Confidence Score: 5/5

Safe to merge; one edge-case P2 around rename-on-existing-CID is theoretically possible but requires an extremely tight failure window.

All findings are P2 or lower. The rename edge case requires a disk-full (or similar) failure between a successful cmake -E rename and : > "$SENTINEL" followed by a !cancelled() cache save — an extremely unlikely sequence. The rest of the design (sentinel logic, cache key strategy, cross-platform compatibility) is sound.

.github/workflows/arm.yml and .github/workflows/pixi.yml for the bash rename guard; the same pattern in AzurePipelinesLinux.yml.

Important Files Changed

Filename Overview
.github/workflows/arm.yml Adds ExternalData cache restore/save with sentinel-gated seeding; minor edge case where rename can fail if CID exists without sentinel.
.github/workflows/pixi.yml Same ExternalData caching pattern as arm.yml; identical sentinel/rename logic and same edge case applies.
Testing/ContinuousIntegration/AzurePipelinesBatch.yml Adds Cache@2 task and Windows cmd sentinel check; if exist guard and cmake -E touch for sentinel creation look correct.
Testing/ContinuousIntegration/AzurePipelinesLinux.yml Three jobs each get Cache@2 + bash seeding; same rename-on-existing edge case as GHA files.
Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml Adds Cache@2 + bash seeding; pattern is consistent with AzurePipelinesLinux.
Testing/ContinuousIntegration/AzurePipelinesMacOS.yml Adds Cache@2 + bash seeding for macOS; same consistent pattern.
Testing/ContinuousIntegration/AzurePipelinesMacOSPython.yml Adds Cache@2 + bash seeding for macOS Python; consistent with other Azure macOS pipeline.
Testing/ContinuousIntegration/AzurePipelinesWindows.yml Windows cmd seeding uses if exist sentinel check and cmake -E touch; make_directory is correctly ordered before rename.
Testing/ContinuousIntegration/AzurePipelinesWindowsPython.yml Same Windows cmd seeding pattern as AzurePipelinesWindows; correct and consistent.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[CI job starts] --> B[Restore ExternalData cache\nvia restore-keys or exact hit]
    B --> C{Sentinel file present\nin object store?}
    C -- Yes --> D[Skip tarball download]
    C -- No --> E[Download release tarball via curl]
    E --> F[Extract tarball with cmake -E tar]
    F --> G[Create STORE dir with cmake -E make_directory]
    G --> H[Move CID dir into STORE with cmake -E rename]
    H --> I[Write sentinel file to STORE]
    I --> J[Export ExternalData_OBJECT_STORES env var]
    D --> J
    J --> K[CMake configure reads env var]
    K --> L[Build and Test\nnew CIDs fetched into STORE ad-hoc]
    L --> M[Save ExternalData cache\nif not cancelled\nimmutable key per commit]
Loading

Reviews (1): Last reviewed commit: "ENH: Cache ExternalData object store acr..." | Re-trigger Greptile

Comment thread .github/workflows/arm.yml Outdated
Copy link
Copy Markdown
Member

@dzenanz dzenanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code generally looks good. Does it work?

@hjmjohnson
Copy link
Copy Markdown
Member Author

@dzenanz — proof the cache wiring is live on the current CI run (HEAD dffb9bd1fd):

Quick visible summary: ✅ 3 ARM + 3 Pixi jobs all have Restore ExternalData object store + Seed ExternalData object store from release tarball completed with success in their step streams. Save ExternalData object store is queued (runs post-job). First save will create the inaugural cache entry.

Cache key naming

GitHub Actions jobs (.github/workflows/{pixi,arm}.yml):

Role Key
Save key (immutable per run) externaldata-v${ExternalDataVersion}-${github.sha} → e.g. externaldata-v5.4.5-dffb9bd1fdb99d90ad4a2063bbcb1c8f7bd291aa
Restore-keys fallback externaldata-v${ExternalDataVersion}- → matches any prior save for the same ExternalDataVersion

Azure Pipelines jobs (Testing/ContinuousIntegration/AzurePipelines*.yml):

Role Key
Save key (immutable per run) "externaldata" | "v$(ExternalDataVersion)" | "$(Build.SourceVersion)"
Restore keys "externaldata" | "v$(ExternalDataVersion)"

Cache path on both: ExternalData_OBJECT_STORES${runner.temp}/ExternalData on GHA, $(Pipeline.Workspace)/ExternalData on Azure.

Live step evidence (current run)

From GET /repos/InsightSoftwareConsortium/ITK/actions/runs/24836880538/jobs and 24836880557/jobs at time of this comment — all 6 in-progress jobs (3 Pixi + 3 ARM) have identical step sequence:

Install ccache                                  ✅ success
Restore compiler cache                          ✅ success
Restore ExternalData object store               ✅ success
Seed ExternalData object store from tarball     ✅ success   ← sentinel-gated download
Export ExternalData_OBJECT_STORES               ✅ success
<build+test steps ...>
Save compiler cache                             ⏳ pending    (post-job)
Save ExternalData object store                  ⏳ pending    (post-job)

No externaldata-* entries in /actions/caches yet because all prior runs on this branch were cancelled by subsequent force-pushes before the post-job save could land. This is the first complete run, so the first save will write the inaugural cache entry keyed externaldata-v5.4.5-dffb9bd1fd....

What demonstrates the sentinel-gating specifically

Next-run behavior (verifiable after this run saves its cache):

  1. Push a no-op commit on this branch.
  2. New run starts. Restore ExternalData object store hits externaldata-v5.4.5- via restore-keys fallback, reports cache-hit=false (exact primary-key miss) but restores the data anyway — this is the case that broke the original cache-hit-based gating.
  3. Seed ExternalData object store from release tarball still runs (no if: gate), but the sentinel check inside the step sees ${STORE}/.seeded-v5.4.5 present and logs "ExternalData store already seeded for v5.4.5; skipping tarball download." then exits 0 in under a second.
  4. No tarball download, no cmake -E tar xfz, no mirror fetches.

This is the behavior the sentinel was introduced to produce (see your earlier conversation about cache-hit='true' semantics). Happy to trigger a no-op rerun once this one lands to produce the actual log lines as final proof.

How to inspect after merge
# List externaldata cache entries the repo is holding:
gh api "repos/InsightSoftwareConsortium/ITK/actions/caches?per_page=100" \
  --jq '.actions_caches[] | select(.key | startswith("externaldata-")) |
       {key, ref, size_mb: (.size_in_bytes / 1048576 | floor), last_accessed_at}'

# Inspect the Seed step output of any past run:
gh run view <run-id> --repo InsightSoftwareConsortium/ITK --log | \
  grep -A3 "Seed ExternalData object store"

@blowekamp
Copy link
Copy Markdown
Member

Considering the number of places this code has to be duplicated, I think keeping is simple is important. Particullary if the data cache is works I don' think there is a need to "fall back" to the data archive. This would simplify the code a bit. Alternatively perhaps a couple scripts, or reusable CI components could avoid the code duplication.

What do other humans thing?

@dzenanz
Copy link
Copy Markdown
Member

dzenanz commented Apr 23, 2026

As this only impacts CI infrastructure and not library content, it is fine to merge it in order to more thoroughly check that it works.

@dzenanz
Copy link
Copy Markdown
Member

dzenanz commented Apr 23, 2026

You are right Brad. I only looked at arm and first Windows file. If duplication can be avoided, that would be good.

@hjmjohnson
Copy link
Copy Markdown
Member Author

Working on removing duplication. That is hard in the CI. I agree that "adding a lot of boilerplate code for most optimal caching" needs to be balanced with "minimize boilerplate maintenance for common benefit caching".

Copy link
Copy Markdown
Member

@thewtex thewtex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 🚀

@hjmjohnson
Copy link
Copy Markdown
Member Author

@blowekamp @dzenanz — applied as a181cc86da.

Dropped the tarball-seed fallback across all 11 seed sites (2 GHA + 7 Azure files; Azure Linux has 3 jobs). Kept only:

  1. Restore ExternalData object storeactions/cache/restore@v5 (GHA) / Cache@2 (Azure)
  2. Export ExternalData_OBJECT_STORES (GHA only — Azure has it in the top-level variables: block)
  3. Save ExternalData object storeactions/cache/save@v5 (GHA, auto-save on Azure)

First-ever run on a given ExternalDataVersion populates the store from CMake's own ExternalData_URL_TEMPLATES (mirrors); every subsequent run restores from the cache. Also removed the dead rm -f InsightData-*.tar.gz / rm -rf InsightToolkit-… cleanup in pixi.yml.

Net delta versus the prior revision of this PR:

 9 files changed, 56 insertions(+), 273 deletions(-)

Net delta versus upstream/main: the PR is now a net subtraction — caching infrastructure replaces more YAML than it adds.

Companion cleanup PR #6112 automates closed/merged-PR cache purges so the budget this PR consumes stays bounded.

@blowekamp
Copy link
Copy Markdown
Member

I don't fully understand the "hash" part of the cache key in this PR. But in simpleITK we use:

git log -n 1 "${{ github.workspace }}/Testing/Data/" | tee "${{ github.workspace }}/external-data.hashable"
and:
key: external-data-v1-${{ hashFiles( format( '{0}/{1}', github.workspace, 'external-data.hashable') ) }}

I think for ITK you could use the output of "git grep "" -- "*.cid"" to generate a hashable file that would consider all of ITK's data. This may save creating extra cache entries when no data changes.

hjmjohnson added a commit to hjmjohnson/ITK that referenced this pull request Apr 23, 2026
Add two complementary workflows that free the repository's 10 GB
Actions-cache budget without adding overhead to regular CI:

  - cleanup-pr-caches.yml         event-driven, fires on PR close
  - cleanup-stale-caches-nightly.yml  scheduled sweep, 3-day grace

The event-driven workflow deletes every cache scoped to the closed
PR's merge ref within seconds of close, using the minimal permission
set (actions: write, contents: read).  The nightly sweep is the safety
net: it catches caches orphaned during a cleanup-workflow outage, or
those pre-dating this workflow.

Motivation: ITK regularly hits the 10 GB per-repo cache cap because
ccache entries (2-3 GB per platform, 3 platforms) accumulate across
open and closed PRs.  Once the cap is reached, GitHub silently rejects
all subsequent cache saves with "Cache reservation failed: you have
reached your configured budget, your cache is now read only".  This
manifested on PR InsightSoftwareConsortium#6109's Pixi-Cxx + ARMBUILD runs where both the
ccache and the new externaldata-* saves failed, even though the
workflows are correctly wired.

Pattern follows the GitHub documentation for force-deleting cache
entries:
https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#force-deleting-cache-entries

Robustness properties of the on-close workflow:
  - ref-scoped delete (refs/pull/N/merge) cannot touch refs/heads/main
    or other PR refs
  - idempotent: re-running finds 0 caches and exits 0
  - works for PRs from forks (runs in upstream context with fork's
    PR number and upstream's GITHUB_TOKEN)
  - closed state is terminal: a reopened PR gets fresh cache entries
    tied to new commits; deletions target the previous-closure era
hjmjohnson added a commit to hjmjohnson/ITK that referenced this pull request Apr 23, 2026
Add two complementary workflows that free the repository's 10 GB
Actions-cache budget without adding overhead to regular CI:

  - cleanup-pr-caches.yml         event-driven, fires on PR close
  - cleanup-stale-caches-nightly.yml  scheduled sweep, 3-day grace

The event-driven workflow deletes every cache scoped to the closed
PR's merge ref within seconds of close, using the minimal permission
set (actions: write, contents: read).  The nightly sweep is the safety
net: it catches caches orphaned during a cleanup-workflow outage, or
those pre-dating this workflow.

Motivation: ITK regularly hits the 10 GB per-repo cache cap because
ccache entries (2-3 GB per platform, 3 platforms) accumulate across
open and closed PRs.  Once the cap is reached, GitHub silently rejects
all subsequent cache saves with "Cache reservation failed: you have
reached your configured budget, your cache is now read only".  This
manifested on PR InsightSoftwareConsortium#6109's Pixi-Cxx + ARMBUILD runs where both the
ccache and the new externaldata-* saves failed, even though the
workflows are correctly wired.

Pattern follows the GitHub documentation for force-deleting cache
entries:
https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#force-deleting-cache-entries

Robustness properties of the on-close workflow:
  - ref-scoped delete (refs/pull/N/merge) cannot touch refs/heads/main
    or other PR refs
  - idempotent: re-running finds 0 caches and exits 0
  - works for PRs from forks (runs in upstream context with fork's
    PR number and upstream's GITHUB_TOKEN)
  - closed state is terminal: a reopened PR gets fresh cache entries
    tied to new commits; deletions target the previous-closure era
@hjmjohnson hjmjohnson force-pushed the cache-externaldata-object-stores branch from a181cc8 to fca10df Compare April 23, 2026 18:01
@hjmjohnson
Copy link
Copy Markdown
Member Author

@blowekamp @dzenanz — force-pushed `fca10df899` which fixes two things:

  1. Regression repair. The previous simplification push (`a181cc86da`) had a regex bug that over-matched on Azure files and accidentally removed the dashboard-clone step and the `ExternalData Cache@2` task alongside the Seed step. That's why all 5 Azure pipelines on `a181cc86da` failed — they couldn't find `$AGENT_BUILDDIRECTORY/ITK-dashboard/azure_dashboard.cmake`. The Azure files are now fully restored and the Seed-step removal is line-walked (not regex-matched) so it targets only the Seed step itself.

  2. Brad's content-hash cache key. Switched from SHA-based keys to the SimpleITK-style content-hash pattern:

    Before After
    GHA `externaldata-v${ExternalDataVersion}-${github.sha}` `externaldata-v1-${{ hashFiles('**/*.cid') }}`
    Azure `"externaldata" | "v$(ExternalDataVersion)" | "$(Build.SourceVersion)"` `"externaldata" | "v1" | **/*.cid`

    The key now only changes when a `.cid` file's content changes — adding a new test, updating a data hash, etc. Non-data commits (documentation, refactors, C++ edits) reuse the same cache entry indefinitely. `ExternalDataVersion: 5.4.5` variable is dropped since it's no longer referenced.

Effect: far fewer cache entries churned per PR, cleaner hit rate, less pressure on the 10 GB repo cache budget. Credit to @blowekamp for the pointer to the SimpleITK pattern.

@hjmjohnson
Copy link
Copy Markdown
Member Author

/azp run ITK.Windows

@hjmjohnson hjmjohnson force-pushed the cache-externaldata-object-stores branch 2 times, most recently from 9916bad to 43a23e1 Compare April 24, 2026 17:45
@github-actions github-actions Bot added the area:Python wrapping Python bindings for a class label Apr 24, 2026
@hjmjohnson hjmjohnson force-pushed the cache-externaldata-object-stores branch from 238db0a to c2f2e11 Compare April 24, 2026 18:09
Point ExternalData_OBJECT_STORES at a dedicated persistent directory
($(Pipeline.Workspace)/ExternalData on Azure, ${{ runner.temp }}/ExternalData
on GitHub Actions) that is cached separately from ccache.  The release
tarball becomes a warm-boot seed: on a cold cache it is downloaded and
unpacked into the store; on a warm cache the seed step short-circuits
and no network access is needed.

Applies to all 9 CI configurations:

  .github/workflows/arm.yml                          (3 matrix jobs)
  .github/workflows/pixi.yml                         (3 matrix jobs)
  Testing/ContinuousIntegration/AzurePipelinesBatch.yml
  Testing/ContinuousIntegration/AzurePipelinesLinux.yml         (3 jobs)
  Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml
  Testing/ContinuousIntegration/AzurePipelinesMacOS.yml
  Testing/ContinuousIntegration/AzurePipelinesMacOSPython.yml
  Testing/ContinuousIntegration/AzurePipelinesWindows.yml
  Testing/ContinuousIntegration/AzurePipelinesWindowsPython.yml

ExternalData blobs are platform-agnostic, so all jobs within a given CI
system share a single cache entry keyed solely on ExternalDataVersion.
Each run saves an immutable entry under ExternalDataVersion + SourceVersion
(same pattern as ccache); restore-keys falls through to the most recent
prior cache under the same version.  Blobs fetched ad-hoc from mirrors
during a run persist to the next run under the fallthrough restore-key.

Kept as sibling directories rather than colocated under CCACHE_DIR so
ccache --cleanup does not consider the ExternalData tree, and so ccache's
SHA-pinned cache invalidation does not force redundant ExternalData cache
writes on every commit.

The tarball seed is gated on a version-tagged sentinel file
(.seeded-v<ExternalDataVersion>) inside the store, not on the
actions/cache cache-hit output.  actions/cache reports cache-hit='true'
only for an exact primary-key match; a restore-keys fallback still
reports cache-hit='false' even though the data was restored.  Keying
off the sentinel correctly skips the tarball download in the
restore-keys-fallback case too.
The GHA ExternalData cache is keyed on hashFiles('**/*.cid'), so the
saved entry should contain an object for every .cid in the tree. In
practice the in-band ExternalData fetch only pulls blobs for the
modules currently selected for compilation and whose tests run; every
other reference stays out of the store and has to be re-downloaded
from gateways on the next cold boot.

Because the shared key is platform-agnostic, every CI workflow that
writes the same key races to save first.  Only one workflow can
actually prefetch the full corpus; the others would overwrite that
save with whatever subset their build happened to fetch, and
GitHub's actions/cache/save refuses duplicate keys once the first
writer wins.  The result observed on an earlier revision of this PR
was a 117 MiB cache under a 2.43 GB corpus (~5 %).

Split the writer out into a dedicated workflow:

  .github/workflows/populate-externaldata-cache.yml

runs on PRs that touch **/*.cid, on pushes to main and release*,
nightly at 05:17 UTC, and on workflow_dispatch.  It is the only
workflow that saves the shared key.  After prefetch a completeness
gate counts the unique CIDs referenced in the tree against the
objects on disk and refuses to save if any are missing, so a partial
cache can never be written under the shared key.

Consumer workflows arm.yml and pixi.yml restore the cache and use it
in-band during the build; they do not save.  Azure pipelines retain
their Cache@2 wiring from the previous commit (separate cache
subsystem; follow-up).

Add Utilities/Maintenance/PrefetchCIDContentLinks.py that walks the
source tree, reads every .cid file, and downloads any missing
<store>/cid/<cid> through the same gateway list
CMake/ITKExternalData.cmake uses.  The script delegates HTTPS to
curl (available on every supported runner) so TLS verification
lives entirely in the system stack, and uses a ThreadPoolExecutor
for parallel downloads.  Idempotent — already-present objects are
skipped.
@hjmjohnson hjmjohnson force-pushed the cache-externaldata-object-stores branch from c2f2e11 to 4cbf6cf Compare April 24, 2026 22:21
@hjmjohnson hjmjohnson merged commit 0f92cb9 into InsightSoftwareConsortium:main Apr 25, 2026
14 of 15 checks passed
hjmjohnson added a commit to hjmjohnson/ITK that referenced this pull request Apr 25, 2026
Add two complementary workflows that free the repository's 10 GB
Actions-cache budget without adding overhead to regular CI:

  - cleanup-pr-caches.yml         event-driven, fires on PR close
  - cleanup-stale-caches-nightly.yml  scheduled sweep, 3-day grace

The event-driven workflow deletes every cache scoped to the closed
PR's merge ref within seconds of close, using the minimal permission
set (actions: write, contents: read).  The nightly sweep is the safety
net: it catches caches orphaned during a cleanup-workflow outage, or
those pre-dating this workflow.

Motivation: ITK regularly hits the 10 GB per-repo cache cap because
ccache entries (2-3 GB per platform, 3 platforms) accumulate across
open and closed PRs.  Once the cap is reached, GitHub silently rejects
all subsequent cache saves with "Cache reservation failed: you have
reached your configured budget, your cache is now read only".  This
manifested on PR InsightSoftwareConsortium#6109's Pixi-Cxx + ARMBUILD runs where both the
ccache and the new externaldata-* saves failed, even though the
workflows are correctly wired.

Pattern follows the GitHub documentation for force-deleting cache
entries:
https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#force-deleting-cache-entries

Robustness properties of the on-close workflow:
  - ref-scoped delete (refs/pull/N/merge) cannot touch refs/heads/main
    or other PR refs
  - idempotent: re-running finds 0 caches and exits 0
  - works for PRs from forks (runs in upstream context with fork's
    PR number and upstream's GITHUB_TOKEN)
  - closed state is terminal: a reopened PR gets fresh cache entries
    tied to new commits; deletions target the previous-closure era
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Python wrapping Python bindings for a class type:Enhancement Improvement of existing methods or implementation type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants