ENH: Cache ExternalData object store across CI runs#6109
Conversation
bea5538 to
cba8c6b
Compare
|
| Filename | Overview |
|---|---|
| .github/workflows/arm.yml | Adds ExternalData cache restore/save with sentinel-gated seeding; minor edge case where rename can fail if CID exists without sentinel. |
| .github/workflows/pixi.yml | Same ExternalData caching pattern as arm.yml; identical sentinel/rename logic and same edge case applies. |
| Testing/ContinuousIntegration/AzurePipelinesBatch.yml | Adds Cache@2 task and Windows cmd sentinel check; if exist guard and cmake -E touch for sentinel creation look correct. |
| Testing/ContinuousIntegration/AzurePipelinesLinux.yml | Three jobs each get Cache@2 + bash seeding; same rename-on-existing edge case as GHA files. |
| Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml | Adds Cache@2 + bash seeding; pattern is consistent with AzurePipelinesLinux. |
| Testing/ContinuousIntegration/AzurePipelinesMacOS.yml | Adds Cache@2 + bash seeding for macOS; same consistent pattern. |
| Testing/ContinuousIntegration/AzurePipelinesMacOSPython.yml | Adds Cache@2 + bash seeding for macOS Python; consistent with other Azure macOS pipeline. |
| Testing/ContinuousIntegration/AzurePipelinesWindows.yml | Windows cmd seeding uses if exist sentinel check and cmake -E touch; make_directory is correctly ordered before rename. |
| Testing/ContinuousIntegration/AzurePipelinesWindowsPython.yml | Same Windows cmd seeding pattern as AzurePipelinesWindows; correct and consistent. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[CI job starts] --> B[Restore ExternalData cache\nvia restore-keys or exact hit]
B --> C{Sentinel file present\nin object store?}
C -- Yes --> D[Skip tarball download]
C -- No --> E[Download release tarball via curl]
E --> F[Extract tarball with cmake -E tar]
F --> G[Create STORE dir with cmake -E make_directory]
G --> H[Move CID dir into STORE with cmake -E rename]
H --> I[Write sentinel file to STORE]
I --> J[Export ExternalData_OBJECT_STORES env var]
D --> J
J --> K[CMake configure reads env var]
K --> L[Build and Test\nnew CIDs fetched into STORE ad-hoc]
L --> M[Save ExternalData cache\nif not cancelled\nimmutable key per commit]
Reviews (1): Last reviewed commit: "ENH: Cache ExternalData object store acr..." | Re-trigger Greptile
cba8c6b to
cf3c44c
Compare
cf3c44c to
dffb9bd
Compare
dzenanz
left a comment
There was a problem hiding this comment.
Code generally looks good. Does it work?
|
@dzenanz — proof the cache wiring is live on the current CI run (HEAD Quick visible summary: ✅ 3 ARM + 3 Pixi jobs all have Cache key namingGitHub Actions jobs (
Azure Pipelines jobs (
Cache path on both: Live step evidence (current run)From No What demonstrates the sentinel-gating specificallyNext-run behavior (verifiable after this run saves its cache):
This is the behavior the sentinel was introduced to produce (see your earlier conversation about How to inspect after merge# List externaldata cache entries the repo is holding:
gh api "repos/InsightSoftwareConsortium/ITK/actions/caches?per_page=100" \
--jq '.actions_caches[] | select(.key | startswith("externaldata-")) |
{key, ref, size_mb: (.size_in_bytes / 1048576 | floor), last_accessed_at}'
# Inspect the Seed step output of any past run:
gh run view <run-id> --repo InsightSoftwareConsortium/ITK --log | \
grep -A3 "Seed ExternalData object store" |
|
Considering the number of places this code has to be duplicated, I think keeping is simple is important. Particullary if the data cache is works I don' think there is a need to "fall back" to the data archive. This would simplify the code a bit. Alternatively perhaps a couple scripts, or reusable CI components could avoid the code duplication. What do other humans thing? |
|
As this only impacts CI infrastructure and not library content, it is fine to merge it in order to more thoroughly check that it works. |
|
You are right Brad. I only looked at arm and first Windows file. If duplication can be avoided, that would be good. |
|
Working on removing duplication. That is hard in the CI. I agree that "adding a lot of boilerplate code for most optimal caching" needs to be balanced with "minimize boilerplate maintenance for common benefit caching". |
dffb9bd to
a181cc8
Compare
|
@blowekamp @dzenanz — applied as Dropped the tarball-seed fallback across all 11 seed sites (2 GHA + 7 Azure files; Azure Linux has 3 jobs). Kept only:
First-ever run on a given Net delta versus the prior revision of this PR: Net delta versus Companion cleanup PR #6112 automates closed/merged-PR cache purges so the budget this PR consumes stays bounded. |
|
I don't fully understand the "hash" part of the cache key in this PR. But in simpleITK we use:
I think for ITK you could use the output of "git grep "" -- "*.cid"" to generate a hashable file that would consider all of ITK's data. This may save creating extra cache entries when no data changes. |
Add two complementary workflows that free the repository's 10 GB Actions-cache budget without adding overhead to regular CI: - cleanup-pr-caches.yml event-driven, fires on PR close - cleanup-stale-caches-nightly.yml scheduled sweep, 3-day grace The event-driven workflow deletes every cache scoped to the closed PR's merge ref within seconds of close, using the minimal permission set (actions: write, contents: read). The nightly sweep is the safety net: it catches caches orphaned during a cleanup-workflow outage, or those pre-dating this workflow. Motivation: ITK regularly hits the 10 GB per-repo cache cap because ccache entries (2-3 GB per platform, 3 platforms) accumulate across open and closed PRs. Once the cap is reached, GitHub silently rejects all subsequent cache saves with "Cache reservation failed: you have reached your configured budget, your cache is now read only". This manifested on PR InsightSoftwareConsortium#6109's Pixi-Cxx + ARMBUILD runs where both the ccache and the new externaldata-* saves failed, even though the workflows are correctly wired. Pattern follows the GitHub documentation for force-deleting cache entries: https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#force-deleting-cache-entries Robustness properties of the on-close workflow: - ref-scoped delete (refs/pull/N/merge) cannot touch refs/heads/main or other PR refs - idempotent: re-running finds 0 caches and exits 0 - works for PRs from forks (runs in upstream context with fork's PR number and upstream's GITHUB_TOKEN) - closed state is terminal: a reopened PR gets fresh cache entries tied to new commits; deletions target the previous-closure era
Add two complementary workflows that free the repository's 10 GB Actions-cache budget without adding overhead to regular CI: - cleanup-pr-caches.yml event-driven, fires on PR close - cleanup-stale-caches-nightly.yml scheduled sweep, 3-day grace The event-driven workflow deletes every cache scoped to the closed PR's merge ref within seconds of close, using the minimal permission set (actions: write, contents: read). The nightly sweep is the safety net: it catches caches orphaned during a cleanup-workflow outage, or those pre-dating this workflow. Motivation: ITK regularly hits the 10 GB per-repo cache cap because ccache entries (2-3 GB per platform, 3 platforms) accumulate across open and closed PRs. Once the cap is reached, GitHub silently rejects all subsequent cache saves with "Cache reservation failed: you have reached your configured budget, your cache is now read only". This manifested on PR InsightSoftwareConsortium#6109's Pixi-Cxx + ARMBUILD runs where both the ccache and the new externaldata-* saves failed, even though the workflows are correctly wired. Pattern follows the GitHub documentation for force-deleting cache entries: https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#force-deleting-cache-entries Robustness properties of the on-close workflow: - ref-scoped delete (refs/pull/N/merge) cannot touch refs/heads/main or other PR refs - idempotent: re-running finds 0 caches and exits 0 - works for PRs from forks (runs in upstream context with fork's PR number and upstream's GITHUB_TOKEN) - closed state is terminal: a reopened PR gets fresh cache entries tied to new commits; deletions target the previous-closure era
a181cc8 to
fca10df
Compare
|
@blowekamp @dzenanz — force-pushed `fca10df899` which fixes two things:
Effect: far fewer cache entries churned per PR, cleaner hit rate, less pressure on the 10 GB repo cache budget. Credit to @blowekamp for the pointer to the SimpleITK pattern. |
|
/azp run ITK.Windows |
9916bad to
43a23e1
Compare
238db0a to
c2f2e11
Compare
Point ExternalData_OBJECT_STORES at a dedicated persistent directory
($(Pipeline.Workspace)/ExternalData on Azure, ${{ runner.temp }}/ExternalData
on GitHub Actions) that is cached separately from ccache. The release
tarball becomes a warm-boot seed: on a cold cache it is downloaded and
unpacked into the store; on a warm cache the seed step short-circuits
and no network access is needed.
Applies to all 9 CI configurations:
.github/workflows/arm.yml (3 matrix jobs)
.github/workflows/pixi.yml (3 matrix jobs)
Testing/ContinuousIntegration/AzurePipelinesBatch.yml
Testing/ContinuousIntegration/AzurePipelinesLinux.yml (3 jobs)
Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml
Testing/ContinuousIntegration/AzurePipelinesMacOS.yml
Testing/ContinuousIntegration/AzurePipelinesMacOSPython.yml
Testing/ContinuousIntegration/AzurePipelinesWindows.yml
Testing/ContinuousIntegration/AzurePipelinesWindowsPython.yml
ExternalData blobs are platform-agnostic, so all jobs within a given CI
system share a single cache entry keyed solely on ExternalDataVersion.
Each run saves an immutable entry under ExternalDataVersion + SourceVersion
(same pattern as ccache); restore-keys falls through to the most recent
prior cache under the same version. Blobs fetched ad-hoc from mirrors
during a run persist to the next run under the fallthrough restore-key.
Kept as sibling directories rather than colocated under CCACHE_DIR so
ccache --cleanup does not consider the ExternalData tree, and so ccache's
SHA-pinned cache invalidation does not force redundant ExternalData cache
writes on every commit.
The tarball seed is gated on a version-tagged sentinel file
(.seeded-v<ExternalDataVersion>) inside the store, not on the
actions/cache cache-hit output. actions/cache reports cache-hit='true'
only for an exact primary-key match; a restore-keys fallback still
reports cache-hit='false' even though the data was restored. Keying
off the sentinel correctly skips the tarball download in the
restore-keys-fallback case too.
The GHA ExternalData cache is keyed on hashFiles('**/*.cid'), so the
saved entry should contain an object for every .cid in the tree. In
practice the in-band ExternalData fetch only pulls blobs for the
modules currently selected for compilation and whose tests run; every
other reference stays out of the store and has to be re-downloaded
from gateways on the next cold boot.
Because the shared key is platform-agnostic, every CI workflow that
writes the same key races to save first. Only one workflow can
actually prefetch the full corpus; the others would overwrite that
save with whatever subset their build happened to fetch, and
GitHub's actions/cache/save refuses duplicate keys once the first
writer wins. The result observed on an earlier revision of this PR
was a 117 MiB cache under a 2.43 GB corpus (~5 %).
Split the writer out into a dedicated workflow:
.github/workflows/populate-externaldata-cache.yml
runs on PRs that touch **/*.cid, on pushes to main and release*,
nightly at 05:17 UTC, and on workflow_dispatch. It is the only
workflow that saves the shared key. After prefetch a completeness
gate counts the unique CIDs referenced in the tree against the
objects on disk and refuses to save if any are missing, so a partial
cache can never be written under the shared key.
Consumer workflows arm.yml and pixi.yml restore the cache and use it
in-band during the build; they do not save. Azure pipelines retain
their Cache@2 wiring from the previous commit (separate cache
subsystem; follow-up).
Add Utilities/Maintenance/PrefetchCIDContentLinks.py that walks the
source tree, reads every .cid file, and downloads any missing
<store>/cid/<cid> through the same gateway list
CMake/ITKExternalData.cmake uses. The script delegates HTTPS to
curl (available on every supported runner) so TLS verification
lives entirely in the system stack, and uses a ThreadPoolExecutor
for parallel downloads. Idempotent — already-present objects are
skipped.
c2f2e11 to
4cbf6cf
Compare
0f92cb9
into
InsightSoftwareConsortium:main
Add two complementary workflows that free the repository's 10 GB Actions-cache budget without adding overhead to regular CI: - cleanup-pr-caches.yml event-driven, fires on PR close - cleanup-stale-caches-nightly.yml scheduled sweep, 3-day grace The event-driven workflow deletes every cache scoped to the closed PR's merge ref within seconds of close, using the minimal permission set (actions: write, contents: read). The nightly sweep is the safety net: it catches caches orphaned during a cleanup-workflow outage, or those pre-dating this workflow. Motivation: ITK regularly hits the 10 GB per-repo cache cap because ccache entries (2-3 GB per platform, 3 platforms) accumulate across open and closed PRs. Once the cap is reached, GitHub silently rejects all subsequent cache saves with "Cache reservation failed: you have reached your configured budget, your cache is now read only". This manifested on PR InsightSoftwareConsortium#6109's Pixi-Cxx + ARMBUILD runs where both the ccache and the new externaldata-* saves failed, even though the workflows are correctly wired. Pattern follows the GitHub documentation for force-deleting cache entries: https://docs.github.com/en/actions/using-workflows/caching-dependencies-to-speed-up-workflows#force-deleting-cache-entries Robustness properties of the on-close workflow: - ref-scoped delete (refs/pull/N/merge) cannot touch refs/heads/main or other PR refs - idempotent: re-running finds 0 caches and exits 0 - works for PRs from forks (runs in upstream context with fork's PR number and upstream's GITHUB_TOKEN) - closed state is terminal: a reopened PR gets fresh cache entries tied to new commits; deletions target the previous-closure era
Adds a persistent per-repo cache of the ExternalData object store to all 9 CI configurations (2 GitHub Actions workflows + 7 Azure Pipelines). The release tarball becomes a warm-boot seed — downloaded on cold cache only, short-circuited on warm cache via a sentinel file.
What changed in each CI
Added next to the existing ccache cache, as a sibling directory rather than colocated under
CCACHE_DIR:$(Pipeline.Workspace)/ExternalDataon Azure;${{ runner.temp }}/ExternalDataon GitHub Actions.ExternalDataVersion, shared across every workflow / job / OS in the repo. ExternalData blobs are platform-agnostic, so there's no reason to key onAgent.OS, matrix name, job name, etc.externaldata-v<ExternalDataVersion>-<SHA>(immutable per run)externaldata-v<ExternalDataVersion>-(fall-through to latest prior).seeded-v<ExternalDataVersion>), not byactions/cache'scache-hitoutput.cache-hitis'true'only for an exact primary-key match; a restore-keys fallback still reports'false'even though data was restored. The sentinel short-circuits the tarball download in the restore-keys-fallback case too.ExternalData_OBJECT_STORESenv var points CMake at the cache path. Set viaGITHUB_ENVon GHA; via the top-levelvariables:block on Azure.Why sentinel file and not cache-hit
The first revision of this PR used
if: steps.restore-externaldata.outputs.cache-hit != 'true'to guard the download. That's subtly wrong:cache-hit: 'true'→ exact primary-key match (on my SHA-suffixed key, this is rare — only same-SHA reruns).cache-hit: 'false'→ either nothing restored or restored via arestore-keysfallback. The step runs in both cases.With the SHA-keyed save pattern (required for immutable GHA caches that still need to grow per-run), almost every PR run falls into the second category: the fallback key matches, data is restored, but
cache-hit='false'causes a redundant tarball download.The version-tagged sentinel (
$STORE/.seeded-v<ver>) directly reflects whether the store has already been seeded from the release tarball, regardless of how the store got populated. The seed step:.ExternalData/CID/into$STORE/CID, and creates the sentinel.Works identically for cold-cache, exact-cache-hit, and restore-keys-fallback cases.
Why sibling directory, not under CCACHE_DIR
Considered and rejected colocating ExternalData under
CCACHE_DIR:ccache --cleanupwalks the full$CCACHE_DIRtree; future ccache versions could start pruning unknown subdirs.CCACHE_MAXSIZE=5Gwould silently eat into ccache's usable budget.${{ github.sha }}(by design) — every PR commit would unnecessarily rewrite the whole ~6 GB bundle including ExternalData, which doesn't need SHA-granularity.Sibling directories with separate caches let each invalidate on its own natural cadence: ccache per-SHA, ExternalData per-version.
Before and after behavior
92 bytes receivedknown GHA flake)Files touched
.github/workflows/arm.ymlshell: bash.github/workflows/pixi.ymlshell: bashTesting/ContinuousIntegration/AzurePipelinesBatch.ymlscript:)Testing/ContinuousIntegration/AzurePipelinesLinux.ymlTesting/ContinuousIntegration/AzurePipelinesLinuxPython.ymlTesting/ContinuousIntegration/AzurePipelinesMacOS.ymlTesting/ContinuousIntegration/AzurePipelinesMacOSPython.ymlTesting/ContinuousIntegration/AzurePipelinesWindows.ymlTesting/ContinuousIntegration/AzurePipelinesWindowsPython.ymlcmd steps use
if existfor the sentinel check andcmake -E touchto create it (both cross-shell primitives available on Windows runners). Bash steps use[ -f ... ]and: > file.mkdir -pwas also replaced withcmake -E make_directoryfor Windows compatibility.CMake plumbing (unchanged, referenced for review)
CMake/ITKExternalData.cmakealready reads$ENV{ExternalData_OBJECT_STORES}at configure time (lines 5-14) and uses it as the default for the cache variable. This PR wires the env var; CMake-side logic is untouched.