ENH: Dedicate populate-externaldata-cache workflow; consumers restore-only#6131
Closed
hjmjohnson wants to merge 3 commits intoInsightSoftwareConsortium:mainfrom
Closed
ENH: Dedicate populate-externaldata-cache workflow; consumers restore-only#6131hjmjohnson wants to merge 3 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson wants to merge 3 commits intoInsightSoftwareConsortium:mainfrom
Conversation
Point ExternalData_OBJECT_STORES at a dedicated persistent directory
($(Pipeline.Workspace)/ExternalData on Azure, ${{ runner.temp }}/ExternalData
on GitHub Actions) that is cached separately from ccache. The release
tarball becomes a warm-boot seed: on a cold cache it is downloaded and
unpacked into the store; on a warm cache the seed step short-circuits
and no network access is needed.
Applies to all 9 CI configurations:
.github/workflows/arm.yml (3 matrix jobs)
.github/workflows/pixi.yml (3 matrix jobs)
Testing/ContinuousIntegration/AzurePipelinesBatch.yml
Testing/ContinuousIntegration/AzurePipelinesLinux.yml (3 jobs)
Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml
Testing/ContinuousIntegration/AzurePipelinesMacOS.yml
Testing/ContinuousIntegration/AzurePipelinesMacOSPython.yml
Testing/ContinuousIntegration/AzurePipelinesWindows.yml
Testing/ContinuousIntegration/AzurePipelinesWindowsPython.yml
ExternalData blobs are platform-agnostic, so all jobs within a given CI
system share a single cache entry keyed solely on ExternalDataVersion.
Each run saves an immutable entry under ExternalDataVersion + SourceVersion
(same pattern as ccache); restore-keys falls through to the most recent
prior cache under the same version. Blobs fetched ad-hoc from mirrors
during a run persist to the next run under the fallthrough restore-key.
Kept as sibling directories rather than colocated under CCACHE_DIR so
ccache --cleanup does not consider the ExternalData tree, and so ccache's
SHA-pinned cache invalidation does not force redundant ExternalData cache
writes on every commit.
The tarball seed is gated on a version-tagged sentinel file
(.seeded-v<ExternalDataVersion>) inside the store, not on the
actions/cache cache-hit output. actions/cache reports cache-hit='true'
only for an exact primary-key match; a restore-keys fallback still
reports cache-hit='false' even though the data was restored. Keying
off the sentinel correctly skips the tarball download in the
restore-keys-fallback case too.
The GHA ExternalData cache is keyed on hashFiles('**/*.cid'), so the
saved entry should contain an object for every .cid in the tree. In
practice the in-band ExternalData fetch only pulls blobs for the
modules currently selected for compilation and whose tests run; every
other reference stays out of the store and has to be re-downloaded
from gateways on the next cold boot.
Add Utilities/Maintenance/PrefetchCIDContentLinks.py that walks the
source tree, reads every .cid file, and downloads any missing
<store>/cid/<cid> through the same gateway list
CMake/ITKExternalData.cmake uses. The script delegates HTTPS to
curl (available on every supported runner) so TLS verification lives
entirely in the system stack, and uses a ThreadPoolExecutor for
parallel downloads. Idempotent — already-present objects are skipped.
Wire it into .github/workflows/pixi.yml immediately before
'Save ExternalData object store', guarded on a cold
restore-externaldata cache so warm boots stay instant.
…-only The shared GitHub Actions cache entry externaldata-v1-<hashFiles> populated by PR InsightSoftwareConsortium#6109 was subject to a multi-writer race: arm.yml, pixi.yml, and (via a parallel Cache@2 subsystem) the Azure pipelines all saved under the same key, and whichever job finished first won. Only pixi.yml had a Prefetch step that pulled every .cid blob; the other workflows saved only the subset the in-band ExternalData fetch happened to touch for the modules each build had enabled. On PR InsightSoftwareConsortium#6109 itself the ARM rosetta build (no prefetch) wrote 117 MiB; the real corpus is ~2.4 GB. Consumers that restored the partial cache then hit cache-hit=true, skipped prefetch by design, and propagated the hole. Split the writer out: .github/workflows/populate-externaldata-cache.yml is now the only workflow that saves the shared key. It runs on PRs that touch .cid files, on pushes to main/release*, nightly, and on workflow_dispatch. A completeness gate after prefetch counts the unique CIDs referenced in the tree against the objects on disk and fails the save if any are missing, so a partial cache can never be written under the shared key. Consumer workflows arm.yml and pixi.yml now only restore the ExternalData cache; their Save (and, for pixi.yml, Prefetch) steps are removed. The Azure Cache@2 subsystem is untouched by this commit; the same pattern applies there but is a separate follow-up because those pipelines cache via Azure DevOps, not GitHub Actions.
Member
Author
|
Auto-closed by branch deletion. The changes have been folded into #6109 directly:
See #6109 for the current state. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Single-writer owner of the
externaldata-v1-<hashFiles('**/*.cid')>GitHub Actions cache key. Stacks on top of #6109 — merge after that lands; diff will shrink to the three-file delta once #6109 is inmain.Root cause (observed on #6109)
Cache entry
externaldata-v1-4cf653e4ff...was 117 MiB. Sampling 50 random CIDs against theITKTestingDataGitHub Pages mirror gave a mean object size of ~1.57 MB × 1,509 unique CIDs = ~2.43 GB expected corpus. 117 MiB ÷ 2,430 MiB = ~4.8% of content — zstd cannot compress.mha/.nrrd/.pngdata 20×, so the cache was incomplete, not just compressed.Timeline on
c2f2e1151639:pixi.yml(has Prefetch) failed to start ("workflow file issue") → never saved.arm.yml/ARMBUILD-x86_64-rosetta(no Prefetch) finished 19:32:40Z → saved 117 MiB at 19:32:34Z containing only whatever blobs the ARM build happened to fetch for its enabled modules.cache-hit=trueon the shared key, skipped Prefetch by design, and propagated the hole forever.Fix
.github/workflows/populate-externaldata-cache.ymlis the only writer of the shared key. Runs on: PRs that touch**/*.cid, pushes tomain/release*, nightly at 05:17 UTC, andworkflow_dispatch.actions/cache/saveif any are missing, so a partial cache can never be written under the shared key.arm.ymlandpixi.ymlSave steps removed;pixi.ymlPrefetch step removed. Both keep their Restore step and now participate as restore-only consumers. Explanatory comments left where the Save/Prefetch used to be to prevent someone re-introducing the race.Cache@2subsystem (7 pipelines) is untouched — same anti-pattern but a separate cache backend; follow-up.Verification plan
externaldata-v1-*entries:gh cache delete --repo InsightSoftwareConsortium/ITK externaldata-v1-<digest>.workflow_dispatchonPopulate ExternalData Cache; confirm the completeness gate passes and the save succeeds.cache-hit=trueon restore and proceed without network fetches.