Skip to content

ENH: Dedicate populate-externaldata-cache workflow; consumers restore-only#6131

Closed
hjmjohnson wants to merge 3 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:populate-externaldata-cache
Closed

ENH: Dedicate populate-externaldata-cache workflow; consumers restore-only#6131
hjmjohnson wants to merge 3 commits intoInsightSoftwareConsortium:mainfrom
hjmjohnson:populate-externaldata-cache

Conversation

@hjmjohnson
Copy link
Copy Markdown
Member

Single-writer owner of the externaldata-v1-<hashFiles('**/*.cid')> GitHub Actions cache key. Stacks on top of #6109 — merge after that lands; diff will shrink to the three-file delta once #6109 is in main.

Root cause (observed on #6109)

Cache entry externaldata-v1-4cf653e4ff... was 117 MiB. Sampling 50 random CIDs against the ITKTestingData GitHub Pages mirror gave a mean object size of ~1.57 MB × 1,509 unique CIDs = ~2.43 GB expected corpus. 117 MiB ÷ 2,430 MiB = ~4.8% of content — zstd cannot compress .mha/.nrrd/.png data 20×, so the cache was incomplete, not just compressed.

Timeline on c2f2e1151639:

  • pixi.yml (has Prefetch) failed to start ("workflow file issue") → never saved.
  • arm.yml / ARMBUILD-x86_64-rosetta (no Prefetch) finished 19:32:40Z → saved 117 MiB at 19:32:34Z containing only whatever blobs the ARM build happened to fetch for its enabled modules.
  • Every later workflow hit cache-hit=true on the shared key, skipped Prefetch by design, and propagated the hole forever.
Fix
  • New workflow .github/workflows/populate-externaldata-cache.yml is the only writer of the shared key. Runs on: PRs that touch **/*.cid, pushes to main/release*, nightly at 05:17 UTC, and workflow_dispatch.
  • Completeness gate after prefetch compares the number of unique CIDs in the tree against the number of objects on disk; refuses to actions/cache/save if any are missing, so a partial cache can never be written under the shared key.
  • arm.yml and pixi.yml Save steps removed; pixi.yml Prefetch step removed. Both keep their Restore step and now participate as restore-only consumers. Explanatory comments left where the Save/Prefetch used to be to prevent someone re-introducing the race.
  • Azure Cache@2 subsystem (7 pipelines) is untouched — same anti-pattern but a separate cache backend; follow-up.
Verification plan
  1. After this PR merges, manually delete any existing partial externaldata-v1-* entries: gh cache delete --repo InsightSoftwareConsortium/ITK externaldata-v1-<digest>.
  2. Trigger workflow_dispatch on Populate ExternalData Cache; confirm the completeness gate passes and the save succeeds.
  3. Subsequent consumer runs (pixi.yml, arm.yml) should hit cache-hit=true on restore and proceed without network fetches.

Point ExternalData_OBJECT_STORES at a dedicated persistent directory
($(Pipeline.Workspace)/ExternalData on Azure, ${{ runner.temp }}/ExternalData
on GitHub Actions) that is cached separately from ccache.  The release
tarball becomes a warm-boot seed: on a cold cache it is downloaded and
unpacked into the store; on a warm cache the seed step short-circuits
and no network access is needed.

Applies to all 9 CI configurations:

  .github/workflows/arm.yml                          (3 matrix jobs)
  .github/workflows/pixi.yml                         (3 matrix jobs)
  Testing/ContinuousIntegration/AzurePipelinesBatch.yml
  Testing/ContinuousIntegration/AzurePipelinesLinux.yml         (3 jobs)
  Testing/ContinuousIntegration/AzurePipelinesLinuxPython.yml
  Testing/ContinuousIntegration/AzurePipelinesMacOS.yml
  Testing/ContinuousIntegration/AzurePipelinesMacOSPython.yml
  Testing/ContinuousIntegration/AzurePipelinesWindows.yml
  Testing/ContinuousIntegration/AzurePipelinesWindowsPython.yml

ExternalData blobs are platform-agnostic, so all jobs within a given CI
system share a single cache entry keyed solely on ExternalDataVersion.
Each run saves an immutable entry under ExternalDataVersion + SourceVersion
(same pattern as ccache); restore-keys falls through to the most recent
prior cache under the same version.  Blobs fetched ad-hoc from mirrors
during a run persist to the next run under the fallthrough restore-key.

Kept as sibling directories rather than colocated under CCACHE_DIR so
ccache --cleanup does not consider the ExternalData tree, and so ccache's
SHA-pinned cache invalidation does not force redundant ExternalData cache
writes on every commit.

The tarball seed is gated on a version-tagged sentinel file
(.seeded-v<ExternalDataVersion>) inside the store, not on the
actions/cache cache-hit output.  actions/cache reports cache-hit='true'
only for an exact primary-key match; a restore-keys fallback still
reports cache-hit='false' even though the data was restored.  Keying
off the sentinel correctly skips the tarball download in the
restore-keys-fallback case too.
The GHA ExternalData cache is keyed on hashFiles('**/*.cid'), so the
saved entry should contain an object for every .cid in the tree. In
practice the in-band ExternalData fetch only pulls blobs for the
modules currently selected for compilation and whose tests run; every
other reference stays out of the store and has to be re-downloaded
from gateways on the next cold boot.

Add Utilities/Maintenance/PrefetchCIDContentLinks.py that walks the
source tree, reads every .cid file, and downloads any missing
<store>/cid/<cid> through the same gateway list
CMake/ITKExternalData.cmake uses. The script delegates HTTPS to
curl (available on every supported runner) so TLS verification lives
entirely in the system stack, and uses a ThreadPoolExecutor for
parallel downloads. Idempotent — already-present objects are skipped.

Wire it into .github/workflows/pixi.yml immediately before
'Save ExternalData object store', guarded on a cold
restore-externaldata cache so warm boots stay instant.
…-only

The shared GitHub Actions cache entry externaldata-v1-<hashFiles>
populated by PR InsightSoftwareConsortium#6109 was subject to a multi-writer race: arm.yml,
pixi.yml, and (via a parallel Cache@2 subsystem) the Azure pipelines
all saved under the same key, and whichever job finished first won.
Only pixi.yml had a Prefetch step that pulled every .cid blob; the
other workflows saved only the subset the in-band ExternalData fetch
happened to touch for the modules each build had enabled.  On PR InsightSoftwareConsortium#6109
itself the ARM rosetta build (no prefetch) wrote 117 MiB; the real
corpus is ~2.4 GB.  Consumers that restored the partial cache then hit
cache-hit=true, skipped prefetch by design, and propagated the hole.

Split the writer out: .github/workflows/populate-externaldata-cache.yml
is now the only workflow that saves the shared key.  It runs on PRs
that touch .cid files, on pushes to main/release*, nightly, and on
workflow_dispatch.  A completeness gate after prefetch counts the
unique CIDs referenced in the tree against the objects on disk and
fails the save if any are missing, so a partial cache can never be
written under the shared key.

Consumer workflows arm.yml and pixi.yml now only restore the
ExternalData cache; their Save (and, for pixi.yml, Prefetch) steps are
removed.  The Azure Cache@2 subsystem is untouched by this commit; the
same pattern applies there but is a separate follow-up because those
pipelines cache via Azure DevOps, not GitHub Actions.
@github-actions github-actions Bot added type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Enhancement Improvement of existing methods or implementation area:Python wrapping Python bindings for a class type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct labels Apr 24, 2026
@hjmjohnson hjmjohnson closed this Apr 24, 2026
@hjmjohnson hjmjohnson deleted the populate-externaldata-cache branch April 24, 2026 22:21
@hjmjohnson
Copy link
Copy Markdown
Member Author

Auto-closed by branch deletion. The changes have been folded into #6109 directly:

  • Two-commit final history on top of latest upstream/main (5840df6):
    • 25bd1af14b — ENH: Cache ExternalData object store across CI runs (9 CI configs; GHA workflows arm.yml/pixi.yml are now restore-only)
    • 4cbf6cfc75 — ENH: Dedicate populate-externaldata-cache workflow for CI prefetch (adds populate-externaldata-cache.yml + PrefetchCIDContentLinks.py with the completeness gate)

See #6109 for the current state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Python wrapping Python bindings for a class type:Enhancement Improvement of existing methods or implementation type:Infrastructure Infrastructure/ecosystem related changes, such as CMake or buildbots type:Testing Ensure that the purpose of a class is met/the results on a wide set of test cases are correct

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant