Canonical runtime stages for the US Microplex build #20

anth-volk · 2026-05-27T17:01:50Z

anth-volk
May 27, 2026
Maintainer

Summary

This proposes a 9-stage runtime taxonomy for the canonical US Microplex build.

The goal is to give microplex-us a shared language for how a US dataset build proceeds from configuration and source loading through final dataset assembly and validation. Each stage should eventually have a clear contract: what it consumes, what it produces, what diagnostics it owns, and what artifacts should be saved for later inspection.

The proposed stages are:

Run profile, config, and source bundle
Source contracts and source loading
Source planning, fusion planning, and scaffold selection
Seed construction and donor integration
Synthesis, candidate population, and support enforcement
PolicyEngine entity construction and microsimulation materialization
Target resolution, selection, and calibration
Dataset assembly and publication
Validation and benchmarking

Stage 1: Run profile, config, and source bundle

This stage defines the build that is about to run.

It should answer questions like:

What profile is being run?
Which source providers are included?
What target period/year is being built?
Which calibration backend is selected?
Which target database and baseline dataset are being used?
Which run-level options, seeds, sample filters, and checkpoint/defer flags are active?

Current code that fits here includes the US build config, canonical rebuild config helpers, default source provider bundle, checkpoint config, and CLI/config plumbing.

Stage outputs should include a resolved run configuration and provider/query plan that downstream stages can reference.

Stage 2: Source contracts and source loading

This stage turns external datasets into Microplex source frames.

It should cover:

CPS loading and construction
PUF loading, uprating, demographics handling, and optional person expansion
ACS donor loading
SIPP donor loading
SCF donor loading
source manifests and source descriptors
source relationships, especially household-person links

The key output is a set of validated ObservationFrames: source metadata plus actual entity tables and relationships.

Stage diagnostics should include row counts, variable coverage, relationship validity, source provenance, cache/download provenance, and any remaining construction dependencies on external packages.

Stage 3: Source planning, fusion planning, and scaffold selection

This stage reasons about the source mix before building the population.

It should answer:

Which variables are covered by which sources?
Which source is the scaffold or backbone of the population?
Which sources are donors?
Which variables need donor integration or synthetic generation?
Why was one scaffold selected over another?

Current code that fits here includes fusion planning, source input preparation, and scaffold selection logic.

The output should be a source plan: scaffold source, donor sources, coverage map, selected variable families, and source-selection diagnostics.

Stage 4: Seed construction and donor integration

This stage creates the initial seed population and projects donor variables onto it.

It should cover:

projecting the scaffold source into the canonical seed schema
creating canonical person/household identifiers and fields
deriving seed-level fields such as state, age group, and income bracket
integrating donor variables from PUF/ACS/SIPP/SCF or other donor sources
applying donor imputation rules, conditioning surfaces, exclusions, and authoritative overrides

The output is a seed frame that is richer than the scaffold alone because donor information has been integrated.

Stage diagnostics should include seed row/column counts, donor-integrated variables, donor source per variable/block, selected conditioning variables, excluded variables, authoritative overrides, and donor-conditioning diagnostics.

Stage 5: Synthesis, candidate population, and support enforcement

This stage turns the seed into the candidate population that will be calibrated.

It should cover:

seed passthrough when the backend is seed
bootstrap or model-backed synthesis when configured
synthesis variable selection
target support enforcement so calibration targets have represented categories
synthetic/candidate population finalization

The output is the candidate population before final calibration.

Stage diagnostics should include synthesis backend, condition variables, target variables, candidate row count, support-enforcement changes, synthesizer metadata, and saved candidate artifacts such as seed_data.parquet, synthetic_data.parquet, and optionally synthesizer.pt.

Stage 6: PolicyEngine entity construction and microsimulation materialization

This stage converts the candidate population into PolicyEngine-style entity tables.

It should cover construction of:

households
persons
tax units
SPM units
families
marital units

It should also cover PE-facing input augmentation, ID/link integrity, compatibility shims, pre-simulation readiness, and any materialization needed before calibration against PolicyEngine target variables.

The output is a PolicyEngine entity-table bundle suitable for calibration, export, and simulation.

Stage diagnostics should include entity row counts, relationship integrity checks, missing or filled PE input variables, direct override variables, pre-simulation compatibility warnings, and checkpoint artifacts after imputation or microsimulation materialization where available.

Stage 7: Target resolution, selection, and calibration

This stage resolves calibration targets and solves final weights.

It should cover:

loading target specifications
building target queries/profiles
inferring PolicyEngine variable bindings
materializing missing target variables
filtering unsupported or infeasible targets
optional household-budget selection
solving calibration weights
running deferred calibration stages when configured
summarizing calibration quality and target fit

The output is a calibrated entity-table bundle and calibration summary.

Stage diagnostics should include loaded/supported/unsupported target counts, feasibility filters, materialized variables, materialization failures, selected constraints, calibration stages, deferred-stage status, target ledger, oracle loss, active-solve loss, convergence status, sparsity, weight diagnostics, and collapse warnings.

Stage 8: Dataset assembly and publication

This stage turns the calibrated result into the distributable dataset artifact.

It should cover:

mapping Microplex/PolicyEngine entity tables to export variables
building time-period arrays
writing the final H5 dataset
writing the artifact manifest
writing the data-flow snapshot
writing local artifact bundle metadata
recording publication metadata if a release/upload path is added

The output is the assembled dataset bundle, currently centered on policyengine_us.h5 plus metadata sidecars.

Stage diagnostics should include H5 existence/loadability, exported variable maps, excluded variables, row counts, weight totals, manifest contents, checksums if added, and publication target/status if applicable.

Stage 9: Validation and benchmarking

This stage evaluates the assembled dataset.

It should cover:

PolicyEngine simulation harness checks
target-level comparison against baselines
native PE-US-data score comparisons where configured
parity or benchmark summaries
native audits
imputation ablation evidence
run registry/index evidence for benchmark frontiers

The output is a validation and benchmark evidence bundle attached to the completed dataset artifact.

Stage diagnostics should include harness summaries, native score summaries, target win rates, candidate-vs-baseline deltas, materialization or simulation failures, audit outputs, ablation outputs, and refreshed saved-run evidence.

Why define these stages?

A canonical 9-stage taxonomy would make the US Microplex build easier to explain, inspect, and improve.

It would support:

stage contracts: required inputs, expected outputs, artifacts, diagnostics, and failure modes per stage
saved-run overlays: per-run status showing which stages completed, failed, skipped, or deferred
pipeline diagrams: a stable visual map of the canonical US build
documentation links: each stage can point to relevant library functions and source modules
rerun/resume semantics: the project can define what can be reused, recomputed, or attached after the main build
diagnostics ownership: each diagnostic artifact can live with the stage that produced the relevant behavior

Suggested next steps

Add canonical runtime stage definitions in microplex-us.
Update saved data-flow snapshots to report these 9 stages.
Add a first-class entity construction/materialization stage.
Define the minimum artifact and diagnostic contract for each stage.
Use the taxonomy as the backbone for pipeline diagrams, saved-run overlays, and library documentation links.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonical runtime stages for the US Microplex build #20

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Canonical runtime stages for the US Microplex build #20

Uh oh!

Uh oh!

anth-volk May 27, 2026 Maintainer

Summary

Stage 1: Run profile, config, and source bundle

Stage 2: Source contracts and source loading

Stage 3: Source planning, fusion planning, and scaffold selection

Stage 4: Seed construction and donor integration

Stage 5: Synthesis, candidate population, and support enforcement

Stage 6: PolicyEngine entity construction and microsimulation materialization

Stage 7: Target resolution, selection, and calibration

Stage 8: Dataset assembly and publication

Stage 9: Validation and benchmarking

Why define these stages?

Suggested next steps

Replies: 0 comments

anth-volk
May 27, 2026
Maintainer