You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This proposes a 9-stage runtime taxonomy for the canonical US Microplex build.
The goal is to give microplex-us a shared language for how a US dataset build proceeds from configuration and source loading through final dataset assembly and validation. Each stage should eventually have a clear contract: what it consumes, what it produces, what diagnostics it owns, and what artifacts should be saved for later inspection.
The proposed stages are:
Run profile, config, and source bundle
Source contracts and source loading
Source planning, fusion planning, and scaffold selection
Seed construction and donor integration
Synthesis, candidate population, and support enforcement
PolicyEngine entity construction and microsimulation materialization
Target resolution, selection, and calibration
Dataset assembly and publication
Validation and benchmarking
Stage 1: Run profile, config, and source bundle
This stage defines the build that is about to run.
It should answer questions like:
What profile is being run?
Which source providers are included?
What target period/year is being built?
Which calibration backend is selected?
Which target database and baseline dataset are being used?
Which run-level options, seeds, sample filters, and checkpoint/defer flags are active?
Current code that fits here includes the US build config, canonical rebuild config helpers, default source provider bundle, checkpoint config, and CLI/config plumbing.
Stage outputs should include a resolved run configuration and provider/query plan that downstream stages can reference.
Stage 2: Source contracts and source loading
This stage turns external datasets into Microplex source frames.
It should cover:
CPS loading and construction
PUF loading, uprating, demographics handling, and optional person expansion
ACS donor loading
SIPP donor loading
SCF donor loading
source manifests and source descriptors
source relationships, especially household-person links
The key output is a set of validated ObservationFrames: source metadata plus actual entity tables and relationships.
Stage diagnostics should include row counts, variable coverage, relationship validity, source provenance, cache/download provenance, and any remaining construction dependencies on external packages.
Stage 3: Source planning, fusion planning, and scaffold selection
This stage reasons about the source mix before building the population.
It should answer:
Which variables are covered by which sources?
Which source is the scaffold or backbone of the population?
Which sources are donors?
Which variables need donor integration or synthetic generation?
Why was one scaffold selected over another?
Current code that fits here includes fusion planning, source input preparation, and scaffold selection logic.
The output should be a source plan: scaffold source, donor sources, coverage map, selected variable families, and source-selection diagnostics.
Stage 4: Seed construction and donor integration
This stage creates the initial seed population and projects donor variables onto it.
It should cover:
projecting the scaffold source into the canonical seed schema
creating canonical person/household identifiers and fields
deriving seed-level fields such as state, age group, and income bracket
integrating donor variables from PUF/ACS/SIPP/SCF or other donor sources
applying donor imputation rules, conditioning surfaces, exclusions, and authoritative overrides
The output is a seed frame that is richer than the scaffold alone because donor information has been integrated.
Stage diagnostics should include seed row/column counts, donor-integrated variables, donor source per variable/block, selected conditioning variables, excluded variables, authoritative overrides, and donor-conditioning diagnostics.
Stage 5: Synthesis, candidate population, and support enforcement
This stage turns the seed into the candidate population that will be calibrated.
It should cover:
seed passthrough when the backend is seed
bootstrap or model-backed synthesis when configured
synthesis variable selection
target support enforcement so calibration targets have represented categories
synthetic/candidate population finalization
The output is the candidate population before final calibration.
Stage diagnostics should include synthesis backend, condition variables, target variables, candidate row count, support-enforcement changes, synthesizer metadata, and saved candidate artifacts such as seed_data.parquet, synthetic_data.parquet, and optionally synthesizer.pt.
Stage 6: PolicyEngine entity construction and microsimulation materialization
This stage converts the candidate population into PolicyEngine-style entity tables.
It should cover construction of:
households
persons
tax units
SPM units
families
marital units
It should also cover PE-facing input augmentation, ID/link integrity, compatibility shims, pre-simulation readiness, and any materialization needed before calibration against PolicyEngine target variables.
The output is a PolicyEngine entity-table bundle suitable for calibration, export, and simulation.
Stage diagnostics should include entity row counts, relationship integrity checks, missing or filled PE input variables, direct override variables, pre-simulation compatibility warnings, and checkpoint artifacts after imputation or microsimulation materialization where available.
Stage 7: Target resolution, selection, and calibration
This stage resolves calibration targets and solves final weights.
It should cover:
loading target specifications
building target queries/profiles
inferring PolicyEngine variable bindings
materializing missing target variables
filtering unsupported or infeasible targets
optional household-budget selection
solving calibration weights
running deferred calibration stages when configured
summarizing calibration quality and target fit
The output is a calibrated entity-table bundle and calibration summary.
This stage turns the calibrated result into the distributable dataset artifact.
It should cover:
mapping Microplex/PolicyEngine entity tables to export variables
building time-period arrays
writing the final H5 dataset
writing the artifact manifest
writing the data-flow snapshot
writing local artifact bundle metadata
recording publication metadata if a release/upload path is added
The output is the assembled dataset bundle, currently centered on policyengine_us.h5 plus metadata sidecars.
Stage diagnostics should include H5 existence/loadability, exported variable maps, excluded variables, row counts, weight totals, manifest contents, checksums if added, and publication target/status if applicable.
Stage 9: Validation and benchmarking
This stage evaluates the assembled dataset.
It should cover:
PolicyEngine simulation harness checks
target-level comparison against baselines
native PE-US-data score comparisons where configured
parity or benchmark summaries
native audits
imputation ablation evidence
run registry/index evidence for benchmark frontiers
The output is a validation and benchmark evidence bundle attached to the completed dataset artifact.
Stage diagnostics should include harness summaries, native score summaries, target win rates, candidate-vs-baseline deltas, materialization or simulation failures, audit outputs, ablation outputs, and refreshed saved-run evidence.
Why define these stages?
A canonical 9-stage taxonomy would make the US Microplex build easier to explain, inspect, and improve.
It would support:
stage contracts: required inputs, expected outputs, artifacts, diagnostics, and failure modes per stage
saved-run overlays: per-run status showing which stages completed, failed, skipped, or deferred
pipeline diagrams: a stable visual map of the canonical US build
documentation links: each stage can point to relevant library functions and source modules
rerun/resume semantics: the project can define what can be reused, recomputed, or attached after the main build
diagnostics ownership: each diagnostic artifact can live with the stage that produced the relevant behavior
Suggested next steps
Add canonical runtime stage definitions in microplex-us.
Update saved data-flow snapshots to report these 9 stages.
Add a first-class entity construction/materialization stage.
Define the minimum artifact and diagnostic contract for each stage.
Use the taxonomy as the backbone for pipeline diagrams, saved-run overlays, and library documentation links.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
This proposes a 9-stage runtime taxonomy for the canonical US Microplex build.
The goal is to give
microplex-usa shared language for how a US dataset build proceeds from configuration and source loading through final dataset assembly and validation. Each stage should eventually have a clear contract: what it consumes, what it produces, what diagnostics it owns, and what artifacts should be saved for later inspection.The proposed stages are:
Stage 1: Run profile, config, and source bundle
This stage defines the build that is about to run.
It should answer questions like:
Current code that fits here includes the US build config, canonical rebuild config helpers, default source provider bundle, checkpoint config, and CLI/config plumbing.
Stage outputs should include a resolved run configuration and provider/query plan that downstream stages can reference.
Stage 2: Source contracts and source loading
This stage turns external datasets into Microplex source frames.
It should cover:
The key output is a set of validated
ObservationFrames: source metadata plus actual entity tables and relationships.Stage diagnostics should include row counts, variable coverage, relationship validity, source provenance, cache/download provenance, and any remaining construction dependencies on external packages.
Stage 3: Source planning, fusion planning, and scaffold selection
This stage reasons about the source mix before building the population.
It should answer:
Current code that fits here includes fusion planning, source input preparation, and scaffold selection logic.
The output should be a source plan: scaffold source, donor sources, coverage map, selected variable families, and source-selection diagnostics.
Stage 4: Seed construction and donor integration
This stage creates the initial seed population and projects donor variables onto it.
It should cover:
The output is a seed frame that is richer than the scaffold alone because donor information has been integrated.
Stage diagnostics should include seed row/column counts, donor-integrated variables, donor source per variable/block, selected conditioning variables, excluded variables, authoritative overrides, and donor-conditioning diagnostics.
Stage 5: Synthesis, candidate population, and support enforcement
This stage turns the seed into the candidate population that will be calibrated.
It should cover:
seedThe output is the candidate population before final calibration.
Stage diagnostics should include synthesis backend, condition variables, target variables, candidate row count, support-enforcement changes, synthesizer metadata, and saved candidate artifacts such as
seed_data.parquet,synthetic_data.parquet, and optionallysynthesizer.pt.Stage 6: PolicyEngine entity construction and microsimulation materialization
This stage converts the candidate population into PolicyEngine-style entity tables.
It should cover construction of:
It should also cover PE-facing input augmentation, ID/link integrity, compatibility shims, pre-simulation readiness, and any materialization needed before calibration against PolicyEngine target variables.
The output is a PolicyEngine entity-table bundle suitable for calibration, export, and simulation.
Stage diagnostics should include entity row counts, relationship integrity checks, missing or filled PE input variables, direct override variables, pre-simulation compatibility warnings, and checkpoint artifacts after imputation or microsimulation materialization where available.
Stage 7: Target resolution, selection, and calibration
This stage resolves calibration targets and solves final weights.
It should cover:
The output is a calibrated entity-table bundle and calibration summary.
Stage diagnostics should include loaded/supported/unsupported target counts, feasibility filters, materialized variables, materialization failures, selected constraints, calibration stages, deferred-stage status, target ledger, oracle loss, active-solve loss, convergence status, sparsity, weight diagnostics, and collapse warnings.
Stage 8: Dataset assembly and publication
This stage turns the calibrated result into the distributable dataset artifact.
It should cover:
The output is the assembled dataset bundle, currently centered on
policyengine_us.h5plus metadata sidecars.Stage diagnostics should include H5 existence/loadability, exported variable maps, excluded variables, row counts, weight totals, manifest contents, checksums if added, and publication target/status if applicable.
Stage 9: Validation and benchmarking
This stage evaluates the assembled dataset.
It should cover:
The output is a validation and benchmark evidence bundle attached to the completed dataset artifact.
Stage diagnostics should include harness summaries, native score summaries, target win rates, candidate-vs-baseline deltas, materialization or simulation failures, audit outputs, ablation outputs, and refreshed saved-run evidence.
Why define these stages?
A canonical 9-stage taxonomy would make the US Microplex build easier to explain, inspect, and improve.
It would support:
Suggested next steps
microplex-us.Beta Was this translation helpful? Give feedback.
All reactions