Emit reform_validation.json: dataset budget effects vs JCT scores#63
Emit reform_validation.json: dataset budget effects vs JCT scores#63PavelMakarchuk wants to merge 1 commit into
Conversation
Adds a per-release reform-validation artifact, the downstream counterpart to calibration_diagnostics.json: where calibration measures fit to its targets, this measures how closely the calibrated dataset reproduces the budget effects of JCT-scored reforms. The calibration-diagnostics dashboard consumes it. Two labelled kinds of reform: - in-sample: the JCT tax-expenditure reforms that are themselves calibration targets (US_JCT_TAX_EXPENDITURE_REFORMS). Their populace estimate is the calibration's own final_estimate — no extra simulation — flagged in_sample=True so a consumer knows agreement is expected. - out-of-sample: OBBBA provisions the calibration never saw (obbba_reforms.json: no-tax-on-tips and no-tax-on-overtime, with their per-FY JCX-35-25 scores). OBBBA is baked into the policyengine-us baseline, so each is encoded as a counterfactual revert and the provision effect is baseline - reform (sign-comparable to the JCT enactment score), simulated at FY2026. - packages/populace-build/.../reform_validation.py: ReformValidationSpec, the in-sample/out-of-sample spec builders, reform_validation_payload (microsim isolated behind an injected simulate() for testing), write_reform_validation. - obbba_reforms.json: curated out-of-sample set; excludes provisions whose JCT line bundles TCJA extension (SALT/CTC/standard deduction), lacks a standalone line (senior deduction), or isn't modeled (Trump accounts) — documented inline. - build_us_fiscal_refresh_release.py: writes reform_validation.json after the release H5, adds it to the release manifest; --skip-reform-validation and --skip-out-of-sample-reforms flags. - 9 unit tests (sign conventions, in-sample-from-calibration, config loading), fake-sim isolated so they need no policyengine-us. ruff clean. Out-of-sample budget effects populate when a release build runs the OBBBA microsims; the artifact is otherwise the in-sample rows plus null estimates. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sync with the producer schema (PolicyEngine/populace#63): each reform now carries in_sample + period. Out-of-sample reforms (OBBBA provisions the calibration never saw) are the genuine fidelity test; in-sample reforms are JCT tax-expenditure calibration targets the dataset was tuned to. - reforms.ts: read in_sample/period; summary adds out-of-sample-only stats (n_out_of_sample, out_of_sample_within_10pct, out_of_sample_mean_abs_rel_err); history series carries in_sample. - View: out-of-sample KPIs headline; per-reform in-sample/out-of-sample badge; out-of-sample rows sorted first; description explains the split. - Tests assert the out-of-sample summary isolates the in-sample miss. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Verified end-to-end on the released populace-US datasetRan the reforms through policyengine-us 1.334.0 on
Data coverage on the populace dataset (this is what makes the reforms measurable): The in-sample neutralize path was also verified to construct and run (neutralizing SALT/medical/charitable raises income_tax by $22.4B / $11.3B / $61.9B — the positive tax-expenditure values, correct convention). The ~40% under-estimate is the actual validation signal: populace under-captures tip/overtime income relative to JCT's assumptions. Corrections found while testing: the CTC path |
What
Adds a per-release reform-validation artifact,
reform_validation.json— the downstream counterpart tocalibration_diagnostics.json. Calibration measures fit to its targets; this measures how closely the calibrated dataset reproduces the budget effects of JCT-scored reforms. It's consumed by the calibration-diagnostics dashboard (PolicyEngine/calibration-diagnostics#16).Two labelled kinds of reform
US_JCT_TAX_EXPENDITURE_REFORMS). Their populace estimate is the calibration's ownfinal_estimate(no extra simulation), flaggedin_sample=trueso a consumer knows agreement is expected.obbba_reforms.json): no-tax-on-tips and no-tax-on-overtime, with their per-fiscal-year JCX-35-25 scores. OBBBA is baked into the policyengine-us baseline, so each is encoded as a counterfactual revert; the provision effect isbaseline − reform(sign-comparable to the JCT enactment score), simulated at FY2026 against JCT's FY2026 line.Why only two OBBBA provisions (for now)
The curated set deliberately excludes provisions where a clean validation isn't possible — documented inline in
obbba_reforms.json:That leaves tips and overtime: genuinely new provisions whose revert captures the whole provision and whose JCT line is exact.
Files
packages/populace-build/src/populace/build/us/reform_validation.py—ReformValidationSpec, in/out-of-sample spec builders,reform_validation_payload(microsim isolated behind an injectedsimulate()),write_reform_validation..../us/obbba_reforms.json— curated out-of-sample set + JCT citations.tools/build_us_fiscal_refresh_release.py— writesreform_validation.jsonafter the release H5 and registers it in the release manifest; adds--skip-reform-validation/--skip-out-of-sample-reforms.Tests
packages/populace-build/tests/test_reform_validation.py— 9 tests (sign conventions incl. the counterfactual flip, in-sample-from-calibration, shipped-config loading), fake-sim isolated so they need no policyengine-us. Existingtest_us_fiscal_targets.py(20) unaffected. ruff clean.State / follow-up
Out-of-sample budget effects populate when a release build actually runs the OBBBA microsims (
build_us_fiscal_refresh_release.py); until then the artifact is the in-sample rows plus null out-of-sample estimates. I have not run a full release build here (needs the base H5 + a calibration run), so the OBBBA parameter paths and the resulting FY2026 magnitudes should be sanity-checked against a real build before merge.🤖 Generated with Claude Code