Skip to content

Emit reform_validation.json: dataset budget effects vs JCT scores#63

Draft
PavelMakarchuk wants to merge 1 commit into
mainfrom
reform-validation
Draft

Emit reform_validation.json: dataset budget effects vs JCT scores#63
PavelMakarchuk wants to merge 1 commit into
mainfrom
reform-validation

Conversation

@PavelMakarchuk

Copy link
Copy Markdown
Contributor

What

Adds a per-release reform-validation artifact, reform_validation.json — the downstream counterpart to calibration_diagnostics.json. Calibration measures fit to its targets; this measures how closely the calibrated dataset reproduces the budget effects of JCT-scored reforms. It's consumed by the calibration-diagnostics dashboard (PolicyEngine/calibration-diagnostics#16).

Two labelled kinds of reform

  • in-sample — the JCT tax-expenditure reforms that are themselves calibration targets (US_JCT_TAX_EXPENDITURE_REFORMS). Their populace estimate is the calibration's own final_estimate (no extra simulation), flagged in_sample=true so a consumer knows agreement is expected.
  • out-of-sample — OBBBA provisions the calibration never saw (obbba_reforms.json): no-tax-on-tips and no-tax-on-overtime, with their per-fiscal-year JCX-35-25 scores. OBBBA is baked into the policyengine-us baseline, so each is encoded as a counterfactual revert; the provision effect is baseline − reform (sign-comparable to the JCT enactment score), simulated at FY2026 against JCT's FY2026 line.

Why only two OBBBA provisions (for now)

The curated set deliberately excludes provisions where a clean validation isn't possible — documented inline in obbba_reforms.json:

  • SALT cap, CTC, standard deduction — the JCX-35-25 line bundles TCJA extension + enhancement, so a parameter revert can't be isolated to the JCT figure.
  • Senior bonus deduction — no standalone JCX-35-25 line (it's netted inside the personal-exemption termination line).
  • Trump accounts — not modeled in policyengine-us.
  • Estate exemption — clean parameter, but estate tax rarely fires in microdata.

That leaves tips and overtime: genuinely new provisions whose revert captures the whole provision and whose JCT line is exact.

Files

  • packages/populace-build/src/populace/build/us/reform_validation.pyReformValidationSpec, in/out-of-sample spec builders, reform_validation_payload (microsim isolated behind an injected simulate()), write_reform_validation.
  • .../us/obbba_reforms.json — curated out-of-sample set + JCT citations.
  • tools/build_us_fiscal_refresh_release.py — writes reform_validation.json after the release H5 and registers it in the release manifest; adds --skip-reform-validation / --skip-out-of-sample-reforms.

Tests

packages/populace-build/tests/test_reform_validation.py — 9 tests (sign conventions incl. the counterfactual flip, in-sample-from-calibration, shipped-config loading), fake-sim isolated so they need no policyengine-us. Existing test_us_fiscal_targets.py (20) unaffected. ruff clean.

State / follow-up

Out-of-sample budget effects populate when a release build actually runs the OBBBA microsims (build_us_fiscal_refresh_release.py); until then the artifact is the in-sample rows plus null out-of-sample estimates. I have not run a full release build here (needs the base H5 + a calibration run), so the OBBBA parameter paths and the resulting FY2026 magnitudes should be sanity-checked against a real build before merge.

🤖 Generated with Claude Code

Adds a per-release reform-validation artifact, the downstream counterpart to
calibration_diagnostics.json: where calibration measures fit to its targets,
this measures how closely the calibrated dataset reproduces the budget effects
of JCT-scored reforms. The calibration-diagnostics dashboard consumes it.

Two labelled kinds of reform:
- in-sample: the JCT tax-expenditure reforms that are themselves calibration
  targets (US_JCT_TAX_EXPENDITURE_REFORMS). Their populace estimate is the
  calibration's own final_estimate — no extra simulation — flagged
  in_sample=True so a consumer knows agreement is expected.
- out-of-sample: OBBBA provisions the calibration never saw (obbba_reforms.json:
  no-tax-on-tips and no-tax-on-overtime, with their per-FY JCX-35-25 scores).
  OBBBA is baked into the policyengine-us baseline, so each is encoded as a
  counterfactual revert and the provision effect is baseline - reform
  (sign-comparable to the JCT enactment score), simulated at FY2026.

- packages/populace-build/.../reform_validation.py: ReformValidationSpec, the
  in-sample/out-of-sample spec builders, reform_validation_payload (microsim
  isolated behind an injected simulate() for testing), write_reform_validation.
- obbba_reforms.json: curated out-of-sample set; excludes provisions whose JCT
  line bundles TCJA extension (SALT/CTC/standard deduction), lacks a standalone
  line (senior deduction), or isn't modeled (Trump accounts) — documented inline.
- build_us_fiscal_refresh_release.py: writes reform_validation.json after the
  release H5, adds it to the release manifest; --skip-reform-validation and
  --skip-out-of-sample-reforms flags.
- 9 unit tests (sign conventions, in-sample-from-calibration, config loading),
  fake-sim isolated so they need no policyengine-us. ruff clean.

Out-of-sample budget effects populate when a release build runs the OBBBA
microsims; the artifact is otherwise the in-sample rows plus null estimates.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
PavelMakarchuk added a commit to PolicyEngine/calibration-diagnostics that referenced this pull request Jun 16, 2026
Sync with the producer schema (PolicyEngine/populace#63): each reform now
carries in_sample + period. Out-of-sample reforms (OBBBA provisions the
calibration never saw) are the genuine fidelity test; in-sample reforms are
JCT tax-expenditure calibration targets the dataset was tuned to.

- reforms.ts: read in_sample/period; summary adds out-of-sample-only stats
  (n_out_of_sample, out_of_sample_within_10pct, out_of_sample_mean_abs_rel_err);
  history series carries in_sample.
- View: out-of-sample KPIs headline; per-reform in-sample/out-of-sample badge;
  out-of-sample rows sorted first; description explains the split.
- Tests assert the out-of-sample summary isolates the in-sample miss.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@PavelMakarchuk

Copy link
Copy Markdown
Contributor Author

Verified end-to-end on the released populace-US dataset

Ran the reforms through policyengine-us 1.334.0 on populace_us_2024.h5 (the live HF release), FY2026. Both out-of-sample provisions construct as proper counterfactual reverts and produce real, correctly-signed budget effects:

Reform populace JCT FY2026 (JCX-35-25) error
No tax on tips −$6.27B −$10.12B +38% (under)
No tax on overtime −$17.67B −$32.81B +46% (under)

Data coverage on the populace dataset (this is what makes the reforms measurable): tip_income = $136.9B, fsla_overtime_premium = $118.3B (6,578 records). Note: overtime is a no-op on the default CPS (fsla_overtime_premium is an unimputed input = 0 there) but populace imputes it, so it validates on populace.

The in-sample neutralize path was also verified to construct and run (neutralizing SALT/medical/charitable raises income_tax by $22.4B / $11.3B / $61.9B — the positive tax-expenditure values, correct convention).

The ~40% under-estimate is the actual validation signal: populace under-captures tip/overtime income relative to JCT's assumptions.

Corrections found while testing: the CTC path gov.irs.credits.ctc.amount.base is a bracket ParameterScale, not a scalar — it can't be set with a flat value (it was already excluded for TCJA-bundling, now also confirmed unencodable as written). Remaining gap: in-sample budget effects come from a live calibration final_estimate, which a full release build produces — not exercised here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant