Skip to content

Reform validation: populace estimates vs JCT scores#16

Draft
PavelMakarchuk wants to merge 3 commits into
mainfrom
reform-validation
Draft

Reform validation: populace estimates vs JCT scores#16
PavelMakarchuk wants to merge 3 commits into
mainfrom
reform-validation

Conversation

@PavelMakarchuk

Copy link
Copy Markdown
Contributor

What

A new Reform validation view (consumer side) that compares populace-US's microsimulated budget effects for JCT-scored reforms (OBBBA and other JCT-scored reforms) against the official JCT scores, and tracks the gap release-over-release.

This is downstream validation to complement the existing calibration diagnostics: calibration asks "does the dataset reproduce its calibration targets"; this asks "does the dataset reproduce the budget effects of reforms an authority has scored".

How it fits the pure-HF architecture

The dashboard can't run microsimulation, so the scores come from a new per-release artifact, reform_validation.json, published by the populace build pipeline (the producer side is a follow-up PR on PolicyEngine/populace). This PR is the consumer:

  • lib/populace/reforms.ts — pure-HF loader; schema v1 documented inline (the producer/consumer contract). Derives populace − JCT error per reform; buildReformHistory assembles per-reform run-over-run series across releases.
  • API: /api/populace/reforms?release= (returns 200 with available:false when a release predates the artifact, not an error) and /api/populace/reforms/history.
  • /populace/reforms page — KPIs (reforms scored, mean |error|, within-10%), a populace-vs-JCT table, and a run-over-run trend with sparklines.
  • Nav entry; React Query hooks + types; bun tests.

State

Until the producer PR lands and a build publishes reform_validation.json, the page shows a clear "not published yet" empty state. Verified live against HF: the endpoint returns available:false for the current release, history is empty, page renders 200.

Tests

bun test — 12 pass (4 new: error derivation, summary counting, chronological history delta, zero-score guard). tsc --noEmit clean.

Follow-up (producer)

PR on PolicyEngine/populace: a build step that scores a fixed set of JCT-scored reforms on each release and publishes reform_validation.json per the schema in reforms.ts.

🤖 Generated with Claude Code

Downstream validation to complement calibration diagnostics: how closely
populace-US reproduces the budget effects of reforms the JCT has officially
scored (OBBBA and other JCT-scored reforms), tracked release-over-release.

- lib/populace/reforms.ts: pure-HF loader for a new per-release artifact
  reform_validation.json (schema v1 documented inline), deriving the
  populace−JCT error per reform, and buildReformHistory for run-over-run.
- API: /api/populace/reforms (one release; 200 + available:false when the
  artifact isn't published yet) and /api/populace/reforms/history.
- /populace/reforms page: KPIs (reforms scored, mean |error|, within-10%),
  a populace-vs-JCT table, and a run-over-run trend with sparklines.
- Nav entry; hooks + types; bun tests for error derivation, summary, and
  the chronological history delta.

The scores are produced by the populace build pipeline (a follow-up PR on
PolicyEngine/populace publishes reform_validation.json); the dashboard reads
it live and shows a clear empty state until then.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
PavelMakarchuk and others added 2 commits June 16, 2026 00:37
Sync with the producer schema (PolicyEngine/populace#63): each reform now
carries in_sample + period. Out-of-sample reforms (OBBBA provisions the
calibration never saw) are the genuine fidelity test; in-sample reforms are
JCT tax-expenditure calibration targets the dataset was tuned to.

- reforms.ts: read in_sample/period; summary adds out-of-sample-only stats
  (n_out_of_sample, out_of_sample_within_10pct, out_of_sample_mean_abs_rel_err);
  history series carries in_sample.
- View: out-of-sample KPIs headline; per-reform in-sample/out-of-sample badge;
  out-of-sample rows sorted first; description explains the split.
- Tests assert the out-of-sample summary isolates the in-sample miss.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The producer now emits big-provision tax-expenditure reforms (CTC/EITC/CDCC/
standard/itemized) benchmarked against JCT or Treasury, plus magnitude-only
rows for provisions neither scores (standard deduction, all-itemized). Update
the view labels accordingly: "JCT score" → "Benchmark", and explain that some
rows show the repeal magnitude only. The loader already handles null benchmark
(error/within-10% become null), so those rows flow through as magnitude-only.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant