Skip to content

Remove dead ZIP_CODE_DATASET infrastructure#7695

Merged
baogorek merged 5 commits intomainfrom
fix-zip-code-dataset-crash
Mar 6, 2026
Merged

Remove dead ZIP_CODE_DATASET infrastructure#7695
baogorek merged 5 commits intomainfrom
fix-zip-code-dataset-crash

Conversation

@baogorek
Copy link
Copy Markdown
Collaborator

@baogorek baogorek commented Mar 3, 2026

Summary

  • Removes ZIP_CODE_DATASET, which loaded from zip_codes.csv.gz — a file that never existed in this repo. Every installation had ZIP_CODE_DATASET = None.
  • The zip_code formula was dead code that crashed when reached via the LA County SLCSP rating area path on CA state-level H5 files (AttributeError: 'NoneType' object has no attribute 'state').
  • Beyond being dead, the formula was flawed — it assigned random zip codes per state, not per county, so an LA County household could get a San Francisco zip code.

Changes:

  • Delete zip_code_dataset.py and its import from data/__init__.py
  • Remove formulas from zip_code and three_digit_zip_code (kept as input-only variables)
  • Remove dead zip-code lookup branch from county.py

Fixes #7694, fixes #7664

Test plan

  • All 132 ACA YAML tests pass
  • All 17 medicaid eligibility tests pass (uses three_digit_zip_code as input)
  • make format clean

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.00%. Comparing base (c1470ad) to head (63a7861).
⚠️ Report is 38 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##             main     #7695       +/-   ##
============================================
+ Coverage   74.80%   100.00%   +25.19%     
============================================
  Files        3907         5     -3902     
  Lines       56380        69    -56311     
  Branches      271         2      -269     
============================================
- Hits        42177        69    -42108     
+ Misses      14159         0    -14159     
+ Partials       44         0       -44     
Flag Coverage Δ
unittests 100.00% <100.00%> (+25.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@baogorek
Copy link
Copy Markdown
Collaborator Author

baogorek commented Mar 3, 2026

Is Claude blowing smoke?

● OK so the key detail: codecov/patch is passing at 100% (our changed lines are fully covered). codecov/project is    
  failing at 70.69% (-27.13%) — that's the overall repo coverage, which has nothing to do with our PR. That -27% drop 
  is from the RRC files in the screenshot you showed earlier, plus likely other uncovered code across the repo.       
                                                                                                                      
  The codecov/project check is a repo-wide gate, not a per-PR gate. It's failing on main too — it's not something we  
  caused or can fix in this PR.   

@baogorek baogorek requested a review from PavelMakarchuk March 3, 2026 20:12
@PavelMakarchuk
Copy link
Copy Markdown
Collaborator

Consolidated Review Report: PR #7695 -- Remove Dead ZIP_CODE_DATASET Infrastructure

PR: #7695
Branch: fix-zip-code-dataset-crash
Type: Bug fix / Infrastructure cleanup
Scope: ACA geographic infrastructure (triggered by CA microsim crash)
CI Status: All 9 checks passing


Executive Summary

This PR removes the dead ZIP_CODE_DATASET infrastructure that never shipped (the backing zip_codes.csv.gz was absent from the repo) and caused AttributeError crashes during CA state-level microsimulation. It also adds defensive guards in two ACA variables (slcsp_rating_area_la_county and slcsp_age_0) to handle missing zip code and rating area data gracefully. The dead code removal is clean and complete, the guard logic is correct, and CI is green. The only actionable gap is missing test coverage for new guard paths.


Findings

CRITICAL (Must Fix)

None.

All code changes are correct. The dead code removal leaves zero dangling references. The ACA guard logic uses proper sentinel patterns. CI passes. No hard-coded values, no crash risks introduced, no incorrect formulas.


SHOULD ADDRESS (2 items)

S1: Missing test for slcsp_age_0 with rating_area=0

Source: Code validator (W1), Test validator (Finding 3 / Test C)
File: /Users/pavelmakarchuk/policyengine-us/policyengine_us/tests/policy/baseline/gov/aca/slcsp/slcsp_age_0.yaml

The PR adds a guard in slcsp_age_0.py that handles rating_area=0 (unknown) using a sentinel pattern: where(known, rating_area, 1) for the lookup, then where(known, cost, 0) for the result. This new logic has zero direct test coverage. All 13 existing slcsp_age_0 test cases use slcsp_rating_area >= 1.

Recommended test to add:

- name: slcsp_age_0 with unknown rating area (zero)
  period: 2024-01
  input:
    state_code: CA
    slcsp_rating_area: 0
  output:
    slcsp_age_0: 0

Why this matters: Without this test, a future refactor could accidentally remove the guard and reintroduce the parameter-lookup crash on key 0 (valid keys start at 1).

S2: LA County with unmapped zip prefix -- verify behavior

Source: Test validator (Finding 1 / Test A)
Files: /Users/pavelmakarchuk/policyengine-us/policyengine_us/variables/gov/aca/slspc/slcsp_rating_area_la_county.py, /Users/pavelmakarchuk/policyengine-us/policyengine_us/tests/policy/baseline/gov/aca/slcsp/slcsp_rating_area_la_county.yaml

The formula guards against zip3 == "" (empty/missing zip), but if an LA County household has a non-empty three_digit_zip_code that is NOT a key in la_county_rating_area.yaml (e.g., "999"), the p.la_county_rating_area[zip3[can_lookup]] parameter lookup could raise a KeyError.

Risk assessment: MEDIUM. In practice this requires an explicit API call setting county_str: LOS_ANGELES_COUNTY_CA with a nonsensical three_digit_zip_code. The has_zip guard catches the empty-string default, which is the primary crash vector. However, an explicit test confirming the behavior (crash or graceful fallback) would be valuable.

Recommended action: Add a test with three_digit_zip_code: "999" and county_str: LOS_ANGELES_COUNTY_CA. If the parameter framework returns a default (0), document it. If it crashes, add a try/except guard.


SUGGESTIONS (4 items)

G1: Test naming convention

Source: Test validator (Finding 4)
File: /Users/pavelmakarchuk/policyengine-us/policyengine_us/tests/policy/baseline/gov/aca/slcsp/slcsp_rating_area_la_county.yaml

Test case names use freeform descriptions (e.g., "LA county (Pasadena)") instead of the "Case N, description." convention. Also, one test has a grammar error: "Not an Los Angeles County" should be "Not a Los Angeles County".

Impact: Cosmetic only. Does not affect test execution.

G2: Unused Simulation import in slcsp_age_0.py (pre-existing)

Source: Code validator (W2)
File: /Users/pavelmakarchuk/policyengine-us/policyengine_us/variables/gov/aca/slspc/slcsp_age_0.py, line 2

from policyengine_core.simulations import Simulation is imported but never used. This is a pre-existing issue, not introduced by the PR. make format or a linter would flag it.

Impact: No runtime effect. Cleanup opportunity.

G3: Redundant is_in_la fetch inside defined_for = "in_la" formula (pre-existing)

Source: Code validator (W3), Test validator (Finding 2)
File: /Users/pavelmakarchuk/policyengine-us/policyengine_us/variables/gov/aca/slspc/slcsp_rating_area_la_county.py

The formula fetches is_in_la = household("in_la", period) and uses it in can_lookup = is_in_la & has_zip, but defined_for = "in_la" already ensures the formula only runs for LA County households. The is_in_la check is defense-in-depth. This is a pre-existing pattern that the PR preserved rather than introduced.

Impact: Minimal extra computation. Acceptable as defensive coding. No action required.

G4: data/__init__.py is now empty

Source: Code validator (I1)
File: /Users/pavelmakarchuk/policyengine-us/policyengine_us/data/__init__.py

After the PR, this file is empty. The data/ package exports nothing. If no other code depends on policyengine_us.data as an importable package, the directory could be removed in a follow-up cleanup PR. Keeping the empty __init__.py is harmless.

Impact: None. Cleanup opportunity for a future PR.


Reconciliation Notes

ZIP_CODE_DATASET removal: complete or incomplete?

The test validator flagged that zip_code.py, county.py, and zip_code_dataset.py still contained references to ZIP_CODE_DATASET, suggesting the removal was incomplete. The code validator confirmed that all references ARE removed in the PR diff -- zip_code_dataset.py is deleted, and both zip_code.py and county.py have their ZIP_CODE_DATASET imports and usage removed.

Resolution: The test validator was analyzing the pre-PR (base branch) state of the files, not the post-PR state. The code validator's analysis of the actual diff is authoritative. The removal is complete with zero dangling references. This finding is dismissed.

slcsp_age_0 guard: present or absent?

The test validator (Finding 3) noted that slcsp_age_0.py has "no guard against rating_area=0" and showed the old unguarded code. The code validator confirmed the PR does add a guard using the sentinel pattern where(known, rating_area, 1) + where(known, cost, 0).

Resolution: The test validator was looking at the pre-PR code. The PR adds the guard. The concern is valid only in terms of test coverage (the guard exists but is untested). This is captured in S1 above.


Files Changed

File Change
policyengine_us/data/zip_code_dataset.py Deleted
policyengine_us/data/__init__.py Cleared (import removed)
policyengine_us/variables/household/demographic/geographic/zip_code/zip_code.py Formula removed; input-only with default "UNKNOWN"
policyengine_us/variables/household/demographic/geographic/zip_code/three_digit_zip_code.py Formula removed; input-only with default ""
policyengine_us/variables/household/demographic/geographic/county/county.py Dead zip-code lookup branch removed
policyengine_us/variables/gov/aca/slspc/slcsp_rating_area_la_county.py Guard against missing zip data added
policyengine_us/variables/gov/aca/slspc/slcsp_age_0.py Guard against rating_area=0 added
policyengine_us/tests/policy/baseline/gov/aca/slcsp/slcsp_rating_area_la_county.yaml New test: LA County with missing zip code
changelog.d/fix-zip-code-dataset-crash.fixed.md Changelog fragment

Overall Assessment

Category Status
Dead code removal complete PASS
No dangling imports/references PASS
ACA guard logic correct PASS
No hard-coded values PASS
Sentinel pattern valid PASS
Changelog fragment correct PASS
CI green PASS (all 9 checks)
Test coverage -- happy paths PASS
Test coverage -- new guard logic PARTIAL (S1: slcsp_age_0 rating_area=0 untested)
Test coverage -- edge cases PARTIAL (S2: unmapped zip prefix unverified)
Vectorization patterns PASS
Entity-level correctness PASS
No crash risks introduced PASS

Recommended Severity: COMMENT

  • 0 critical issues -- nothing blocks merge
  • 2 should-address items -- both are missing test coverage for new guard logic (not blocking, but valuable)
  • 4 suggestions -- cosmetic/pre-existing cleanup opportunities

The PR is a clean, well-scoped infrastructure fix. The dead code removal is thorough, the guard logic is correct, and CI passes. The missing test for slcsp_age_0 with rating_area=0 is the most valuable follow-up action but does not block approval since the guard code itself is correct and the scenario requires unusual direct API input to trigger.


Generated with Claude Code

Copy link
Copy Markdown
Collaborator

@PavelMakarchuk PavelMakarchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after test coverage is addressed and passes

Copy link
Copy Markdown
Collaborator

@PavelMakarchuk PavelMakarchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good — clean dead code removal, correct guard logic, CI green.

Suggestions:

  • Add a test for slcsp_age_0 with rating_area=0 to cover the new sentinel guard
  • Add a test for LA County with an unmapped zip prefix (e.g., "999") to verify behavior
  • Minor: "Not an Los Angeles County""Not a Los Angeles County" in test names
  • Pre-existing: unused Simulation import in slcsp_age_0.py, redundant is_in_la fetch inside defined_for = "in_la"

@baogorek
Copy link
Copy Markdown
Collaborator Author

baogorek commented Mar 4, 2026

Thanks for the thorough review, Pavel! All suggestions addressed:

  • S1: Added test for slcsp_age_0 with rating_area=0 (sentinel guard coverage)
  • S2: Added test for LA County with unmapped zip prefix "999" — this exposed a crash, so I added a guard in slcsp_rating_area_la_county.py that validates zip codes against known parameter keys before lookup
  • G1: Fixed grammar: "Not an Los Angeles County" → "Not a Los Angeles County"
  • G2: Removed unused Simulation import from slcsp_age_0.py
  • G3: Removed redundant is_in_la fetch (already guaranteed by defined_for)
  • G4: Deleted empty policyengine_us/data/ directory

All tests passing (locally... we'll see)

baogorek and others added 5 commits March 6, 2026 14:57
ZIP_CODE_DATASET loaded from zip_codes.csv.gz, a file that never existed
in this repo. Every installation had ZIP_CODE_DATASET = None, making the
zip_code formula dead code that crashed when reached via the LA County
SLCSP rating area path on CA state-level H5 files.

Removes zip_code_dataset.py, guts the zip_code and three_digit_zip_code
formulas (keeping them as input-only variables), and removes the dead
zip-code lookup branch from county.py.

Fixes #7694, fixes #7664

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
slcsp_rating_area_la_county now returns rating area 0 when
three_digit_zip_code is empty. slcsp_age_0 treats rating area 0 as
unknown and returns cost 0, so households without zip data get no
aca_ptc rather than crashing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add test for slcsp_age_0 with rating_area=0 (sentinel guard coverage)
- Add test for LA County with unmapped zip prefix and guard against crash
- Fix grammar in test name ("Not an" -> "Not a")
- Remove unused Simulation import and redundant is_in_la fetch
- Delete empty policyengine_us/data/ package

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@baogorek baogorek force-pushed the fix-zip-code-dataset-crash branch from 5ca53e0 to 63a7861 Compare March 6, 2026 19:58
@baogorek baogorek merged commit f1dd74a into main Mar 6, 2026
9 checks passed
@baogorek baogorek deleted the fix-zip-code-dataset-crash branch March 6, 2026 20:23
MaxGhenis added a commit to MaxGhenis/policyengine-us that referenced this pull request Mar 13, 2026
The formula derives the 3-digit ZIP prefix from zip_code (e.g. 90019 → 900)
and is independent of ZIP_CODE_DATASET. Its removal silently broke ACA PTC
calculations for all LA County households that provide zip_code but not
three_digit_zip_code, causing aca_ptc to return 0 instead of ~$5,055.

Adds regression tests for the derivation and an integration test reproducing
the Amplifi-reported ACA PTC bug.

Fixes the regression introduced in 9d05912.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MaxGhenis added a commit that referenced this pull request Mar 13, 2026
* Restore three_digit_zip_code formula removed in #7695

The formula derives the 3-digit ZIP prefix from zip_code (e.g. 90019 → 900)
and is independent of ZIP_CODE_DATASET. Its removal silently broke ACA PTC
calculations for all LA County households that provide zip_code but not
three_digit_zip_code, causing aca_ptc to return 0 instead of ~$5,055.

Adds regression tests for the derivation and an integration test reproducing
the Amplifi-reported ACA PTC bug.

Fixes the regression introduced in 9d05912.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Simplify formula and fix import path

- Use integer division (// 100) instead of float (// 1e2) in three_digit_zip_code
- Import Simulation from policyengine_us public API, not policyengine_core internals
- Add code deletion safety rule to CLAUDE.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove dead ZIP_CODE_DATASET infrastructure CA state H5: AttributeError in zip_code.py when ZIP_CODE_DATASET is None

2 participants