Fix parquet loading crash from datasets version mismatch#1140
Fix parquet loading crash from datasets version mismatch#1140yeyu-nvidia merged 4 commits intomainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughWraps parquet loading in SPEEDBench._load_dataset with a try/except for TypeError/ValueError; on error it reads parquet files with pyarrow, concatenates tables, strips Arrow schema metadata keyed Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant SPEEDBench
participant DatasetsLib as "datasets.load_dataset"
participant PyArrow as "pyarrow.parquet"
participant HF_Dataset as "datasets.Dataset"
Caller->>SPEEDBench: request dataset load
SPEEDBench->>DatasetsLib: load_dataset("parquet", data_files, split="test")
alt success
DatasetsLib-->>SPEEDBench: Dataset
SPEEDBench-->>Caller: return Dataset (possibly truncated)
else TypeError or ValueError
DatasetsLib--xSPEEDBench: raises TypeError/ValueError
SPEEDBench->>PyArrow: read_table(file1), read_table(fileN)
PyArrow-->>SPEEDBench: Table(s)
SPEEDBench->>PyArrow: concat_tables(tables) (if multiple)
PyArrow-->>SPEEDBench: concatenated Table
SPEEDBench->>SPEEDBench: remove b"huggingface" schema metadata if present
SPEEDBench->>HF_Dataset: Dataset(concatenated Table)
HF_Dataset-->>SPEEDBench: Dataset
SPEEDBench-->>Caller: return Dataset (possibly truncated)
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 721-730: The except clause currently catching TypeError around the
parquet fallback (the block that imports pyarrow, pq.read_table,
pyarrow.concat_tables and constructs HFDataset(table) from data_files["test"])
should be changed so the fallback actually runs for Hugging Face metadata
incompatibility errors: replace `except TypeError:` with `except ValueError:`
(or `except (TypeError, ValueError):` if you want to handle both) so the
PyArrow-to-HFDataset fallback triggers when datasets raises the ValueError about
unknown feature types.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: faeefce7-a99c-401f-af42-d6d7bd1addf3
📒 Files selected for processing (1)
examples/specdec_bench/specdec_bench/datasets/speed.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1140 +/- ##
==========================================
+ Coverage 74.77% 76.84% +2.06%
==========================================
Files 352 352
Lines 40341 40341
==========================================
+ Hits 30166 31001 +835
+ Misses 10175 9340 -835
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ChenhanYu
left a comment
There was a problem hiding this comment.
Code Review: Fix parquet loading crash from datasets version mismatch
1. Summary
This PR adds a fallback mechanism to handle parquet file loading failures caused by version mismatches in Hugging Face datasets metadata. When load_dataset("parquet") fails due to incompatible embedded feature schemas, the code falls back to reading the file directly via PyArrow and strips the problematic HF metadata before converting to a Dataset. The approach is sound, but there is a critical bug in exception handling that prevents the fallback from ever executing for the documented use case.
2. Key Areas Requiring Attention
| Issue | Location | Severity | Details |
|---|---|---|---|
| Wrong exception type | speed.py:721 |
🔴 Critical | Code catches TypeError but Hugging Face datasets raises ValueError when it encounters unknown feature types in embedded metadata (e.g., "Feature type 'List' not found..."). This means the fallback never executes for the documented scenario. Should be except (TypeError, ValueError): to handle both. |
| Incomplete version constraint relaxation | requirements_speed.txt:1 |
🟡 Important | Changed from datasets>=4.4.0,<5.0.0 to datasets>=3.1.0 (no upper bound). While this increases compatibility, it removes the upper bound entirely, which could introduce instability with future major versions. Consider datasets>=3.1.0,<6.0.0 as a safer middle ground. |
| Metadata stripping logic | speed.py:732-737 |
🟡 Important | The code strips the huggingface metadata key to avoid parsing errors. This is a reasonable workaround, but it silently drops column metadata (features, descriptions, etc.). Document this trade-off: users get a working dataset but lose schema information from the original metadata. |
| No test coverage | PR description | 🟡 Important | Test plan is marked incomplete ([ ] Run specdec_bench...). The critical bug above would be caught by a simple test that loads a parquet file written by a newer datasets version using an older installed version. |
3. Architecture & Design Changes
No architecture/design changes detected. This is a defensive fallback pattern for an existing code path, not a structural refactor.
4. Test Coverage
Tests are missing. The test plan checklist is incomplete. Before merge, add at least:
- Unit test: Create a parquet file with forward-incompatible HF metadata (e.g., simulating a newer
datasetsversion feature) and verify the fallback path loads it successfully. - Regression test: Ensure normal (compatible) parquet files still load via the primary
load_datasetpath without triggering the fallback. - Integration test: Run the documented
specdec_benchscenario to verify end-to-end functionality.
Without these, the bug in exception handling (#2 above) would never be caught.
5. Verdict
Must fix before merge:
- Change
except TypeError:toexcept (TypeError, ValueError):on line 721 to actually catch the documented metadata incompatibility error. - Add test cases to verify the fallback path is triggered and works correctly.
Should fix before merge:
3. Constrain the datasets version upper bound in requirements_speed.txt (e.g., <6.0.0) for safety.
4. Add a comment explaining that metadata stripping causes loss of column-level schema info.
Optional:
5. Mark the test plan items as complete before merging.
Note: This review is AI-assisted. A human reviewer must verify the exception analysis, test the actual error message from datasets, and approve all changes before merging.
ChenhanYu
left a comment
There was a problem hiding this comment.
Critical fix addressed — except (TypeError, ValueError) now catches both exception types. Fallback logic and metadata stripping look correct. LGTM.
Minor items still open (non-blocking):
datasets>=3.1.0has no upper bound — consider adding<6.0.0for safety- No test coverage for the fallback path
When local parquet files contain HF datasets metadata written by a
different version of the `datasets` library, `load_dataset("parquet")`
can raise a TypeError during feature deserialization. Fall back to
reading via PyArrow directly in that case.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
The PyArrow fallback still failed because HFDataset(table) parses the huggingface metadata embedded in the arrow schema, hitting the same TypeError. Strip that metadata before constructing the Dataset. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Ye Yu <[email protected]>
The tensorrt_llm 1.3.0rc5 container pins datasets==3.1.0. The previous pin (>=4.4.0) caused concurrent pip installs across ranks to race and corrupt the datasets package, breaking tensorrt_llm imports entirely. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Ye Yu <[email protected]>
HF datasets raises ValueError (not just TypeError) when it encounters unknown feature types in embedded parquet metadata. Catch both so the PyArrow fallback triggers correctly. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Ye Yu <[email protected]>
4731a12 to
95a9395
Compare
Summary
datasetsmetadata written by a different library version,load_dataset("parquet")raises aTypeErrorduring feature deserializationTypeErrorand reads parquet files directly via PyArrow, bypassing the incompatible metadataTest plan
specdec_benchwith EAGLE config against local parquet dataset filesload_datasetpath🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Chores