Skip to content

Fix parquet loading crash from datasets version mismatch#1140

Merged
yeyu-nvidia merged 4 commits intomainfrom
yeyu/fix-parquet-datasets-compat
Apr 7, 2026
Merged

Fix parquet loading crash from datasets version mismatch#1140
yeyu-nvidia merged 4 commits intomainfrom
yeyu/fix-parquet-datasets-compat

Conversation

@yeyu-nvidia
Copy link
Copy Markdown
Contributor

@yeyu-nvidia yeyu-nvidia commented Mar 30, 2026

Summary

  • When local parquet files contain HF datasets metadata written by a different library version, load_dataset("parquet") raises a TypeError during feature deserialization
  • Added a fallback that catches the TypeError and reads parquet files directly via PyArrow, bypassing the incompatible metadata

Test plan

  • Run specdec_bench with EAGLE config against local parquet dataset files
  • Verify normal (compatible) parquet loading still works via the primary load_dataset path

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved robustness of parquet dataset loading with a safer fallback and metadata cleanup to ensure reliable reads across environments.
  • Chores

    • Broadened the required version constraint for the datasets package to increase compatibility and simplify installation.

@yeyu-nvidia yeyu-nvidia requested a review from a team as a code owner March 30, 2026 16:24
@yeyu-nvidia yeyu-nvidia requested a review from h-guo18 March 30, 2026 16:24
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8d1aa5a1-c7d3-4a0f-9fc1-b9cf22e01a35

📥 Commits

Reviewing files that changed from the base of the PR and between 4731a12 and 95a9395.

📒 Files selected for processing (2)
  • examples/specdec_bench/requirements_speed.txt
  • examples/specdec_bench/specdec_bench/datasets/speed.py
✅ Files skipped from review due to trivial changes (1)
  • examples/specdec_bench/requirements_speed.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/specdec_bench/specdec_bench/datasets/speed.py

📝 Walkthrough

Walkthrough

Wraps parquet loading in SPEEDBench._load_dataset with a try/except for TypeError/ValueError; on error it reads parquet files with pyarrow, concatenates tables, strips Arrow schema metadata keyed b"huggingface", and builds a datasets.Dataset from the Arrow table.

Changes

Cohort / File(s) Summary
Parquet load fallback
examples/specdec_bench/specdec_bench/datasets/speed.py
Wraps datasets.load_dataset(..., split="test") in try/except (TypeError, ValueError). On those exceptions, reads each parquet with pyarrow.parquet.read_table, concatenates via pyarrow.concat_tables when needed, removes b"huggingface" schema metadata if present, and constructs a datasets.Dataset from the Arrow table as fallback.
Dependency constraint update
examples/specdec_bench/requirements_speed.txt
Relaxed datasets requirement from >=4.4.0,<5.0.0 to >=3.1.0 (removed upper bound).

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant SPEEDBench
    participant DatasetsLib as "datasets.load_dataset"
    participant PyArrow as "pyarrow.parquet"
    participant HF_Dataset as "datasets.Dataset"

    Caller->>SPEEDBench: request dataset load
    SPEEDBench->>DatasetsLib: load_dataset("parquet", data_files, split="test")
    alt success
        DatasetsLib-->>SPEEDBench: Dataset
        SPEEDBench-->>Caller: return Dataset (possibly truncated)
    else TypeError or ValueError
        DatasetsLib--xSPEEDBench: raises TypeError/ValueError
        SPEEDBench->>PyArrow: read_table(file1), read_table(fileN)
        PyArrow-->>SPEEDBench: Table(s)
        SPEEDBench->>PyArrow: concat_tables(tables) (if multiple)
        PyArrow-->>SPEEDBench: concatenated Table
        SPEEDBench->>SPEEDBench: remove b"huggingface" schema metadata if present
        SPEEDBench->>HF_Dataset: Dataset(concatenated Table)
        HF_Dataset-->>SPEEDBench: Dataset
        SPEEDBench-->>Caller: return Dataset (possibly truncated)
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title clearly and specifically identifies the main problem being fixed: a parquet loading crash caused by datasets version incompatibility.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed No security anti-patterns found: no torch.load with weights_only=False, numpy.load with allow_pickle=True, hardcoded trust_remote_code=True, eval/exec on external input, or nosec comments bypassing Bandit.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yeyu/fix-parquet-datasets-compat

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-07 20:56 UTC

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 721-730: The except clause currently catching TypeError around the
parquet fallback (the block that imports pyarrow, pq.read_table,
pyarrow.concat_tables and constructs HFDataset(table) from data_files["test"])
should be changed so the fallback actually runs for Hugging Face metadata
incompatibility errors: replace `except TypeError:` with `except ValueError:`
(or `except (TypeError, ValueError):` if you want to handle both) so the
PyArrow-to-HFDataset fallback triggers when datasets raises the ValueError about
unknown feature types.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: faeefce7-a99c-401f-af42-d6d7bd1addf3

📥 Commits

Reviewing files that changed from the base of the PR and between a3f5c46 and 45a33f8.

📒 Files selected for processing (1)
  • examples/specdec_bench/specdec_bench/datasets/speed.py

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.84%. Comparing base (4255bc6) to head (95a9395).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1140      +/-   ##
==========================================
+ Coverage   74.77%   76.84%   +2.06%     
==========================================
  Files         352      352              
  Lines       40341    40341              
==========================================
+ Hits        30166    31001     +835     
+ Misses      10175     9340     -835     
Flag Coverage Δ
examples 45.26% <ø> (+5.03%) ⬆️
unit 54.89% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@ChenhanYu ChenhanYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Fix parquet loading crash from datasets version mismatch

1. Summary

This PR adds a fallback mechanism to handle parquet file loading failures caused by version mismatches in Hugging Face datasets metadata. When load_dataset("parquet") fails due to incompatible embedded feature schemas, the code falls back to reading the file directly via PyArrow and strips the problematic HF metadata before converting to a Dataset. The approach is sound, but there is a critical bug in exception handling that prevents the fallback from ever executing for the documented use case.


2. Key Areas Requiring Attention

Issue Location Severity Details
Wrong exception type speed.py:721 🔴 Critical Code catches TypeError but Hugging Face datasets raises ValueError when it encounters unknown feature types in embedded metadata (e.g., "Feature type 'List' not found..."). This means the fallback never executes for the documented scenario. Should be except (TypeError, ValueError): to handle both.
Incomplete version constraint relaxation requirements_speed.txt:1 🟡 Important Changed from datasets>=4.4.0,<5.0.0 to datasets>=3.1.0 (no upper bound). While this increases compatibility, it removes the upper bound entirely, which could introduce instability with future major versions. Consider datasets>=3.1.0,<6.0.0 as a safer middle ground.
Metadata stripping logic speed.py:732-737 🟡 Important The code strips the huggingface metadata key to avoid parsing errors. This is a reasonable workaround, but it silently drops column metadata (features, descriptions, etc.). Document this trade-off: users get a working dataset but lose schema information from the original metadata.
No test coverage PR description 🟡 Important Test plan is marked incomplete ([ ] Run specdec_bench...). The critical bug above would be caught by a simple test that loads a parquet file written by a newer datasets version using an older installed version.

3. Architecture & Design Changes

No architecture/design changes detected. This is a defensive fallback pattern for an existing code path, not a structural refactor.


4. Test Coverage

Tests are missing. The test plan checklist is incomplete. Before merge, add at least:

  1. Unit test: Create a parquet file with forward-incompatible HF metadata (e.g., simulating a newer datasets version feature) and verify the fallback path loads it successfully.
  2. Regression test: Ensure normal (compatible) parquet files still load via the primary load_dataset path without triggering the fallback.
  3. Integration test: Run the documented specdec_bench scenario to verify end-to-end functionality.

Without these, the bug in exception handling (#2 above) would never be caught.


5. Verdict

⚠️ Request changes

Must fix before merge:

  1. Change except TypeError: to except (TypeError, ValueError): on line 721 to actually catch the documented metadata incompatibility error.
  2. Add test cases to verify the fallback path is triggered and works correctly.

Should fix before merge:
3. Constrain the datasets version upper bound in requirements_speed.txt (e.g., <6.0.0) for safety.
4. Add a comment explaining that metadata stripping causes loss of column-level schema info.

Optional:
5. Mark the test plan items as complete before merging.


Note: This review is AI-assisted. A human reviewer must verify the exception analysis, test the actual error message from datasets, and approve all changes before merging.

Copy link
Copy Markdown
Collaborator

@ChenhanYu ChenhanYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical fix addressed — except (TypeError, ValueError) now catches both exception types. Fallback logic and metadata stripping look correct. LGTM.

Minor items still open (non-blocking):

  • datasets>=3.1.0 has no upper bound — consider adding <6.0.0 for safety
  • No test coverage for the fallback path

yeyu-nvidia and others added 4 commits April 7, 2026 13:00
When local parquet files contain HF datasets metadata written by a
different version of the `datasets` library, `load_dataset("parquet")`
can raise a TypeError during feature deserialization. Fall back to
reading via PyArrow directly in that case.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
The PyArrow fallback still failed because HFDataset(table) parses
the huggingface metadata embedded in the arrow schema, hitting the
same TypeError. Strip that metadata before constructing the Dataset.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
The tensorrt_llm 1.3.0rc5 container pins datasets==3.1.0. The previous
pin (>=4.4.0) caused concurrent pip installs across ranks to race and
corrupt the datasets package, breaking tensorrt_llm imports entirely.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
HF datasets raises ValueError (not just TypeError) when it encounters
unknown feature types in embedded parquet metadata. Catch both so the
PyArrow fallback triggers correctly.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
@yeyu-nvidia yeyu-nvidia force-pushed the yeyu/fix-parquet-datasets-compat branch from 4731a12 to 95a9395 Compare April 7, 2026 20:00
@yeyu-nvidia yeyu-nvidia enabled auto-merge (squash) April 7, 2026 20:00
@yeyu-nvidia yeyu-nvidia merged commit af2fe24 into main Apr 7, 2026
43 checks passed
@yeyu-nvidia yeyu-nvidia deleted the yeyu/fix-parquet-datasets-compat branch April 7, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants