Fix parquet loading crash from datasets version mismatch by yeyu-nvidia · Pull Request #1140 · NVIDIA/Model-Optimizer

yeyu-nvidia · 2026-03-30T16:24:55Z

Summary

When local parquet files contain HF datasets metadata written by a different library version, load_dataset("parquet") raises a TypeError during feature deserialization
Added a fallback that catches the TypeError and reads parquet files directly via PyArrow, bypassing the incompatible metadata

Test plan

Run specdec_bench with EAGLE config against local parquet dataset files
Verify normal (compatible) parquet loading still works via the primary load_dataset path

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved robustness of parquet dataset loading with a safer fallback and metadata cleanup to ensure reliable reads across environments.
Chores
- Broadened the required version constraint for the datasets package to increase compatibility and simplify installation.

coderabbitai · 2026-03-30T16:25:12Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8d1aa5a1-c7d3-4a0f-9fc1-b9cf22e01a35

📥 Commits

Reviewing files that changed from the base of the PR and between 4731a12 and 95a9395.

📒 Files selected for processing (2)

examples/specdec_bench/requirements_speed.txt
examples/specdec_bench/specdec_bench/datasets/speed.py

✅ Files skipped from review due to trivial changes (1)

examples/specdec_bench/requirements_speed.txt

🚧 Files skipped from review as they are similar to previous changes (1)

examples/specdec_bench/specdec_bench/datasets/speed.py

📝 Walkthrough

Walkthrough

Wraps parquet loading in SPEEDBench._load_dataset with a try/except for TypeError/ValueError; on error it reads parquet files with pyarrow, concatenates tables, strips Arrow schema metadata keyed b"huggingface", and builds a datasets.Dataset from the Arrow table.

Changes

Cohort / File(s)	Summary
Parquet load fallback `examples/specdec_bench/specdec_bench/datasets/speed.py`	Wraps `datasets.load_dataset(..., split="test")` in `try/except (TypeError, ValueError)`. On those exceptions, reads each parquet with `pyarrow.parquet.read_table`, concatenates via `pyarrow.concat_tables` when needed, removes `b"huggingface"` schema metadata if present, and constructs a `datasets.Dataset` from the Arrow table as fallback.
Dependency constraint update `examples/specdec_bench/requirements_speed.txt`	Relaxed `datasets` requirement from `>=4.4.0,<5.0.0` to `>=3.1.0` (removed upper bound).

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant SPEEDBench
    participant DatasetsLib as "datasets.load_dataset"
    participant PyArrow as "pyarrow.parquet"
    participant HF_Dataset as "datasets.Dataset"

    Caller->>SPEEDBench: request dataset load
    SPEEDBench->>DatasetsLib: load_dataset("parquet", data_files, split="test")
    alt success
        DatasetsLib-->>SPEEDBench: Dataset
        SPEEDBench-->>Caller: return Dataset (possibly truncated)
    else TypeError or ValueError
        DatasetsLib--xSPEEDBench: raises TypeError/ValueError
        SPEEDBench->>PyArrow: read_table(file1), read_table(fileN)
        PyArrow-->>SPEEDBench: Table(s)
        SPEEDBench->>PyArrow: concat_tables(tables) (if multiple)
        PyArrow-->>SPEEDBench: concatenated Table
        SPEEDBench->>SPEEDBench: remove b"huggingface" schema metadata if present
        SPEEDBench->>HF_Dataset: Dataset(concatenated Table)
        HF_Dataset-->>SPEEDBench: Dataset
        SPEEDBench-->>Caller: return Dataset (possibly truncated)
    end

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title clearly and specifically identifies the main problem being fixed: a parquet loading crash caused by datasets version incompatibility.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	No security anti-patterns found: no torch.load with weights_only=False, numpy.load with allow_pickle=True, hardcoded trust_remote_code=True, eval/exec on external input, or nosec comments bypassing Bandit.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch yeyu/fix-parquet-datasets-compat

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-30T16:28:55Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-07 20:56 UTC

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 721-730: The except clause currently catching TypeError around the
parquet fallback (the block that imports pyarrow, pq.read_table,
pyarrow.concat_tables and constructs HFDataset(table) from data_files["test"])
should be changed so the fallback actually runs for Hugging Face metadata
incompatibility errors: replace `except TypeError:` with `except ValueError:`
(or `except (TypeError, ValueError):` if you want to handle both) so the
PyArrow-to-HFDataset fallback triggers when datasets raises the ValueError about
unknown feature types.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: faeefce7-a99c-401f-af42-d6d7bd1addf3

📥 Commits

Reviewing files that changed from the base of the PR and between a3f5c46 and 45a33f8.

📒 Files selected for processing (1)

examples/specdec_bench/specdec_bench/datasets/speed.py

examples/specdec_bench/specdec_bench/datasets/speed.py

codecov · 2026-03-30T16:36:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.84%. Comparing base (4255bc6) to head (95a9395).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1140      +/-   ##
==========================================
+ Coverage   74.77%   76.84%   +2.06%     
==========================================
  Files         352      352              
  Lines       40341    40341              
==========================================
+ Hits        30166    31001     +835     
+ Misses      10175     9340     -835

Flag	Coverage Δ
examples	`45.26% <ø> (+5.03%)`	⬆️
unit	`54.89% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChenhanYu

Code Review: Fix parquet loading crash from datasets version mismatch

1. Summary

This PR adds a fallback mechanism to handle parquet file loading failures caused by version mismatches in Hugging Face datasets metadata. When load_dataset("parquet") fails due to incompatible embedded feature schemas, the code falls back to reading the file directly via PyArrow and strips the problematic HF metadata before converting to a Dataset. The approach is sound, but there is a critical bug in exception handling that prevents the fallback from ever executing for the documented use case.

2. Key Areas Requiring Attention

Issue	Location	Severity	Details
Wrong exception type	`speed.py:721`	🔴 Critical	Code catches `TypeError` but Hugging Face `datasets` raises `ValueError` when it encounters unknown feature types in embedded metadata (e.g., `"Feature type 'List' not found..."`). This means the fallback never executes for the documented scenario. Should be `except (TypeError, ValueError):` to handle both.
Incomplete version constraint relaxation	`requirements_speed.txt:1`	🟡 Important	Changed from `datasets>=4.4.0,<5.0.0` to `datasets>=3.1.0` (no upper bound). While this increases compatibility, it removes the upper bound entirely, which could introduce instability with future major versions. Consider `datasets>=3.1.0,<6.0.0` as a safer middle ground.
Metadata stripping logic	`speed.py:732-737`	🟡 Important	The code strips the `huggingface` metadata key to avoid parsing errors. This is a reasonable workaround, but it silently drops column metadata (features, descriptions, etc.). Document this trade-off: users get a working dataset but lose schema information from the original metadata.
No test coverage	PR description	🟡 Important	Test plan is marked incomplete (`[ ] Run specdec_bench...`). The critical bug above would be caught by a simple test that loads a parquet file written by a newer `datasets` version using an older installed version.

3. Architecture & Design Changes

No architecture/design changes detected. This is a defensive fallback pattern for an existing code path, not a structural refactor.

4. Test Coverage

Tests are missing. The test plan checklist is incomplete. Before merge, add at least:

Unit test: Create a parquet file with forward-incompatible HF metadata (e.g., simulating a newer datasets version feature) and verify the fallback path loads it successfully.
Regression test: Ensure normal (compatible) parquet files still load via the primary load_dataset path without triggering the fallback.
Integration test: Run the documented specdec_bench scenario to verify end-to-end functionality.

Without these, the bug in exception handling (#2 above) would never be caught.

5. Verdict

⚠️ Request changes

Must fix before merge:

Change except TypeError: to except (TypeError, ValueError): on line 721 to actually catch the documented metadata incompatibility error.
Add test cases to verify the fallback path is triggered and works correctly.

Should fix before merge:
3. Constrain the datasets version upper bound in requirements_speed.txt (e.g., <6.0.0) for safety.
4. Add a comment explaining that metadata stripping causes loss of column-level schema info.

Optional:
5. Mark the test plan items as complete before merging.

Note: This review is AI-assisted. A human reviewer must verify the exception analysis, test the actual error message from datasets, and approve all changes before merging.

ChenhanYu

Critical fix addressed — except (TypeError, ValueError) now catches both exception types. Fallback logic and metadata stripping look correct. LGTM.

Minor items still open (non-blocking):

datasets>=3.1.0 has no upper bound — consider adding <6.0.0 for safety
No test coverage for the fallback path

When local parquet files contain HF datasets metadata written by a different version of the `datasets` library, `load_dataset("parquet")` can raise a TypeError during feature deserialization. Fall back to reading via PyArrow directly in that case. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Ye Yu <[email protected]>

The PyArrow fallback still failed because HFDataset(table) parses the huggingface metadata embedded in the arrow schema, hitting the same TypeError. Strip that metadata before constructing the Dataset. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Ye Yu <[email protected]>

The tensorrt_llm 1.3.0rc5 container pins datasets==3.1.0. The previous pin (>=4.4.0) caused concurrent pip installs across ranks to race and corrupt the datasets package, breaking tensorrt_llm imports entirely. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Ye Yu <[email protected]>

HF datasets raises ValueError (not just TypeError) when it encounters unknown feature types in embedded parquet metadata. Catch both so the PyArrow fallback triggers correctly. Co-Authored-By: Claude Opus 4.6 <[email protected]> Signed-off-by: Ye Yu <[email protected]>

yeyu-nvidia requested a review from a team as a code owner March 30, 2026 16:24

yeyu-nvidia requested a review from h-guo18 March 30, 2026 16:24

coderabbitai bot reviewed Mar 30, 2026

View reviewed changes

examples/specdec_bench/specdec_bench/datasets/speed.py Outdated Show resolved Hide resolved

ChenhanYu reviewed Apr 6, 2026

View reviewed changes

ChenhanYu approved these changes Apr 7, 2026

View reviewed changes

yeyu-nvidia and others added 4 commits April 7, 2026 13:00

yeyu-nvidia force-pushed the yeyu/fix-parquet-datasets-compat branch from 4731a12 to 95a9395 Compare April 7, 2026 20:00

yeyu-nvidia enabled auto-merge (squash) April 7, 2026 20:00

yeyu-nvidia merged commit af2fe24 into main Apr 7, 2026
43 checks passed

yeyu-nvidia deleted the yeyu/fix-parquet-datasets-compat branch April 7, 2026 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix parquet loading crash from datasets version mismatch#1140

Fix parquet loading crash from datasets version mismatch#1140
yeyu-nvidia merged 4 commits intomainfrom
yeyu/fix-parquet-datasets-compat

yeyu-nvidia commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 30, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

github-actions bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Mar 30, 2026 •

edited

Loading

Uh oh!

ChenhanYu left a comment

Uh oh!

ChenhanYu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yeyu-nvidia commented Mar 30, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

github-actions bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Code Review: Fix parquet loading crash from datasets version mismatch

1. Summary

2. Key Areas Requiring Attention

3. Architecture & Design Changes

4. Test Coverage

5. Verdict

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yeyu-nvidia commented Mar 30, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 30, 2026 •

edited

Loading

codecov bot commented Mar 30, 2026 •

edited

Loading