fix: keep sentencepiece tokenizers out of xgboost routing#1641
Conversation
|
@codex review |
…t19-sentencepiece-routing-20260610
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Performance BenchmarksCompared
|
|
@codex review |
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Independent review of PR #1641Reviewed exact live head: FindingsP2: valid-looking prefix suppresses malformed SentencePiece tails
Exact control: payload = _sentencepiece_model_proto() + b"\x12\x80"
Path("tokenizer.model").write_bytes(payload)The PR head classifies this as SentencePiece, skips XGBoost, and exits 0 with no issues. SentencePiece 0.2.1 rejects the same payload as an invalid P2: valid SentencePiece models can omit BOS, EOS, and PADThe classifier requires three typed identity pieces plus literal A model produced by SentencePiece 0.2.1 with those settings loads successfully, but the PR classifier returns false, XGBoost claims it, and ModelAudit exits 2 with S1004. Add dependency-produced or official fixtures with BOS/EOS/PAD disabled and a customized unknown-piece string. Local and streaming scans should avoid XGBoost ownership without weakening malformed-input handling. ValidationThe PR's focused tests passed ( |
|
@codex address that feedback on the exact current head. Preserve positive routing for valid SentencePiece models, retain fail-closed behavior for malformed tails, run dependency-produced local and streaming end-to-end controls, and request a fresh exact-head review after pushing. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 81b6d146a0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Additional pinned SentencePiece QA: |
|
@codex review exact head |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 65b64807d2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Pinned QA input: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8da5a6ce54
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex add exact real-model QA for |
|
@codex address these exact-head blockers before merge:
Independent report: |
|
Additional pinned baseline reproduction for SentencePiece routing:
Please add this exact pinned model to current-head QA and verify |
|
Summary
Testing
|
|
Summary
Testing
|
|
@codex review exact head |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00de898a21
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Independent review of exact head
Validation: XGBoost had 186 passes/2 skips; focused core 61 passes; registry 22; Hugging Face source 211; compressed scanner 73 passes excluding one environment cache failure that reproduces on base; fixture hygiene 7. Three authentic models loaded with SentencePiece 0.2.1 and passed streaming CLI QA. Exact-head CI is fully green, but two current review threads remain unresolved and both P1 regressions block merge. |
|
@codex review exact head Follow-up pushed for the two exact-head Codex P2s: bounded SentencePiece ownership reads without XGBoost false positives, and restored raw-binary routing for bare compressed wrappers. Local validation passed: Ruff format/check, mypy, focused XGBoost/compressed/filetype/core/archive slices, and broad four-worker pytest ( |
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review exact head Follow-up pushed for the remaining bounded SentencePiece ownership thread. Local validation passed on this exact head: Ruff format/check, mypy, focused SentencePiece/XGBoost/compressed regressions, and broad four-worker pytest ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a1b7f7871e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Independent review of exact head
Validation: 11 malformed/duplicate controls, 70 SentencePiece/compressed tests, 187 XGBoost tests with 2 skips, 211 HF selection tests, and 79 compressed tests passed. Real SentencePiece loading and protobuf merge behavior were verified. Exact-head CI is fully green, but one current thread matches the P1 unknown-field bypass. The two P1 clean-scan regressions block merge. |
|
@codex review exact head Follow-up pushed for the P1 unknown top-level SentencePiece field issue. Local validation passed on this exact head: Ruff format/check, mypy, focused SentencePiece/XGBoost/archive/core slices, and broad four-worker pytest ( |
|
@codex review exact head Follow-up pushed for the remaining exact-head review blockers: tokenizer-like gzip raw binaries now route inner payloads to PyTorch-binary unless the decompressed payload is a SentencePiece candidate, repeated TrainerSpec messages merge under proto2 semantics, and SentencePiece ownership probes are cached by stable file identity plus bounded content fingerprint. Local validation passed on this exact head: Ruff check/format-check, mypy, focused SentencePiece/XGBoost/compressed/archive/core slices, and broad four-worker pytest ( |
|
@codex review exact head Corrected exact-head request for the just-pushed follow-up. Local validation on this head: Ruff check/format-check, mypy, focused SentencePiece/XGBoost/compressed/archive/core slices, and broad four-worker pytest ( |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6660ef8b7e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review exact head Follow-up pushed for the streamed SentencePiece wire-type P2. Fields 2-5 now require length-delimited wire type in the streaming classifier, with small malformed-tail and >10MiB streaming regressions. Local validation on this head: Ruff check/format-check, mypy, full XGBoost/compressed focused files, ZIP/TAR/core routing selections, and broad four-worker pytest ( |
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
…dangelo/codex/hf-fp-t19-sentencepiece-routing-20260610 # Conflicts: # modelaudit/utils/file/detection.py
|
@codex review exact head Maintainer follow-up pushed after merging current |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1eae3a83e6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review exact head Follow-up pushed for the exact-head |
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Root Cause
The .model suffix is intentionally absent from extension-only file type validation because it is shared by multiple ecosystems. After content detection returned unknown for Mistral tokenizer.model, ScannerRegistry fell back to extension routing. XGBoostScanner.can_handle() accepted every .model file, so the SentencePiece tokenizer was scanned as an XGBoost binary and emitted inconclusive S1004 / xgboost_binary_structure_unrecognized.
Security Tradeoff
This only suppresses XGBoost routing when the file prefix strongly matches SentencePiece ModelProto pieces: multiple scored piece messages, typed pieces, identity tokens, and . Malformed, truncated, generic protobuf-like, or otherwise ambiguous .model files still reach XGBoost and fail closed. Real XGBoost .model binaries remain scanned.
Validation
Final validation on merge commit 81b6d14:
Note: one earlier pre-merge broad-lane run hit packages/modelaudit-picklescan/tests/test_api.py::test_scan_bytes_warns_when_allowlisted_module_is_unresolved once; the exact test passed on immediate serial rerun, and subsequent full broad-lane reruns passed.
Real-Model QA
Pinned model: mistralai/Mistral-7B-v0.1 @ 27d67f1b5f57dc0953326b2601d68371d40ea8da, file tokenizer.model.
Pre-fix confirmation on main SHA 2f782ba:
Post-fix final validation:
Unrestricted whole-model scans of this Mistral revision can still become inconclusive for separate reasons: the repository contains multi-gigabyte PyTorch shards, so scanner size limits/large-shard handling remain outside this tokenizer.model routing fix.