Skip to content

fix(routing): bound tokenizer json scanner selection#1638

Merged
mldangelo-oai merged 38 commits into
mainfrom
mdangelo/codex/hf-fp-t06-tokenizer-json-routing-20260610
Jun 11, 2026
Merged

fix(routing): bound tokenizer json scanner selection#1638
mldangelo-oai merged 38 commits into
mainfrom
mdangelo/codex/hf-fp-t06-tokenizer-json-routing-20260610

Conversation

@mldangelo-oai

Copy link
Copy Markdown
Contributor

Summary

Fixes tokenizer JSON routing false positives by adding a bounded Hugging Face tokenizer.json ownership check before generic JSON model scanners claim the file.

Root cause:

  • Generic .json routing could send large tokenizer JSON files through MXNet bounded symbol routing, producing mxnet_symbol_routing_inconclusive on benign tokenizer files.
  • Jinja2 and XGBoost also selected exact tokenizer.json via generic JSON/name gates unless content proved otherwise.
  • Manifest metadata advertised tokenizer.json / tokenizer_config.json as exact Manifest filenames even though ManifestScanner.can_handle() rejects tokenizer filenames.

Fix:

  • Add is_huggingface_tokenizer_json_file() with a deterministic 16 MiB read cap.
  • Claim exact tokenizer.json only when the complete root JSON fits inside the cap, contains tokenizer schema evidence (version, added_tokens, model.type, model.vocab), and has no root-level Jinja2/MXNet/XGBoost conflict keys.
  • Keep fail-closed behavior when the file is oversized, malformed, ambiguous, or contains chat_template/template, learner, or MXNet nodes/arg_nodes/heads.
  • Remove tokenizer exact filenames from Manifest routing metadata to match scanner policy.

Security tradeoff:

  • This avoids expensive/false-positive binary-framework passes for proven benign tokenizer JSON.
  • It is not filename-only: malicious tokenizer-shaped controls with Jinja2, XGBoost, MXNet, or Manifest signals still route to the relevant scanner or existing fail-closed path.
  • Files above the tokenizer routing cap are not exempted.

Validation

Focused tests:

PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/utils/file/test_filetype.py -k 'tokenizer_json or generic_json_value_budget_before_mxnet_structure'
# 6 passed

PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/test_core.py -k 'tokenizer_json or tokenizer_config_preserves_direct_jinja or config_json_with_tokenizer_schema'
# 7 passed

PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/scanners/test_scanner_registry.py -k 'tokenizer_exact or hf_tokenizer_json or manifest_json'
# 3 passed

Affected suite:

PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto --maxfail=1 tests/utils/file/test_filetype.py tests/test_core.py tests/scanners/test_scanner_registry.py tests/scanners/test_jinja2_template_scanner.py tests/scanners/test_xgboost_scanner.py tests/scanners/test_mxnet_scanner.py tests/scanners/test_manifest_scanner.py
# 1450 passed, 367 skipped, 3 warnings

Full pre-commit/CI-style local checks:

uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
# 419 files already formatted

uv run ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
# All checks passed

uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
# Success: no issues found in 474 source files

PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1
# 16900 passed, 1292 skipped, 30 warnings

git diff --check
# clean

Real pinned Hugging Face reproductions:

PROMPTFOO_DISABLE_TELEMETRY=1 HF_HOME=/tmp/modelaudit-task06-final-hf timeout 300 uv run modelaudit scan --no-cache --format json --output /tmp/modelaudit-task06-final-deepseek-tokenizer.json --timeout 240 --max-size 200MB https://huggingface.co/deepseek-ai/DeepSeek-R1/resolve/56d4cbbb4d29f4355bab4b9a39ccb717a14ad5ad/tokenizer.json

Outcome: exit 0, success: true, files_scanned: 1, scanner_names: [], checks_len: 0, one license-info issue only. Grep over checks/issues found no mxnet, xgboost, jinja, manifest, or symbol routing.

PROMPTFOO_DISABLE_TELEMETRY=1 HF_HOME=/tmp/modelaudit-task06-final-hf timeout 300 uv run modelaudit scan --no-cache --format json --output /tmp/modelaudit-task06-final-higgs-tokenizer.json --timeout 240 --max-size 200MB https://huggingface.co/bosonai/higgs-audio-v3-tts-4b/resolve/5402f019e7f316ff513e265f0431e145afcd2cc1/tokenizer.json

Outcome: exit 0, success: true, files_scanned: 1, scanner_names: [], checks_len: 0, one license-info issue only. Grep over checks/issues found no mxnet, xgboost, jinja, manifest, or symbol routing.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Workflow run and artifacts

Performance Benchmarks

Compared 12 shared benchmarks with a regression threshold of 15%.
Status: 1 regressions, 0 improved, 11 stable, 0 new, 0 missing.
Aggregate shared-benchmark median: 1.439s -> 1.452s (+0.9%).

Top regressions:

  • tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_direct_malicious_upload +29.9% (456.2us -> 592.9us, direct-malicious-upload, malicious_reduce, size=52 B, files=1)
Workload Benchmark Target Size Files Baseline Current Change Status
direct-malicious-upload tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_direct_malicious_upload malicious_reduce 52 B 1 456.2us 592.9us +29.9% regression
nested-payload-review tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_base64] nested_base64 98 B 1 523.0us 508.7us -2.7% stable
warm-cache-rescan tests/benchmarks/test_scan_benchmarks.py::test_scan_warm_cached_repository_rescan release-candidate 547.3 KiB 32 103.34ms 105.73ms +2.3% stable
mixed-model-repository tests/benchmarks/test_scan_benchmarks.py::test_scan_release_candidate_repository release-candidate 547.3 KiB 32 482.00ms 488.63ms +1.4% stable
padded-multi-stream-upload tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_padded_multi_stream_upload multi_stream_padded 4.1 KiB 1 584.3us 576.8us -1.3% stable
nested-payload-review tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_hex] nested_hex 130 B 1 546.4us 540.3us -1.1% stable
nested-payload-review tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payload_review[nested_raw] nested_raw 78 B 1 516.4us 522.0us +1.1% stable
duplicate-heavy-registry tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_registry_snapshot registry-snapshot 915.2 KiB 13 404.86ms 407.78ms +0.7% stable
single-checkpoint-preflight tests/benchmarks/test_scan_benchmarks.py::test_scan_single_checkpoint_before_load single_checkpoint.pkl 183.0 KiB 1 73.95ms 74.38ms +0.6% stable
clean-training-checkpoint tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_clean_training_checkpoint safe_large 278.2 KiB 1 113.48ms 112.84ms -0.6% stable
suspicious-pickle-intake tests/benchmarks/test_scan_benchmarks.py::test_scan_suspicious_pickle_intake suspicious-intake 183.8 KiB 4 142.63ms 143.34ms +0.5% stable
chunked-upload-stream tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_chunked_upload_stream chunked_stream 278.2 KiB 1 116.26ms 116.17ms -0.1% stable

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7c18dd4561

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/utils/file/detection.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9659fd854

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Additional pinned QA: FacebookAI/xlm-roberta-large@c23d21b0620b635a76227c604d44e43a9f0ee389 has 17 MB onnx/tokenizer.json; origin/main routes it through MXNet/JAX probes and fails closed for bounded-probe limits. Validate this exact tokenizer against the PR head, with selected-scanner and malicious Jinja/MXNet/JAX controls, before requesting a fresh exact-head review.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Additional tokenizer boundary QA: google/gemma-4-E4B-it@fee6332c1abaafb77f6f9624236c63aa2f1d0187 has a valid 32,169,626-byte tokenizer.json that exact main routes to MXNet and terminates inconclusive. baidu/NAVA@16c20287c79c0085ebc8213ca7894701948891f0 reproduces the same route at Wan2.2-TI2V-5B/google/umt5-xxl/tokenizer.json. Please include both exact files in the late-security-key/bounded-routing regression matrix.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Pinned QA input: intfloat/multilingual-e5-base@d128750597153bb5987e10b1c3493a34e5a4502a has two identical 17,082,660-byte tokenizer.json files. Current main routes each through JAX checkpoint and MXNet symbol scanners, producing four inconclusive failures. Please validate the current head identifies tokenizer ownership before those binary/checkpoint routes, deduplicates identical root causes, and preserves positive JAX/MXNet signatures.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c1e5bb294

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/core.py Outdated

Copy link
Copy Markdown
Contributor Author

Cross-PR validated QA: current main 8d6c4864 silently skips ordinary Orbax/JAX state.json (clean unknown, exit 0, empty SARIF), while identical bytes under .checkpoint trigger jax_checkpoint, critical S902, and exit 1. The bypass reaches direct file, directory, ZIP/TAR member, downloaded streaming, strict, explicit-scanner, malformed-JSON, and SARIF paths.

This PR independently overlaps the ordinary-JSON routing fix, but #1609 is the primary remediation and has already passed the full fixture matrix on its current branch. Please ensure this branch either remains compatible with #1609 or narrows duplicate behavior so merging both cannot reintroduce ambiguous ownership. Preserve benign generic JSON, tokenizer JSON, and manifest controls.

Full coordinator report: modelaudit-pr-swarm/outputs/deep-scan-validation-jax-json-routing-20260611.md.

@codex compare the current branch against #1609’s ownership changes, add only non-duplicative coverage needed here, rebase after the primary fix lands if necessary, and request exact-head review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c1e5bb294

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/core.py Outdated
Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/utils/file/detection.py Outdated

Copy link
Copy Markdown
Contributor Author

Additional pinned tokenizer JSON QA: Qwen/Qwen3-1.7B@70d244cc86ccca08cf5af4e1e306ecf908b1ad5e reports tokenizer.json as inconclusive MXNet routing because the bounded JSON probe reaches its limit. Qwen/Qwen2.5-VL-3B-Instruct@66285546d2b821cf421d4f5eb2576359d3770cd3 is a negative control for the same family. @codex include the exact tokenizer files in current-head QA and preserve real MXNet symbol detection plus the validated JAX/Orbax state.json route.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Pinned tokenizer QA completed on head 74f6c3b.

Exact benign files validated with cache disabled:

  • FacebookAI/xlm-roberta-large@c23d21b0620b635a76227c604d44e43a9f0ee389 onnx/tokenizer.json (17,082,660 bytes): tokenizer ownership true; default, Jinja-selected, MXNet-selected, and JAX-selected scans all succeeded without MXNet/JAX fail-closed reasons.
  • google/gemma-4-E4B-it@fee6332c1abaafb77f6f9624236c63aa2f1d0187 tokenizer.json (32,169,626 bytes): tokenizer ownership true; default, Jinja-selected, MXNet-selected, and JAX-selected scans all succeeded without MXNet/JAX fail-closed reasons.
  • baidu/NAVA@16c20287c79c0085ebc8213ca7894701948891f0 Wan2.2-TI2V-5B/google/umt5-xxl/tokenizer.json (16,837,417 bytes): tokenizer ownership true; default, Jinja-selected, MXNet-selected, and JAX-selected scans all succeeded without MXNet/JAX fail-closed reasons.

Exact-path malicious controls were also exercised: Jinja template evidence reaches jinja2_template, MXNet late-root evidence fails closed through MXNet routing instead of being suppressed, and selected JAX late-root evidence reaches jax_checkpoint and fails closed on the bounded JSON limit for the oversized exact file. Local validation also includes the focused tokenizer regressions, broader routing/scanner suite, ruff, mypy, and the full non-slow/non-integration pytest lane.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74f6c3b0e5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
@chatgpt-codex-connector

Copy link
Copy Markdown

Summary

  • Committed 99b5288 (fix: preserve tokenizer conflict routing) and created a follow-up PR titled fix: preserve tokenizer conflict routing on PR follow-up.
  • Added decoded JSON-key probing for bounded tokenizer evidence so escaped keys such as chat\u005ftemplate are decoded before routing decisions, and reused that decoded-key probe for tokenizer suffix conflict checks.
  • Hardened incomplete oversized-tokenizer ownership paths to re-check late suffix conflicts before suppressing MXNet/Jinja/JAX/XGBoost routing.
  • Added low-level routing coverage for the requested Qwen pinned tokenizer matrix, escaped tokenizer template keys, and escaped late conflict keys.
  • Added scan-level coverage preserving ordinary Orbax/JAX state.json, escaped tokenizer Jinja routing, and the Qwen tokenizer matrix before binary/checkpoint JSON routes.

Testing

  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/utils/file/test_filetype.py -k 'qwen or escaped or routing_budget_is_claimed or late_root_overlap_after_probe' tests/test_core.py -k 'qwen or escaped_chat_template or ordinary_orbax_state_json or tokenizer_json'
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/utils/file/test_filetype.py -k 'pinned_qwen or escaped or late_conflict_key or routes_renamed_mxnet_symbol' tests/test_core.py -k 'pinned_qwen or escaped_chat_template or ordinary_orbax_state_json or late_mxnet_markers_preserve_mxnet_detection or late_jax_root_preserves_selected_jax_detection'
  • uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1 — failed in existing CLI symlink tests (test_cli_report_writers_recheck_parent_links_on_fallback, test_cli_report_writers_reject_symlinked_parent_directory) unrelated to the touched routing files; I interrupted after 10m56s once those failures were reported.
  • ⚠️ PROMPTFOO_DISABLE_TELEMETRY=1 HF_HOME=/tmp/modelaudit-qwen-qa timeout 120 uv run modelaudit scan --no-cache --format json --output /tmp/modelaudit-qwen3-tokenizer.json --timeout 90 --max-size 200MB https://huggingface.co/Qwen/Qwen3-1.7B/resolve/70d244cc86ccca08cf5af4e1e306ecf908b1ad5e/tokenizer.json — Hugging Face download returned 403 Forbidden in this environment.
  • ⚠️ PROMPTFOO_DISABLE_TELEMETRY=1 HF_HOME=/tmp/modelaudit-qwen-qa timeout 120 uv run modelaudit scan --no-cache --format json --output /tmp/modelaudit-qwen25vl-tokenizer.json --timeout 90 --max-size 200MB https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/resolve/66285546d2b821cf421d4f5eb2576359d3770cd3/tokenizer.json — Hugging Face download returned 403 Forbidden in this environment.

View task →

…er-json-routing-20260610' into mdangelo/codex/takeover-pr1638-20260611

# Conflicts:
#	modelaudit/utils/file/detection.py
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex add tokenizer routing QA for yuxinlu1/gemma-4-12B-it-Claude-4.6-4.8-Opus-GGUF@19e33723755977da0b9b5482a88772aad4dfce03: fp16/tokenizer.json is currently marked mxnet_symbol_routing_incomplete despite being a normal tokenizer asset.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex add Qwen/Qwen2.5-Omni-7B@ae9e1690543ffd5c0221dc27f79834d0294cba00 tokenizer QA: the normal tokenizer.json is currently marked mxnet_symbol_routing_incomplete.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex add BAAI/bge-reranker-base@2cfc18c9415c912f9d8155881c133215df768a70 tokenizer QA: its normal 17 MB tokenizer.json is marked both jax_json_checkpoint_analysis_size_limit and mxnet_symbol_routing_incomplete.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b37b1fe12f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/utils/file/detection.py Fixed
…er-json-routing-20260610' into mdangelo/codex/takeover-pr1638-20260611

# Conflicts:
#	modelaudit/scanners/jax_checkpoint_scanner.py
#	modelaudit/utils/file/detection.py
@chatgpt-codex-connector

Copy link
Copy Markdown

Summary

  • Added pinned-QA constants for Qwen/Qwen2.5-Omni-7B@ae9e1690543ffd5c0221dc27f79834d0294cba00 and its tokenizer.json path in both routing and scan-level test modules. tests/utils/file/test_filetype.pyL77-L79 tests/test_core.pyL107-L109
  • Added low-level tokenizer routing QA that simulates the Qwen2.5-Omni tokenizer shape under reduced tokenizer/MXNet probe budgets and verifies it is claimed as benign tokenizer JSON before MXNet routing can mark it incomplete. tests/utils/file/test_filetype.pyL1618-L1649
  • Added scan-level QA verifying the same Qwen2.5-Omni-shaped tokenizer scans successfully without mxnet_symbol_routing_incomplete or MXNet/Jinja/Manifest/XGBoost scanner ownership. tests/test_core.pyL9733-L9767
  • Committed changes as 9d8d194 (test: add Qwen Omni tokenizer routing QA) and created a follow-up PR record.

Testing

  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/utils/file/test_filetype.py -k 'qwen25_omni or tokenizer_json_over_routing_budget' tests/test_core.py -k 'qwen25_omni or oversized_hf_tokenizer_json'
  • uv run ruff format --check tests/utils/file/test_filetype.py tests/test_core.py
  • uv run ruff check tests/utils/file/test_filetype.py tests/test_core.py
  • uv run mypy tests/utils/file/test_filetype.py tests/test_core.py
  • uv run ruff format modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run ruff check --fix modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • ⚠️ PROMPTFOO_DISABLE_TELEMETRY=1 HF_HOME=/tmp/modelaudit-qwen25-omni-qa timeout 120 uv run modelaudit scan --no-cache --format json --output /tmp/modelaudit-qwen25-omni-tokenizer.json --timeout 90 --max-size 200MB https://huggingface.co/Qwen/Qwen2.5-Omni-7B/resolve/ae9e1690543ffd5c0221dc27f79834d0294cba00/tokenizer.json — Hugging Face download returned 403 Forbidden in this environment.
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1 — failed in existing tests/cache/test_cache_correctness.py::test_cached_scan_persists_deterministic_validation_findings (scan_count mismatch) before I interrupted the run after 12 minutes; this is unrelated to the touched tokenizer routing tests.

View task →

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Updated head: 38859e03cb5882772b11fd63e66f9434d0d01a85.

Addressed the latest Codex routing thread:

  • preserve nested model-level template routing after an oversized non-vocab model member such as merges
  • keep stricter handling for vocab so vocab token keys named template do not route to Jinja

Validation on this head:

  • uv run ruff format modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ - 419 files unchanged
  • uv run ruff check --fix modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ - pass
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ - pass
  • focused tokenizer/JAX/Jinja/XGBoost routing regressions - 11 passed
  • broader routing/scanner slice - 373 passed, 315 skipped, 1238 deselected

Review threads: 22/22 resolved.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review head 38859e0

…er-json-routing-20260610' into mdangelo/codex/takeover-pr1638-20260611

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 38859e03cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Updated head: e2b23c57af2def9dbf303225a55a3684841904a3.

Addressed the latest Codex routing thread:

  • tokenizer template route evidence now has a bounded-memory streaming structural scan beyond the first probe and final suffix window
  • the stream scan tracks JSON context and skips model.vocab, preserving benign vocab-token handling while routing late root/model templates to Jinja

Validation on this head:

  • uv run ruff format modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ - pass
  • uv run ruff check --fix modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ - pass
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ - pass
  • focused tokenizer/JAX/Jinja/XGBoost routing regressions - 14 passed
  • broader routing/scanner slice - 376 passed, 315 skipped, 1238 deselected

Review threads: 23/23 resolved.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review head e2b23c5

…er-json-routing-20260610' into mdangelo/codex/takeover-pr1638-20260611

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e2b23c57af

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py
Comment thread modelaudit/utils/file/detection.py
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Updated head: 58bfc032f75b6c5defc847393000d035fb167284.

Addressed the latest Codex routing threads:

  • streaming tokenizer route scan now covers late root MXNet/XGBoost/JAX route keys beyond the probe and suffix windows
  • JAX identity keys are value-aware (framework: jax routes, framework: transformers does not)
  • streaming template-value detection now recognizes Unicode-escaped Jinja openers that JSON decoding would turn into {{, {%, or {#

Validation on this head:

  • uv run ruff format modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ - pass
  • uv run ruff check --fix modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ - pass
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ - pass
  • focused tokenizer/JAX/Jinja/XGBoost routing regressions - 19 passed
  • broader routing/scanner slice - 381 passed, 315 skipped, 1238 deselected

Review threads: 25/25 resolved.

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review head 58bfc03

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep them coming!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…er-json-routing-20260610' into mdangelo/codex/takeover-pr1638-20260611
@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review head 58bfc03

@mldangelo-oai mldangelo-oai requested a review from mldangelo June 11, 2026 09:47
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

@codex review

Exact head: ec571c8

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Breezy!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mldangelo-oai

Copy link
Copy Markdown
Contributor Author

Independent review of exact head ec571c8a34c3508f38817e447bb82484c403475f found two routing false negatives:

  1. P1: malicious tokenizer-shaped JSON passes clean under extensionless or misleading suffixes. Tokenizer route evidence is gated to exact tokenizer.json/.json, and the preferred Jinja route also requires .json. A tokenizer object with chat_template="{{ ''.__class__.__mro__[1].__subclasses__() }}" returned success=true, scanner=unknown, and no failures as extensionless tokenizer; named tokenizer.bin, it returned success=true, scanner=pytorch_binary, and no failures. The content evidence should not be discarded solely because the suffix is absent or misleading.
  2. P2: malformed exact tokenizer.json can pass clean instead of failing closed. A small tokenizer.json containing a raw control byte inside JSON returned format=unknown, success=true, scanner=unknown, and no failures. The tokenizer proof rejects malformed content, then core silently skips the unknown format. Large malformed/late-route cases fail closed, but this small malformed/control-byte case remains clean.

Validation on the stable exact head: routing/filetype tests had 65 passes, core tests 52 passes, registry/CLI/streaming/archive tests 11 passes, and the Jinja/JAX/XGBoost overlap slice had 654 passes with one skip. Generated Holo-shaped 13 MiB and Qwen-shaped 18 MiB tokenizers stayed benign; late Jinja and MXNet evidence failed closed; a nested ZIP tokenizer Jinja payload was detected. The exact pinned external tokenizer bytes were unavailable, so generated same-shape corpora were used.

At the review snapshot all 25 threads were resolved and most CI was green, with four Python/Windows lanes still running. These two false negatives block merge even if CI completes successfully.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7f749d3d3e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py
Comment thread modelaudit/utils/file/detection.py
Comment thread modelaudit/scanners/jinja2_template_scanner.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 599ebba8a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment thread modelaudit/scanners/xgboost_scanner.py Outdated
Comment thread modelaudit/utils/file/detection.py
@mldangelo-oai mldangelo-oai merged commit e47be19 into main Jun 11, 2026
29 checks passed
@mldangelo-oai mldangelo-oai deleted the mdangelo/codex/hf-fp-t06-tokenizer-json-routing-20260610 branch June 11, 2026 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant