fix(text): contextualize tokenizer vocabulary indicators#1653
fix(text): contextualize tokenizer vocabulary indicators#1653mldangelo-oai wants to merge 3 commits into
Conversation
|
@codex review |
Performance BenchmarksCompared
|
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
New exact pinned QA before merge: |
|
@codex review Exact-head QA update for
Focused/local validation:
|
|
Codex Review: Didn't find any major issues. 🚀 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Pinned vocabulary QA input: |
|
Second pinned vocabulary QA: |
|
Codex Review: Didn't find any major issues. Nice work! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Additional pinned baseline reproduction for this PR's exact false-positive family:
This is directly in scope. Auto-merge is paused until the exact pinned artifact is rerun at the current PR head and proves zero C2 finding while the existing active-instruction controls remain actionable. |
|
Additional exact-main vocabulary QA from Hugging Face rank 254:
This independently reproduces the rank-242 multilingual vocabulary family. Please include both exact models in current-head QA while preserving detection of these terms in executable, URL, configuration, prose-instruction, and serialized contexts. Full audit: |
|
@codex review Pinned multilingual vocab follow-up for head
Validation run:
|
|
Codex Review: Didn't find any major issues. Bravo. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
Supplemental pinned vocabulary QA for the later rank-254 note on head
This is consistent with the multilingual rank-242 QA above; vocabulary-only entries are clean, while active prose/config/URL/script/metadata-shaped contexts remain covered by the added regressions. |
ianw-oai
left a comment
There was a problem hiding this comment.
Holding my approval for now: the tokenizer-vocabulary suppression is still pattern-specific (trojan/zombie only), which looks too tailored to the pinned QA cases for this simple-approval pass.
Summary
Fixes tokenizer vocabulary false positives where isolated lexical entries such as
trojanandzombiein line-orientedvocab.txtfiles produced command-and-control findings.Root Cause
The network communication detector reports the first matching
cc_patternoccurrence in text.TextScannerpreviously downgraded bare vocabulary-like tokens to informational severity, but it still emitted S310 findings for benign tokenizer vocabulary entries in Hugging Facevocab.txtfiles.Security Tradeoff
This keeps the policy narrow:
trojanandzombie, including common tokenizer prefixes/plurals.botnetdetected.Validation
Outcomes:
tests/scanners/test_text_scanner.py: 449 passedruff format,ruff check --fix,mypy, andgit diff --check: cleanPinned Real-Model Streaming QA
Command run from the repository with telemetry disabled and a 2 MiB bound on each streamed
vocab.txt:Outcomes:
BAAI/bge-base-en-v1.5 @ a5beb1e3e68b9ab74eb54cfd186867f64f240e1a:vocab.txt231508 bytes / 30522 lines, 1 streamed file scanned,success=true, network checkpassed,cc_pattern_checks=0ProsusAI/finbert @ 4556d13015211d73dccd3fdd39d39232506f3e43:vocab.txt231508 bytes / 30522 lines, 1 streamed file scanned,success=true, network checkpassed,cc_pattern_checks=0Malicious Controls
Added regression coverage that keeps these actionable:
trojan/zombievocabulary entries.curl,nc,requests.get, and multi-token C&C instructions insidevocab.txt.vocab.txtfiles.