-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[WIP] tool-call: experimental migration of all parsers to peg-parser infra (w/ better test coverage) #18353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Holy...
I'll have to look more closely, but nothing particular stands out as disagreeable.
Is this in reference to the raw string parameters that circumvent I intend to add a schema optimization step that should help a bit. Unfortunately, I'm starting to think we might have to loosen the grammar constraining and rely more on the agent feedback loop if model creators no longer want to emit JSON. |
|
@ochafik actually I've been working on an alternative approach - an automatic peg parser generator based on differential template output analysis that will handle like 90% of all templates out of the box - then we only have to handle the really hard/weird/atypical ones. |
@aldehir Sorry no, I meant for the handling of Without a clean way to escape xml arg end blocks in string arguments (and corresponding fine-tuning) all the xml formats seem inherently fragile, unless we accept they contain a JSON string (which I'd done in my first Qwen3 parser). XML-esque escaping of Alternatively we could introduce token-based parsing (e.g. Re/ strings args, I've added a schema_or_raw_string_until to do the same treatment across the afflicted parsers.
@pwilkin this sounds super exciting, I wanted to do this in Minja at some point but never finished my prototype / got desperate as i realized how diverse the practices were in Jinja templates 😓 But more importantly... lots of small models output tool calls with an alarmingly high creativity, not at all guessable by just analyzing templates (Qwen 2.5 comes to my mind: I had to sample many tool calls to see what it comes up in practice, hence the horribly long and accommodating list of triggers in the Hermes 2 grammar & parser). Not to mention outright broken templates. (could you integrate alternative syntaxes by having multiple templates feed into the same diff engine?) Anyway, would love to see what you come up with! (oh and btw, feel free to host this in https://github.com/ochafik/minja if it needs too much testing infra for llama.cpp: it already has all the templates, could be a good fit / playground) |
Oh, interesting. I thought this would be handled by delegating to I came to the same conclusions with the XML formats. The vLLM tool call parsing seems affected by the same problems. I'm not sure there's much more we can do.
I worry this will impact model performance. I've noticed that with Devstral (not an XML format), heavy constraining is detrimental to its performance. I rather accept that it will fail when // GBNF doesn't have token-awareness, so same as text-based untilIt does now (#17816).I was slowly working my way to making the parser token aware, but for the same reasons--i.e. the argument delimiter is not a unique token--it became a low priority. I mainly added it in to handle grammar enforced reasoning budget. |
e8aa015 to
427afe6
Compare
|
There's a desire to refactor the common chat API (ref #18215). I was working on this, but given this PR and @pwilkin's effort, I'll wait to see what direction we go. Otherwise merging will be a challenge. This PR accomplishes part of what I had in mind: splitting the parsers into separate files. I'd also like to maintain a stateful object during an entire generation request. That would let us remove the serialization mechanics in the PEG parser and implement a variant of incremental Packrat parsing (https://ohmjs.org/pubs/sle2017/incremental-packrat-parsing.pdf). |
@aldehir We could experiment w/ a bit of jinja hackery: ideally we'd want models to think of escaping Longer term we should probably start a page about best practices for template writers / finetuning folks (how to pick your tool call syntax and make it work reliably by using the right special tokens, how to escape things properly, etc).
Sounds fab, just note that in the OG tool calls support PR it was pointed out sharing objects between the two sides could be a bit iffy wrt/ threading. |
Not a fan either. My hope is model creators will wise up and add some way to disambiguate. It looks like Deepseek v3.2 has addressed this to a degree.
The parser is thread-safe, as it should not be mutated after creation. The parse context should share the same thread context as the accumulated generation string. I think we're good w.r.t thread safety. I have a few nits, but I'll wait until you're done iterating. |
@aldehir It should look a bit less gruesome now (sorry for the moving target!) I just wanna make another pass at the few parsers I couldn't trivially refactor to use the new helpers ( |
I’m not a fan of this level of abstraction. I had a similar concern with the XML parsing implementation: it requires changes to these builders for every new model that’s close but needs a slight tweak. If an abstraction has to be modified for every new case, it’s not really abstracting much. More so the generic tool calls parser. |
@aldehir Agree on the principle, although it can be a good way to focus on the diffs esp. in building phase. I've now inlined Also, made tests prettier (split each parser into its own lil test file), there's lots of failures I had to skip, you can run them all (incl. skipped) w/ |
Hmm, this kind of helper is what @pwilkin wanted as well so I guess I'm out-voted. We'll keep it as is then. Awesome. I'll start reviewing the changes, but might take me a couple of days. It's a big PR! |
Core PEG parser infrastructure improvements: - Replace string AST tags with integer tag IDs for type safety and faster dispatch - Mapper functions defined as lambdas to reduce boilerplate - Add token_tag and literal_tag helper methods for token-aware parsing - Improve GBNF grammar generation for optional repetitions and lazy mode - Make SPACE_RULE optional in JSON schema grammar generation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Migrate 28 chat template formats to the unified PEG parser approach: - Each format gets its own module in common/chat-parsers/ - Shared internal header for common patterns and utilities - XML toolcall infrastructure for formats using XML-style tool calls - Updated routing in chat.cpp and chat-parser.cpp Formats migrated: Apertus, Apriel 1.5, Command R7B, DeepSeek R1/V3.1, Firefunction V2, Functionary V3.1/V3.2, Generic, GLM 4.5, GPT-OSS, Granite, Hermes 2 Pro, Kimi K2, LFM2, Llama 3.x, Magistral, MiniMax M2, Ministral 3, Mistral Nemo, Nemotron V2/V3, Qwen3 Coder XML, Seed OSS, Xiaomi MIMO. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add systematic streaming test coverage for all 28 parser formats: - Needle streaming tests inject unique markers into each semantic field - Tests verify incremental parsing at every character boundary - Cross-check declared capabilities against minja's runtime detection - 21 formats x 6+ scenarios = 126+ regression tests The needle technique catches: - Content loss or truncation (both needles must be present) - Out-of-order emission and buffering bugs (N1 before N2 per pair) - Function name splitting during streaming (atomic tool names) - Tool argument regression (args never get shorter) - Key ordering violations (key N finishes before key N+1) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add feature flag infrastructure for safe migration: - --experimental-new-parsers CLI flag (defaults to off) - LLAMA_USE_NEW_PARSERS env var for testing - Dual-path testing: legacy parsers active by default - Server integration for runtime parser selection Also includes: - Nemotron Nano template fix - Granite template rename for consistency - Test model fetching updates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
When tool_choice is "required", the model MUST output tool calls without any preceding content. This change ensures that the following parsers enforce this constraint: - apertus.cpp - glm-4-5.cpp - lfm2.cpp - minimax-m2.cpp - nemotron-v2.cpp - nemotron-v3.cpp - qwen3-coder-xml.cpp Previously, these parsers would accept content before tool calls even in required mode, allowing the model to bypass the tool call requirement. Now, when require_tools is true, the parser only matches tool calls (with optional reasoning blocks where supported). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Consolidates the common pattern in XML-based chat parsers for handling schema-based parameter values: - For string schemas: use until[_max](delimiter, maxLength?) - For non-string schemas: use schema() with optional space wrapping This reduces boilerplate across glm-4-5, minimax-m2, qwen3-coder-xml, and seed-oss parsers (removes ~99 lines, adds ~94 lines with docs). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
These templates' parsers don't allow content in tool_choice=required mode, but the test was generating content because tool_required_allows_content defaulted to true. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Make native mapper create tool calls lazily on TOOL_NAME, not TOOL_OPEN - Buffer pending TOOL_ID when it comes before TOOL_NAME (Command R7B format) - Add is_partial check for TOOL_ID to skip incomplete IDs - Fix test_parser_with_streaming to distinguish new vs updated tool calls - Integrate test_parser_with_streaming into needle test suite - Improve parse error messages with context 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Fixes GBNF grammar generation for templates with Unicode tokens. DeepSeek R1/V3 use fullwidth vertical line (U+FF5C) and lower one eighth block (U+2581) in their special tokens. The byte-based trie was generating invalid UTF-8 prefixes in exclusion patterns. Changes: - Add unicode_trie that works with code points instead of bytes - Update gbnf_escape_codepoint to handle non-ASCII characters - Add tests for DeepSeek-style Unicode token formats Enables: deepseek_r1:experimental, deepseek_v3_1:experimental 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Two fixes for streaming/partial parsing: 1. Skip partial TOOL_CLOSE nodes to prevent premature argument closing during streaming. Without this, arguments would be closed before seeing the full closing tag. 2. Wrap tool name+delimiter in p.atomic() to prevent TOOL_NAME emission on prefix match. This fixes the case where "special_function" would match as complete when input is "special_function_with_opt", causing tool count to decrease when more input arrives. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
- Move magistral detection before ministral_3 in template matching
(both use [TOOL_CALLS][ARGS] patterns)
- Add "Unsloth" check to distinguish magistral templates
- Rewrite PEG parser to match actual template format:
[TOOL_CALLS]name[ARGS]{...} (not JSON array with ids)
- Update test to reflect format doesn't include tool call ids
Enables: magistral:legacy, magistral:experimental
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
Add p.space() before tool call patterns in parsers to handle models that output leading whitespace or newlines before tool calls. This is a mechanical fix applied consistently across parsers. Affected parsers: apertus, apriel-1-5, command-r7b, deepseek-v3-1, functionary-v3-1-llama-3-1, generic, gpt-oss, granite, hermes-2-pro, kimi-k2, lfm2, llama-3-x, mistral-nemo, nemotron-v2, xiaomi-mimo 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…alled
Add `p.choice({with_tools, content_only})` pattern to parsers that were
requiring tool calls when tools were provided, even for content-only responses.
Files changed:
- glm-4-5.cpp: Add content_only fallback
- minimax-m2.cpp: Add content_only fallback
- ministral-3.cpp: Add content_only fallback + atomic wrapper for tool name
Enables: glm_4_5:experimental, minimax_m2:experimental, ministral_3:both
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Use `additional_stops` instead of inline `consume_eos()` pattern - Add `p.space()` after JSON args to handle model's trailing whitespace - Restructure parallel tool calls handling for cleaner parsing - Add test case matching server test scenario 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Fix `param_ends` to not include leading `\n` (already consumed by `space()` in value parser). Simplifies content-only path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Kimi template splits messages into hist_msgs (up to last non-tool-call assistant) and suffix_msgs (after). Both get `<think></think>` tags, but: - hist_msgs: reasoning_content is discarded (empty think tags) - suffix_msgs: reasoning_content is preserved The needle tests use a single assistant message which becomes the "last non-tool-call assistant" and goes to hist_msgs, so reasoning is discarded. - Mark `supports_disable_thinking=No` since think tags are always output - Skip run_template_test_suite for experimental impl (needle tests incompatible with this message splitting) Enables: kimi_k2:experimental 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Fix `p.chars("0-9")` to `p.chars("[0-9]", 1, 10)` - the first argument
is a regex character class pattern, not a range string. Also specify
min/max repetitions (1-10 digits for tool call ID).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add NEW_PARSERS_UNSUPPORTED dict to document templates with known issues when using experimental parsers in server tests: - LFM2: requires special system message marker - Llama 3.x: builtin tools need custom TOOL_ARG_NAME handling - Functionary v3.2: python tool allows raw code fallback - Nemotron v3: tiny model generates invalid parameter structure - GPT-OSS: tiny model generates unparseable content - Kimi K2: tiny model generates format that fails to parse Also in test-chat.cpp: - Change test name separator from `_` to `:` for easier grep - Add skip logic for force_disable_thinking scenarios 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
When repeat(p, min, max) is called with max=0, return eps() instead of creating a repetition parser. This avoids issues with parsers that have no valid matches. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The legacy lfm2 parser requires a "force json schema." marker in the system message to enable tool call grammar. Skip run_template_test_suite for legacy mode since it uses generic inputs without this marker. The explicit tests in test-lfm2.cpp still run and cover the legacy parser behavior with the proper marker. Enables: lfm2:legacy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Upstream now defaults message content to empty string instead of null, which adds "content": "" to JSON output after tool_calls. Update both the PEG grammar and test expectation to handle this. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
bf5cb42 to
c4ff3e4
Compare
The test_peg_parser helper was always setting experimental_new_parsers=true, which didn't match the impl parameter being used for other test decisions. Now it respects the impl parameter like other test helpers. This enables nemotron_v3:legacy which was previously failing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
7afae43 to
ae718c2
Compare
I was testing ggml-org/llama.cpp#18353 and `test-chat` is unbearably slow on MSVC (always has been). The cause is MSVC's `std::regex` engine. The `regex_search()` calls take a considerable amount of time during tokenization. I added some profiling and tested with the [Apertus-8B-Instruct.jinja](https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509/blob/main/chat_template.jinja) template: ``` [Profile] parseExpression: 4.93307 s (102 calls) [Profile] non_text_regex: 0.0153389 s (319 calls) [Profile] block_keyword_regex: 0.0048339 s (216 calls) [Profile] block_close_regex: 0.0068446 s (216 calls) [Profile] block_open_regex: 0.0817274 s (535 calls) [Profile] expr_close_regex: 0.0035717 s (102 calls) [Profile] expr_open_regex: 0.0604591 s (637 calls) [Profile] comment_regex: 3.11919 s (639 calls) [Profile] tokenize: 18.1967 s Parsed template from: .\Apertus-8B-Instruct.jinja ``` **18.2s** to parse. This PR adds `std::regex_constants::match_continuous` to the `regex_search()` calls. This flag instructs the regex engine to only match from the start of the iterator, which aligns with the existing `match.position() == 0` check. ``` [Profile] parseExpression: 0.0225291 s (102 calls) [Profile] non_text_regex: 0.013297 s (319 calls) [Profile] block_keyword_regex: 0.0039072 s (216 calls) [Profile] block_close_regex: 0.0052241 s (216 calls) [Profile] block_open_regex: 0.0081572 s (535 calls) [Profile] expr_close_regex: 0.0025207 s (102 calls) [Profile] expr_open_regex: 0.0064239 s (637 calls) [Profile] comment_regex: 0.0076846 s (639 calls) [Profile] tokenize: 0.117085 s Parsed template from: .\Apertus-8B-Instruct.jinja ``` **0.12s**, much better.
TL;DR: it's a lot, but there's a lot more testing than before.
Building on the PEG parser infrastructure introduced in #17136 by @aldehir, this is an experimental effort to migrate all chat template formats to the unified PEG approach.
Why migrate? The current monolithic
common/chat.cpphas grown to ~25 ad-hoc parser implementations that are difficult to maintain. Lots of parsing bugs are hard to reproduce and diagnose (esp. if the user wasn't in--verbosemode).The PEG infrastructure offers a cleaner path forward, w/ strong guarantees (modulo bugs) that what is allowed to be generated should be parseable.
How to Test
Changes:
common/chat-parsers/*.cpp- 28 modular parser implementations--experimental-new-parsers- defaults to off, nothing changes by defaultNew "Needle" Streaming Tests
Existing streaming tests (
tools/server/tests/unit/test_tool_call.py) required loading real models and cover only a subset of formats. This PR adds systematic coverage for all 21 formats without the model-loading overhead.This migration was designed to be safe through systematic test constraints:
21 formats x 6+ scenarios = up to 126 regression tests (some scenarios filtered based on format capabilities)
Each format tests:
content,reasoning_content, toolname&arguments) is streamedreasoning_content+tool_call/content(or 3 of them when template supports it)json_schema+reasoning_contentHow Needle Tests Work
The "needle" technique injects unique marker pairs into each semantic field. For example, in Hermes 2 Pro format with thinking and a tool call:
The test parses this message at every character boundary (simulating streaming), and verifies:
This aims to prove parsers are truly incremental: partial input produces partial output, fields stream in proper order, and nothing is buffered unnecessarily.
Known Limitations
The PEG implementation has gaps vs legacy (TBC):
allOf/anyOf/$refpatterns not fully handleduntil_maxw/ weird implementation, maybe we just drop maxLength on xml formats)Proposed Migration Plan
--experimental-new-parserscommon/chat-parser.cpp: ~28 legacy parser functions (~900 lines)common/chat.cpp: ~19 legacy init functions (~600 lines)common/chat-peg-parser.cpp/.h: class-based builders/mappers (~220 lines)common/chat-parser-xml-toolcall.cpp/.h: XML grammar builder (~900 lines) - new PEG parsers generate grammars directly from their parser definitionsTODOs before undrafting
Follow up work
supports_tool_call_id- Whether tool calls include IDsreasoning_requires_tools- Whether thinking mode only works with toolstools_emit_content_with_calls- Whether tool calls can include content