Skip to content

Indexing large SQL file crashes native parser (Tree-sitter SQL assertion / stack overflow) #668

Description

@georgelichen

Summary

index_repository can crash the native process while indexing a large .sql file. In a full repository run I observed a Tree-sitter assertion:

Assertion failed: symbol < self->token_count, file internal/cbm/vendored/ts_runtime/src/./language.c, line 79

I then narrowed this down to a single large SQL file. I cannot attach the original file because it contains private/customer database schema and data scripts, but the failure is reproducible with that single file copied into an otherwise empty temporary repository.

Environment

  • OS: Windows
  • Shell: PowerShell
  • Binary: C:/Users/lichen/.local/bin/codebase-memory-mcp.exe
  • codebase-memory-mcp --version: codebase-memory-mcp 0.8.1
  • MCP initialize serverInfo reported: codebase-memory-mcp 0.10.0
  • Command mode: codebase-memory-mcp cli index_repository ...

Reproduction Shape

Full repository indexing crashed while scanning a repository with many SQL scripts:

codebase-memory-mcp.exe cli index_repository '{"repo_path":"C:/path/to/repo","mode":"full"}'

The full run reached SQL files under Src/Database and then aborted in the Tree-sitter runtime.

To reduce the input, I copied one SQL file into a new empty directory and indexed only that directory:

$root = "$env:TEMP/cbm-sql-repro"
New-Item -ItemType Directory -Force -Path $root | Out-Null
Copy-Item C:/path/to/CreateDB.sql "$root/CreateDB.sql" -Force
codebase-memory-mcp.exe cli index_repository "{`"repo_path`":`"$($root.Replace('\','/'))`",`"mode`":`"full`"}"

The single file is about 2.6 MiB and contains T-SQL database creation/schema/data script content.

Actual Result

Single-file repro exits with native crash code -1073741571 (0xC00000FD, stack overflow on Windows). The output stops during the definitions pass:

level=info msg=mem.init budget_mb=32767 total_ram_mb=65534
level=info msg=pipeline.discover files=1 elapsed_ms=1
level=info msg=pipeline.route path=full
level=info msg=pass.start pass=structure files=1
level=info msg=pass.done pass=structure nodes=2 edges=1
level=info msg=pass.timing pass=structure elapsed_ms=0
level=info msg=pipeline.mode mode=sequential files=1
level=info msg=pkgmap.scan_repo manifests=0
level=info msg=pkgmap.scan manifests_from_files=0 manifests_from_walk=0 entries=0
level=info msg=pass.start pass=definitions files=1

The full repository run also showed the Tree-sitter assertion above:

Assertion failed: symbol < self->token_count, file internal/cbm/vendored/ts_runtime/src/./language.c, line 79

Expected Result

A malformed, huge, or unsupported SQL file should not abort the whole indexing process. Ideally the file should be skipped with an indexed error, or Tree-sitter parse failures should be isolated so index_repository can continue and report failed files.

Relevant Source Observations

From the current repository source:

  • .sql maps to CBM_LANG_SQL in src/discover/language.c.
  • SQL uses the vendored Tree-sitter SQL grammar.
  • internal/cbm/vendored/grammars/sql/parser.c has TOKEN_COUNT = 429 and SYMBOL_COUNT = 770.
  • The assertion is in internal/cbm/vendored/ts_runtime/src/language.c:79 inside ts_language_table_entry, which expects symbol < self->token_count.

This looks like a Tree-sitter SQL grammar/runtime crash path triggered by large/complex T-SQL input, not a normal SQL syntax error. Normal parse errors should produce ERROR nodes rather than aborting the process.

Notes

I can help test a patched binary or try to produce a minimized/redacted SQL repro if that would be useful. For now I avoided attaching the source SQL because it is private customer project material.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions