Skip to content

feat(search): index prose content for BM25 full-text search#617

Open
ShauryaaSharma wants to merge 1 commit into
DeusData:mainfrom
ShauryaaSharma:feat/fts-prose-content
Open

feat(search): index prose content for BM25 full-text search#617
ShauryaaSharma wants to merge 1 commit into
DeusData:mainfrom
ShauryaaSharma:feat/fts-prose-content

Conversation

@ShauryaaSharma

Copy link
Copy Markdown
Contributor

What & why

search_graph BM25 only matched node names and headings, so it was blind to the
prose that documentation- and config-heavy repos carry. Markdown Section nodes
exposed only their heading; YAML/JSON Module nodes only their file name — the
section body and the description value were never indexed, and Section/Module
were excluded from BM25 results entirely. This indexes that prose so content is
searchable.

Closes #518
Closes #519

Changes

Testing

7 extraction cases + 3 store FTS cases added. Verified end-to-end: bodies are
extracted → indexed into nodes_fts.body → returned by BM25; json_valid() tolerates
malformed rows; legacy FTS tables upgrade on rebuild.

Notes

Backward compatible (additive column; legacy DBs upgrade on next index). No MCP
tool changes, no new deps, no new system()/popen()/network calls. #518 and #519
share the FTS body infra (#519 can't work without it), so they're together —
happy to split if preferred.

@ShauryaaSharma

Copy link
Copy Markdown
Contributor Author

Rebased onto main (post-#667) , all tests green ✅

Rebased on top of the #667 merge. One minor conflict in extract_defs.c: upstream added qn_safe_segment() immediately before push_simple_class_def as part of the markdown QN slugification work. Resolution was additive, kept both the new helper and the docstring parameter this PR adds.

Verification against the #518 repro in the bug suite:

The #667 markup battery (repro_grammar_markup_markdown) now correctly expects "Section" as the label for headings, which is exactly what this PR emits. Running the 7 targeted extraction tests locally confirms the rebased code is still correct:

markdown_section_body_captured          PASS  ← Playwright body captured, sibling excluded
markdown_section_no_body                PASS  ← empty heading → no docstring
markdown_section_body_capped            PASS  ← body ≤ 500 bytes
yaml_description_promoted_to_module     PASS
yaml_summary_promoted_to_module         PASS
json_description_promoted_to_module     PASS
yaml_no_description_leaves_module_bare  PASS

@DeusData

Copy link
Copy Markdown
Owner

Huge thanks for opening this PR and for the work you put into it.

The maintainer shop is currently full, so this may sit for a bit before it gets a proper review. We will come back to this as soon as possible with real feedback; I wanted to make sure it did not sit unacknowledged in the meantime.

@ShauryaaSharma

Copy link
Copy Markdown
Contributor Author

No worries at all @DeusData, take your time, there's no rush on my end! Happy to wait and appreciate you acknowledging it. 🙂

@ShauryaaSharma ShauryaaSharma force-pushed the feat/fts-prose-content branch from 074e59e to 2135922 Compare June 29, 2026 19:18
Section nodes (markdown) and Module nodes (YAML/JSON) previously exposed
only their heading/name to BM25, so search_graph could not match the prose
body or a config description. Index that text so content is searchable.

- store: add a `body` column to the nodes_fts FTS5 table; new
  cbm_store_fts_rebuild() drops+recreates the table (upgrading legacy
  4-column databases) and backfills `body` from each node's docstring,
  guarded by json_valid() against malformed-JSON rows
- pipeline: both FTS backfill sites now call cbm_store_fts_rebuild()
- mcp: stop excluding Section/Module from BM25 results (they rank below
  code symbols, so existing result ordering is preserved)
- internal/cbm: capture the markdown section body beneath each heading
  (DeusData#518) and promote top-level description/summary/purpose values onto
  the file's Module node (DeusData#519), reusing the existing docstring property
- tests: 7 extraction cases + 3 store FTS cases

Closes DeusData#518
Closes DeusData#519

Signed-off-by: ShauryaaSharma <shauryasofficial27@gmail.com>
@ShauryaaSharma ShauryaaSharma force-pushed the feat/fts-prose-content branch from 2135922 to 411ad44 Compare June 29, 2026 19:35
@DeusData DeusData added enhancement New feature or request parsing/quality Graph extraction bugs, false positives, missing edges priority/normal Standard review queue; useful PR with ordinary maintainer urgency. labels Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request parsing/quality Graph extraction bugs, false positives, missing edges priority/normal Standard review queue; useful PR with ordinary maintainer urgency.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

META.yaml/frontmatter description values not indexed for BM25 search Section nodes don't index body text — BM25 can't search markdown content

2 participants