feat(search): index prose content for BM25 full-text search#617
feat(search): index prose content for BM25 full-text search#617ShauryaaSharma wants to merge 1 commit into
Conversation
58cd6c4 to
2135922
Compare
|
Rebased onto main (post-#667) , all tests green ✅ Rebased on top of the #667 merge. One minor conflict in Verification against the #518 repro in the bug suite: The #667 markup battery ( |
|
Huge thanks for opening this PR and for the work you put into it. The maintainer shop is currently full, so this may sit for a bit before it gets a proper review. We will come back to this as soon as possible with real feedback; I wanted to make sure it did not sit unacknowledged in the meantime. |
|
No worries at all @DeusData, take your time, there's no rush on my end! Happy to wait and appreciate you acknowledging it. 🙂 |
074e59e to
2135922
Compare
Section nodes (markdown) and Module nodes (YAML/JSON) previously exposed only their heading/name to BM25, so search_graph could not match the prose body or a config description. Index that text so content is searchable. - store: add a `body` column to the nodes_fts FTS5 table; new cbm_store_fts_rebuild() drops+recreates the table (upgrading legacy 4-column databases) and backfills `body` from each node's docstring, guarded by json_valid() against malformed-JSON rows - pipeline: both FTS backfill sites now call cbm_store_fts_rebuild() - mcp: stop excluding Section/Module from BM25 results (they rank below code symbols, so existing result ordering is preserved) - internal/cbm: capture the markdown section body beneath each heading (DeusData#518) and promote top-level description/summary/purpose values onto the file's Module node (DeusData#519), reusing the existing docstring property - tests: 7 extraction cases + 3 store FTS cases Closes DeusData#518 Closes DeusData#519 Signed-off-by: ShauryaaSharma <shauryasofficial27@gmail.com>
2135922 to
411ad44
Compare
What & why
search_graphBM25 only matched node names and headings, so it was blind to theprose that documentation- and config-heavy repos carry. Markdown
Sectionnodesexposed only their heading; YAML/JSON
Modulenodes only their file name — thesection body and the description value were never indexed, and
Section/Modulewere excluded from BM25 results entirely. This indexes that prose so content is
searchable.
Closes #518
Closes #519
Changes
bodycolumn to the nodes_fts FTS5 table; newcbm_store_fts_rebuild() drops+recreates it (upgrading legacy 4-column DBs) and
backfills
bodyfrom each node's docstring, guarded by json_valid().results still sort first).
description/summary/purpose value onto the Module node (META.yaml/frontmatter description values not indexed for BM25 search #519), reusing the
existing docstring property.
Testing
7 extraction cases + 3 store FTS cases added. Verified end-to-end: bodies are
extracted → indexed into nodes_fts.body → returned by BM25; json_valid() tolerates
malformed rows; legacy FTS tables upgrade on rebuild.
Notes
Backward compatible (additive column; legacy DBs upgrade on next index). No MCP
tool changes, no new deps, no new system()/popen()/network calls. #518 and #519
share the FTS
bodyinfra (#519 can't work without it), so they're together —happy to split if preferred.