fix(valkey): post-filter metadata_filter in vector search (issue #5795)#5796
fix(valkey): post-filter metadata_filter in vector search (issue #5795)#5796devin-ai-integration[bot] wants to merge 1 commit into
Conversation
…T clauses
The memory_index FT schema only defines fields for embedding, scope,
categories, created_at and importance. Building @{key}:{value} clauses
from metadata_filter referenced fields that are not indexed, causing
FT.SEARCH to error or silently return wrong results.
Apply metadata_filter as a Python post-filter on the deserialized
records (matching the pattern used by LanceDBStorage) and oversample
the KNN limit (limit * 3) when a metadata_filter is supplied so we
don't under-return after filtering. Updated tests that previously
asserted the buggy server-side query and added regression tests for
the post-filter behaviour.
Closes #5795
Co-Authored-By: João <joao@crewai.com>
|
Prompt hidden (unlisted session) |
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Summary
Fixes #5795. In
ValkeyStorage._vector_search(introduced by PR #5703),metadata_filterwas being translated into@{key}:{value}clauses of the FT.SEARCH query. But thememory_indexFT schema only defines fields forembedding,scope,categories,created_at, andimportance— so any metadata key (e.g.agent_id) refers to a non-existent index field. On a real Valkey Search backend this either errors out or silently returns wrong results. The existing test suite only passes becauseft.searchis mocked.This PR implements Option B from the issue (post-filter outside FT.SEARCH):
_vector_searchno longer emits@{key}:{value}query parts formetadata_filter— only the indexed fieldsscopeandcategoriesremain in the FT query._metadata_matchesis applied in Python after FT.SEARCH returns, mirroring the pattern used byLanceDBStorage. Each filter value is compared first natively and then as a string, so callers can pass either{"page": 5}or{"page": "5"}.metadata_filteris supplied,_vector_searchrequestslimit * 3candidates from FT.SEARCH (and uses the same value forFtSearchLimit) so post-filtering doesn't under-return. Results are trimmed back to the originallimitafter filtering.Why post-filter (Option B) instead of materialising fields (Option A)?
Option A would require either declaring a fixed set of metadata keys upfront (incompatible with arbitrary user metadata) or dynamically mutating the FT schema, which Valkey Search does not really support and would complicate index creation/upgrade. Post-filtering matches the existing approach already used by
LanceDBStoragefor the same combination of filters.Tests
test_search_with_metadata_filter_only,test_search_with_combined_filters,test_search_with_numeric_metadata_values, andtest_search_with_multiple_metadata_filters_uses_and_logic— they previously asserted the buggy server-side query (@agent_id:{...}etc). They now assert that no@<metadata_key>clauses leak into the FT query and that the KNN limit is oversampled when needed.TestValkeyStorageMetadataFilterPostFilterclass covering:metadata_filter(wrong value or missing key entirely) are excluded.limit * 3only whenmetadata_filteris provided.limitafter post-filtering._metadata_matcheshelper handles AND logic, type coercion, missing keys, and empty filters.Lint (
ruff), type check (mypy), and all 192 tests inlib/crewai/tests/memory/storage/pass locally.Review & Testing Checklist for Human
This is a yellow-risk change — it changes the semantics of
metadata_filterfrom "broken on real Valkey Search" to "applied as a Python post-filter".devin/valkey-5703-base(a vendored copy of PR Feat/valkey 4 storage #5703's head branch pushed to the upstream repo so the diff is reviewable as a delta). Once PR Feat/valkey 4 storage #5703 merges, this PR's base should be retargeted tomain(or the fix can be cherry-picked into PR Feat/valkey 4 storage #5703 directly). The fix branch contains only the issue feat(valkey): metadata_filter fields not indexed in memory_index FT schema #5795 commit on top of PR Feat/valkey 4 storage #5703.ft.search). Spin upvalkey/valkey-bundle:latest, save a few records with distinctmetadata, and callValkeyStorage.search(...)with both matching and non-matchingmetadata_filtervalues to confirm post-filtering returns the expected records.limit * 3is the same factor used byLanceDBStorage. If your workloads expect very selective metadata filters over large indexes, you may want a larger oversample factor or to combine post-filtering with_find_records_by_metadatafor a pre-filter intersection.asearch/searchstill accurately describemetadata_filtersemantics in light of the post-filter implementation.Notes
mainbecausevalkey_storage.pydoesn't exist there yet.search,asearch, and_vector_search.@staticmethod(_metadata_matches) so the matching logic is testable on its own and documented in one place.Link to Devin session: https://app.devin.ai/sessions/068aff28e79a49b297b5d70a9bdd3b9b