Conversation
b6c0dec to
bb179a6
Compare
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
bb179a6 to
3c56d41
Compare
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
89355de to
0734ed6
Compare
quickwit/quickwit-metastore/migrations/postgresql/26_add-split-soft-deleted-doc-ids.down.sql
Outdated
Show resolved
Hide resolved
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
c86fbfb to
817c329
Compare
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
rdettai-sk
left a comment
There was a problem hiding this comment.
Not 100% through with the review (I only glanced over the search part). I think an integration test would be really nice.
| ctx: &ActorContext<Self>, | ||
| ) -> anyhow::Result<IndexedSplit> { | ||
| let (union_index_meta, split_directories) = open_split_directories( | ||
| ) -> anyhow::Result<(IndexedSplit, HashMap<String, BTreeSet<u32>>)> { |
There was a problem hiding this comment.
nit, more readable
| ) -> anyhow::Result<(IndexedSplit, HashMap<String, BTreeSet<u32>>)> { | |
| ) -> anyhow::Result<(IndexedSplit, HashMap<SplitId, BTreeSet<u32>>)> { |
I also think this list of soft deleted docs would be better off carried by IndexedSplit.split_attrs. I might force you to modify a couple unrelated files, but it makes more sense there because you already have replaced_split_ids (which you could more or less replace with this)
There was a problem hiding this comment.
Will move it. Do you really think I should replace replaced_split_ids by the snapshots ?
| /// has been updated before the merge completes), the merge still succeeds and emits a | ||
| /// warning rather than failing. | ||
| #[tokio::test] | ||
| async fn test_merge_executor_soft_delete_race_condition_warning() -> anyhow::Result<()> { |
There was a problem hiding this comment.
| async fn test_merge_executor_soft_delete_race_condition_warning() -> anyhow::Result<()> { | |
| async fn test_merge_executor_soft_delete_race_condition() -> anyhow::Result<()> { |
cool test (don't forget to update the docstring if we completely get rid of the warning #16 (comment))
| /// Query text in Tantivy query language to match events to soft-delete. | ||
| pub query: String, | ||
| /// Fields to search on (defaults to doc mapper default fields). | ||
| #[serde(default)] | ||
| pub search_fields: Option<Vec<String>>, | ||
| /// Maximum number of events to soft-delete in a single call (default: 10000). | ||
| #[serde(default = "default_max_soft_deletes")] | ||
| pub max_hits: u64, |
There was a problem hiding this comment.
I find this API powerful and clean but a bit dangerous:
- It's very easy to write a query that matches a lot of documents (even more so with the default fields)
- The default max_hits is set to MAX_SOFT_DELETED_DOCS_PER_SPLIT which is very large
- It's irreversible (without going to postgres very quickly 😄)
A couple of easy things to do to make this safer:
- Remove search_fields, the added value is not worth the risk (fail if default field is missing, this can be done using
query_ast_from_user_text(&request.query, request.search_fields).parse_user_query(&[])? - Add a lower default and document that increasing it is risky
Nice to have but not necessary:
- delete soft_delete endpoint
Probably overkill:
- list soft_delete endpoint (you can already get that info from listing the splits)
There was a problem hiding this comment.
I will reduce the limit and remove the default search fields.
I considered adding an un-delete endpoint and decided it was not worth the work since we don't need it in our use case.
It could be easy to add later if we need it.
| pub doc_mapping_uid: DocMappingUid, | ||
|
|
||
| /// Set of tantivy doc_ids that have been soft-deleted from this split. | ||
| #[serde(default)] |
There was a problem hiding this comment.
| #[serde(default)] |
| index_uid: &IndexUid, | ||
| snapshot: &HashMap<String, BTreeSet<u32>>, | ||
| metastore: &MetastoreServiceClient, | ||
| ctx: &ActorContext<Publisher>, |
There was a problem hiding this comment.
nit, when you just need to be able to create protected zones, you can pass &Progress instead of the whole context to better convey your usage
There was a problem hiding this comment.
I didn't know about it, will make the change!
| "soft-delete race condition detected: {} doc(s) were soft-deleted on split \ | ||
| `{}` while the merge was running; they will not be reflected in the merged \ | ||
| split and will remain searchable until the next merge cycle", |
There was a problem hiding this comment.
I think in case of race condition you loose the splits here. It's a pitty, if you send the list of deleted docs to the metastore, it could actually during the transaction perform the diff between the last state of deleted docs on all merged splits and the docs that were removed during the merge and commit a new split with it.
There was a problem hiding this comment.
The chances of a race condition here are pretty limited, but not impossible. Do you feel it would be cleaner to perform this check at the metastore level ?
I feel that we carry this list of soft deleted docs for too long 😅
There was a problem hiding this comment.
I think that we we carried it here, it's not much to carry it just one extra step and have a bullet proof solution.
There was a problem hiding this comment.
Ah, right, I forgot that it wasn't that simple to add the offsets of the original splits to the merged split. Never mind, we can stick to the warning here and just make it explicit that the delete was lost (maybe print an error).
| fn scorer(&self, reader: &SegmentReader, boost: Score) -> tantivy::Result<Box<dyn Scorer>> { | ||
| let inner_scorer = self.inner.scorer(reader, boost)?; | ||
| let excluded = SortedDocIdSet::new(Arc::clone(&self.deleted_doc_ids)); | ||
| Ok(Box::new(Exclude::new(inner_scorer, excluded))) | ||
| } |
There was a problem hiding this comment.
this is so beautiful I almost cried when reading it
There was a problem hiding this comment.
If we don't look at the other 300 lines it seems quite straightforward :p
| timestamp_start: None, | ||
| timestamp_end: None, | ||
| num_docs: 0, | ||
| soft_deleted_doc_ids: Vec::new(), |
There was a problem hiding this comment.
Actually I think there is a cache invalidation bug. The partial hit cache is likely to return an old result (beginning of leaf_search_single_split()).
Description
This PR allow to soft delete documents from an index
Tasks