Skip to content

feat: Soft Delete documents#16

Draft
Darkheir wants to merge 7 commits intosekoiafrom
feat/soft_delete_documents
Draft

feat: Soft Delete documents#16
Darkheir wants to merge 7 commits intosekoiafrom
feat/soft_delete_documents

Conversation

@Darkheir
Copy link
Collaborator

@Darkheir Darkheir commented Mar 16, 2026

Description

This PR allow to soft delete documents from an index

Tasks

  • Update quickwit-proto with new messages
  • Update metastore to store soft deleted documents
  • Update quickwit search to ignore soft deleted documents
  • Add new REST endpoint to soft delete documents
  • Add Elasticsearch compatible endpoint to delete documents
  • Ignore deleted documents on merge of splits
  • Enforce limit of soft deleted events for a given split

@Darkheir Darkheir force-pushed the feat/soft_delete_documents branch from b6c0dec to bb179a6 Compare March 17, 2026 08:20
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
@Darkheir Darkheir force-pushed the feat/soft_delete_documents branch from bb179a6 to 3c56d41 Compare March 17, 2026 08:39
@Darkheir Darkheir force-pushed the feat/soft_delete_documents branch from 89355de to 0734ed6 Compare March 18, 2026 09:49
@Darkheir Darkheir requested a review from rdettai-sk March 18, 2026 11:25
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
@Darkheir Darkheir force-pushed the feat/soft_delete_documents branch from c86fbfb to 817c329 Compare March 19, 2026 14:36
Signed-off-by: Darkheir <raphael.cohen@sekoia.io>
@Darkheir Darkheir requested a review from rdettai-sk March 19, 2026 17:00
Copy link
Collaborator

@rdettai-sk rdettai-sk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% through with the review (I only glanced over the search part). I think an integration test would be really nice.

ctx: &ActorContext<Self>,
) -> anyhow::Result<IndexedSplit> {
let (union_index_meta, split_directories) = open_split_directories(
) -> anyhow::Result<(IndexedSplit, HashMap<String, BTreeSet<u32>>)> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, more readable

Suggested change
) -> anyhow::Result<(IndexedSplit, HashMap<String, BTreeSet<u32>>)> {
) -> anyhow::Result<(IndexedSplit, HashMap<SplitId, BTreeSet<u32>>)> {

I also think this list of soft deleted docs would be better off carried by IndexedSplit.split_attrs. I might force you to modify a couple unrelated files, but it makes more sense there because you already have replaced_split_ids (which you could more or less replace with this)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move it. Do you really think I should replace replaced_split_ids by the snapshots ?

/// has been updated before the merge completes), the merge still succeeds and emits a
/// warning rather than failing.
#[tokio::test]
async fn test_merge_executor_soft_delete_race_condition_warning() -> anyhow::Result<()> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async fn test_merge_executor_soft_delete_race_condition_warning() -> anyhow::Result<()> {
async fn test_merge_executor_soft_delete_race_condition() -> anyhow::Result<()> {

cool test (don't forget to update the docstring if we completely get rid of the warning #16 (comment))

Comment on lines +49 to +56
/// Query text in Tantivy query language to match events to soft-delete.
pub query: String,
/// Fields to search on (defaults to doc mapper default fields).
#[serde(default)]
pub search_fields: Option<Vec<String>>,
/// Maximum number of events to soft-delete in a single call (default: 10000).
#[serde(default = "default_max_soft_deletes")]
pub max_hits: u64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this API powerful and clean but a bit dangerous:

  • It's very easy to write a query that matches a lot of documents (even more so with the default fields)
  • The default max_hits is set to MAX_SOFT_DELETED_DOCS_PER_SPLIT which is very large
  • It's irreversible (without going to postgres very quickly 😄)

A couple of easy things to do to make this safer:

  • Remove search_fields, the added value is not worth the risk (fail if default field is missing, this can be done using query_ast_from_user_text(&request.query, request.search_fields).parse_user_query(&[])?
  • Add a lower default and document that increasing it is risky

Nice to have but not necessary:

  • delete soft_delete endpoint

Probably overkill:

  • list soft_delete endpoint (you can already get that info from listing the splits)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will reduce the limit and remove the default search fields.

I considered adding an un-delete endpoint and decided it was not worth the work since we don't need it in our use case.

It could be easy to add later if we need it.

pub doc_mapping_uid: DocMappingUid,

/// Set of tantivy doc_ids that have been soft-deleted from this split.
#[serde(default)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[serde(default)]

index_uid: &IndexUid,
snapshot: &HashMap<String, BTreeSet<u32>>,
metastore: &MetastoreServiceClient,
ctx: &ActorContext<Publisher>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, when you just need to be able to create protected zones, you can pass &Progress instead of the whole context to better convey your usage

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about it, will make the change!

Comment on lines +274 to +276
"soft-delete race condition detected: {} doc(s) were soft-deleted on split \
`{}` while the merge was running; they will not be reflected in the merged \
split and will remain searchable until the next merge cycle",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in case of race condition you loose the splits here. It's a pitty, if you send the list of deleted docs to the metastore, it could actually during the transaction perform the diff between the last state of deleted docs on all merged splits and the docs that were removed during the merge and commit a new split with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chances of a race condition here are pretty limited, but not impossible. Do you feel it would be cleaner to perform this check at the metastore level ?

I feel that we carry this list of soft deleted docs for too long 😅

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we we carried it here, it's not much to carry it just one extra step and have a bullet proof solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right, I forgot that it wasn't that simple to add the offsets of the original splits to the merged split. Never mind, we can stick to the warning here and just make it explicit that the delete was lost (maybe print an error).

Comment on lines +79 to +83
fn scorer(&self, reader: &SegmentReader, boost: Score) -> tantivy::Result<Box<dyn Scorer>> {
let inner_scorer = self.inner.scorer(reader, boost)?;
let excluded = SortedDocIdSet::new(Arc::clone(&self.deleted_doc_ids));
Ok(Box::new(Exclude::new(inner_scorer, excluded)))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so beautiful I almost cried when reading it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't look at the other 300 lines it seems quite straightforward :p

timestamp_start: None,
timestamp_end: None,
num_docs: 0,
soft_deleted_doc_ids: Vec::new(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think there is a cache invalidation bug. The partial hit cache is likely to return an old result (beginning of leaf_search_single_split()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants