Description of the Problem
When executing a search query that hits a massive number of splits (e.g., in a 50PB log scale deployment with a broad time range and weak filters), Quickwit exhibits extremely heavy continuous reads from S3. This pattern can saturate the network ingress bandwidth on the search nodes.
Upon analyzing the S3 access pattern (s3_compatible_storage.rs, bundle_storage.rs, and SplitCache), it's clear that Quickwit is already highly optimized (utilizing HTTP Range Requests efficiently). The problem stems from the concurrency logic in the leaf.rs module.
Currently, in single_doc_mapping_leaf_search, all matched uncached splits are scheduled and dispatched concurrently via run_local_search_tasks.
While Quickwit has a great CanSplitDoBetter pruning mechanism, it has limited effectiveness in this extreme scenario. Because hundreds or thousands of splits are spawned and enter their warmup phase (triggering S3 I/O) almost simultaneously, the pruning state (worst_hit_found) doesn't get populated fast enough to prevent unnecessary splits from aggressively downloading their fast_fields and dictionary terms.
As a result, an enormous amount of unnecessary S3 Range Request bandwidth is wasted on splits that would have been pruned anyway if they were processed slightly later.
Proposed Solution: Progressive Two-Phase Search
To maximize the power of the existing CanSplitDoBetter mechanism without impacting normal query latency (where split_count is reasonable), we propose modifying single_doc_mapping_leaf_search to implement a "Two-Phase" or "Progressive" execution path when the split count is large.
Logic implementation:
- New Configuration: Introduce
progressive_search_batch_size: usize in SearcherConfig (defaulting to 20 or similar). If set to 0, it falls back to the current behavior.
- Phase 1 (Priority Batch): In
leaf.rs, we take the first batch_size splits (which are already sorted by optimize_split_order, meaning they are highly relevant) and execute them. We tokio::join! them and wait for completion.
- Data Population: Upon Phase 1 completion, the
CanSplitDoBetter structure (which holds the incremental_merge_collector) is filled with concrete worst_hit scores from the most relevant splits.
- Phase 2 (Remaining Splits): We then dispatch the remaining splits. Because
CanSplitDoBetter is now populated with a strong baseline, the simplify_search_request function can aggressively convert unnecessary splits into count-only queries or completely skip them (if count is not needed), preventing them from making any heavy S3 I/O requests.
Impact & Safety:
- Paginating (
count accuracy) remains unaffected. simplify_search_request correctly respects CountAll and safely converts splits to count-only (bypassing expensive field data downloads) instead of dropping them.
- This effectively bounds the peak S3 GET request burst length to exactly what is needed for the Top-K.
- It degrades seamlessly back to the original concurrency behavior if the query matches fewer than
progressive_search_batch_size splits.
Environment details:
- OS: Linux / Kubernetes
- Quickwit Version: Latest
main
Description of the Problem
When executing a search query that hits a massive number of splits (e.g., in a 50PB log scale deployment with a broad time range and weak filters), Quickwit exhibits extremely heavy continuous reads from S3. This pattern can saturate the network ingress bandwidth on the search nodes.
Upon analyzing the S3 access pattern (
s3_compatible_storage.rs,bundle_storage.rs, andSplitCache), it's clear that Quickwit is already highly optimized (utilizing HTTP Range Requests efficiently). The problem stems from the concurrency logic in theleaf.rsmodule.Currently, in
single_doc_mapping_leaf_search, all matched uncached splits are scheduled and dispatched concurrently viarun_local_search_tasks.While Quickwit has a great
CanSplitDoBetterpruning mechanism, it has limited effectiveness in this extreme scenario. Because hundreds or thousands of splits are spawned and enter their warmup phase (triggering S3 I/O) almost simultaneously, the pruning state (worst_hit_found) doesn't get populated fast enough to prevent unnecessary splits from aggressively downloading theirfast_fieldsand dictionary terms.As a result, an enormous amount of unnecessary S3
Range Requestbandwidth is wasted on splits that would have been pruned anyway if they were processed slightly later.Proposed Solution: Progressive Two-Phase Search
To maximize the power of the existing
CanSplitDoBettermechanism without impacting normal query latency (wheresplit_countis reasonable), we propose modifyingsingle_doc_mapping_leaf_searchto implement a "Two-Phase" or "Progressive" execution path when the split count is large.Logic implementation:
progressive_search_batch_size: usizeinSearcherConfig(defaulting to 20 or similar). If set to 0, it falls back to the current behavior.leaf.rs, we take the firstbatch_sizesplits (which are already sorted byoptimize_split_order, meaning they are highly relevant) and execute them. Wetokio::join!them and wait for completion.CanSplitDoBetterstructure (which holds theincremental_merge_collector) is filled with concreteworst_hitscores from the most relevant splits.CanSplitDoBetteris now populated with a strong baseline, thesimplify_search_requestfunction can aggressively convert unnecessary splits into count-only queries or completely skip them (ifcountis not needed), preventing them from making any heavy S3 I/O requests.Impact & Safety:
countaccuracy) remains unaffected.simplify_search_requestcorrectly respectsCountAlland safely converts splits tocount-only(bypassing expensive field data downloads) instead of dropping them.progressive_search_batch_sizesplits.Environment details:
main